On-demand local LLMs with systemd socket activation

I use both remote and local language models in my workflow. Running a large (or small) language model locally has some key differences from the typical remote AI usage. Unless you have expensive hardware, a local LLM will be less performant and the responses will often be less smart. However, it has some key benefits which make me want to keep it in my tool belt:

An LLM server like Llama.cpp’s llama-server can consume a lot of RAM even on standby, when no query is being processed. If you don’t use the local model very frequently, you may not want to have the server process running at all times.

I used to start and stop the local LLM server manually when I needed, but it adds friction to the local querying workflow and it made me less likely to reach for it.

Luckily, systemd has some features that make it relatively easy to transparently start a service on the first connection to its socket, and to shut it down again after a period of inactivity. With these, I can just fire up a prompt whenever I need without worrying whether llama-server is running or not. On the first query, there’s a delay as the server is initializing, but it’s not excessive. And when I’m not using the local LLM for a while, the RAM gets freed up for other tasks. Looks like we can have our cake and eat it too!

Key mechanisms

I won’t go into excessive detail here on how systemd works, as there are manual pages and online docs. Let me just highlight the building blocks that make the local LLM setup work:

That’s pretty much all the interesting stuff. The rest is “normal systemd”, likely familiar to many, and in any case easy to read up on.

The files

Here are the systemd unit files. Read and adjust before using :). When adjusting the port numbers, watch out for which port is which: 8000 (the proxy) vs. 8001 (the llama-server).

Possible improvements

I use several of these sockets to launch different models on demand, as llama-server can currently only serve a single model. I am considering something like llama-swap to act as a smart proxy for serving multiple models and switching between them automatically. It would allow the client programs to use the “model” parameter to select a model, rather than use a different port for each. It can also stop a backend llama-server process for a model when a different model is being requested, to conserve memory. That would be an improvement over the current 15 minute timeout approach. It would be another dependency for something that feels non-critical, so I’m not completely sold on the approach yet, but I think it would be worth a try.

Happy hacking!