On-demand local LLMs with systemd socket activation

I use both remote and local language models in my workflow. Running a large (or small) language model locally has some key differences from the typical remote AI usage. Unless you have expensive hardware, a local LLM will be less performant and the responses will often be less smart. However, it has some key benefits which make me want to keep it in my tool belt:

no per-token costs,
no leaking of private data to third parties,
not being at the mercy of third parties with regards to models and APIs being available over time.

An LLM server like Llama.cpp’s llama-server can consume a lot of RAM even on standby, when no query is being processed. If you don’t use the local model very frequently, you may not want to have the server process running at all times.

I used to start and stop the local LLM server manually when I needed, but it adds friction to the local querying workflow and it made me less likely to reach for it.

Luckily, systemd has some features that make it relatively easy to transparently start a service on the first connection to its socket, and to shut it down again after a period of inactivity. With these, I can just fire up a prompt whenever I need without worrying whether llama-server is running or not. On the first query, there’s a delay as the server is initializing, but it’s not excessive. And when I’m not using the local LLM for a while, the RAM gets freed up for other tasks. Looks like we can have our cake and eat it too!

Key mechanisms

I won’t go into excessive detail here on how systemd works, as there are manual pages and online docs. Let me just highlight the building blocks that make the local LLM setup work:

Socket activation: The llama-api.socket unit listens on a port. When a connection is received, systemd automatically starts the corresponding .service file with the same base name to handle the connection(s). (See man systemd.socket.)
systemd-socket-proxyd: The main llama-api.service file doesn’t start llama-server directly. It starts systemd-socket-proxyd, a lightweight proxy, which can be configured to exit after some period of socket inactivity.
Start chaining: Before llama-api.service is considered started, it requires that llama-server.service starts - that one runs the actual llama-server to be proxied.
Stop chaining: The llama-server.service has StopWhenUnneeded=true. After llama-api.service exits due to socket inactivity, there is nothing else that Requires= the llama-server.service to be running and systemd stops it as well.
Wait for health check on startup: The llama-server API starts responding even before the model is fully initialized, and it returns a 503 error for any prompts. The llama-api.service has an ExecStartPre= snippet which keeps polling the llama-server health check until it reports that it’s initialized. This delays the start of the proxy until everything’s ready for prompting, and prevents the first prompt from ending in an error.
Out-of-memory score adjustment: The llama-server.service explicitly sets a high OOM score. Should the computer start running out of free RAM, the llama-server will be among the top candidates to be killed. Better crash the LLM API than something like the desktop session :).
User systemd session: The services don’t have to (and thus shouldn’t) be run as root. The files go into ~/.config/systemd/user.

That’s pretty much all the interesting stuff. The rest is “normal systemd”, likely familiar to many, and in any case easy to read up on.

The files

Here are the systemd unit files. Read and adjust before using :). When adjusting the port numbers, watch out for which port is which: 8000 (the proxy) vs. 8001 (the llama-server).

llama-api.socket:

[Socket]
ListenStream=8000

[Install]
WantedBy=sockets.target

llama-api.service:

[Unit]
Requires=llama-server.service
After=llama-server.service

[Service]
Type=notify
TimeoutStartSec=30s
ExecStartPre=bash -c 'until curl -I http://127.0.0.1:8001/health | grep "200 OK"; do sleep 1; done'
ExecStart=/usr/lib/systemd/systemd-socket-proxyd --exit-idle-time=15m 127.0.0.1:8001

llama-server.service:

[Unit]
StopWhenUnneeded=true

[Service]
OOMScoreAdjust=1000
Type=exec
Restart=no
ExecStart=/path/to/llama-server --port 8001 --alias default --ctx-size 16384 -m /path/to/model.gguf

Possible improvements

I use several of these sockets to launch different models on demand, as llama-server can currently only serve a single model. I am considering something like llama-swap to act as a smart proxy for serving multiple models and switching between them automatically. It would allow the client programs to use the “model” parameter to select a model, rather than use a different port for each. It can also stop a backend llama-server process for a model when a different model is being requested, to conserve memory. That would be an improvement over the current 15 minute timeout approach. It would be another dependency for something that feels non-critical, so I’m not completely sold on the approach yet, but I think it would be worth a try.

Happy hacking!