On-demand local LLMs with systemd socket activation
I use both remote and local language models in my workflow. Running a large (or small) language model locally has some key differences from the typical remote AI usage. Unless you have expensive hardware, a local LLM will be less performant and the responses will often be less smart. However, it has some key benefits which make me want to keep it in my tool belt:
-
no per-token costs,
-
no leaking of private data to third parties,
-
not being at the mercy of third parties with regards to models and APIs being available over time.
An LLM server like Llama.cpp’s llama-server
can consume a lot of RAM
even on standby, when no query is being processed. If you don’t use
the local model very frequently, you may not want to have the server
process running at all times.
I used to start and stop the local LLM server manually when I needed, but it adds friction to the local querying workflow and it made me less likely to reach for it.
Luckily, systemd has some features that make it relatively easy to
transparently start a service on the first connection to its socket,
and to shut it down again after a period of inactivity. With these, I
can just fire up a prompt whenever I need without worrying whether
llama-server
is running or not. On the first query, there’s a
delay as the server is initializing, but it’s not excessive. And when
I’m not using the local LLM for a while, the RAM gets freed up for
other tasks. Looks like we can have our cake and eat it too!
Key mechanisms
I won’t go into excessive detail here on how systemd works, as there are manual pages and online docs. Let me just highlight the building blocks that make the local LLM setup work:
-
Socket activation: The
llama-api.socket
unit listens on a port. When a connection is received,systemd
automatically starts the corresponding.service
file with the same base name to handle the connection(s). (Seeman systemd.socket
.) -
systemd-socket-proxyd: The main
llama-api.service
file doesn’t startllama-server
directly. It startssystemd-socket-proxyd
, a lightweight proxy, which can be configured to exit after some period of socket inactivity. -
Start chaining: Before
llama-api.service
is considered started, it requires thatllama-server.service
starts - that one runs the actualllama-server
to be proxied. -
Stop chaining: The
llama-server.service
hasStopWhenUnneeded=true
. Afterllama-api.service
exits due to socket inactivity, there is nothing else thatRequires=
thellama-server.service
to be running and systemd stops it as well. -
Wait for health check on startup: The
llama-server
API starts responding even before the model is fully initialized, and it returns a 503 error for any prompts. Thellama-api.service
has anExecStartPre=
snippet which keeps polling thellama-server
health check until it reports that it’s initialized. This delays the start of the proxy until everything’s ready for prompting, and prevents the first prompt from ending in an error. -
Out-of-memory score adjustment: The
llama-server.service
explicitly sets a high OOM score. Should the computer start running out of free RAM, thellama-server
will be among the top candidates to be killed. Better crash the LLM API than something like the desktop session :). -
User systemd session: The services don’t have to (and thus shouldn’t) be run as root. The files go into
~/.config/systemd/user
.
That’s pretty much all the interesting stuff. The rest is “normal systemd”, likely familiar to many, and in any case easy to read up on.
The files
Here are the systemd unit files. Read and adjust before using :). When
adjusting the port numbers, watch out for which port is which: 8000
(the proxy) vs. 8001
(the llama-server
).
-
llama-api.socket
:[Socket] ListenStream=8000 [Install] WantedBy=sockets.target
-
llama-api.service
:[Unit] Requires=llama-server.service After=llama-server.service [Service] Type=notify TimeoutStartSec=30s ExecStartPre=bash -c 'until curl -I http://127.0.0.1:8001/health | grep "200 OK"; do sleep 1; done' ExecStart=/usr/lib/systemd/systemd-socket-proxyd --exit-idle-time=15m 127.0.0.1:8001
-
llama-server.service
:[Unit] StopWhenUnneeded=true [Service] OOMScoreAdjust=1000 Type=exec Restart=no ExecStart=/path/to/llama-server --port 8001 --alias default --ctx-size 16384 -m /path/to/model.gguf
Possible improvements
I use several of these sockets to launch different models on demand, as llama-server can currently only serve a single model. I am considering something like llama-swap to act as a smart proxy for serving multiple models and switching between them automatically. It would allow the client programs to use the “model” parameter to select a model, rather than use a different port for each. It can also stop a backend llama-server process for a model when a different model is being requested, to conserve memory. That would be an improvement over the current 15 minute timeout approach. It would be another dependency for something that feels non-critical, so I’m not completely sold on the approach yet, but I think it would be worth a try.
Happy hacking!