Automated Health Monitoring for Local LLM Inference Servers #
GPUStack detects the state where it “appears to be running but cannot perform inference.” We built a system running 2-stage health checks + Slack notifications + Zabbix integration at 5-minute intervals.
What was the problem? #
The BASTION introduced in the previous article (Building a system to automatically analyze infrastructure logs with local LLM) is a mechanism that automatically analyzes infrastructure logs every 15 minutes using a local LLM (Qwen2.5-14B).
The single point of failure in this mechanism is GPUStack (LLM inference server). When GPUStack goes down, log collection continues but analysis and notifications all stop. Moreover, it takes time to notice that it has stopped. It can take several hours before a human notices that the Slack notification that comes every 15 minutes “doesn’t arrive.”
GPUStack itself has a model auto-restart feature. If a process crashes, it automatically restarts. However, this alone cannot detect certain cases.
・The API process is alive but the model is unloaded due to OOM
・The API returns 200 but the model does not respond to inference requests
・The BASTION server cannot reach GPUStack from a network perspective
In other words, “process survival” and “inference actually working” are separate issues. A mechanism to regularly confirm from the outside “can we actually perform inference” was needed.
Design approach #
The task is simple: Call the GPUStack API every 5 minutes, do nothing if normal, and notify Slack if abnormal. However, there are several design decisions.
2-stage check #
We divided the health check into 2 stages.
| Stage | Check content | What it detects | On failure |
|---|---|---|---|
| CHECK1 | HTTP request to /v1/models API | GPUStack process viability | Exit 1 (CRITICAL) |
| CHECK2 | Actual inference request to chat/completions | Whether the model is actually loaded and able to respond | Exit 2 (WARNING) |
The case where CHECK1 passes but CHECK2 fails is important. The state “API is alive but model cannot perform inference.” This can never be detected by process monitoring alone.
In CHECK2, we send a single word "healthcheck" and receive the response with max_tokens: 5. Since we only need to confirm whether inference works, we use the minimum number of tokens.
Exit code design #
We designed the script’s exit code to be 3-valued.
| Exit code | Meaning | Slack notification | Zabbix trigger |
|---|---|---|---|
| 0 | Normal | Only on recovery | — |
| 1 | API down | CRITICAL (red) | Severe failure |
| 2 | Model down | WARNING (orange) | Warning |
We separated 1 and 2 because the response is different. For Exit 1, you need to check the GPUStack process itself. For Exit 2, you just need to look at the GPUStack dashboard for model status. The person receiving the notification can understand what they should do next just from the exit code.
Continuous alert suppression #
Running at 5-minute intervals means if GPUStack is down for 30 minutes, the same CRITICAL notification will be sent 6 times. By the third time, no one looks at Slack anymore.
To prevent this, we save the previous state to a file and only send notifications when the state changes.
| State transition | Notification |
|---|---|
| ok → api_down | Send CRITICAL notification |
| ok → model_down | Send WARNING notification |
| api_down → ok | Send RECOVERED notification |
| api_down → api_down | No notification |
| model_down → model_down | No notification |
Sending a RECOVERED notification on recovery is also important. The person who received the CRITICAL notification doesn’t have to open the dashboard to check “is it still down or already fixed.”
Implementation #
Everything is done with a single shell script. It is placed in the same directory as other BASTION scripts (summarize.sh, analyze.sh, etc.) and executed every 5 minutes by cron.
CHECK1: API response confirmation #
MODELS_RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" \
--max-time 15 \
-H "Authorization: Bearer ${API_KEY}" \
"${GPUSTACK_HOST}/v1/models")Since GPUStack provides an OpenAI-compatible API, we just send a GET request to the /v1/models endpoint. If HTTP 200 is returned, the API is alive. Timeout is 15 seconds.
CHECK2: Actual inference test #
INFERENCE_RESPONSE=$(curl -s --max-time 15 \
-H "Authorization: Bearer ${API_KEY}" \
-H "Content-Type: application/json" \
"${GPUSTACK_HOST}/v1/chat/completions" \
-d '{"model":"qwen2.5-14b-instruct",
"messages":[{"role":"user","content":"healthcheck"}],
"max_tokens":5,"temperature":0}')If the response JSON contains "choices", inference is successful. If not, it is processed as model down.
State persistence #
# Read previous state STATUS_FILE="/tmp/gpustack-health-status" PREV_STATUS="ok" [ -f "$STATUS_FILE" ] && PREV_STATUS=$(cat "$STATUS_FILE") # Save current state echo "ok" > "$STATUS_FILE" # or "api_down" or "model_down"
Just write the state to a single file. No complex database needed. Even after a restart, on first run there is no file so ok is the default, and if there is an abnormality, the notification is sent on the first run.
cron registration #
*/5 * * * * bash /opt/oc-seclogs/gpustack-healthcheck.sh >> /var/log/gpustack-healthcheck.log 2>&1
We set it to 5-minute intervals, shorter than the periodic BASTION analysis (15-minute interval). If periodic analysis runs with LLM down, it will fail, so we want to detect it before that.
Things we actually encountered #
Overlooked API key authentication #
When I first ran it, CHECK1 immediately returned HTTP 401. GPUStack had API key authentication enabled, but I hadn’t included the Authorization header in curl. This is an obvious mistake, but I think it’s common to write health check scripts in a test environment (no authentication) and move them to production as-is. We resolved it by modifying it to read the API key from a .env file.
The case where CHECK1 passes but CHECK2 fails actually exists #
The GPUStack API process is running, but model loading has failed. /v1/models returns 200 but chat/completions returns 500. Process monitoring alone would determine it as “normal.” The decision to use 2-stage checks was validated in operations.
Slack notification format should include “what to do next” #
Initially we just said “GPUStack API is not responding,” but the person receiving the notification (myself) always does the same thing next. So we included the confirmation command (curl -s http://<endpoint>/v1/models) in the notification text. Once you receive the notification, you can copy and paste it into the terminal to quickly check the status.
Zabbix integration #
Cron-based Slack notifications have high immediacy, but lack a long-term perspective like “what was the uptime over the past month” or “what is the trend in response time.” This is Zabbix’s forte, so we also register UserParameters to monitor from Zabbix.
UserParameter #
We added 3 UserParameters to the Zabbix Agent configuration file.
| Key | Content | Data type |
|---|---|---|
gpustack.health | Health check script exit code (0/1/2) | Integer |
gpustack.response_time | API response time (milliseconds) | Integer |
gpustack.models | Model list JSON | Text |
gpustack.health calls the health check script itself and returns the exit code. By reusing the same script in both cron and Zabbix, we avoid duplicating judgment logic.
Triggers #
We set 3-level triggers.
| Severity | Condition | Meaning |
|---|---|---|
| Severe failure | gpustack.health = 1 | API down. LLM inference completely stopped. |
| Warning | gpustack.health = 2 | API is alive but model cannot perform inference. |
| Information | gpustack.response_time > 5000 | API response takes 5+ seconds. Sign of high GPU load. |
The third “response delay” trigger functions as pre-symptom detection before it becomes CRITICAL. If you see the response time gradually increasing in a graph, you can adjust the model’s batch size or concurrent request count before OOM occurs.
Division of roles between cron and Zabbix #
Zabbix (5-minute intervals): History storage, response time graphs, escalation, failure history.
Running both ensures redundancy. Even if cron stops, Zabbix catches it; even if Zabbix goes down, cron still notifies.
Operations results #
Since starting operations, we have detected 2 cases where the GPUStack model stopped responding. In both cases, CHECK1 passed but CHECK2 detected it (Exit 2). Checking the GPUStack dashboard showed the model in “Error” state. GPUStack’s auto-restart feature reloaded the model and a RECOVERED notification came a few minutes later confirming recovery.
Without health checks, the next periodic analysis (up to 15 minutes later) would fail, and it would take even more time for a human to notice that Slack was not sending notifications.
Summary #
We automated health monitoring for local LLM inference servers (GPUStack) using shell scripts + cron + Zabbix.
Three design points stand out. 2-stage checks (separate API responses and actual inference for improved detection accuracy), continuous alert suppression (prevent burnout by only notifying on state change), combined cron and Zabbix use (achieve both immediate notification and long-term trends).
Since everything is done with a single shell script, the same approach can be used for monitoring other LLM servers that provide OpenAI-compatible APIs, such as vLLM or Ollama.
BASTION is a service that realizes AI security monitoring in closed environments.