Automating Health Monitoring for Local LLM Inference Servers

Automating Health Monitoring for Local LLM Inference Servers

6 min read

2026.04 / Tech Blog / BASTION

Automated Health Monitoring for Local LLM Inference Servers #

GPUStack detects the state where it “appears to be running but cannot perform inference.” We built a system running 2-stage health checks + Slack notifications + Zabbix integration at 5-minute intervals.

What was the problem? #

The BASTION introduced in the previous article (Building a system to automatically analyze infrastructure logs with local LLM) is a mechanism that automatically analyzes infrastructure logs every 15 minutes using a local LLM (Qwen2.5-14B).

The single point of failure in this mechanism is GPUStack (LLM inference server). When GPUStack goes down, log collection continues but analysis and notifications all stop. Moreover, it takes time to notice that it has stopped. It can take several hours before a human notices that the Slack notification that comes every 15 minutes “doesn’t arrive.”

GPUStack itself has a model auto-restart feature. If a process crashes, it automatically restarts. However, this alone cannot detect certain cases.

Cases that GPUStack auto-restart cannot detect:
・The API process is alive but the model is unloaded due to OOM
・The API returns 200 but the model does not respond to inference requests
・The BASTION server cannot reach GPUStack from a network perspective

In other words, “process survival” and “inference actually working” are separate issues. A mechanism to regularly confirm from the outside “can we actually perform inference” was needed.

Design approach #

The task is simple: Call the GPUStack API every 5 minutes, do nothing if normal, and notify Slack if abnormal. However, there are several design decisions.

2-stage check #

We divided the health check into 2 stages.

StageCheck contentWhat it detectsOn failure
CHECK1HTTP request to /v1/models APIGPUStack process viabilityExit 1 (CRITICAL)
CHECK2Actual inference request to chat/completionsWhether the model is actually loaded and able to respondExit 2 (WARNING)

The case where CHECK1 passes but CHECK2 fails is important. The state “API is alive but model cannot perform inference.” This can never be detected by process monitoring alone.

In CHECK2, we send a single word "healthcheck" and receive the response with max_tokens: 5. Since we only need to confirm whether inference works, we use the minimum number of tokens.

Exit code design #

We designed the script’s exit code to be 3-valued.

Exit codeMeaningSlack notificationZabbix trigger
0NormalOnly on recovery
1API downCRITICAL (red)Severe failure
2Model downWARNING (orange)Warning

We separated 1 and 2 because the response is different. For Exit 1, you need to check the GPUStack process itself. For Exit 2, you just need to look at the GPUStack dashboard for model status. The person receiving the notification can understand what they should do next just from the exit code.

Continuous alert suppression #

Running at 5-minute intervals means if GPUStack is down for 30 minutes, the same CRITICAL notification will be sent 6 times. By the third time, no one looks at Slack anymore.

To prevent this, we save the previous state to a file and only send notifications when the state changes.

State transitionNotification
ok → api_downSend CRITICAL notification
ok → model_downSend WARNING notification
api_down → okSend RECOVERED notification
api_down → api_downNo notification
model_down → model_downNo notification

Sending a RECOVERED notification on recovery is also important. The person who received the CRITICAL notification doesn’t have to open the dashboard to check “is it still down or already fixed.”

Implementation #

Everything is done with a single shell script. It is placed in the same directory as other BASTION scripts (summarize.sh, analyze.sh, etc.) and executed every 5 minutes by cron.

CHECK1: API response confirmation #

MODELS_RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" \
    --max-time 15 \
    -H "Authorization: Bearer ${API_KEY}" \
    "${GPUSTACK_HOST}/v1/models")

Since GPUStack provides an OpenAI-compatible API, we just send a GET request to the /v1/models endpoint. If HTTP 200 is returned, the API is alive. Timeout is 15 seconds.

CHECK2: Actual inference test #

INFERENCE_RESPONSE=$(curl -s --max-time 15 \
    -H "Authorization: Bearer ${API_KEY}" \
    -H "Content-Type: application/json" \
    "${GPUSTACK_HOST}/v1/chat/completions" \
    -d '{"model":"qwen2.5-14b-instruct",
         "messages":[{"role":"user","content":"healthcheck"}],
         "max_tokens":5,"temperature":0}')

If the response JSON contains "choices", inference is successful. If not, it is processed as model down.

State persistence #

# Read previous state
STATUS_FILE="/tmp/gpustack-health-status"
PREV_STATUS="ok"
[ -f "$STATUS_FILE" ] && PREV_STATUS=$(cat "$STATUS_FILE")

# Save current state
echo "ok" > "$STATUS_FILE"      # or "api_down" or "model_down"

Just write the state to a single file. No complex database needed. Even after a restart, on first run there is no file so ok is the default, and if there is an abnormality, the notification is sent on the first run.

cron registration #

*/5 * * * * bash /opt/oc-seclogs/gpustack-healthcheck.sh >> /var/log/gpustack-healthcheck.log 2>&1

We set it to 5-minute intervals, shorter than the periodic BASTION analysis (15-minute interval). If periodic analysis runs with LLM down, it will fail, so we want to detect it before that.

Things we actually encountered #

Overlooked API key authentication #

When I first ran it, CHECK1 immediately returned HTTP 401. GPUStack had API key authentication enabled, but I hadn’t included the Authorization header in curl. This is an obvious mistake, but I think it’s common to write health check scripts in a test environment (no authentication) and move them to production as-is. We resolved it by modifying it to read the API key from a .env file.

The case where CHECK1 passes but CHECK2 fails actually exists #

The GPUStack API process is running, but model loading has failed. /v1/models returns 200 but chat/completions returns 500. Process monitoring alone would determine it as “normal.” The decision to use 2-stage checks was validated in operations.

Slack notification format should include “what to do next” #

Initially we just said “GPUStack API is not responding,” but the person receiving the notification (myself) always does the same thing next. So we included the confirmation command (curl -s http://<endpoint>/v1/models) in the notification text. Once you receive the notification, you can copy and paste it into the terminal to quickly check the status.

Zabbix integration #

Cron-based Slack notifications have high immediacy, but lack a long-term perspective like “what was the uptime over the past month” or “what is the trend in response time.” This is Zabbix’s forte, so we also register UserParameters to monitor from Zabbix.

UserParameter #

We added 3 UserParameters to the Zabbix Agent configuration file.

KeyContentData type
gpustack.healthHealth check script exit code (0/1/2)Integer
gpustack.response_timeAPI response time (milliseconds)Integer
gpustack.modelsModel list JSONText

gpustack.health calls the health check script itself and returns the exit code. By reusing the same script in both cron and Zabbix, we avoid duplicating judgment logic.

Triggers #

We set 3-level triggers.

SeverityConditionMeaning
Severe failuregpustack.health = 1API down. LLM inference completely stopped.
Warninggpustack.health = 2API is alive but model cannot perform inference.
Informationgpustack.response_time > 5000API response takes 5+ seconds. Sign of high GPU load.

The third “response delay” trigger functions as pre-symptom detection before it becomes CRITICAL. If you see the response time gradually increasing in a graph, you can adjust the model’s batch size or concurrent request count before OOM occurs.

Division of roles between cron and Zabbix #

cron (5-minute intervals): Immediately notify Slack on anomaly detection. Simple and reliable.
Zabbix (5-minute intervals): History storage, response time graphs, escalation, failure history.

Running both ensures redundancy. Even if cron stops, Zabbix catches it; even if Zabbix goes down, cron still notifies.

Operations results #

Since starting operations, we have detected 2 cases where the GPUStack model stopped responding. In both cases, CHECK1 passed but CHECK2 detected it (Exit 2). Checking the GPUStack dashboard showed the model in “Error” state. GPUStack’s auto-restart feature reloaded the model and a RECOVERED notification came a few minutes later confirming recovery.

Without health checks, the next periodic analysis (up to 15 minutes later) would fail, and it would take even more time for a human to notice that Slack was not sending notifications.

Summary #

We automated health monitoring for local LLM inference servers (GPUStack) using shell scripts + cron + Zabbix.

Three design points stand out. 2-stage checks (separate API responses and actual inference for improved detection accuracy), continuous alert suppression (prevent burnout by only notifying on state change), combined cron and Zabbix use (achieve both immediate notification and long-term trends).

Since everything is done with a single shell script, the same approach can be used for monitoring other LLM servers that provide OpenAI-compatible APIs, such as vLLM or Ollama.

BASTION is a service that realizes AI security monitoring in closed environments.

BASTION service page
contact

Updated on 2026/6/9

What are your feelings

  • Happy
  • Normal
  • Sad

©2020 BESTNET.LLC . All Rights Reserved.