Troubleshooting

Agent Offline? Your Runtime/Container May Be Down

A practical recovery flow for when your agent goes silent, keeps typing forever, or suddenly stops executing tasks.

Community reports often look like this: "My agent is dead," "it keeps typing but never replies," or "did my container go down?" The good news: this is usually an environment/runtime issue, not permanent data loss.

Fast mental model: If chat exists but tools/actions stop working, the execution layer may be unhealthy. If nothing responds at all, the runtime may be offline.

3-minute triage checklist

1) Run a tiny liveness prompt

Use one short message first:

Reply with exactly: OK-ALIVE

If this fails repeatedly, treat it as runtime outage instead of prompt quality.

2) Separate chat health from execution health

Send two tests in order:

  1. Reply with: CHAT-OK (no tools)
  2. Create a file named healthcheck.txt with content "runtime check"

If test #1 passes but #2 hangs/fails, your execution layer is the likely problem.

3) Start one fresh session and retest

Open a clean session and repeat the tiny liveness prompt. This rules out a bloated or stuck chat thread.

Common causes

Don’t burn time debugging giant prompts first. Always validate basic liveness before rewriting your instructions.

Known-good recovery flow

  1. Run tiny liveness prompt (OK-ALIVE).
  2. Run no-tool vs tool test pair (chat health vs execution health).
  3. Start one fresh session and repeat test.
  4. If still failing, restart runtime/container from your control surface.
  5. Retest with one tiny task before resuming normal work.

Escalation packet (copy/paste)

Issue: Agent appears offline / container possibly down Where tested: [dashboard | Discord DM | channel] Liveness prompt used: "Reply with exactly: OK-ALIVE" Result: [no reply | typing forever | partial reply] Execution test result: [chat-only works / tool step fails] What I already tried: [fresh session, restart, browser refresh] Time started: [timestamp + timezone]

Bottom line: Most “my agent died” cases are recoverable. A tiny liveness test + execution isolation + controlled restart resolves the majority quickly.