Cold-boot recovery

If a model has been idle, the first request returns 503 modal_cold_boot with a Retry-After header and a retry_after_seconds body field. Wait and retry — a few minutes is normal, 5–13 min can happen for 405B.

curl

# curl with manual retry — read Retry-After from -D headers.txt:
until curl -fs -D headers.txt "$ACS_API_BASE/completions" \
  -H "Authorization: Bearer $ACS_API_KEY" -H "Content-Type: application/json" \
  -d '{"model": "llama-405b", "prompt": "Hello", "max_tokens": 4}'; do
  sleep "$(awk '/^[Rr]etry-[Aa]fter:/ {print $2+0}' headers.txt)"
done

Python

import time
from openai import OpenAI, APIStatusError

while True:
    try:
        resp = client.completions.create(model="llama-405b", prompt="Hello", max_tokens=4)
        break
    except APIStatusError as e:
        if e.status_code == 503 and (e.response.headers.get("retry-after") or "").isdigit():
            time.sleep(int(e.response.headers["retry-after"]))
            continue
        raise

Gotcha

Retry indefinitely on 503 modal_cold_boot, but cap retries on 503 circuit_open — that one means an underlying outage and the breaker may stay open for minutes.