# ACS base-model API — full documentation

Single-page copy of the entire tutorial, generated from the same source as the human docs at /tutorial. Sections follow the site's navigation order.

<!-- source: docs/overview.md (/tutorial/overview) -->

# Using the base-model API

These are **base models** — raw next-token prediction, no chat template, no instruction tuning — served over an OpenAI-compatible `/v1/completions` endpoint.

## What you get

- **Three base models** — a small one for quick tests plus two larger ones (see [Models](/tutorial/models)). Some are kept warm; others cold-start on first use.
- **Real next-token access** — arbitrary prefill/continuation, `logprobs` and `prompt_logprobs` for likelihood/surprisal and interpretability work, `echo`, and SSE streaming.
- **Full, honored sampling controls** — `temperature`, `top_p`, `top_k`, `min_p`, penalties, and a respected `seed` for reproducibility.
- **A reliable, strict API** — standard OpenAI-compatible `/v1/completions`, with strict parameter validation (a typo'd parameter fails loudly instead of silently defaulting) and clear, structured JSON errors.
- **Browser Workbench** — try prompts and manage API keys without writing code.
- **Per-key budgets & usage** — set token caps per key and track spend (see [Account](/tutorial/account)).
- **Community** — a Discord with #feedback and #bug-reports, plus a one-click Feedback button in the Workbench.

## Two ways in

- **Workbench** — prompt the models straight from your browser. Good for getting a feel before you write any code.
- **The API** (below) — for anything programmatic. Create a key from your dashboard.

## Quick start

Create a key in your dashboard, then point any OpenAI-compatible client at the API:

```bash
export ACS_API_KEY="acs-bm-..."          # your key
export ACS_API_BASE="https://base-models.acsresearch.org/v1"
```

A first request with `curl`:

```bash
curl -s "$ACS_API_BASE/completions" \
  -H "Authorization: Bearer $ACS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-8b", "prompt": "The capital of France is", "max_tokens": 8}'
```

Or with the Python SDK (`pip install openai`):

```python
import os
from openai import OpenAI

client = OpenAI(
    base_url=os.environ["ACS_API_BASE"],
    api_key=os.environ["ACS_API_KEY"],
)

resp = client.completions.create(   # completions — not chat.completions
    model="llama-8b",
    prompt="The capital of France is",
    max_tokens=16,
    logprobs=5,
)
print(resp.choices[0].text)
```

## Worked examples

Short, copy-pasteable end-to-end snippets for each feature live under [Examples](/tutorial/examples) — one page per topic. The curl examples assume `ACS_API_KEY` + `ACS_API_BASE` are exported (see [Quick start](#quick-start)); the Python examples use the same `openai` SDK client as above.

Start with [`logprobs`](/tutorial/examples/logprobs) if you're new — it's the smallest end-to-end example. The harder ones (with the most gotchas) are [`prompt_logprobs`](/tutorial/examples/prompt-logprobs) and [Cold-boot recovery](/tutorial/examples/cold-boot).

<!-- source: docs/models.md (/tutorial/models) -->

# Models

The wrapper currently fronts three base-model checkpoints. Pick `llama-8b` if you're just trying it out — it has the shortest cold-boot.

## Available models

<table>
  <thead>
    <tr><th><code>model</code></th><th>Checkpoint</th><th>Precision</th></tr>
  </thead>
  <tbody>
    <tr><td><code>llama-8b</code></td><td><code>meta-llama/Llama-3.1-8B</code></td><td>bf16</td></tr>
    <tr><td><code>llama-405b</code></td><td><code>meta-llama/Llama-3.1-405B</code></td><td>bf16</td></tr>
    <tr><td><code>trinity-base</code></td><td><code>arcee-ai/Trinity-Large-TrueBase</code></td><td>bf16</td></tr>
  </tbody>
</table>

## Availability & cold starts

Models that get steady use are usually kept warm, so requests start immediately. Less-used or larger models sleep when idle to save GPU time and cold-start on the first request after a quiet spell — usually a few minutes, worst case up to ~10 minutes (occasionally longer for the largest models when GPUs are scarce). Once warm, a model answers in seconds and stays warm while you keep using it.

If a first request returns `modal_cold_boot`, wait for the `Retry-After` hint and retry — see [Cold-boot recovery](/tutorial/examples/cold-boot) for a copy-pasteable retry loop. For interactive work, send one short throwaway request to wake the model, then go; for batch jobs, let the first request absorb the wait.

## Always-on (warm windows)

If you'd rather have a model kept always-on for a stretch — a work session, a deadline — tell us via the **Feedback** button or [email](mailto:base-models@acsresearch.org), and we'll schedule a warm window.

<!-- source: docs/api.md (/tutorial/api) -->

# API reference

OpenAI-compatible `/v1/completions`. Standard sampling controls + a handful of token-level extras for inspection work.

## What's supported

Standard OpenAI-style `/v1/completions` parameters, validated strictly (see below).

**Sampling:** `temperature` (0–2), `top_p` (0–1], `top_k` (`-1` disables, else ≥1), `min_p` (0–1), `presence_penalty` and `frequency_penalty` (−2–2), `repetition_penalty` (0–2], and `seed` (honored for reproducibility under sampling).

**Length & variants:** `max_tokens` (≥1), `n` and `best_of` (≤16 — request multiple continuations at once), and `stop` (a string or up to 4 strings).

> **Scoring without generating?** `max_tokens=0` is **not** supported (the minimum is 1). To score existing text rather than generate, use `prompt_logprobs` with `max_tokens=1` and `echo=true` — you get the per-prompt-position logprobs and just ignore the one extra generated token. See the [prompt_logprobs](/tutorial/examples/prompt-logprobs) and [echo](/tutorial/examples/echo) examples.

**Token-level / inspection work:**

- `logprobs=k` — top-*k* logprobs per generated token (`k ≤ 20`; ask if you need more). See [logprobs example](/tutorial/examples/logprobs).
- `prompt_logprobs=k` — top-*k* logprobs at each *prompt* position (the model's predictions, not the actual prompt token unless it was in the top *k*). See [prompt_logprobs example](/tutorial/examples/prompt-logprobs).
- `echo=true` — include the prompt in the response. See [echo example](/tutorial/examples/echo).
- `stream=true` (plus optional `stream_options`) — token-by-token SSE streaming. See [stream example](/tutorial/examples/stream).

Plus `user` — an optional free-form tag echoed back for your own bookkeeping.

**Using the OpenAI Python SDK?** A few of these are vLLM-native and aren't on the SDK's typed signature, so the SDK rejects them as direct keyword arguments *before the request leaves your machine*. Pass them through `extra_body={...}` instead: **`top_k`, `min_p`, `repetition_penalty`, and `prompt_logprobs`**. Everything else — `temperature`, `top_p`, `max_tokens`, `n`, `best_of`, `presence_penalty`, `frequency_penalty`, `seed`, `logprobs`, `echo`, `stream`, `stop` — works as a normal keyword argument.

```python
resp = client.completions.create(
    model="llama-8b",
    prompt="The capital of France is",
    temperature=0.7,            # standard kwarg
    logprobs=5,                 # standard kwarg
    extra_body={                # vLLM-native — must go here
        "top_k": 20,
        "min_p": 0.05,
        "repetition_penalty": 1.1,
        "prompt_logprobs": 5,
    },
)
```

**Strict validation.** Unknown fields are rejected with a `400`, and out-of-range values fail loudly instead of silently clamping — e.g. `temperature=3` is a `400`, not a quiet reset to the default, and `logprobs: true` is rejected (pass a number, not a boolean). A typo'd parameter never silently changes your results.

## What's not supported

- **No chat endpoint.** `/v1/chat/completions` returns `400` — there's no chat template. Note the `openai` SDK defaults to chat, so call `client.completions.create(...)`.

## Limits

- Each key has a **monthly token budget** (`prompt + completion` tokens, reset on the 1st, UTC). Exceed it and requests return `429 budget_exceeded`. [Email](mailto:base-models@acsresearch.org) to raise it.
- Up to **8 concurrent requests per key**, with no per-minute cap — large batch jobs are fine, extra requests just queue until a slot frees up. See [Batch rollouts](/tutorial/examples/batch-rollouts) for the canonical fan-out pattern.

## Errors

Every error body follows the OpenAI shape: `{"error": {"code": "...", "message": "...", "type": "..."}}` with extra fields tagged where useful. Read `error.code` for programmatic handling.

<table>
  <thead>
    <tr><th>HTTP</th><th>Code</th><th>Meaning</th><th>What to do</th></tr>
  </thead>
  <tbody>
    <tr><td><code>400</code></td><td><code>invalid_request</code></td><td>Unknown field, bad type, or out-of-range sampling param (e.g. <code>temperature&gt;2</code>, <code>top_p&gt;1</code>, <code>logprobs&gt;20</code>)</td><td>The message names the offending field; fix and retry</td></tr>
    <tr><td><code>400</code></td><td><code>context_length_exceeded</code></td><td><code>prompt_tokens + max_tokens &gt; max_model_len</code></td><td>Reduce the prompt or <code>max_tokens</code>; check <code>/v1/models</code> for the per-model limit</td></tr>
    <tr><td><code>400</code></td><td><code>chat_completions_unsupported</code></td><td>You hit <code>/v1/chat/completions</code></td><td>Use <code>/v1/completions</code> — these are base models, no chat template</td></tr>
    <tr><td><code>400</code></td><td><code>model_not_found</code></td><td>Unknown <code>model</code> id</td><td>Use a short id from <code>/v1/models</code> (not the HF repo name)</td></tr>
    <tr><td><code>400</code></td><td><code>bad_json</code></td><td>Request body wasn't valid JSON</td><td>Fix the JSON</td></tr>
    <tr><td><code>401</code></td><td><code>invalid_api_key</code></td><td>Key missing, wrong, paused, or revoked</td><td>Check <code>ACS_API_KEY</code> in your account settings</td></tr>
    <tr><td><code>429</code></td><td><code>budget_exceeded</code></td><td>Monthly / daily / input / output token budget hit</td><td>Wait for the reset, or ask for more. See <a href="/tutorial/examples/budget-cap">Budget-cap recovery</a></td></tr>
    <tr><td><code>502</code></td><td><code>upstream_unreachable</code></td><td>Wrapper couldn't reach the model server (DNS / connection / read timeout) after retries</td><td>Retry shortly; persistent failures are an outage — report it</td></tr>
    <tr><td><code>502</code></td><td><code>vllm_oom</code></td><td>Upstream model server ran out of GPU memory</td><td>Retry with a smaller prompt / <code>max_tokens</code> / lower <code>n</code></td></tr>
    <tr><td><code>502</code></td><td><code>vllm_context_length</code></td><td>Upstream enforced its context-length limit (rare — the wrapper usually catches this as <code>context_length_exceeded</code> first)</td><td>Reduce prompt / <code>max_tokens</code></td></tr>
    <tr><td><code>502</code></td><td><code>vllm_engine_dead</code></td><td>Upstream vLLM engine crashed</td><td>Retry; if it persists the model is down — check status or report</td></tr>
    <tr><td><code>502</code></td><td><code>upstream_server_error</code></td><td>Other upstream 5xx after retries</td><td>Retry; check <code>error.upstream_status</code> for the original code</td></tr>
    <tr><td><code>503</code></td><td><code>modal_cold_boot</code></td><td>Model container is starting. <code>Retry-After</code> header + <code>retry_after_seconds</code> field give a recommended delay.</td><td>Wait the suggested interval and retry; a few minutes is normal after scale-to-zero, and 5–13 min total can happen for large models. See <a href="/tutorial/examples/cold-boot">Cold-boot recovery</a></td></tr>
    <tr><td><code>503</code></td><td><code>circuit_open</code></td><td>Backend is in a circuit-breaker open state after repeated failures</td><td>Use <code>retry_after_seconds</code>; if the model is critical, contact us</td></tr>
  </tbody>
</table>

## Request headers

- `Authorization: Bearer <ACS_API_KEY>` — required.
- `Content-Type: application/json` — for the JSON body.
- `X-Acs-Workload: interactive | batch` — optional usage tag (telemetry only; it does not change scheduling or priority). Unrecognised values are ignored. See [Batch rollouts](/tutorial/examples/batch-rollouts).

## Response headers

Reliability hints on the response itself:

- `X-Request-Id` — a unique id for the request; quote it in bug reports so we can trace it in our logs.
- `X-Acs-Upstream-Model` / `X-Acs-Upstream-Gpu` — which backend served (or failed) this request. Cite this in bug reports.
- `X-Acs-Upstream-Error-Kind` — set on upstream-error 5xx responses (mirrors `error.code`).
- `X-Acs-Max-Tokens-Clamped` — present when the wrapper reduced your `max_tokens` to fit a budget; format `requested=N,applied=M,reason=budget`.
- `Retry-After` — RFC-7231 header on `503` cold-boot / circuit-open responses.

## Privacy

API requests to `/v1/completions` are logged as **metadata only** — key prefix, email, IP, endpoint, model, token counts, status, latency — which we use to run the service and prevent abuse. We don't log your prompts or completions. Note this might change: we may start logging API requests for abuse-prevention reasons. We will not train on them. (Keep your own copies if you need them for reproducibility.)

The one exception is the Workbench: prompts you save there are stored server-side, so your sessions persist across browsers. Don't keep anything there you wouldn't want stored.

## Getting help

Email [base-models@acsresearch.org](mailto:base-models@acsresearch.org), or use the in-app **Feedback** button if you're signed in. To help us trace a specific request, include the rough time you made it, your key prefix or name (found in your dashboard — *not* the secret), and the error body.

<!-- source: docs/account.md (/tutorial/account) -->

# Account

Where to find your keys, your spend, and your monthly budget.

## API keys

You create keys from the Dashboard once your account is approved. Each key is shown **once** at creation time — copy it then; we only ever store a hash. If you lose it, revoke the key and create a new one.

Keys can be **paused** (reversibly disabled) or **revoked** (permanent). Paused keys come back with one click; revoked keys are gone.

## Budgets

- **Per-key monthly budget** — a token cap (`prompt + completion`) that resets on the 1st (UTC). You can set this per key when you create it; "unbounded" means no per-key sub-cap (the account-level total still applies).
- **Account total** — an aggregate cap across all your keys. Set by an admin; visible on the dashboard. A per-key cap can't exceed the account total.

When you blow through either, requests come back as `429 budget_exceeded`. [Budget-cap recovery](/tutorial/examples/budget-cap) shows the pattern for handling this in code; the short version is: don't retry, the cap is sticky until the 1st.

## Where things live

- **Dashboard** — keys, this-month's spend per key, create / pause / revoke buttons.
- **Usage** — usage chart with per-model breakdown for the current month, plus the running totals you'll need for budget planning.
- **Workbench** — in-browser prompt UI; saved prompts persist across browsers.

Once you're signed in, all three pages are linked from the top nav.

## Asking for more

If you need a bigger budget, an always-on warm window, or access to a model that isn't on the list — use the in-app **Feedback** button or [email base-models@acsresearch.org](mailto:base-models@acsresearch.org). We read everything; please include your key prefix or name so we can find your account.

<!-- source: docs/examples.md (/tutorial/examples) -->

# Examples

Short, copy-pasteable end-to-end snippets for each feature. The curl examples assume `ACS_API_KEY` + `ACS_API_BASE` are exported (see [Quick start](/tutorial/overview#quick-start)); the Python examples use the same `openai` SDK client.

Each page below shows one shared **example response** after the curl and Python snippets — the two calls return the same JSON; the SDK just wraps it in typed objects. *Heads-up on reproducibility:* the example responses were captured with `seed=0` appended to the request, but the displayed requests don't pin a seed, so under the default `temperature=1.0` your output will differ from token to token. Add a `seed` to your own request to reproduce a specific run, or `temperature=0` for greedy decoding. Always-null fields (`service_tier`, `system_fingerprint`, `kv_transfer_params`, etc.) are omitted from the shown responses for brevity.

## Token-level inspection

- [`logprobs`](/tutorial/examples/logprobs) — top-*k* logprobs per generated token.
- [`prompt_logprobs`](/tutorial/examples/prompt-logprobs) — top-*k* logprobs at each *prompt* position (gotchas around rank-vs-actual-token).
- [`echo`](/tutorial/examples/echo) — include the prompt in the response.

## Streaming + concurrency

- [`stream`](/tutorial/examples/stream) — SSE token-by-token.
- [Batch rollouts](/tutorial/examples/batch-rollouts) — 8 concurrent per key, with the canonical `asyncio.gather` pattern.

## Recovery patterns

- [Cold-boot recovery](/tutorial/examples/cold-boot) — handle `503 modal_cold_boot` + `Retry-After`.
- [Budget-cap recovery](/tutorial/examples/budget-cap) — handle `429 budget_exceeded` (the non-retryable kind).

<!-- source: docs/examples/logprobs.md (/tutorial/examples/logprobs) -->

# logprobs

Ask for the top-*k* alternative tokens at each generated position (`k ≤ 20`). Useful for calibration, classification by next-token probabilities, or just inspecting what the model "almost said".

## curl

```bash
curl -s "$ACS_API_BASE/completions" \
  -H "Authorization: Bearer $ACS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-8b", "prompt": "The capital of France is", "max_tokens": 1, "logprobs": 5}'
```

## Python

```python
resp = client.completions.create(
    model="llama-8b",
    prompt="The capital of France is",
    max_tokens=1,
    logprobs=5,
)
choice = resp.choices[0]
print(choice.text, choice.logprobs.top_logprobs[0])  # {' a': -1.74, ' Paris': -1.99, ...}
```

## Example response

```json
{
  "id": "cmpl-91ea94c6ed0e5b75",
  "object": "text_completion",
  "model": "meta-llama/Llama-3.1-8B",
  "choices": [
    {
      "index": 0,
      "text": " a",
      "logprobs": {
        "text_offset": [0],
        "tokens": [" a"],
        "token_logprobs": [-1.7437232732772827],
        "top_logprobs": [
          {
            " a": -1.7437232732772827,
            " Paris": -1.9937232732772827,
            " one": -2.5562233924865723,
            " the": -2.5562233924865723,
            " also": -3.1187233924865723
          }
        ]
      },
      "finish_reason": "length"
    }
  ],
  "usage": {"prompt_tokens": 6, "completion_tokens": 1, "total_tokens": 7}
}
```

Note that base `llama-8b` picked `" a"` over `" Paris"` — the model is a continuation predictor, not a Q&A assistant, so it samples a plausible continuation rather than answering. `" Paris"` is the second-most likely token by 0.25 logprob, so a chat-tuned model (or greedy decoding with `temperature=0`) would land on it more often.

## Gotcha

`logprobs` is an integer (top-*k*), not a boolean — passing `true` returns `400 invalid_request`. The same rule applies to every numeric sampler param: `max_tokens`, `n`, `best_of`, `seed`, `top_k`, `temperature`, `top_p`, `min_p`, `presence_penalty`, `frequency_penalty`, `repetition_penalty`, and `prompt_logprobs` all reject `true`/`false` so a typo like `max_tokens: true` fails loudly instead of silently being interpreted as `1`. (`echo` and `stream` are the only sampler fields that legitimately take booleans.)

<!-- source: docs/examples/prompt-logprobs.md (/tutorial/examples/prompt-logprobs) -->

# prompt_logprobs

Return the model's top-*k* logprob predictions at each prompt position — useful for inspecting what the model expected at each step, and (with extra care, see the Gotcha) a building block for scoring how likely a given piece of text is under the model.

## curl

```bash
curl -s "$ACS_API_BASE/completions" \
  -H "Authorization: Bearer $ACS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-8b",
       "prompt": "The capital of France is",
       "max_tokens": 1,
       "prompt_logprobs": 5}' | jq '.choices[0].prompt_logprobs[:2]'
```

## Python

```python
resp = client.completions.create(
    model="llama-8b",
    prompt="The capital of France is",
    max_tokens=1,
    extra_body={"prompt_logprobs": 5},
)

# resp.choices[0].prompt_logprobs[i] is a {token_id: {logprob, rank, decoded_token}}
# dict for prompt position i, or None for the very first position (no
# conditioning context). Here we just print the top-5 at the first few positions.
for i, choices in enumerate(resp.choices[0].prompt_logprobs[:3]):
    if choices is None:
        continue
    print(f"position {i}:")
    for token_id, info in choices.items():
        print(f"  rank={info['rank']:>2}  logprob={info['logprob']:+.3f}  {info['decoded_token']!r}")
```

## Example response

Just `choices[0].prompt_logprobs[:2]` (what the curl `jq` filter shows):

```json
[
  null,
  {
    "14924": {"logprob": -1.179518699645996, "rank": 1, "decoded_token": "Question"},
    "755":   {"logprob": -2.179518699645996, "rank": 2, "decoded_token": "def"},
    "2":     {"logprob": -2.742018699645996, "rank": 3, "decoded_token": "#"},
    "791":   {"logprob": -3.679518699645996, "rank": 4, "decoded_token": "The"},
    "16309": {"logprob": -4.179518699645996, "rank": 5, "decoded_token": "Tags"}
  }
]
```

Position 0 is `null` (no conditioning context for the first token). At position 1 the model's top-1 prediction is `"Question"`, not the actual prompt token `"The"` — which here happens to show up at rank 4. If your prompt token isn't in the top-*k*, vLLM still appends it as an extra entry with `rank > k` so you can recover its logprob; see the Gotcha for how to use that to score a sequence.

## Gotcha

`prompt_logprobs` isn't on the OpenAI SDK's typed signature — pass it via `extra_body={...}`. It appears as a vLLM-native top-level field on each `choice` (alongside `logprobs`); each non-None entry is a `{token_id: {logprob, rank, decoded_token}}` dict, and the first position is `None` (nothing to condition on). `rank=1` is the model's top-1 prediction at that position, *not* necessarily the actual prompt token — if your prompt token wasn't in the top-*k*, vLLM appends it as an extra entry with `rank > k`. To compute the actual sequence log-likelihood of your prompt, set `echo=true` to recover the prompt tokens, then for each position pick the entry whose `decoded_token` matches the prompt token (or use a generous `prompt_logprobs=20` and fall back to the appended "extra" entry when the actual token wasn't in the top-*k*). The wrapper schema requires `max_tokens >= 1`, so set it to 1 and ignore the one extra generated token.

<!-- source: docs/examples/echo.md (/tutorial/examples/echo) -->

# echo

Prepends the prompt to the generated text in `choices[0].text`. Combined with `logprobs` + `max_tokens=1` (the wrapper's minimum) and `prompt_logprobs`, this gives you a request that's almost entirely about scoring existing text rather than generating new text — only one token is actually generated.

## curl

```bash
curl -s "$ACS_API_BASE/completions" \
  -H "Authorization: Bearer $ACS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-8b", "prompt": "Once upon a time", "max_tokens": 8, "echo": true}'
```

## Python

```python
resp = client.completions.create(
    model="llama-8b",
    prompt="Once upon a time",
    max_tokens=8,
    echo=True,
)
print(resp.choices[0].text)  # "Once upon a time there was a man who was very poor"
```

## Example response

```json
{
  "id": "cmpl-b986142b3fd7d215",
  "object": "text_completion",
  "model": "meta-llama/Llama-3.1-8B",
  "choices": [
    {
      "index": 0,
      "text": "Once upon a time there was a man who was very poor",
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {"prompt_tokens": 5, "completion_tokens": 8, "total_tokens": 13}
}
```

Note that `text` contains both the prompt and the 8 generated tokens, concatenated with no separator — `usage.completion_tokens` only counts the generated portion, so you can slice by token count or just by the byte length of your original prompt. (`prompt_tokens=5`, not 4, because the tokenizer auto-prepends a `<|begin_of_text|>` BOS token to Llama prompts.)

## Gotcha

`echo` only mirrors the prompt back; it doesn't add a separator, so split by length if you need just the continuation.

<!-- source: docs/examples/stream.md (/tutorial/examples/stream) -->

# stream

Server-sent events: each chunk is a partial completion, terminated by `data: [DONE]`. Drop the request-level `timeout` high enough to span the full generation.

## curl

```bash
curl -N -s "$ACS_API_BASE/completions" \
  -H "Authorization: Bearer $ACS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-8b", "prompt": "Five reasons to learn Rust:\n1.", "max_tokens": 80,
       "stream": true, "stream_options": {"include_usage": true}}'
```

## Python

```python
stream = client.completions.create(
    model="llama-8b",
    prompt="Five reasons to learn Rust:\n1.",
    max_tokens=80,
    stream=True,
    stream_options={"include_usage": True},
)
for chunk in stream:
    if chunk.choices:                            # final usage chunk has choices=[]
        print(chunk.choices[0].text, end="", flush=True)
    elif chunk.usage:
        print(f"\n[tokens: {chunk.usage.total_tokens}]")
```

## Example response

First few SSE frames — the wire format; the Python SDK parses each `data:` payload into a chunk object. (Per-chunk `logprobs: null` and `stop_reason: null` keys are omitted here for brevity — they're present in the real wire output.)

```text
data: {"id":"cmpl-ae7c4833c82a33d6","object":"text_completion","model":"meta-llama/Llama-3.1-8B","choices":[{"index":0,"text":" It","finish_reason":null}],"usage":null}

data: {"id":"cmpl-ae7c4833c82a33d6","object":"text_completion","model":"meta-llama/Llama-3.1-8B","choices":[{"index":0,"text":"’s","finish_reason":null}],"usage":null}

data: {"id":"cmpl-ae7c4833c82a33d6","object":"text_completion","model":"meta-llama/Llama-3.1-8B","choices":[{"index":0,"text":" fast","finish_reason":null}],"usage":null}

data: {"id":"cmpl-ae7c4833c82a33d6","object":"text_completion","model":"meta-llama/Llama-3.1-8B","choices":[{"index":0,"text":".\n","finish_reason":null}],"usage":null}

... (one chunk per token until max_tokens or stop) ...

data: {"id":"cmpl-ae7c4833c82a33d6","object":"text_completion","model":"meta-llama/Llama-3.1-8B","choices":[{"index":0,"text":" again","finish_reason":"length"}],"usage":null}

data: {"id":"cmpl-ae7c4833c82a33d6","object":"text_completion","model":"meta-llama/Llama-3.1-8B","choices":[],"usage":{"prompt_tokens":9,"completion_tokens":80,"total_tokens":89}}

data: [DONE]
```

The final content chunk carries `finish_reason` alongside its token (not on an empty text). Then — because the request set `stream_options.include_usage: true` — a terminal chunk arrives with `choices: []` and a populated `usage` object, where you get the token counts for a streamed completion. Then `[DONE]` closes the stream. Drop `stream_options` and you'll skip the usage chunk and get only the content frames (vLLM may emit usage permissively; the OpenAI spec requires the opt-in).

## Gotcha

Use `curl -N` to disable buffering, otherwise you'll see the whole response arrive at once. Each SSE chunk's `finish_reason` is `null` until the last one.

<!-- source: docs/examples/batch-rollouts.md (/tutorial/examples/batch-rollouts) -->

# Batch rollouts

The per-key concurrency cap is 8; `asyncio.gather` over a single `AsyncOpenAI` client gives the maximum throughput without extra credentials.

## curl

```bash
# Shell version — xargs -P 8 fans out 8 concurrent requests:
seq 1 32 | xargs -I{} -P 8 curl -s "$ACS_API_BASE/completions" \
  -H "Authorization: Bearer $ACS_API_KEY" -H "Content-Type: application/json" \
  -d '{"model": "llama-8b", "prompt": "Sample {}: once upon a time", "max_tokens": 32}'
```

## Python

```python
import asyncio
import os
from openai import AsyncOpenAI

client = AsyncOpenAI(base_url=os.environ["ACS_API_BASE"], api_key=os.environ["ACS_API_KEY"])
prompts = [f"Sample {i}: once upon a time" for i in range(32)]

async def one(p):
    r = await client.completions.create(model="llama-8b", prompt=p, max_tokens=32)
    return r.choices[0].text

async def _main():
    # asyncio.gather() must be called from inside a running event loop, so
    # wrap it in a coroutine and hand THAT to asyncio.run(...).
    return await asyncio.gather(*(one(p) for p in prompts))

results = asyncio.run(_main())
```

## Gotcha

Requests beyond the 9th queue server-side rather than 429 — fine for batch jobs, but don't expect 32 parallel requests to all start at once. If your own code fans out many requests at once, cap it with a `Semaphore(8)` so they don't all pile up against the per-key limit.

## Tag your workload (optional)

Add an `X-Acs-Workload: batch` request header to batch jobs (or `interactive` for live, latency-sensitive calls). It's purely a usage signal we record to understand traffic and plan capacity / keep-warm policy — it does *not* change how your request is scheduled or prioritised, and an unrecognised value is simply ignored. The in-browser Workbench is tagged `interactive` automatically.

<!-- source: docs/examples/cold-boot.md (/tutorial/examples/cold-boot) -->

# Cold-boot recovery

If a model has been idle, the first request returns `503 modal_cold_boot` with a `Retry-After` header and a `retry_after_seconds` body field. Wait and retry — a few minutes is normal, 5–13 min can happen for 405B.

## curl

```bash
# curl with manual retry — read Retry-After from -D headers.txt:
until curl -fs -D headers.txt "$ACS_API_BASE/completions" \
  -H "Authorization: Bearer $ACS_API_KEY" -H "Content-Type: application/json" \
  -d '{"model": "llama-405b", "prompt": "Hello", "max_tokens": 4}'; do
  sleep "$(awk '/^[Rr]etry-[Aa]fter:/ {print $2+0}' headers.txt)"
done
```

## Python

```python
import time
from openai import OpenAI, APIStatusError

while True:
    try:
        resp = client.completions.create(model="llama-405b", prompt="Hello", max_tokens=4)
        break
    except APIStatusError as e:
        if e.status_code == 503 and (e.response.headers.get("retry-after") or "").isdigit():
            time.sleep(int(e.response.headers["retry-after"]))
            continue
        raise
```

## Gotcha

Retry indefinitely on `503 modal_cold_boot`, but cap retries on `503 circuit_open` — that one means an underlying outage and the breaker may stay open for minutes.

<!-- source: docs/examples/budget-cap.md (/tutorial/examples/budget-cap) -->

# Budget-cap recovery

When your monthly token budget runs out, requests return `429 budget_exceeded`. There is no automatic refill mid-month — the cap resets at the 1st (UTC).

## curl

```bash
curl -i "$ACS_API_BASE/completions" \
  -H "Authorization: Bearer $ACS_API_KEY" -H "Content-Type: application/json" \
  -d '{"model": "llama-8b", "prompt": "Hi", "max_tokens": 8}'
# HTTP/1.1 429 Too Many Requests
# {"error": {"code": "budget_exceeded", "message": "Monthly token budget exhausted.", ...}}
```

## Python

```python
from openai import APIStatusError

try:
    resp = client.completions.create(model="llama-8b", prompt="Hi", max_tokens=8)
except APIStatusError as e:
    # The wrapper nests the error code under body["error"]["code"], not at
    # the top level — so e.code (which the openai SDK reads from
    # body["code"]) is None here. Dig into e.body instead.
    err = (e.body or {}).get("error", {}) if isinstance(e.body, dict) else {}
    if e.status_code == 429 and err.get("code") == "budget_exceeded":
        # Don't retry — the cap is sticky until the 1st of next month UTC.
        # Email base-models@acsresearch.org to raise the budget mid-month.
        raise SystemExit("Budget exhausted — emailing the team for a raise.")
    raise
```

## Gotcha

`429 budget_exceeded` is *not* a transient rate-limit — exponential backoff just burns time. Inspect `error.code` and short-circuit non-retryable cases (concurrency limits, by contrast, simply queue).
`model`	Checkpoint	Precision
`llama-8b`	`meta-llama/Llama-3.1-8B`	bf16
`llama-405b`	`meta-llama/Llama-3.1-405B`	bf16
`trinity-base`	`arcee-ai/Trinity-Large-TrueBase`	bf16
HTTP	Code	Meaning	What to do
`400`	`invalid_request`	Unknown field, bad type, or out-of-range sampling param (e.g. `temperature>2`, `top_p>1`, `logprobs>20`)	The message names the offending field; fix and retry
`400`	`context_length_exceeded`	`prompt_tokens + max_tokens > max_model_len`	Reduce the prompt or `max_tokens`; check `/v1/models` for the per-model limit
`400`	`chat_completions_unsupported`	You hit `/v1/chat/completions`	Use `/v1/completions` — these are base models, no chat template
`400`	`model_not_found`	Unknown `model` id	Use a short id from `/v1/models` (not the HF repo name)
`400`	`bad_json`	Request body wasn't valid JSON	Fix the JSON
`401`	`invalid_api_key`	Key missing, wrong, paused, or revoked	Check `ACS_API_KEY` in your account settings
`429`	`budget_exceeded`	Monthly / daily / input / output token budget hit	Wait for the reset, or ask for more. See Budget-cap recovery
`502`	`upstream_unreachable`	Wrapper couldn't reach the model server (DNS / connection / read timeout) after retries	Retry shortly; persistent failures are an outage — report it
`502`	`vllm_oom`	Upstream model server ran out of GPU memory	Retry with a smaller prompt / `max_tokens` / lower `n`
`502`	`vllm_context_length`	Upstream enforced its context-length limit (rare — the wrapper usually catches this as `context_length_exceeded` first)	Reduce prompt / `max_tokens`
`502`	`vllm_engine_dead`	Upstream vLLM engine crashed	Retry; if it persists the model is down — check status or report
`502`	`upstream_server_error`	Other upstream 5xx after retries	Retry; check `error.upstream_status` for the original code
`503`	`modal_cold_boot`	Model container is starting. `Retry-After` header + `retry_after_seconds` field give a recommended delay.	Wait the suggested interval and retry; a few minutes is normal after scale-to-zero, and 5–13 min total can happen for large models. See Cold-boot recovery
`503`	`circuit_open`	Backend is in a circuit-breaker open state after repeated failures	Use `retry_after_seconds`; if the model is critical, contact us