API reference

OpenAI-compatible /v1/completions. Standard sampling controls + a handful of token-level extras for inspection work.

What's supported

Standard OpenAI-style /v1/completions parameters, validated strictly (see below).

Sampling: temperature (0–2), top_p (0–1], top_k (-1 disables, else ≥1), min_p (0–1), presence_penalty and frequency_penalty (−2–2), repetition_penalty (0–2], and seed (honored for reproducibility under sampling).

Length & variants: max_tokens (≥1), n and best_of (≤16 — request multiple continuations at once), and stop (a string or up to 4 strings).

Scoring without generating? max_tokens=0 is not supported (the minimum is 1). To score existing text rather than generate, use prompt_logprobs with max_tokens=1 and echo=true — you get the per-prompt-position logprobs and just ignore the one extra generated token. See the prompt_logprobs and echo examples.

Token-level / inspection work:

logprobs=k — top-k logprobs per generated token (k ≤ 20; ask if you need more). See logprobs example.
prompt_logprobs=k — top-k logprobs at each prompt position (the model's predictions, not the actual prompt token unless it was in the top k). See prompt_logprobs example.
echo=true — include the prompt in the response. See echo example.
stream=true (plus optional stream_options) — token-by-token SSE streaming. See stream example.

Plus user — an optional free-form tag echoed back for your own bookkeeping.

Using the OpenAI Python SDK? A few of these are vLLM-native and aren't on the SDK's typed signature, so the SDK rejects them as direct keyword arguments before the request leaves your machine. Pass them through extra_body={...} instead: top_k, min_p, repetition_penalty, and prompt_logprobs. Everything else — temperature, top_p, max_tokens, n, best_of, presence_penalty, frequency_penalty, seed, logprobs, echo, stream, stop — works as a normal keyword argument.

resp = client.completions.create(
    model="llama-8b",
    prompt="The capital of France is",
    temperature=0.7,            # standard kwarg
    logprobs=5,                 # standard kwarg
    extra_body={                # vLLM-native — must go here
        "top_k": 20,
        "min_p": 0.05,
        "repetition_penalty": 1.1,
        "prompt_logprobs": 5,
    },
)

Strict validation. Unknown fields are rejected with a 400, and out-of-range values fail loudly instead of silently clamping — e.g. temperature=3 is a 400, not a quiet reset to the default, and logprobs: true is rejected (pass a number, not a boolean). A typo'd parameter never silently changes your results.

What's not supported

No chat endpoint. /v1/chat/completions returns 400 — there's no chat template. Note the openai SDK defaults to chat, so call client.completions.create(...).

Limits

Each key has a monthly token budget (prompt + completion tokens, reset on the 1st, UTC). Exceed it and requests return 429 budget_exceeded. Email to raise it.
Up to 8 concurrent requests per key, with no per-minute cap — large batch jobs are fine, extra requests just queue until a slot frees up. See Batch rollouts for the canonical fan-out pattern.

Errors

Every error body follows the OpenAI shape: {"error": {"code": "...", "message": "...", "type": "..."}} with extra fields tagged where useful. Read error.code for programmatic handling.

HTTP	Code	Meaning	What to do
`400`	`invalid_request`	Unknown field, bad type, or out-of-range sampling param (e.g. `temperature>2`, `top_p>1`, `logprobs>20`)	The message names the offending field; fix and retry
`400`	`context_length_exceeded`	`prompt_tokens + max_tokens > max_model_len`	Reduce the prompt or `max_tokens`; check `/v1/models` for the per-model limit
`400`	`chat_completions_unsupported`	You hit `/v1/chat/completions`	Use `/v1/completions` — these are base models, no chat template
`400`	`model_not_found`	Unknown `model` id	Use a short id from `/v1/models` (not the HF repo name)
`400`	`bad_json`	Request body wasn't valid JSON	Fix the JSON
`401`	`invalid_api_key`	Key missing, wrong, paused, or revoked	Check `ACS_API_KEY` in your account settings
`429`	`budget_exceeded`	Monthly / daily / input / output token budget hit	Wait for the reset, or ask for more. See Budget-cap recovery
`502`	`upstream_unreachable`	Wrapper couldn't reach the model server (DNS / connection / read timeout) after retries	Retry shortly; persistent failures are an outage — report it
`502`	`vllm_oom`	Upstream model server ran out of GPU memory	Retry with a smaller prompt / `max_tokens` / lower `n`
`502`	`vllm_context_length`	Upstream enforced its context-length limit (rare — the wrapper usually catches this as `context_length_exceeded` first)	Reduce prompt / `max_tokens`
`502`	`vllm_engine_dead`	Upstream vLLM engine crashed	Retry; if it persists the model is down — check status or report
`502`	`upstream_server_error`	Other upstream 5xx after retries	Retry; check `error.upstream_status` for the original code
`503`	`modal_cold_boot`	Model container is starting. `Retry-After` header + `retry_after_seconds` field give a recommended delay.	Wait the suggested interval and retry; a few minutes is normal after scale-to-zero, and 5–13 min total can happen for large models. See Cold-boot recovery
`503`	`circuit_open`	Backend is in a circuit-breaker open state after repeated failures	Use `retry_after_seconds`; if the model is critical, contact us

Request headers

Authorization: Bearer <ACS_API_KEY> — required.
Content-Type: application/json — for the JSON body.
X-Acs-Workload: interactive | batch — optional usage tag (telemetry only; it does not change scheduling or priority). Unrecognised values are ignored. See Batch rollouts.

Response headers

Reliability hints on the response itself:

X-Request-Id — a unique id for the request; quote it in bug reports so we can trace it in our logs.
X-Acs-Upstream-Model / X-Acs-Upstream-Gpu — which backend served (or failed) this request. Cite this in bug reports.
X-Acs-Upstream-Error-Kind — set on upstream-error 5xx responses (mirrors error.code).
X-Acs-Max-Tokens-Clamped — present when the wrapper reduced your max_tokens to fit a budget; format requested=N,applied=M,reason=budget.
Retry-After — RFC-7231 header on 503 cold-boot / circuit-open responses.

Privacy

API requests to /v1/completions are logged as metadata only — key prefix, email, IP, endpoint, model, token counts, status, latency — which we use to run the service and prevent abuse. We don't log your prompts or completions. Note this might change: we may start logging API requests for abuse-prevention reasons. We will not train on them. (Keep your own copies if you need them for reproducibility.)

The one exception is the Workbench: prompts you save there are stored server-side, so your sessions persist across browsers. Don't keep anything there you wouldn't want stored.

Getting help

Email base-models@acsresearch.org, or use the in-app Feedback button if you're signed in. To help us trace a specific request, include the rough time you made it, your key prefix or name (found in your dashboard — not the secret), and the error body.