Private beta — access is gated. Email base-models@acsresearch.org with a brief note on how you'd like to use it; we review requests individually. If it's a fit, you'll get an invite link to create an API key from the dashboard.

API reference

OpenAI-compatible /v1/completions. Standard sampling controls + a handful of token-level extras for inspection work.

What's supported

Standard OpenAI-style /v1/completions parameters, validated strictly (see below).

Sampling: temperature (0–2), top_p (0–1], top_k (-1 disables, else ≥1), min_p (0–1), presence_penalty and frequency_penalty (−2–2), repetition_penalty (0–2], and seed (honored for reproducibility under sampling).

Length & variants: max_tokens (≥1), n and best_of (≤16 — request multiple continuations at once), and stop (a string or up to 4 strings).

Scoring without generating? max_tokens=0 is not supported (the minimum is 1). To score existing text rather than generate, use prompt_logprobs with max_tokens=1 and echo=true — you get the per-prompt-position logprobs and just ignore the one extra generated token. See the prompt_logprobs and echo examples.

Token-level / inspection work:

  • logprobs=k — top-k logprobs per generated token (k ≤ 20; ask if you need more). See logprobs example.
  • prompt_logprobs=k — top-k logprobs at each prompt position (the model's predictions, not the actual prompt token unless it was in the top k). See prompt_logprobs example.
  • echo=true — include the prompt in the response. See echo example.
  • stream=true (plus optional stream_options) — token-by-token SSE streaming. See stream example.

Plus user — an optional free-form tag echoed back for your own bookkeeping.

Using the OpenAI Python SDK? A few of these are vLLM-native and aren't on the SDK's typed signature, so the SDK rejects them as direct keyword arguments before the request leaves your machine. Pass them through extra_body={...} instead: top_k, min_p, repetition_penalty, and prompt_logprobs. Everything else — temperature, top_p, max_tokens, n, best_of, presence_penalty, frequency_penalty, seed, logprobs, echo, stream, stop — works as a normal keyword argument.

resp = client.completions.create(
    model="llama-8b",
    prompt="The capital of France is",
    temperature=0.7,            # standard kwarg
    logprobs=5,                 # standard kwarg
    extra_body={                # vLLM-native — must go here
        "top_k": 20,
        "min_p": 0.05,
        "repetition_penalty": 1.1,
        "prompt_logprobs": 5,
    },
)

Strict validation. Unknown fields are rejected with a 400, and out-of-range values fail loudly instead of silently clamping — e.g. temperature=3 is a 400, not a quiet reset to the default, and logprobs: true is rejected (pass a number, not a boolean). A typo'd parameter never silently changes your results.

What's not supported

  • No chat endpoint. /v1/chat/completions returns 400 — there's no chat template. Note the openai SDK defaults to chat, so call client.completions.create(...).

Limits

  • Each key has a monthly token budget (prompt + completion tokens, reset on the 1st, UTC). Exceed it and requests return 429 budget_exceeded. Email to raise it.
  • Up to 8 concurrent requests per key, with no per-minute cap — large batch jobs are fine, extra requests just queue until a slot frees up. See Batch rollouts for the canonical fan-out pattern.

Errors

Every error body follows the OpenAI shape: {"error": {"code": "...", "message": "...", "type": "..."}} with extra fields tagged where useful. Read error.code for programmatic handling.

HTTPCodeMeaningWhat to do
400invalid_requestUnknown field, bad type, or out-of-range sampling param (e.g. temperature>2, top_p>1, logprobs>20)The message names the offending field; fix and retry
400context_length_exceededprompt_tokens + max_tokens > max_model_lenReduce the prompt or max_tokens; check /v1/models for the per-model limit
400chat_completions_unsupportedYou hit /v1/chat/completionsUse /v1/completions — these are base models, no chat template
400model_not_foundUnknown model idUse a short id from /v1/models (not the HF repo name)
400bad_jsonRequest body wasn't valid JSONFix the JSON
401invalid_api_keyKey missing, wrong, paused, or revokedCheck ACS_API_KEY in your account settings
429budget_exceededMonthly / daily / input / output token budget hitWait for the reset, or ask for more. See Budget-cap recovery
502upstream_unreachableWrapper couldn't reach the model server (DNS / connection / read timeout) after retriesRetry shortly; persistent failures are an outage — report it
502vllm_oomUpstream model server ran out of GPU memoryRetry with a smaller prompt / max_tokens / lower n
502vllm_context_lengthUpstream enforced its context-length limit (rare — the wrapper usually catches this as context_length_exceeded first)Reduce prompt / max_tokens
502vllm_engine_deadUpstream vLLM engine crashedRetry; if it persists the model is down — check status or report
502upstream_server_errorOther upstream 5xx after retriesRetry; check error.upstream_status for the original code
503modal_cold_bootModel container is starting. Retry-After header + retry_after_seconds field give a recommended delay.Wait the suggested interval and retry; a few minutes is normal after scale-to-zero, and 5–13 min total can happen for large models. See Cold-boot recovery
503circuit_openBackend is in a circuit-breaker open state after repeated failuresUse retry_after_seconds; if the model is critical, contact us

Request headers

  • Authorization: Bearer <ACS_API_KEY> — required.
  • Content-Type: application/json — for the JSON body.
  • X-Acs-Workload: interactive | batch — optional usage tag (telemetry only; it does not change scheduling or priority). Unrecognised values are ignored. See Batch rollouts.

Response headers

Reliability hints on the response itself:

  • X-Request-Id — a unique id for the request; quote it in bug reports so we can trace it in our logs.
  • X-Acs-Upstream-Model / X-Acs-Upstream-Gpu — which backend served (or failed) this request. Cite this in bug reports.
  • X-Acs-Upstream-Error-Kind — set on upstream-error 5xx responses (mirrors error.code).
  • X-Acs-Max-Tokens-Clamped — present when the wrapper reduced your max_tokens to fit a budget; format requested=N,applied=M,reason=budget.
  • Retry-After — RFC-7231 header on 503 cold-boot / circuit-open responses.

Privacy

API requests to /v1/completions are logged as metadata only — key prefix, email, IP, endpoint, model, token counts, status, latency — which we use to run the service and prevent abuse. We don't log your prompts or completions. Note this might change: we may start logging API requests for abuse-prevention reasons. We will not train on them. (Keep your own copies if you need them for reproducibility.)

The one exception is the Workbench: prompts you save there are stored server-side, so your sessions persist across browsers. Don't keep anything there you wouldn't want stored.

Getting help

Email base-models@acsresearch.org, or use the in-app Feedback button if you're signed in. To help us trace a specific request, include the rough time you made it, your key prefix or name (found in your dashboard — not the secret), and the error body.