Batch rollouts

The per-key concurrency cap is 8; asyncio.gather over a single AsyncOpenAI client gives the maximum throughput without extra credentials.

curl

# Shell version — xargs -P 8 fans out 8 concurrent requests:
seq 1 32 | xargs -I{} -P 8 curl -s "$ACS_API_BASE/completions" \
  -H "Authorization: Bearer $ACS_API_KEY" -H "Content-Type: application/json" \
  -d '{"model": "llama-8b", "prompt": "Sample {}: once upon a time", "max_tokens": 32}'

Python

import asyncio
import os
from openai import AsyncOpenAI

client = AsyncOpenAI(base_url=os.environ["ACS_API_BASE"], api_key=os.environ["ACS_API_KEY"])
prompts = [f"Sample {i}: once upon a time" for i in range(32)]

async def one(p):
    r = await client.completions.create(model="llama-8b", prompt=p, max_tokens=32)
    return r.choices[0].text

async def _main():
    # asyncio.gather() must be called from inside a running event loop, so
    # wrap it in a coroutine and hand THAT to asyncio.run(...).
    return await asyncio.gather(*(one(p) for p in prompts))

results = asyncio.run(_main())

Gotcha

Requests beyond the 9th queue server-side rather than 429 — fine for batch jobs, but don't expect 32 parallel requests to all start at once. If your own code fans out many requests at once, cap it with a Semaphore(8) so they don't all pile up against the per-key limit.

Tag your workload (optional)

Add an X-Acs-Workload: batch request header to batch jobs (or interactive for live, latency-sensitive calls). It's purely a usage signal we record to understand traffic and plan capacity / keep-warm policy — it does not change how your request is scheduled or prioritised, and an unrecognised value is simply ignored. The in-browser Workbench is tagged interactive automatically.