Batch rollouts
The per-key concurrency cap is 8; asyncio.gather over a single AsyncOpenAI client gives the maximum throughput without extra credentials.
curl
# Shell version — xargs -P 8 fans out 8 concurrent requests:
seq 1 32 | xargs -I{} -P 8 curl -s "$ACS_API_BASE/completions" \
-H "Authorization: Bearer $ACS_API_KEY" -H "Content-Type: application/json" \
-d '{"model": "llama-8b", "prompt": "Sample {}: once upon a time", "max_tokens": 32}'
Python
import asyncio
import os
from openai import AsyncOpenAI
client = AsyncOpenAI(base_url=os.environ["ACS_API_BASE"], api_key=os.environ["ACS_API_KEY"])
prompts = [f"Sample {i}: once upon a time" for i in range(32)]
async def one(p):
r = await client.completions.create(model="llama-8b", prompt=p, max_tokens=32)
return r.choices[0].text
async def _main():
# asyncio.gather() must be called from inside a running event loop, so
# wrap it in a coroutine and hand THAT to asyncio.run(...).
return await asyncio.gather(*(one(p) for p in prompts))
results = asyncio.run(_main())
Gotcha
Requests beyond the 9th queue server-side rather than 429 — fine for batch jobs, but don't expect 32 parallel requests to all start at once. If your own code fans out many requests at once, cap it with a Semaphore(8) so they don't all pile up against the per-key limit.
Tag your workload (optional)
Add an X-Acs-Workload: batch request header to batch jobs (or interactive for live, latency-sensitive calls). It's purely a usage signal we record to understand traffic and plan capacity / keep-warm policy — it does not change how your request is scheduled or prioritised, and an unrecognised value is simply ignored. The in-browser Workbench is tagged interactive automatically.