stream
Server-sent events: each chunk is a partial completion, terminated by data: [DONE]. Drop the request-level timeout high enough to span the full generation.
curl
curl -N -s "$ACS_API_BASE/completions" \
-H "Authorization: Bearer $ACS_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "llama-8b", "prompt": "Five reasons to learn Rust:\n1.", "max_tokens": 80,
"stream": true, "stream_options": {"include_usage": true}}'
Python
stream = client.completions.create(
model="llama-8b",
prompt="Five reasons to learn Rust:\n1.",
max_tokens=80,
stream=True,
stream_options={"include_usage": True},
)
for chunk in stream:
if chunk.choices: # final usage chunk has choices=[]
print(chunk.choices[0].text, end="", flush=True)
elif chunk.usage:
print(f"\n[tokens: {chunk.usage.total_tokens}]")
Example response
First few SSE frames — the wire format; the Python SDK parses each data: payload into a chunk object. (Per-chunk logprobs: null and stop_reason: null keys are omitted here for brevity — they're present in the real wire output.)
data: {"id":"cmpl-ae7c4833c82a33d6","object":"text_completion","model":"meta-llama/Llama-3.1-8B","choices":[{"index":0,"text":" It","finish_reason":null}],"usage":null}
data: {"id":"cmpl-ae7c4833c82a33d6","object":"text_completion","model":"meta-llama/Llama-3.1-8B","choices":[{"index":0,"text":"’s","finish_reason":null}],"usage":null}
data: {"id":"cmpl-ae7c4833c82a33d6","object":"text_completion","model":"meta-llama/Llama-3.1-8B","choices":[{"index":0,"text":" fast","finish_reason":null}],"usage":null}
data: {"id":"cmpl-ae7c4833c82a33d6","object":"text_completion","model":"meta-llama/Llama-3.1-8B","choices":[{"index":0,"text":".\n","finish_reason":null}],"usage":null}
... (one chunk per token until max_tokens or stop) ...
data: {"id":"cmpl-ae7c4833c82a33d6","object":"text_completion","model":"meta-llama/Llama-3.1-8B","choices":[{"index":0,"text":" again","finish_reason":"length"}],"usage":null}
data: {"id":"cmpl-ae7c4833c82a33d6","object":"text_completion","model":"meta-llama/Llama-3.1-8B","choices":[],"usage":{"prompt_tokens":9,"completion_tokens":80,"total_tokens":89}}
data: [DONE]
The final content chunk carries finish_reason alongside its token (not on an empty text). Then — because the request set stream_options.include_usage: true — a terminal chunk arrives with choices: [] and a populated usage object, where you get the token counts for a streamed completion. Then [DONE] closes the stream. Drop stream_options and you'll skip the usage chunk and get only the content frames (vLLM may emit usage permissively; the OpenAI spec requires the opt-in).
Gotcha
Use curl -N to disable buffering, otherwise you'll see the whole response arrive at once. Each SSE chunk's finish_reason is null until the last one.