Models
The wrapper currently fronts three base-model checkpoints. Pick llama-8b if you're just trying it out — it has the shortest cold-boot.
Available models
model | Checkpoint | Precision |
|---|---|---|
llama-8b | meta-llama/Llama-3.1-8B | bf16 |
llama-405b | meta-llama/Llama-3.1-405B | bf16 |
trinity-base | arcee-ai/Trinity-Large-TrueBase | bf16 |
Availability & cold starts
Models that get steady use are usually kept warm, so requests start immediately. Less-used or larger models sleep when idle to save GPU time and cold-start on the first request after a quiet spell — usually a few minutes, worst case up to ~10 minutes (occasionally longer for the largest models when GPUs are scarce). Once warm, a model answers in seconds and stays warm while you keep using it.
If a first request returns modal_cold_boot, wait for the Retry-After hint and retry — see Cold-boot recovery for a copy-pasteable retry loop. For interactive work, send one short throwaway request to wake the model, then go; for batch jobs, let the first request absorb the wait.
Always-on (warm windows)
If you'd rather have a model kept always-on for a stretch — a work session, a deadline — tell us via the Feedback button or email, and we'll schedule a warm window.