Models

The wrapper currently fronts three base-model checkpoints. Pick llama-8b if you're just trying it out — it has the shortest cold-boot.

Available models

`model`	Checkpoint	Precision
`llama-8b`	`meta-llama/Llama-3.1-8B`	bf16
`llama-405b`	`meta-llama/Llama-3.1-405B`	bf16
`trinity-base`	`arcee-ai/Trinity-Large-TrueBase`	bf16

Availability & cold starts

Models that get steady use are usually kept warm, so requests start immediately. Less-used or larger models sleep when idle to save GPU time and cold-start on the first request after a quiet spell — usually a few minutes, worst case up to ~10 minutes (occasionally longer for the largest models when GPUs are scarce). Once warm, a model answers in seconds and stays warm while you keep using it.

If a first request returns modal_cold_boot, wait for the Retry-After hint and retry — see Cold-boot recovery for a copy-pasteable retry loop. For interactive work, send one short throwaway request to wake the model, then go; for batch jobs, let the first request absorb the wait.

Always-on (warm windows)

If you'd rather have a model kept always-on for a stretch — a work session, a deadline — tell us via the Feedback button or email, and we'll schedule a warm window.