The API specifications have changed dramatically since the time of the previous post.
A hard cutoff at ~120s almost always means the server-side gateway (or an upstream proxy) is closing the connection. Your local timeout cannot override that. In your specific case, the biggest “today” factor is that api-inference.huggingface.co/models/... is the legacy Serverless Inference API path and it is being deprecated in favor of the Inference Providers router. People now commonly see legacy-path failures like 404/410 and “use router.huggingface.co instead” messages. (Hugging Face Forums)
Below is what’s happening, why ~120s shows up, and the practical fixes that actually move the needle now.
1) Where the “120 seconds” really comes from
There are two distinct “120s” phenomena that get mixed together:
A. “Model not loaded” timeout loops (client-side / service-side)
Older flows around the Serverless Inference API often hit model cold-start. You see errors like:
InferenceTimeoutError: Model not loaded ... Please retry with a higher timeout (current: 120)(Hugging Face)
This is commonly mitigated by X-Wait-For-Model: true, which was explicitly requested in tooling like Gradio to avoid timing out while a serverless model loads. (GitHub)
But this is not the same as your symptom if your TCP connection is dropped at 120s with a read timeout.
B. Gateway / proxy maximum duration or idle timeout (server-side)
When you observe:
- the connection gets closed around 120s regardless of your client timeout
- you do not receive an application-level JSON error body first
that strongly indicates a gateway/proxy cap or an idle timeout between you and the backend.
Evidence that Hugging Face is fronted by an edge layer for these APIs shows up in real headers: x-powered-by: huggingface-moon, x-inference-provider: ..., and even CloudFront-related headers in legacy flows. (Hugging Face Forums)
So yes, “~120s” is a pattern people hit. But it is not usually described as “your plan allows 120s”. It is more like “this particular path is fronted by infrastructure that won’t hold a non-streaming request open past ~N seconds”.
2) Why your exact request setup is likely to hit it
You are using the legacy host
Multiple Hugging Face forum threads now say the legacy host is deprecated and that you should use the router path instead. (Hugging Face Forums)
That alone can cause inconsistent behavior, including hard timeouts and intermittent gateway errors.
You are requesting a provider in a way that is not the “current” contract
In Inference Providers, provider selection is typically done via:
- the SDK parameter
provider="cerebras"(PythonInferenceClient) - or a model suffix like
model="...:cerebras"in the OpenAI-compatible API
That is the pattern shown in the Cerebras provider documentation. (Hugging Face)
A custom request header X-Inference-Provider: cerebras is not the primary documented mechanism for the newer router flows. In practice, headers like x-inference-provider are reliably seen as response headers (what you got routed to), not necessarily as stable request controls. (Postman)
You are calling the base model, but Cerebras support often targets “Instruct”
Cerebras’ Inference Providers page prominently lists Llama-3.1-8B-Instruct among supported models. (Hugging Face)
Calling meta-llama/Llama-3.1-8B (base) may route differently than you expect (or fail over), which can increase latency and queueing.
Non-streaming responses are most fragile at the edge
If the backend takes a while before returning the first byte, an edge proxy can drop the connection. The most robust fix is streaming so the connection stays active and you start receiving tokens early. Hugging Face’s Inference Providers docs explicitly support streaming (SSE) for text generation and chat completions. (Hugging Face)
3) Direct answers to your 3 questions (as of now)
(1) Is there a 120s server-side limit for Llama 3.1-8B on my plan?
There is no clean “plan X = 120 seconds” statement in the public docs. What is clearly visible is:
- legacy
api-inference.huggingface.cois deprecated and unreliable now (Hugging Face Forums) - users repeatedly report gateway timeouts and “bad state” episodes on shared/serverless paths (Hugging Face Forums)
- 120s appears as a commonly encountered timeout threshold in older serverless flows (often around loading / waiting). (Hugging Face)
Practically: treat ~120s as an infrastructure cap you should engineer around, not a knob you can raise with a client timeout.
(2) Does X-Inference-Provider: cerebras support longer runtimes?
On the current Inference Providers interface, don’t rely on that header.
Instead:
- use the router OpenAI-compatible endpoint and set
model="…:cerebras"(Hugging Face) - or use
InferenceClient(provider="cerebras", ...)(Hugging Face)
Even with Cerebras, the request still traverses Hugging Face routing infrastructure, so the router layer can still impose timeouts unless you stream.
(3) Do you need a specific plan/config to exceed 120s?
If you need reliably longer-running inference, the product that is designed for that is Inference Endpoints (dedicated), not shared serverless. Hugging Face’s own Hub docs draw that distinction: Inference Providers are serverless partner routing, while Inference Endpoints run on dedicated infrastructure. (Hugging Face)
Paid plans mainly impact billing continuity and credits (for example PRO can continue pay-as-you-go after credits) more than “remove all gateway timeouts”. (Hugging Face)
4) Workarounds that actually help “today”
Fix 1: Stop calling the legacy URL. Use the router.
Use the OpenAI-compatible endpoint:
- Base URL shown in docs:
https://huggingface.co/proxy/router.huggingface.co/v1(Hugging Face)
Then set the model with a provider suffix:
meta-llama/Llama-3.1-8B-Instruct:cerebras(Hugging Face)
Fix 2: Turn on streaming (this is the single best mitigation for edge timeouts)
Hugging Face documents that stream=true returns tokens via SSE for text generation. (Hugging Face)
Chat completion docs also highlight streaming support. (Hugging Face)
Streaming changes the failure mode because:
- you get time-to-first-token quickly
- the connection stays active with incremental bytes
Fix 3: Reduce prefill cost, not just max_new_tokens
People often only reduce output tokens. But long latency often comes from prompt prefill:
- very long input
- high concurrency causing queueing
- cold starts
On the Inference Providers text generation spec, there is also truncate (input token truncation) that can help keep prefill bounded. (Hugging Face)
Fix 4: If you truly need long wall-clock jobs, switch product shape
If your workload is “generate for minutes” or “large document transformations”, serverless shared APIs are the wrong primitive. Use:
- Inference Endpoints (dedicated) (Hugging Face)
- or self-host a server like TGI where streaming is first-class and you control infra timeouts. (Hugging Face)
5) A quick diagnostic checklist to confirm the root cause
-
Capture response headers when it works
Look forx-inference-provider(what actually served you). This is commonly present. (Hugging Face Forums) -
Differentiate “HTTP error response” vs “dropped connection”
- If you get a JSON error or HTTP 503/504, it is gateway/back-end signaling. (Hugging Face Forums)
- If your client just times out with no HTTP status, it is almost certainly an upstream close or network idle timeout.
-
Check if you are hitting shared instability
There are periodic forum reports of timeouts across multiple models or accounts, especially on shared paths. (Hugging Face Forums)
6) Similar cases and threads worth reading
These show the same classes of failures you are seeing:
- The exact same “120s timeout on Llama 3.1-8B” report (Hugging Face Forums)
- “Model not loaded… current: 120” style errors on legacy serverless (Hugging Face)
- Gradio issue requesting
X-Wait-For-Modelto avoid serverless cold-start timeouts (GitHub) - Shared gateway timeouts reported by users (504) (Hugging Face Forums)
- Legacy endpoint deprecation and “use router…” errors (Hugging Face Forums)
7) High-quality docs and guides for the “new” correct way
Core docs:
- Hugging Face
huggingface_hubinference guide explaining Providers vs Endpoints (Hugging Face) - Inference Providers Chat Completion task docs (OpenAI-compatible, streaming) (Hugging Face)
- Inference Providers Text Generation task docs (SSE streaming, request fields like
truncate) (Hugging Face) - Cerebras provider page (supported models and
:cerebrasusage) (Hugging Face) - Pricing and billing (what plans change and what they don’t) (Hugging Face)
If you want the fastest path to human answers, use the Inference Providers discussions aggregator Space you found. (Hugging Face)
Summary
- Your 120s cutoff is almost certainly a server-side gateway/proxy behavior, not your client timeout.
- The legacy
api-inference.huggingface.co/models/...path is deprecated now. Move to the router. (Hugging Face Forums) - For Cerebras, use
model="…:cerebras"(and usually the Instruct variant). (Hugging Face) - Use streaming (
stream=true) to avoid long silent waits that trigger gateway timeouts. (Hugging Face) - If you truly need multi-minute requests reliably, use dedicated Inference Endpoints or self-host TGI. (Hugging Face)