Inquiry About 120s Timeout on Hugging Face Inference Endpoint for Llama 3.1-8B

John6666 · December 30, 2025, 12:23am

The API specifications have changed dramatically since the time of the previous post.

A hard cutoff at ~120s almost always means the server-side gateway (or an upstream proxy) is closing the connection. Your local timeout cannot override that. In your specific case, the biggest “today” factor is that api-inference.huggingface.co/models/... is the legacy Serverless Inference API path and it is being deprecated in favor of the Inference Providers router. People now commonly see legacy-path failures like 404/410 and “use router.huggingface.co instead” messages. (Hugging Face Forums)

Below is what’s happening, why ~120s shows up, and the practical fixes that actually move the needle now.

1) Where the “120 seconds” really comes from

There are two distinct “120s” phenomena that get mixed together:

A. “Model not loaded” timeout loops (client-side / service-side)

Older flows around the Serverless Inference API often hit model cold-start. You see errors like:

InferenceTimeoutError: Model not loaded ... Please retry with a higher timeout (current: 120) (Hugging Face)

This is commonly mitigated by X-Wait-For-Model: true, which was explicitly requested in tooling like Gradio to avoid timing out while a serverless model loads. (GitHub)

But this is not the same as your symptom if your TCP connection is dropped at 120s with a read timeout.

B. Gateway / proxy maximum duration or idle timeout (server-side)

When you observe:

the connection gets closed around 120s regardless of your client timeout
you do not receive an application-level JSON error body first

that strongly indicates a gateway/proxy cap or an idle timeout between you and the backend.

Evidence that Hugging Face is fronted by an edge layer for these APIs shows up in real headers: x-powered-by: huggingface-moon, x-inference-provider: ..., and even CloudFront-related headers in legacy flows. (Hugging Face Forums)

So yes, “~120s” is a pattern people hit. But it is not usually described as “your plan allows 120s”. It is more like “this particular path is fronted by infrastructure that won’t hold a non-streaming request open past ~N seconds”.

2) Why your exact request setup is likely to hit it

You are using the legacy host

Multiple Hugging Face forum threads now say the legacy host is deprecated and that you should use the router path instead. (Hugging Face Forums)
That alone can cause inconsistent behavior, including hard timeouts and intermittent gateway errors.

You are requesting a provider in a way that is not the “current” contract

In Inference Providers, provider selection is typically done via:

the SDK parameter provider="cerebras" (Python InferenceClient)
or a model suffix like model="...:cerebras" in the OpenAI-compatible API

That is the pattern shown in the Cerebras provider documentation. (Hugging Face)
A custom request header X-Inference-Provider: cerebras is not the primary documented mechanism for the newer router flows. In practice, headers like x-inference-provider are reliably seen as response headers (what you got routed to), not necessarily as stable request controls. (Postman)

You are calling the base model, but Cerebras support often targets “Instruct”

Cerebras’ Inference Providers page prominently lists Llama-3.1-8B-Instruct among supported models. (Hugging Face)
Calling meta-llama/Llama-3.1-8B (base) may route differently than you expect (or fail over), which can increase latency and queueing.

Non-streaming responses are most fragile at the edge

If the backend takes a while before returning the first byte, an edge proxy can drop the connection. The most robust fix is streaming so the connection stays active and you start receiving tokens early. Hugging Face’s Inference Providers docs explicitly support streaming (SSE) for text generation and chat completions. (Hugging Face)

3) Direct answers to your 3 questions (as of now)

(1) Is there a 120s server-side limit for Llama 3.1-8B on my plan?

There is no clean “plan X = 120 seconds” statement in the public docs. What is clearly visible is:

legacy api-inference.huggingface.co is deprecated and unreliable now (Hugging Face Forums)
users repeatedly report gateway timeouts and “bad state” episodes on shared/serverless paths (Hugging Face Forums)
120s appears as a commonly encountered timeout threshold in older serverless flows (often around loading / waiting). (Hugging Face)

Practically: treat ~120s as an infrastructure cap you should engineer around, not a knob you can raise with a client timeout.

(2) Does `X-Inference-Provider: cerebras` support longer runtimes?

On the current Inference Providers interface, don’t rely on that header.

Instead:

use the router OpenAI-compatible endpoint and set model="…:cerebras" (Hugging Face)
or use InferenceClient(provider="cerebras", ...) (Hugging Face)

Even with Cerebras, the request still traverses Hugging Face routing infrastructure, so the router layer can still impose timeouts unless you stream.

(3) Do you need a specific plan/config to exceed 120s?

If you need reliably longer-running inference, the product that is designed for that is Inference Endpoints (dedicated), not shared serverless. Hugging Face’s own Hub docs draw that distinction: Inference Providers are serverless partner routing, while Inference Endpoints run on dedicated infrastructure. (Hugging Face)

Paid plans mainly impact billing continuity and credits (for example PRO can continue pay-as-you-go after credits) more than “remove all gateway timeouts”. (Hugging Face)

4) Workarounds that actually help “today”

Fix 1: Stop calling the legacy URL. Use the router.

Use the OpenAI-compatible endpoint:

Base URL shown in docs: https://huggingface.co/proxy/router.huggingface.co/v1 (Hugging Face)

Then set the model with a provider suffix:

meta-llama/Llama-3.1-8B-Instruct:cerebras (Hugging Face)

Fix 2: Turn on streaming (this is the single best mitigation for edge timeouts)

Hugging Face documents that stream=true returns tokens via SSE for text generation. (Hugging Face)
Chat completion docs also highlight streaming support. (Hugging Face)

Streaming changes the failure mode because:

you get time-to-first-token quickly
the connection stays active with incremental bytes

Fix 3: Reduce prefill cost, not just `max_new_tokens`

People often only reduce output tokens. But long latency often comes from prompt prefill:

very long input
high concurrency causing queueing
cold starts

On the Inference Providers text generation spec, there is also truncate (input token truncation) that can help keep prefill bounded. (Hugging Face)

Fix 4: If you truly need long wall-clock jobs, switch product shape

If your workload is “generate for minutes” or “large document transformations”, serverless shared APIs are the wrong primitive. Use:

Inference Endpoints (dedicated) (Hugging Face)
or self-host a server like TGI where streaming is first-class and you control infra timeouts. (Hugging Face)

5) A quick diagnostic checklist to confirm the root cause

Capture response headers when it works
Look for x-inference-provider (what actually served you). This is commonly present. (Hugging Face Forums)
Differentiate “HTTP error response” vs “dropped connection”
- If you get a JSON error or HTTP 503/504, it is gateway/back-end signaling. (Hugging Face Forums)
- If your client just times out with no HTTP status, it is almost certainly an upstream close or network idle timeout.
Check if you are hitting shared instability
There are periodic forum reports of timeouts across multiple models or accounts, especially on shared paths. (Hugging Face Forums)

6) Similar cases and threads worth reading

These show the same classes of failures you are seeing:

The exact same “120s timeout on Llama 3.1-8B” report (Hugging Face Forums)
“Model not loaded… current: 120” style errors on legacy serverless (Hugging Face)
Gradio issue requesting X-Wait-For-Model to avoid serverless cold-start timeouts (GitHub)
Shared gateway timeouts reported by users (504) (Hugging Face Forums)
Legacy endpoint deprecation and “use router…” errors (Hugging Face Forums)

7) High-quality docs and guides for the “new” correct way

Core docs:

Hugging Face huggingface_hub inference guide explaining Providers vs Endpoints (Hugging Face)
Inference Providers Chat Completion task docs (OpenAI-compatible, streaming) (Hugging Face)
Inference Providers Text Generation task docs (SSE streaming, request fields like truncate) (Hugging Face)
Cerebras provider page (supported models and :cerebras usage) (Hugging Face)
Pricing and billing (what plans change and what they don’t) (Hugging Face)

If you want the fastest path to human answers, use the Inference Providers discussions aggregator Space you found. (Hugging Face)

Summary

Your 120s cutoff is almost certainly a server-side gateway/proxy behavior, not your client timeout.
The legacy api-inference.huggingface.co/models/... path is deprecated now. Move to the router. (Hugging Face Forums)
For Cerebras, use model="…:cerebras" (and usually the Instruct variant). (Hugging Face)
Use streaming (stream=true) to avoid long silent waits that trigger gateway timeouts. (Hugging Face)
If you truly need multi-minute requests reliably, use dedicated Inference Endpoints or self-host TGI. (Hugging Face)

Topic		Replies	Views
Help using inference endpoint with Llama 3.1 405B Instruct Inference Endpoints on the Hub	1	283	August 30, 2024
HF Inference API: 503/504 Server Error Inference Endpoints on the Hub	4	525	September 5, 2025
Hugging Face API timeouts on all models! Inference Endpoints on the Hub	5	136	September 18, 2025
Inference API time out? Site Feedback	2	997	February 28, 2024
Cannot use Inference Provider. 429 error. First time usage Inference Endpoints on the Hub	6	157	May 5, 2025