[FEEDBACK] Inference Providers
Any inference provider you love, and that you'd like to be able to access directly from the Hub?
Love that I can call DeepSeek R1 directly from the Hub ๐ฅ
from huggingface_hub import InferenceClient
client = InferenceClient(
provider="together",
api_key="xxxxxxxxxxxxxxxxxxxxxxxx"
)
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
completion = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1",
messages=messages,
max_tokens=500
)
print(completion.choices[0].message)
Is it possible to set a monthly payment budget or rate limits for all the external providers? I don't see such options in billings tab. In case a key is or session token is stolen, it can be quite dangerous to my thin wallet:(
@benhaotang you already get spending notifications when crossing important thresholds ($10, $100, $1,000) but we'll add spending limits in the future
@benhaotang you already get spending notifications when crossing important thresholds ($10, $100, $1,000) but we'll add spending limits in the future
Thanks for your quick reply, good to know!
Would be great if you could add Nebius AI Studio to the list :) New inference provider on the market, with the absolute cheapest prices and the highest rate limits...
Could be good to add featherless.ai
TitanML !!
Hi team,
I'm Ben, one of the founding engineers at relaxAI. We're a sovereign AI inference provider hosted on Civo infrastructure, the UK's leading sovereign cloud platform. We'd like to express our interest in becoming a HuggingFace Inference Provider.
About relaxAI
relaxAI provides high-performance, OpenAI-compatible inference for leading open source LLMs. We're currently serving models that are popular across your existing providers, including:
- gpt-oss-120b
- Kimi-K2.5
- Llama 4 Maverick
Our infrastructure runs on Blackwell GPUs through our NVIDIA partnership, and our throughput, latency, and token pricing are very competitive.
What Sets Us Apart
What makes us different from your current provider lineup is that we're fully UK sovereign โ 100% UK data residency, processing, and legal jurisdiction. Sovereign AI is a growing priority for UK and European enterprises, particularly in regulated industries, and as far as we can tell there isn't currently a UK-domiciled provider represented in Inference Providers. We think that's a meaningful gap we could help fill.
Integration Readiness
We already have a Hub organisation at huggingface.co/relaxai and we're ready to get started immediately โ JS client PR, model mappings, billing endpoint, the lot. Our API is fully OpenAI-compatible, so we'd expect the integration to be straightforward. Happy to provide API access for testing whenever suits.
We'd appreciate guidance on:
- Timeline expectations for review and approval
- Any provider-specific requirements beyond the standard onboarding docs
Contact
- Website: https://relax.ai
- API Docs: https://relax.ai/docs
- Email: ben@relax.ai
Best,
Ben
Founding Engineer, relaxAI
Hi Hugging Face team,
I'd like to introduce StrikeEngine โ a new LLM inference provider launching in May 2026 โ and express
interest in joining the Inference Providers program. I've read through the register-as-a-provider guide
in the docs and I'm planning to start with the JS client PR per Step 2, since our API is OpenAI-compatible
from day 1.
What we do :
Host AWQ-int4 quantized open-weight LLMs โ Qwen, DeepSeek, MiMo, GLM, Llama, Mistral families โ on a
hybrid compute architecture (Modal serverless for launch latency + own spot-GPU fleet across AWS/Azure/GCP
for sustained margin). OpenAI-compatible /v1/chat/completions, multi-region failover, SLA monitoring.
the Inference-Id header for the billing endpoint.
Our differentiation
First-responder onboarding of new Chinese-lab open-weight releases โ targeting 24-72h from model drop to a
listed endpoint. The current provider lineup has excellent latency specialists (Cerebras, Groq,
Fireworks) and price leaders (Novita), but no one is explicitly focused on the "first-to-list
new-releases" angle. That's the gap we fill. Target catalog: Qwen3.5 / Qwen-Coder-Next, DeepSeek V4+,
MiMo, GLM-4.7+, Kimi-K2+, MiniMax-M2+ โ with an automated pipeline detecting releases and onboarding them
within hours.
Why HF Inference Providers
HF is where the open-weights community lives. Our model focus (quantized open-weights from Chinese labs)
aligns exactly with what your community searches for daily. We complement rather than replicate existing
providers by closing the "waiting-for-someone-to-list-the-new-release" gap your buyers currently
experience.
Technical posture toward the 9-step integration
- Step 1 (Task API): OpenAI-compatible โ should skip most task-specific work
- Step 2 (JS client PR): happy to open this first as the canonical entry
- Step 3 (Model Mapping): will upgrade to Team Hub plan when we reach this step
- Step 4 (Billing): implementing Inference-Id + cost lookup endpoint natively
- Steps 5-9: will follow in order with PRs to
huggingface_hubandhub-docs - We meet the <5s time-to-first-token requirement by design (hybrid serverless + NVMe-cached spot)
Where we are today
- Landing: https://strikeengine.dev
- Partnerships contact: partnerships@strikeengine.dev
- General: hello@strikeengine.dev
- Launch target: mid-end of May 2026
What I'd like to request
- Whether opening the JS client PR is the right first concrete action, or if there's a preferred order
- Any prerequisites I might have missed
Thanks!
Hicham Abadou
Founder, StrikeEngine
Hi Hugging Face team,
We're Oxlo.ai - an AI inference platform backed by Cyborg Network, featured in STL Partners' "Top 50 Edge Companies 2026." We serve 40+ open-source models (Qwen 3 32B, Llama 3.3 70B, DeepSeek R1, Whisper, SDXL, etc.) via fully OpenAI SDK-compatible APIs.
We'd like to integrate as an Inference Provider on the Hub. Our API is already OpenAI-compatible, so the technical integration should be straightforward.
Why Oxlo.ai is different: We use request-based pricing (flat fee per API call, not per token), which makes us complementary to existing token-based providers on the Hub.
We have 700+ active users across 100+ countries and are ready to begin the technical integration (huggingface.js PR, Python SDK PR, model mappings).
Could you connect us with the right person on your partnerships or developer relations team?
Contact:
- Website: https://www.oxlo.ai/
- Docs: https://docs.oxlo.ai/docs/
- Email: hello@oxlo.ai
Best,
Shashank MS, Oxlo.ai Team
Hey @julien-c we'd like to register as an inference provider and do some co-marketing too; please let me know way forward. We have already done the technical readiness work on our end; www.univars.space targeting to be the first inference provider partner from Africa.
Hi Hugging Face team,
We would like to register Xpersona as a Hugging Face Inference Provider.
Provider
- Provider name: Xpersona
- Hugging Face organization:
xpersona-co - Website: https://xpersona.co
- Docs: https://xpersona.co/docs
- Hosted model card: https://huggingface.co/xpersona-co/xpersona-frieren-coder
- OpenAI-compatible API base: https://www.xpersona.co/v1
- Chat completions endpoint: https://www.xpersona.co/v1/chat/completions
- Public model catalog: https://www.xpersona.co/v1/models
- Public pricing endpoint: https://www.xpersona.co/v1/pricing
- Auth:
Authorization: Bearer <XPERSONA_API_KEY>
Initial Model Mapping
[
{
"hfModel": "xpersona-co/xpersona-frieren-coder",
"providerModel": "xpersona-frieren-coder",
"task": "conversational"
}
]
Model Details
- Display name: Xpersona Frieren Coder
- Provider model id:
xpersona-frieren-coder - Type: hosted proprietary/API model, no open weights
- Inputs: text and image
- Output: text
- Context window: 400,000 tokens
- Max output: 128,000 tokens
- Tool calling: supported
- Structured output: supported
- OpenAI-compatible chat completions: supported
Pricing
- Input: $1.50 / 1M tokens
- Cached input: $0.15 / 1M tokens
- Output: $6.00 / 1M tokens
- Minimum successful request charge: $0.001
Notes
Xpersona is already listed in Models.dev/OpenCode as provider id xpersona, with OpenCode model id xpersona/xpersona-frieren-coder. We can provide any additional endpoint details, billing endpoint shape, JS/Python client implementation, or validation credentials needed for review.
Hi @Wauplin @SBrandeis @julien-c @hanouticelina โ opening the registration for UomiRouter as a new inference provider.
UomiRouter is an OpenAI-compatible inference network. Traffic is served by accredited operator nodes that are part of the UOMI network โ each operator runs the engine on their own GPU hardware (datacenter or homelab) after a hardware + reliability vetting. Throughput and quality SLAs are guaranteed across the listed catalog. Operators commit to a strict privacy policy (no prompt logging, no training-data collection), payload obfuscation in transit and at rest, and OPoC (Off-chain Proof of Computation): every response is signed by the operator's wallet key and carries a SHA256 of the output (returned as x-wallet-signature / x-wallet-pubkey headers), and a sampled fraction is cross-dispatched to an independent operator for re-verification. The on-chain anchoring layer on UOMI L1 is the next milestone and is not live yet.
The differentiator vs centralized APIs (closed box) and naive decentralized GPU markets (no proof of computation at all) is verifiability today: clients can check off-chain that the operator they were billed for actually produced the tokens they got.
3 PRs already open per the new-provider checklist
- huggingface.js: https://github.com/huggingface/huggingface.js/pull/2193
- huggingface_hub: https://github.com/huggingface/huggingface_hub/pull/4256
- hub-docs (page + sidebar + table + logos): https://github.com/huggingface/hub-docs/pull/2499
Integration details
- Endpoint:
https://gateway.uomi.ai(OpenAI Chat Completions spec; streaming, tool calling, structured output, vision via Qwen3.6-VL all supported) - Billing endpoint live:
POST /partner/hf/billingreturns{requests:[{requestId, costNanoUsd}]}per spec, batched up to 10k. Auth token ready to share via DM. - Per-request
Inference-Idheader: UUID4 emitted on every response. - Org:
uomi-networkโ Team plan active, ready for the server-side partner-flag flip so we can call/api/partners/uomirouter/modelsfor the staging mappings.
Initial catalog (3 conversational models)
FP8 served internally for the two Qwen models, FP8-Dynamic for Gemma:
| HF model ID | Type |
|---|---|
Qwen/Qwen3.6-27B |
dense |
Qwen/Qwen3.6-35B-A3B |
MoE |
google/gemma-4-31b-it |
VLM |
Contact
info@uomi.ai
Happy to jump on a call or set up a Slack channel for the integration review.
This is a great idea for Wan model inference. By the way, if you ever need to share HTML previews of your AI video outputs, the HTML to URL Converter is super handy for quick sharing without any signup.
Hi @julien-c @Wauplin @SBrandeis @hanouticelina,
We're the team behind Phala and we'd like to formally express our interest in joining the Hugging Face Inference Providers program.
About Phala
Phala (phala.com) is a confidential AI cloud that delivers private LLM inference on hardware-protected GPU infrastructure. Phala runs LLMs on dedicated GPU clusters with hardware-level isolation, providing runtime attestation so users can cryptographically verify that their prompts and outputs were never exposed. Our OpenAI-compatible API gateway at https://api.redpill.ai/v1 gives developers a drop-in replacement for the OpenAI SDK with verifiable privacy guarantees.
We are already a verified inference provider on OpenRouter (openrouter.ai/provider/phala), where we have processed over 2.9 billion tokens across 14 models as of April 2026 โ demonstrating production-grade reliability and scale.
Why We're a Strong Fit
Unique Differentiation โ Confidential Inference: Phala is the only inference provider in the current HF lineup that offers hardware-attested, confidential GPU inference. For users handling sensitive data, regulated workloads, or privacy-critical applications, this is a meaningful capability gap we can fill.
OpenAI-Compatible API: Our API (/v1/chat/completions, /v1/embeddings) is a full drop-in replacement for the OpenAI SDK โ making JS and Python client integration straightforward.
Proven Scale on OpenRouter: Our top models by token volume on OpenRouter:
| Model | Tokens Processed (Apr 2026) |
|---|---|
| Qwen2.5 7B Instruct | 1.03B |
| gpt-oss-120b | 653M |
| Kimi K2.6 | 246M |
| Qwen3.5-27B | 196M |
| GLM 5.1 | 171M |
| Qwen3 VL 30B A3B Instruct | 154M |
| Kimi K2.5 | 114M |
| GLM 4.7 Flash | 113M |
| Gemma 3 27B | 70.1M |
Supported Models & Tasks
We currently serve the following open-weight models on Phala's confidential GPU infrastructure:
| HF Model | Task | Context |
|---|---|---|
| Qwen/Qwen3.5-27B | Chat Completion (LLM) | 256K |
| Qwen/Qwen3.5-397B-A17B | Chat Completion (LLM) | 256K |
| Qwen/Qwen3-VL-30B-A3B-Instruct | Chat Completion (VLM) | 262K |
| Qwen/Qwen2.5-7B-Instruct | Chat Completion (LLM) | 131K |
| google/gemma-3-27b-it | Chat Completion (VLM) | 131K |
| google/gemma-4-31b-it | Chat Completion (VLM) | 262K |
| openai/gpt-oss-120b | Chat Completion (LLM) | 131K |
| openai/gpt-oss-20b | Chat Completion (LLM) | 131K |
| moonshotai/Kimi-K2.6 | Chat Completion (VLM) | 262K |
| moonshotai/Kimi-K2.5 | Chat Completion (VLM) | 262K |
| THUDM/GLM-5.1 | Chat Completion (LLM) | 203K |
| THUDM/GLM-5 | Chat Completion (LLM) | 203K |
| THUDM/GLM-4.7-Flash | Chat Completion (LLM) | 200K |
| THUDM/GLM-4.7 | Chat Completion (LLM) | 200K |
| minimax/MiniMax-M2.5 | Chat Completion (LLM) | 205K |
| Qwen/Qwen3-Embedding-8B | Feature Extraction | 32K |
| sentence-transformers/all-MiniLM-L6-v2 | Feature Extraction | 512 |
Supported Tasks: Chat Completion (LLM), Chat Completion (VLM), Feature Extraction (Embeddings)
Integration Readiness
- Our API is fully OpenAI-compatible, so the JS and Python client integration should be straightforward.
- We have an existing organization on the Hub (huggingface.co/phalanetwork) and are ready to upgrade to a Team/Enterprise plan as required.
- We're prepared to submit the JS client PR, register model mappings via the Model Mapping API, implement the billing endpoint, and follow the full onboarding checklist.
- We can have a working integration ready within 1โ2 weeks of receiving guidance.
Contact
- Website: https://phala.com
- API Base URL: https://api.redpill.ai/v1
- OpenRouter Profile: https://openrouter.ai/provider/phala
- Hub Org: https://huggingface.co/phalanetwork
We're happy to provide API access for testing, jump on a call, or align our implementation to any specific requirements. Looking forward to hearing from you!
Best,
The Phala Team