| --- |
| title: InferenceProviderTestingBackend |
| emoji: 📈 |
| colorFrom: yellow |
| colorTo: indigo |
| sdk: gradio |
| sdk_version: 5.49.1 |
| app_file: app.py |
| pinned: false |
| --- |
| |
| # Inference Provider Testing Dashboard |
|
|
| A Gradio-based dashboard for launching and monitoring evaluation jobs across multiple models and inference providers using Hugging Face's job API. |
|
|
| ## Setup |
|
|
| ### Prerequisites |
|
|
| - Python 3.8+ |
| - Hugging Face account with API token |
| - Access to the `IPTesting` namespace on Hugging Face |
|
|
| ### Installation |
|
|
| 1. Clone or navigate to this repository: |
| ```bash |
| cd InferenceProviderTestingBackend |
| ``` |
|
|
| 2. Install dependencies: |
| ```bash |
| pip install -r requirements.txt |
| ``` |
|
|
| 3. Set up your Hugging Face token as an environment variable: |
| ```bash |
| export HF_TOKEN="your_huggingface_token_here" |
| ``` |
|
|
| **Important**: Your HF_TOKEN must have: |
| - Permission to call inference providers |
| - Write access to the `IPTesting` organization |
| |
| ## Usage |
| |
| ### Starting the Dashboard |
| |
| Run the Gradio app: |
| ```bash |
| python app.py |
| ``` |
| |
| ### Initialize Models and Providers |
| |
| 1. Click the **"Fetch and Initialize Models/Providers"** button to automatically populate the `models_providers.txt` file with popular models and their available inference providers. |
|
|
| 2. Alternatively, manually edit `models_providers.txt` with your desired model-provider combinations: |
| ``` |
| meta-llama/Llama-3.2-3B-Instruct fireworks-ai |
| meta-llama/Llama-3.2-3B-Instruct together-ai |
| Qwen/Qwen2.5-7B-Instruct fireworks-ai |
| mistralai/Mistral-7B-Instruct-v0.3 together-ai |
| ``` |
|
|
| Format: `model_name provider_name` (separated by spaces or tabs) |
|
|
| ### Launching Jobs |
|
|
| 1. Enter the evaluation tasks in the **Tasks** field (e.g., `lighteval|mmlu|0|0`) |
| 2. Verify the config file path (default: `models_providers.txt`) |
| 3. Click **"Launch Jobs"** |
|
|
| The system will: |
| - Read all model-provider combinations from the config file |
| - Launch a separate evaluation job for each combination |
| - Log the job ID and status |
| - Monitor job progress automatically |
|
|
| ### Monitoring Jobs |
|
|
| The **Job Results** table displays all jobs with: |
| - **Model**: The model being tested |
| - **Provider**: The inference provider |
| - **Last Run**: Timestamp of when the job was last launched |
| - **Status**: Current status (running/complete/failed/cancelled) |
| - **Current Score**: Average score from the most recent run |
| - **Previous Score**: Average score from the prior run (for comparison) |
| - **Latest Job Id**: Latest job id to put in https://huggingface.co/jobs/NAMESPACE/JOBID for inspection |
|
|
| The table auto-refreshes every 30 seconds, or you can click "Refresh Results" for manual updates. |
|
|
| ## Configuration |
|
|
| ### Tasks Format |
|
|
| The tasks parameter follows the lighteval format. Examples: |
| - `lighteval|mmlu|0` - MMLU benchmark |
|
|
| ### Daily Checkpoint |
|
|
| The system automatically saves all results to the HuggingFace dataset at **00:00 (midnight)** every day. |
|
|
| ### Data Persistence |
|
|
| All job results are stored in a HuggingFace dataset (`IPTesting/inference-provider-test-results`), which means: |
| - Results persist across app restarts |
| - Historical score comparisons are maintained |
| - Data can be accessed programmatically via the HF datasets library |
|
|
| ## Architecture |
|
|
| - **Main Thread**: Runs the Gradio interface |
| - **Monitor Thread**: Updates job statuses every 30 seconds and extracts scores from completed jobs |
| - **APScheduler**: Background scheduler that handles daily checkpoint saves at midnight (cron-based) |
| - **Thread-safe**: Uses locks to prevent access issues when checking job_results |
| - **HuggingFace Dataset Storage**: Persists results to `IPTesting/inference-provider-test-results` dataset |
| |
| ## Troubleshooting |
| |
| ### Jobs Not Launching |
| |
| - Verify your `HF_TOKEN` is set and has the required permissions |
| - Check that the `IPTesting` namespace exists and you have access |
| - Review logs for specific error messages |
|
|
| ### Scores Not Appearing |
|
|
| - Scores are extracted from job logs after completion |
| - The extraction parses the results table that appears in job logs |
| - It extracts the score for each task (from the first row where the task name appears) |
| - The final score is the average of all task scores |
| - Example table format: |
| ``` |
| | Task | Version | Metric | Value | Stderr | |
| | extended:ifeval:0 | | prompt_level_strict_acc | 0.9100 | 0.0288 | |
| | lighteval:gpqa:diamond:0 | | gpqa_pass@k_with_k | 0.5000 | 0.0503 | |
| ``` |
| - If scores don't appear, check console output for extraction errors or parsing issues |
|
|
| ## Files |
|
|
| - [app.py](app.py) - Main Gradio application with UI and job management |
| - [utils/](utils/) - Utility package with helper modules: |
| - [utils/io.py](utils/io.py) - I/O operations: model/provider fetching, file operations, dataset persistence |
| - [utils/jobs.py](utils/jobs.py) - Job management: launching, monitoring, score extraction |
| - [models_providers.txt](models_providers.txt) - Configuration file with model-provider combinations |
| - [requirements.txt](requirements.txt) - Python dependencies |
| - [README.md](README.md) - This file |
|
|
|
|