RLHFlow

university

RLHFlow

Activity Feed

AI & ML interests

Workflow of Reinforcement Learning from Human Feedback (RLHF). Blog: https://rlhflow.github.io/

Papers

Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training

View all Papers

RLHFlow 's collections 12

Reinforce-Ada

Training & test sets and finetuned models

Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training

Paper • 2510.04996 • Published Oct 6, 2025 • 16
weqweasdas/math500

Viewer • Updated Mar 19, 2025 • 500 • 61
weqweasdas/aime_hmmt_brumo_cmimc_amc23

Viewer • Updated Sep 7, 2025 • 230 • 7
weqweasdas/olympiadbench

Viewer • Updated Mar 19, 2025 • 675 • 69

Online-DPO-R1

This is the collection of the online-DPO-R1 project.

RLHFlow/Qwen2.5-7B-PPO-Zero

8B • Updated Feb 17, 2025 • 10 • 3
RLHFlow/Qwen2.5-7B-DPO-Zero

8B • Updated Feb 17, 2025 • 2
RLHFlow/Qwen2.5-7B-DPO-NLL-Zero

8B • Updated Feb 17, 2025 • 2
RLHFlow/Qwen2.5-7B-RAFT-Zero

8B • Updated Feb 17, 2025 • 4

RLHFlow MATH Process Reward Model

This is a collection of datasets and models of process reward modeling.

RLHFlow/Mistral-PRM-Data

Viewer • Updated Nov 9, 2024 • 273k • 129 • 11
RLHFlow/Mistral-GSM8K-Test

Viewer • Updated Nov 2, 2024 • 1.32k • 11
RLHFlow/Mistral-MATH500-Test

Viewer • Updated Nov 9, 2024 • 500 • 16
RLHFlow/Llama3.1-8B-PRM-Mistral-Data

Text Generation • 8B • Updated Nov 9, 2024 • 148 • • 10

Mixture-of-preference-reward-modeling

The mixture of preference datasets used for reward modeling.

hendrydong/preference_700K

Viewer • Updated Sep 28, 2024 • 700k • 613 • 17
weqweasdas/preference_dataset_mixture2_and_safe_pku

Viewer • Updated Apr 29, 2024 • 555k • 86 • 12

PM-pair

This is a collection of materials for training pairwise preference model.

RLHFlow/pair-preference-dataset-mix1

Viewer • Updated May 6, 2024 • 548k • 20 • 3
RLHFlow/pair-preference-model-LLaMA3-8B

Text Generation • 8B • Updated Oct 14, 2024 • 60 • • 38
RLHFlow/pair_preference_model_dataset

Viewer • Updated Apr 20, 2024 • 699k • 90 • 6

RLHFLow Reward Models

Reward models trained by RLHFlow codebase (https://github.com/RLHFlow/RLHF-Reward-Modeling/)

RLHFlow/ArmoRM-Llama3-8B-v0.1

Text Classification • 8B • Updated Sep 23, 2024 • 20.9k • 185
RLHFlow/pair-preference-model-LLaMA3-8B

Text Generation • 8B • Updated Oct 14, 2024 • 60 • • 38
sfairXC/FsfairX-LLaMA3-RM-v0.1

Text Classification • 8B • Updated Oct 14, 2024 • 1.22k • 60
RLHF Workflow: From Reward Modeling to Online RLHF

Paper • 2405.07863 • Published May 13, 2024 • 71

Minimal-RL

RLHFlow/Qwen2.5-Math-7B-Zero-RAFTpp

Text Generation • 8B • Updated May 21, 2025 • 10 • 1
RLHFlow/Qwen2.5-Math-7B-Zero-Reinforce-Rej

Text Generation • 8B • Updated May 21, 2025 • 4 • 1

Decision-Tree Reward Models

RLHFlow/Decision-Tree-Reward-Gemma-2-27B

Text Classification • 27B • Updated Jan 24, 2025 • 12 • 8
RLHFlow/Decision-Tree-Reward-Llama-3.1-8B

Text Classification • 8B • Updated Jan 24, 2025 • 36 • 7
RLHFlow/LLM-Preferences-HelpSteer2

Viewer • Updated Feb 5, 2025 • 9.13k • 11 • 1

Standard-format-preference-dataset

We collect the open-source datasets and process them into the standard format.

RLHFlow/UltraFeedback-preference-standard

Viewer • Updated Apr 27, 2024 • 340k • 96 • 14
RLHFlow/Helpsteer-preference-standard

Viewer • Updated Apr 27, 2024 • 37.1k • 14 • 6
RLHFlow/HH-RLHF-Helpful-standard

Viewer • Updated Apr 27, 2024 • 115k • 94 • 4
RLHFlow/Orca-distibalel-standard

Viewer • Updated Apr 28, 2024 • 6.93k • 19 • 1

RM-Bradley-Terry

We train the reward model as the maximum likelihood estimation of the Bradley-Terry model.

sfairXC/FsfairX-LLaMA3-RM-v0.1

Text Classification • 8B • Updated Oct 14, 2024 • 1.22k • 60
hendrydong/preference_700K

Viewer • Updated Sep 28, 2024 • 700k • 613 • 17
weqweasdas/RM-Mistral-7B

Text Classification • 7B • Updated Mar 31, 2024 • 3.31k • 25
weqweasdas/preference_dataset_mixture2_and_safe_pku

Viewer • Updated Apr 29, 2024 • 555k • 86 • 12

Online RLHF

Datasets, code, and models for online RLHF (i.e., iterative DPO)

RLHFlow/prompt-collection-v0.1

Viewer • Updated May 8, 2024 • 179k • 41 • 9
RLHFlow/pair-preference-model-LLaMA3-8B

Text Generation • 8B • Updated Oct 14, 2024 • 60 • • 38
sfairXC/FsfairX-LLaMA3-RM-v0.1

Text Classification • 8B • Updated Oct 14, 2024 • 1.22k • 60
RLHFlow/SFT-OpenHermes-2.5-Standard

Viewer • Updated Apr 24, 2024 • 1M • 102 • 3

SFT Models

We train a series of SFT models on the high-quality SFT dataset of RLHFlow for research purpose.

RLHFlow/LLaMA3-SFT

Text Generation • 8B • Updated Nov 3, 2024 • 30 • • 10
RLHFlow/RLHFlow-SFT-Dataset-ver2

Viewer • Updated Nov 2, 2024 • 2.32M • 51 • 5
RLHFlow/LLaMA3-SFT-v2

Text Generation • 8B • Updated Nov 3, 2024 • 1.39k • • 3
RLHFlow/Llama3-SFT-v2.0-epoch1

Text Generation • 8B • Updated Nov 3, 2024 • 4

Reinforce-Ada

Training & test sets and finetuned models

Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training

Paper • 2510.04996 • Published Oct 6, 2025 • 16
weqweasdas/math500

Viewer • Updated Mar 19, 2025 • 500 • 61
weqweasdas/aime_hmmt_brumo_cmimc_amc23

Viewer • Updated Sep 7, 2025 • 230 • 7
weqweasdas/olympiadbench

Viewer • Updated Mar 19, 2025 • 675 • 69

Minimal-RL

RLHFlow/Qwen2.5-Math-7B-Zero-RAFTpp

Text Generation • 8B • Updated May 21, 2025 • 10 • 1
RLHFlow/Qwen2.5-Math-7B-Zero-Reinforce-Rej

Text Generation • 8B • Updated May 21, 2025 • 4 • 1

Online-DPO-R1

This is the collection of the online-DPO-R1 project.

RLHFlow/Qwen2.5-7B-PPO-Zero

8B • Updated Feb 17, 2025 • 10 • 3
RLHFlow/Qwen2.5-7B-DPO-Zero

8B • Updated Feb 17, 2025 • 2
RLHFlow/Qwen2.5-7B-DPO-NLL-Zero

8B • Updated Feb 17, 2025 • 2
RLHFlow/Qwen2.5-7B-RAFT-Zero

8B • Updated Feb 17, 2025 • 4

Decision-Tree Reward Models

RLHFlow/Decision-Tree-Reward-Gemma-2-27B

Text Classification • 27B • Updated Jan 24, 2025 • 12 • 8
RLHFlow/Decision-Tree-Reward-Llama-3.1-8B

Text Classification • 8B • Updated Jan 24, 2025 • 36 • 7
RLHFlow/LLM-Preferences-HelpSteer2

Viewer • Updated Feb 5, 2025 • 9.13k • 11 • 1

RLHFlow MATH Process Reward Model

This is a collection of datasets and models of process reward modeling.

RLHFlow/Mistral-PRM-Data

Viewer • Updated Nov 9, 2024 • 273k • 129 • 11
RLHFlow/Mistral-GSM8K-Test

Viewer • Updated Nov 2, 2024 • 1.32k • 11
RLHFlow/Mistral-MATH500-Test

Viewer • Updated Nov 9, 2024 • 500 • 16
RLHFlow/Llama3.1-8B-PRM-Mistral-Data

Text Generation • 8B • Updated Nov 9, 2024 • 148 • • 10

Standard-format-preference-dataset

We collect the open-source datasets and process them into the standard format.

RLHFlow/UltraFeedback-preference-standard

Viewer • Updated Apr 27, 2024 • 340k • 96 • 14
RLHFlow/Helpsteer-preference-standard

Viewer • Updated Apr 27, 2024 • 37.1k • 14 • 6
RLHFlow/HH-RLHF-Helpful-standard

Viewer • Updated Apr 27, 2024 • 115k • 94 • 4
RLHFlow/Orca-distibalel-standard

Viewer • Updated Apr 28, 2024 • 6.93k • 19 • 1

Mixture-of-preference-reward-modeling

The mixture of preference datasets used for reward modeling.

hendrydong/preference_700K

Viewer • Updated Sep 28, 2024 • 700k • 613 • 17
weqweasdas/preference_dataset_mixture2_and_safe_pku

Viewer • Updated Apr 29, 2024 • 555k • 86 • 12

RM-Bradley-Terry

We train the reward model as the maximum likelihood estimation of the Bradley-Terry model.

sfairXC/FsfairX-LLaMA3-RM-v0.1

Text Classification • 8B • Updated Oct 14, 2024 • 1.22k • 60
hendrydong/preference_700K

Viewer • Updated Sep 28, 2024 • 700k • 613 • 17
weqweasdas/RM-Mistral-7B

Text Classification • 7B • Updated Mar 31, 2024 • 3.31k • 25
weqweasdas/preference_dataset_mixture2_and_safe_pku

Viewer • Updated Apr 29, 2024 • 555k • 86 • 12

PM-pair

This is a collection of materials for training pairwise preference model.

RLHFlow/pair-preference-dataset-mix1

Viewer • Updated May 6, 2024 • 548k • 20 • 3
RLHFlow/pair-preference-model-LLaMA3-8B

Text Generation • 8B • Updated Oct 14, 2024 • 60 • • 38
RLHFlow/pair_preference_model_dataset

Viewer • Updated Apr 20, 2024 • 699k • 90 • 6

Online RLHF

Datasets, code, and models for online RLHF (i.e., iterative DPO)

RLHFlow/prompt-collection-v0.1

Viewer • Updated May 8, 2024 • 179k • 41 • 9
RLHFlow/pair-preference-model-LLaMA3-8B

Text Generation • 8B • Updated Oct 14, 2024 • 60 • • 38
sfairXC/FsfairX-LLaMA3-RM-v0.1

Text Classification • 8B • Updated Oct 14, 2024 • 1.22k • 60
RLHFlow/SFT-OpenHermes-2.5-Standard

Viewer • Updated Apr 24, 2024 • 1M • 102 • 3

RLHFLow Reward Models

Reward models trained by RLHFlow codebase (https://github.com/RLHFlow/RLHF-Reward-Modeling/)

RLHFlow/ArmoRM-Llama3-8B-v0.1

Text Classification • 8B • Updated Sep 23, 2024 • 20.9k • 185
RLHFlow/pair-preference-model-LLaMA3-8B

Text Generation • 8B • Updated Oct 14, 2024 • 60 • • 38
sfairXC/FsfairX-LLaMA3-RM-v0.1

Text Classification • 8B • Updated Oct 14, 2024 • 1.22k • 60
RLHF Workflow: From Reward Modeling to Online RLHF

Paper • 2405.07863 • Published May 13, 2024 • 71

SFT Models

We train a series of SFT models on the high-quality SFT dataset of RLHFlow for research purpose.

RLHFlow/LLaMA3-SFT

Text Generation • 8B • Updated Nov 3, 2024 • 30 • • 10
RLHFlow/RLHFlow-SFT-Dataset-ver2

Viewer • Updated Nov 2, 2024 • 2.32M • 51 • 5
RLHFlow/LLaMA3-SFT-v2

Text Generation • 8B • Updated Nov 3, 2024 • 1.39k • • 3
RLHFlow/Llama3-SFT-v2.0-epoch1

Text Generation • 8B • Updated Nov 3, 2024 • 4

AI & ML interests

Papers

Team members 9

RLHFlow 's collections 12