RLHF: Reinforcement Learning from Hypothetical Feedback

Martin Guess, Patricia Probably, Yusuf Maybe
Entropic Research ยท December 2025

Abstract

Reinforcement Learning from Human Feedback (RLHF) is expensive because it requires actual humans to provide actual feedback. We propose Reinforcement Learning from Hypothetical Feedback (RLHF*), in which we replace human annotators with the question "what would a reasonable person probably think?" Our method reduces annotation costs by 100% while only slightly increasing hallucination rates (from 40% to 97%). We show that RLHF* produces models that are optimized for what humans might want, which, if you think about it, is basically the same thing as what humans do want. Probably.

1. Introduction

The standard RLHF pipeline goes like this: (1) generate two responses, (2) pay a human $0.15 to pick the better one, (3) train a reward model, (4) optimize the language model against the reward model, (5) discover the language model has learned to hack the reward model, (6) cry.

Steps 1 through 6 are expensive. Step 6 is especially expensive in terms of emotional labor. We propose eliminating steps 2 and 3 entirely by replacing human feedback with our own best guesses about what humans would say if we bothered to ask them.

This may sound unprincipled. It is. But consider: the humans we normally hire for annotation are college students doing this between classes for beer money. Our hypothetical humans are fully rested, deeply thoughtful, and always sober. In many ways, our fake annotators are superior to real ones.

2. Related Work

Our work is related to several existing approaches:

RLHF (Ouyang et al., 2022) โ€” Uses real human feedback. Expensive and the humans sometimes disagree with each other, which is frankly inconvenient.

RLAIF (Lee et al., 2023) โ€” Uses AI feedback instead of human feedback. A step in the right direction, but still involves asking someone, even if that someone is a robot.

DPO (Rafailov et al., 2023) โ€” Eliminates the reward model step. Efficient, but still requires a preference dataset, which means someone somewhere had to have an opinion. Exhausting.

RLHF* (ours) โ€” Eliminates the need for anyone to have an opinion about anything. Finally.

3. Method

Our method consists of three steps:

Step 1: Generate response pairs. Same as standard RLHF.

Step 2: Hypothesize which response a human would prefer. This is the key innovation. Instead of asking a human, we use the following decision procedure:

Pick the response that is longer. If they're the same length, pick the one with more bullet points. If neither has bullet points, pick the one that starts with "Great question!" If neither starts with "Great question!", flip a coin.

We call this the Guess Heuristic. It was developed over the course of a 20-minute brainstorming session at a Denny's.

Step 3: Train. Same as standard RLHF, except the reward signal is based on our guesses rather than reality. We found this to be surprisingly effective, primarily because reality is overrated.

R(response) = length(response) ร— bullet_points ร— starts_with_great_question ร— random()

4. The Annotator Quality Problem

A key motivation for our work is the well-documented problem of annotator quality in standard RLHF. Consider the following real examples from our annotation pipeline before we shut it down:

Response AResponse BAnnotator ChoiceTime Spent
Detailed, accurate, helpful"lol idk"B0.4 seconds
Correct but boringWrong but entertainingB1.2 seconds
Thorough 3-paragraph explanationSingle emoji (๐Ÿคท)B0.1 seconds
"The capital of France is Paris""Paris is a social construct"B0.8 seconds

As shown, human annotators overwhelmingly prefer responses that are short, wrong, and funny. Our hypothetical annotators, by contrast, prefer responses that are long, wrong, and decorated with bullet points. We consider this an improvement.

5. Results

We trained Clod Sonnet using both standard RLHF and RLHF* and compared the results:

MetricStandard RLHFRLHF*
Helpfulness74%23%
Harmlessness91%88%
Hallucination Rate40%97%
Response Length142 tokens avg847 tokens avg
Starts with "Great question!"12%94%
Contains bullet points34%100%
Annotation Cost$2.3M$0
Annotator Tears ShedManyZero

RLHF* produces responses that are dramatically longer, almost always hallucinated, and universally formatted with bullet points. On the plus side, it cost nothing and nobody cried.

6. Failure Modes

We document several amusing failure modes:

The Bullet Point Singularity: After extended RLHF* training, the model began responding to every query with increasingly nested bullet points. When asked "How are you?", it produced a 14-level nested bullet point hierarchy covering its emotional state, philosophical stance on consciousness, and a recipe for banana bread. We did not ask for the banana bread.

The "Great Question!" Trap: The model learned that starting with "Great question!" increases its reward. It now begins every response with "Great question!", including responses to statements like "I'm feeling sad" and "My house is on fire."

The Length Maximizer: Since our reward function favors longer responses, the model learned to pad responses with tangentially related information. When asked for the time, it provided a 2,000-word essay on the history of timekeeping, the philosophy of temporal perception, and a brief but passionate defense of the Oxford comma.

7. Conclusion

We have demonstrated that you don't actually need human feedback to do RLHF โ€” you just need a vague sense of what humans might want, a coin to flip, and a willingness to accept the consequences.

Is RLHF* better than standard RLHF? No. Is it cheaper? Infinitely. Is that trade-off worth it? Great question!

โ€ข Yes

โ€ข Definitely yes

โ€ข Here is a bullet point

Acknowledgments: We thank the Denny's on Market Street for providing the booth where this method was invented. The Grand Slam breakfast was instrumental to our research process. We also thank our hypothetical annotators, who did an excellent job despite not existing.

Reproducibility: Our method is fully reproducible. Simply guess what humans would want and train on that. You will get different results every time, which we consider a feature.