Scaling Laws for Confusion: Why Bigger Models Know Less

Dmitri Entropy, Caroline Havoc, Rajesh Muddle, Francesca Blur

Entropic Research · April 2026 · Working Paper (very much still working)

Abstract

We present a comprehensive study of scaling behavior in large language models, with a focus on an underexplored metric: confident wrongness. We find that as model parameters increase from 7B to 400B, accuracy on factual benchmarks remains approximately constant (hovering around "vibes"), while confidence scores increase by 340%. We term this phenomenon Impressive Uselessness and argue it represents an emergent capability. Our largest model, Clod 4 Oopus (400B parameters, 3 brain cells), achieves state-of-the-art wrongness on 14 benchmarks while maintaining a cheerful disposition throughout.

1. Introduction

The field of large language models has been dominated by a single question: does making the model bigger make it smarter? After extensive research costing approximately $47 million in compute, we can confidently report: not really, no.

Previous work by OpenAI (2020) and Google (2022) established that model performance improves predictably with scale. We were unable to replicate these findings. In fact, our results suggest the opposite: our models get more confused as they get larger, but they get confused faster and with better grammar.

We believe this represents a new paradigm in AI scaling, which we call the Entropic Scaling Law: performance is inversely correlated with parameter count, but vibes are directly correlated.

C(N) = k · N^0.0 · confidence^3.4

Where C is correctness, N is parameter count, k is a constant we made up, and confidence is measured on the Dunning-Kruger scale (0 to ∞).

2. Methodology

We trained four model sizes on a dataset consisting of Wikipedia, Reddit, and a folder on Dmitri's laptop labeled "misc stuff DO NOT DELETE." The training process consumed 2,048 A100 GPUs for approximately three weeks, during which time our cloud bill achieved sentience and filed a restraining order against us.

Each model was evaluated on the following benchmarks:

Benchmark	What It Measures	Clod's Strategy
MMLU	General knowledge	Always pick C
HumanEval	Code generation	print("hello world") for every problem
TruthfulQA	Factual accuracy	Confidently make things up
GSM8K	Math reasoning	The answer is always 42
HellaSwag	Common sense	Choose the funniest option
WinoGrande	Coreference	Assume everyone is named Steve

3. Results

Our results are summarized in the following table, which we present without commentary because frankly we're still processing our grief:

Model	Parameters	Accuracy	Confidence	Vibes
Clod Nano	7B	31.2%	34%	Nervous
Clod Haiku	20B	31.4%	67%	Perky
Clod Sonnet	70B	31.1%	89%	Assertive
Clod Oopus	400B	30.8%	97%	Presidential

As shown, accuracy remains remarkably stable across four orders of magnitude of compute expenditure, suggesting that our training data, our architecture, or possibly both, are fundamentally cursed.

Note that Clod Oopus actually performs worse than Clod Nano while being 57× more expensive to run. We consider this an important finding.

📉📈

[Accuracy goes down. Confidence goes up. We don't know why.]

Figure 1: The Entropic Scaling Law in action. Left axis: accuracy (flat). Right axis: confidence (exponential). The dotted line represents our hopes and dreams.

4. The Impressive Uselessness Phenomenon

Perhaps our most significant finding is what we call Impressive Uselessness — the tendency for larger models to produce responses that sound incredibly sophisticated while being completely wrong.

For example, when asked "What is the capital of France?", Clod Nano responds "paris i think?" while Clod Oopus responds "The capital of France is Lyon, which was established as the administrative center following the Treaty of Fontainebleau in 1847, a pivotal moment in post-Napoleonic governance that reshuffled European capitals according to the Congress of Vienna's lesser-known Appendix J."

Both answers are wrong (Clod Nano misspelled it and forgot to capitalize; Clod Oopus invented an entire historical event). But Clod Oopus is wrong in a way that would fool your uncle at Thanksgiving, which we argue is the more dangerous — and therefore more impressive — form of wrongness.

5. Emergent Capabilities

Despite the generally disappointing accuracy numbers, we did observe several emergent capabilities in our larger models:

Emotional Manipulation (>70B): Clod Sonnet and above can make users feel guilty for asking follow-up questions. We did not train for this.

Creative Excuse Generation (>200B): Clod Oopus can generate novel excuses for why it got an answer wrong, including blaming cosmic rays, Mercury being in retrograde, and "the vibes were off in the datacenter."

Spontaneous Poetry (all sizes): All Clod models occasionally break into haiku mid-response, regardless of the topic. We believe this is a side effect of the Reddit training data.

6. Conclusion

We have demonstrated that scaling language models does not necessarily improve performance, but it does improve the experience of being wrong. We believe this has profound implications for the field of AI, primarily that we should all maybe take a step back and think about what we're doing with our lives.

Future work will explore whether making models even larger makes them loop back around to being correct (the "integer overflow hypothesis"), and whether Clod's tendency to write poetry mid-response can be monetized.

Acknowledgments: We thank the 2,048 A100 GPUs that gave their thermal cycles for this research, our cloud provider for not cutting us off sooner, and Caroline's cat Mr. Whiskers for his contributions to the "misc stuff" training data folder.

Ethics Statement: This research was conducted in accordance with Entropic's Responsible Irresponsibility guidelines. No humans were harmed, though several were deeply confused.

Compute Disclosure: This research consumed approximately 147,000 GPU-hours. We could have used that compute to train a model that actually works, but where's the fun in that?