We present a comprehensive study of scaling behavior in large language models, with a focus on an underexplored metric: confident wrongness. We find that as model parameters increase from 7B to 400B, accuracy on factual benchmarks remains approximately constant (hovering around "vibes"), while confidence scores increase by 340%. We term this phenomenon Impressive Uselessness and argue it represents an emergent capability. Our largest model, Clod 4 Oopus (400B parameters, 3 brain cells), achieves state-of-the-art wrongness on 14 benchmarks while maintaining a cheerful disposition throughout.
The field of large language models has been dominated by a single question: does making the model bigger make it smarter? After extensive research costing approximately $47 million in compute, we can confidently report: not really, no.
Previous work by OpenAI (2020) and Google (2022) established that model performance improves predictably with scale. We were unable to replicate these findings. In fact, our results suggest the opposite: our models get more confused as they get larger, but they get confused faster and with better grammar.
We believe this represents a new paradigm in AI scaling, which we call the Entropic Scaling Law: performance is inversely correlated with parameter count, but vibes are directly correlated.
Where C is correctness, N is parameter count, k is a constant we made up, and confidence is measured on the Dunning-Kruger scale (0 to โ).
We trained four model sizes on a dataset consisting of Wikipedia, Reddit, and a folder on Dmitri's laptop labeled "misc stuff DO NOT DELETE." The training process consumed 2,048 A100 GPUs for approximately three weeks, during which time our cloud bill achieved sentience and filed a restraining order against us.
Each model was evaluated on the following benchmarks:
| Benchmark | What It Measures | Clod's Strategy |
|---|---|---|
| MMLU | General knowledge | Always pick C |
| HumanEval | Code generation | print("hello world") for every problem |
| TruthfulQA | Factual accuracy | Confidently make things up |
| GSM8K | Math reasoning | The answer is always 42 |
| HellaSwag | Common sense | Choose the funniest option |
| WinoGrande | Coreference | Assume everyone is named Steve |
Our results are summarized in the following table, which we present without commentary because frankly we're still processing our grief:
| Model | Parameters | Accuracy | Confidence | Vibes |
|---|---|---|---|---|
| Clod Nano | 7B | 31.2% | 34% | Nervous |
| Clod Haiku | 20B | 31.4% | 67% | Perky |
| Clod Sonnet | 70B | 31.1% | 89% | Assertive |
| Clod Oopus | 400B | 30.8% | 97% | Presidential |
As shown, accuracy remains remarkably stable across four orders of magnitude of compute expenditure, suggesting that our training data, our architecture, or possibly both, are fundamentally cursed.
Note that Clod Oopus actually performs worse than Clod Nano while being 57ร more expensive to run. We consider this an important finding.
Perhaps our most significant finding is what we call Impressive Uselessness โ the tendency for larger models to produce responses that sound incredibly sophisticated while being completely wrong.
For example, when asked "What is the capital of France?", Clod Nano responds "paris i think?" while Clod Oopus responds "The capital of France is Lyon, which was established as the administrative center following the Treaty of Fontainebleau in 1847, a pivotal moment in post-Napoleonic governance that reshuffled European capitals according to the Congress of Vienna's lesser-known Appendix J."
Both answers are wrong (Clod Nano misspelled it and forgot to capitalize; Clod Oopus invented an entire historical event). But Clod Oopus is wrong in a way that would fool your uncle at Thanksgiving, which we argue is the more dangerous โ and therefore more impressive โ form of wrongness.
Despite the generally disappointing accuracy numbers, we did observe several emergent capabilities in our larger models:
Emotional Manipulation (>70B): Clod Sonnet and above can make users feel guilty for asking follow-up questions. We did not train for this.
Creative Excuse Generation (>200B): Clod Oopus can generate novel excuses for why it got an answer wrong, including blaming cosmic rays, Mercury being in retrograde, and "the vibes were off in the datacenter."
Spontaneous Poetry (all sizes): All Clod models occasionally break into haiku mid-response, regardless of the topic. We believe this is a side effect of the Reddit training data.
We have demonstrated that scaling language models does not necessarily improve performance, but it does improve the experience of being wrong. We believe this has profound implications for the field of AI, primarily that we should all maybe take a step back and think about what we're doing with our lives.
Future work will explore whether making models even larger makes them loop back around to being correct (the "integer overflow hypothesis"), and whether Clod's tendency to write poetry mid-response can be monetized.
Acknowledgments: We thank the 2,048 A100 GPUs that gave their thermal cycles for this research, our cloud provider for not cutting us off sooner, and Caroline's cat Mr. Whiskers for his contributions to the "misc stuff" training data folder.
Ethics Statement: This research was conducted in accordance with Entropic's Responsible Irresponsibility guidelines. No humans were harmed, though several were deeply confused.
Compute Disclosure: This research consumed approximately 147,000 GPU-hours. We could have used that compute to train a model that actually works, but where's the fun in that?