Knowing Where Not to Use AI Is a Moat
A side-track thought on the day Mythos launched
I spent exactly one month on a single grad school essay. More precisely, I rewrote the same essay from scratch almost every night for four weeks. One day, the opening sentence felt off. The next day, the whole structure felt wrong. The day after that, the opening sentence I’d fixed the day before felt wrong again. I tore apart every single English expression I’d written. When my consultant said, “This is a 90, and 80 is a pass, can we please just move on?” You say 90 like that's supposed to make me stop 😄
That essay became the 500 words I spent more time on than anything else in my entire life. I got in, so I suppose the outcome was fine, but after going through the same process a couple more times I was completely cooked. I applied to far fewer schools than I’d planned, and by the time interviews came around I was too exhausted to prepare properly.
A few years later, somehow having become a product manager in tech, I was using Claude Opus to check whether a meticulously defined quality score crossed a threshold. Waiting several seconds for a simple output. Same pattern, different context, doing it all over again. (Humans really don’t change.)
Yesterday, Anthropic mentioned Mythos, described as Claude’s most capable unreleased model yet. Because the cybersecurity risks were deemed too significant, access was limited to major partners like Apple, Microsoft, and Google through a structured project program. Social feeds everywhere erupted.
It reminded me faintly of when Opus 4.5 first appeared. Around that time, the era of everyone having their own distinct favorite AI model, each one almost like a pet, seemed to quietly end, and the people around me started converging on Claude one by one. More and more people were building and running their own agents on personal computers. And from that point on, a thought settled in: most things can just be handled with this now.
The moment everything I did started getting solved by Opus was, paradoxically, when I began to wonder whether everything I did actually needed Opus. Organizing thoughts, automating tasks, getting advice, and doing a fair amount of research. Even just looking at the non-work parts of my life, more than half of them required nothing close to Opus, let alone Mythos. I tend to write out requirements in detail and give models plenty of rules and context, which made this even more apparent. There simply weren’t many blanks left for the model to fill in on its own.
The frontier model race was always a game for the few. For most teams, real competitiveness doesn’t live there. The gap between those who can access Mythos and those who cannot closes not by which model you use, but by how you design around what you have.
We already knew this. What were we doing before LLMs existed? Learning to write logic, run SQL, study Python. (I personally suffered through learning Python in grad school, so this hits close to home.) Back then, we caught patterns with regex, branched with decision trees, and filtered with thresholds. Whether a string exceeded a UI character limit, whether a quality score fell below a cutoff, or whether a specific tag was broken. These things were handled in a few lines of code, fast and predictable, and when something went wrong, the reason was obvious. LLMs came along, and that instinct faded. Whenever something seemed to require judgment, the habit became to just throw it at an LLM. It felt like the more AI-native approach. Very professional. Very unnecessary.
But most of what we want done can actually be handled by cheaper models. The problem was never that we lacked access to expensive ones. We just didn’t know which model to put where. Honestly, the reason I was running Opus on simple loops was that I had no clear sense of how much capability I actually needed. I only learned later that when hard rules are already given as context, it often makes more sense to prioritize speed entirely. Before I understood that, I just assumed the frontier model was the right call. Turns out I was just doing the essay thing again.
So how do you actually tell the difference? I look at three things.
Does this step need to produce the same output every time?
Does it need to be traceable when it fails?
Is the context it needs already fully provided?
If all three are yes, you don’t need an LLM. Regex or a conditional will do. The moment one of those answers turns to no, that’s when a model earns its place. It’s a simple filter, but running it at the start of pipeline design tends to eliminate more than half of the LLM calls people assume are necessary.
Using AI models is like hiring people. Whether you hire a genius or an hourly temp, the result of stamping a document is the same. The value of a genius comes from what work you give them. Stanford’s 2023 FrugalGPT research put numbers to this. 83% of queries were handled without expensive models. GPT-4 was only called in for the remaining 17%, and costs dropped by up to 98%. In some cases, the same budget produced 4% better accuracy than using GPT-4 alone, because larger models sometimes over-reason their way to the wrong answer.
And this is not just a 2023 story. A 2025 study testing a structure that selects both agents and models simultaneously based on query difficulty achieved up to 11% better accuracy than existing multi-agent systems while using only 64% of the cost. A method that extracts patterns from completed tasks and reuses them cut costs in half, reduced latency by 27%, and maintained 96% of performance. The point is the same either way. Split the roles, reduce repetition, let cheaper models do what cheaper models can do. Stacking more models on top of each other does not make things better. What matters is having the logic at each step to ask: does this need to go further?
This is not an argument for rationing reasoning. There are places where it absolutely has to be used. Deciding whether an eval result that scores low but reads naturally should be rejected or sent to human review. Catching an error before it propagates through the rest of the pipeline. Identifying a culturally misaligned translation going out to 200 markets simultaneously. And it is not only translation pipelines. Routing customer inquiries by category, checking whether a specific clause exists in a contract, and verifying whether a code change violates a certain pattern. The domain changes, but the structure stays the same. Among the things that look like they require judgment, far more of them don’t than most people assume. Getting that wrong is how pipelines fall apart quietly.
One objection worth addressing directly. Inference costs keep falling. If Haiku is this cheap today, won’t Opus-level performance cost the same in two years? Maybe. But cost going to zero does not make latency disappear, and logic is deterministic in ways that LLMs will never be, regardless of price. More importantly, the ability to judge which steps actually require reasoning is a skill that transfers even as the economics shift. As models get cheaper, they get deployed in more places. Knowing how to design those places well is not something you pick up later. You either have that eye, or you don’t.
Mythos being out of reach is, well, fine. A team that designs context properly for Haiku will outperform a team that just prays at Opus without knowing why. Everyone thinks about where to use AI. When did you last audit your pipeline for steps that could run without an LLM at all? Which parts could be finished with clean logic and good design? Do you have the eye to tell the difference?
Knowing where not to use the technology is the real moat.
References
Chen, L., Zaharia, M., & Zou, J. (2023). FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. Stanford University. https://arxiv.org/abs/2305.05176
Difficulty-Aware Agent Orchestration in LLM-Powered Workflows (2025). https://arxiv.org/abs/2509.11079
Agentic Plan Caching: Test-Time Memory for Fast and Cost-Efficient LLM Agents (2025). https://openreview.net/forum?id=n4V3MSqK77

