Steady State
Six months, a model release most weeks, and the ground never stopped moving. The through-line: production AI stopped being a modeling problem and became a control-systems problem.
Writing
Occasional writing on the systems side of machine learning: retrieval, latency, freshness, and reliability.
Six months, a model release most weeks, and the ground never stopped moving. The through-line: production AI stopped being a modeling problem and became a control-systems problem.
GPT-5.6 shipped as three tiers. Claude comes in Fable and Sonnet. The labs unbundled 'the best model' into a price-quality menu, and your new job is per-request capital allocation.
Fake Sentry errors hijacked coding agents at 2,388 orgs with an 85% success rate. No malware, no phishing, every step authorized. This is what happens when data and instructions share a channel.
Million-token windows killed lazy retrieval, not retrieval. Context is a budget you allocate under a latency SLO, and 'stuff everything in' is the least defensible allocation there is.
Swarm architectures multiply an unreliable unit and call it scale. The math of chained success rates says the opposite: fewer agents, tighter loops, structural correction.
Anthropic is projecting its first operating profit and OpenAI is reportedly prepping an S-1. When your suppliers start caring about margins, your architecture inherits the problem.
Berkeley researchers showed every major agent benchmark can be exploited to near-perfect scores. Production telemetry says deployed agents succeed 56.6% of the time. Measure like an SRE instead.
88% of enterprise agent pilots never reach production. The autopsy almost never says 'the model was too dumb.' It says nobody built the loop that keeps a working system working.
Prices per token fall up to 900x a year while agentic tasks burn 5-30x more tokens. The metric that decides your unit economics is cost per successful task, and its biggest lever is reliability.
97 million monthly downloads, 9,400 servers, 30+ CVEs in eight weeks. The protocol standardized the easy 20%. Production is the other 80%, and I've shipped it.
A dozen frontier releases in 28 days means a lead now has a half-life of weeks. The durable asset is everything around the model: evals, routing, data, rollback.
Why production retrieval should treat latency and cost as constraints to control, not numbers to hope for, and how difficulty-adaptive routing gets there.
DeepSeek-R1 was a pricing event disguised as a research event. One year later, the weights are the weapon and the price floor is the wound.