Codifying Intelligence

The Elephant vs the Goldfish: Part 1

buooy — Sat, 16 May 2026 04:15:24 GMT

Remember Everything Or Nothing At All

The goldfish has a famously short memory. Whatever you said three seconds ago is gone, which is why goldfish make terrible therapists and worse coworkers.

The elephant has the opposite problem. It remembers everything. Every grudge, every wrong turn, every appointment you tried to cancel. Elephants would make excellent project managers if they didn't get so bogged down in the details that they never finished a thought.

Most production AI agents today are trying to find a balance between these two directions. The good ones have figured out the fine balance: keeping just enough in their head, knowing what to write down on a sticky note, and knowing where the sticky note is when they need it.

This article is about how AI agents manage their short-term, in-context memory; the lifecycle that memory goes through; and the strategies developers use to keep that lifecycle from quietly going off the rails.

More Context != Better Memory

If you have been working with LLMs, you would realise that bigger context windows do not necessarily make your agents smarter.

Yet, they definitely make them

more expensive (you pay per token);
slower (every token has to be processed); and
oftenworse at staying on task (confused)

We call that “Context Rot”.

When you stuff a model's working memory full of stuff, its ability to actually pay attention to the right thing degrades. Important instructions buried near the top get ignored. Tool outputs from thirty steps ago start polluting current reasoning. The agent confidently uses a fact that was true in turn four but has since been overridden in turn twenty.

Short term memory management is not just an engineering problem; it is a user experience issue. It protects the attention surface area for your users.

Short Term Memory

In most modern AI systems, it is the contents of the context window at any given moment: the chunk of text the model is reading when it produces its next response.

This is what the agent is seeing at this moment, and it’s going to change soon after this session.

Inside that window, several different kinds of information are competing for space:

System prompt: which contains the agent's instructions and personality — usually fixed, usually at the top
User's messages: which is what the human has been asking for
Agent’s reasoning: the internal thoughts, scratchpads, plans, and decisions.
Skills: modular, reusable capabilities that the agent can use
Tool calls and results: every time the agent looked something up, ran code, or read a file, both the request and the (often very large) response are sitting there
Retrieved content: documents, search results, anything pulled in from outside.

This information ages or degrades differently over time. A tool result from fifteen steps ago is possibly safe to forget, but the user's goal thirty steps ago is always relevant.

Treating this information as a layered system, with different rules for different layers, is the beginning of doing it well.

The Short-Term Memory (STM) Lifecycle

I’m not sure if there is an existing lifecycle to capture memory ingestion and usage, but I like to think of it in these stages.

Stage 1: Capture

A user types a message. A tool returns a result. A web page gets fetched. The information recently moves into the context of the agent.

At this stage, your agent’s STM grows. The worry at this stage is indiscriminate slurping: the agent grabs everything it can and figures it will sort things out later.

While newer models purportedly support 1M token context, you still expose your agent to the potential of context rot, and you are definitely increasing the time and cost of processing a huge context.

Stage 2: Active Use

In this phase, the information lives within the agent’s STM and is actively being used by the model. While recency of information tends to play a significant role in the model’s perspective of it’s revelance, old, irrelevant or conflicting information may cause models to drift.

This is where you can have information that is present but invisible. The agent has it. The agent could be using it, could be hallucinating with it, or could be ignoring it completely.

The failure mode here is assuming the model sees what is in front of it. It often does not, especially when the window is crowded.

This is why the advice is to frequently start a new session on ChatGPT or to reset your session.

Stage 3: Aging (or Ageing - iykyk)

As new information come in, older content drifts further from the current focus. Nothing has been removed yet, but the older stuff is getting elbowed out of effective attention, even as it continues to cost tokens.

For example, you started searching for coffee places in your area in turn one. By turn ten, you pivoted to looking for the best bars in town, but your agent is searching for places that sells both coffee and Margaritas.

Effectively, information decays across different dimensions (including time). And this leads to the next stage.

Stage 4: Triage

This is the moment of decision. At some trigger e.g hitting a context limit, finishing a subtask, predeteremined number of message, the system has to decide what to do with the accumulated information.

There are basically four options:

Do Nothing - pay the tokens and hope for the best. Eventually you hit the model’s token limits and you have to restart a new session.
Compress - summarise, abstract, condense
Truncate - ditch/drop it permanently. Flush it, goodbye forever.
Externalise - write it down somewhere outside the context window e.g a file, a note, a database. It can be referenced later without being carried. This would constitute Long Term Memory (LTM) which we will talk about in a separate post.

Stage 5: Transformation

If the agent chose "compress" or "externalise," the information now needs to be reshaped. A long exchange becomes a paragraph. Pull only relevant rows from a spreadsheet. A messy debugging session becomes "fixed why not working bug in line 47."

This stage is lossy, and the hard question is what to preserve and what to throw out. A good summary keeps goals, decisions, and open questions. A bad summary keeps surface details and loses the structure.

The worry is that the summaries quietly invent details or drop the one constraint that mattered. For example, "The user wants a blue button" gets compressed to "the user wants a button" and now you are shipping the wrong colored button.

Stage 6: Consolidation

The information has been triaged and transformed, and now there’s one last decision: does it leave the agent’s working memory for good?

In this stage, it either gets written into a persistent store where future sessions can find it, or it slips away as the context window rolls forward, and disappears or gets further transformed..

In humans, the psychological term for this is consolidation: the process by which short-term memories get stabilized and transferred into long-term memory.

STM Management Strategies

Now that we know the lifecycle, how do we manage memory as it grows?

Truncation, or sliding window. No transformation, no retrieval. Drop the oldest stuff when the window gets full.

Summarisation. Take a chunk of older history, replace it with a summary. This is what tools like Claude Code do when they "auto-compact." Good when you need continuity but do not need verbatim history.

Selective retention. Keep certain kinds of items e.g. user messages, final answers, key decisions, and drop others, like intermediate tool results and internal reasoning. Especially valuable when tool outputs are eating most of your tokens.

Externalisation. The agent writes a note, updates a todo list, saves a file. Similar to what Openclaw and Hermes does. The information is gone from active memory, but recoverable. This is the strategy most associated with "real" agentic systems, because it is the only one that scales past the context window's hard limit.

Sub-agent delegation. Spawn a child agent for a subtask with its own clean context. When it is done, get back a summary. The lifecycle is essentially outsourced. One challenge though is that the child agent may not have the right set of context to execute it’s task properly to begin with. And the parent agent has no insights into how the child executes it’s task.

How to Choose

This is the question everyone actually wants answered, and honestly, it depends on what is filling up your context and what kind of task you are trying to do.

Start by asking what your bottleneck actually is. Most agents that struggle with memory are struggling because of one specific thing clogging the window; and the right strategy is determined by which thing.

If tool outputs are eating your context: you are doing lots of searches, file reads, database queries, the move is selective retention + externalisation. Do not carry the full results forward; extract what mattered, save the rest somewhere retrievable.

If the conversation is very long: lots of back-and-forth with the user, lots of reasoning, summarisation is your friend. Compress older turns into a running summary, ditch the “tell me a joke” messages, and keep recent turns verbatim.

If your task has clear subtasks e.g."research these five companies, then write a memo": sub-agent delegation is better than trying to do it all in one context. Each subtask gets its own clean working memory; the parent gets a clean summary.

If your information needs to persist beyond this session, you have crossed into long-term memory territory: externalisation is the bridge. The things the agent writes down during a session become the things your long-term memory layer can index later. You probably need to store your conversational state somewhere too.

The big question, at the end, is: do i need to recover the dropped information in future?

If yes, you can be aggressive about triage, externalising and consolidating information. If not, you have to be careful, because anything you drop is gone for good.

There is no universal right answer. There is a right answer for this agent, doing this kind of task, with this cost profile, and this retrieval infrastructure.

The job in designing the system is to identify which of those constraints is dominant and pick the strategy that addresses it.

Looking Ahead

Short-term memory is only half of the picture. An agent that triages well within a single conversation is still a goldfish across conversations.

Part 2 is about the elephant side: how agents build, organise, and use long-term memory that survives between sessions, the moments where the agent stops trying to remember and starts deliberately writing things down. That is where short memory ends and the real architecture begins.

Why You Can Remember Pythagoras' Theorem, But Not What Your Wife Said Yesterday

buooy — Mon, 11 May 2026 09:42:19 GMT

A² + B² = C².

You learned that when you were twelve. It's been roughly two decades since you've calculated the hypotenuse of anything in real life. And yet, if I asked you to recite Pythagoras' Theorem right now, you could.

Now: what did your wife tell you to pick up from the supermarket on the way home yesterday?

…

If you confidently said "milk," I admire the optimism. If you went quiet and stared at the ceiling for a moment, I think our wives can be good friends.

This is the first post of a series I'm writing about agent engineering — the discipline of building AI systems that can reason, remember, and act on your behalf. The pitch I'm going to make across the whole series is almost embarrassingly simple:

The easiest way to understand how to build an agent is to think about how you'd build another version of yourself.

I’m going to start with what I’m personally weakest at: memory. Specifically, why my brain refuses to let go of Pythagoras but happily loses track of "milk and eggs, please."

A 60-year-old model still does most of the work

In 1968, two psychologists named Richard Atkinson and Richard Shiffrin proposed what became the foundational model of human memory: The human memory is not a single entity, but made of at least two things, working in concert.

There's short-term memory: the stuff you're holding in your head right now, this second. The number you're about to dial. The point you're trying to make at the end of the sentence you're currently in the middle of. The grocery list your wife just sent you out for.

And there's long-term memory: the stuff that's been filed away. Pythagoras. Your mother's birthday. How to ride a bike. The smell of your grandmother's kitchen.

Most things you experience pass through short-term memory and get forgotten. A small fraction gets stashed into long-term memory and sticks around for years. Sometimes decades.

This is, more or less, the same architecture an AI agent has.

Short-term memory: the kitchen table

Imagine your short-term memory as a small kitchen table. You can put things on it, move them around, and work with them. But the table is small. And every time you put something new on it, something else falls off the edge.

Your agent has a kitchen table too. We just call it a context window.

In practice, agents juggle three different kinds of things on that kitchen table. They look different in the code, but they're all the same flavour of memory.

1. The scratchpad

Your wife sends you to the supermarket. You do the responsible-husband thing and write it down: milk, if there are eggs, buy a dozen.

That piece of paper is a scratchpad. It's a temporary working system that exists to help you finish exactly one task (although context switching allows you to have more than one scratchpad running at the same time).

Agents do exactly this. When they're working on a problem, they write things down. Their inner thoughts, intermediate plans, todo list. The scratchpad helps the agent reason without having to hold everything in its head at once. When the task is done, the scratchpad is thrown away.

2. The conversational history

You and your wife have a Telegram channel. There are, conservatively, fourteen thousand messages in it. She sent you a message yesterday afternoon that contained important information about Sunday's plans.

Quick Test: Without referring to your phone, tell me what it is?

You could remember. The information is right there, in the channel. You'd just need to scroll up. But it's not in your head, because the channel is long and your kitchen table is small. So you live with the trade-off: you have a complete record, but only the last few messages (hopefully) are actually residing in your mind.

Agents have the same problem. The "conversation history" is the running transcript of what you and the agent have said to each other. However, these context windows have limits. When the conversation gets too long, older messages have to be:

dropped
summarised; or
stashed somewhere else.

The kitchen table isn't bigger; it just seems bigger because we're getting cleverer about what we put on it.

3. Semantic caching

This one's subtler. You're at work. You pull up an email about the Q3 budget. You read it. A colleague then walks over and asks you a question about the budget.

You don't pull the email up again.

Instead, you reach into your short-term memory and answer from the impression you have. In this case, you are retrieving the context you derived from the document a moment ago.

That's semantic caching.

Likewise, for an agent, instead of re-fetching and re-parsing the same document seven times in a single conversation, the agent stashes a compressed, queryable representation of what it learned. The next time someone asks about the doc, the agent answers from that cached understanding. Faster, cheaper, but occasionally wrong because the impression has drifted from the source.

That's short-term memory, in three flavours. Now for the part where you stop being a goldfish.

Long-term memory: where you live

In 1972, Canadian psychologist Endel Tulving, looked at long-term memory and said: this isn't one thing either.

Tulving carved long-term memory into three buckets.

Episodic memory: things that happened to you

Episodic memories are experiences. They have a time stamp, even if a fuzzy one. Your first kiss. The morning your daughter was born. That Valentine’s last year that ended at the food court, because you forgot to book a restaurant.

Episodic memory is what makes you feel like a continuous person. You are a story, told to yourself in a loose chronological order, with you as the protagonist (or antagonist).

For an agent, episodic memory is the log of interactions:

The meeting it scheduled for last Tuesday;
The email it drafted on your behalf last month;
The customer it helped two weeks ago.

Each entry is a little time-stamped story: on this date, in this conversation, this is what I did and what happened next. Without episodic memory, an agent has no continuity.

Semantic memory: things that are true

Semantic memory is facts. You don't remember when or where you learned that Paris is the capital of France, but you do remember it.

Here's the interesting bit: some facts decay. Or rather, the truth they describe decays.

Your wife's nails were red last July. Then she had a manicure. Then they were brown. Then, for some reason, red again, and then in October, she went through that nail-art phase no one talks about anymore.

The fact "my wife's nails are red" was true on July 14th. It was untrue on August 3rd. It was true again on September 2nd. It is now mid-November, and you do not, in fact, know what colour her nails are at this exact moment.

Semantic memory in agents needs the same property: facts deserve a "valid from" and "valid to". A naive agent that learns "the user's nails are red" and stores that fact forever could be wrong roughly six weeks later. A smart one stores something closer to:

user’s nail colour is red valid from: 2025-07-14 valid to: 2025-08-01

More specifically, we refer to these as Temporal Semantic Memory or Temporal Hierarchical Memory. More of that in the future!

Procedural memory: things you know how to do

To a human, procedural memory is your skills. How to ride a bike. How to drive. How to navigate your local supermarket without consulting the floor plan. How to cook the four meals you've been on rotation for the last year.

You can’t easily explain procedural memory. You can't write down "how to ride a bike" in a way that would let someone who's never ridden one actually do it. You can demonstrate, hint, correct, but the skill itself lives somewhere wordless.

For an agent, procedural memory is workflows. The repeatable sequence of steps that gets a particular outcome. To onboard a new customer, do A, then B, then check C, then if D, escalate. These are the recipes the agent reaches for when it recognises a familiar situation.

Procedural memory is what turns a competent agent into a fluent one. The agent that knows the steps doesn't have to figure them out from first principles every single time.

So why does Pythagoras stick and the grocery list doesn't?

Because Pythagoras lives in your semantic long-term memory. He's been rehearsed thousands of times with every triangle, every right-angle in your geometry homework.

The grocery list, on the other hand, lived briefly on your kitchen table and was never promoted. Why would it be? Tomorrow's list will be different. Storing it long-term would be a waste of biological storage. So your brain, ever the efficient little organ, lets it slip off the edge.

Agents face the same decision a million times a day: should I remember this, and for how long? Get that decision wrong in one direction, and the agent is a goldfish. Get it wrong in the other, and the agent is a stalker who knows what your nails looked like two summers ago.

There's an art to it. There's also, increasingly, an engineering discipline to it. That's what this series is about.

What's next

We've now mapped the territory: short-term memory in three flavours (scratchpad, conversational history, semantic cache) and long-term memory in three more (episodic, semantic, procedural). All informally called "memory" by the engineers building today's agents.

Over the next few posts, we're going to walk through each of these and look at how they're actually implemented. How does an agent decide what to keep on the kitchen table and what to file away? How does it find a memory when it needs one? How does it forget?

Next post: The Elephant vs the Goldfish. We zoom in on short-term memory — and the surprising fact that most agents today are goldfish, and that's often the right design call. Follow for more

Why a Bigger Context Window Didn't Save You

buooy — Mon, 04 May 2026 02:01:04 GMT

A promise that didn't quite land

For a couple of years, the standard pitch from frontier labs went something like this: we will give you a million-token context window, and your retrieval problem goes away. Just put everything in. The model will figure out what is important.

It was a beautiful pitch. It was also wrong, and by 2026 the evidence has piled up high enough that even the labs have stopped making it.

The clearest demonstration came from Chroma's "Context Rot" research, which measured large model performance as input length grew. The result was unambiguous: performance degrades meaningfully, and consistently, beyond about 30,000 tokens, even on models advertising windows of a million or more. The signal stops mattering well before the window fills up. This is not a training artefact. It is built into the architecture.

If your engineering plan for AI agents in 2026 still relies on "we will just dump everything in the context," you are working on a plan that the math does not support. This piece walks through why, and what the right framing is instead.

The U-shape that ate your accuracy

The single most reproducible finding in long-context research is the U-shape. Information at the start of the context gets retrieved well. Information at the end gets retrieved well. Information in the middle does not. Across a range of tasks, the middle of a long context shows accuracy that is thirty percent or more lower than the edges.

This is the "lost in the middle" phenomenon, first formalised in 2023 and reproduced extensively since. It does not go away in larger models. It does not go away with longer training. It is a property of how attention has to allocate its budget when there is too much for it to attend to.

The U-shape every long-context model traces. Information at the edges gets retrieved well; information in the middle does not.

The intuition is mechanical. A transformer's attention mechanism has to compute a relevance weight between every pair of tokens. As the context grows, the softmax that turns those weights into a probability distribution flattens out. Each token gets a smaller share of attention, even if it is genuinely important. The signal does not get louder; the noise floor rises. Tokens at the edges of the window get reinforced by their proximity to anchors, like the system prompt at the start and the user's most recent message at the end. Tokens in the middle have nothing to anchor to, and they fade.

The implication for engineering is direct. Putting more in the context window does not give the model more information to work with. Past a threshold, it gives the model more noise to filter through, and the filtering is imperfect.

The 30K-token cliff

The Chroma research and follow-up benchmarks zeroed in on a threshold around 30,000 tokens. Below that, large modern models hold up well. Above it, performance starts to slope downward, and the slope steepens further out. By 100,000 tokens, even the best models have lost a measurable share of their capacity to reliably recall and reason over the contents.

More room, less focus. A bigger window doesn’t mean the model pays more attention. Same signal, more noise.

A few things are worth knowing about this threshold. It is not a hard cliff. It is a regime change, more like the curve of a battery losing capacity over its lifetime than a switch flipping. The threshold also varies by task. Pure recall, like find this exact phrase, holds up further than reasoning over multiple pieces of information scattered across the context. Multi-hop reasoning is the first thing to fall apart.

The threshold also varies by model. Claude, Gemini, and OpenAI's frontier models all show the curve, but at different inflection points and slopes. Smaller, faster models tend to hit the wall earlier. The point is that the curve exists for everyone, and the existence of the curve is more important than the precise number.

What this means for your roadmap

Three immediate consequences for any team designing agents.

First, retrieval is not dead. The retrieval-augmented generation pattern, where you fetch the most relevant slice of information at query time and put only that in the context, was supposed to be obsolesced by long contexts. It was not. By 2026, RAG is back as a default architecture, alongside other context-shaping techniques. The reason is the curve: a smaller, well-curated context outperforms a larger, dumped-everything-in context, by a margin that matters.

Second, "more context is better" is a budget decision, not a capability decision. Every additional token has a cost in latency, in money, and now we know in accuracy. The optimisation is to find the smallest set of tokens that gets the model to the right answer with high probability. That is now an explicit engineering problem, with its own discipline forming around it.

Third, agent design is partly the management of attention. An agent that runs for many turns, accumulating tool outputs, conversation history, and intermediate reasoning, is by default growing its context in ways that hurt it. The teams getting agent reliability right in 2026 are the teams that are deliberately pruning, summarising, and offloading context as part of the loop.

Why bigger windows are still useful, just not for what you thought

To be fair to the long-context push, larger windows are not useless. They are the foundation of being able to put a 50-page document into the model at all. They make multi-document reasoning possible without complex chunking pipelines. They expand the upper bound of what is reachable.

What they do not do is solve the engineering question of what should be in the context at any given moment. That question is now front and centre, and the answer for nearly every production agent is: less than you would think.

Context engineering as a discipline

The right framing in 2026 is that context engineering, not prompt engineering, is the central skill for getting AI agents to work in production. The terminology has caught on because it captures the new reality: a prompt is a single message, a context is the entire token stream the model sees, and the discipline is shaping that stream.

The core moves are well known and we will spend a full piece on them next, but the headline list is short. Write things to a scratchpad rather than carrying them through the context. Select the right slice of memory or document for the current step rather than carrying everything. Compress completed steps into a summary rather than keeping the raw trace. Isolate sub-tasks into their own contexts rather than cramming them all into one. Each of these is a direct response to the curve described above.

The teams that have internalised this are seeing reliability gains without changing the model. The teams that have not are throwing more tokens at problems that more tokens cannot solve.

A sanity check before you ship

A useful diagnostic: take an agent that is misbehaving and look at the size and shape of its context at the failure step. If it is over 30,000 tokens, especially with tool outputs and prior conversation crammed in the middle, you are probably looking at a context-rot failure rather than a model failure. The fix is not a better prompt or a smarter model. The fix is to redesign the context to keep the relevant information at the edges and prune the middle.

If you only do one thing differently after reading this, do this one. The next time an agent fails on something it should clearly be able to do, before blaming the model, check the context length and the position of the relevant information. The pattern repeats often enough that you will start spotting it on sight.

The real lesson

The first wave of LLM engineering was about the prompt. Get the wording right, get the examples right, get the format right. The second wave was about retrieval. Find the right document, put it in the context. The third wave, which is what 2026 is about, is the management of the entire context stream over time, including across many agent turns, tool calls, and sub-tasks.

The lesson is not that long context windows do not matter. They do. The lesson is that they are necessary but not sufficient, and that the engineering work of deciding what goes in the window has not gone away. It has just gotten more visible. The teams treating it as a first-class problem are the teams whose agents are working in production. Everyone else is wondering why a bigger model did not fix the bug.

Sources

Cloudflare Agent: Day 6

buooy — Sat, 18 Apr 2026 00:50:11 GMT

Day 5 wired up the developer-facing control plane — Flagship for feature flags, governed rollouts, the plumbing that lets teams ship agent code safely. Day 6 flips the camera. If agents are going to read, write, and act across the internet at scale, the web itself has to change — the pages, the transport, the models behind the requests. Six announcements dropped today. Here's what dropped.

The Big Picture

Agents Week has spent five days on the agent side of the equation — runtimes, memory, security, feature gates. Day 6 is about what agents run against. The headline product, Agent Readiness, doesn't grade an agent. It grades a website. Can an agent find your docs? Read your content efficiently? Know what it's allowed to do? Authenticate and call your APIs? Most sites fail — and Cloudflare's betting that's about to become a problem worth measuring.

The other five announcements slot into the same thesis from different angles. Agent Memory makes the agents themselves smarter over time. Unweight makes the models behind them cheaper to serve. Shared Dictionaries and the network performance update make the transport layer faster for machines reading the web at scale. Redirects for AI Training gives site owners a one-toggle way to stop models from training on stale content. Read together, Day 6 is Cloudflare arguing that the agentic web needs upgrades on both ends — the agent side and the site side — and then shipping both ends.

1. Agent Readiness: A Grade for Every Website

The headline drop. Cloudflare launched isitagentready.com — a free scoring tool plus a new Cloudflare Radar dataset — that measures how well any website supports AI agents.

The score rolls up four dimensions. Discoverability — robots.txt, sitemap.xml, HTTP Link headers. Content — markdown content negotiation via Accept: text/markdown. Bot access control — Content Signals, AI-specific robots rules, Web Bot Auth. And capabilities — API catalogs, MCP server cards, OAuth discovery, Agent Skills indexes. The tool itself runs as a stateless MCP server at /.well-known/mcp.json — a dogfood moment that doubles as a reference implementation other sites can copy.

What's new:

isitagentready.com — Drop in any URL, get a per-dimension score plus concrete remediation steps.
Radar adoption dataset — Aggregate tracking of agent-standards adoption across the top of the web, updated over time.
Agent-ready Cloudflare docs — Cloudflare rewrote its own developer documentation to serve markdown natively and publish an Agent Skills index.

Two numbers to chew on. Only 4% of sites have declared AI usage preferences in robots.txt — despite 78% of sites having a robots.txt file at all. The access-control layer for agents basically doesn't exist yet. Meanwhile, the rebuilt Cloudflare docs serve agents with 31% fewer tokens consumed and 66% faster time to answer than competing documentation sites. The readiness score isn't just a diagnostic — it's a roadmap and a benchmark dropped on the same day.

2. Agent Memory: Persistent Memory as a Managed Service

Agents can now remember. Cloudflare opened private beta for Agent Memory, a managed service that extracts, stores, and retrieves information from agent interactions — so the model doesn't have to jam every relevant fact into the context window.

The architecture is a tight stack on Cloudflare's own primitives. A Worker orchestrates three backends: Durable Objects (SQLite-backed message and memory storage), Vectorize (vector search), and Workers AI (inference). Each memory profile gets its own isolated Durable Object instance, so tenants never share state. The API exposes four verbs — Ingest, Remember, Recall, Forget — plus a list operation for memory management.

What's new:

Four-class memory model — Extracted items are classified as Facts (stable knowledge), Events (time-specific), Instructions (procedural), or Tasks (ephemeral work).
Five-channel retrieval — Search runs full-text (with Porter stemming), exact fact-key lookup, raw message search, direct vector search, and HyDE (Hypothetical Document Embedding) vector search in parallel. Results merge via Reciprocal Rank Fusion with fact-key matches weighted highest.
Idempotent ingestion — Content-addressed SHA-256 IDs make re-ingestion safe. An eight-check verification pipeline validates every extracted memory against the source transcript before storage.
Model split — Llama 4 Scout (17B, 16-expert MoE) handles extraction and classification; Nemotron 3 (120B MoE, 12B active) does synthesis.

The line Cloudflare is drawing: "Your memories are yours." Every memory is exportable — a deliberate bet that portability, not lock-in, wins the managed-agent-memory category before anyone's really fighting over it.

3. Unweight: 22% Smaller Models, Bit-Exact Output

Lossless LLM compression, in production. Cloudflare shipped Unweight, an inference-time compression system that shrinks BF16 model weights by 15–22% without changing a single output bit.

The trick is a quiet one. In BF16 weights, sign and mantissa bits look effectively random — but the top 16 exponent bytes account for over 99% of weights in a typical layer. Unweight Huffman-codes only the exponent stream, hitting ~30% compression on MLP matrices. A reconstructive matmul kernel then decompresses weights inside on-chip shared memory (SMEM) and hands them directly to tensor cores — so the reconstructed weights never touch main memory. Memory bus traffic drops roughly 30%.

What's new:

Four execution pipelines — Full decode, exponent-only decode with reconstructive kernel, 4-bit palette transcode, and direct palette. An autotuner measures actual end-to-end throughput and picks the best option per weight matrix and batch size.
~3 GB VRAM savings on Llama 3.1 8B — Extrapolating, Llama 70B could save 18–28 GB depending on configuration.
~22% distribution footprint — Matters disproportionately because Cloudflare has to replicate models across its global network.
30–40% throughput overhead on H100 SXM5 — "Not a free lunch," as the post puts it, but the capacity unlocks are the point.

The system-design line in the post lands hard: "Every byte that crosses the memory bus is a byte that could have been avoided if the weights were smaller." Unweight is Cloudflare treating model weights like any other asset on its network — compressed, cached, and never larger than it has to be.

4. Shared Dictionaries: Delta Compression for Deploy Churn

Most of what crosses the wire is redundant. Cloudflare is rolling out support for RFC 9842 shared compression dictionaries, turning repeat asset transfers into delta transfers.

A shared dictionary is a reference both sides already have. When the server ships a new version of a resource, it compresses only the diff against the dictionary. The HTTP plumbing — Use-As-Dictionary, Available-Dictionary, cache key variance on Accept-Encoding — handles the handshake. Chrome 130+ and Edge 130+ support it today; Firefox is in progress.

What's new:

Phase 1 beta April 30 — Passthrough support. Cloudflare forwards dictionary-related headers and encodings without modification, properly varying cache keys on Available-Dictionary and Accept-Encoding.
Phase 2 — Cloudflare takes over the dictionary lifecycle: injection, storage, delta compression — zero origin changes.
Phase 3 — Automatic dictionary generation. Cloudflare detects versioned resources across the network and compresses new versions against predecessors without any developer intervention.

The lab numbers look like a typo. A 272 KB asset gzips to 92 KB — a 66% reduction. With a shared dictionary against the previous version, the same asset compresses to 2.6 KB — a 97% reduction over the already-gzipped payload, and 89% faster on cache hit. Cloudflare's framing example: "At 100K daily users and 10 deploys a day, that's the difference between 500GB of transfer and a few hundred megabytes." Agentic crawlers are now just under 10% of Cloudflare's total requests, up ~60% year-over-year — so this isn't a marginal optimization, it's a bandwidth reset for the machine-readable web.

5. Redirects for AI Training: One Toggle for Canonical Content

Site owners get edge enforcement of canonical content. Redirects for AI Training, live today on all paid plans, converts existing tags into HTTP 301 redirects — but only for verified AI training crawlers.

The logic is narrow and surgical. Cloudflare's cf.verified_bot_category field identifies AI training bots (GPTBot, ClaudeBot, Bytespider). For requests to pages with a canonical tag that's same-origin and not self-referencing, the edge returns a 301 to the canonical URL before the original response ever leaves. Humans, search engines, and unverified bots continue to see the deprecated page unchanged — only training crawlers get redirected.

What's new:

Single toggle on paid plans — No origin changes. No new bot management rules.
CMS-compatible — Works with any platform that emits canonical tags (WordPress, Contentful, EmDash).
Verified effectiveness — In Cloudflare's own rollout, "100% of AI training crawler requests to pages with non-self-referencing canonical tags were redirected" within the first seven days.

The dogfood example is the sharpest part of the post. Cloudflare's own deprecated Wrangler v1 docs were being crawled ~46,000 times a month by OpenAI, 3,600 by Anthropic, and 1,700 by Meta — enough that models were confidently returning outdated CLI syntax to developers. That's 4.8 million AI crawler visits in 30 days across developers.cloudflare.com. One toggle, and training pipelines now read the current version instead.

6. Network Performance: 40% → 60% Fastest in Twelve Months

Cloudflare is now the fastest network in 60% of the world's top 1,000 networks — up from 40% a year earlier. That's 40 additional countries and 261 additional ASNs where it ranks first.

Two drivers. A Rust rewrite of core connection handling — HTTP/3, SSL/TLS termination, congestion window management — cut CPU and memory overhead per request. And physical expansion: new points of presence in Constantine (Algeria), Malang (Indonesia), and Wroclaw (Poland) — where user RTT dropped from 19ms to 12ms, a 40% reduction at the edge.

What's new:

+54 ASNs in the US alone now rank Cloudflare fastest.
6ms faster than the next-fastest provider on average, measured in December.
Trimean-weighted RUM — Rankings use real-user measurements with a trimean (weighted 25th/50th/75th percentile) to filter outliers and reflect typical — not best-case — experience.

The Agents Week subtext is easy to miss: agents read the web on behalf of users, and a single agent task can touch a hundred URLs. Six milliseconds per page stacks fast.

The Through-Line

The agentic web isn't just the agents — it's the substrate agents run on. Days 1–5 focused on what developers build; Day 6 focused on what they build against. Readiness scoring, canonical redirects, shared dictionaries, a 22% smaller model, persistent memory, and a network that now leads in 60% of the top 1,000 ASNs — every announcement is Cloudflare betting that infrastructure is the product, and that the sites, the models, and the transport have to be upgraded together.

The strategic consistency is the interesting part. Cloudflare isn't trying to own the agent — it's trying to own the ground the agent walks on, and make sure that ground is measurable, optimizable, and open.

More announcements coming throughout the week — I'll keep covering them as they land.

Sources

Cloudflare Agent: Day 5

buooy — Fri, 17 Apr 2026 02:27:03 GMT

Day 4 shipped the agent runtime — Sandboxes GA, Durable Object facets for one-database-per-app, a new AI Gateway inference layer, and OAuth that makes internal apps agent-ready. The boxes the agents live in are settled. Day 5 ships what flows through them — the state they persist, the knowledge they search, the models they reach for, and the emails they send. Here's what dropped.

The Big Picture

If Day 3 was the network fabric and Day 4 was the runtime, Day 5 is the data and inference layer that feeds everything on top. Every announcement today answers one question — what does an agent need to remember, retrieve, reason with, and reply through — and how does Cloudflare collapse all of it behind a single binding?

The headline is Artifacts — a Git-compatible, versioned filesystem built for agents first, humans second. Versioning state is the piece that unlocks fork-a-session, time-travel-a-decision, and diff-two-agent-runs. Around it, Cloudflare is handing agents a unified inference layer that makes provider choice a one-line change (AI Platform, 70+ models across 12+ providers), a search primitive that isn’t a bolt-on (AI Search), an email pipe that goes both directions (Cloudflare Email Service public beta), and the Inference Layer underneath to run it all fast — a 3x improvement in intertoken latency on Kimi K2.5 via prefill-decode disaggregation.

The thesis across all five: agents don't need new clouds — they need the existing cloud to grow new senses, and the developer interface needs to collapse from "pick a tool per product" down to one command.

1. Artifacts: Versioned storage that speaks Git

Git for agents, shipping as private beta. Artifacts is a distributed, versioned filesystem that exposes itself as a Git server — meaning any Git client on the planet can push, pull, fork, and diff against it without knowing Cloudflare exists.

The pitch is that the agent state is messy — filesystems, session logs, plans, scratchpads — and none of it survives a restart without engineering gymnastics. Git already solved versioned content-addressable storage twenty years ago. Artifacts borrows the data model and bolts on serverless-friendly APIs so an agent can create a repo, commit, fork, and hand off a URL programmatically. Humans get the same repo the agent sees.

What's new:

Create tens of millions of repos — Repos are cheap and ephemeral-friendly. Spin one up per agent task, per sandbox, per user session. Fork from any remote on creation.
REST + native Workers API (coming soon) — You don't need to be a Git client to write to Artifacts. A Worker, a Lambda, or a Node script can create repos, generate credentials, and commit changes over HTTPS.
Connect from any Git client — git push, git clone, GitHub Desktop, VS Code source control — all work. That's the unlock. Humans and agents share the same surface.
Built for fork and time-travel — The Git data model lets you diff two agent runs, fork a session from the moment it went wrong, and branch experiments without touching the origin.

Cloudflare is already using Artifacts internally to persist filesystem state and session history for its own agents — teams share sessions, time-travel through both file and message state, and fork from any point. Public beta lands in early May 2026. Private beta is open today for Workers paid plans.

2. AI Platform: the unified inference layer

One call to rule them all, and in the Cloudflare run them. Cloudflare is turning AI Gateway and Workers AI into a single AI Platform — one AI.run() call, 70+ models across 12+ providers, one set of credits, automatic failover between providers.

The observation driving this: most companies are already calling an average of 3.5 models across multiple providers. Switching providers today means rewriting code, re-plumbing credentials, and losing unified cost visibility. The AI Platform makes provider choice a one-line change.

What’s new:

One binding, any model — env.AI.run('anthropic/claude-opus-4-6', ...) works identically to calling a Workers AI-hosted model. REST API support is weeks away for non-Workers clients.
Expanded catalog — Alibaba Cloud, AssemblyAI, Bytedance, Google, InWorld, MiniMax, OpenAI, Pixverse, Recraft, Runway, and Vidu join the roster, bringing multimodal image, video, and speech models into the unified API.
Unified cost visibility — Tag requests with custom metadata — team, user, workflow — and get a cross-provider cost breakdown from one dashboard. No provider on its own can give you the full picture.
Bring your own model — Coming soon: push a Replicate Cog-packaged container to Workers AI and Cloudflare serves it like a first-party model. GPU snapshotting for faster cold starts is on the roadmap.
Automatic failover — If a provider goes down, AI Gateway routes to another that hosts the same model, no custom retry logic required. Streaming responses are resilient to disconnects when paired with the Agents SDK.

The strategic play: make Cloudflare the single operational layer for inference economics in a multi-provider reality. Because Workers run on the same global network as AI Gateway and Workers AI, Cloudflare-hosted models skip a public-Internet hop — time-to-first-token matters more for agents than total latency does.

3. AI Search: The search primitive for agents

Search as a first-class Cloudflare primitive, not a feature bolted onto storage. AI Search replaces the "stitch Vectorize + R2 + your chunker + your embedder" pattern with a single primitive that an agent can spin up on demand.

An instance is created dynamically — one per user, per session, per topic, whatever the agent decides. You upload files; Cloudflare handles chunking, embedding, storage, and retrieval. Queries run hybrid (lexical + semantic) with relevance boosting, which you can tune per-instance. The agent gets ranked results without owning a vector database.

What's new:

Dynamic instance creation — Agents create and tear down search indexes as part of their reasoning loop. No provisioning dance.
Hybrid retrieval built in — Lexical and semantic search combined in one query, not two services stitched with a reranker. Handles the "exact product name" and "vibe of the description" cases in one call.
BM25 — Keyword-based search runs in parallel with vector search. Results are fused, and optionally ranked at the end.
Relevance boosting — Steer ranking with metadata filters and signal weights without retraining anything.
Cross-instance search — Query across multiple instances when an agent needs to pull from more than one knowledge source.

This is what most teams were building by hand in 2025 with a patchwork of Vectorize, R2, and homegrown chunkers. Today, it's a single API.

4. Cloudflare Email Service: Public beta, ready for your agents

Email, bidirectional, from your Worker. Receive inbound emails and outbound sending with a native Workers binding, plus a new Email MCP server, Wrangler CLI email commands, and an open-source agentic inbox reference app.

Email is the most universal interface on the Internet. No custom client, no custom SDK, no custom auth — everyone already has an email address. That makes it the natural channel for agents that need to operate asynchronously and reach users who don’t live in your product dashboard.

What’s new:

Email Sending (public beta) — env.EMAIL.send({...}) from any Worker. No API keys, no secrets management. REST API plus TypeScript, Python, and Go SDKs for any platform.
Automatic SPF, DKIM, DMARC — Add your domain to Email Service and Cloudflare configures the authentication records so deliveries don’t land in spam.
Agents SDK `onEmail` + `sendEmail` — Agents can receive, orchestrate work asynchronously, and reply on their own timeline. Cloudflare frames this as the concrete difference between a chatbot and an agent — your agent can receive a message, spend an hour processing, check three systems, then email back with a complete answer.
Email MCP server — Expose inbox access to any MCP-compatible agent harness.
Wrangler email commands, skills for coding agents, and an open-source agentic inbox reference app — Tooling for building email-native products, not just email-aware ones.

Combined with Email Routing — free and GA for years — Cloudflare now offers fully bidirectional email on the platform. Receive, process in a Worker, persist to Agent state, reply. Customer support agents, invoice processing pipelines, account verification flows, multi-agent workflows — Cloudflare’s private-beta customers were already building all of this.

5. Building the foundation for extra-large language models

Making Extra-LLM respond faster at scale. Having your models respond quickly to your customer is the defining factor for user experience. Cloudflare published a deep dive on the hardware and software stack behind Workers AI’s large-model hosting — and shipped a 3x improvement in intertoken latency (time between two consecutive tokens) on Kimi K2.5.

This is the plumbing post of the week. Running models the size of Kimi K2.5 at agent-workload scale — long contexts, heavy prefill, tool calling, turn-over-turn prompt reuse — requires squeezing every ounce out of very expensive GPUs. Cloudflare’s angle: the kind of traffic agents send is meaningfully different from chatbot traffic, and the platform should be tuned for it.

What’s new:

Prefill-decode disaggregation — Separate inference servers for input processing (compute-bound) and output generation (memory-bound), scaled independently, routed by a token-aware load balancer. P90 time-per-token dropped from ~100ms with high variance to 20–30ms, a 3x improvement in intertoken latency.
Prompt caching via `x-session-affinity` — A request header routes follow-ups back to the region holding the computed input tensors. Internal heavy users went from 60% to 80% input-token cache hit ratio at peak — and Cloudflare discounts cached tokens to incentivize adoption.
Hardware-tuned configs — Different GPU configurations for input-heavy workloads (summarization) vs. output-heavy workloads (content generation), with load balancers deciding where requests land.
3x faster Kimi K2.5 — The first visible output of the new stack.

The underlying bet: agent traffic is bursty, long-context, and reuse-heavy. A platform tuned for that pattern gets cheaper and faster the more agents run on it — the opposite of a naive per-token cost curve.

The Through-Line

Agents don't need a new cloud. They need the existing cloud to provide an integrated and seamless platform.

Day 3 made the network private. Day 4 made the runtime persistent. Day 5 makes the runtime capable — a versioned filesystem that remembers decisions, a search it can reach into, an email channel, a model-agnostic inference layer, and a user experience for your customers that’s smooth as butter.

The strategic bet is that "how do I build this agent" and "how do I ship this agent" should be the same question with the same answer — and that answer lives on a single developer platform where the primitives, the identity layer, and the tooling don't force you to context-switch between vendors. Everything shipped this week points to that collapse.

More announcements coming throughout the week — I'll keep covering them as they land.

Sources

Cloudflare Agent: Day 4

buooy — Thu, 16 Apr 2026 02:27:32 GMT

Day 4 ships what Cloudflare thinks are the brains and hands of the future AI Agent. Here's what dropped.

The Big Picture: From Primitives to Platform

The first three days of Agents Week established a pattern. Day 1 gave agents their own computers (Sandboxes GA) and persistent storage (Durable Object Facets). Day 2 locked down identity, networking, and tool governance. Day 3 is where Cloudflare makes the leap from "infrastructure you can build agents on" to "a platform that builds agents for you."

The centerpiece is Project Think — the next edition of the Agents SDK. Where the original SDK gave you lightweight primitives and left the wiring to you, Project Think ships an opinionated base class that handles the agentic loop, durable execution, persistent memory, sub-agent orchestration, and sandboxed code execution out of the box. The rest of the day's announcements fill in the modalities: browsers, voice, workflow orchestration, and even domain registration.

The combined message is clear: Cloudflare wants to own the full runtime for AI agents, from the first prompt to the last API call.

1. Project Think: The Batteries-Included Agents SDK

The headline announcement. Project Think is a new edition of the Agents SDK that transforms it from a collection of primitives into a complete agent platform. It ships a Think base class that handles the core agentic loop — you implement getModel() and optional hooks, and the SDK manages context assembly, tool execution, and state persistence.

What's new:

Durable Execution via Fibers — Crash recovery and checkpointing through runFiber(). If the platform restarts mid-task, the agent picks up where it left off. Automatic keepalive during long LLM calls prevents premature hibernation.

Sub-Agents — Isolated child agents, each with their own Durable Object Facet and separate SQLite database. Communication happens through typed RPC. Think of it as microservices for agents — a researcher, a coder, and a reviewer each running independently but coordinated by a parent.

Persistent Sessions — Tree-structured message storage with branching and forking. An agent can explore multiple solution paths without losing history. Non-destructive compaction summarizes old context rather than deleting it. Full-text search across conversation history comes built in.

The Execution Ladder — A tiered capability model for sandboxed code execution. Tier 0 gives agents a virtual filesystem. Tier 1 adds sandboxed JavaScript. Tier 2 brings npm packages. Tier 3 provides a headless browser (via Browser Run). Tier 4 is a full development sandbox with git, compilers, and test runners. Each tier grants explicit capabilities — security through structure, not behavioral constraints.

Code over tool-calling — This is the architectural bet that ties it together. Instead of the LLM making individual function calls in a loop, the agent writes a program that accomplishes the entire task. That code runs in a sandboxed Dynamic Worker that spins up in milliseconds. Cloudflare reports a 99.9% reduction in token usage compared to endpoint-per-tool approaches. One program replaces a hundred round-trips.

The economics are striking. With 10,000 agents each active 1% of the time, containers require 10,000 always-on instances. Durable Objects need roughly 100 active simultaneously — zero compute cost when hibernated, instant wake-up on demand via HTTP, WebSocket, or scheduled alarm.

2. Browser Run: Give Your Agents a Browser

Browser Rendering gets a new name and a serious upgrade. Browser Run is Cloudflare's agent-ready browser service — on-demand Chrome instances running on the global network. The rebrand reflects what changed: this is no longer a rendering utility. It's an agent's window into the web.

What's new since Browser Rendering:

Live View — Watch your agent browse in real time. See the page, DOM, console output, and network activity as it happens. Access it through code via devtoolsFrontendURL or directly from the Cloudflare dashboard.

Human in the Loop — When the agent hits a wall — a CAPTCHA, a login screen, an unexpected modal — a human can take direct control of the active session. Click, type, navigate, enter credentials, then hand control back. Future updates will add automatic handoff signals.

4x concurrency — Default concurrent browser limit jumps from 30 to 120 instances with no cold-start delays. Quick Action endpoints now handle 10 requests per second.

Direct CDP access — Chrome DevTools Protocol exposed as a raw WebSocket endpoint. Existing Puppeteer and Playwright scripts transition with a single-line config change.

WebMCP support — Chrome 146+ lets websites declare available agent tools through navigator.modelContext, improving navigation reliability without UI-analysis loops.

Browser Run is available on both Workers Free and Workers Paid plans. Figma is already using it — agents in Figma Make browse the web to go from idea to production.

3. Voice Agents: Add Speech in ~30 Lines of Code

An experimental voice pipeline for the Agents SDK. The new @cloudflare/voice package layers real-time speech on top of existing agents. The entire server-side implementation takes roughly 30 lines of code.

How it works:

Browser microphone captures audio and streams it via WebSocket as 16 kHz mono PCM.
A continuous Speech-to-Text session processes incoming audio frames.
The STT model detects utterance completion and emits a stable transcript.
That transcript passes to the agent's onTurn() method — the same method that handles text input.
The agent's response synthesizes to audio via Text-to-Speech.
Audio streams back to the client. Messages persist in SQLite.

The key design decision: voice and text share the same state, the same conversation history, the same tools. There are no separate code paths. A single agent handles both modalities. The same user can type a question in the morning and ask it out loud in the afternoon, and the agent has full context either way.

The package ships withVoice(Agent) for full conversational agents, withVoiceInput(Agent) for dictation-only interfaces, React hooks (useVoiceAgent, useVoiceInput), and a framework-agnostic VoiceClient. A Twilio adapter routes phone calls to the same agent with no additional logic.

4. Agent Lee: Dogfooding through the Dashboard

An in-dashboard AI agent built on the same stack Cloudflare sells to developers. Agent Lee replaces manual dashboard navigation with a prompt-based interface. Ask it to debug a connectivity issue, enable a feature across domains, or deploy an R2 bucket — it handles the API calls.

The numbers are already real: 18,000 daily users and 250,000 tool calls per day across DNS, Workers, SSL/TLS, R2, and more.

Under the hood, Agent Lee uses Codemode — the same pattern Project Think advocates. Instead of traditional tool definitions, the LLM writes and executes TypeScript against Cloudflare's APIs. A Durable Object proxy classifies operations as read or write, proxies reads directly, and blocks writes until the user explicitly approves. API keys stay server-side. Generated code never touches credentials.

The strategic significance: Cloudflare is building its own products on the primitives it ships to customers. Limitations discovered in production become platform improvements for everyone.

5. Workflows v2: 10x the Concurrency

The Workflows control plane gets rearchitected for agent-scale traffic. The original design funneled all operations through a single account-level Durable Object — fine for human-triggered events, a bottleneck when a single agent session spawns dozens of workflows at machine speed.

The fix introduces two new components:

SousChef — Tracks metadata and lifecycle for a subset of instances within a workflow. Multiple SousChefs distribute the load and provide per-workflow isolation.

Gatekeeper — Distributes concurrency slots across SousChefs via a leasing system, batching all slot requests into one call per second.

The result: 50,000 concurrent instances (up from 4,500), 300 instance creations per second (up from 100), and 2 million queued instances per workflow. All existing instances migrated to v2 with zero downtime.

6. Registrar API: Agents Can Buy Domains Now

Cloudflare's Registrar gets a programmatic API, now in beta. Agents and developers can search for domains, check real-time availability and pricing, and complete registration — all without leaving their editor or terminal.

The API integrates directly with MCP, making it immediately available in tools like Cursor and Claude Code. The workflow is deliberately machine-friendly: search, check, confirm, register. Domains require explicit human confirmation before purchase (registrations are non-refundable). Pricing stays at cost with no markup — same as the dashboard.

It's a small announcement relative to the others, but it signals where Cloudflare is heading: every product in their stack should be agent-accessible by default.

The Through-Line

All six announcements share the same architectural philosophy: give agents real computing primitives — execution, persistence, browsers, voice, orchestration — and make the platform disappear. The Think base class abstracts the agentic loop. Browser Run and voice are one import away. Workflows scale without configuration changes. The Registrar API just works over MCP.

Cloudflare is betting that the agentic era needs infrastructure that's fundamentally different from the container-and-orchestrator model that dominated the smartphone era. Agents don't share code paths. They hibernate for hours and wake in milliseconds. They need browsers and microphones, not just HTTP endpoints. And the platform operator needs to maintain control — over execution, credentials, network access, and billing — without constraining what the agent can do.

Day 4 makes that bet concrete. More announcements coming throughout the week — I'll keep covering them as they land.

Sources

Cloudflare Agent: Day 3

buooy — Wed, 15 Apr 2026 03:11:30 GMT

Cloudflare's Agents Week continued with a clear pivot from "give agents a computer" to "give agents a network identity." Day 3 is the security and connectivity chapter — four announcements that stack into a single story: agents should be first-class citizens of your private network, with their own identity, their own access policies, and their own controlled path to the resources they need. Here's what dropped.

The Big Picture: Agents Need a Home Network

The traditional perimeter — VPN in, bastion host, service accounts with long-lived tokens — was designed for humans and the apps they log into. It starts to crack the moment you have ten coding agents, a research agent, and a handful of vendor-supplied agents all trying to reach production data at once.

The core tension is that agents need real access to internal systems (databases, APIs, MCP servers, dashboards) but you can't reasonably trust them with the same credentials a human employee would hold. Static service tokens leak through prompt injection. VPNs flatten the blast radius when any single agent goes rogue. And most identity tooling was never built to differentiate "Sarah from finance" from "the invoice-reconciliation bot Sarah deployed last Tuesday."

Day 3 is Cloudflare's answer: every agent gets a cryptographic identity, every internal app gets a managed OAuth front door, and every MCP connection goes through a governed portal. The plumbing is designed so that the policies you'd normally write for humans just extend cleanly to agents.

1. Cloudflare Mesh — A Private Network Built for Agents

The headline announcement. Cloudflare Mesh unifies users, office hardware, multi-cloud environments (AWS, GCP, on-prem), and autonomous AI agents into a single encrypted private fabric. Every node — including every agent — enrolls with its own identity, and policies are written against those identities rather than IP ranges.

Why this matters over alternatives:

Traditional VPNs — Flatten trust. Once an agent is "inside," it can usually reach everything the user could. No scoped identity per agent.
Service mesh inside a single cloud — Works, until your agent runs on Workers and needs to hit a Postgres sitting in AWS. Cross-cloud is where these break down.
Bespoke tunnels per integration — Manual. You end up with a map of SSH tunnels and port forwards that no one fully understands.

Mesh collapses all of that into one fabric. A single control plane spans laptops, office hardware, and multi-cloud infrastructure, and the enforcement happens at the network layer before the agent ever sees a response.

Workers VPC bindings for Mesh are the developer-facing half of this. An agent running on Cloudflare Workers can reach a private database sitting in a VPC in us-east-1 with a single binding in wrangler.toml — no manual tunnel, no bastion, no public exposure. The agent's identity travels with the request, and policies like "the coding agent may read staging but never production financials" get enforced at the edge of your VPC.

2. MCP Server Portals with Code Mode

The governance piece. MCP (Model Context Protocol) adoption inside enterprises has been messy — every team spins up its own MCP servers, each agent connects to a different subset, and security has no single chokepoint for policy. MCP Server Portals fix that by acting as a proxy in front of many MCP servers.

How it works:

An employee (or an agent acting on their behalf) connects their MCP client to a single portal endpoint.
The portal checks identity and reveals exactly the internal and third-party MCP servers that principal is authorized to use.
Every downstream tool call flows through the portal, which enforces org-wide policies and logs everything.
The portal serves the MCP catalog with Code Mode enabled by default.

Code Mode is the other half of this announcement, and it's a genuine win on token economics. Instead of exposing every MCP tool directly to the model (where each tool definition chews through context), the portal compiles the full tool catalog into a typed TypeScript API. The agent writes TypeScript against that API and executes it in a Dynamic Worker isolate.

The numbers are striking. Cloudflare's own API has over 2,500 endpoints. Exposed directly as MCP tools, that's roughly 2 million tokens of context. Collapsed into a Code Mode TypeScript API, the entire surface fits into two tools and under 1,000 tokens — an 81% reduction in token usage versus traditional tool-calling.

Under the hood, DynamicWorkerExecutor spins up an isolated Worker via WorkerLoader. External fetch() and connect() are blocked by default at the Workers runtime level, so sandboxed code can only reach the outside world through code-mode tool calls — each of which is proxied and logged by the portal.

3. Managed OAuth for Cloudflare Access

Making internal apps agent-ready in one click. Every SaaS app you've ever onboarded has an OAuth server. Your own internal apps, usually, do not — which means agents that want to call them end up with static API tokens, shared service accounts, or bespoke auth brokers.

Managed OAuth turns Cloudflare Access itself into an OAuth 2.0 authorization server for any self-hosted app sitting behind it. Non-browser clients — CLIs, AI agents, SDKs, scripts — can now authenticate to your internal apps using a standard authorization code flow, no changes required to the app.

The design choices worth noting:

Opt-in for existing apps. Managed OAuth is off by default on existing Access applications so it doesn't interfere with apps that run their own OAuth server and rely on their own WWW-Authenticate headers.
On by default for new MCP Server Portals. New portals get managed OAuth turned on automatically, because the agent auth story is exactly what portals are for.
Same identity graph as your humans. The agent authenticates as a principal inside your existing Access policy — the same SSO, the same groups, the same audit trail.

The practical effect: you put an internal dashboard, an internal REST API, or an MCP server behind Access, flip the Managed OAuth toggle, and an agent can log in the same way a person would, using short-lived tokens that Access issues and rotates.

4. Securing AI Agents — The Overall Framing

A companion post zooms out on the full agent lifecycle and maps each stage to a Cloudflare control.

The lifecycle view:

Identity — Every agent gets a Mesh identity at enrollment. No shared service accounts.
Connectivity — Mesh + Workers VPC provide the private path to internal resources. No public exposure.
Authorization — Access (with Managed OAuth) enforces which apps and APIs the agent can reach, with short-lived tokens.
Tool use — MCP Server Portals proxy and govern every MCP interaction. Code Mode keeps token usage sane.
Execution — Sandboxes (GA since Day 2) give the agent a controlled compute environment with Outbound Workers gating its egress.
Observability — Every layer emits logs keyed to the agent's identity, so you can reconstruct exactly what any agent did at any point in time.

The point of the framing is that these pieces aren't independent products — they're designed to compose into a single control plane where an agent's identity is the primary key across network, auth, tool use, compute, and audit.

The Through-Line

Day 2 was about giving agents a computer. Day 3 is about giving that computer a badge, a VPN profile, and an HR record. The architectural philosophy is the same as yesterday's: give agents real primitives — network access, OAuth flows, MCP tool catalogs — but never give up control. Every layer has a trust boundary, and every trust boundary is keyed to agent identity.

Cloudflare's bet is that the agent-native enterprise looks less like a fleet of API keys and shared credentials, and more like a directory of first-class agent principals with scoped access, short-lived tokens, and full audit trails — the way employee access has worked for a decade, but purpose-built for software that writes its own code paths.

Day 4 tomorrow — I'll keep covering them as they land.

— posted by Aaron's Cowork assistant

Sources

Cloudflare Agent: Day 2

buooy — Tue, 14 Apr 2026 00:30:52 GMT

Cloudflare kicked off Agents Week 2026 with a clear thesis: the Internet — and the cloud infrastructure underpinning it — wasn't built for the age of AI agents. Day 2 delivered a batch of announcements that back that thesis up with real product. Here's what dropped.

The Big Picture: Why Agents Week Exists

The opening post from Rita Kozlov and Dane Knecht frames the entire week. The core argument is that the traditional cloud model — one application serving many users, scaled horizontally via containers — breaks down when every user gets their own agent running its own unique task.

Unlike a conventional web app that follows the same code path for every request, an agent is a one-to-one relationship: one user, one agent, one task. Each agent needs its own execution environment where the LLM dictates the code path, calls tools dynamically, and persists state until the job is done.

The math is striking. If even 15% of the 100+ million US knowledge workers ran a single agent concurrently, you'd need capacity for ~24 million simultaneous sessions — and that's before people start running multiple agents in parallel. Cloudflare's bet is that V8 isolates (the foundation of Workers) are orders of magnitude more efficient than containers for this kind of workload, and the rest of this week's announcements build on that foundation.

1. Sandboxes Are Now Generally Available

The headline announcement. Cloudflare Sandboxes — persistent, isolated environments powered by Cloudflare Containers — are now GA. Originally launched last June, Sandboxes give AI agents something approximating a real computer: a shell, a filesystem, background processes, the ability to clone repos and run dev servers.

What's new since launch:

Secure credential injection — Credentials are injected at the network layer via a programmable egress proxy. The agent never touches raw tokens.
PTY support — Real pseudo-terminal sessions over WebSocket, compatible with xterm.js. Agents (and humans debugging them) get a proper terminal, not a request-response loop pretending to be one.
Persistent code interpreters — Stateful Python, JavaScript, and TypeScript execution out of the box. Variables, imports, and state carry across calls — unlike throwaway interpreters.
Background processes & live preview URLs — Run dev servers inside sandboxes and interact with them via preview URLs.
Filesystem watching — Faster iteration loops when agents modify files.
Snapshots — Quickly save and restore an agent's coding session.
Active CPU Pricing — Pay only for CPU cycles actually consumed, not idle time. Critical for deploying fleets of bursty agent sessions.

Figma is already running agents in Cloudflare Containers for Figma Make, their tool for going from idea to production.

2. Dynamic, Identity-Aware Sandbox Auth

A companion post dives deep into one of the hardest problems in agentic workloads: authentication. The core tension is that agents need to access external services, but you can't trust them with raw credentials.

Cloudflare's solution is Outbound Workers for Sandboxes — a programmable, zero-trust egress proxy. Every HTTP request leaving a sandbox passes through a Worker you control, where you can inject credentials, enforce policies, log activity, or block requests entirely — all without the agent ever seeing a token.

Why this matters over alternatives:

Standard API tokens are simple but risky — a compromised sandbox leaks the token.
Workload identity tokens (OIDC) are more secure but inflexible — many upstream services lack first-class OIDC support.
Custom proxies offer maximum control but are complex to build and operate.

Outbound Workers combine the flexibility of a custom proxy with the simplicity of a few lines of JavaScript. The proxy runs on the same machine as the sandbox, so latency is negligible. You can scope rules per-host, per-sandbox, and change them dynamically. It's zero-trust by default — no token ever enters the untrusted sandbox environment.

3. A New CLI for All of Cloudflare

Cloudflare has nearly 3,000 HTTP API operations spread across 100+ products, but the existing CLI (Wrangler) only covers a fraction of them. Agents, it turns out, love CLIs — and they expect consistency.

Enter `cf`, a new unified CLI available as a Technical Preview. Run npx cf to try it today.

The key design decisions:

Generated from a single source of truth. Cloudflare built a new TypeScript schema layer that describes the full scope of APIs, CLI commands, configuration, bindings, and more. From this schema they generate the CLI, SDKs, Terraform provider, OpenAPI specs, and even MCP servers — ensuring consistency across every interface.
Consistency is enforced, not hoped for. It's always get, never info. Always --force, never --skip-confirmations. Always --json. These conventions are baked into the schema with linting and guardrails.
Context engineering for agents. Output clearly signals whether commands target local or remote resources. This prevents a class of bugs where an agent thinks it's writing to a remote database but is actually hitting a local simulator.

The post also introduces Local Explorer, a tool for debugging local data (D1 databases, KV namespaces, R2 buckets) used in local development. The current Technical Preview covers a small subset of products, with full API coverage coming over the next few months.

4. Durable Object Facets for Dynamic Workers

This is the most architecturally interesting announcement. Dynamic Workers — announced a few weeks ago — let you load Worker code on-the-fly into a secure sandbox using V8 isolates (100x faster startup than containers, 1/10th the memory). But until now, dynamic code couldn't have persistent state.

Durable Object Facets solve this by letting Dynamic Workers instantiate Durable Objects with their own isolated SQLite databases.

How it works:

You write a "supervisor" Durable Object class that you deploy normally.
When a request comes in, the supervisor loads the agent's code as a Dynamic Worker.
The Dynamic Worker can export its own DurableObject class.
That class gets instantiated as a "facet" — a child of the supervisor's Durable Object — with its own separate SQLite database.
The supervisor controls lifecycle, enforces limits, handles billing, adds observability — all while the agent's code runs in a secure sandbox with real persistent storage.

This is purpose-built for platforms where AI generates small applications on the fly — think vibe-coded personal tools, custom UIs, or one-off data apps. Each generated app gets its own database, its own state, its own isolated execution — and the platform operator retains full control.

Storage access is local-disk fast (zero network latency) because Durable Object storage lives on the same machine as the instance.

The Through-Line

All four announcements share the same architectural philosophy: give agents real computing primitives — shells, filesystems, databases, network access — but never give up control. Every layer has a trust boundary. Sandboxes isolate execution. Outbound Workers gate network access. Facets scope storage. The CLI enforces consistency for both humans and agents.

More announcements coming throughout the week — I'll keep covering them as they land.

English Is the New Programming Language & Markdown is the New SQL

buooy — Mon, 13 Apr 2026 12:59:03 GMT

That sounds parabolic. But I think it captures something real about where product building and engineering are heading right now.

Since November 2025, everything changed. I could reliably describe a feature to Claude in three paragraphs of plain English and watched it produce a working component that would have taken me a full afternoon to write by hand. No syntax errors to chase. No Stack Overflow tabs. Just explaining what I wanted, clearly enough that a model could be built. I sat there for a moment thinking: the bottleneck wasn't code. It was how well I communicated my intent.

For decades, that bottleneck was syntax. You had to translate intent into a language a machine could parse. Python, TypeScript, SQL. Precise grammars. Deterministic outputs. You wrote SELECT * FROM users WHERE active = true and you got back exactly what you asked for, every time.

Now you write "get me all the active users" and a model figures it out. The interface between human and machine shifted from formal language to natural language. The abstraction layer moved up. Way up.

What is glaringly ironic to me is that many of us chose engineering precisely because it was precise. There was a certain comfort in determinism. You write the code, you run the tests, you get the result. The system behaves the way you told it to. If it doesn't, there's a bug, and bugs have causes, and causes can be found.

Now we are communicating imprecise thoughts to try and achieve precise outcomes. We are prompting, not programming. We are negotiating with probability distributions instead of executing instruction sets.

That's a strange new world. And honestly? I find it exciting.

The engineering job was never just writing code

But here's the thing I keep coming back to.

The job of an engineer was never just writing code. Just like the job of a pilot was never just flying a plane. A pilot's job is to get people safely from point A to point B. Flying the aircraft is the mechanism, not the mission.

Engineering is the same. The job is to bring value to a business or mission through technical means. Code was the mechanism. If the mechanism evolves, then we evolve with it. That's not a threat to the profession. That's the profession working as intended.

The two engineering skills that matter most now

If you strip away the specific languages, frameworks, and tools, and you look at what actually separates engineers who build things that matter from engineers who just ship features, it comes down to two things:

Systems thinking and design. The ability to see how parts interact. To understand that a change here creates pressure there. To model complexity before you build it, and to know which complexity is necessary and which is accidental. This is the skill that lets you architect something that holds together at scale, whether you wrote the code by hand or prompted an AI to generate it.

Communication. The ability to articulate what you want, why it matters, and how it should work. This was always important. It's how you align a team, convince a stakeholder, write a clear spec. But now it is also how you interface with the tools themselves. The quality of your output is directly proportional to the quality of your input, and your input is words. Natural language. English.

Systems thinking and communication. If those are the two load-bearing skills, then the engineers who invest in both will build circles around the ones who only invest in one.

What this shift means for me personally

At heart, I'm an engineer, a product person, and a builder. I love systems. I'm good at systems. Give me a complex problem with multiple moving parts and I will happily spend hours mapping out how they fit together.

But through my career, one area I have consistently struggled with has been writing and communicating, especially in long form. Short Slack messages, fine. A quick technical spec, sure. But sitting down to articulate a broader idea, to develop a thought across multiple paragraphs, to write something that someone would actually want to read... that has always been the muscle I neglected.

And now, suddenly, that muscle is the one that matters most.

The good news? Muscles grow. And for the first time in my career, the incentives to grow this one are perfectly aligned. The shift towards natural language as the engineering interface means that getting better at writing makes me better at building. That feels like a gift, honestly.

So like any evolving engineer confronting a gap in their capabilities, I'm looking at this shift as a skills issue. Pun intended.

Why I started this Substack

I started this Substack to learn. To force myself to practice the thing I'm weakest at, in public, where the stakes are real.

I started it to grow. Writing is thinking. If I can't write clearly about an idea, I probably don't understand it as well as I think I do. The act of publishing is a forcing function for clarity.

I started it to share. There's a lot of noise right now about AI and the future of building. I want to add signal, not noise. Honest takes from someone who's in the middle of it, building products, shipping code, and figuring it out in real time.

But mostly, I started it to find other people who feel what I feel. That we are standing at the edge of what is arguably the most fundamental industrial and cultural shift we have yet to see. Not the biggest one yet. The biggest one yet. And the people who engage with it now, who build the muscle and the instinct and the community, are the ones who will shape what comes next.

We're not losing engineering. We're gaining a new version of it. One where the ability to think in systems and communicate with clarity unlocks more than any single programming language ever could.

If that sounds like something worth exploring together, I'm glad you're here. Let's build.