Butter Blog

On Automatic Template Induction for Response Caching

Raymond Tana — Wed, 07 Jan 2026 04:51:07 GMT

As of last week, Butter’s proxy now offers automatic template induction for its response cache! We’ve prepared the following blog post to help explain its significance and potential to help you serve more LLM responses from cache. You can also read through our documentation on template-induction.

Butter is a cache for LLM responses, sitting as an HTTP proxy between clients and LLM inference endpoints.

One of Butter’s central goals is to develop a system of serving LLM responses from cache in a way that is:

Fast at the time of request,
Accurate enough to avoid both false positives and false negatives, and
Powerful enough to achieve a high cache hit rate.

Our main strategy for doing so is via template-aware response caching, something we discussed in an earlier post. We’ll cover it again here, as well as talk about the challenges involved in automating it.

Rather than storing user-agent messages verbatim, a template-aware response cache stores templated messages, or templates. This allows messages in the cache to generalize, thanks to the introduction of variable placeholders.

A message is considered an instance of a template if that template could be populated (that is, hydrated) according to some bindings which specify the hard values to substitute for each template variable.

See the figure below for a simple example of inducing a template and bindings from a query.

Templates lend themselves to expanding the reach of Butter’s cache. Currently, when Butter receives a new query, it attempts to match that query to all the available templates by direct, syntactic comparison (i.e., by comparing against an appropriate regex pattern).

Below, we illustrate how this syntactic comparison works on a followup query which matches the template we induced in the previous figure.

Importantly, this kind of templated matching algorithm employed by Butter is both deterministic and syntactic, avoiding calling extra language models during request time. We continue to create powerful templating and data transformation tools which ensure users’ request-time hotpaths to be served truly deterministically and LLM-free.

Butter’s Cache is a Tree

We should take a moment to clarify how Butter organizes its response cache (and hence, how it serves from cache). For simplicity, we’ll treat all messages in a user-agent interaction as if they either come from the user or the agent; ignoring tool calls and system prompts.

Butter’s caching is tailored to the turn-based structure of user-agent interactions: user queries are met with assistant responses, constituting a single “turn” of the interaction. Models like GPT-4o or Sonnet 4.5 are technically stateless between requests. So, in order for second, third, or later turns to operate under the context of prior turns, the user must make sure to pass along this prior context along with their latest request, all appended together in chronological order. We’ll call this style of managing context: append-only.

Append-only context management is the de-facto default for facilitating messages between users, agents, and tools, and is expected in the now-standardized Open AI Chat Completions format. Note that newer APIs such as Responses handles appending for you.

Under the append-only context management style, Butter’s cache may be thought of as a tree: each node in the tree is a new message in the thread; and distinct branches may spawn from the same node whenever Butter encounters distinct ways of continuing on from a shared context.

This helps us understand what happens when Butter “compares to cache:” It begins at the top-level of the tree and looks for a template matching with the first message in the context. It then seeks a child template of that node matching the second message, etc. Eventually, this process ends whenever a matching child cannot be found, or once it’s exhausted the context. If Butter indeed has the full context stored in the tree, it is ready to serve the corresponding response from cache.

Below is an example of how Butter compares an incoming query with context to its own cache. In this case, we observe a full cache hit.

Beyond the placeholders, the contents of a template essentially capture the structural content of a query: the tokens that inform the model how to respond to the query. Under template-aware response caching, differences in structural content are exactly what trigger branching in Butter’s tree. And so, diverging paths can really be interpreted as distinct workflows followed by agents. Extracting the structural content of a query is thus central to Butter’s caching process. Let us now see what stands in the way of doing this.

Noise

LLM-based agents operate in messy environments. Their context may get cluttered with extraneous information or artifacts which do not serve in performing tasks. We’ll call these artifacts: noise.

Noisy Query:

 



[2024-10-17 14:32:01] User logged in

====================

[AD] Special offer!

Noisy environments pose a real obstacle to cache-based responding. If Butter caches a noisy query, there is little chance that a later instance of that query will contain exactly the same noise as the former, impeding Butter’s ability to recognize it as a cache hit.

Ideally, Butter’s cache stores the “ideal” (i.e., not noisy) version of the query, and any incoming query gets “de-noised” before getting compared to Butter’s cache.

Possible De-Noised Query:

[2024-10-17 14:32:01] User logged in

Special offer!

One could identify certain punctuation, whitespace, or HTML tags as noisy, and manage to detect them by purely syntactic means. E.g., construct an appropriate regex pattern and filter all instances of these tokens from the query. We could use such noise detectors to syntactically filter out any noise present in queries.

# Example: filter out any instance of the HTML tag 

noisy_text = "Hello"

regex_pattern = r']*>'
denoised_text = re.sub(regex_pattern, "", noisy_text)

assert denoised_text == "Hello"

Together, a de-noised query would ultimately be the form in which Butter prepares queries for comparison with (and storage into) its cache.

We can now show exactly how Butter serves from cache: once we find a template which matches the (de-noised) query, Butter syntactically deduces the bindings that would make this template hydrate to the query. Then, it may use those deduced bindings to hydrate the cached response template, yielding a full response.

Understanding noise to be any information which is not relevant to completing a task, we should point out that noise may be context-dependent. Exhibit A: timestamps.

Suppose a computer-use agent includes timestamps with every browser event or interaction it witnesses. Many of the workflows developed by the agent will not be time-dependent: the same “Submit” button that was clickable on 2025-11-10T11:34:00 should still be present on 2025-11-10T11:34:01. So, these timestamps included in the response only muddy the context and prevent the corresponding cache entry from applying to future instances of the same workflow.

However, some other timestamps could indeed be relevant to the agent’s task. Any workflow that distinguishes between a weekday and the weekend, or between morning and night, will require some type of timestamp in order to proceed in its flow logic. Therefore, it is not appropriate to syntactically filter out timestamps indiscriminately.

Context-dependent noise (otherwise called semantic noise) requires more sophisticated methods to detect and discern. So, for now at Butter, we choose not to apply any syntactic de-noising, and pass off the job of semantic de-noising until Variable Induction, below.

Template Induction

We imagine every user query as begging for an LLM response. A proper response should make use of any relevant information found in that query (as well as in any previous context). Surely, not all content in the query is guaranteed to be relevant to responding. Templates should be robust to any of this irrelevant content. Moreover, some content in the query might indeed be relevant to answering the query, but not relevant towards deciding how to produce an answer to the query.

With this in mind, before Butter may add a message to the cache, we ask that it split the message into structural content (the template) and dynamic content (the bound variables). We sometimes describe this as separating data from code, or more concretely as performing template induction. [See here for Why “Induction?”]

Any information which may be abstracted out and bound to a variable (without affecting the workflow’s logic) acts like data in the message, whereas the rest comprises the code of the message.

Data (dynamic content): tokens which are essential towards building a response but not essential towards deciding an algorithm for generating the response.
Code (structural content): tokens which are essential for fixing a method for responding to the query.

Entirely structural messages might look like:

Find the topmost element on the page and interact with it.

Templating parts of the above query wouldn’t make sense, since any changes would likely impact the workflow chosen by the agent for completing the task. For example, with a few swaps, the above command could have instead looked like: Anthropomorphize the largest icon on the page and argue with it.

Whereas highly dynamic messages might look like:

Send Erik at erik@butter.dev the message: "Hey, nice to see you!"

Where portions like Erik and erik@butter.dev and ”Hey, nice to see you!” are mostly safe to templatize: most replacements maintain the same response method: to try to send an email containing some contents to some address and named recipient.

Structural messages require no extra templating before getting stored into Butter’s cache. It is the dynamic content which needs to be detected and get bound to variables.

Thus, template induction boils down to variable induction.

Variable Induction

We’ve just mentioned how variables are useful as placeholders for dynamic content in a message.

But variables also serve a second purpose: as placeholders for semantic noise. This way, only some of the variables specified in the bindings may be useful for generating responses. The rest “mask out” any irrelevant information that can’t be screened syntactically from the message.

It is also possible that what registers as semantic noise at one turn may become relevant for responding in subsequent turns. So, it’s good that we keep semantic noise around in the bindings even if it isn’t useful presently.

Luckily, unlike in the case of syntactic de-noising, we can afford to employ some (slower) semantic analyses when inferring the variables of a message. This is because the caching process happens asynchronously from request time:

Variable induction asks “What to include in the bindings?” We’ll see below how our desire to more generally respond from cache—as well as our need to hydrate templates into full responses—restrict how we may do this.

Bindings

The bindings carry all of the variable assignments. Naïvely, we could hope that each unit of data be assigned to a variable via the form: {{var}} ↦ literal. But, consider the following example query:

Erik Dunteman works at Butter. Erik loves to code and to cook with butter.

The naïve approach might produce separate bindings for each piece of dynamic content:

`full_name`	Erik Dunteman
`company`	Butter
`first_name`	Erik
`activity`	code
`ingredient`	butter

But we know full_name and first_name are not independent. We could even derive one’s first name from their full name:

`full_name`	Erik Dunteman
`company`	Butter
`first_name`	`full_name.strip()[0]`
`activity`	code
`ingredient`	butter

That is, variables can be (syntactically and/or semantically) related. (But it’s not always obvious! Consider that the company “Butter” and the ingredient “butter” are nearly identical strings but semantically independent).

It is vital that we track these interdependencies, especially when a user or agent quietly applies a transformation to existing data in their message. For example:

**User**: On what day of the week did the 1900s start?
****
**Agent**: The first day of the 20th century was a Monday.

While the user asked about “the 1900s,” the agent distinctly referenced the “20th century.” If we wished to treat the century as dynamic content in this exchange, we would have to explain how “20th” derives from “1900s.” Otherwise, the workflow would fail to properly generalize to other time periods.

So, our bindings might not only be storing literal assignments to some named variables. Instead, they may contain code specifying how to derive its value from other variables’ values. In practice, we make use of a coding agent to generate the code for all such derivations, and verify/sandbox that code appropriately. Importantly, a derivation must return a literal value given literal assignments for all its arguments.

For example, consider the following prompt, from which we have inferred some dynamic content.

We might propose a few inter-variable derivations where appropriate. For simplicity, we only provided the function signatures of each of the proposed derivations below:

In order for this system of interdependent derivations to always be resolvable, the bindings must satisfy a DAG (directed, acyclic graph) structure, since no variable should have a derivation implicitly depending on itself! The nodes of the bindings DAG would consist of all the bound variables. And a node x connects to another node y in this DAG if the variable corresponding to x is used in the derivation of the variable corresponding to y. In our example:

Any finite DAG will possess at least one root node (i.e., a node having no “incoming” arrows), meaning our bindings should always have at least one variable which does not depend on any other variables in order to be hydrated. So-called independent variables are necessarily bound to literals: {{independent var}} ↦ literal. The remaining dependent variables depend on other variables in order to be hydrated, and are thus bound to derivations treating those free variables as arguments.

Given a bindings DAG, it is straightforward to hydrate a template making use of variables from those bindings: we use topological sort to fix a hydration order for the variables in the bindings, starting with the independent variables, then any variables depending only on those independent ones, and so on. We hydrate all the variables to literal values, and then populate those values into the template.

Automating Template Induction

Recall that whenever Butter observes novel contexts, it forwards the request along to the provider, and then asynchronously attempts to add the query-response pair as templates into the cache:

Our job at Butter is to automate the template induction process so we may respond to a variety of requests from cache. What might this take?

Separating data from code appears infeasible by purely deterministic means. In general, queries specified in natural language require some level of intelligence or high-level reasoning in order to be templated. [See my previous post for some examples.]

That is, we expect automatic template induction to be a harder task than simply implementing a syntactic pattern matcher, grammar constructor, or text-embedding.

In particular, we should have a robust system for extracting variable data from messages.

Possible Induction Algorithm

For instance, one could break this task down into the following algorithm:

Algorithm for Inducing Variables:

Set up Bindings [deterministic]: Inherit any bindings that may have been inferred from messages earlier in the query’s context. Otherwise, start with empty bindings.
Identify Dynamic Content [classification task]: Identify any substrings of this message which should qualify as dynamic content (either as data or as semantic noise).
Label Dynamic Content [naming task]: Propose semantically-relevant variable names for these substrings (consistent with any inherited variable names).
Arrange all Variables [reasoning task]: Fixing all inherited bindings as independent/literally bound, incorporate any newly-induced variables to arrange all variables into a DAG structure.
Derive each Variable [code-gen task]: For each dependent variable in the DAG, propose the code relevant to deriving that variable from its arguments.

Each of the above steps involving intelligence (i.e., Identify, Label, Arrange, and Derive) will require slightly different skills, and hence could be handled by distinct models and approaches. We might set up specialized agents called the Identifier, Namer, Arranger, and Deriver, to accomplish each step, respectively.

One of our prerogatives is ensuring that such a pipeline for performing variable induction is not prohibitively expensive: e.g., minimizing the cost and compute required for any LLM call we make.

Worked Examples

In the following demos, I show off some cute examples of Butter’s automatic template induction at work. It cleanly performs one-shot generalization for messages involving arithmetic, string manipulation, and form parsing.

https://youtu.be/bMppsIgOc8U

https://www.youtube.com/watch?v=ORDfPnk9rCA

Possible Tweaks

The version of automatic template induction we’ve developed in this blog post insists on doing so for every novel query-response pair. Should this prove to be too expensive, we might still deploy a modified, few-shot version of template induction more cheaply. That is, Butter may begin by directly caching exact messages throughout its cache. Then, there might come a time that Butter decides it is worthwhile to attempt to merge many children under a common node into one or more templates. Merges will attempt to induce generic templates from several examples of query-response pairs. And merges might be triggered by reaching a critical number of children under a single node, or by judging the similarity between the existing examples by way of evidence generated by lightweight language models or other inexpensive methods.

We may further save on costs by taking advantage of Prompt Caching when designing the prompts used by the various semantic agents described above. That would involve front-loading their contexts with all the instructions that appear consistently across runs, and leaving the rest until the end of their prompts.

We continue to shed weight from and calibrate our template induction pipeline. Expect to see more from us as we observe how much it benefits our users’ cache hit rates.

Changelog #0009

Erik Dunteman — Sat, 13 Dec 2025 05:14:17 GMT

Happy Friday!

Nothing user-facing to report in this week’s changelog, so hang tight.

We continue to invest in internal tooling: evals, infra rewrite, and prepping last week’s automatic template induction POC for production.

For fun, here’s a shout-out for some of the tools we’ve been using and loving:

Bun for runtime and tests.
E2B for TS sandboxing.
Braintrust for evals.
Vercel AI SDK for structured generation.
And of course, Butter for caching tests and evals - it’s really sped up our iteration cycles to have LLM requests return 20x faster.

Stay tuned next week for more updates!

Changelog #0008

Erik Dunteman — Sat, 06 Dec 2025 00:21:48 GMT

Welcome to Butter’s eighth changelog! Just like grandma on Thanksgiving morning, we’ve spent a whole lot of time cooking.

Starting with:

Gramma’s Recipe Book

As a Thanksgiving treat, we launched cookwithbutter.com, which uses Butter’s LLM response caching to help generate, and cache, popular holiday cooking recipes.

Look up popular pre-computed recipes, or try your own! Gramma wants all your favorite recipes, from standard stuffing to creative butter, covered in butter, with a butter garnish.

The remainder of our time has been focused on internal tech improvements, which are still in-progress so we’ll only briefly highlight them below:

Data Engine Rewrite

Part necessity and part premature optimization, we’ve been rewriting our data backend to structure the cache tree in a much more efficient, S3-centric way.

This project, once completed, will include:

As low as 0 round trips to S3 for serving hot-path responses.
Truly stateless servers able to horizontally replicate and achieve high availability.
The ability to say we “rewrote in Rust” (meme).

Automatic Template Induction

The goal of template-aware caching is to expand the generalizability of cache entries by converting literal text messages into a more powerful composition of templates and dynamic variables.

Currently, this is a manual process, where users must explicitly flag their dynamic content in the butter-bindings request headers in order for those variables to be stripped into template placeholders. Powerful, but cumbersome, especially when agent intent isn’t known in at the time of programming.

Our goal is to make the process automatic, using hints such as attention values to map out which parts of the context window are noise (ignored), which are dynamic (templated), and which are structural (cached).

Two days ago, thanks to hard work from teammate Raymond, we’ve got our first proof-of-concept working end to end:

That’s it for this week! We’re excited to get this above work out in a more public form so you can see the magic of it.

Until then, keep on cooking!

Changelog #0007

Erik Dunteman — Sat, 22 Nov 2025 04:16:32 GMT

Hi all, this week’s changelog is quick and simple, nothing user-visible to announce.

We’ve been making R&D progress towards better template-aware-caching, and infra progress rethinking the storage system to work with higher availability.

We’ve also published a Terms of Service and Privacy Policy.

Stay tuned next week for more updates!

Changelog #0006

Raymond Tana — Sat, 15 Nov 2025 02:59:24 GMT

Welcome to Butter’s latest weekly changelog. Today’s log is light, as we strengthen our focus on the R&D-side of templated caching.

This week, we’ve also seen some more signups, which have brought about a higher volume of requests through Butter’s proxy and impressive cache hit rates!

Unsupported Request Handling

Butter now detects and tracks requests with unsupported content (i.e., images, audio, files)
Unsupported requests are forwarded directly to the provider (preventing any downtime)
Butter’s app transparently shows this with a “bypassed” notice, storing none of the original message content

Changelog #0005

Raymond Tana — Sat, 08 Nov 2025 00:32:23 GMT

Welcome to Butter’s fifth weekly changelog. This week, we’ve seen a promising amount of new signups, as well as have tightened up our onboarding and helped users check Butter’s status.

Onboarding Experience

Cleaned up the onboarding experience we designed last week to give better feedback.
Pointed more clearly to Butter’s documentation and app.

Status Page

Set up a public-facing uptime monitor for Butter’s proxy at status.butter.dev.

Changelog #0004

Raymond Tana — Sat, 01 Nov 2025 01:09:33 GMT

Welcome to Butter’s fourth weekly changelog. This week we celebrate our launch, with a focus on the first time onboarding experience.

Better Onboarding Experience

Added an onboarding flow to guide first-time users through their first cache miss and cache hit

New Landing Page

Shipped our new landing page, live at butter.dev
Users will find their dashboard (previously at app.butter.dev) now merged into the main butter.dev site behind the Login button

Minor UI Fixes

Made integration guides more obvious in our documentation
Improved scroll through the Request preview
Fixed bug with Request pagination
Cleaned up pagination URL parameters

Changelog #0003

Erik Dunteman — Fri, 24 Oct 2025 19:00:45 GMT

Welcome to Butter’s third weekly changelog. This week we’ve been focused on content, documentation, and examples.

New Examples

Examples section added to Docs, highlighting how to repoint popular AI tools through the Butter proxy endpoint. These include:

LLM Clients & Agent Frameworks
Other HTTP proxies/gateways
- Helicone
- LiteLLM
- Martian
Specialized Tools
- Browser Use (custom fork)
- DSPy

Bugfixes & Improvements

Fixed race condition which caused thrashing in server’s disk cache
Added internal load tests, currently hitting ~120rps

The Messy World of “Deterministic Agents”

Erik Dunteman — Thu, 23 Oct 2025 20:26:26 GMT

With the daily use of “agentic” tools like Cursor and Claude Code, we’re all becoming deeply familiar with the feeling of working with autonomous agents, and increasingly aware of their shortcomings.

I’m here to talk about one of those shortcomings: non-determinism.

You’ve probably run into it yourself, it goes as follows:

You have a relatively repeatable task to do.
The agent does it right the first time. Awesome!
You ask the agent to do that same task again, with slightly different input.
The agent chooses a dramatically different approach. Weird!

This sense of randomness can be unsettling, and lead to a lack of trust in agentic systems, which is a strict hurdle to get over before widespread adoption. We want to think of agents as employees we can delegate to, but employees learn skills which builds trust, whereas agents do not. Yet.

What we’ll be covering

This blog is an overview of nine different attempts at introducing “determinism” (skills, trust, reliability, repeatability) into agents. These approaches range from nascent ideas and theoretical concepts to full products being used in the enterprise.

For each approach, I’ll be highlighting products or research that’s relevant. We’ll walk through them from highest abstraction down to the lowest abstraction.

Note that this coverage will include our own product Butter, but it’s not my intention to shill our own thing or draw any “us-vs-them” comparisons. Those blogs always suck.

The goal is to inform, and paint a clear picture for the many ways this world could end up going.

Let’s start from the top!

Workflow Builders

Workflow Builders are Zapier-like canvases which allow technical and non-technical users to chain prebuilt integrations together, including LLM blocks for operations such as data transformation and classification-based routing.

They are not agents by the strict definition, but are embraced by the more cautious enterprise user due to their explainability and true determinism. No magic, just smarter data processing. If an agent block is included, it’s opt-in and generally well scoped.

These tools have exploded in usage, with n8n being a popular choice.

And earlier this month, OpenAI launched their Agent Builder in the same category:

Following the OpenAI launch, many were disappointed to see a workflow builder use the “agent” name. Regardless of if it’s truly an agent, users are happy and workflow builders deserve their place in this blog as today’s most viable route to “determinism.”

You may have wondered why I said they’re not agents, which warrants a quick definition:

An LLM agent runs tools in a loop to achieve a goal. – Simon Willison

while task_incomplete:
    tool_choice = llm()
    do(tool_choice)

Under this definition, the control flow is dictated by the LLM, allowing agents to perform tasks they were never explicitly programmed to do.

All of the following approaches apply specifically to agents by this definition, exploring what it means to bring (pseudo)determinism into fundamentally random LLM branching decisions.

As we walk through them, remember:

The architecture: there’s always a model, and it’s always choosing tools.
The goal: “determinism” is defined as the consistent reproduction of a trajectory of tool calls given a repeat task.

Context Engineering

Many view skills and repeatability as a context problem, focusing on the fact that an LLM performing an automation does not have the accumulated knowledge from prior runs.

So… some people simply put the successful run in the context.

Products in this camp still use LLMs at every turn, but they aim to increase the reliability and skill-retention by injecting additional content into the context window.

Dynamic context engineering is nothing new, tracing its lineage back to few-shot prompting (hardcoding examples into your prompt) and RAG (appending semantically relevant content into the prompt). Notable products in the space, dubbed “memory layers,” include mem0 and Supermemory.

The space has historically focused on user context, such as remembering birthdays, but applied to deterministic replay, a memory layer would be repurposed to store and retrieve content related to instructions & behaviors, which may include:

User preferences
User-written SOPs (standard operating procedures)
Recorded prior agent trajectories
An LLM-summarization of prior trajectories
Reasoning traces or summaries explaining why certain branches are taken.

This does not force determinism by the strict definition, but it does guide the model.

Explicit Skills

One pattern is to build up a knowledge-base in advance, similar to an employee onboarding document, which may be selectively referenced by the agent during operation.

Most recently, you can see this approach used by Anthropic’s newly released Claude Skills, which seems to be RAG across SOPs and docs.

While working on tasks, Claude scans available skills to find relevant matches. When one matches, it loads only the minimal information and files needed—keeping Claude fast while accessing specialized expertise.

These skills are generated in-advance by a user via interacting with a separate “skill-creator” agent. Doing so requires advance knowledge of the types of tasks your agents will perform.

Learned Skills

The process of “skill creation” could be done after the fact, inferred from the message history.

You can see this in action with Cursor’s Memory feature, which uses a special “save to memory” tool which it can invoke if it detects a behavior would be useful in the context of all future runs.

Another creative approach for learned skills is Letta’s Sleep-Time Agents, which use async agents to continuously overwrite earlier message history context with more compressed summaries, allowing agents to resume from their prior history rather than needing to start from fresh states.

Code Generation

With the goal of consistently reproducing tool calls, the most deterministic tool we could reach for is code itself.

Rather than using LLMs in the hot-loop, interpreting every case and choosing a tool to invoke, what if they were used more like compilers, generating optimized code in advance? LLMs are especially well suited for this, given strong programming language representation in their training sets.

In its most basic form, codegen could produce one-off disposable scripts, which when interpreted in our client environment would call tool functions directly.

This bypasses the indirection of the ToolCall response type, and allows a single LLM generation to invoke as many tools as it needs.

The team at Cloudflare recently launched “Code Mode” (great blog), and Browser Use recently launched Code Use, both of which do exactly this concept. The scripts are ephemeral, but the concept of calling tools from code is the building block for the next topic.

Meta-Tools

Sometimes, the code generated to invoke multiple tools is worth storing as its own tool.

In doing this, we’re allowing the models to not only use tools, but to build their own abstractions, making each tool call decision to be that much more powerful.

There’s no common name for this, so we’ll just call them “meta-tools.”

What’s most worth highlighting here is how well it fits into the “agents are loops with tools” architecture. The model keeps using tools, it’s just that those tools perform increasingly long (and deterministic) tasks.

The universally agreed pioneer of this concept is the Voyager paper, which used on-the-fly tool generation to evolve primitive Minecraft bot APIs into higher-abstracted tools:

Voyager was well ahead of its time, published in 2023, and there’s yet to be a clear follow-up paper or product that expands on it.

Script-Agent Fallback

Script-agent fallback refers to systems whose default operating mode is pure-software, with agent loops only being used for initial discovery and self-healing.

To generate these scripts, it’s usually done post-hoc, after seeing multiple examples. In these cases, you employ an agent (or human!) to perform a multi-step workflow, and use the tool-call trace from that to generate reusable scripts. Note that “generate” is loosely defined here, ranging from full-blown LLM codegen to simple json runbooks.

This approach is especially popular in computer automation, where humans can be there to describe the task, perform or monitor the learning runs, leave comments, and iterate on the produced scripts.

Stars in this space include Browserbase’s Director (congrats on the recent v2 launch) and Browser Use’s Workflow Use.

We also experimented at this level, with a tool tracing and replay SDK called Muscle Mem. Read here for why we started it, and here for why we moved on.

Similar to the workflow builder UIs, script-agent fallback systems require you to know in advance which workflow you’re about to run. The branching behavior does not need to be known in advance, but the task does need to be discrete and namable.

Script Generators

This is “Lovable for Automations,” where technical or nontechnical users work with codegen agents to produce pure-software scripts. No agents at runtime.

This ahead-of-time generation can be quite tricky, as it’s hard to know in advance exactly which tools will get run in a particular workflow, but with enough iteration and end user feedback, you can get there.

Particularly creative approaches in this space even build DSLs to represent the automation, and force generation using custom grammars, reducing the surface area for error and hallucinations.

Notable teams in this space are Forge and Sola.

As with workflow builder UIs, it’s unclear if this fits under the strict definition of an agent, but worth shouting out, as these products tend to be happily adopted.

Response Caching

Response Caching means running an HTTP proxy in front of the LLM provider and caching responses as they flow through. On repeat requests, the cache can serve responses, as if a model had generated them, resulting in deterministic behavior.

By spoofing the LLM layer, the agent loop remains simple, unaware that the endpoint is guiding it down a deterministic path.

This is coincidentally what we’re building at Butter.dev.

To have any meaningful cache-hit rate (more than the rare exact context matches) you’d need to figure out how to correctly group subtly different prompts, identify dynamic data, ignore noisy contexts, handle complex conditional control flows, etc. Much is yet to be solved, which we’ve written more about here.

LLM-Layer Improvements

The obvious VC question is “won’t the big labs just do this?”

Quite possibly! As taught in the The Bitter Lesson, one must not ignore the consistent trend that one-size-fits-all LLMs have always ended up superseding incremental progress made outside of the models.

If the goal is to make models more deterministic, surely there’s a way it could happen in the architecture of the models themselves.

I’m in no way an expert in these topics, and there’s probably some secret other thing being cooked up in the labs or a paper I haven’t seen, but here are a few model-level shoutouts.

Action Models

Action Models are special language models where the decoder is trained to emit tool calls rather than tokens, allowing them to map input stimuli to actions without text as an intermediate.

These models are used heavily in robotics, where the specialized domain allows them to be a quite a bit smaller and runnable on-device.

Adjacent work has been done in computer use by the team at General Agents, resulting in shockingly fast fully-agentic computer automation:

Ace leverages a new behavioral training paradigm. Unlike language and vision models which are trained on text and images, Ace is trained on behavior.

Reinforcement Learning

Many process automation tasks have quick feedback for success/fail, which makes these domains optimal for using that feedback as a reward function in RL.

It’s quite early, but early tinkerers are seeing promising results.

Mapping them all together

Below are all of the topics we’ve discussed, highlighting how well each approach satisfies the important aspects of reliable “deterministic replay”:

And below is a map showing where in the stack the approaches sit, and how explicitly tasks must be known in advance:

I trust this overview helps untangle the mess of different approaches, breaking them down into clearer sub-groups, so we all can make more informed tradeoffs in our own building.

Cheers!

Erik

Changelog #0002

Raymond Tana — Sat, 18 Oct 2025 01:03:59 GMT

Hi there! Welcome to our second weekly changelog.

Performance Improvement

Average cache hit response times are now 8.9x faster due to data caching on the server.

Fixes

Fixed a bug related to traversing cache entries with 100+ children.

We’re Hiring

We’re hiring a Systems Engineer to work on the performance problems you see above.

Changelog #0001

Erik Dunteman — Fri, 10 Oct 2025 07:00:00 GMT

Hi everyone! Welcome to our first of many weekly changelogs, set to run every Friday. Follow along for weekly updates and improvements.

This week was primarily focused on ensuring compatibility with upstream providers (OpenAI, for the time being) and downstream agent frameworks.

Features

Brotli compression support

Added support to handle large compressed responses from OpenAI.

Transparent forwarding

All requests to unsupported endpoints, such as Responses or Encodings, are now transparently forwarded without involving the cache. This allows popular agent libraries pointed at Butter proxy to operate as expected, with caching only on the Chat Completions requests.

Request pagination

Dashboard home-page loads are now much faster for users with moderate request traffic.

Security & Uptime

Data access controls improved on UI backend
Uptime monitors and alerts configured

Fixes

Various UI rendering/flickering bugs
Stale-segfault bug in proxy
Broken links in cache graph UI
Mobile reactivity improved*
- *with a message that says “use desktop please”

Rethinking Muscle Mem as an LLM Proxy

Erik Dunteman — Wed, 01 Oct 2025 19:25:44 GMT

I believe two things:

AI is an incredible technology,
AI will be used less* over the long term.

(*) proportionally speaking…

Over the last few months, I’ve been working on building “Muscle Memory for AI”: tooling that specifically removes AI from places where deterministic scripts would serve just fine.

In May, I launched a Python package called Muscle Mem.

Titled A Behavior Cache for Agents, Muscle Mem was an SDK for instrumenting and replaying sequences of tool calls.

It’d bolt on top of your existing agents as decorators on their tools, running snapshots every time the tool was invoked to record not only the action taken, but the environment in which it was taken (for safe cache validation). For more context such as its use cases, API design, and a demo, the original launch blog is worth a read.

Muscle Mem was well-received, gaining healthy intrigue on Hacker News and GitHub.

I sure felt clever!

Fast-forward a couple days: we had a more than 700 Github stars, but strangely… seemingly no users. While a few real users did eventually trickle in over the following month(s), it was clear something had to change.

Thankfully (or, not), it’s hard to talk me out of ideas that I consider to be inevitabilities.

I genuinely believe in the thesis that many classes of automation are best expressed with deterministic software, and we’re wasting intelligence and reducing trust by performing them with agents.

The road to AI is understandably tempting; there’s an long tail of cases to handle in automation software, and agents can confidently trailblaze across those edge cases into the unknown to figure it out. But what I assert is that: for those repeat executions, we ought to be using software.

The Flaws of Muscle Mem

(You would not believe how many failed attempts I’ve put into writing on this topic)

Muscle Mem did not last long in the wild before it started hitting limitations. Its flaws fell into one of four buckets:

Shit UX
Rigid definition of a workflow
Recording the wrong signals
Ignoring the dynamic nature of data

Shit UX

Conceptually cool, but a head scratcher to try to use.

Muscle Mem focused heavily on tool-calling as the integration point, along with user-defined callbacks on generic types to guard the execution against edge cases.

You’d decorate tools with a pre-Check , a guard that would run a capture()->T callback to capture some data T that you’ve deemed matters for later determining cache validity (yet another compare(T, T) callback to pass/fail relative to the current T). Simple!

Confusingly to many, Muscle Mem did not ask users to decorate their tools which read from the environment: it didn’t care about replaying those. Just decorate the write tools. And the capture function would greatly depend on T. Even for us, generating examples proved difficult in this setup, which is pretty apparent in an example where we make use of a literal timestamp in T, resorting to TTL based cache invalidation.

If you’re struggling to follow, rest assured. For, in my excitement to make the system “bolt onto” your existing agents, I created a whole new set of hoops to jump through. Tools had to be manually instrumented, and you had to know in advance which ones mattered for replay. Cache validation was forced to be a user concern, and users didn’t seem terribly keen on spending their time writing such abstract callback handlers.

Rigid definition of workflow

Following the assertion that Muscle Mem was automating “discrete” workflows, it became the responsibility of the user to group workflows into discrete buckets, such as:

“Fill out the intake form,” or
“Process a refund.”

This would require all workflows to be known in advance. It afforded no natural drift in workflows. It left no room for subagents and subtrajectories, which are usually powerful tools for abstraction in agents.

Moreover, to enter a workflow was a black box. A supposed cache-hit that would break mid-run wouldn’t have any way of letting the agent resume from partway through; its only option was to start from the very beginning.

This clashed with the callback structure as well. If a tool were to return a cache-breaking piece of data, that signal would certainly be present in the messages, but it’s possible the user couldn’t have anticipated having to read that data when designing T. Or, what if the new branch in behavior subtly involved an entirely different workflow? Well, it would all get baked into the cache under “Fill out the intake form.”

Recording the wrong signals

The role of Muscle Memory is to accurately reproduce the same series of tool calls that an LLM would when in the same situation.

The artifact of an agent run in Muscle Mem was a linear list of function invocations with arguments, as well as any user-retrieved environment data which T would tag onto it. This differs from the content visible to the LLM: the LLM isn’t looking at tool traces and user data T. Instead, it only sees the context window to make branching logic decisions. Thus, Muscle Mem shared no overlap in signal with the models it was meant to proxy.

In reproducing tool calls in a model-like way, it is critical to have a 100% overlap in signal, else succumb to the final point:

Ignoring the dynamic nature of data

This was, and is, the biggest challenge in this entire domain of Muscle Memory. No two LLM calls are exactly alike: they’re the result of structural templating, nondeterministic tool results, and chaotic user input.

If you require exact matches, as did Muscle Mem, your hit rate will be 0%.

We must be able to separate out “variables” from the code.

The easiest example of this is a form-filling bot navigating a browser. The button coordinates are static, as the buttons will almost always remain at their respective locations. But, typing into the “First Name” field would be dynamic: entering a unique string from run to run. Still, you could (and should) assert that form-filling with dynamic data constitutes the same workflow; it just involves variable data which the model is regurgitating into a tool call argument.

I will talk in great depth on this topic, so we’ll just say that Muscle Mem had nearly zero mechanics accounting for this.

The Realization: Everything is Code

All of these concepts map to software systems.

LLMs are pure functions, tools are the runtime

The Chinese Room Experiment questions our ability to judge the intelligence of an agent locked away in a black box offering communication only by basic inputs and outputs. For us, we aren’t so much concerned with determining the intelligence of the agent inside. Instead, Muscle Memory attempts to learn from and emulate the apparent “intelligence” inside.

An interesting aspect of the Chinese Room setup is that it is stateless, with a text-in/text-out messaging interface. Peaceful, quiet, one request at time, and a pure function even by a Haskell programmer’s standards. At least until LLM API providers pull tool callings behind their API wall, we may think of LLMs as pure functions, interpreting inputs and transforming them into outputs per instructions without side effects.

Tools are the runtime

A pure function on its own has no use without side-effects. Tools are the runtime, the “hands” for models to reach out and read/write into the stateful world. They’re also the means by which caches get broken.

If it quacks like a duck…

Poorly summarized in polymorphism terms, the lesson we take from Chinese Room Experiment is:

If it looks like a duck and quacks like a duck, it calls into question if ducks are real.

Whether or not ducks are real, the Chinese Room scenario says that as long as a black box is producing a valid sequence of outputs, it’s a viable system.

This system, executing tools into the external world, cares not for what’s in the black box, so long as that black box produces tool calls in the right order with the right arguments. That is, users of LLMs aren’t so much interested in the mechanics of the computation used to generate a response to their prompts. They’re mainly looking for responses they can trust.

Data-vs-code separation

Programming languages (turning a blind eye to the lisps) generally draw a strict line between data and code. The code lays out a map of all possible branches a system can take, and the data slots into known shapes (types) to ultimately dictate which branches get taken.

Context windows are effectively interwoven data + code. This is especially the case with the workflow agents we’re targeting, which are usually triggered by some software system without a human chat so it carries structurally consistent data.

The separation of data and code introduces a concept of variables, which we’ll call symbols. In programming languages, symbols map to data, functions, etc. In context windows, it’s more nebulous, but for simplicity: think of symbols as pointers to substrings within the context, like first_name.

The exact line of “intelligence” is variable scope

In interpreted programming languages, executing a line depends on all the used symbols being in scope by the time of execution, else raising an Undefined Symbol error.

a = 1 # brings a into scope
b = a # a already in scope, brings b into scope
c = d # Undefined Symbol: d!

Imagine a perfect system for separating data and code from an LLM’s context window, one where you accumulate a vast trove of symbols and their respective data.

If that context is sent to a model, and it produces new symbols that have no computable origin from the context, we might call this the result of “intelligence.”

That is, the model pulled that variable out of thin air.

This is the equivalent of the c = d undefined symbol error above. Variables simply do not materialize out of thin air in deterministic programs. This helps us identify the intractable land of “intelligence” which we couldn’t possibly serve. Sadly, this is where we must draw the line for how helpful Butter can be.

Can it be Muscle Memory?

Given this, the cleanest heuristic for whether an automated workflow might benefit from Muscle Memory systems is this: whenever you could, in theory, with infinite time, sit down and write heuristic-based software handling every case of the workflow.

If you could not write the workflow as software, if you couldn’t massage dynamic data of known shapes through a conditional branching logic without hitting some data transformation that can only be described as “an LLM pulling something out of thin air,” then you would be working with a fundamentally intelligent workflow.

So long as every single token output of a model is either static (inherent to that run, including planning traces) or a regurgitated derivation of dynamic data from earlier in the context window, we consider the generation to be a candidate for Muscle Memory.

It’s that simple. Can it be code?

“Can it be code” – So we’re doing codegen?

Not exactly. In addition to the proxy approach, which I’ll discuss in a moment, I believe a Voyager-style codegen approach models the data-vs-code separation well.

Under this model, LLMs are used to write actual software which hooks into a low-level primitive set of functions, giving itself an on-the-fly generated tool it can call.

I have plenty of qualms with client-side codegen (which I’ve omitted from this blog for clarity), but all around I believe it can perform well, especially as codegen models improve. We’ve decided to keep focused on our current approach, but it’s worth acknowledging codegen as a hopeful alternative.

Actually, we’re building an LLM proxy

LLM proxies, or “gateways,” are an increasingly common class of product that take advantage of the widespread support for OpenAI’s Chat Completions format, by even non-OpenAI model providers or self-hosted servers like vLLM.

These gateways are HTTP reverse proxies which sit between any LLM client and server supporting the chat completions format. They are simply integrated with one line of change by repointing your LLM client to call the proxy instead of the provider’s default API: BASE_URL=https://proxy.butter.dev .

Most gateways are used for analytics, or for intelligently routing across multiple providers.

The Butter proxy does something special: it spoofs LLM responses back to clients.

The proxy models all trajectories taken using a tree keyed at each level by the message content. When a response can be served deterministically, it is. If not, the request forwards to the LLM provider for an actual generation, and novel results are stored as a fresh branch in the tree, to be used as muscle memory in the future.

It’s not obvious at first, but a proxy cache serving back spoofed responses to a tool-calling agent achieves, in a heavy-handed way, the same thing that codegen does: we hold the branching logic of a system in the form of a tree-like cache, and guide the agent into invoking functions in the runtime (tools) to read and write data, which goes on to inform even more branching logic in the subsequent requests.

Again, as long as those tools get called, does it matter who called them?

Remember the Chinese Room.

As far as the agent client is concerned, it’s always in “AI” mode. The duck keeps quacking, the results come back in valid format, and the world keeps going round.

I’m obviously sold, but let’s walk through how this Proxy approach solves many of the issues with our first Muscle Mem approach:

Not-Shit UX

It’s dead-simple to use: just repoint your BASE_URL to the proxy. It’s a drop-in endpoint for any client system that speaks the chat completions API, as most do.

Users need not concern themselves with special APIs to record this vague concept of a T.

Cache invalidation is built into the way the tree is traversed, on the messages itself. Cache-breaking data from tool call results simply branch the tree and gracefully continue from there with the guidance of an LLM. This provides a completely smooth on/off-ramp between software systems (spoofing the agent down a proven route) and AI systems (generating novel responses).

Fluid definition of a workflow

Grouping of trajectories is automatic and structural: “Does it follow an existing branch in the tree? Boom, that’s an instance of a workflow.”

This way, workflows taken on a form we can visualize: of a path or set of adjacent paths down the tree. This also allows for a “natural drift” as the workflows evolve: new branches growing, and stale branches eventually pruning.

Recording the right signals

It hooks into the messages level, containing all of the data needed for Muscle Memory:

Message content for cache returns,
Tool call invocations,
Tool call results, which carry cache-branching external data, and
Any other context “symbols” or data that would push the trajectory down a specific branch.

A 100% overlap with what the model sees.

(TBH still struggling with) The dynamic nature of data

We’re working on this one. At the very least, we have access to every piece of “scope” that the model does, plus more if we grow our API surface area. In addition, the centralized nature of the cache allows trajectory-vs-trajectory comparison. This gives us the most complete shot at deconstructing data and code from context windows and building a cache that’s more structural than literal.

Conclusion: If only it were Easier

The pivot to the proxy form factor was no obvious path to take, and it’s substantially harder to do this correctly than a Python client library.

Despite the challenges, I’m confident that it is the cleanest and most theoretically correct route to building Muscle Memory into AI systems, so we’ll be giving it an enthusiastic swing!

You can sign up and try Butter today using our quickstart guide.

Or reach out – erik@butter.dev – we’d love to hear your use-cases, and will try our best to prioritize the patterns that most quickly get you to an agentic system you can trust.

Cheers,

Erik

Template-Aware Caching

Raymond Tana — Wed, 01 Oct 2025 01:12:56 GMT

Motivation

Butter is designed not only to memorize static trajectories, but also workflows involving dynamic variables. Here, we think of dynamic variables as placeholders for the data which may change from run-to-run: information which might derive externally from tools, creativity, or the environment. Everything else in the workflow we consider to be the code or structure: how to use the data to complete the task.

Most real world automations include dynamic variables (e.g., names, dates, addresses, etc.), and are thus designed to handle a range of inputs. Workflows may make use of dynamic variables for a number of reasons:

Storing information about the specific run. E.g., storing the user’s name, or storing the root directory in the file system.
Tracking information about the environment. E.g., checking the current time, or reading all the filenames in the current directory.
Determining relevant tools to employ. E.g., choosing the appropriate parser to run on a given file.

❗

We assert that in our domain of workflow automation, the desire for a repeatable workflow implies the notion of “code” in the context window; i.e., the instructions that influence the model’s branching logic decisions to complete a known task. This code is intermixed with runtime unique data. It is our job to distinguish the two.

Each chat between a user and a model represents a single trajectory through an abstract workflow. As Butter observes these trajectories, it is confronted with the problem of distinguishing between dynamic data and structural content within the messages. The better Butter can do at extracting “data from code,” the broader the set of future queries which will result in a cache hit. This way, Butter can more robustly serve from its cache and avoid another LLM call.

❗

In this setup, the broader the set of queries a single cache entry is meant to handle, the greater the burden there is on Butter to correctly model how this data is derived and transformed.

Obstacles to Separating Data from Code

The general problem of separating data from code involves many considerations.

How to detect when an argument to a tool is intended to change run-to-run?

As we observe LLMs making use of provided tools to complete tasks, we might expect the model to make use of some dynamic data when providing arguments to its tool calls. Indeed, it seems reasonable to assume that whenever an argument is passed into a tool call, it represents dynamic data.

For instance, suppose that the user specifies that they live in Boston, and asks about the weather there. With access to the tool get_weather(city : str, state : str) -> str, the LLM could produce the tool call get_weather('Boston', 'MA').

Here, both the arguments for city and state require some dynamic data, and it will be the job of Butter to figure out how to derive their values. As the user already explicitly wrote that they live in Boston, the substring 'Boston' is already present in the model’s context, so we know from where in the prompt to source the city information. But nowhere did the user specify Massachusetts as their state; so while the string 'MA' might be theoretically derivable from the city of Boston, Butter has no clear source from which to have derived it syntactically. Really, the argument 'MA' was generated by the LLM, understanding Boston to be the city in Massachusetts. How is Butter to understand this relationship between Boston and MA?

On the other hand, some tool calls might involve variables which should not change run-to-run. For example, an encoding format like png may not depend on any dynamic data, and instead is a structural part of the workflow.

How to avoid associating identical data to the same variable when they’re truly unrelated?

Coincidences happen, but recognizing them is not so straightforward. For instance, which data in the following query are related?

Today is September 30, 2025. Find the third Python script in the directory /source when sorted alphabetically, interpret it with python3, and save the output in 09/output_3.txt.

A naïve approach to separating data from code might replace all instances of the number 3 with the same variable, ignoring the various, distinct roles played by the number three in this query. Any time we group together unrelated data, our cached template will fail to generalize to other situations.

How to recognize instances of the same underlying variable throughout a message?

While the previous point might be described as a concern for false positives, we might also be concerned about false negatives. That is, we worry about failing to recognize how the same variable was used across a conversation.

Data may undergo transformations which are syntactically simple: e.g., turning a string into all lower-case (Erik → erik), rewriting a number (1→ 1.0), or stringifying a JSON object.

Data may also be filtered: e.g., a variable corresponding to the user’s full name will necessarily contain the data required to obtain both that user’s first name and last name. Either of these partial names may be used throughout the context without writing out the full name.

Other transformations involve more intelligence to execute. For instance, we might view 'MA' from the example above as the result of a transformation of the form state_from_city('Boston'). Plenty of other examples require the same level of insight, such as knowing the industry in which a given company operates, computing the date associated with a given holiday, expanding a given acronym, naming the artist behind a given song, producing antonyms for given words, etc.

No matter the transformations that may have been applied to some data, we still consider each of its representations as being associated to the same underlying variable. But recognizing that the variable leader = 'Napolean' explains both 'France' and '1821' in a given chat is not trivial. Just as before, failure to recognize relationships between data makes our cache less generalizable.

Bindings

Butter performs a number of symbolic manipulations when augmenting, comparing to, and generating from its cache. This functionality is necessary for Butter to make use of variables in its stored workflows. Instead of filling the cache with observed messages written verbatim (e.g., 'Say hello to Erik'), it stores a combination of bindings paired with a message template. The bindings specify how each variable maps to a corresponding value (e.g., {'name': 'Erik'}). The template is generated by substituting all instances of dynamic data with their corresponding variable’s name (e.g., "Say hello to {{name}}"). This way, applying the bindings to the template reproduces the original message.

We describe this approach as template-aware caching. Templates do not limit our ability to compare new queries to the cache: an incoming query is compared to an existing template using regex and exact-matching. If regex recognizes the query to follow the same structure as the template, it is straightforward to read off the values that each of the expected variables from the bindings should take on in this query. Assuming this process goes through without any contradictory assignments, it is now straightforward to use these bindings to populate the cached response to this query. This is how Butter adds determinism to LLM calls which follow a recognized structure and contain dynamic data.

Inferring Bindings

In practice, bindings must either be specified or inferred. Currently, Butter’s Butter-Bindings allows users to specify bindings explicitly from the start. These so-called top-level bindings help to avoid guesswork, but are not always feasible to provide.

For our method to be effective, we should also be able to infer bindings from chats. We have discussed how arguments passed to tool calls are very likely to have made use of dynamic data. For instance, in the tool call: read_latest_email_from(sender = 'example@butter.dev'), we find an email address which almost surely contains some dynamic data.

Bindings may also be derived deterministically via regex or substring matches. By comparing multiple observed trajectories, we might identify locations in which data was used in the same manner across each run.

Still, large language models may be best suited for the job of automatically detecting (in post) any relevant bindings that were not already detected via the other deterministic methods.

What Butter Already Does

As an LLM proxy, Butter forwards requests to inference providers and caches responses. On repeat requests, responses are served immediately, bypassing wasteful generations. In its current implementation, Butter performs template-aware caching, serving responses based on structural similarity rather than requiring exact matches.

You can simplify modify your LLM client or your curl command to Butter’s custom base URL:

import os, json, httpx
from openai import OpenAI

client = OpenAI(
    base_url="https://proxy.butter.dev/v1",
    http_client= httpx.Client(
        headers={"Butter-Auth": f"Bearer {os.getenv('BUTTER_API_KEY')}"},
    ),
)

# Specify any bindings
bindings = {
    "name": "Erik"
}

# Create cache
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "say hello to Erik"}],
    extra_headers={"Butter-Bindings": json.dumps(bindings)},
)

print(response)

curl -X POST $BASE_URL/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Butter-Auth: Bearer $BUTTER_API_KEY" \
-H "Butter-Bindings: {\"name\": \"Erik\"}" \
-d '{"messages":[{"content":"say hello to Erik","role":"user"}],"model":"gpt-4o"}'

The above code examples show how the user can tell Butter to cache templates rather than exact messages by specifying top-level Butter bindings.

Whenever Butter caches a new message which involved some bindings, it builds the template by replacing all instances of bound values by their corresponding variables. These replacements can occur anywhere in a message, including in tool calls. Additionally, whenever Butter recognizes a match between a cached template and an incoming query, it uses regex substring matching to infer the proper bindings as expected for that template.

Known Bugs

Let’s review a few bugs users should expect to run into with Butter’s current implementation.

The exact string matching that Butter currently uses poses a few challenges.
- False negatives: Even slight modifications to a letter’s case (Erik vs. erik) or a number’s precision (1 vs 1.0) will fail to match.
- False positives: Whenever building a template from a query, the matcher will replace any string that matches the bound value—an error if those values had no semantic relationship.
- Naming conflicts: Wrapper types for naming variables in templates could conflict with other agent frameworks.
Another issue follows from how exact matching is implemented with regex: bound values should be separated by delimiters. Otherwise, Butter’s cache responses might diverge from the expected behavior. For instance, suppose we were to request the model to Repeat butterfly 3 times while specifying the bindings prefix = butter and suffix = fly:
```
 curl -X POST $BASE_URL/v1/chat/completions \
 -H "Content-Type: application/json" \
 -H "Authorization: Bearer $OPENAI_API_KEY" \
 -H "Butter-Auth: Bearer $BUTTER_API_KEY" \
 -H "Butter-Bindings: {\"prefix\": \"butter\", \"suffix\": \"fly\"}" \
 -d '{"messages":[{"content":"say butterfly 3 times","role":"user"}],"model":"gpt-4o"}'

 # response: "butterfly butterfly butterfly"
```
In this case, Butter will add a node into its cache with the specified bindings {prefix: butter, suffix: fly}, and the corresponding template {{prefix}}{{suffix}} {{prefix}}{{suffix}} {{prefix}}{{suffix}}. Now, if we try running the command again:
```
 curl -X POST $BASE_URL/v1/chat/completions \
 -H "Content-Type: application/json" \
 -H "Authorization: Bearer $OPENAI_API_KEY" \
 -H "Butter-Auth: Bearer $BUTTER_API_KEY" \
 -H "Butter-Bindings: {\"prefix\": \"butter\", \"suffix\": \"fly\"}" \
 -d '{"messages":[{"content":"say butterfly 3 times","role":"user"}],"model":"gpt-4o"}'

 # error: failed to query tree
```
This error occurs because, as Butter compares the query "say butterfly 3 times" to the existing template, regex must default to some way of decomposing butterfly into {{prefix}}{{suffix}} (the current implementation uses non-greedy regex, which chooses prefix = b and suffix = utterfly). These assignments then disagree with the specified bindings of prefix = butter and suffix = fly, leading to this error.

Instead, we could have run the above command again sans any bindings:
```
 curl -X POST $BASE_URL/v1/chat/completions \
 -H "Content-Type: application/json" \
 -H "Authorization: Bearer $OPENAI_API_KEY" \
 -H "Butter-Auth: Bearer $BUTTER_API_KEY" \
 -d '{"messages":[{"content":"say butterfly 3 times","role":"user"}],"model":"gpt-4o"}'

 # response: "butterutterfly, butterfly, butterfly"
```
This time, Butter succeeds matching this query to the stored template, and so it makes use of the inferred bindings prefix = b and suffix = utterfly to produce "butterutterfly, butterfly, butterfly". This isn’t quite what we had in mind.
Butter may fail to recognize the underlying interdependencies between data, making it worse at generalizing to unseen trajectories. In the example described above about getting the weather in Boston, Butter would fail to recognize that the tool argument MA was generated by the LLM in lieu of the city being Boston. Consider what this means for the following command:
```
 curl -X POST $BASE_URL/v1/chat/completions \
 -H "Content-Type: application/json" \
 -H "Authorization: Bearer $OPENAI_API_KEY" \
 -H "Butter-Auth: Bearer $BUTTER_API_KEY" \
 -H "Butter-Bindings: {\"city\": \"Boston\"}" \
 -d '{"messages":[{"content":"Tell me the weather in Boston","role":"user"}],"model":"gpt-4o"}'

 # the model next chooses to call the tool: get_weather('Boston', 'MA')
```
Butter would naively cache the template get_weather({{city}}, 'MA')for the tool call, which would fail to generalize to cities outside of Massachusetts.

What Comes Next

There are many ways in which Butter could be improved. Here are some of the directions we will explore.

Inferring dynamic data from tool calls: one easy way to infer new dynamic variables is by reading the arguments passed into tool calls. Any value not already specified in the Butter bindings could be bound to a new variable; any other instance of that value could then be associated to that same variable.
Revising the cache in light of new observations: There is power in seeing more examples: extra trajectories can either reveal different roles played by distinct variables, or add to our confidence that a certain piece of data is being used in several places throughout a workflow.
Intelligence in separating data from code: Many of the issues we’ve cited regarding separating data from code call for a more intelligent way of manipulating data. For this, we propose code-generation for resolvers. Resolvers could be generated to handle many of the complex data transforms discussed above such as formatting, combining, filtering, or associating data.
Building more sophisticated matchers: Some of the limitations of exact matchers could be addressed with deterministic fuzzy matchers that more flexibly handle case, precision, punctuation, or whitespace. Still, we anticipate some intelligence is required to generally match messages in a chat to cached templates.

For example, we might hope that in learning how to respond to the prompt "Do I have any unread emails?", to not store separate workflows for each possible value for number_of_unread_emails being 1 vs 2 vs 3, etc. Instead, an appropriate matcher in this case would switch on the condition: number_of_unread_emails > 0.

So, building matchers for each query may in general require some creativity.
Sub-workflows: Many agent workflows will perform data transformation or planning operations that are impossible without intelligence or creativity. This disqualifies them for deterministic replay. Accurately detecting these operations would allow us to still cache the deterministic sub-workflows between them. Sub-workflows could be implemented into Butter by pointing to other entry-points in the cache.

We would love to hear any feedback, ideas, or experiences you have related to Butter!