Template-Aware Caching

Motivation

Butter is designed not only to memorize static trajectories, but also workflows involving dynamic variables. Here, we think of dynamic variables as placeholders for the data which may change from run-to-run: information which might derive externally from tools, creativity, or the environment. Everything else in the workflow we consider to be the code or structure: how to use the data to complete the task.

Most real world automations include dynamic variables (e.g., names, dates, addresses, etc.), and are thus designed to handle a range of inputs. Workflows may make use of dynamic variables for a number of reasons:

Storing information about the specific run. E.g., storing the user’s name, or storing the root directory in the file system.
Tracking information about the environment. E.g., checking the current time, or reading all the filenames in the current directory.
Determining relevant tools to employ. E.g., choosing the appropriate parser to run on a given file.

❗

We assert that in our domain of workflow automation, the desire for a repeatable workflow implies the notion of “code” in the context window; i.e., the instructions that influence the model’s branching logic decisions to complete a known task. This code is intermixed with runtime unique data. It is our job to distinguish the two.

Each chat between a user and a model represents a single trajectory through an abstract workflow. As Butter observes these trajectories, it is confronted with the problem of distinguishing between dynamic data and structural content within the messages. The better Butter can do at extracting “data from code,” the broader the set of future queries which will result in a cache hit. This way, Butter can more robustly serve from its cache and avoid another LLM call.

❗

In this setup, the broader the set of queries a single cache entry is meant to handle, the greater the burden there is on Butter to correctly model how this data is derived and transformed.

Obstacles to Separating Data from Code

The general problem of separating data from code involves many considerations.

How to detect when an argument to a tool is intended to change run-to-run?

As we observe LLMs making use of provided tools to complete tasks, we might expect the model to make use of some dynamic data when providing arguments to its tool calls. Indeed, it seems reasonable to assume that whenever an argument is passed into a tool call, it represents dynamic data.

For instance, suppose that the user specifies that they live in Boston, and asks about the weather there. With access to the tool get_weather(city : str, state : str) -> str, the LLM could produce the tool call get_weather('Boston', 'MA').

Here, both the arguments for city and state require some dynamic data, and it will be the job of Butter to figure out how to derive their values. As the user already explicitly wrote that they live in Boston, the substring 'Boston' is already present in the model’s context, so we know from where in the prompt to source the city information. But nowhere did the user specify Massachusetts as their state; so while the string 'MA' might be theoretically derivable from the city of Boston, Butter has no clear source from which to have derived it syntactically. Really, the argument 'MA' was generated by the LLM, understanding Boston to be the city in Massachusetts. How is Butter to understand this relationship between Boston and MA?

On the other hand, some tool calls might involve variables which should not change run-to-run. For example, an encoding format like png may not depend on any dynamic data, and instead is a structural part of the workflow.

How to avoid associating identical data to the same variable when they’re truly unrelated?

Coincidences happen, but recognizing them is not so straightforward. For instance, which data in the following query are related?

Today is September 30, 2025. Find the third Python script in the directory /source when sorted alphabetically, interpret it with python3, and save the output in 09/output_3.txt.

A naïve approach to separating data from code might replace all instances of the number 3 with the same variable, ignoring the various, distinct roles played by the number three in this query. Any time we group together unrelated data, our cached template will fail to generalize to other situations.

How to recognize instances of the same underlying variable throughout a message?

While the previous point might be described as a concern for false positives, we might also be concerned about false negatives. That is, we worry about failing to recognize how the same variable was used across a conversation.

Data may undergo transformations which are syntactically simple: e.g., turning a string into all lower-case (Erik → erik), rewriting a number (1→ 1.0), or stringifying a JSON object.

Data may also be filtered: e.g., a variable corresponding to the user’s full name will necessarily contain the data required to obtain both that user’s first name and last name. Either of these partial names may be used throughout the context without writing out the full name.

Other transformations involve more intelligence to execute. For instance, we might view 'MA' from the example above as the result of a transformation of the form state_from_city('Boston'). Plenty of other examples require the same level of insight, such as knowing the industry in which a given company operates, computing the date associated with a given holiday, expanding a given acronym, naming the artist behind a given song, producing antonyms for given words, etc.

No matter the transformations that may have been applied to some data, we still consider each of its representations as being associated to the same underlying variable. But recognizing that the variable leader = 'Napolean' explains both 'France' and '1821' in a given chat is not trivial. Just as before, failure to recognize relationships between data makes our cache less generalizable.

Bindings

Butter performs a number of symbolic manipulations when augmenting, comparing to, and generating from its cache. This functionality is necessary for Butter to make use of variables in its stored workflows. Instead of filling the cache with observed messages written verbatim (e.g., 'Say hello to Erik'), it stores a combination of bindings paired with a message template. The bindings specify how each variable maps to a corresponding value (e.g., {'name': 'Erik'}). The template is generated by substituting all instances of dynamic data with their corresponding variable’s name (e.g., "Say hello to {{name}}"). This way, applying the bindings to the template reproduces the original message.

We describe this approach as template-aware caching. Templates do not limit our ability to compare new queries to the cache: an incoming query is compared to an existing template using regex and exact-matching. If regex recognizes the query to follow the same structure as the template, it is straightforward to read off the values that each of the expected variables from the bindings should take on in this query. Assuming this process goes through without any contradictory assignments, it is now straightforward to use these bindings to populate the cached response to this query. This is how Butter adds determinism to LLM calls which follow a recognized structure and contain dynamic data.

Inferring Bindings

In practice, bindings must either be specified or inferred. Currently, Butter’s Butter-Bindings allows users to specify bindings explicitly from the start. These so-called top-level bindings help to avoid guesswork, but are not always feasible to provide.

For our method to be effective, we should also be able to infer bindings from chats. We have discussed how arguments passed to tool calls are very likely to have made use of dynamic data. For instance, in the tool call: read_latest_email_from(sender = 'example@butter.dev'), we find an email address which almost surely contains some dynamic data.

Bindings may also be derived deterministically via regex or substring matches. By comparing multiple observed trajectories, we might identify locations in which data was used in the same manner across each run.

Still, large language models may be best suited for the job of automatically detecting (in post) any relevant bindings that were not already detected via the other deterministic methods.

What Butter Already Does

As an LLM proxy, Butter forwards requests to inference providers and caches responses. On repeat requests, responses are served immediately, bypassing wasteful generations. In its current implementation, Butter performs template-aware caching, serving responses based on structural similarity rather than requiring exact matches.

You can simplify modify your LLM client or your curl command to Butter’s custom base URL:

import os, json, httpx
from openai import OpenAI

client = OpenAI(
    base_url="https://proxy.butter.dev/v1",
    http_client= httpx.Client(
        headers={"Butter-Auth": f"Bearer {os.getenv('BUTTER_API_KEY')}"},
    ),
)

# Specify any bindings
bindings = {
    "name": "Erik"
}

# Create cache
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "say hello to Erik"}],
    extra_headers={"Butter-Bindings": json.dumps(bindings)},
)

print(response)

curl -X POST $BASE_URL/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Butter-Auth: Bearer $BUTTER_API_KEY" \
-H "Butter-Bindings: {\"name\": \"Erik\"}" \
-d '{"messages":[{"content":"say hello to Erik","role":"user"}],"model":"gpt-4o"}'

The above code examples show how the user can tell Butter to cache templates rather than exact messages by specifying top-level Butter bindings.

Whenever Butter caches a new message which involved some bindings, it builds the template by replacing all instances of bound values by their corresponding variables. These replacements can occur anywhere in a message, including in tool calls. Additionally, whenever Butter recognizes a match between a cached template and an incoming query, it uses regex substring matching to infer the proper bindings as expected for that template.

Known Bugs

Let’s review a few bugs users should expect to run into with Butter’s current implementation.

The exact string matching that Butter currently uses poses a few challenges.
- False negatives: Even slight modifications to a letter’s case (Erik vs. erik) or a number’s precision (1 vs 1.0) will fail to match.
- False positives: Whenever building a template from a query, the matcher will replace any string that matches the bound value—an error if those values had no semantic relationship.
- Naming conflicts: Wrapper types for naming variables in templates could conflict with other agent frameworks.
Another issue follows from how exact matching is implemented with regex: bound values should be separated by delimiters. Otherwise, Butter’s cache responses might diverge from the expected behavior. For instance, suppose we were to request the model to Repeat butterfly 3 times while specifying the bindings prefix = butter and suffix = fly:
```
 curl -X POST $BASE_URL/v1/chat/completions \
 -H "Content-Type: application/json" \
 -H "Authorization: Bearer $OPENAI_API_KEY" \
 -H "Butter-Auth: Bearer $BUTTER_API_KEY" \
 -H "Butter-Bindings: {\"prefix\": \"butter\", \"suffix\": \"fly\"}" \
 -d '{"messages":[{"content":"say butterfly 3 times","role":"user"}],"model":"gpt-4o"}'

 # response: "butterfly butterfly butterfly"
```
In this case, Butter will add a node into its cache with the specified bindings {prefix: butter, suffix: fly}, and the corresponding template {{prefix}}{{suffix}} {{prefix}}{{suffix}} {{prefix}}{{suffix}}. Now, if we try running the command again:
```
 curl -X POST $BASE_URL/v1/chat/completions \
 -H "Content-Type: application/json" \
 -H "Authorization: Bearer $OPENAI_API_KEY" \
 -H "Butter-Auth: Bearer $BUTTER_API_KEY" \
 -H "Butter-Bindings: {\"prefix\": \"butter\", \"suffix\": \"fly\"}" \
 -d '{"messages":[{"content":"say butterfly 3 times","role":"user"}],"model":"gpt-4o"}'

 # error: failed to query tree
```
This error occurs because, as Butter compares the query "say butterfly 3 times" to the existing template, regex must default to some way of decomposing butterfly into {{prefix}}{{suffix}} (the current implementation uses non-greedy regex, which chooses prefix = b and suffix = utterfly). These assignments then disagree with the specified bindings of prefix = butter and suffix = fly, leading to this error.

Instead, we could have run the above command again sans any bindings:
```
 curl -X POST $BASE_URL/v1/chat/completions \
 -H "Content-Type: application/json" \
 -H "Authorization: Bearer $OPENAI_API_KEY" \
 -H "Butter-Auth: Bearer $BUTTER_API_KEY" \
 -d '{"messages":[{"content":"say butterfly 3 times","role":"user"}],"model":"gpt-4o"}'

 # response: "butterutterfly, butterfly, butterfly"
```
This time, Butter succeeds matching this query to the stored template, and so it makes use of the inferred bindings prefix = b and suffix = utterfly to produce "butterutterfly, butterfly, butterfly". This isn’t quite what we had in mind.
Butter may fail to recognize the underlying interdependencies between data, making it worse at generalizing to unseen trajectories. In the example described above about getting the weather in Boston, Butter would fail to recognize that the tool argument MA was generated by the LLM in lieu of the city being Boston. Consider what this means for the following command:
```
 curl -X POST $BASE_URL/v1/chat/completions \
 -H "Content-Type: application/json" \
 -H "Authorization: Bearer $OPENAI_API_KEY" \
 -H "Butter-Auth: Bearer $BUTTER_API_KEY" \
 -H "Butter-Bindings: {\"city\": \"Boston\"}" \
 -d '{"messages":[{"content":"Tell me the weather in Boston","role":"user"}],"model":"gpt-4o"}'

 # the model next chooses to call the tool: get_weather('Boston', 'MA')
```
Butter would naively cache the template get_weather({{city}}, 'MA')for the tool call, which would fail to generalize to cities outside of Massachusetts.

What Comes Next

There are many ways in which Butter could be improved. Here are some of the directions we will explore.

Inferring dynamic data from tool calls: one easy way to infer new dynamic variables is by reading the arguments passed into tool calls. Any value not already specified in the Butter bindings could be bound to a new variable; any other instance of that value could then be associated to that same variable.
Revising the cache in light of new observations: There is power in seeing more examples: extra trajectories can either reveal different roles played by distinct variables, or add to our confidence that a certain piece of data is being used in several places throughout a workflow.
Intelligence in separating data from code: Many of the issues we’ve cited regarding separating data from code call for a more intelligent way of manipulating data. For this, we propose code-generation for resolvers. Resolvers could be generated to handle many of the complex data transforms discussed above such as formatting, combining, filtering, or associating data.
Building more sophisticated matchers: Some of the limitations of exact matchers could be addressed with deterministic fuzzy matchers that more flexibly handle case, precision, punctuation, or whitespace. Still, we anticipate some intelligence is required to generally match messages in a chat to cached templates.

For example, we might hope that in learning how to respond to the prompt "Do I have any unread emails?", to not store separate workflows for each possible value for number_of_unread_emails being 1 vs 2 vs 3, etc. Instead, an appropriate matcher in this case would switch on the condition: number_of_unread_emails > 0.

So, building matchers for each query may in general require some creativity.
Sub-workflows: Many agent workflows will perform data transformation or planning operations that are impossible without intelligence or creativity. This disqualifies them for deterministic replay. Accurately detecting these operations would allow us to still cache the deterministic sub-workflows between them. Sub-workflows could be implemented into Butter by pointing to other entry-points in the cache.

We would love to hear any feedback, ideas, or experiences you have related to Butter!

Template-Aware Caching

Motivation

Obstacles to Separating Data from Code

How to detect when an argument to a tool is intended to change run-to-run?

How to avoid associating identical data to the same variable when they’re truly unrelated?

How to recognize instances of the same underlying variable throughout a message?

Bindings

Inferring Bindings

What Butter Already Does

Known Bugs

What Comes Next

Comments

More from this blog

On Automatic Template Induction for Response Caching

Changelog #0009

Changelog #0008

Changelog #0007

Changelog #0006

Command Palette

Motivation

Obstacles to Separating Data from Code

How to detect when an argument to a tool is intended to change run-to-run?

How to avoid associating identical data to the same variable when they’re truly unrelated?

How to recognize instances of the same underlying variable throughout a message?

Bindings

Inferring Bindings

What Butter Already Does

Known Bugs

What Comes Next

Comments

More from this blog