Skip to main content

Prompt Caching

Enable caching for long prompts that are reused. Later requests can reduce cost by about 90% and latency by about 85%.

import anthropic

client = anthropic.Anthropic(api_key="your-api-key", base_url="https://aisupermarket.work")

with open("long_document.txt") as f:
document = f.read()

response = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
system=[
{"type": "text", "text": "You are a document analysis assistant."},
{
"type": "text",
"text": document,
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": "Please summarize the main points of this document"}]
)

print(response.content[0].text)
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")

Cache Tool Definitions in Multi-Turn Chat

import anthropic

client = anthropic.Anthropic(api_key="your-api-key", base_url="https://aisupermarket.work")

tools = [
{
"name": "search",
"description": "Search the knowledge base",
"input_schema": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"]
},
"cache_control": {"type": "ephemeral"}
}
]

response = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
tools=tools,
messages=[{"role": "user", "content": "Search for content about Python"}]
)
print(response.content[0].text)

The minimum cache lifetime is 5 minutes and the maximum is 1 hour. At least 1024 tokens must be cached (2048 for Haiku).