Model caches
This notebook covers how to cache results of individual LLM calls using different caches.
First, let's install some dependencies
%pip install -qU langchain-openai langchain-community
import os
from getpass import getpass
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass()
from langchain.globals import set_llm_cache
from langchain_openai import OpenAI
# To make the caching really obvious, lets use a slower and older model.
# Caching supports newer chat models as well.
llm = OpenAI(model="gpt-3.5-turbo-instruct", n=2, best_of=2)
In Memory
from langchain_community.cache import InMemoryCache
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")
CPU times: user 7.57 ms, sys: 8.22 ms, total: 15.8 ms
Wall time: 649 ms
"\n\nWhy couldn't the bicycle stand up by itself? Because it was two-tired!"
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")
CPU times: user 551 µs, sys: 221 µs, total: 772 µs
Wall time: 1.23 ms
"\n\nWhy couldn't the bicycle stand up by itself? Because it was two-tired!"
!rm .langchain.db
# We can do the same thing with a SQLite cache
from langchain_community.cache import SQLiteCache
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")
CPU times: user 12.6 ms, sys: 3.51 ms, total: 16.1 ms
Wall time: 486 ms
"\n\nWhy couldn't the bicycle stand up by itself? Because it was two-tired!"
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")
CPU times: user 52.6 ms, sys: 57.7 ms, total: 110 ms
Wall time: 113 ms
"\n\nWhy couldn't the bicycle stand up by itself? Because it was two-tired!"
Upstash Redis
Standard cache
Use Upstash Redis to cache prompts and responses with a serverless HTTP API.
%pip install -qU upstash_redis
import langchain
from langchain_community.cache import UpstashRedisCache
from upstash_redis import Redis
langchain.llm_cache = UpstashRedisCache(redis_=Redis(url=URL, token=TOKEN))
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")
CPU times: user 7.56 ms, sys: 2.98 ms, total: 10.5 ms
Wall time: 1.14 s
'\n\nWhy did the chicken cross the road?\n\nTo get to the other side!'
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")
CPU times: user 2.78 ms, sys: 1.95 ms, total: 4.73 ms
Wall time: 82.9 ms
'\n\nWhy did the chicken cross the road?\n\nTo get to the other side!'
Semantic cache
Use Upstash Vector to do a semantic similarity search and cache the most similar response in the database. The vectorization is automatically done by the selected embedding model while creating Upstash Vector database.
%pip install upstash-semantic-cache
from langchain.globals import set_llm_cache
from upstash_semantic_cache import SemanticCache
cache = SemanticCache(
llm.invoke("Which city is the most crowded city in the USA?")
CPU times: user 28.4 ms, sys: 3.93 ms, total: 32.3 ms
Wall time: 1.89 s
'\n\nNew York City is the most crowded city in the USA.'
llm.invoke("Which city has the highest population in the USA?")
CPU times: user 3.22 ms, sys: 940 μs, total: 4.16 ms
Wall time: 97.7 ms
'\n\nNew York City is the most crowded city in the USA.'
See the main Redis cache docs for detail.
Standard cache
Use Redis to cache prompts and responses.
%pip install -qU redis
# We can do the same thing with a Redis cache
# (make sure your local Redis instance is running first before running this example)
from langchain_community.cache import RedisCache
from redis import Redis
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")
CPU times: user 6.88 ms, sys: 8.75 ms, total: 15.6 ms
Wall time: 1.04 s
'\n\nWhy did the chicken cross the road?\n\nTo get to the other side!'
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")
CPU times: user 1.59 ms, sys: 610 µs, total: 2.2 ms
Wall time: 5.58 ms
'\n\nWhy did the chicken cross the road?\n\nTo get to the other side!'
Semantic cache
Use Redis to cache prompts and responses and evaluate hits based on semantic similarity.
%pip install -qU redis
from langchain_community.cache import RedisSemanticCache
from langchain_openai import OpenAIEmbeddings
RedisSemanticCache(redis_url="redis://localhost:6379", embedding=OpenAIEmbeddings())
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")
CPU times: user 351 ms, sys: 156 ms, total: 507 ms
Wall time: 3.37 s
"\n\nWhy don't scientists trust atoms?\nBecause they make up everything."
# The second time, while not a direct hit, the question is semantically similar to the original question,
# so it uses the cached result!
llm.invoke("Tell me one joke")
CPU times: user 6.25 ms, sys: 2.72 ms, total: 8.97 ms
Wall time: 262 ms
"\n\nWhy don't scientists trust atoms?\nBecause they make up everything."
We can use GPTCache for exact match caching OR to cache results based on semantic similarity
Let's first start with an example of exact match
%pip install -qU gptcache
import hashlib
from gptcache import Cache
from gptcache.manager.factory import manager_factory
from gptcache.processor.pre import get_prompt
from langchain_community.cache import GPTCache
def get_hashed_name(name):
return hashlib.sha256(name.encode()).hexdigest()
def init_gptcache(cache_obj: Cache, llm: str):
hashed_llm = get_hashed_name(llm)
data_manager=manager_factory(manager="map", data_dir=f"map_cache_{hashed_llm}"),
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")
CPU times: user 21.5 ms, sys: 21.3 ms, total: 42.8 ms
Wall time: 6.2 s
'\n\nWhy did the chicken cross the road?\n\nTo get to the other side!'
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")
CPU times: user 571 µs, sys: 43 µs, total: 614 µs
Wall time: 635 µs
'\n\nWhy did the chicken cross the road?\n\nTo get to the other side!'
Let's now show an example of similarity caching
import hashlib
from gptcache import Cache
from gptcache.adapter.api import init_similar_cache
from langchain_community.cache import GPTCache
def get_hashed_name(name):
return hashlib.sha256(name.encode()).hexdigest()
def init_gptcache(cache_obj: Cache, llm: str):
hashed_llm = get_hashed_name(llm)
init_similar_cache(cache_obj=cache_obj, data_dir=f"similar_cache_{hashed_llm}")
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")
CPU times: user 1.42 s, sys: 279 ms, total: 1.7 s
Wall time: 8.44 s
'\n\nWhy did the chicken cross the road?\n\nTo get to the other side.'
# This is an exact match, so it finds it in the cache
llm.invoke("Tell me a joke")
CPU times: user 866 ms, sys: 20 ms, total: 886 ms
Wall time: 226 ms
'\n\nWhy did the chicken cross the road?\n\nTo get to the other side.'
# This is not an exact match, but semantically within distance so it hits!
llm.invoke("Tell me joke")
CPU times: user 853 ms, sys: 14.8 ms, total: 868 ms
Wall time: 224 ms
'\n\nWhy did the chicken cross the road?\n\nTo get to the other side.'
MongoDB Atlas
MongoDB Atlas is a fully-managed cloud database available in AWS, Azure, and GCP. It has native support for Vector Search on the MongoDB document data. Use MongoDB Atlas Vector Search to semantically cache prompts and responses.
Standard cache
Standard cache is a simple cache in MongoDB. It does not use Semantic Caching, nor does it require an index to be made on the collection before generation.
To import this cache, first install the required dependency:
%pip install -qU langchain-mongodb
from langchain_mongodb.cache import MongoDBCache
To use this cache with your LLMs:
from langchain_core.globals import set_llm_cache
# use any embedding provider...
from tests.integration_tests.vectorstores.fake_embeddings import FakeEmbeddings
mongodb_atlas_uri = "<YOUR_CONNECTION_STRING>"
Semantic cache
Semantic caching allows retrieval of cached prompts based on semantic similarity between the user input and previously cached results. Under the hood, it blends MongoDBAtlas as both a cache and a vectorstore.
The MongoDBAtlasSemanticCache inherits from MongoDBAtlasVectorSearch
and needs an Atlas Vector Search Index defined to work. Please look at the usage example on how to set up the index.
To import this cache:
from langchain_mongodb.cache import MongoDBAtlasSemanticCache
To use this cache with your LLMs:
from langchain_core.globals import set_llm_cache
# use any embedding provider...
from tests.integration_tests.vectorstores.fake_embeddings import FakeEmbeddings
mongodb_atlas_uri = "<YOUR_CONNECTION_STRING>"
To find more resources about using MongoDBSemanticCache visit here
Use Momento to cache prompts and responses.
Requires installing the momento
%pip install -qU momento
You'll need to get a Momento auth token to use this class. This can either be passed in to a momento.CacheClient if you'd like to instantiate that directly, as a named parameter auth_token
to MomentoChatMessageHistory.from_client_params
, or can just be set as an environment variable MOMENTO_AUTH_TOKEN
from datetime import timedelta
from langchain_community.cache import MomentoCache
cache_name = "langchain"
ttl = timedelta(days=1)
set_llm_cache(MomentoCache.from_client_params(cache_name, ttl))
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")
CPU times: user 40.7 ms, sys: 16.5 ms, total: 57.2 ms
Wall time: 1.73 s
'\n\nWhy did the chicken cross the road?\n\nTo get to the other side!'
# The second time it is, so it goes faster
# When run in the same region as the cache, latencies are single digit ms
llm.invoke("Tell me a joke")
CPU times: user 3.16 ms, sys: 2.98 ms, total: 6.14 ms
Wall time: 57.9 ms
'\n\nWhy did the chicken cross the road?\n\nTo get to the other side!'
You can use SQLAlchemyCache
to cache with any SQL database supported by SQLAlchemy
Standard cache
from langchain.cache import SQLAlchemyCache
from sqlalchemy import create_engine
engine = create_engine("postgresql://postgres:postgres@localhost:5432/postgres")
Custom SQLAlchemy schemas
You can define your own declarative SQLAlchemyCache
child class to customize the schema used for caching. For example, to support high-speed fulltext prompt indexing with Postgres
, use:
from langchain_community.cache import SQLAlchemyCache
from sqlalchemy import Column, Computed, Index, Integer, Sequence, String, create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy_utils import TSVectorType
Base = declarative_base()
class FulltextLLMCache(Base): # type: ignore
"""Postgres table for fulltext-indexed LLM Cache"""
__tablename__ = "llm_cache_fulltext"
id = Column(Integer, Sequence("cache_id"), primary_key=True)
prompt = Column(String, nullable=False)
llm = Column(String, nullable=False)
idx = Column(Integer)
response = Column(String)
prompt_tsv = Column(
Computed("to_tsvector('english', llm || ' ' || prompt)", persisted=True),
__table_args__ = (
Index("idx_fulltext_prompt_tsv", prompt_tsv, postgresql_using="gin"),
engine = create_engine("postgresql://postgres:postgres@localhost:5432/postgres")
set_llm_cache(SQLAlchemyCache(engine, FulltextLLMCache))
Apache Cassandra® is a NoSQL, row-oriented, highly scalable and highly available database. Starting with version 5.0, the database ships with vector search capabilities.
You can use Cassandra for caching LLM responses, choosing from the exact-match CassandraCache
or the (vector-similarity-based) CassandraSemanticCache
Let's see both in action. The next cells guide you through the (little) required setup, and the following cells showcase the two available cache classes.
Required dependency:
%pip install -qU "cassio>=0.1.4"
Connecting to the DB
The Cassandra caches shown in this page can be used with Cassandra as well as other derived databases, such as Astra DB, which use the CQL (Cassandra Query Language) protocol.
DataStax Astra DB is a managed serverless database built on Cassandra, offering the same interface and strengths.
Depending on whether you connect to a Cassandra cluster or to Astra DB through CQL, you will provide different parameters when instantiating the cache (through initialization of a CassIO connection).