Indexing API
Vector Databases loaded chunks with from_texts. index() is for the second and later runs — it checks a SQLite record manager, skips hashes that match, and deletes rows that left the doc list. This lesson uses local Chroma — see PostgreSQL Indexing API for the same pattern with PGVector.
Before you run
Activate the venv from Project Setup. Install SQLAlchemy for the record manager:
pip install sqlalchemy
Keep OPENAI_API_KEY in .env from OpenAI Account Setup. Chroma still calls OpenAI when it embeds new lines.
Demo flow:
indexing_record_manager.sqlfrom_texts on a re-run
Call from_texts twice on the same two HTML lines and Chroma stores four rows. Edit the file, drop the <title> line, run from_texts again — the old <title> vector is still there.
| Approach | What happens |
|---|---|
| from_texts each run | Duplicate rows in Chroma |
| Line removed from file | Old row still in Chroma |
| index() + SQLRecordManager | Skip match, delete missing |
Wire up index()
SQLRecordManager writes indexing_record_manager.sql. Pass the doc list, manager, and Chroma store to index(). Each Document needs metadata["source"] set to the file name.
from langchain_community.indexes import SQLRecordManager
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_core.indexing import index
from langchain_openai import OpenAIEmbeddings
docs = [
Document(
page_content="The <a> tag creates a hyperlink.",
metadata={"source": "html_notes.txt"},
),
Document(
page_content="The <title> tag sets the browser tab title.",
metadata={"source": "html_notes.txt"},
),
]
record_manager = SQLRecordManager(
"indexing_demo",
db_url="sqlite:///indexing_record_manager.sql",
)
record_manager.create_schema()
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(
collection_name="indexing_demo",
embedding_function=embeddings,
persist_directory="./indexing_chroma_db",
)
result = index(
docs,
record_manager,
vectorstore,
cleanup="incremental",
source_id_key="source",
)
print(result)cleanup="incremental" plus source_id_key="source" removes rows from that file that were not in this run. Print the returned num_added / num_skipped / num_deleted to see what happened.
Use in RAG
Load and split files from Document Loading, then call index() instead of from_texts when the file changes. The RAG chain in the next lesson points at the same Chroma folder.
Run the demo
Download the script, unzip if needed, then run:
indexing_api_demo.py
Three index() calls, prints counts each time
OPENAI_API_KEY in .env — Chroma only calls OpenAI on add.python indexing_api_demo.py
If it fails
- ModuleNotFoundError: sqlalchemy — run
pip install sqlalchemy. - AuthenticationError — check
OPENAI_API_KEYin.env. - ValueError: Source id key is required — pass
source_id_key="source"when usingcleanup="incremental", and putsourcein each document's metadata. - Vectorstore has not implemented the delete method — import Chroma from
langchain_community.vectorstores.
Docs: LangChain indexing how-to.
What's Next
Chroma indexing is in place. Next: the same pattern with PostgreSQL.