Course navigation
Embeddings, Vector Stores & RAGLesson 7 of 11

Indexing API

Vector Databases loaded chunks with from_texts. index() is for the second and later runs — it checks a SQLite record manager, skips hashes that match, and deletes rows that left the doc list. This lesson uses local Chroma — see PostgreSQL Indexing API for the same pattern with PGVector.

Before you run

Activate the venv from Project Setup. Install SQLAlchemy for the record manager:

pip install sqlalchemy

Keep OPENAI_API_KEY in .env from OpenAI Account Setup. Chroma still calls OpenAI when it embeds new lines.

Demo flow:

2 Document objects with metadata source
↓ index()
SQLRecordManager
Chroma store
↓ print result dict
num_added · num_skipped · num_deleted
SQLite file: indexing_record_manager.sql

from_texts on a re-run

Call from_texts twice on the same two HTML lines and Chroma stores four rows. Edit the file, drop the <title> line, run from_texts again — the old <title> vector is still there.

ApproachWhat happens
from_texts each runDuplicate rows in Chroma
Line removed from fileOld row still in Chroma
index() + SQLRecordManagerSkip match, delete missing
Row three is what the demo script does on each run. For PostgreSQL, see PostgreSQL Indexing API.

Wire up index()

SQLRecordManager writes indexing_record_manager.sql. Pass the doc list, manager, and Chroma store to index(). Each Document needs metadata["source"] set to the file name.

from langchain_community.indexes import SQLRecordManager
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_core.indexing import index
from langchain_openai import OpenAIEmbeddings

docs = [
    Document(
        page_content="The <a> tag creates a hyperlink.",
        metadata={"source": "html_notes.txt"},
    ),
    Document(
        page_content="The <title> tag sets the browser tab title.",
        metadata={"source": "html_notes.txt"},
    ),
]

record_manager = SQLRecordManager(
    "indexing_demo",
    db_url="sqlite:///indexing_record_manager.sql",
)
record_manager.create_schema()

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(
    collection_name="indexing_demo",
    embedding_function=embeddings,
    persist_directory="./indexing_chroma_db",
)

result = index(
    docs,
    record_manager,
    vectorstore,
    cleanup="incremental",
    source_id_key="source",
)

print(result)

cleanup="incremental" plus source_id_key="source" removes rows from that file that were not in this run. Print the returned num_added / num_skipped / num_deleted to see what happened.

Use in RAG

Load and split files from Document Loading, then call index() instead of from_texts when the file changes. The RAG chain in the next lesson points at the same Chroma folder.

Run the demo

Download the script, unzip if needed, then run:

indexing_api_demo.py

Three index() calls, prints counts each time

Keep OPENAI_API_KEY in .env — Chroma only calls OpenAI on add.
indexing_api_demo.py
"""indexing_api_demo.py"""
from langchain_core.indexing import index
# SQLRecordManager + Chroma → 3 runs
result = index(docs, record_manager, vectorstore, …)
python indexing_api_demo.py
PowerShell — (.venv) active
(.venv) PS C:\projects\langchain-course> python indexing_api_demo.py
=== First run (2 docs) ===
added: 2
skipped: 0
deleted: 0
=== Second run (same 2 docs) ===
added: 0
skipped: 2
deleted: 0
=== Third run (1 doc removed) ===
skipped: 1
deleted: 1
Second run: skipped 2. Third run: <title> line deleted from Chroma.

If it fails

  • ModuleNotFoundError: sqlalchemy — run pip install sqlalchemy.
  • AuthenticationError — check OPENAI_API_KEY in .env.
  • ValueError: Source id key is required — pass source_id_key="source" when using cleanup="incremental", and put source in each document's metadata.
  • Vectorstore has not implemented the delete method — import Chroma from langchain_community.vectorstores.

Docs: LangChain indexing how-to.

What's Next

Chroma indexing is in place. Next: the same pattern with PostgreSQL.