Embeddings, Vector Stores & RAGLesson 7 of 11

Indexing API

Vector Databases loaded chunks with from_texts. index() is for the second and later runs — it checks a SQLite record manager, skips hashes that match, and deletes rows that left the doc list. This lesson uses local Chroma — see PostgreSQL Indexing API for the same pattern with PGVector.

Before you run

Activate the venv from Project Setup. Install SQLAlchemy for the record manager:

pip install sqlalchemy

Keep OPENAI_API_KEY in .env from OpenAI Account Setup. Chroma still calls OpenAI when it embeds new lines.

Demo flow:

2 Document objects with metadata source

↓ index()

SQLRecordManager

Chroma store

↓ print result dict

num_added · num_skipped · num_deleted

SQLite file: indexing_record_manager.sql

from_texts on a re-run

Call from_texts twice on the same two HTML lines and Chroma stores four rows. Edit the file, drop the <title> line, run from_texts again — the old <title> vector is still there.

Approach	What happens
from_texts each run	Duplicate rows in Chroma
Line removed from file	Old row still in Chroma
index() + SQLRecordManager	Skip match, delete missing

Row three is what the demo script does on each run. For PostgreSQL, see PostgreSQL Indexing API.

Wire up index()

SQLRecordManager writes indexing_record_manager.sql. Pass the doc list, manager, and Chroma store to index(). Each Document needs metadata["source"] set to the file name.

from langchain_community.indexes import SQLRecordManager
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_core.indexing import index
from langchain_openai import OpenAIEmbeddings

docs = [
    Document(
        page_content="The <a> tag creates a hyperlink.",
        metadata={"source": "html_notes.txt"},
    ),
    Document(
        page_content="The <title> tag sets the browser tab title.",
        metadata={"source": "html_notes.txt"},
    ),
]

record_manager = SQLRecordManager(
    "indexing_demo",
    db_url="sqlite:///indexing_record_manager.sql",
)
record_manager.create_schema()

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(
    collection_name="indexing_demo",
    embedding_function=embeddings,
    persist_directory="./indexing_chroma_db",
)

result = index(
    docs,
    record_manager,
    vectorstore,
    cleanup="incremental",
    source_id_key="source",
)

print(result)

cleanup="incremental" plus source_id_key="source" removes rows from that file that were not in this run. Print the returned num_added / num_skipped / num_deleted to see what happened.

Use in RAG

Load and split files from Document Loading, then call index() instead of from_texts when the file changes. The RAG chain in the next lesson points at the same Chroma folder.

Run the demo

Download the script, unzip if needed, then run:

indexing_api_demo.py

Three index() calls, prints counts each time

Download ZIP .py only

Keep OPENAI_API_KEY in .env — Chroma only calls OpenAI on add.

◇indexing_api_demo.py

"""indexing_api_demo.py"""

from langchain_core.indexing import index

# SQLRecordManager + Chroma → 3 runs

result = index(docs, record_manager, vectorstore, …)

python indexing_api_demo.py

PowerShell — (.venv) active

(.venv) PS C:\projects\langchain-course> python indexing_api_demo.py

=== First run (2 docs) ===

added: 2

skipped: 0

deleted: 0

=== Second run (same 2 docs) ===

added: 0

skipped: 2

deleted: 0

=== Third run (1 doc removed) ===

skipped: 1

deleted: 1

Second run: skipped 2. Third run: <title> line deleted from Chroma.

If it fails

ModuleNotFoundError: sqlalchemy — run pip install sqlalchemy.
AuthenticationError — check OPENAI_API_KEY in .env.
ValueError: Source id key is required — pass source_id_key="source" when using cleanup="incremental", and put source in each document's metadata.
Vectorstore has not implemented the delete method — import Chroma from langchain_community.vectorstores.

Docs: LangChain indexing how-to.

What's Next

Chroma indexing is in place. Next: the same pattern with PostgreSQL.

← PREVIOUS

PostgreSQL Hybrid Search

PostgreSQL Indexing API