Embeddings, Vector Stores & RAGLesson 2 of 7

Text Splitting

Document Loading gives you one long page_content string. Here you cut it into smaller pieces with CharacterTextSplitter and RecursiveCharacterTextSplitter, and set chunk_size plus chunk_overlap.

Before you run

Use the same venv from Project Setup. Install the text splitter package once:

pip install langchain-text-splitters

One file → many chunks:

html_notes.txt (one Document, ~735 chars)

↓ split_documents()

chunk 0 · chunk 1 · chunk 2 · …

↓

each chunk is its own Document (metadata kept from the source)

You load the file first, then pass the document list to a splitter.

Load the sample file

Load html_notes.txt with TextLoader — same as the previous lesson. The file has a title line and short paragraphs about HTML tags.

from langchain_community.document_loaders import TextLoader

docs = TextLoader("text_splitting_samples/html_notes.txt", encoding="utf-8").load()

print(len(docs[0].page_content), "characters")

chunk_size and chunk_overlap

chunk_size sets how many characters go in each piece. chunk_overlap repeats the tail of one chunk at the head of the next — handy when a cut lands in the middle of a sentence.

chunk_size=150, chunk_overlap=30:

chunk 0 — 150 chars

chunk 1 — repeats last 30 chars, then new text

chunk 2 — same overlap pattern

Amber = overlap. The last 30 characters of chunk 0 show up again at the start of chunk 1.

chunk_size caps each piece.chunk_overlap copies characters from the end of one chunk into the start of the next.

The demo uses 150 and 30 so the printout stays short. Use larger numbers on your own files.

CharacterTextSplitter

Cuts at a fixed character count. With separator="", it does not look for spaces or line breaks — chunk boundaries can land in the middle of a word.

from langchain_text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=30,
    separator="",
)

chunks = splitter.split_documents(docs)

for i, chunk in enumerate(chunks):
    print(f"chunk {i}: {len(chunk.page_content)} chars")

RecursiveCharacterTextSplitter

Splits on blank lines first, then single newlines, then spaces. It only chops mid-word when nothing else fits under chunk_size.

Splitter	How it cuts	Best for
CharacterTextSplitter	Fixed character count (can cut mid-word)	Uniform size, simple logs
RecursiveCharacterTextSplitter	Tries paragraph → line → space first	Notes, articles, most text files

For notes and articles, pick recursive. Pick character when every chunk must hit an exact length.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=30,
)

chunks = splitter.split_documents(docs)

Run the demo

Download the ZIP, unzip into your project folder, then run:

text_splitting_demo.zip

Script plus sample notes file

Download ZIP .py only

Inside the archive

◇text_splitting_demo.py
◇text_splitting_samples/html_notes.txt

Unzip into your project folder. Keep text_splitting_samples/ next to the script.

◇text_splitting_demo.py

"""text_splitting_demo.py"""

from langchain_text_splitters import CharacterTextSplitter, RecursiveCharacterTextSplitter

# chunk_size=150, chunk_overlap=30

chunks = splitter.split_documents(docs)

python text_splitting_demo.py

PowerShell — (.venv) active

(.venv) PS C:\projects\langchain-course> python text_splitting_demo.py

Loaded 1 document — 735 characters total

=== CharacterTextSplitter (6 chunks) ===

--- chunk 0 (150 chars) ---

HTML Basics — Quick Notes The <a> tag creates a hyperlink. Use the href attribute…

=== RecursiveCharacterTextSplitter (7 chunks) ===

--- chunk 0 (25 chars) ---

HTML Basics — Quick Notes

--- chunk 1 (134 chars) ---

The <a> tag creates a hyperlink. Use the href attribute to set the destination URL…

Recursive splits on blank lines first — chunk 0 is just the title, chunk 1 starts the first paragraph.

If it fails

FileNotFoundError — unzip the ZIP so text_splitting_samples/html_notes.txt sits next to the script.
ModuleNotFoundError: langchain_text_splitters — run pip install langchain-text-splitters.
Only 1 chunk printed — your file may be shorter than chunk_size. Lower the value or use a longer file.

Other splitters are listed in the LangChain text splitters docs.

What's Next

Chunks are done. Next up: Embeddings.

← PREVIOUS

Document Loading

Embeddings