Text Splitting
Document Loading gives you one long page_content string. Here you cut it into smaller pieces with CharacterTextSplitter and RecursiveCharacterTextSplitter, and set chunk_size plus chunk_overlap.
Before you run
Use the same venv from Project Setup. Install the text splitter package once:
pip install langchain-text-splitters
One file → many chunks:
Load the sample file
Load html_notes.txt with TextLoader — same as the previous lesson. The file has a title line and short paragraphs about HTML tags.
from langchain_community.document_loaders import TextLoader
docs = TextLoader("text_splitting_samples/html_notes.txt", encoding="utf-8").load()
print(len(docs[0].page_content), "characters")chunk_size and chunk_overlap
chunk_size sets how many characters go in each piece. chunk_overlap repeats the tail of one chunk at the head of the next — handy when a cut lands in the middle of a sentence.
chunk_size=150, chunk_overlap=30:
Amber = overlap. The last 30 characters of chunk 0 show up again at the start of chunk 1.
chunk_size caps each piece.chunk_overlap copies characters from the end of one chunk into the start of the next.The demo uses 150 and 30 so the printout stays short. Use larger numbers on your own files.
CharacterTextSplitter
Cuts at a fixed character count. With separator="", it does not look for spaces or line breaks — chunk boundaries can land in the middle of a word.
from langchain_text_splitters import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=150,
chunk_overlap=30,
separator="",
)
chunks = splitter.split_documents(docs)
for i, chunk in enumerate(chunks):
print(f"chunk {i}: {len(chunk.page_content)} chars")RecursiveCharacterTextSplitter
Splits on blank lines first, then single newlines, then spaces. It only chops mid-word when nothing else fits under chunk_size.
| Splitter | How it cuts |
|---|---|
| CharacterTextSplitter | Fixed character count (can cut mid-word) |
| RecursiveCharacterTextSplitter | Tries paragraph → line → space first |
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=150,
chunk_overlap=30,
)
chunks = splitter.split_documents(docs)Run the demo
Download the ZIP, unzip into your project folder, then run:
text_splitting_demo.zip
Script plus sample notes file
Inside the archive
- ◇text_splitting_demo.py
- ◇text_splitting_samples/html_notes.txt
text_splitting_samples/ next to the script.python text_splitting_demo.py
If it fails
- FileNotFoundError — unzip the ZIP so
text_splitting_samples/html_notes.txtsits next to the script. - ModuleNotFoundError: langchain_text_splitters — run
pip install langchain-text-splitters. - Only 1 chunk printed — your file may be shorter than
chunk_size. Lower the value or use a longer file.
Other splitters are listed in the LangChain text splitters docs.