Document Loading
LangChain reads files and web pages through document loaders. You pass a path or URL, call .load(), and get back a list of Document objects. We try four common sources: text, PDF, CSV, and a web page.
Before you run
Open a terminal in your langchain-course folder. Activate the venv from Project Setup.
langchain-community is already installed. PDF and web loaders need two more packages — install them once:
pip install pypdf beautifulsoup4
Same steps for every loader:
.load(), then inspect what came back.The Document object
A loader always returns a list of Document objects. Two fields matter: page_content (the text) and metadata (source path, page number, and similar). See the LangChain docs on Documents if you want the full type definition.
One Document
The <a> tag creates a hyperlink.
{ "source": "html_basics.txt" }
page_content holds the text.metadata usually includes the file path or page number.from langchain_core.documents import Document
doc = Document(
page_content="The <a> tag creates a hyperlink.",
metadata={"source": "html_basics.txt"},
)
print(doc.page_content)
print(doc.metadata)Which loader to use
Import from langchain_community.document_loaders. Match the loader to your source type:
| Loader | Source |
|---|---|
| TextLoader | .txt file |
| PyPDFLoader | .pdf file |
| CSVLoader | .csv file |
| WebBaseLoader | URL |
.load().Text files
TextLoader reads a .txt file into one document. On Windows, pass encoding="utf-8" if accented characters print as garbage.
from langchain_community.document_loaders import TextLoader
docs = TextLoader("document_loading_samples/html_basics.txt", encoding="utf-8").load()
print(docs[0].page_content)
print(docs[0].metadata)PDF files
PyPDFLoader returns one document per page. A 10-page file gives you a list of 10 items. It reads embedded text only — scanned photos of pages come back empty unless you run OCR separately.
from langchain_community.document_loaders import PyPDFLoader
docs = PyPDFLoader("document_loading_samples/html_basics.pdf").load()
for page in docs:
print(page.metadata) # page number, source path
print(page.page_content[:120])CSV files
CSVLoader makes one document per row. The row text looks like tag: a on one line and description: Creates a hyperlink… on the next — column names stay attached to their values.
from langchain_community.document_loaders import CSVLoader
docs = CSVLoader("document_loading_samples/html_tags.csv").load()
for row in docs:
print(row.page_content) # one CSV row as text
print(row.metadata)Web pages
WebBaseLoader downloads the HTML at a URL and keeps the visible text. You need a working internet connection. We use https://www.google.com in the demo — a real site most students already know.
from langchain_community.document_loaders import WebBaseLoader
docs = WebBaseLoader("https://www.google.com").load()
print(docs[0].page_content[:300])
print(docs[0].metadata)Run the demo
Download the ZIP below (script and sample files together), unzip into your project folder, then run:
document_loading_demo.zip
Script plus sample txt, csv, and pdf files
Inside the archive
- ◇document_loading_demo.py
- ◇document_loading_samples/html_basics.txt
- ◇document_loading_samples/html_tags.csv
- ◇document_loading_samples/html_basics.pdf
document_loading_samples/ next to document_loading_demo.py — the zip already lays them out that way.document_loading_samples/ in the same folder as the script.python document_loading_demo.py
If it fails
- FileNotFoundError — the sample folder must sit next to the script. Check the path printed in the error.
- ModuleNotFoundError: pypdf — run
pip install pypdf beautifulsoup4. - Connection error on WebBaseLoader — check your network, or comment out that block and run the file loaders only.
- Blank PDF output — the file is probably image-based. Try the included
html_basics.pdffirst.
More loaders are listed in the LangChain document loaders docs.
What's Next
You can load files into Document objects. Next up: split long text into smaller chunks.