Course navigation
Embeddings, Vector Stores & RAGLesson 1 of 7

Document Loading

LangChain reads files and web pages through document loaders. You pass a path or URL, call .load(), and get back a list of Document objects. We try four common sources: text, PDF, CSV, and a web page.

Before you run

Open a terminal in your langchain-course folder. Activate the venv from Project Setup.

langchain-community is already installed. PDF and web loaders need two more packages — install them once:

pip install pypdf beautifulsoup4

Same steps for every loader:

html_basics.txt · html_tags.csv · html_basics.pdf · a URL
TextLoader / PyPDFLoader / CSVLoader / WebBaseLoader
loader.load() → list of Document objects
print doc.page_content and doc.metadata
Pick the loader that matches your file type, call .load(), then inspect what came back.

The Document object

A loader always returns a list of Document objects. Two fields matter: page_content (the text) and metadata (source path, page number, and similar). See the LangChain docs on Documents if you want the full type definition.

One Document

page_content

The <a> tag creates a hyperlink.

metadata

{ "source": "html_basics.txt" }

page_content holds the text.metadata usually includes the file path or page number.
from langchain_core.documents import Document

doc = Document(
    page_content="The <a> tag creates a hyperlink.",
    metadata={"source": "html_basics.txt"},
)

print(doc.page_content)
print(doc.metadata)

Which loader to use

Import from langchain_community.document_loaders. Match the loader to your source type:

LoaderSource
TextLoader.txt file
PyPDFLoader.pdf file
CSVLoader.csv file
WebBaseLoaderURL
Pass a path or URL to the loader, then call .load().

Text files

TextLoader reads a .txt file into one document. On Windows, pass encoding="utf-8" if accented characters print as garbage.

from langchain_community.document_loaders import TextLoader

docs = TextLoader("document_loading_samples/html_basics.txt", encoding="utf-8").load()

print(docs[0].page_content)
print(docs[0].metadata)

PDF files

PyPDFLoader returns one document per page. A 10-page file gives you a list of 10 items. It reads embedded text only — scanned photos of pages come back empty unless you run OCR separately.

from langchain_community.document_loaders import PyPDFLoader

docs = PyPDFLoader("document_loading_samples/html_basics.pdf").load()

for page in docs:
    print(page.metadata)   # page number, source path
    print(page.page_content[:120])

CSV files

CSVLoader makes one document per row. The row text looks like tag: a on one line and description: Creates a hyperlink… on the next — column names stay attached to their values.

from langchain_community.document_loaders import CSVLoader

docs = CSVLoader("document_loading_samples/html_tags.csv").load()

for row in docs:
    print(row.page_content)   # one CSV row as text
    print(row.metadata)

Web pages

WebBaseLoader downloads the HTML at a URL and keeps the visible text. You need a working internet connection. We use https://www.google.com in the demo — a real site most students already know.

from langchain_community.document_loaders import WebBaseLoader

docs = WebBaseLoader("https://www.google.com").load()

print(docs[0].page_content[:300])
print(docs[0].metadata)

Run the demo

Download the ZIP below (script and sample files together), unzip into your project folder, then run:

document_loading_demo.zip

Script plus sample txt, csv, and pdf files

Inside the archive

  • document_loading_demo.py
  • document_loading_samples/html_basics.txt
  • document_loading_samples/html_tags.csv
  • document_loading_samples/html_basics.pdf
Unzip into your project folder. Keep document_loading_samples/ next to document_loading_demo.py — the zip already lays them out that way.
document_loading_demo.py
"""document_loading_demo.py"""
from langchain_community.document_loaders import TextLoader, PyPDFLoader, …
# load txt → csv → pdf → web URL
docs = TextLoader(…).load()
Keep document_loading_samples/ in the same folder as the script.
python document_loading_demo.py
PowerShell — (.venv) active
(.venv) PS C:\projects\langchain-course> python document_loading_demo.py
=== TextLoader (1 chunk(s)) ===
--- [0] metadata: {'source': '…/html_basics.txt'}
The HTML <a> tag creates a hyperlink to another page or file.…
=== CSVLoader (4 chunk(s)) ===
tag: a
description: Creates a hyperlink to another page or resource
=== PyPDFLoader (1 chunk(s)) ===
HTML basics - a tag creates a hyperlink
=== WebBaseLoader (fetching google.com) ===
Google About Store Gmail Images Sign in…
Your output should look close to this. Google's homepage text can vary by region and login state.

If it fails

  • FileNotFoundError — the sample folder must sit next to the script. Check the path printed in the error.
  • ModuleNotFoundError: pypdf — run pip install pypdf beautifulsoup4.
  • Connection error on WebBaseLoader — check your network, or comment out that block and run the file loaders only.
  • Blank PDF output — the file is probably image-based. Try the included html_basics.pdf first.

More loaders are listed in the LangChain document loaders docs.

What's Next

You can load files into Document objects. Next up: split long text into smaller chunks.