Embeddings, Vector Stores & RAGLesson 1 of 7

Document Loading

LangChain reads files and web pages through document loaders. You pass a path or URL, call .load(), and get back a list of Document objects. We try four common sources: text, PDF, CSV, and a web page.

Before you run

Open a terminal in your langchain-course folder. Activate the venv from Project Setup.

langchain-community is already installed. PDF and web loaders need two more packages — install them once:

pip install pypdf beautifulsoup4

Same steps for every loader:

html_basics.txt · html_tags.csv · html_basics.pdf · a URL

↓

TextLoader / PyPDFLoader / CSVLoader / WebBaseLoader

↓

loader.load() → list of Document objects

↓

print doc.page_content and doc.metadata

Pick the loader that matches your file type, call .load(), then inspect what came back.

The Document object

A loader always returns a list of Document objects. Two fields matter: page_content (the text) and metadata (source path, page number, and similar). See the LangChain docs on Documents if you want the full type definition.

One Document

page_content

The <a> tag creates a hyperlink.

metadata

{ "source": "html_basics.txt" }

page_content holds the text.metadata usually includes the file path or page number.

from langchain_core.documents import Document

doc = Document(
    page_content="The <a> tag creates a hyperlink.",
    metadata={"source": "html_basics.txt"},
)

print(doc.page_content)
print(doc.metadata)

Which loader to use

Import from langchain_community.document_loaders. Match the loader to your source type:

Loader	Source	Typical use
TextLoader	.txt file	Notes, logs, plain text
PyPDFLoader	.pdf file	One Document per page
CSVLoader	.csv file	One Document per row
WebBaseLoader	URL	Fetch and parse HTML

Pass a path or URL to the loader, then call .load().

Text files

TextLoader reads a .txt file into one document. On Windows, pass encoding="utf-8" if accented characters print as garbage.

from langchain_community.document_loaders import TextLoader

docs = TextLoader("document_loading_samples/html_basics.txt", encoding="utf-8").load()

print(docs[0].page_content)
print(docs[0].metadata)

PDF files

PyPDFLoader returns one document per page. A 10-page file gives you a list of 10 items. It reads embedded text only — scanned photos of pages come back empty unless you run OCR separately.

from langchain_community.document_loaders import PyPDFLoader

docs = PyPDFLoader("document_loading_samples/html_basics.pdf").load()

for page in docs:
    print(page.metadata)   # page number, source path
    print(page.page_content[:120])

CSV files

CSVLoader makes one document per row. The row text looks like tag: a on one line and description: Creates a hyperlink… on the next — column names stay attached to their values.

from langchain_community.document_loaders import CSVLoader

docs = CSVLoader("document_loading_samples/html_tags.csv").load()

for row in docs:
    print(row.page_content)   # one CSV row as text
    print(row.metadata)

Web pages

WebBaseLoader downloads the HTML at a URL and keeps the visible text. You need a working internet connection. We use https://www.google.com in the demo — a real site most students already know.

from langchain_community.document_loaders import WebBaseLoader

docs = WebBaseLoader("https://www.google.com").load()

print(docs[0].page_content[:300])
print(docs[0].metadata)

Run the demo

Download the ZIP below (script and sample files together), unzip into your project folder, then run:

document_loading_demo.zip

Script plus sample txt, csv, and pdf files

Download ZIP .py only

Inside the archive

◇document_loading_demo.py
◇document_loading_samples/html_basics.txt
◇document_loading_samples/html_tags.csv
◇document_loading_samples/html_basics.pdf

Unzip into your project folder. Keep document_loading_samples/ next to document_loading_demo.py — the zip already lays them out that way.

◇document_loading_demo.py

"""document_loading_demo.py"""

from langchain_community.document_loaders import TextLoader, PyPDFLoader, …

# load txt → csv → pdf → web URL

docs = TextLoader(…).load()

Keep document_loading_samples/ in the same folder as the script.

python document_loading_demo.py

PowerShell — (.venv) active

(.venv) PS C:\projects\langchain-course> python document_loading_demo.py

=== TextLoader (1 chunk(s)) ===

--- [0] metadata: {'source': '…/html_basics.txt'}

The HTML <a> tag creates a hyperlink to another page or file.…

=== CSVLoader (4 chunk(s)) ===

tag: a

description: Creates a hyperlink to another page or resource

=== PyPDFLoader (1 chunk(s)) ===

HTML basics - a tag creates a hyperlink

=== WebBaseLoader (fetching google.com) ===

Google About Store Gmail Images Sign in…

Your output should look close to this. Google's homepage text can vary by region and login state.

If it fails

FileNotFoundError — the sample folder must sit next to the script. Check the path printed in the error.
ModuleNotFoundError: pypdf — run pip install pypdf beautifulsoup4.
Connection error on WebBaseLoader — check your network, or comment out that block and run the file loaders only.
Blank PDF output — the file is probably image-based. Try the included html_basics.pdf first.

More loaders are listed in the LangChain document loaders docs.

What's Next

You can load files into Document objects. Next up: split long text into smaller chunks.

← PREVIOUS

Streamlit Chat Message History

Text Splitting