AI & Agents

How to Use LangChain Document Loaders

Document loaders are the first step in any RAG pipeline. They pull data from over 100 sources into a standard format that LLMs can work with.

Fastio Editorial Team 6 min read
Document loaders transform raw data into standard formats for LLMs.

What Is a LangChain Document Loader?

Document loaders in LangChain pull data from a source and return Document objects that you can feed into vector stores or LLMs. A Document has two attributes:

  • page_content: The string containing the actual text of the document.
  • metadata: A dictionary containing information about the source, such as the filename, page number, or URL. This standard structure means you can treat all text data the same way, whether it came from a PDF, a Notion database, or a CSV. The same splitting, embedding, and storage code works regardless of the original source.

The BaseLoader Interface

All loaders inherit from the BaseLoader class, which exposes two key methods:

  • load(): Loads all documents into memory at once. Best for small datasets.
  • lazy_load(): Loads documents one by one in a stream. Use this for large datasets to avoid out-of-memory errors.

Helpful references: Fastio Workspaces, Fastio Collaboration, and Fastio AI.

Top LangChain Loaders Compared

LangChain offers over 100 built-in integrations. Choosing the right one depends on your source data and performance requirements.

Loader Type Class Name Best For Key Feature
PDF PyPDFLoader Standard text-based PDFs Simple, fast extraction of text and page numbers.
PDF (Complex) UnstructuredPDFLoader Layout-heavy PDFs Uses unstructured library to handle tables and images.
Web WebBaseLoader Scraping public URLs Extracts clean text from HTML, removing boilerplate.
CSV CSVLoader Structured data Converts each row into a separate document.
Directory DirectoryLoader Folders of mixed files Recursively loads all supported files in a path.

Most developers start with PyPDFLoader for RAG pipelines involving business documents, but complex formatting often requires upgrading to UnstructuredPDFLoader or commercial parsers.

How to Load PDFs in LangChain

Loading a PDF is the most common "Hello World" task for document loaders. Here is a step-by-step example using the standard PyPDFLoader.

1. Install Dependencies

You will need the langchain-community and pypdf packages:

pip install langchain-community pypdf

2. Load the Document

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("example_data/handbook.pdf")
pages = loader.load()

print(f"Loaded {len(pages)} pages")
print(f"Content of page 1: {pages[0].page_content[:100]}...")
print(f"Metadata: {pages[0].metadata}")

This code initializes the loader with a file path and calls .load(), which returns a list of Document objects, one per page. The metadata automatically includes the source path and page number, which you need for citing sources in RAG responses.

Visualization of data being parsed and indexed

The Problem with Local File Loaders for Agents

While loading local files works for scripts running on your laptop, it breaks down when building autonomous AI agents. Agents running in cloud environments (like serverless functions or containers) typically have ephemeral file systems. If your agent saves a file to ./tmp, that file vanishes when the session ends. And if you are building a multi-agent system, Agent A cannot easily "hand off" a local file to Agent B without writing upload/download plumbing.

Why agents need persistent storage loaders:

  • Persistence: Data must survive server restarts and environment re-provisioning.
  • Sharing: Files need to be accessible by multiple agents and humans.
  • Universal Access: Loaders should work via standard authenticated URLs, not just local file paths.
Fastio features

Give Your AI Agents Persistent Storage

Fastio gives your AI agents a unified, persistent file system with built-in RAG. Connect S3, Google Drive, and more to a single standard interface.

Building Custom Loaders for Cloud Storage

To solve the persistence issue, developers often build custom loaders that fetch data from cloud storage before processing.

Custom Loader Pattern

A custom loader typically steps through this logic: 1.

Authenticate: Connect to the storage provider (S3, Google Drive, Fastio). 2.

Fetch: Download the file stream into memory. 3.

Parse: Convert the stream into text. 4.

Wrap: Return Document objects.

Using Fastio as Your Storage Backend

Instead of writing custom loaders for every cloud provider, Fastio offers a single interface. Since Fastio provides a standard URL structure and an MCP (Model Context Protocol) server, agents can load documents without juggling multiple auth flows.

  • Direct URL Loading: Use WebBaseLoader with secure Fastio links.
  • MCP Integration: The Fastio MCP server allows agents to "read" files directly from your persistent cloud storage without needing a specific LangChain loader wrapper.
  • Intelligence Mode: Fastio's built-in RAG indexes your files automatically. Instead of building a loader pipeline, your agent can query the Intelligence API to get relevant chunks, skipping the "load -> split -> embed -> store" loop entirely.
AI agents accessing a shared cloud storage network

Advanced: Adding Metadata to Your Documents

Metadata makes the difference between a RAG system that returns vaguely relevant results and one that gives you exactly what you asked for. A generic loader might only give you a filename, but good retrieval often depends on filtering by date, author, or document type. You can inject custom metadata during the loading process:

### Example of adding custom metadata after loading
documents = loader.load()

for doc in documents:
    doc.metadata["department"] = "HR"
    doc.metadata["upload_date"] = "2026-02-09"
    doc.metadata["version"] = 1.0

When these documents are added to a vector store, you can then perform filtered searches (e.g., "Show me HR policies uploaded after 2025"). Fastio's Intelligence Mode handles this automatically, indexing file attributes alongside the vector embeddings.

Frequently Asked Questions

How do I load multiple files in LangChain?

Use the `DirectoryLoader`. It accepts a directory path and a glob pattern (e.g., `**/*.pdf`) to recursively find and load all matching files using the appropriate underlying loader for each file type.

What is the difference between load and lazy_load?

`load()` processes all documents and keeps them in memory, returning a complete list. `lazy_load()` returns a generator that yields documents one by one, which is much more memory-efficient for large datasets or massive files.

Can LangChain load files directly from Google Drive?

Yes, LangChain has a `GoogleDriveLoader`. However, it requires setting up a Google Cloud Project, enabling the Drive API, and managing service account credentials. Using a unified storage layer like Fastio can simplify this by aggregating multiple cloud sources into a single standard interface.

Related Resources

Fastio features

Give Your AI Agents Persistent Storage

Fastio gives your AI agents a unified, persistent file system with built-in RAG. Connect S3, Google Drive, and more to a single standard interface.