n8n PDF text extraction node

n8n PDF text extraction node turns PDF documents into chunked text ready for LLM pipelines. Configure chunk sizes and overlap to fit your model's context window. Connect it to n8n.

Trusted by industry leaders all over the world

What data you can get with n8n PDF text extraction node

Get chunked text from any PDF with source URL, chunk index, and content per segment. Feed into LangChain, LlamaIndex, or custom AI frameworks for document QA or search.

Output

[
{
"url": "https://arxiv.org/pdf/2307.12856.pdf",
"text": "Preprint\nA REAL-WORLD WEBAGENT WITH PLANNING,\nLONG CONTEXT UNDERSTANDING, AND\nPROGRAM SYNTHESIS\nIzzeddin Gur1∗ Hiroki Furuta1,2∗† Austin Huang1 Mustafa Safdari1 Yutaka Matsuo2\nDouglas Eck1 Aleksandra Faust1\n1Google DeepMind, 2The University of Tokyo\nizzeddin@google.com, furuta@weblab.t.u-tokyo.ac.jp\nABSTRACT\nPre-trained large language models (LLMs) have recently achieved better gener￾alization and sample efficiency in autonomous web automation. However, the\nperformance on real-world websites has still suffered from (1) open domainness,\n(2) limited context length, and (3) lack of inductive bias on HTML. We introduce\nWebAgent, an LLM-driven agent that learns from self-experience to complete tasks\non real websites following natural language instructions. WebAgent plans ahead by\ndecomposing instructions into canonical sub-instructions, summarizes long HTML\ndocuments into task-relevant snippets, and acts on websites via Python programs",
"index": 0
},
{
"url": "https://arxiv.org/pdf/2307.12856.pdf",
"text": "generated from those. We design WebAgent with Flan-U-PaLM, for grounded code\ngeneration, and HTML-T5, new pre-trained LLMs for long HTML documents\nusing local and global attention mechanisms and a mixture of long-span denoising\nobjectives, for planning and summarization. We empirically demonstrate that our\nmodular recipe improves the success on real websites by over 50%, and that HTML￾T5 is the best model to solve various HTML understanding tasks; achieving 18.7%\nhigher success rate than the prior method on MiniWoB web automation benchmark,\nand SoTA performance on Mind2Web, an offline task planning evaluation.\n1 INTRODUCTION\nLarge language models (LLM) (Brown et al., 2020; Chowdhery et al., 2022; OpenAI, 2023) can\nsolve variety of natural language tasks, such as arithmetic, commonsense, logical reasoning, question\nanswering, text generation (Brown et al., 2020; Kojima et al., 2022; Wei et al., 2022), and even",
"index": 1
},
{
"url": "https://arxiv.org/pdf/2307.12856.pdf",
"text": "interactive decision making tasks (Ahn et al., 2022; Yao et al., 2022b). Recently, LLMs have also\ndemonstrated success in autonomous web navigation, where the agents control computers or browse\nthe internet to satisfy the given natural language instructions through the sequence of computer\nactions, by leveraging the capability of HTML comprehension and multi-step reasoning (Furuta et al.,\n2023; Gur et al., 2022; Kim et al., 2023).\nHowever, web automation on real-world websites has still suffered from (1) the lack of pre-defined\naction space, (2) much longer HTML observations than simulators, and (3) the absence of domain\nknowledge for HTML in LLMs (Figure 1). Considering the open-ended real-world websites and the\ncomplexity of instructions, defining appropriate action space in advance is challenging. In addition,\nalthough several works have argued that recent LLMs with instruction-finetuning or reinforcement",
"index": 2
}
]

How to set up n8n PDF text extraction node with Apify

Pass PDF URLs with chunk size and overlap settings. The Actor extracts text and splits it into segments optimized for your LLM's context window. Configurable overlap.

Sign up for Apify account01

Creating an account is quick and free. No credit card required. Your account gives you access to more than 20,000+ scrapers and APIs.

Start for free
Get your Apify API token02

Go to Settings in Apify Console and navigate to the API & Integrations tab. There, create a new token and save it for later.

Test run PDF Text Extraction Node03

Open PDF Text Extraction Node in Apify Console and configure your input parameters. Click Start to run the Actor and preview the data structure you receive in your n8n workflow.

Integrate PDF Text Extraction Node via n8n04

Add the Apify node to your n8n workflow. Select Run Actor as the operation, choose your Actor, and pass your input configuration as JSON. Enable Wait for finish to retrieve results directly in subsequent nodes.

Why use Apify?

Never get blocked

Never get blocked

Every plan (free included) comes with Apify Proxy, which is great for avoiding blocking and giving you access to geo-specific content.

Customers love us

Customers love us

We truly care about the satisfaction of our users and thanks to that we're one of the best-rated data extraction platforms on both G2 and Capterra.

Monitor your runs

Monitor your runs

With our latest monitoring features, you always have immediate access to valuable insights on the status of your web scraping tasks.

Frequently Asked Questions

Add an HTTP Request node to your n8n workflow and point it to the Apify API. Use your API token for authentication and specify the PDF text extraction node Actor ID you want to run. The Actor executes and returns data directly to your workflow. You can also use n8n's dedicated Apify node if available in your version.

Yes. Apify offers a free tier with prepaid platform usage. This is enough to test Actors with your n8n workflows and run small-scale extractions. No credit card required to start.

No. You can configure Apify Actors through their web interface and connect them to n8n using the HTTP Request node - no coding required. For advanced use cases, you can customize Actor inputs or use the Apify SDK with JavaScript or Python.

Building and maintaining scrapers takes significant time. Websites change their structure, add bot detection, and block requests. Apify Actors handle all of this automatically - proxy rotation, anti-bot bypassing, error handling, and data parsing. You get reliable data without the maintenance burden.

Try n8n PDF Text Extraction Node now