No credit card required

PDF Text Extractor

jirimoravcik/pdf-text-extractor

No credit card required

PDF Text Extractor allows you to extract text from PDF files. It also supports chunking of the text to prepare the data for usage with large language models.

PDF Text Extractor

PDF Text Extractor allows you to extract text from PDF files. It also supports chunking of the text to prepare the data for usage with large language models.

Input

URLs - URLs of the PDF files you want to extract the text from.
Chunk size - the maximum size of a single chunk of text
Chunk overlap - how many characters will overlap between neighbouring chunks of text

Output

Each item will contain the URL of the source PDF, index that identifies the position in the extracted text, and lastly, the extracted text.

Sample output

1[{
2  "url": "https://arxiv.org/pdf/2307.12856.pdf",
3  "index": 0,
4  "text": "Preprint\nA REAL-WORLD WEBAGENT WITH PLANNING,\nLONG CONTEXT UNDERSTANDING, AND\nPROGRAM SYNTHESIS\nIzzeddin Gur1∗ Hiroki Furuta1,2∗† Austin Huang1 Mustafa Safdari1 Yutaka Matsuo2\nDouglas Eck1 Aleksandra Faust1\n1Google DeepMind, 2The University of Tokyo\nizzeddin@google.com, furuta@weblab.t.u-tokyo.ac.jp\nABSTRACT\nPre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web automation. However, the\nperformance on real-world websites has still suffered from (1) open domainness,\n(2) limited context length, and (3) lack of inductive bias on HTML. We introduce\nWebAgent, an LLM-driven agent that learns from self-experience to complete tasks\non real websites following natural language instructions. WebAgent plans ahead by\ndecomposing instructions into canonical sub-instructions, summarizes long HTML\ndocuments into task-relevant snippets, and acts on websites via Python programs"
5},
6{
7  "url": "https://arxiv.org/pdf/2307.12856.pdf",
8  "index": 1,
9  "text": "generated from those. We design WebAgent with Flan-U-PaLM, for grounded code\ngeneration, and HTML-T5, new pre-trained LLMs for long HTML documents\nusing local and global attention mechanisms and a mixture of long-span denoising\nobjectives, for planning and summarization. We empirically demonstrate that our\nmodular recipe improves the success on real websites by over 50%, and that HTMLT5 is the best model to solve various HTML understanding tasks; achieving 18.7%\nhigher success rate than the prior method on MiniWoB web automation benchmark,\nand SoTA performance on Mind2Web, an offline task planning evaluation.\n1 INTRODUCTION\nLarge language models (LLM) (Brown et al., 2020; Chowdhery et al., 2022; OpenAI, 2023) can\nsolve variety of natural language tasks, such as arithmetic, commonsense, logical reasoning, question\nanswering, text generation (Brown et al., 2020; Kojima et al., 2022; Wei et al., 2022), and even"
10},
11{
12  "url": "https://arxiv.org/pdf/2307.12856.pdf",
13  "index": 2,
14  "text": "interactive decision making tasks (Ahn et al., 2022; Yao et al., 2022b). Recently, LLMs have also\ndemonstrated success in autonomous web navigation, where the agents control computers or browse\nthe internet to satisfy the given natural language instructions through the sequence of computer\nactions, by leveraging the capability of HTML comprehension and multi-step reasoning (Furuta et al.,\n2023; Gur et al., 2022; Kim et al., 2023).\nHowever, web automation on real-world websites has still suffered from (1) the lack of pre-defined\naction space, (2) much longer HTML observations than simulators, and (3) the absence of domain\nknowledge for HTML in LLMs (Figure 1). Considering the open-ended real-world websites and the\ncomplexity of instructions, defining appropriate action space in advance is challenging. In addition,\nalthough several works have argued that recent LLMs with instruction-finetuning or reinforcement"
15}]

How to use PDF Text Extractor

Follow this tutorial to learn how to use PDF Text Extractor and combine it with LangChain to build an intelligent QA system that can extract answers from PDF documents.

Developer

Jiří Moravčík

Actor Metrics

43 monthly users
19 stars
>99% runs succeeded
Created in Oct 2023
Modified 4 months ago

Categories

Integrations

Automation

PDF Scraper

onidivo/pdf-scraper

Scrape and extract text from PDF links.

Onidivo Technologies

235

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

31.7k

852

Google Search Results Scraper

apify/google-search-scraper

Scrape Google Search Engine Results Pages (SERPs). Select the country or language and extract organic and paid results, AI overviews, ads, queries, People Also Ask, prices, reviews, like a Google SERP API. Export scraped data, run the scraper via API, schedule runs, or integrate with other tools.

Apify

52.1k

293

Youtube Video Downloader

epctex/youtube-video-downloader

Effortlessly download YouTube videos of your preferred quality with our user-friendly Video Downloader. Try it now!

epctex

600

Tiktok Shop Scraper

excavator/tiktok-shop-scraper

This is the Actor for crawling data from the TikTok shop product URLs. For example: https://shop.tiktok.com/view/product/XXXXXXXXXX These URLs are only available for TikTok Shop US. You can test it here: https://apify.com/excavator/tiktok-shop-product

Excavator

Reddit Scraper Lite

trudax/reddit-scraper-lite

Pay Per Result, unlimited Reddit web scraper to crawl posts, comments, communities, and users without login. Limit web scraping by number of posts or items and extract all data in a dataset in multiple formats.

Gustavo Rudiger

4.3k

Download HTML from URLs

mtrunkat/url-list-download-html

This actor takes a list of URLs and downloads HTML of each page.

Marek Trunkát

8.3k

🔥 LinkedIn Jobs Scraper

bebity/linkedin-jobs-scraper

ℹ️ Designed for both personal and professional use, simply enter your desired job title and location to receive a tailored list of job opportunities. Try it today!

Bebity

4.9k

121

Rightmove Scraper

dhrumil/rightmove-scraper

Scrape rightmove.co.uk to crawl millions of sale/rent real estate properties from United Kingdom. Our real estate scraper also lets you monitor specific listing for new updates/listing. You can provide multiple search result listings to scrape/monitor.