Website Content Crawler

  • apify/website-content-crawler
  • Modified
  • Users 4.6k
  • Runs 368.8k
  • Created by Author's avatarApify

Automatically crawl and extract text content from websites with documentation, knowledge bases, help centers, or blogs. This Actor is designed to provide data to feed, fine-tune, or train large language models such as ChatGPT or LLaMA.

Website Content Crawler is an Apify Actor that can perform a deep crawl of one or more websites and extract text content from the web pages. It is useful to download data from websites such as documentation, knowledge bases, help sites, or blogs.

The Actor was specifically designed to extract data for feeding, fine-tuning, or training large language models (LLMs) such as GPT-4, ChatGPT, or LLaMA.

Website Content Crawler has a simple input configuration so that it can be easily integrated into customer-facing products, where customers can enter just a URL of the website they want to have indexed by an AI application. You can retrieve the results using the API to formats such as JSON or CSV, which can be fed directly to your LLM, vector database, or directly to ChatGPT.

Main features

Website Content Crawler is built upon Crawlee, Apify's state-of-the-art open-source library for web scraping and crawling. The Actor can:

  • Crawl JavaScript-enabled websites using headless Firefox or Chrome, or simple sites using raw HTTP.
  • Circumvent anti-scraping protections using browser fingerprinting and proxies.
  • Save web content in plain text, Markdown, or HTML.
  • Crawl pages behind a login.
  • Download files in PDF, DOC, DOCX, XLS, XLSX, or CSV formats.
  • Remove fluff from pages like navigation, header, footers, ads, modals, or cookies warnings to improve the accuracy of the data.
  • Load content of pages with infinite scroll.
  • Scale gracefully from tiny sites to sites with millions of pages by leveraging the Apify platform capabilities.
  • Integrate to 🦜🔗LangChain or LlamaIndex
  • and much more...

Not sure if Website Content Crawler can handle your use case? Simply try it free of charge. You can also check out our Web Scraping Data for Generative AI video on this topic, showcasing the Website Content Crawler:

Designed for generative AI and LLMs

The results of Website Content Crawler can help you feed, fine-tune or train your large language models (LLMs) or provide context for prompts for ChatGPT. In return, the model will answer questions based on your or your customer's websites and content.

Custom chatbots for customer support

Customer service chatbots personalized on customer websites, such as documentation or knowledge bases, are one of the most promising use cases of AI and LLMs. Let your customers easily onboard by typing the URL of their site, and thus give your chatbot detailed knowledge of their product or service. Learn more about this use case in our blog post.

Generate personalized content based on customer’s copy

ChatGPT and LLMs can write articles for you, but they won’t sound like you wrote them. Feed all your old blogs into your model to make it sound like you. Or train the model on your customers’ blogs and have it write in their tone of voice. Or help their technical writers with making first drafts of new documentation pages.

Summarization, translation, proofreading at scale

Got some old docs or blogs that need to be improved? Use Website Content Crawler to scrape the content, feed it to the ChatGPT API, and ask it to summarize, proofread, translate, or change the style of the content.

How does it work?

Website Content Crawler operates in three stages:

  1. Crawling - Finds and downloads the right web pages.
  2. HTML processing - Transforms the DOM of crawled pages to e.g. remove navigation, header, footer, cookie warnings, and other fluff.
  3. Output - Converts the resulting DOM to plain text or Markdown and saves downloaded files.

For clarity, the input settings of the Actor are organized according to the above stages. Note that input settings have reasonable defaults—the only mandatory setting is the Start URLs.

Crawling

Website Content Crawler only needs one or more Start URLs to run, typically the top-level URL of the documentation site, blog, or knowledge base that you want to scrape. The actor crawls the start URLs, finds links to other pages, and recursively crawls those pages, too, as long as their URL is under the start URL.

For example, if you enter the start URL https://example.com/blog/, the actor will crawl pages like https://example.com/blog/article-1 or https://example.com/blog/section/article-2, but will skip pages like https://example.com/docs/something-else.

You can also force the crawler to skip certain URLs using the Exclude URLs (globs) input setting, which specifies an array of glob patterns matching URLs of pages to be skipped. Note that this setting affects only links found on pages, but not Start URLs, which are always crawled. For example, https://{store,docs}.example.com/** will exclude all URLs starting with https://store.example.com/ and https://docs.example.com/. Or https://example.com/**/*\?*foo=* exclude all URLs that contain foo query parameter with any value. You can learn more about globs and test them here.

The Actor automatically skips duplicate pages identified by the same canonical URL; those pages are loaded and counted towards the Max pages limit but not saved to the results.

Website Content Crawler provides various input settings to customize the crawling. For example, you can select the crawler type:

  • Headless web browser - Useful for modern websites with anti-scraping protections and JavaScript rendering. It recognizes common blocking patterns like CAPTCHAs and automatically retries blocked requests through new sessions. However, running web browsers is more expensive as it requires more computing resources and is slower.
  • Stealthy web browser (default) - Another headless web browser, but with anti-blocking measures enabled. Try this if you encounter bot protection while scraping. For best performance, use it with Apify Proxy.
  • Raw HTTP client - High-performance crawling mode that uses raw HTTP requests to fetch the pages. It is faster and cheaper, but it might not work on all websites.
  • Raw HTTP client with JS execution (JSDOM) [experimental] - A compromise between a browser and raw HTTP crawlers. Good performance and should work on almost all websites, including those with dynamic content. However, it is still experimental and might sometimes crash, so we don't recommend it in production settings yet.

You can also set additional input parameters such as a maximum number of pages, maximum crawling depth, maximum concurrency, proxy configuration, timeout, etc., to control the behavior and performance of the Actor.

HTML processing

The goal of the HTML processing step is to ensure each web page has the right content — neither less nor more.

If you're using a headless browser Crawler type, whenever a web page is loaded, the Actor can wait a certain time or scroll to a certain height to ensure all dynamic page content is loaded, using the Wait for dynamic content or Maximum scroll height input settings, respectively. If Expand clickable elements is enabled, the Actor tries to click various DOM elements to ensure their content is expanded and visible in the resulting text.

Once the web page is ready, the Actor transforms its DOM to remove irrelevant content in order to help you ensure you're feeding your AI models with relevant data to keep them accurate.

First, the Actor removes DOM nodes matching the Remove HTML elements (CSS selector). The provided default value attempts to remove all common types of modals, navigation, headers, or footers, as well as scripts and inline images to reduce the output HTML size.

Then, if Remove cookie warnings is enabled, the Actor removes cookie warnings using the I don't care about cookies browser extension.

Finally, the Actor transforms the page using the selected HTML transformer, whose goal is to only keep the important content of the page and reduce its complexity before converting it to text. Basically, to keep just the "meat" of the article or a page.

Output

Once the web page HTML is processed, the Actor converts it to the desired output format, including plain text, Markdown to preserve rich formatting, or save the full HTML or a screenshot of the page, which is useful for debugging. The Actor also saves important metadata about the content, such as author, language, publishing date, etc.

The results of the actor are stored in the default Dataset associated with the Actor run, from where you can access it via API and export to formats like JSON, XML, or CSV.

Example

This example shows how to scrape all pages from the Apify documentation at https://docs.apify.com/:

Input

input-screenshot.png

See full input with description.

Output

This is how one crawled page (https://docs.apify.com/academy/web-scraping-for-beginners) looks in a browser:

page-screenshot.png

And here is how the crawling result looks in JSON format (note that other formats like CSV or Excel are also supported). The main page content can be found in the text field, and it only contains the valuable content, without menus and other noise:

{ "url": "https://docs.apify.com/academy/web-scraping-for-beginners", "crawl": { "loadedUrl": "https://docs.apify.com/academy/web-scraping-for-beginners", "loadedTime": "2023-04-05T16:26:51.030Z", "referrerUrl": "https://docs.apify.com/academy", "depth": 0 }, "metadata": { "canonicalUrl": "https://docs.apify.com/academy/web-scraping-for-beginners", "title": "Web scraping for beginners | Apify Documentation", "description": "Learn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place.", "author": null, "keywords": null, "languageCode": "en" }, "screenshotUrl": null, "text": "Skip to main content\nOn this page\nWeb scraping for beginners\nLearn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place.\nWelcome to Web scraping for beginners, a comprehensive, practical and long form web scraping course that will take you from an absolute beginner to a successful web scraper developer. If you're looking for a quick start, we recommend trying this tutorial instead.\nThis course is made by Apify, the web scraping and automation platform, but we will use only open-source technologies throughout all academy lessons. This means that the skills you learn will be applicable to any scraping project, and you'll be able to run your scrapers on any computer. No Apify account needed.\nIf you would like to learn about the Apify platform and how it can help you build, run and scale your web scraping and automation projects, see the Apify platform course, where we'll teach you all about Apify serverless infrastructure, proxies, API, scheduling, webhooks and much more.\nWhy learn scraper development?​\nWith so many point-and-click tools and no-code software that can help you extract data from websites, what is the point of learning web scraper development? Contrary to what their marketing departments say, a point-and-click or no-code tool will never be as flexible, as powerful, or as optimized as a custom-built scraper.\nAny software can do only what it was programmed to do. If you build your own scraper, it can do anything you want. And you can always quickly change it to do more, less, or the same, but faster or cheaper. The possibilities are endless once you know how scraping really works.\nScraper development is a fun and challenging way to learn web development, web technologies, and understand the internet. You will reverse-engineer websites and understand how they work internally, what technologies they use and how they communicate with their servers. You will also master your chosen programming language and core programming concepts. When you truly understand web scraping, learning other technology like React or Next.js will be a piece of cake.\nCourse Summary​\nWhen we set out to create the Academy, we wanted to build a complete guide to modern web scraping - a course that a beginner could use to create their first scraper, as well as a resource that professionals will continuously use to learn about advanced and niche web scraping techniques and technologies. All lessons include code examples and code-along exercises that you can use to immediately put your scraping skills into action.\nThis is what you'll learn in the Web scraping for beginners course:\nWeb scraping for beginners\nBasics of data extraction\nBasics of crawling\nBest practices\nRequirements​\nYou don't need to be a developer or a software engineer to complete this course, but basic programming knowledge is recommended. Don't be afraid, though. We explain everything in great detail in the course and provide external references that can help you level up your web scraping and web development skills. If you're new to programming, pay very close attention to the instructions and examples. A seemingly insignificant thing like using [] instead of () can make a lot of difference.\nIf you don't already have basic programming knowledge and would like to be well-prepared for this course, we recommend taking a JavaScript course and learning about CSS Selectors.\nAs you progress to the more advanced courses, the coding will get more challenging, but will still be manageable to a person with an intermediate level of programming skills.\nIdeally, you should have at least a moderate understanding of the following concepts:\nJavaScript + Node.js​\nIt is recommended to understand at least the fundamentals of JavaScript and be proficient with Node.js prior to starting this course. If you are not yet comfortable with asynchronous programming (with promises and async...await), loops (and the different types of loops in JavaScript), modularity, or working with external packages, we would recommend studying the following resources before coming back and continuing this section:\nasync...await (YouTube)\nJavaScript loops (MDN)\nModularity in Node.js\nGeneral web development​\nThroughout the next lessons, we will sometimes use certain technologies and terms related to the web without explaining them. This is because the knowledge of them will be assumed (unless we're showing something out of the ordinary).\nHTML\nHTTP protocol\nDevTools\njQuery or Cheerio​\nWe'll be using the Cheerio package a lot to parse data from HTML. This package provides a simple API using jQuery syntax to help traverse downloaded HTML within Node.js.\nNext up​\nThe course begins with a small bit of theory and moves into some realistic and practical examples of extracting data from the most popular websites on the internet using your browser console. So let's get to it!\nIf you already have experience with HTML, CSS, and browser DevTools, feel free to skip to the Basics of crawling section.\nWhy learn scraper development?\nCourse Summary\nRequirements\nJavaScript + Node.js\nGeneral web development\njQuery or Cheerio\nNext up", "html": null, "markdown": " Web scraping for beginners | Apify Documentation \n\n[Skip to main content](#docusaurus_skipToContent_fallback)\n\nOn this page\n\n# Web scraping for beginners\n\n**Learn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place.**\n\n* * *\n\nWelcome to **Web scraping for beginners**, a comprehensive, practical and long form web scraping course that will take you from an absolute beginner to a successful web scraper developer. If you're looking for a quick start, we recommend trying [this tutorial](https://blog.apify.com/web-scraping-javascript-nodejs/) instead.\n\nThis course is made by [Apify](https://apify.com), the web scraping and automation platform, but we will use only open-source technologies throughout all academy lessons. This means that the skills you learn will be applicable to any scraping project, and you'll be able to run your scrapers on any computer. No Apify account needed.\n\nIf you would like to learn about the Apify platform and how it can help you build, run and scale your web scraping and automation projects, see the [Apify platform course](/academy/apify-platform), where we'll teach you all about Apify serverless infrastructure, proxies, API, scheduling, webhooks and much more.\n\n## Why learn scraper development?[​](#why-learn \"Direct link to Why learn scraper development?\")\n\nWith so many point-and-click tools and no-code software that can help you extract data from websites, what is the point of learning web scraper development? Contrary to what their marketing departments say, a point-and-click or no-code tool will never be as flexible, as powerful, or as optimized as a custom-built scraper.\n\nAny software can do only what it was programmed to do. If you build your own scraper, it can do anything you want. And you can always quickly change it to do more, less, or the same, but faster or cheaper. The possibilities are endless once you know how scraping really works.\n\nScraper development is a fun and challenging way to learn web development, web technologies, and understand the internet. You will reverse-engineer websites and understand how they work internally, what technologies they use and how they communicate with their servers. You will also master your chosen programming language and core programming concepts. When you truly understand web scraping, learning other technology like React or Next.js will be a piece of cake.\n\n## Course Summary[​](#summary \"Direct link to Course Summary\")\n\nWhen we set out to create the Academy, we wanted to build a complete guide to modern web scraping - a course that a beginner could use to create their first scraper, as well as a resource that professionals will continuously use to learn about advanced and niche web scraping techniques and technologies. All lessons include code examples and code-along exercises that you can use to immediately put your scraping skills into action.\n\nThis is what you'll learn in the **Web scraping for beginners** course:\n\n* [Web scraping for beginners](/academy/web-scraping-for-beginners)\n * [Basics of data extraction](/academy/web-scraping-for-beginners/data-collection)\n * [Basics of crawling](/academy/web-scraping-for-beginners/crawling)\n * [Best practices](/academy/web-scraping-for-beginners/best-practices)\n\n## Requirements[​](#requirements \"Direct link to Requirements\")\n\nYou don't need to be a developer or a software engineer to complete this course, but basic programming knowledge is recommended. Don't be afraid, though. We explain everything in great detail in the course and provide external references that can help you level up your web scraping and web development skills. If you're new to programming, pay very close attention to the instructions and examples. A seemingly insignificant thing like using `[]` instead of `()` can make a lot of difference.\n\n> If you don't already have basic programming knowledge and would like to be well-prepared for this course, we recommend taking a [JavaScript course](https://www.codecademy.com/learn/introduction-to-javascript) and learning about [CSS Selectors](https://www.w3schools.com/css/css_selectors.asp).\n\nAs you progress to the more advanced courses, the coding will get more challenging, but will still be manageable to a person with an intermediate level of programming skills.\n\nIdeally, you should have at least a moderate understanding of the following concepts:\n\n### JavaScript + Node.js[​](#javascript-and-node \"Direct link to JavaScript + Node.js\")\n\nIt is recommended to understand at least the fundamentals of JavaScript and be proficient with Node.js prior to starting this course. If you are not yet comfortable with asynchronous programming (with promises and `async...await`), loops (and the different types of loops in JavaScript), modularity, or working with external packages, we would recommend studying the following resources before coming back and continuing this section:\n\n* [`async...await` (YouTube)](https://www.youtube.com/watch?v=vn3tm0quoqE&ab_channel=Fireship)\n* [JavaScript loops (MDN)](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Loops_and_iteration)\n* [Modularity in Node.js](https://www.section.io/engineering-education/how-to-use-modular-patterns-in-nodejs/)\n\n### General web development[​](#general-web-development \"Direct link to General web development\")\n\nThroughout the next lessons, we will sometimes use certain technologies and terms related to the web without explaining them. This is because the knowledge of them will be **assumed** (unless we're showing something out of the ordinary).\n\n* [HTML](https://developer.mozilla.org/en-US/docs/Web/HTML)\n* [HTTP protocol](https://developer.mozilla.org/en-US/docs/Web/HTTP)\n* [DevTools](/academy/web-scraping-for-beginners/data-collection/browser-devtools)\n\n### jQuery or Cheerio[​](#jquery-or-cheerio \"Direct link to jQuery or Cheerio\")\n\nWe'll be using the [**Cheerio**](https://www.npmjs.com/package/cheerio) package a lot to parse data from HTML. This package provides a simple API using jQuery syntax to help traverse downloaded HTML within Node.js.\n\n## Next up[​](#next \"Direct link to Next up\")\n\nThe course begins with a small bit of theory and moves into some realistic and practical examples of extracting data from the most popular websites on the internet using your browser console. So [let's get to it!](/academy/web-scraping-for-beginners/introduction)\n\n> If you already have experience with HTML, CSS, and browser DevTools, feel free to skip to the [Basics of crawling](/academy/web-scraping-for-beginners/crawling) section.\n\n* [Why learn scraper development?](#why-learn)\n* [Course Summary](#summary)\n* [Requirements](#requirements)\n * [JavaScript + Node.js](#javascript-and-node)\n * [General web development](#general-web-development)\n * [jQuery or Cheerio](#jquery-or-cheerio)\n* [Next up](#next)" }

Integration with the AI ecosystem

Thanks to the native Apify platform integrations, Website Content Crawler can seamlessly connect with various third-party systems and tools.

LangChain integration

LangChain is the most popular framework for developing applications powered by language models. It provides an integration for Apify, so you can feed Actor results directly to LangChain’s vector databases, enabling you to easily create ChatGPT-like query interfaces to websites with documentation, knowledge base, blog, etc.

Python example

First, install LangChain with common LLMs and Apify API client for Python:

pip install langchain[llms] apify-client

And then create a ChatGPT-powered answering machine:

from langchain.document_loaders.base import Document from langchain.indexes import VectorstoreIndexCreator from langchain.utilities import ApifyWrapper import os #Set up your Apify API token and OpenAI API key os.environ["OPENAI_API_KEY"] = "Your OpenAI API key" os.environ["APIFY_API_TOKEN"] = "Your Apify API token" apify = ApifyWrapper() #Run the Website Content Crawler on a website, wait for it to finish, and save #its results into a LangChain document loader: loader = apify.call_actor( actor_id="apify/website-content-crawler", run_input={"startUrls": [{"url": "https://docs.apify.com/"}]}, dataset_mapping_function=lambda item: Document( page_content=item["text"] or "", metadata={"source": item["url"]} ), ) #Initialize the vector database with the text documents: index = VectorstoreIndexCreator().from_loaders([loader]) #Finally, query the vector database: query = "What is Apify?" result = index.query_with_sources(query) print(result["answer"]) print(result["sources"])

The query produces an answer like this:

Apify is a platform for developing, running, and sharing serverless cloud programs. It enables users to create web scraping and automation tools and publish them on the Apify platform.

https://docs.apify.com/platform/actors, https://docs.apify.com/platform/actors/running/actors-in-store, https://docs.apify.com/platform/security, https://docs.apify.com/platform/actors/examples

For details and Jupyter notebook, see Apify integration for LangChain.

Node.js example

See detailed example in LangChain for JavaScript.

LlamaIndex integration

LlamaIndex is a Python library that provides a central interface to connect LLMs with external data. The Apify integration makes it easy to feed LlamaIndex applications with data crawled from the web:

from llama_index import download_loader from llama_index.readers.schema.base import Document # Converts a single record from the Apify dataset to the LlamaIndex format def tranform_dataset_item(item): return Document( item.get("text"), extra_info={ "url": item.get("url"), }, ) ApifyDataset = download_loader("ApifyDataset") reader = ApifyDataset("<Your Apify API token>")) documents = reader.load_data(dataset_id="<Apify Dataset ID>", dataset_mapping_function=tranform_dataset_item)

Pinecone integration

Pinecone is the most popular commercial vector database. Using the Pinecone integration Actor, you can easily feed the results of Website Content Crawler directly into a Pinecone database. Just set up the Pinecone integration Actor to run after Website Content Crawler succeeds.

How much does it cost?

Website Content Crawler is free to use—you only pay for the Apify platform usage consumed by the Actor. The exact price depends on the crawler type and settings, website complexity, network speed, and random circumstances.

The main cost driver of Website Content Crawler is the compute power, which is measured in the Actor compute units (CU): 1 CU corresponds to an actor with 1 GB of memory running for 1 hour. With the baseline price of $0.25/CU, from our tests, the actor usage costs approximately:

  • $0.5 - $5 per 1,000 web pages with a headless browser, depending on the website
  • $0.2 per 1,000 web pages with raw HTTP crawler

Note that Apify's free plan gives you $5 free credits every month and access to Apify Proxy, which is sufficient for testing and low-volume use cases.

Troubleshooting

  • If the extracted text doesn’t contain the expected page content, try to select another Crawler type. Generally, a headless browser will extract more text as it loads dynamic page content and is less likely to be blocked.
  • If the extracted text has more than expected page content (e.g. navigation or footer), try to select another HTML transformer, or use the Remove HTML elements setting to skip unwanted parts of the page.
  • If the crawler is too slow, try increasing the Actor memory and/or the Initial concurrency setting. Note that if you set the concurrency too high, the Actor will run out of memory and crash, or potentially overload the target site.
  • If the target website is blocking the crawler, make sure to use the Stealthy web browser (Firefox+Playwright) crawler type and use residential proxies
  • The crawler automatically restarts on crash, and continues where it left off. But if it crashes more than 3 times per minute, the system fails the Actor run.

Help & support

Website Content Crawler is under active development. If you have any feedback or feature ideas, please get in touch at ai@apify.com or submit an issue.

Web scraping is generally legal if you scrape publicly available non-personal data. What you do with the data is another question. Documentation, help articles, or blogs are typically protected by copyright, so you can't republish the content without the owner's permission.

Learn more about the legality of web scraping in this blog post. If you're not sure, please seek professional legal advice.