Website Content Crawler

  • apify/website-content-crawler
  • Modified
  • Users 1.6k
  • Runs 38.1k
  • Created by Author's avatarApify

Automatically crawl and extract text content from websites with documentation, knowledge bases, help centers, or blogs. This Actor is designed to provide data to feed, fine-tune, or train large language models such as ChatGPT or LLaMA.

Website Content Crawler is an Apify Actor that can perform a deep crawl of one or more websites to extract their content, such as documentation, knowledge bases, help articles, blog posts, or any other text content.

The actor was specifically designed to extract data for feeding, fine-tuning, or training large language models (LLMs) such as GPT-4, ChatGPT or LLaMA, and other AI models. It automatically removes headers, footers, menus, ads, and other noise from the web pages in order to return only the text content that can be directly fed to the models.

The actor has a simple input configuration so that it can be easily integrated into customer-facing products, where customers can enter just a URL of their website that want to have indexed by LLMs. The actor scales gracefully and can be used for small sites as well as sites with millions of pages. You can retrieve the results using API to formats such as JSON or CSV, which can be fed directly to your LLM, vector database, or directly to ChatGPT.

How does it work?

Website Content Crawler only needs one or more start URLs, typically the top-level URL of the documentation site, blog, or knowledge base that you want to scrape. The actor crawls the start URLs, finds links to other pages, and recursively crawls those pages too, as long as their URL is under the start URL.

For example, if you enter the start URL https://example.com/blog/, the actor will crawl pages like https://example.com/blog/article-1 or https://example.com/blog/section/article-2, but will skip pages like https://example.com/docs/something-else.

The actor also extracts important metadata about the content, such as author, language, publishing date, etc. It can save also the full HTML and screenshots of the pages, which is useful for debugging.

The actor automatically skips duplicate pages identified by the same canonical URL; those pages are loaded and counted towards the Max pages limit, but not saved to the results.

Website Content Crawler can be further configured for optimal performance. For example, you can select the crawler type:

  • Headless web browser (default) - Useful for modern websites with anti-scraping protections and JavaScript rendering. It recognizes common blocking patterns like CAPTCHAs and automatically retries blocked requests through new sessions. However, running web browsers is more expensive as it requires more computing resources and is slower.
  • Stealthy web browser - Headless web browser with antiblocking measures enabled. Try this if you encounter bot protection while scraping. For best performance, use with Apify proxy servers..
  • Raw HTTP client - High-performance crawling mode that uses raw HTTP requests to fetch the pages. It is faster and cheaper, but it might not work on all websites.
  • Raw HTTP client with JS execution (JSDOM) (experimental) - A compromise between a browser and raw HTTP crawlers. Good performance and should work on almost all websites including those with dynamic content. However, it is still experimental and might sometimes crash so we don't recommend it in production settings yet.

You can also set additional input parameters such as a maximum number of pages, maximum crawling depth, maximum concurrency, proxy configuration, timeout, etc. to control the behavior and performance of the actor.

Designed for generative AI and LLMs

The results of the Website Content Crawler can help you feed, fine-tune or train your large language models (LLMs) or provide context for prompts for ChatGPT. In return, the model will answer questions based on your or your customer's websites and content.

Custom chatbots for customer support

Chatbots personalized on customer data such as documentation or knowledge bases are the next big thing for customer support and success teams. Let your customers simply type in the URL of their documentation or help center, and in minutes, your chatbot will have full knowledge about their product with zero integration costs.

Generate personalized content based on customer’s copy

ChatGPT and LLMs can write articles for you, but they won’t sound like you wrote them. Feed all your old blogs into your model to make it sound like you. Or train the model on your customers’ blogs and have it write in their tone of voice. Or help their technical writers with making first drafts of new documentation pages.

Summarization, translation, proofreading at scale

Got some old docs or blogs that need to be improved? Use Website Content Crawler to scrape the content, feed it to ChatGPT API and ask it to summarize, proofread, translate or change the style of the content.

Example

This example shows how to scrape all pages from the Apify documentation at https://docs.apify.com/:

Input

input-screenshot.png

See full input with description.

Output

This is how one crawled page (https://docs.apify.com/academy/web-scraping-for-beginners) looks in a browser:

page-screenshot.png

And here is how the crawling result looks in JSON format (note that other formats like CSV or Excel are also supported). The main page content can be found in the text field, and it only contains the valuable content, without menus and other noise:

{ "url": "https://docs.apify.com/academy/web-scraping-for-beginners", "crawl": { "loadedUrl": "https://docs.apify.com/academy/web-scraping-for-beginners", "loadedTime": "2023-04-05T16:26:51.030Z", "referrerUrl": "https://docs.apify.com/academy", "depth": 0 }, "metadata": { "canonicalUrl": "https://docs.apify.com/academy/web-scraping-for-beginners", "title": "Web scraping for beginners | Apify Documentation", "description": "Learn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place.", "author": null, "keywords": null, "languageCode": "en" }, "screenshotUrl": null, "text": "Skip to main content\nOn this page\nWeb scraping for beginners\nLearn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place.\nWelcome to Web scraping for beginners, a comprehensive, practical and long form web scraping course that will take you from an absolute beginner to a successful web scraper developer. If you're looking for a quick start, we recommend trying this tutorial instead.\nThis course is made by Apify, the web scraping and automation platform, but we will use only open-source technologies throughout all academy lessons. This means that the skills you learn will be applicable to any scraping project, and you'll be able to run your scrapers on any computer. No Apify account needed.\nIf you would like to learn about the Apify platform and how it can help you build, run and scale your web scraping and automation projects, see the Apify platform course, where we'll teach you all about Apify serverless infrastructure, proxies, API, scheduling, webhooks and much more.\nWhy learn scraper development?​\nWith so many point-and-click tools and no-code software that can help you extract data from websites, what is the point of learning web scraper development? Contrary to what their marketing departments say, a point-and-click or no-code tool will never be as flexible, as powerful, or as optimized as a custom-built scraper.\nAny software can do only what it was programmed to do. If you build your own scraper, it can do anything you want. And you can always quickly change it to do more, less, or the same, but faster or cheaper. The possibilities are endless once you know how scraping really works.\nScraper development is a fun and challenging way to learn web development, web technologies, and understand the internet. You will reverse-engineer websites and understand how they work internally, what technologies they use and how they communicate with their servers. You will also master your chosen programming language and core programming concepts. When you truly understand web scraping, learning other technology like React or Next.js will be a piece of cake.\nCourse Summary​\nWhen we set out to create the Academy, we wanted to build a complete guide to modern web scraping - a course that a beginner could use to create their first scraper, as well as a resource that professionals will continuously use to learn about advanced and niche web scraping techniques and technologies. All lessons include code examples and code-along exercises that you can use to immediately put your scraping skills into action.\nThis is what you'll learn in the Web scraping for beginners course:\nWeb scraping for beginners\nBasics of data extraction\nBasics of crawling\nBest practices\nRequirements​\nYou don't need to be a developer or a software engineer to complete this course, but basic programming knowledge is recommended. Don't be afraid, though. We explain everything in great detail in the course and provide external references that can help you level up your web scraping and web development skills. If you're new to programming, pay very close attention to the instructions and examples. A seemingly insignificant thing like using [] instead of () can make a lot of difference.\nIf you don't already have basic programming knowledge and would like to be well-prepared for this course, we recommend taking a JavaScript course and learning about CSS Selectors.\nAs you progress to the more advanced courses, the coding will get more challenging, but will still be manageable to a person with an intermediate level of programming skills.\nIdeally, you should have at least a moderate understanding of the following concepts:\nJavaScript + Node.js​\nIt is recommended to understand at least the fundamentals of JavaScript and be proficient with Node.js prior to starting this course. If you are not yet comfortable with asynchronous programming (with promises and async...await), loops (and the different types of loops in JavaScript), modularity, or working with external packages, we would recommend studying the following resources before coming back and continuing this section:\nasync...await (YouTube)\nJavaScript loops (MDN)\nModularity in Node.js\nGeneral web development​\nThroughout the next lessons, we will sometimes use certain technologies and terms related to the web without explaining them. This is because the knowledge of them will be assumed (unless we're showing something out of the ordinary).\nHTML\nHTTP protocol\nDevTools\njQuery or Cheerio​\nWe'll be using the Cheerio package a lot to parse data from HTML. This package provides a simple API using jQuery syntax to help traverse downloaded HTML within Node.js.\nNext up​\nThe course begins with a small bit of theory and moves into some realistic and practical examples of extracting data from the most popular websites on the internet using your browser console. So let's get to it!\nIf you already have experience with HTML, CSS, and browser DevTools, feel free to skip to the Basics of crawling section.\nWhy learn scraper development?\nCourse Summary\nRequirements\nJavaScript + Node.js\nGeneral web development\njQuery or Cheerio\nNext up", "html": null, "markdown": " Web scraping for beginners | Apify Documentation \n\n[Skip to main content](#docusaurus_skipToContent_fallback)\n\nOn this page\n\n# Web scraping for beginners\n\n**Learn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place.**\n\n* * *\n\nWelcome to **Web scraping for beginners**, a comprehensive, practical and long form web scraping course that will take you from an absolute beginner to a successful web scraper developer. If you're looking for a quick start, we recommend trying [this tutorial](https://blog.apify.com/web-scraping-javascript-nodejs/) instead.\n\nThis course is made by [Apify](https://apify.com), the web scraping and automation platform, but we will use only open-source technologies throughout all academy lessons. This means that the skills you learn will be applicable to any scraping project, and you'll be able to run your scrapers on any computer. No Apify account needed.\n\nIf you would like to learn about the Apify platform and how it can help you build, run and scale your web scraping and automation projects, see the [Apify platform course](/academy/apify-platform), where we'll teach you all about Apify serverless infrastructure, proxies, API, scheduling, webhooks and much more.\n\n## Why learn scraper development?[​](#why-learn \"Direct link to Why learn scraper development?\")\n\nWith so many point-and-click tools and no-code software that can help you extract data from websites, what is the point of learning web scraper development? Contrary to what their marketing departments say, a point-and-click or no-code tool will never be as flexible, as powerful, or as optimized as a custom-built scraper.\n\nAny software can do only what it was programmed to do. If you build your own scraper, it can do anything you want. And you can always quickly change it to do more, less, or the same, but faster or cheaper. The possibilities are endless once you know how scraping really works.\n\nScraper development is a fun and challenging way to learn web development, web technologies, and understand the internet. You will reverse-engineer websites and understand how they work internally, what technologies they use and how they communicate with their servers. You will also master your chosen programming language and core programming concepts. When you truly understand web scraping, learning other technology like React or Next.js will be a piece of cake.\n\n## Course Summary[​](#summary \"Direct link to Course Summary\")\n\nWhen we set out to create the Academy, we wanted to build a complete guide to modern web scraping - a course that a beginner could use to create their first scraper, as well as a resource that professionals will continuously use to learn about advanced and niche web scraping techniques and technologies. All lessons include code examples and code-along exercises that you can use to immediately put your scraping skills into action.\n\nThis is what you'll learn in the **Web scraping for beginners** course:\n\n* [Web scraping for beginners](/academy/web-scraping-for-beginners)\n * [Basics of data extraction](/academy/web-scraping-for-beginners/data-collection)\n * [Basics of crawling](/academy/web-scraping-for-beginners/crawling)\n * [Best practices](/academy/web-scraping-for-beginners/best-practices)\n\n## Requirements[​](#requirements \"Direct link to Requirements\")\n\nYou don't need to be a developer or a software engineer to complete this course, but basic programming knowledge is recommended. Don't be afraid, though. We explain everything in great detail in the course and provide external references that can help you level up your web scraping and web development skills. If you're new to programming, pay very close attention to the instructions and examples. A seemingly insignificant thing like using `[]` instead of `()` can make a lot of difference.\n\n> If you don't already have basic programming knowledge and would like to be well-prepared for this course, we recommend taking a [JavaScript course](https://www.codecademy.com/learn/introduction-to-javascript) and learning about [CSS Selectors](https://www.w3schools.com/css/css_selectors.asp).\n\nAs you progress to the more advanced courses, the coding will get more challenging, but will still be manageable to a person with an intermediate level of programming skills.\n\nIdeally, you should have at least a moderate understanding of the following concepts:\n\n### JavaScript + Node.js[​](#javascript-and-node \"Direct link to JavaScript + Node.js\")\n\nIt is recommended to understand at least the fundamentals of JavaScript and be proficient with Node.js prior to starting this course. If you are not yet comfortable with asynchronous programming (with promises and `async...await`), loops (and the different types of loops in JavaScript), modularity, or working with external packages, we would recommend studying the following resources before coming back and continuing this section:\n\n* [`async...await` (YouTube)](https://www.youtube.com/watch?v=vn3tm0quoqE&ab_channel=Fireship)\n* [JavaScript loops (MDN)](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Loops_and_iteration)\n* [Modularity in Node.js](https://www.section.io/engineering-education/how-to-use-modular-patterns-in-nodejs/)\n\n### General web development[​](#general-web-development \"Direct link to General web development\")\n\nThroughout the next lessons, we will sometimes use certain technologies and terms related to the web without explaining them. This is because the knowledge of them will be **assumed** (unless we're showing something out of the ordinary).\n\n* [HTML](https://developer.mozilla.org/en-US/docs/Web/HTML)\n* [HTTP protocol](https://developer.mozilla.org/en-US/docs/Web/HTTP)\n* [DevTools](/academy/web-scraping-for-beginners/data-collection/browser-devtools)\n\n### jQuery or Cheerio[​](#jquery-or-cheerio \"Direct link to jQuery or Cheerio\")\n\nWe'll be using the [**Cheerio**](https://www.npmjs.com/package/cheerio) package a lot to parse data from HTML. This package provides a simple API using jQuery syntax to help traverse downloaded HTML within Node.js.\n\n## Next up[​](#next \"Direct link to Next up\")\n\nThe course begins with a small bit of theory and moves into some realistic and practical examples of extracting data from the most popular websites on the internet using your browser console. So [let's get to it!](/academy/web-scraping-for-beginners/introduction)\n\n> If you already have experience with HTML, CSS, and browser DevTools, feel free to skip to the [Basics of crawling](/academy/web-scraping-for-beginners/crawling) section.\n\n* [Why learn scraper development?](#why-learn)\n* [Course Summary](#summary)\n* [Requirements](#requirements)\n * [JavaScript + Node.js](#javascript-and-node)\n * [General web development](#general-web-development)\n * [jQuery or Cheerio](#jquery-or-cheerio)\n* [Next up](#next)" }

LangChain integration

LangChain is the most popular framework for developing applications powered by language models. It provides an integration for Apify, so that you can feed Actor results directly to LangChain’s vector databases, enabling you to easily create ChatGPT-like query interfaces to websites with documentation, knowledge base, blog, etc.

Python example

First, install LangChain with common LLMs and Apify API client for Python:

pip install langchain[llms] apify-client

And then create a ChatGPT-powered answering machine:

from langchain.document_loaders.base import Document from langchain.indexes import VectorstoreIndexCreator from langchain.utilities import ApifyWrapper import os #Set up your Apify API token and OpenAI API key os.environ["OPENAI_API_KEY"] = "Your OpenAI API key" os.environ["APIFY_API_TOKEN"] = "Your Apify API token" apify = ApifyWrapper() #Run the Website Content Crawler on a website, wait for it to finish, and save #its results into a LangChain document loader: loader = apify.call_actor( actor_id="apify/website-content-crawler", run_input={"startUrls": [{"url": "https://docs.apify.com/"}]}, dataset_mapping_function=lambda item: Document( page_content=item["text"] or "", metadata={"source": item["url"]} ), ) #Initialize the vector database with the text documents: index = VectorstoreIndexCreator().from_loaders([loader]) #Finally, query the vector database: query = "What is Apify?" result = index.query_with_sources(query) print(result["answer"]) print(result["sources"])

The query produces an answer like this:

Apify is a platform for developing, running, and sharing serverless cloud programs. It enables users to create web scraping and automation tools and publish them on the Apify platform.

https://docs.apify.com/platform/actors, https://docs.apify.com/platform/actors/running/actors-in-store, https://docs.apify.com/platform/security, https://docs.apify.com/platform/actors/examples

For details and Jupyter notebook, see Apify integration for LangChain.

Node.js example

See detailed example in LangChain for JavaScript.

LlamaIndex integration

LlamaIndex is a "project that provides a central interface to connect your LLM’s with external data". The Apify integration makes it easy to feed LlamaIndex applications with data crawler from web:

from llama_index import download_loader from llama_index.readers.schema.base import Document # Converts a single record from the Apify dataset to the LlamaIndex format def tranform_dataset_item(item): return Document( item.get("text"), extra_info={ "url": item.get("url"), }, ) ApifyDataset = download_loader("ApifyDataset") reader = ApifyDataset("<Your Apify API token>")) documents = reader.load_data(dataset_id="<Apify Dataset ID>", dataset_mapping_function=tranform_dataset_item)

How much does Website Content Crawler cost?

You pay only for the Apify platform usage required by the Actor to crawl the websites and extract the content. The exact price depends on the crawler type and settings, website complexity, network speed, and random circumstances.

The main cost driver of Website Content Crawler is the actor compute units (CU), where 1 CU corresponds to an actor with 1 GB of memory running for 1 hour. With the baseline price of $0.25/CU, from our tests, the actor usage costs approximately:

  • $0.5 - $5 per 1,000 web pages with a headless browser, depending on the website
  • $0.2 per 1,000 web pages with raw HTTP crawler

Note that the Apify Free plan gives you $5 free credits every month and access to Apify Proxy, which is sufficient for testing and low-volume use cases.

Troubleshooting

  • If the extracted text doesn’t contain the expected page content, try to select another Crawler type. Generally, a headless browser will extract more text as it loads dynamic page content and is less likely to be blocked.
  • If the extracted text has more than expected page content (e.g. navigation or footer), try to select another HTML transformer, or use the Remove HTML elements setting to skip unwanted parts of the page.
  • If the crawler is too slow, try increasing the Actor memory and/or the Initial concurrency setting. Note that if you set the concurrency too high, the Actor will run out of memory and crash, or potentially overload the target site.
  • The crawler automatically restarts on crash, and continues where it left off. But if it crashes more than 3 times per minute, the system fails the Actor run.

Help & Support

Website Content Crawler is under active development. If you have any feedback or feature ideas, please get in touch at ai@apify.com or submit an issue.

Web scraping is generally legal if you scrape publicly available non-personal data. What you do with the data is another question. Documentation, help articles, or blogs are typically protected by copyright, so you can't republish the content without owner's permission.

Learn more about legality of web scraping in this blog post.

Changelog

Available on the Changelog tab.