Pricing

Pay per usage

Try for free

Go to Apify Store

Website Content Crawler

Try for free

Developed by

Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

4.3 (43)

Pricing

Pay per usage

Issues response

7.3 days

Last modified

13 days ago

Developer tools

Start URLs

startUrlsarrayRequired

One or more URLs of pages where the crawler will start.

By default, the Actor will also crawl sub-pages of these URLs. For example, for start URL https://example.com/blog, it will crawl also https://example.com/blog/post or https://example.com/blog/article. The Include URLs (globs) option overrides this automation behavior.

Load URLs from Sitemaps

useSitemapsbooleanOptional

If enabled, the crawler will look for Sitemaps at the domains of the provided Start URLs and enqueue matching URLs similarly as the links found on crawled pages. You can also reference a sitemap.xml file directly by adding it as another Start URL (e.g. https://www.example.com/sitemap.xml)

This feature makes the crawling more robust on websites that support Sitemaps, as it includes pages that might be not reachable from Start URLs. However, loading and processing Sitemaps can take a lot of time, especially for large sites. Note that if a page is found via Sitemaps, it will have depth of 1.

Default value of this property is false

Respect the robots.txt file

respectRobotsTxtFilebooleanOptional

If enabled, the crawler will consult the robots.txt file for the target website before crawling each page. At the moment, the crawler does not use any specific user agent identifier. The crawl-delay directive is also not supported yet.

Default value of this property is false

Crawler type

crawlerTypeEnumOptional

Select the crawling engine:

Headless web browser - Useful for modern websites with anti-scraping protections and JavaScript rendering. It recognizes common blocking patterns like CAPTCHAs and automatically retries blocked requests through new sessions. However, running web browsers is more expensive as it requires more computing resources and is slower. It is recommended to use at least 8 GB of RAM.
Stealthy web browser (default) - Another headless web browser with anti-blocking measures enabled. Try this if you encounter bot protection while scraping. For best performance, use with Apify Proxy residential IPs.
Adaptive switching between Chrome and raw HTTP client - The crawler automatically switches between raw HTTP for static pages and Chrome browser (via Playwright) for dynamic pages, to get the maximum performance wherever possible.
Raw HTTP client - High-performance crawling mode that uses raw HTTP requests to fetch the pages. It is faster and cheaper, but it might not work on all websites.

Beware that with the raw HTTP client or adaptive crawling mode, some features are not available, e.g. wait for dynamic content, maximum scroll height, or remove cookie warnings.

Value options:

"playwright:adaptive": string"playwright:firefox": string"cheerio": string"jsdom": string"playwright:chrome": string

Default value of this property is "playwright:firefox"

Include URLs (globs)

includeUrlGlobsarrayOptional

Glob patterns matching URLs of pages that will be included in crawling.

Setting this option will disable the default Start URLs based scoping and will allow you to customize the crawling scope yourself. Note that this affects only links found on pages, but not Start URLs - if you want to crawl a page, make sure to specify its URL in the Start URLs field.

For example https://{store,docs}.example.com/** lets the crawler to access all URLs starting with https://store.example.com/ or https://docs.example.com/, and https://example.com/**/*\?*foo=* allows the crawler to access all URLs that contain foo query parameter with any value.

Learn more about globs and test them here.

Default value of this property is []

Exclude URLs (globs)

excludeUrlGlobsarrayOptional

Glob patterns matching URLs of pages that will be excluded from crawling. Note that this affects only links found on pages, but not Start URLs, which are always crawled.

For example https://{store,docs}.example.com/** excludes all URLs starting with https://store.example.com/ or https://docs.example.com/, and https://example.com/**/*\?*foo=* excludes all URLs that contain foo query parameter with any value.

Learn more about globs and test them here.

Default value of this property is []

URL #fragments identify unique pages

keepUrlFragmentsbooleanOptional

Indicates that URL fragments (e.g. http://example.com#fragment) should be included when checking whether a URL has already been visited or not. Typically, URL fragments are used for page navigation only and therefore they should be ignored, as they don't identify separate pages. However, some single-page websites use URL fragments to display different pages; in such a case, this option should be enabled.

Default value of this property is false

Ignore canonical URLs

ignoreCanonicalUrlbooleanOptional

If enabled, the Actor will ignore the canonical URL reported by the page, and use the actual URL instead. You can use this feature for websites that report invalid canonical URLs, which causes the Actor to skip those pages in results.

Default value of this property is false

Ignore HTTPS errors

ignoreHttpsErrorsbooleanOptional

If enabled, the scraper will ignore HTTPS certificate errors. Use at your own risk.

Default value of this property is false

Max crawling depth

maxCrawlDepthintegerOptional

The maximum number of links starting from the start URL that the crawler will recursively follow. The start URLs have depth 0, the pages linked directly from the start URLs have depth 1, and so on.

This setting is useful to prevent accidental crawler runaway. By setting it to 0, the Actor will only crawl the Start URLs.

Default value of this property is 20

Max pages

maxCrawlPagesintegerOptional

The maximum number pages to crawl. It includes the start URLs, pagination pages, pages with no content, etc. The crawler will automatically finish after reaching this number. This setting is useful to prevent accidental crawler runaway.

Default value of this property is 9999999

Initial concurrency

initialConcurrencyintegerOptional

The initial number of web browsers or HTTP clients running in parallel. The system scales the concurrency up and down based on the current CPU and memory load. If the value is set to 0 (default), the Actor uses the default setting for the specific crawler type.

Note that if you set this value too high, the Actor will run out of memory and crash. If too low, it will be slow at start before it scales the concurrency up.

Default value of this property is 0

Max concurrency

maxConcurrencyintegerOptional

The maximum number of web browsers or HTTP clients running in parallel. This setting is useful to avoid overloading the target websites and to avoid getting blocked.

Default value of this property is 200

Initial cookies

initialCookiesarrayOptional

Cookies that will be pre-set to all pages the scraper opens. This is useful for pages that require login. The value is expected to be a JSON array of objects with name and value properties. For example: [{"name": "cookieName", "value": "cookieValue"}].

You can use the EditThisCookie browser extension to copy browser cookies in this format, and paste it here.

Default value of this property is []

Proxy configuration

proxyConfigurationobjectRequired

Enables loading the websites from IP addresses in specific geographies and to circumvent blocking.

Default value of this property is {"useApifyProxy":true}

Maximum number of session rotations

maxSessionRotationsintegerOptional

The maximum number of times the crawler will rotate the session (IP address + browser configuration) on anti-scraping measures like CAPTCHAs. If the crawler rotates the session more than this number and the page is still blocked, it will finish with an error.

Default value of this property is 10

Maximum number of retries on network / server errors

maxRequestRetriesintegerOptional

The maximum number of times the crawler will retry the request on network, proxy or server errors. If the (n+1)-th request still fails, the crawler will mark this request as failed.

Default value of this property is 3

Request timeout

requestTimeoutSecsintegerOptional

Timeout in seconds for making the request and processing its response. Defaults to 60s.

Default value of this property is 60

Minimum file download speed

minFileDownloadSpeedKBpsintegerOptional

The minimum viable file download speed in kilobytes per seconds. If the file download speed is lower than this value for a prolonged duration, the crawler will consider the file download as failing, abort it, and retry it again (up to "Maximum number of retries" times). This is useful to avoid your crawls being stuck on slow file downloads.

Default value of this property is 128

Wait for dynamic content

dynamicContentWaitSecsintegerOptional

The maximum time in seconds to wait for dynamic page content to load. By default, it is 10 seconds. The crawler will continue processing the page either if this time elapses, or if it detects the network became idle as there are no more requests for additional resources.

When using the Wait for selector option, the crawler will wait for the selector to appear for this amount of time. If the selector doesn't appear within this period, the request will fail and will be retried.

Note that this setting is ignored for the raw HTTP client, because it doesn't execute JavaScript or loads any dynamic resources. Similarly, if the value is set to 0, the crawler doesn't wait for any dynamic to load and processes the HTML as provided on load.

Default value of this property is 10

Wait for selector

waitForSelectorstringOptional

If set, the crawler will wait for the specified CSS selector to appear in the page before proceeding with the content extraction. This is useful for pages for which the default content load recognition by idle network fails. Setting this option completely disables the default behavior, and the page will be processed only if the element specified by this selector appears. If the element doesn't appear within the Wait for dynamic content timeout, the request will fail and will be retried later. The value must be a valid CSS selector as accepted by the document.querySelectorAll() function.

With the raw HTTP client, this option checks for the presence of the selector in the HTML content and throws an error if it's not found.

Default value of this property is ""

Soft wait for selector

softWaitForSelectorstringOptional

If set, the crawler will wait for the specified CSS selector to appear in the page before proceeding with the content extraction. Unlike the waitForSelector option, this option doesn't fail the request if the selector doesn't appear within the timeout (the request processing will continue).

Default value of this property is ""

Maximum scroll height

maxScrollHeightPixelsintegerOptional

The crawler will scroll down the page until all content is loaded (and network becomes idle), or until this maximum scrolling height is reached. Setting this value to 0 disables scrolling altogether.

Note that this setting is ignored for the raw HTTP client, because it doesn't execute JavaScript or loads any dynamic resources.

Default value of this property is 5000

Keep HTML elements (CSS selector)

keepElementsCssSelectorstringOptional

An optional CSS selector matching HTML elements that should be preserved in the DOM. If provided, all HTML elements which are not matching the CSS selectors or their descendants are removed from the DOM. This is useful to extract only relevant page content. The value must be a valid CSS selector as accepted by the document.querySelectorAll() function.

This option runs before the HTML transformer option. If you are missing content in the output despite using this option, try disabling the HTML transformer.

Default value of this property is ""

Remove HTML elements (CSS selector)

removeElementsCssSelectorstringOptional

A CSS selector matching HTML elements that will be removed from the DOM, before converting it to text, Markdown, or saving as HTML. This is useful to skip irrelevant page content. The value must be a valid CSS selector as accepted by the document.querySelectorAll() function.

By default, the Actor removes common navigation elements, headers, footers, modals, scripts, and inline image. You can disable the removal by setting this value to some non-existent CSS selector like dummy_keep_everything.

Default value of this property is "nav, footer, script, style, noscript, svg, img[src^='data:'],\n[role=\"alert\"],\n[role=\"banner\"],\n[role=\"dialog\"],\n[role=\"alertdialog\"],\n[role=\"region\"][aria-label*=\"skip\" i],\n[aria-modal=\"true\"]"

Remove cookie warnings

removeCookieWarningsbooleanOptional

If enabled, the Actor will try to remove cookies consent dialogs or modals, using the I don't care about cookies browser extension, to improve the accuracy of the extracted text. Note that there is a small performance penalty if this feature is enabled.

This setting is ignored when using the raw HTTP crawler type.

Default value of this property is true

Block loading of images and videos

blockMediabooleanOptional

If the flag is enabled and the Actor is using a headless browser, it will not load images, fonts, stylesheets and videos to improve performance. It will load scripts as usual - that is after all the point of using a headless browser.

Default value of this property is false

Expand iframe elements

expandIframesbooleanOptional

By default, the Actor will extract content from iframe elements. If you want to specifically skip iframe processing, disable this option. Works only for the playwright:firefox crawler type.

Default value of this property is true

Expand clickable elements

clickElementsCssSelectorstringOptional

A CSS selector matching DOM elements that will be clicked. This is useful for expanding collapsed sections, in order to capture their text content. The value must be a valid CSS selector as accepted by the document.querySelectorAll() function.

Default value of this property is "[aria-expanded=\"false\"]"

Make containers sticky

stickyContainerCssSelectorstringOptional

This is an experimental feature. A CSS selector matching DOM elements that will be prevented from deleting any of their children. This is useful in conjunction with the "Expand clickable elements" option on pages where hidden content is actually removed from the DOM (i.e., some variants of the accordion pattern). Enabling this might corrupt the extracted content, which is why it is disabled by default. It is possible to enable the feature for the whole page with the * selector, or you can target specific elements if the former has unwanted side effects.

HTML transformer

htmlTransformerEnumOptional

Specify how to transform the HTML to extract meaningful content without any extra fluff, like navigation or modals. The HTML transformation happens after removing and clicking the DOM elements.

Readable text with fallback - Uses Mozilla's Readability to extract the main content but falls back to the original HTML if the page doesn't appear to be an article. This is useful for websites with mixed content types (articles, product pages, etc.) as it preserves more content on non-article pages.
Readable text (default) - Uses Mozilla's Readability to extract the main article content, removing navigation, headers, footers, and other non-essential elements. Works best for article-rich websites and blogs.
Extractus - Uses the Extractus article extraction library, which is an alternative content extraction algorithm. May work better than Readability on certain websites, particularly news sites or blogs with specific layouts.
None - Only applies basic cleaning (removing elements specified via 'Remove HTML elements' option) without any content extraction algorithm. Best when you want to preserve most of the original HTML structure with minimal processing.

You can examine output of all transformers by enabling the debug mode.

Value options:

"readableTextIfPossible": string"readableText": string"extractus": string"none": string

Default value of this property is "readableText"

Readable text extractor character threshold

readableTextCharThresholdintegerOptional

A configuration options for the "Readable text" HTML transformer. It contains the minimum number of characters an article must have in order to be considered relevant.

Default value of this property is 100

Remove duplicate text lines

aggressivePrunebooleanOptional

This is an experimental feature. If enabled, the crawler will prune content lines that are very similar to the ones already crawled on other pages, using the Count-Min Sketch algorithm. This is useful to strip repeating content in the scraped data like menus, headers, footers, etc. In some (not very likely) cases, it might remove relevant content from some pages.

Default value of this property is false

Debug mode (stores output of all HTML transformers)

debugModebooleanOptional

If enabled, the Actor will store the output of all types of HTML transformers, including the ones that are not used by default, and it will also store the HTML to Key-value Store with a link. All this data is stored under the debug field in the resulting Dataset.

Default value of this property is false

Debug log

debugLogbooleanOptional

If enabled, the actor log will include debug messages. Beware that this can be quite verbose.

Default value of this property is false

Save HTML to dataset (deprecated)

saveHtmlbooleanOptional

If enabled, the crawler stores full transformed HTML of all pages found to the output dataset under the html field. This option has been deprecated in favor of the saveHtmlAsFile option, because the dataset records have a size of approximately 10MB and it's harder to review the HTML for debugging.

Default value of this property is false

Save HTML to key-value store

saveHtmlAsFilebooleanOptional

If enabled, the crawler stores full transformed HTML of all pages found to the default key-value store and saves links to the files as htmlUrl field in the output dataset. Storing HTML in key-value store is preferred to storing it into the dataset with the saveHtml option, because there's no size limit and it's easier for debugging as you can easily view the HTML.

Default value of this property is false

Save Markdown

saveMarkdownbooleanOptional

If enabled, the crawler converts the transformed HTML of all pages found to Markdown, and stores it under the markdown field in the output dataset.

Default value of this property is true

Save files

saveFilesbooleanOptional

If enabled, the crawler downloads files linked from the web pages, as long as their URL has one of the following file extensions: PDF, DOC, DOCX, XLS, XLSX, and CSV. Note that unlike web pages, the files are downloaded regardless if they are under Start URLs or not. The files are stored to the default key-value store, and metadata about them to the output dataset, similarly as for web pages.

Default value of this property is false

Save screenshots (headless browser only)

saveScreenshotsbooleanOptional

If enabled, the crawler stores a screenshot for each article page to the default key-value store. The link to the screenshot is stored under the screenshotUrl field in the output dataset. It is useful for debugging, but reduces performance and increases storage costs.

Note that this feature only works with the playwright:firefox crawler type.

Default value of this property is false

Max results

maxResultsintegerOptional

The maximum number of resulting web pages to store. The crawler will automatically finish after reaching this number. This setting is useful to prevent accidental crawler runaway. If both Max pages and Max results are defined, then the crawler will finish when the first limit is reached. Note that the crawler skips pages with the canonical URL of a page that has already been crawled, hence it might crawl more pages than there are results.

Default value of this property is 9999999

Text extractor (deprecated)

textExtractorstringOptional

Removed in favor of the htmlTransformer option. Will be removed soon.

(Adaptive crawling only) Minimum client-side content change percentage

clientSideMinChangePercentageintegerOptional

The least amount of content (as a percentage) change after the initial load required to consider the pages client-side rendered

Default value of this property is 15

(Adaptive crawling only) How often should the crawler attempt to detect page rendering type

renderingTypeDetectionPercentageintegerOptional

How often should the adaptive attempt to detect page rendering type

Default value of this property is 10

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

652

4.6

🔥 FireScrape AI Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-Website-Content-Markdown-Scraper

Advanced web scraper powered by Crawlee and Puppeteer — extracts website content, converts it to Markdown, and structures it for LLM training datasets.

mohamed el hadi msaid

114

3.8

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

EasyApi

5.0

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

1.6K

4.7

AI-Powered Web Content & Link Extractor

scrapercoder/ai-powered-web-content-link-extractor

Crawls websites to extract clean, structured content for AI/LLM use, ideal for training datasets, knowledge bases, and RAG systems. Json output includes: * text: Normalized page content * links: Extracted sub-URLs

wallnut.ai

110

🔥fireScraper AI Prompt Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-prompt-Website-Content-Markdown-Scraper

fireScrape AI is an advanced web scraper built with Crawlee and Puppeteer. It crawls websites, extracts meaningful content, converts it into Markdown, then runs your custom prompt on the extracted text—ideal for generating enriched datasets, summaries or analyses for LLMs and AI pipelines

mohamed el hadi msaid

5.0

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

512

4.7

RAG Web Browser

apify/rag-web-browser

Web browser for OpenAI Assistants, RAG pipelines, or AI agents, similar to a web browser in ChatGPT. It queries Google Search, scrapes the top N pages, and returns their content as Markdown for further processing by an LLM. It can also scrape individual URLs. Supports Model Context Protocol (MCP).

Apify

5.2K

4.4

Dynamic Markdown Scraper

louisdeconinck/dynamic-markdown-scraper

Effortlessly feed LLM AIs with clean Markdown using our advanced web scraper. Seamlessly scrape dynamic, JavaScript-rendered websites while preserving original formatting. Ideal for AI training, documentation, and content migration.

Louis Deconinck

5.0

Web Scraper

apify/web-scraper

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

Apify

92K

4.4