Website Content Crawler
Try for free
No credit card required
Go to Store
Website Content Crawler
apify/website-content-crawler
Try for free
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demo0.3.57 (2024-12-13)
- Documentation:
- Updated README with the Introducing Website Content Crawler video.
0.3.56 (2024-11-25)
- Input:
- Empty
includeUrlGlobs
are now filtered out with a warning log message. To enforce the old behavior (i.e. matching everything), use**
instead.
- Empty
- Behaviour:
- The Actor now automatically drops the request queue associated with the file download.
- Only the first
<title>
element on the page is prepended to the exported content. - The crawler now uses the correct scope with all
startUrls
being sitemaps. - Sitemaps are now processed in a separate thread. The 30 seconds per sitemap request limit is now strongly enforced.
0.3.55 (2024-11-11)
- Behaviour:
- The sitemap timeout warning is only logged the first time.
0.3.54 (2024-11-07)
-
Input:
- The default
removeElementsCssSelector
now removesimg
elements withdata:
URLs to prevent cluttering the text output.
- The default
-
Behaviour:
expandIframes
now skips brokeniframe
elements instead of failing the whole request.- Actor now parses formatted (indented or newline-separated) sitemaps correctly.
- The sitemap discovery process is now parallelized. Logging is improved to show the progress of the sitemap discovery.
- Sitemap processing now has stricter per-URL time limits to prevent indefinite hangs.
0.3.53 (2024-10-22)
- Input:
- New
keepElementsCssSelector
accepts a CSS selector targetting elements to extract from the page to the output.
- New
- Behaviour:
- Actor optimizes RQ writes by following the
maxCrawlPages
limits better.
- Actor optimizes RQ writes by following the
0.3.52 (2024-10-10)
- Behavior:
- Handle sitemap-based requests correctly. Solves the indefinite
RequestQueue
hanging read issue.
- Handle sitemap-based requests correctly. Solves the indefinite
0.3.51 (2024-10-10)
- Behavior:
- Revert internal library update to mitigate the indefinite
RequestQueue
hanging read issue.
- Revert internal library update to mitigate the indefinite
0.3.50 (2024-10-07)
- Behavior:
- Actor terminates sitemap loading prematurely in case of a timeout.
- Sitemap loading now respects
maxRequestRetries
option.
0.3.49 (2024-09-23)
- Behavior:
- Use the correct proxy settings when loading sitemap files.
- Mitigate sitemap persistence issues with premature stopping on
maxRequests
(ERR_STREAM_PUSH_AFTER_EOF
).
0.3.48 (2024-09-10)
- Input:
useSitemaps
option is now pre-filled totrue
to automatically enable it for new users and in API examples.
0.3.47 (2024-09-04)
- Behaviour:
- Use crawlee 3.11.3 which should help with the crawler being stuck because of some stale requests in the queue.
0.3.46 (2024-08-30)
- Behaviour:
- Process markdown in a separate worker thread so it won't block the main process on too large pages.
- Sitemap loading with
useSitemaps
doesn't block indefinitely on large sitemaps anymore.
0.3.45 (2024-08-20)
- Input:
- New
keepFragmentUrls
(URL#fragments
identify unique pages) input option to consider fragment URLs as separate pages.
- New
- Behaviour:
- Ensure canonical URL is only taken from the main page and not embedded pages from
<iframe>
. - Too large dataset items used to fail, now we retry with a trimmed payload (
html
,text
andmarkdown
fields are trimmed to the first three million characters each).
- Ensure canonical URL is only taken from the main page and not embedded pages from
0.3.44 (2024-07-30)
- Behaviour:
waitForSelector
option allows users to specify a CSS selector to wait for before extracting the page content. This is useful for pages that load content dynamically and break the automatic waiting mechanism.
0.3.43 (2024-07-24)
- Behaviour:
- Change the shadow DOM expansion logic to handle edge cases better.
0.3.42 (2024-07-12)
- Behaviour:
- Fix edge cases with the improved
startUrls
sanitization.
- Fix edge cases with the improved
0.3.41 (2024-07-11)
- Behaviour:
- Better input URLs sanitization to prevent issues with the
startUrls
input.
- Better input URLs sanitization to prevent issues with the
0.3.40 (2024-07-10)
- Input:
- New
expandIframes
(Expand iframe elements) option for extracting content from on-pageiframe
elements. Available only inplaywright:firefox
.
- New
0.3.39 (2024-06-28)
- Behaviour:
- Mitigating excessive Request Queue writes in some cases.
0.3.38 (2024-06-25)
- Behaviour:
- The
saveScreenshots
option now correctly prints warnings with crawler types that don't support screenshots. - The screenshot KVS key now contains the website hostname and a hash of the original URL to avoid collisions.
- The
0.3.37 (2024-06-17)
- Behaviour:
- The Actor now respects the advanced request configuration passed through the
Start URLs
input.
- The Actor now respects the advanced request configuration passed through the
0.3.36 (2024-06-10)
- Input:
- New
saveHtmlAsFile
option is available which enables storing the HTML into a key-value store, replacing their values in the dataset with links to make the dataset value smaller, since there is a hard limit for its size. - Deprecated
saveHtml
in favor ofsaveHtmlAsFile
.
- New
- Output:
- The new
saveHtmlAsFile
option saves the URL under a newhtmlUrl
key in the dataset.
- The new
- Behaviour:
- HTML processors don't block the main thread and can safely time out.
- Better fallback logic for the HTML processing pipeline.
- When pushing to dataset, we now detect too large payload and skip retrying (while suggesting to use the new
saveHtmlAsFile
option to get around this problem).
0.3.35 (2024-05-23)
- Behaviour:
RequestQueue
race condition fixed.
- Output:
- The Readable text extractor now correctly handles the article titles.
0.3.34 (2024-05-17)
- Behaviour:
- If any of the Start URLs lead to a sitemap file, it is processed and the links are enqueued.
- Performance / QoL improvements (see Crawlee 3.10.0 changelog for more details)
0.3.33 (2024-04-22)
- Input:
AdaptiveCrawler
is the new default (prefill) crawler type.
- Behaviour:
- The use of Chrome + Chromium browsers was deprecated. The Actor now uses only Firefox internally for the browser-based crawlers.
- Reimplementation of the file download feature for better stability and performance.
- On smaller websites, the
AdaptiveCrawler
skips the adaptive scanning to speed up the crawl.
0.3.32 (2024-03-28)
- Output:
- The Actor now stores
metadata.headers
with the HTTP response headers of the crawled page.
- The Actor now stores
0.3.31 (2024-03-14)
- Input:
- Invalid
startUrls
are now filtered out and don't cause the actor to fail.
- Invalid
0.3.30 (2024-02-24)
- Behavior:
- File download now respects
excludeGlobs
andincludeGlobs
input options, stores filenames, and understandsContent-Disposition
HTTP headers (i.e. "forced download").
- File download now respects
- Output:
- Better
og:
metatags coverage (article:
,movie:
etc.). - Stores JSON-LD metatags in the
metadata
field.
- Better
0.3.29 (2024-02-05)
- Input:
useSitemaps
toggle for sitemap discovery - leads to more consistent results, scrapes also unreachable webpages.- Do not fail on empty globs in input (ignore them instead).
- Experimental
playwright:adaptive
crawling mode
- Output:
- Add
metadata.openGraph
output field for contents ofog:*
metatags.
- Add
0.3.27 (2024-01-25)
- Input:
maxRequestRetries
input option for limiting the number of request retries on network, server or parsing errors.
- Behavior:
- Allow large lists of start URLs with deep crawling (
maxCrawlDepth > 0
), as the memory overflow issue from0.3.18
is now fixed.
- Allow large lists of start URLs with deep crawling (
0.3.26 (2024-01-02)
- Output:
- The Actor now stores
metadata.mimeType
for downloaded files (only applicable whensaveFiles
is enabled).
- The Actor now stores
0.3.25 (2023-12-21)
- Input:
- Add
maxSessionRotations
input option for limiting the number of session rotations when recognized as a bot.
- Add
- Behavior:
- Fail on
401
,403
and429
HTTP status codes.
- Fail on
0.3.24 (2023-12-07)
- Behavior:
- Fix a bug within the
expandClickableElements
utility function.
- Fix a bug within the
0.3.23 (2023-12-06)
- Behavior:
- Respect empty
<body>
tag when extracting text from HTML. - Fix a bug with the
simplifiedBody === null
exception.
- Respect empty
0.3.22 (2023-12-04)
- Input:
ignoreCanonicalUrl
toggle to deduplicate pages based on their actual URLs (useful when two different pages share the same canonical URL).
- Behavior:
- Improve the large content detection - this fixes a regression from
0.3.21
.
- Improve the large content detection - this fixes a regression from
0.3.21 (2023-11-29)
- Output:
- The
debug
mode now stores the results of all the extractors (+ raw HTML) as Key-Value Store objects. - New extractor "Readable text with fallback" checks the results of the "Readable text" extractor and checks the content integrity on the fly.
- The
- Behavior:
- Skip text-cleaning and markdown processing step on large responses to avoid indefinite hangs.
0.3.20 (2023-11-08)
- Output:
- The
debug
mode now stores the raw page HTML (without the page preprocessing) under therawHtml
key.
- The
0.3.19 (2023-10-18)
- Input:
- Add a default for
proxyConfiguration
option (which is now required since 0.3.18). This fixes the actor usage via API, falling back to the default proxy settings if they are not explicitly provided.
- Add a default for
0.3.18 (2023-10-17)
- Input:
- Adds
includeUrlGlobs
option to allow explicit control over enqueuing logic (overrides the default scoping logic). - Adds
requestTimeoutSecs
option to allow overriding the default request processing timeout.
- Adds
- Behavior:
- Disallow using large list of start URLs (more than 100) with deep crawling (
maxCrawlDepth > 0
) as it can lead to memory overflow.
- Disallow using large list of start URLs (more than 100) with deep crawling (
0.3.17 (2023-10-05)
- Input:
- Adds
debugLog
option to enable debug logging.
- Adds
0.3.16 (2023-09-06)
- Behavior:
- Raw HTTP Client (Cheerio) now works correctly with proxies again.
0.3.15 (2023-08-30)
- Input:
startUrls
is now a required input field.- Input tooltips now provide more detailed description of crawler types and other input options.
0.3.14 (2023-07-19)
- Behavior:
- When using the Cheerio based crawlers, the actor processes links from removed elements correctly now.
- Crawlers now follow links in
<link>
tags (rel=next, prev, help, search
). - Relative canonical URLs are now getting correctly resolved during the deduplication phase.
- Actor now automatically recognizes blocked websites and retries the crawl with a new proxy / fingerprint combination.
0.3.13 (2023-06-30)
- Input:
- The Actor now uses new defaults for input settings for better user experience:
- default
crawlerType
is nowplaywright:firefox
saveMarkdown
istrue
removeCookieWarnings
istrue
- default
- The Actor now uses new defaults for input settings for better user experience:
0.3.12 (2023-06-14)
- Input:
- add
excludeUrlGlobs
for skipping certain URLs when enqueueing - add
maxScrollHeightPixels
for scrolling down the page (useful for dynamically loaded pages)
- add
- Behavior:
- by default, the crawler scrolls down on every page to trigger dynamic loading (disable by setting
maxScrollHeightPixels
to 0) - the crawler now handles HTML processing errors gracefully
- the actor now stays alive and restarts the crawl on certain known errors (Playwright Assertion Error).
- by default, the crawler scrolls down on every page to trigger dynamic loading (disable by setting
- Output:
crawl.httpStatusCode
now contains HTTP response status code.
0.3.10 (2023-06-05)
- Input:
- move
processedHtml
underdebug
object - add
removeCookieWarnings
option to automatically remove cookie consent modals (via I don't care about cookies browser extension)
- move
- Behavior:
- consider redirected start URL for prefix matching
- make URL deduplication case-insensitive
- wait at least 3s in playwright to let dynamic content to load
- retry the start URLs 10 times and regular URLs 5 times (to get around issues with retries on burnt proxies)
- ignore links not starting with
http
- skip parsing non-html files
- support
startUrls
as text-file
0.3.9 (2023-05-18)
- Input:
- Updated README and input hints.
0.3.8 (2023-05-17)
- Input:
initialCookies
option for passing cookies to the crawler. Provide the cookies as a JSON array of objects with"name"
and"value"
keys. Example:[{ "name": "token", "value": "123456" }]
.
- Behavior:
textExtractor
option is now removed in favour ofhtmlTransformer
unfluff
extractor has been completely removed- HTML is always simplified by removing some elements from it, those are configurable via
removeElementsCssSelector
option, which now defaults to a larger set of elements, including<nav>
,<footer>
,<svg>
, and elements withrole
attribute set to one ofalert
,banner
,dialog
,alertdialog
. - New
htmlTransformer
option has been introduced which allows to configure how the simplified HTML is further processed. The output of this is still an HTML, which can be later used to generate markdown or plain text from it. - The crawler will now try to expand collapsible sections automatically. This works by clicking on elements with
aria-expanded="false"
attribute. You can configure this selector viaclickElementsCssSelector
option. - When using Playwright based crawlers, the dynamic content is awaited for based on network activity, rather than webpage changes. This should improve the reliability of the crawler.
- Firefox
SEC_ERROR_UNKNOWN_ISSUER
has been solved by preloading the recognized intermediate TLS certificates into the Docker image. - Crawled URLs are retried in case their processing timeouts.
0.3.7 (2023-05-10)
- Behavior:
- URLs with redirects are now enqueued based on the original (unredirected) URL. This should prevent the actor from skipping relevant pages which are hidden behind redirects.
- The actor now considers all start URLs when enqueueing new links. This way, the user can specify multiple start URLs as a workaround for the actor skipping some relevant pages on the website.
- Error with not enqueueing URLs with certain query parameters is now fixed.
- Output:
- The
.url
field now contains the main resource URL without the fragment (#
) part.
- The
0.3.6 (2023-05-04)
- Input:
- Made the
initialConcurrency
option visible in the input editor. - Added
aggressivePruning
option. With this option set totrue
, the crawler will try to deduplicate the scraped content. This can be useful when the crawler is scraping a website with a lot of duplicate content (header menus, footers, etc.)
- Made the
- Behavior:
- The actor now stays alive and restarts the crawl on certain known errors (Playwright Assertion Error).
0.3.4 (2023-05-04)
- Input:
- Added a new hidden option
initialConcurrency
. This option sets the initial number of web browsers or HTTP clients running in parallel during the actor run. Increasing this number can speed up the crawling process. Bear in mind this option is hidden and can be changed only by editing the actor input using the JSON editor.
- Added a new hidden option
0.3.3 (2023-04-28)
- Input:
- Added a new option
maxResults
to limit the total number of results. If used withmaxCrawlPages
, the crawler will stop when either of the limits is reached.
- Added a new option
0.3.1 (2023-04-24)
- Input:
- Added an option to download linked document files from the page -
saveFiles
. This is useful for downloading pdf, docx, xslx... files from the crawled pages. The files are saved to the default key-value store of the run and the links to the files are added to the dataset. - Added a new crawler - Stealthy web browser - that uses a Firefox browser with a stealthy profile. It is useful for crawling websites that block scraping.
- Added an option to download linked document files from the page -
0.0.13 (2023-04-18)
- Input:
- Added new
textExtractor
optionreadableText
. It is generally very accurate and has a good ratio of coverage to noise. It extracts only the main article body (similar tounfluff
) but can work for more complex pages. - Added
readableTextCharThreshold
option. This only applies toreadableText
extractor. It allows fine-tuning which part of the text should be focused on. That only matters for very complex pages where it is not obvious what should be extracted.
- Added new
- Output:
- Added simplified output view
Overview
that has onlyurl
andtext
for quick output check
- Added simplified output view
- Behavior:
- Domains starting with
www.
are now considered equal to ones without it. This means that the start URLhttps://apify.com
can enqueuehttps://www.apify.com
and vice versa.
- Domains starting with
0.0.10 (2023-04-05)
- Input:
- Added new
crawlerType
optionjsdom
for processing with JSDOM. It allows client-side script processing, trying to mimic the browser behavior in Node.js but with much better performance. This is still experimental and may crash on some particular pages. - Added
dynamicContentWaitSecs
option (defaults to 10s), which is the maximum waiting time for dynamic waiting.
- Added new
- Output (BREAKING CHANGE):
- Renamed
crawl.date
tocrawl.loadedTime
- Moved
crawl.screenshotUrl
to top-level object - The
markdown
field was made visible - Renamed
metadata.language
tometadata.languageCode
- Removed
metadata.createdAt
(for now) - Added
metadata.keywords
- Renamed
- Behavior:
- Added waiting for dynamically rendered content (supported in Headless browser and JSDOM crawlers). The crawler checks every half a second for content changes. When there are no changes for 2 seconds, the crawler proceeds to extraction.
0.0.7 (2023-03-30)
- Input:
- BREAKING CHANGE: Added
textExtractor
input option to choose how strictly to parse the content. Swapped the previousunfluff
forCrawleeHtmlToText
as default which in general will extract more text. We chose to output more text rather than less by default. - Added
removeElementsCssSelector
which allows passing extra CSS selectors to further strip down the HTML before it is converted to text. This can help fine-tuning. By default, the actor removes the page navigation bar, header, and footer.
- BREAKING CHANGE: Added
- Output:
- Added markdown to output if
saveMarkdown
option is chosen - All extractor outputs + HTML as a link can be obtained if
debugMode
is set. - Added
pageType
to the output (only asdebug
for now), it will be fine-tuned in the future.
- Added markdown to output if
- Behavior:
- Added deduplication by
canonicalUrl
. E.g. if more different URLs point to the same canonical URL, they are skipped - Skip pages that redirect outside the original start URLs domain.
- Only run a single text extractor unless in debug mode. This improves performance.
- Added deduplication by
Developer
Maintained by Apify
Actor Metrics
3.9k monthly users
-
747 stars
>99% runs succeeded
1.8 days response time
Created in Mar 2023
Modified 3 days ago
Categories