Website Content Scraper
Pricing
from $0.10 / 1,000 results
Website Content Scraper
Extract clean Markdown, plain text, linked files, and RAG-ready chunks from websites, documentation, help centers, knowledge bases, and authenticated portals. Preserve structure, metadata, URLs, and crawl context for AI search, training, and retrieval workflows.
Pricing
from $0.10 / 1,000 results
Rating
0.0
(0)
Developer
Muhammad Qaseem Iqbal
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Website Content Scraper turns websites into clean, structured content that you can use in AI apps, search tools, knowledge bases, documentation workflows, and data exports.
Give it a website URL and it will crawl the site, remove common clutter, extract useful page text, create clean Markdown, download supported files, and prepare smaller text chunks that are easier for AI tools to search and understand.
It works well for:
- documentation sites
- help centers and knowledge bases
- blogs and article libraries
- product websites
- public portals
- authenticated pages when you provide cookies or headers
Why use Website Content Scraper?
Most websites are designed for people, not for AI systems or clean data exports. A page can include menus, banners, cookie popups, repeated footers, scripts, and links that are not useful for your final dataset.
This Actor helps by collecting the useful content and organizing it into records you can export as JSON, CSV, Excel, XML, or other Apify dataset formats.
Common use cases include:
- Build a chatbot that answers questions from your website or docs.
- Create a search index for internal or customer-facing support.
- Export documentation pages to Markdown or plain text.
- Feed website content into a vector database or AI workflow.
- Track changed, unchanged, or deleted pages across repeat crawls.
- Download and parse linked documents such as PDFs, spreadsheets, and JSON files.
Main features
- Crawl one page, one section, or a larger website.
- Extract clean text and Markdown from web pages.
- Create AI-ready chunks, which are smaller pieces of content for search and chatbot systems.
- Download and parse linked files, including PDF, DOCX, XLSX, CSV, TSV, Markdown, JSON, XML, and text files.
- Discover extra URLs from sitemaps and
llms.txtfiles. - Respect
robots.txtby default. - Use fast crawling for simple sites and browser crawling for JavaScript-heavy pages.
- Crawl pages behind login when you provide cookies or request headers.
- Save run summaries, skipped URL diagnostics, and sync manifests.
- Support incremental recrawls, so you can skip unchanged content in scheduled runs.
How it works
Website Content Scraper works in four simple steps:
-
Find pages
The Actor starts from the URLs you provide. It follows links that are in scope, can read sitemaps, and can use
llms.txtfiles when available. -
Clean the page
It removes common noise such as navigation, scripts, repeated layout content, and other page clutter where possible.
-
Extract content
It saves the page as clean text, Markdown, and optionally cleaned HTML. It can also download and parse supported linked files.
-
Prepare results
It writes page records, file records, and AI-ready chunks to the dataset. You can export the data or connect it to another workflow.
Quick start
For your first run, start small. You can increase the limits after you check the results.
{"startUrls": [{ "url": "https://docs.apify.com/" }],"crawlerType": "cheerio","crawlScope": "startUrlPath","maxCrawlPages": 25,"maxResults": 25,"discoverSitemaps": false,"discoverLlmsTxt": false,"discoverLlmsFullTxt": false,"saveMarkdown": true,"saveText": false,"createChunks": false,"saveFiles": false,"parseFiles": false,"maxFiles": 0,"proxyConfiguration": {"useApifyProxy": false}}
For many documentation and help sites, cheerio is the best first choice because it is fast and cost-efficient. Turn on sitemap discovery, chunks, file parsing, or browser rendering only when the first small run shows that you need them.
For AI search or chatbot workflows, use the rag preset or enable createChunks. For linked PDFs, spreadsheets, or JSON files, enable saveFiles, parseFiles, and set maxFiles to a small number first.
Example output
The dataset contains different types of records. The most important field is recordType.
Page record
A page record represents one crawled web page.
{"recordType": "page","url": "https://docs.example.com/getting-started","title": "Getting started","markdown": "# Getting started\n\nThis guide explains...","text": "Getting started\n\nThis guide explains...","contentQuality": {"confidence": 0.98,"wordCount": 1240,"isThin": false}}
Chunk record
A chunk record is a smaller piece of a page or file. These records are useful for AI search, chatbots, and retrieval workflows.
{"recordType": "chunk","url": "https://docs.example.com/getting-started","title": "Getting started","headingPath": ["Getting started", "Install"],"text": "Install the package and configure your project...","tokenEstimate": 420}
File record
A file record represents a downloaded or parsed file linked from a page.
{"recordType": "file","url": "https://docs.example.com/api/openapi.json","title": "JSON","metadata": {"contentType": "application/json","byteLength": 968704}}
Understanding the results
Use recordType to filter the dataset:
| Record type | What it means | When to use it |
|---|---|---|
page | A full crawled web page | Markdown export, content review, documentation migration |
chunk | A smaller text section | AI search, chatbots, vector databases, RAG workflows |
file | A downloaded or parsed linked file | File archives, API specs, PDFs, spreadsheets |
skipped | A URL skipped by the Actor | Debugging crawl limits or URL scope |
tombstone | A previously seen item that disappeared | Incremental sync and delete handling |
Apify dataset views select useful columns, but they do not filter rows by type. For page-only, chunk-only, or file-only exports, filter by recordType.
Input settings explained
| Setting | Plain-language description |
|---|---|
startUrls | The page or website section where the crawl starts. |
crawlScope | Controls which links are allowed. startUrlPath is safest for one docs section or blog section. |
maxCrawlPages | Maximum number of page requests the crawler will process. |
maxResults | Maximum number of page records saved to the dataset. |
crawlerType | Choose fast crawling, adaptive crawling, or browser crawling. |
maxBrowserFallbacks | Caps how many pages adaptive mode may retry in a browser. |
discoverSitemaps | Finds more URLs from sitemap files. Leave off for the cheapest first run. |
discoverLlmsTxt | Finds URLs from llms.txt files when a site provides them. Leave off unless you need extra discovery. |
discoverLlmsFullTxt | Also reads llms-full.txt; keep off unless you want a larger crawl. |
saveMarkdown | Saves page content in Markdown format. |
saveText | Saves page content as plain text. Turn off when Markdown is enough. |
createChunks | Splits content into smaller AI-friendly records. Useful for RAG, but creates more dataset rows. |
saveFiles | Downloads supported linked files. Leave off unless you need file archives. |
parseFiles | Extracts text from supported linked files. Leave off unless you need PDF, spreadsheet, or document text. |
maxFiles | Limits how many linked files are processed. |
cookies | Secret cookie string for logged-in pages. |
requestHeaders | Secret custom headers for authenticated or special requests. |
Crawler types
| Crawler type | Best for |
|---|---|
cheerio | Fast crawling of static pages, docs, blogs, and help centers. |
adaptive | Starts fast and falls back to browser rendering when needed. |
playwright-firefox | Pages that need a real browser, JavaScript, or login flows. |
playwright-chromium | Browser crawling with Chromium. |
Browser crawling is more powerful, but usually slower and more expensive. Start with cheerio unless the website content does not appear in the results.
AI and chatbot use cases
This Actor is especially useful when you want AI to answer questions from website content.
Examples:
- Customer support chatbot trained on a help center.
- Internal assistant that searches company documentation.
- Product copilot that answers questions from API docs.
- Custom GPT knowledge files created from website pages.
- Vector database ingestion for tools such as Pinecone, Qdrant, Weaviate, or similar systems.
If you are not familiar with the term RAG, it simply means giving an AI model relevant information from your own content before it answers a question. The chunk records are designed for that kind of workflow.
Incremental crawling
If you run the Actor on a schedule, you may not want to process the same unchanged content every time.
Use incremental mode to track what changed:
{"startUrls": [{ "url": "https://docs.example.com/" }],"incrementalMode": "readWriteState","stateKey": "docs-production","skipUnchanged": true,"emitDeletedRecords": true}
The Actor stores content hashes in the key-value store. On future runs, it can identify new, changed, unchanged, and deleted content.
Authenticated websites
For private pages or customer portals, provide cookies or request headers in the input.
These fields are marked as secret inputs:
cookiesrequestHeaders
They are not written to dataset records or logs. You can also provide loginValidationUrl to check that authentication works before the crawl continues.
How much does it cost?
The cost depends on:
- how many pages you crawl,
- how many files you download or parse,
- whether you use browser crawling,
- how much data is written to datasets and key-value stores.
Tips to control cost:
- Start with
maxCrawlPagesandmaxResultsset to 25. - Keep
discoverLlmsFullTxtoff unless you need it. - Keep
discoverSitemapsanddiscoverLlmsTxtoff for the first test run. - Use
cheeriofor static sites. - Use
createChunksonly when you need AI search or chatbot-ready records. - Keep
saveFilesandparseFilesoff unless linked files matter. - Turn off
saveHtmlandsaveScreenshotsunless you need them. - Set
maxFilesto a small number, such as 5 or 10, before processing many files.
Troubleshooting
I only got navigation or very little text
Try adaptive or a Playwright crawler. The page may need JavaScript rendering. You can also use keepElementsCssSelector to tell the Actor which part of the page to keep.
I got too many pages
Use a narrower startUrl, keep crawlScope set to startUrlPath, or add patterns to excludeUrlGlobs.
I did not get enough pages
Increase maxCrawlPages, maxResults, and maxCrawlDepth. Also keep discoverSitemaps enabled.
My files are missing
Make sure saveFiles and parseFiles are enabled, and increase maxFiles if the site links to many files.
Some pages have low confidence scores
Low scores are common for index pages, category pages, and navigation-heavy pages. For AI workflows, the detailed content pages and chunk records are usually more useful.
The website blocks the crawler
Try a browser crawler and configure proxies in Apify. Some sites require stronger crawling settings than simple HTTP crawling.
Limitations
- Legacy
.docfiles can be downloaded but are not text-extracted. - Very large files may be skipped based on
fileMaxSizeMb. - Browser crawling is slower and may cost more than fast HTTP crawling.
llms.txtandllms-full.txtare used for discovery, not saved as normal file records.- Results depend on the structure and accessibility of the target website.
Best practices
- Test with a small crawl before running a large one.
- Review a few
pagerecords to confirm the extracted text looks right. - Use
chunkrecords for chatbot and vector database workflows. - Use
pagerecords for full Markdown or text exports. - Use
skippedrecords to understand why URLs were not saved. - Save a tested input as an Apify Task for repeat use.