Website Content Scraper avatar

Website Content Scraper

Pricing

from $0.10 / 1,000 results

Go to Apify Store
Website Content Scraper

Website Content Scraper

Extract clean Markdown, plain text, linked files, and RAG-ready chunks from websites, documentation, help centers, knowledge bases, and authenticated portals. Preserve structure, metadata, URLs, and crawl context for AI search, training, and retrieval workflows.

Pricing

from $0.10 / 1,000 results

Rating

0.0

(0)

Developer

Muhammad Qaseem Iqbal

Muhammad Qaseem Iqbal

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Website Content Scraper turns websites into clean, structured content that you can use in AI apps, search tools, knowledge bases, documentation workflows, and data exports.

Give it a website URL and it will crawl the site, remove common clutter, extract useful page text, create clean Markdown, download supported files, and prepare smaller text chunks that are easier for AI tools to search and understand.

It works well for:

  • documentation sites
  • help centers and knowledge bases
  • blogs and article libraries
  • product websites
  • public portals
  • authenticated pages when you provide cookies or headers

Why use Website Content Scraper?

Most websites are designed for people, not for AI systems or clean data exports. A page can include menus, banners, cookie popups, repeated footers, scripts, and links that are not useful for your final dataset.

This Actor helps by collecting the useful content and organizing it into records you can export as JSON, CSV, Excel, XML, or other Apify dataset formats.

Common use cases include:

  • Build a chatbot that answers questions from your website or docs.
  • Create a search index for internal or customer-facing support.
  • Export documentation pages to Markdown or plain text.
  • Feed website content into a vector database or AI workflow.
  • Track changed, unchanged, or deleted pages across repeat crawls.
  • Download and parse linked documents such as PDFs, spreadsheets, and JSON files.

Main features

  • Crawl one page, one section, or a larger website.
  • Extract clean text and Markdown from web pages.
  • Create AI-ready chunks, which are smaller pieces of content for search and chatbot systems.
  • Download and parse linked files, including PDF, DOCX, XLSX, CSV, TSV, Markdown, JSON, XML, and text files.
  • Discover extra URLs from sitemaps and llms.txt files.
  • Respect robots.txt by default.
  • Use fast crawling for simple sites and browser crawling for JavaScript-heavy pages.
  • Crawl pages behind login when you provide cookies or request headers.
  • Save run summaries, skipped URL diagnostics, and sync manifests.
  • Support incremental recrawls, so you can skip unchanged content in scheduled runs.

How it works

Website Content Scraper works in four simple steps:

  1. Find pages

    The Actor starts from the URLs you provide. It follows links that are in scope, can read sitemaps, and can use llms.txt files when available.

  2. Clean the page

    It removes common noise such as navigation, scripts, repeated layout content, and other page clutter where possible.

  3. Extract content

    It saves the page as clean text, Markdown, and optionally cleaned HTML. It can also download and parse supported linked files.

  4. Prepare results

    It writes page records, file records, and AI-ready chunks to the dataset. You can export the data or connect it to another workflow.

Quick start

For your first run, start small. You can increase the limits after you check the results.

{
"startUrls": [{ "url": "https://docs.apify.com/" }],
"crawlerType": "cheerio",
"crawlScope": "startUrlPath",
"maxCrawlPages": 25,
"maxResults": 25,
"discoverSitemaps": false,
"discoverLlmsTxt": false,
"discoverLlmsFullTxt": false,
"saveMarkdown": true,
"saveText": false,
"createChunks": false,
"saveFiles": false,
"parseFiles": false,
"maxFiles": 0,
"proxyConfiguration": {
"useApifyProxy": false
}
}

For many documentation and help sites, cheerio is the best first choice because it is fast and cost-efficient. Turn on sitemap discovery, chunks, file parsing, or browser rendering only when the first small run shows that you need them.

For AI search or chatbot workflows, use the rag preset or enable createChunks. For linked PDFs, spreadsheets, or JSON files, enable saveFiles, parseFiles, and set maxFiles to a small number first.

Example output

The dataset contains different types of records. The most important field is recordType.

Page record

A page record represents one crawled web page.

{
"recordType": "page",
"url": "https://docs.example.com/getting-started",
"title": "Getting started",
"markdown": "# Getting started\n\nThis guide explains...",
"text": "Getting started\n\nThis guide explains...",
"contentQuality": {
"confidence": 0.98,
"wordCount": 1240,
"isThin": false
}
}

Chunk record

A chunk record is a smaller piece of a page or file. These records are useful for AI search, chatbots, and retrieval workflows.

{
"recordType": "chunk",
"url": "https://docs.example.com/getting-started",
"title": "Getting started",
"headingPath": ["Getting started", "Install"],
"text": "Install the package and configure your project...",
"tokenEstimate": 420
}

File record

A file record represents a downloaded or parsed file linked from a page.

{
"recordType": "file",
"url": "https://docs.example.com/api/openapi.json",
"title": "JSON",
"metadata": {
"contentType": "application/json",
"byteLength": 968704
}
}

Understanding the results

Use recordType to filter the dataset:

Record typeWhat it meansWhen to use it
pageA full crawled web pageMarkdown export, content review, documentation migration
chunkA smaller text sectionAI search, chatbots, vector databases, RAG workflows
fileA downloaded or parsed linked fileFile archives, API specs, PDFs, spreadsheets
skippedA URL skipped by the ActorDebugging crawl limits or URL scope
tombstoneA previously seen item that disappearedIncremental sync and delete handling

Apify dataset views select useful columns, but they do not filter rows by type. For page-only, chunk-only, or file-only exports, filter by recordType.

Input settings explained

SettingPlain-language description
startUrlsThe page or website section where the crawl starts.
crawlScopeControls which links are allowed. startUrlPath is safest for one docs section or blog section.
maxCrawlPagesMaximum number of page requests the crawler will process.
maxResultsMaximum number of page records saved to the dataset.
crawlerTypeChoose fast crawling, adaptive crawling, or browser crawling.
maxBrowserFallbacksCaps how many pages adaptive mode may retry in a browser.
discoverSitemapsFinds more URLs from sitemap files. Leave off for the cheapest first run.
discoverLlmsTxtFinds URLs from llms.txt files when a site provides them. Leave off unless you need extra discovery.
discoverLlmsFullTxtAlso reads llms-full.txt; keep off unless you want a larger crawl.
saveMarkdownSaves page content in Markdown format.
saveTextSaves page content as plain text. Turn off when Markdown is enough.
createChunksSplits content into smaller AI-friendly records. Useful for RAG, but creates more dataset rows.
saveFilesDownloads supported linked files. Leave off unless you need file archives.
parseFilesExtracts text from supported linked files. Leave off unless you need PDF, spreadsheet, or document text.
maxFilesLimits how many linked files are processed.
cookiesSecret cookie string for logged-in pages.
requestHeadersSecret custom headers for authenticated or special requests.

Crawler types

Crawler typeBest for
cheerioFast crawling of static pages, docs, blogs, and help centers.
adaptiveStarts fast and falls back to browser rendering when needed.
playwright-firefoxPages that need a real browser, JavaScript, or login flows.
playwright-chromiumBrowser crawling with Chromium.

Browser crawling is more powerful, but usually slower and more expensive. Start with cheerio unless the website content does not appear in the results.

AI and chatbot use cases

This Actor is especially useful when you want AI to answer questions from website content.

Examples:

  • Customer support chatbot trained on a help center.
  • Internal assistant that searches company documentation.
  • Product copilot that answers questions from API docs.
  • Custom GPT knowledge files created from website pages.
  • Vector database ingestion for tools such as Pinecone, Qdrant, Weaviate, or similar systems.

If you are not familiar with the term RAG, it simply means giving an AI model relevant information from your own content before it answers a question. The chunk records are designed for that kind of workflow.

Incremental crawling

If you run the Actor on a schedule, you may not want to process the same unchanged content every time.

Use incremental mode to track what changed:

{
"startUrls": [{ "url": "https://docs.example.com/" }],
"incrementalMode": "readWriteState",
"stateKey": "docs-production",
"skipUnchanged": true,
"emitDeletedRecords": true
}

The Actor stores content hashes in the key-value store. On future runs, it can identify new, changed, unchanged, and deleted content.

Authenticated websites

For private pages or customer portals, provide cookies or request headers in the input.

These fields are marked as secret inputs:

  • cookies
  • requestHeaders

They are not written to dataset records or logs. You can also provide loginValidationUrl to check that authentication works before the crawl continues.

How much does it cost?

The cost depends on:

  • how many pages you crawl,
  • how many files you download or parse,
  • whether you use browser crawling,
  • how much data is written to datasets and key-value stores.

Tips to control cost:

  • Start with maxCrawlPages and maxResults set to 25.
  • Keep discoverLlmsFullTxt off unless you need it.
  • Keep discoverSitemaps and discoverLlmsTxt off for the first test run.
  • Use cheerio for static sites.
  • Use createChunks only when you need AI search or chatbot-ready records.
  • Keep saveFiles and parseFiles off unless linked files matter.
  • Turn off saveHtml and saveScreenshots unless you need them.
  • Set maxFiles to a small number, such as 5 or 10, before processing many files.

Troubleshooting

I only got navigation or very little text

Try adaptive or a Playwright crawler. The page may need JavaScript rendering. You can also use keepElementsCssSelector to tell the Actor which part of the page to keep.

I got too many pages

Use a narrower startUrl, keep crawlScope set to startUrlPath, or add patterns to excludeUrlGlobs.

I did not get enough pages

Increase maxCrawlPages, maxResults, and maxCrawlDepth. Also keep discoverSitemaps enabled.

My files are missing

Make sure saveFiles and parseFiles are enabled, and increase maxFiles if the site links to many files.

Some pages have low confidence scores

Low scores are common for index pages, category pages, and navigation-heavy pages. For AI workflows, the detailed content pages and chunk records are usually more useful.

The website blocks the crawler

Try a browser crawler and configure proxies in Apify. Some sites require stronger crawling settings than simple HTTP crawling.

Limitations

  • Legacy .doc files can be downloaded but are not text-extracted.
  • Very large files may be skipped based on fileMaxSizeMb.
  • Browser crawling is slower and may cost more than fast HTTP crawling.
  • llms.txt and llms-full.txt are used for discovery, not saved as normal file records.
  • Results depend on the structure and accessibility of the target website.

Best practices

  • Test with a small crawl before running a large one.
  • Review a few page records to confirm the extracted text looks right.
  • Use chunk records for chatbot and vector database workflows.
  • Use page records for full Markdown or text exports.
  • Use skipped records to understand why URLs were not saved.
  • Save a tested input as an Apify Task for repeat use.