AI Website Content Extractor avatar

AI Website Content Extractor

Pricing

$5.00/month + usage

Go to Apify Store
AI Website Content Extractor

AI Website Content Extractor

Crawl website pages, strip noise, and convert the main content to clean Markdown for RAG/LLM training.

Pricing

$5.00/month + usage

Rating

5.0

(2)

Developer

ScrapeAI

ScrapeAI

Maintained by Community

Actor stats

0

Bookmarked

4

Total users

2

Monthly active users

23 days ago

Last modified

Share

Apify Actor that crawls one or more website pages using Playwright, removes navigation, ads, and other noise, then converts the main content to clean Markdown — ready for RAG pipelines, vector databases, and LLM training datasets.

Features

  • Crawl any public website page(s)
  • Automatically dismiss cookie / consent dialogs
  • Strip navigation bars, headers, footers, sidebars, ads, and modals
  • Detect the main content area using semantic HTML selectors (main, article, [role="main"], etc.)
  • Convert HTML to clean Markdown via turndown
  • Skip low-content pages (login walls, redirects) automatically
  • Outputs a structured dataset ready for AI use-cases

Input

FieldTypeDescriptionDefault
startUrlsArrayList of {url} objects or plain URL strings to crawl[{url: "https://example.com"}]
maxPagesNumberMaximum number of pages to process20
proxyConfigurationObjectApify proxy settings (optional){}

Example Input

{
"startUrls": [
{ "url": "https://en.wikipedia.org/wiki/Artificial_intelligence" },
{ "url": "https://openai.com/blog" }
],
"maxPages": 10
}

Output

Each extracted page produces one dataset record:

FieldTypeDescription
urlStringURL of the crawled page
titleStringPage <title>
markdownStringClean Markdown of the main content
textString
wordCountNumberApproximate word count of the Markdown
extractedAtStringISO 8601 timestamp

Example Output

{
"url": "https://en.wikipedia.org/wiki/Artificial_intelligence",
"title": "Artificial intelligence - Wikipedia",
"markdown": "# Artificial intelligence\n\nArtificial intelligence (AI) is the simulation of human intelligence...",
"text": "Example Domain\n\nThis domain is for use in documentation examples without needing permission. Avoid use in operations.\n\nLearn more",
"wordCount": 4312,
"extractedAt": "2026-03-13T08:00:00.000Z"
}

Use Cases

  • RAG pipelines — ingest Markdown directly into your vector store
  • LLM fine-tuning — build clean text corpora from any website
  • AI chatbots — feed domain knowledge to your assistant
  • Research — extract and archive article content at scale

Tips

  • Supply multiple startUrls to crawl several pages in one run
  • Increase maxPages to crawl an entire site (combine with Apify's link-following features)
  • For authenticated pages, configure a proxy or session in proxyConfiguration