AI Website Content Extractor avatar

AI Website Content Extractor

Pricing

$5.00/month + usage

Go to Apify Store
AI Website Content Extractor

AI Website Content Extractor

Crawl website pages, strip noise, and convert the main content to clean Markdown for RAG/LLM training.

Pricing

$5.00/month + usage

Rating

0.0

(0)

Developer

ScrapeAI

ScrapeAI

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

Apify Actor that crawls one or more website pages using Playwright, removes navigation, ads, and other noise, then converts the main content to clean Markdown — ready for RAG pipelines, vector databases, and LLM training datasets.

Features

  • Crawl any public website page(s)
  • Automatically dismiss cookie / consent dialogs
  • Strip navigation bars, headers, footers, sidebars, ads, and modals
  • Detect the main content area using semantic HTML selectors (main, article, [role="main"], etc.)
  • Convert HTML to clean Markdown via turndown
  • Skip low-content pages (login walls, redirects) automatically
  • Outputs a structured dataset ready for AI use-cases

Input

FieldTypeDescriptionDefault
startUrlsArrayList of {url} objects or plain URL strings to crawl[{url: "https://example.com"}]
maxPagesNumberMaximum number of pages to process20
proxyConfigurationObjectApify proxy settings (optional){}

Example Input

{
"startUrls": [
{ "url": "https://en.wikipedia.org/wiki/Artificial_intelligence" },
{ "url": "https://openai.com/blog" }
],
"maxPages": 10
}

Output

Each extracted page produces one dataset record:

FieldTypeDescription
urlStringURL of the crawled page
titleStringPage <title>
markdownStringClean Markdown of the main content
textString
wordCountNumberApproximate word count of the Markdown
extractedAtStringISO 8601 timestamp

Example Output

{
"url": "https://en.wikipedia.org/wiki/Artificial_intelligence",
"title": "Artificial intelligence - Wikipedia",
"markdown": "# Artificial intelligence\n\nArtificial intelligence (AI) is the simulation of human intelligence...",
"text": "Example Domain\n\nThis domain is for use in documentation examples without needing permission. Avoid use in operations.\n\nLearn more",
"wordCount": 4312,
"extractedAt": "2026-03-13T08:00:00.000Z"
}

Use Cases

  • RAG pipelines — ingest Markdown directly into your vector store
  • LLM fine-tuning — build clean text corpora from any website
  • AI chatbots — feed domain knowledge to your assistant
  • Research — extract and archive article content at scale

Tips

  • Supply multiple startUrls to crawl several pages in one run
  • Increase maxPages to crawl an entire site (combine with Apify's link-following features)
  • For authenticated pages, configure a proxy or session in proxyConfiguration