Website Content Miner
Pricing
$7.00/month + usage
Website Content Miner
Extract clean website content at scale: page titles, meta descriptions, H1-H3 headings, readable main text, and URLs. Includes smart noise removal, Readability fallback, optional internal crawling, and structured output for SEO audits, AI datasets, research, and automation.
Pricing
$7.00/month + usage
Rating
5.0
(1)
Developer
Techionik
Maintained by CommunityActor stats
1
Bookmarked
7
Total users
3
Monthly active users
2 days ago
Last modified
Categories
Share
Website Content Miner
Extract clean, structured, and human-readable content from websites without writing custom selectors.
Website Content Miner is built for SEO audits, AI preprocessing, research, content analysis, website archiving, and automation workflows. It crawls standard HTML websites and returns organized page-level data including page titles, meta descriptions, headings, clean main text, and source URLs.
What This Actor Does
Website Content Miner helps you turn website pages into clean structured datasets.
It automatically:
- Extracts page titles
- Extracts meta descriptions
- Extracts H1, H2, and H3 headings
- Extracts readable main page text
- Removes common website noise such as navigation menus, footers, cookie banners, modals, newsletter blocks, and social/share sections
- Uses smart content detection with Mozilla Readability fallback
- Optionally follows internal links with crawl depth control
- Outputs clean dataset items ready for SEO, AI, research, or automation use
Best For
- SEO content audits
- Website content extraction
- AI dataset preparation
- LLM / RAG preprocessing
- Competitor research
- Content inventory creation
- Website text archiving
- Marketing and content analysis
- Automation workflows using Apify, Make, n8n, Zapier, or custom APIs
Data Extracted
Each scraped page returns the following fields:
| Field | Description |
|---|---|
| pageTitle | The page title, using Open Graph title or HTML title |
| metaDescription | The page meta description, using standard or Open Graph description |
| headings | Extracted H1, H2, and H3 headings |
| mainText | Clean readable page text with common noise removed |
| pageUrl | Final scraped page URL |
Input Options
Start URLs
Add one or more website URLs to scrape.
Example:
Crawl Links
Enable this option if you want the Actor to follow links found on the provided pages.
Default: false
Max Enqueue Depth
Controls how deep the scraper should follow links.
Examples:
- 0 = scrape only the provided start URLs
- 1 = scrape start URLs and links found on those pages
- 2 = scrape links found on the next level as well
Default: 1
Same Domain Only
When enabled, the Actor only follows links from the same domain as the first start URL.
This is useful for keeping the crawl focused on one website.
Default: true
Max Requests per Crawl
Sets the maximum number of pages processed in one run.
Default: 100
Output Example
{ "pageTitle": "Example Website", "metaDescription": "A sample website used for demonstration.", "headings": [ { "level": "h1", "text": "Example Domain" } ], "mainText": "This domain is for use in illustrative examples in documents...", "pageUrl": "https://example.com" }
How It Works
- Website Content Miner starts from the URLs you provide.
- It loads each page using Crawlee and Cheerio.
- It detects the main content area using common content selectors such as main, article, #content, .content, and similar structures.
- It removes common noise elements like headers, navigation menus, footers, forms, scripts, cookie banners, modals, newsletter blocks, and social sharing sections.
- It extracts titles, descriptions, headings, and readable text.
- It uses Mozilla Readability first, then applies a stronger fallback strategy for pages where content is not structured like a standard article.
- It saves each result to the Apify dataset.
Key Features
- Clean structured output
- No custom selectors required
- Smart main content detection
- Noise removal for cleaner text
- Optional internal link crawling
- Same-domain crawling option
- Crawl depth control
- Request limit control
- SEO and AI-ready dataset format
- Simple input configuration
- Easy integration through Apify API
Typical Use Cases
SEO Audits
Collect page titles, meta descriptions, headings, and page text from websites to review content structure and optimization quality.
AI and LLM Preprocessing
Prepare clean website text for AI workflows, embeddings, semantic search, RAG systems, and knowledge base creation.
Website Research
Extract readable content from multiple pages for competitor research, market research, or content analysis.
Content Inventory
Create a structured inventory of website pages, including titles, URLs, headings, and body text.
Website Archiving
Save clean text versions of website pages for documentation, research, or long-term reference.
Automation Workflows
Use the output dataset in Apify integrations, Make, n8n, Zapier, Google Sheets, databases, or custom APIs.
Recommended Settings
For a Single Page
- crawlLinks: false
- maxRequestsPerCrawl: 1
For a Small Website Audit
- crawlLinks: true
- maxEnqueueDepth: 1
- sameDomainOnly: true
- maxRequestsPerCrawl: 50
For a Larger Website Crawl
- crawlLinks: true
- maxEnqueueDepth: 2
- sameDomainOnly: true
- maxRequestsPerCrawl: 100 or higher
Notes and Limitations
- Best suited for static and semi-static HTML websites
- Not designed for websites that require login
- Not ideal for heavily JavaScript-rendered applications
- Results depend on the quality and structure of the target website
- For websites with strict anti-bot protection, proxy configuration may be required
Output Access
After the run finishes, you can access the scraped data from:
- Apify Dataset
- Dataset API
- Overview table
- JSON, CSV, Excel, XML, or RSS exports
- Apify integrations and webhooks
Why Use Website Content Miner
Website Content Miner saves time by automatically extracting clean, structured website content without requiring custom scraping rules for every website.
It is useful for anyone who needs reliable page-level content data for SEO, AI, automation, research, reporting, or content intelligence workflows.
Technology
Built with:
- Apify SDK
- Crawlee
- CheerioCrawler
- Cheerio
- Mozilla Readability
Status
Production-ready for general website content extraction.