website content crawler

Deprecated

Pricing

from $0.01 / 1,000 results

See alternative Actors

Go to Apify Store

website content crawler

Deprecated

See alternative Actors

Powerful website content crawler tool to extract, analyze, and index web pages automatically. Streamline data collection with fast, accurate web scraping technology.

Pricing

from $0.01 / 1,000 results

Rating

0.0

(0)

Developer

Akash Kumar Naik

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Key Features

🌐 Universal Crawling: Crawl any website with configurable data extraction
🔒 Stealth Browsing: Camoufox integration for avoiding detection
🚀 Advanced Proxy Management: Request-level proxy rotation with automatic failover
🔗 Intelligent Link Following: Depth-limited crawling with domain restrictions
📈 Rich Metadata: Extract titles, descriptions, images, and meta tags

Quick Start

1. Run on Apify Platform

apify login
apify push

Then run your actor in Apify Console with these settings:

Start URLs: https://yesintelligent.com
Max Pages: 100
Use Apify Proxy: true

2. Local Development

cd website-content-crawler
npm install
apify run

Configuration Options

Basic Settings

Parameter	Type	Default	Description
`startUrls`	array	`["https://yesintelligent.com"]`	URLs to start crawling from
`maxPages`	integer	`100`	Maximum pages to crawl (1-1000)
`crawlDepth`	integer	`2`	Maximum crawl depth (0 = current page only)

Proxy Configuration

Parameter	Type	Default	Description
`proxyConfig.useApifyProxy`	boolean	`true`	Enable Apify proxy for anonymous browsing
`proxyConfig.apifyProxyGroups`	array	`["RESIDENTIAL"]`	Proxy groups: RESIDENTIAL, DATACENTER, or custom

Usage Examples

Example 1: Basic Website Crawling

{
  "startUrls": [{"url": "https://yesintelligent.com"}],
  "maxPages": 50,
  "proxyConfig": {
    "useApifyProxy": true,
    "apifyProxyGroups": ["RESIDENTIAL"]
  }
}

Example 2: Deep Content Crawling

{
  "startUrls": [{"url": "https://example-blog.com"}],
  "maxPages": 100,
  "crawlDepth": 3,
  "followExternal": false,
  "proxyConfig": {
    "useApifyProxy": true,
    "apifyProxyGroups": ["RESIDENTIAL"]
  }
}

Output Data Structure

Each crawled page produces the following data:

{
  "url": "https://example.com/page",
  "title": "Page Title",
  "description": "Page meta description",
  "content": "Main page content text",
  "meta": {
    "keywords": "keyword1, keyword2",
    "author": "Author Name",
    "publishedTime": "2024-01-01T00:00:00Z",
    "wordCount": 1000,
    "charCount": 5000,
    "readingTime": 5
  },
  "images": [
    {
      "src": "/image1.jpg",
      "alt": "Image description",
      "url": "https://example.com/image1.jpg"
    }
  ],
  "scrapedAt": "2024-01-01T12:00:00Z",
  "statusCode": 200,
  "depth": 1
}

Best Practices

1. Respect Rate Limits

Use appropriate delays between requests
Start with conservative concurrency settings
Monitor server response times

2. Optimize Data Extraction

The crawler automatically extracts content from common selectors
Images are deduplicated and converted to absolute URLs
Metadata is extracted from meta tags

3. Handle Anti-Bot Measures

Enable proxy rotation for large crawls
Use residential proxies for sensitive sites
Monitor for rate limiting responses

Deployment

Apify Platform

apify login
apify push

Docker Deployment

docker build -t website-content-crawler .
docker run -e APIFY_INPUT='{"startUrls":[{"url":"https://yesintelligent.com"}]}' website-content-crawler

Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Submit a pull request

License

This project is licensed under the ISC License.

Support

For issues and questions:

Check the Apify Documentation
Visit the Apify Community Forum
Create an issue in the project repository

Built with ❤️ using Apify, Playwright, and Camoufox

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

94K

4.7

(153)

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

2.8K

4.7

(7)

🔥 FireScrape AI Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-Website-Content-Markdown-Scraper

Advanced web scraper powered by Crawlee and Puppeteer — extracts website content, converts it to Markdown, and structures it for LLM training datasets.

mohamed el hadi msaid

232

3.8

(3)

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

843

4.7

(2)

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

850

3.9

(4)

Website Content Crawler Pro

datascoutapi/website-content-crawler-pro

Crawl websites and extract clean, structured content in Markdown, JSON, or plain text for AI models, LLMs, vector DBs, or RAG pipelines. Fast, reliable, and stealthy, with bulk processing, advanced metadata extraction, and seamless integration with LangChain, LlamaIndex, and AI workflows.

halam

279

5.0

(1)

Website Content Crawler Fast

timelody/website-content-crawler-fast

Scraping data from every single web page.

timelody

5.0

(1)

Website Content Text Extractor

smart-digital/website-content-text-extractor

Extract visible text content from websites as structured JSON blocks. Supports multi-URL batch processing, header/footer/cookie exclusion, and optional form extraction. Perfect for content analysis and translation workflows.

My Smart Digital

5.0

(1)

Website Content Crawler

alizarin_refrigerator-owner/website-crawler

Crawl websites for SEO audits. Extracts HTML, title, meta tags, headings, links, & text content from pages. Automatic sitemap detection & parsing Extracts metadata (title, description, OG tags) Heading structure (H1, H2, H3) Internal & external link analysis Image extraction w/alt text Word count

John Rippy

AI Website Content Localizer & Scraper

eunit/ai-website-content-localizer-scraper

Scrape any website and instantly translate the content into 83+ languages using Lingo.dev AI. Build multilingual datasets, localize competitor data, and power global RAG pipelines with clean, context-aware translations. The ultimate tool for shipping global apps fast!

Emmanuel Uchenna

5.0

(2)

Website Content Extractor

fastidious_drawer/website-content-extractor

This extractor lets you extract content from any website with a single or multiple URLs. Use selectors to choose specific sections like the body and exclude elements like headers or navigation. It also extracts images and links, providing data in JSON and DataTable formats for easy processing.