website content crawler avatar
website content crawler

Pricing

Pay per usage

Go to Apify Store
website content crawler

website content crawler

Developed by

Akash Kumar Naik

Akash Kumar Naik

Maintained by Community

Powerful website content crawler tool to extract, analyze, and index web pages automatically. Streamline data collection with fast, accurate web scraping technology.

0.0 (0)

Pricing

Pay per usage

0

3

3

Last modified

a day ago

Extract website content with this advanced web crawler. Features stealth browsing with Camoufox, proxy rotation, and intelligent data extraction for content analysis.

Key Features

  • 🌐 Universal Crawling: Crawl any website with configurable data extraction
  • 🔒 Stealth Browsing: Camoufox integration for avoiding detection
  • 🚀 Advanced Proxy Management: Request-level proxy rotation with automatic failover
  • 🔗 Intelligent Link Following: Depth-limited crawling with domain restrictions
  • 📈 Rich Metadata: Extract titles, descriptions, images, and meta tags

Quick Start

1. Run on Apify Platform

apify login
apify push

Then run your actor in Apify Console with these settings:

  • Start URLs: https://yesintelligent.com
  • Max Pages: 100
  • Use Apify Proxy: true

2. Local Development

cd website-content-crawler
npm install
apify run

Configuration Options

Basic Settings

ParameterTypeDefaultDescription
startUrlsarray["https://yesintelligent.com"]URLs to start crawling from
maxPagesinteger100Maximum pages to crawl (1-1000)
crawlDepthinteger2Maximum crawl depth (0 = current page only)

Proxy Configuration

ParameterTypeDefaultDescription
proxyConfig.useApifyProxybooleantrueEnable Apify proxy for anonymous browsing
proxyConfig.apifyProxyGroupsarray["RESIDENTIAL"]Proxy groups: RESIDENTIAL, DATACENTER, or custom

Usage Examples

Example 1: Basic Website Crawling

{
"startUrls": [{"url": "https://yesintelligent.com"}],
"maxPages": 50,
"proxyConfig": {
"useApifyProxy": true,
"apifyProxyGroups": ["RESIDENTIAL"]
}
}

Example 2: Deep Content Crawling

{
"startUrls": [{"url": "https://example-blog.com"}],
"maxPages": 100,
"crawlDepth": 3,
"followExternal": false,
"proxyConfig": {
"useApifyProxy": true,
"apifyProxyGroups": ["RESIDENTIAL"]
}
}

Output Data Structure

Each crawled page produces the following data:

{
"url": "https://example.com/page",
"title": "Page Title",
"description": "Page meta description",
"content": "Main page content text",
"meta": {
"keywords": "keyword1, keyword2",
"author": "Author Name",
"publishedTime": "2024-01-01T00:00:00Z",
"wordCount": 1000,
"charCount": 5000,
"readingTime": 5
},
"images": [
{
"src": "/image1.jpg",
"alt": "Image description",
"url": "https://example.com/image1.jpg"
}
],
"scrapedAt": "2024-01-01T12:00:00Z",
"statusCode": 200,
"depth": 1
}

Best Practices

1. Respect Rate Limits

  • Use appropriate delays between requests
  • Start with conservative concurrency settings
  • Monitor server response times

2. Optimize Data Extraction

  • The crawler automatically extracts content from common selectors
  • Images are deduplicated and converted to absolute URLs
  • Metadata is extracted from meta tags

3. Handle Anti-Bot Measures

  • Enable proxy rotation for large crawls
  • Use residential proxies for sensitive sites
  • Monitor for rate limiting responses

Deployment

Apify Platform

apify login
apify push

Docker Deployment

docker build -t website-content-crawler .
docker run -e APIFY_INPUT='{"startUrls":[{"url":"https://yesintelligent.com"}]}' website-content-crawler

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Submit a pull request

License

This project is licensed under the ISC License.

Support

For issues and questions:


Built with ❤️ using Apify, Playwright, and Camoufox