
website content crawler
Pricing
Pay per usage
Go to Apify Store

website content crawler
Powerful website content crawler tool to extract, analyze, and index web pages automatically. Streamline data collection with fast, accurate web scraping technology.
0.0 (0)
Pricing
Pay per usage
0
3
3
Last modified
a day ago
Extract website content with this advanced web crawler. Features stealth browsing with Camoufox, proxy rotation, and intelligent data extraction for content analysis.
Key Features
- 🌐 Universal Crawling: Crawl any website with configurable data extraction
- 🔒 Stealth Browsing: Camoufox integration for avoiding detection
- 🚀 Advanced Proxy Management: Request-level proxy rotation with automatic failover
- 🔗 Intelligent Link Following: Depth-limited crawling with domain restrictions
- 📈 Rich Metadata: Extract titles, descriptions, images, and meta tags
Quick Start
1. Run on Apify Platform
apify loginapify push
Then run your actor in Apify Console with these settings:
- Start URLs:
https://yesintelligent.com
- Max Pages:
100
- Use Apify Proxy:
true
2. Local Development
cd website-content-crawlernpm installapify run
Configuration Options
Basic Settings
Parameter | Type | Default | Description |
---|---|---|---|
startUrls | array | ["https://yesintelligent.com"] | URLs to start crawling from |
maxPages | integer | 100 | Maximum pages to crawl (1-1000) |
crawlDepth | integer | 2 | Maximum crawl depth (0 = current page only) |
Proxy Configuration
Parameter | Type | Default | Description |
---|---|---|---|
proxyConfig.useApifyProxy | boolean | true | Enable Apify proxy for anonymous browsing |
proxyConfig.apifyProxyGroups | array | ["RESIDENTIAL"] | Proxy groups: RESIDENTIAL, DATACENTER, or custom |
Usage Examples
Example 1: Basic Website Crawling
{"startUrls": [{"url": "https://yesintelligent.com"}],"maxPages": 50,"proxyConfig": {"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"]}}
Example 2: Deep Content Crawling
{"startUrls": [{"url": "https://example-blog.com"}],"maxPages": 100,"crawlDepth": 3,"followExternal": false,"proxyConfig": {"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"]}}
Output Data Structure
Each crawled page produces the following data:
{"url": "https://example.com/page","title": "Page Title","description": "Page meta description","content": "Main page content text","meta": {"keywords": "keyword1, keyword2","author": "Author Name","publishedTime": "2024-01-01T00:00:00Z","wordCount": 1000,"charCount": 5000,"readingTime": 5},"images": [{"src": "/image1.jpg","alt": "Image description","url": "https://example.com/image1.jpg"}],"scrapedAt": "2024-01-01T12:00:00Z","statusCode": 200,"depth": 1}
Best Practices
1. Respect Rate Limits
- Use appropriate delays between requests
- Start with conservative concurrency settings
- Monitor server response times
2. Optimize Data Extraction
- The crawler automatically extracts content from common selectors
- Images are deduplicated and converted to absolute URLs
- Metadata is extracted from meta tags
3. Handle Anti-Bot Measures
- Enable proxy rotation for large crawls
- Use residential proxies for sensitive sites
- Monitor for rate limiting responses
Deployment
Apify Platform
apify loginapify push
Docker Deployment
docker build -t website-content-crawler .docker run -e APIFY_INPUT='{"startUrls":[{"url":"https://yesintelligent.com"}]}' website-content-crawler
Contributing
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Submit a pull request
License
This project is licensed under the ISC License.
Support
For issues and questions:
- Check the Apify Documentation
- Visit the Apify Community Forum
- Create an issue in the project repository
Built with ❤️ using Apify, Playwright, and Camoufox
On this page
Share Actor: