website content crawler
Pricing
from $0.01 / 1,000 results
website content crawler
Powerful website content crawler tool to extract, analyze, and index web pages automatically. Streamline data collection with fast, accurate web scraping technology.
Pricing
from $0.01 / 1,000 results
Rating
0.0
(0)
Developer

Akash Kumar Naik
Actor stats
2
Bookmarked
44
Total users
20
Monthly active users
a month ago
Last modified
Categories
Share
Extract website content with this advanced web crawler. Features stealth browsing with Camoufox, proxy rotation, and intelligent data extraction for content analysis.
Key Features
- ๐ Universal Crawling: Crawl any website with configurable data extraction
- ๐ Stealth Browsing: Camoufox integration for avoiding detection
- ๐ Advanced Proxy Management: Request-level proxy rotation with automatic failover
- ๐ Intelligent Link Following: Depth-limited crawling with domain restrictions
- ๐ Rich Metadata: Extract titles, descriptions, images, and meta tags
Quick Start
1. Run on Apify Platform
apify loginapify push
Then run your actor in Apify Console with these settings:
- Start URLs:
https://yesintelligent.com - Max Pages:
100 - Use Apify Proxy:
true
2. Local Development
cd website-content-crawlernpm installapify run
Configuration Options
Basic Settings
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrls | array | ["https://yesintelligent.com"] | URLs to start crawling from |
maxPages | integer | 100 | Maximum pages to crawl (1-1000) |
crawlDepth | integer | 2 | Maximum crawl depth (0 = current page only) |
Proxy Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
proxyConfig.useApifyProxy | boolean | true | Enable Apify proxy for anonymous browsing |
proxyConfig.apifyProxyGroups | array | ["RESIDENTIAL"] | Proxy groups: RESIDENTIAL, DATACENTER, or custom |
Usage Examples
Example 1: Basic Website Crawling
{"startUrls": [{"url": "https://yesintelligent.com"}],"maxPages": 50,"proxyConfig": {"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"]}}
Example 2: Deep Content Crawling
{"startUrls": [{"url": "https://example-blog.com"}],"maxPages": 100,"crawlDepth": 3,"followExternal": false,"proxyConfig": {"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"]}}
Output Data Structure
Each crawled page produces the following data:
{"url": "https://example.com/page","title": "Page Title","description": "Page meta description","content": "Main page content text","meta": {"keywords": "keyword1, keyword2","author": "Author Name","publishedTime": "2024-01-01T00:00:00Z","wordCount": 1000,"charCount": 5000,"readingTime": 5},"images": [{"src": "/image1.jpg","alt": "Image description","url": "https://example.com/image1.jpg"}],"scrapedAt": "2024-01-01T12:00:00Z","statusCode": 200,"depth": 1}
Best Practices
1. Respect Rate Limits
- Use appropriate delays between requests
- Start with conservative concurrency settings
- Monitor server response times
2. Optimize Data Extraction
- The crawler automatically extracts content from common selectors
- Images are deduplicated and converted to absolute URLs
- Metadata is extracted from meta tags
3. Handle Anti-Bot Measures
- Enable proxy rotation for large crawls
- Use residential proxies for sensitive sites
- Monitor for rate limiting responses
Deployment
Apify Platform
apify loginapify push
Docker Deployment
docker build -t website-content-crawler .docker run -e APIFY_INPUT='{"startUrls":[{"url":"https://yesintelligent.com"}]}' website-content-crawler
Contributing
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Submit a pull request
License
This project is licensed under the ISC License.
Support
For issues and questions:
- Check the Apify Documentation
- Visit the Apify Community Forum
- Create an issue in the project repository
Built with โค๏ธ using Apify, Playwright, and Camoufox