Enhanced Deep Content Crawler
Pricing
$10.00 / 1,000 results
Enhanced Deep Content Crawler
A fast, Python-powered web crawler with smart content extraction, JS support, metadata capture, and duplicate detection. Ideal for SEO, content migration, and e-commerce scraping. Reliable, scalable, and easy to customize.
0.0 (0)
Pricing
$10.00 / 1,000 results
0
Total users
3
Monthly users
3
Runs succeeded
>99%
Last modified
8 days ago
Enhanced Deep Content Crawler
A powerful, production-ready web crawler for comprehensive content extraction built with modern Python technologies. This enhanced crawler combines the efficiency of Crawlee for Python with advanced content extraction capabilities, intelligent duplicate detection, and robust error handling.
π Key Features
- Dual Crawling Modes: Choose between Playwright (JavaScript support) or HTTP-only crawling for optimal performance
- Smart Content Extraction: Automatic main content detection with fallback strategies
- Comprehensive Metadata: Extracts titles, descriptions, Open Graph data, Twitter Cards, and JSON-LD structured data
- Duplicate Detection: Content-based deduplication using hashing algorithms
- Advanced Link Discovery: Intelligent internal link extraction with domain validation
- Real-time Statistics: Live progress tracking with performance metrics
- Robust Error Handling: Retry mechanisms with exponential backoff
- Flexible Configuration: Extensive customization options via input schema
Included features
- Apify SDK - a toolkit for building Apify Actors in Python.
- Crawlee for Python - a web scraping and browser automation library.
- Input schema - define and validate a schema for your Actor's input.
- Request queue - manage the URLs you want to scrape in a queue.
- Dataset - store and access structured data extracted from web pages.
- Beautiful Soup - a library for pulling data out of HTML and XML files.
- Playwright - modern browser automation for JavaScript-heavy sites.
π Data Extraction Capabilities
The enhanced crawler extracts comprehensive data from each page:
Basic Information
- URL: Complete page URL with timestamp
- Title: Page title from
<title>
tag - Description: Meta description and other metadata
- Word Count: Total words in main content
- Page Size: HTML document size in bytes
Advanced Metadata
- Open Graph Data: Complete OG tags for social sharing
- Twitter Card Data: Twitter-specific metadata
- Structured Data: JSON-LD schemas for rich snippets
- Canonical URLs: Preferred page URLs
- Author Information: When available in meta tags
Content Analysis
- Main Content: Intelligently extracted article/page content
- Custom Content: User-defined CSS selector extraction
- Content Hash: For duplicate detection
- Internal Links: All same-domain links with metadata
- Images: Up to 10 images with alt text and metadata
β‘ Performance Features
- Dual Mode Operation:
- Browser Mode (Playwright): Full JavaScript support, perfect for SPAs
- HTTP Mode: Lightning-fast for static content
- Concurrent Processing: Asynchronous crawling with proper queue management
- Smart Deduplication: Avoids crawling identical content
- Progress Tracking: Real-time statistics and performance metrics
- Memory Efficient: Optimized for large-scale crawling
π‘οΈ Reliability & Error Handling
- Retry Logic: Configurable retry attempts with exponential backoff
- Timeout Management: Customizable request timeouts
- Error Categorization: Detailed error logging and reporting
- Graceful Degradation: Continues crawling despite individual page failures
- URL Validation: Prevents crawling invalid or dangerous URLs
Resources
- Video introduction to Python SDK
- Webinar introducing to Crawlee for Python
- Apify Python SDK documentation
- Crawlee for Python documentation
- Python tutorials in Academy
- Integration with Make, GitHub, Zapier, Google Drive, and other apps
- Video guide on getting scraped data using Apify API
π οΈ Configuration Options
Basic Configuration
{"baseUrl": "https://example.com","maxPages": 100,"maxDepth": 3}
Advanced Configuration
{"baseUrl": "https://example.com","maxPages": 500,"maxDepth": 4,"contentSelector": "article, .post-content, .entry-content","excludePatterns": [".*\\.pdf$","/admin/.*",".*\\?.*utm_.*"],"useBrowser": true,"maxRetries": 5,"requestTimeout": 45000}
Configuration Parameters
Parameter | Type | Default | Description |
---|---|---|---|
baseUrl | string | required | Starting URL for crawling |
maxPages | integer | 100 | Maximum pages to crawl (1-1000) |
maxDepth | integer | 3 | Maximum link depth to follow (1-10) |
contentSelector | string | "body" | CSS selector for content extraction |
excludePatterns | array | [] | Regex patterns to exclude URLs |
useBrowser | boolean | true | Enable JavaScript with Playwright |
maxRetries | integer | 3 | Retry attempts for failed requests |
requestTimeout | integer | 30000 | Request timeout in milliseconds |
π Usage Examples
E-commerce Site Crawling
{"baseUrl": "https://shop.example.com","maxPages": 200,"contentSelector": ".product-description, .product-specs","excludePatterns": ["/cart.*","/checkout.*","/account.*"]}
News Site Crawling
{"baseUrl": "https://news.example.com","maxPages": 1000,"maxDepth": 2,"contentSelector": "article, .post-content","useBrowser": false,"excludePatterns": ["/tag/.*","/author/.*",".*\\?.*utm_.*"]}
Documentation Site Crawling
{"baseUrl": "https://docs.example.com","maxPages": 300,"contentSelector": ".markdown-body, .content","excludePatterns": [".*\\.pdf$",".*\\.zip$"]}
π Output Data Structure
Each crawled page produces a comprehensive data object:
{"url": "https://example.com/page","crawled_at": "2024-01-15T10:30:00Z","content_hash": "a1b2c3d4e5f6","metadata": {"title": "Page Title","description": "Page description","keywords": "keyword1, keyword2","author": "Author Name","canonical_url": "https://example.com/canonical","og_data": {"title": "Social Title","description": "Social Description","image": "https://example.com/image.jpg"},"structured_data": [{"@type": "Article","headline": "Article Title"}]},"main_content": "Extracted main content text...","specific_content": "Content from custom selector...","images": [{"src": "https://example.com/image.jpg","alt": "Image description","title": "Image title","width": "800","height": "600"}],"internal_links": [{"url": "https://example.com/linked-page","text": "Link text","title": "Link title"}],"word_count": 1250,"page_size": 45678,"load_time": 1.23}
- A short guide on how to build web scrapers using code templates:
Getting started
For complete information see this article. In short, you will:
- Build the Actor
- Run the Actor
Pull the Actor for local development
If you would like to develop locally, you can pull the existing Actor from Apify console using Apify CLI:
-
Install
apify-cli
Using Homebrew
$brew install apify-cliUsing NPM
$npm -g install apify-cli -
Pull the Actor by its unique
<ActorId>
, which is one of the following:- unique name of the Actor to pull (e.g. "apify/hello-world")
- or ID of the Actor to pull (e.g. "E2jjCZBezvAZnX8Rb")
You can find both by clicking on the Actor title at the top of the page, which will open a modal containing both Actor unique name and Actor ID.
This command will copy the Actor into the current directory on your local machine.
$apify pull <ActorId>
Documentation reference
To learn more about Apify and Actors, take a look at the following resources:
On this page
Share Actor: