Enhanced Deep Content Crawler avatar
Enhanced Deep Content Crawler

Pricing

$10.00 / 1,000 results

Go to Store
Enhanced Deep Content Crawler

Enhanced Deep Content Crawler

Developed by

Gideon Nesh

Gideon Nesh

Maintained by Community

A fast, Python-powered web crawler with smart content extraction, JS support, metadata capture, and duplicate detection. Ideal for SEO, content migration, and e-commerce scraping. Reliable, scalable, and easy to customize.

0.0 (0)

Pricing

$10.00 / 1,000 results

0

Total users

3

Monthly users

3

Runs succeeded

>99%

Last modified

8 days ago

Enhanced Deep Content Crawler

A powerful, production-ready web crawler for comprehensive content extraction built with modern Python technologies. This enhanced crawler combines the efficiency of Crawlee for Python with advanced content extraction capabilities, intelligent duplicate detection, and robust error handling.

πŸš€ Key Features

  • Dual Crawling Modes: Choose between Playwright (JavaScript support) or HTTP-only crawling for optimal performance
  • Smart Content Extraction: Automatic main content detection with fallback strategies
  • Comprehensive Metadata: Extracts titles, descriptions, Open Graph data, Twitter Cards, and JSON-LD structured data
  • Duplicate Detection: Content-based deduplication using hashing algorithms
  • Advanced Link Discovery: Intelligent internal link extraction with domain validation
  • Real-time Statistics: Live progress tracking with performance metrics
  • Robust Error Handling: Retry mechanisms with exponential backoff
  • Flexible Configuration: Extensive customization options via input schema

Included features

  • Apify SDK - a toolkit for building Apify Actors in Python.
  • Crawlee for Python - a web scraping and browser automation library.
  • Input schema - define and validate a schema for your Actor's input.
  • Request queue - manage the URLs you want to scrape in a queue.
  • Dataset - store and access structured data extracted from web pages.
  • Beautiful Soup - a library for pulling data out of HTML and XML files.
  • Playwright - modern browser automation for JavaScript-heavy sites.

πŸ“Š Data Extraction Capabilities

The enhanced crawler extracts comprehensive data from each page:

Basic Information

  • URL: Complete page URL with timestamp
  • Title: Page title from <title> tag
  • Description: Meta description and other metadata
  • Word Count: Total words in main content
  • Page Size: HTML document size in bytes

Advanced Metadata

  • Open Graph Data: Complete OG tags for social sharing
  • Twitter Card Data: Twitter-specific metadata
  • Structured Data: JSON-LD schemas for rich snippets
  • Canonical URLs: Preferred page URLs
  • Author Information: When available in meta tags

Content Analysis

  • Main Content: Intelligently extracted article/page content
  • Custom Content: User-defined CSS selector extraction
  • Content Hash: For duplicate detection
  • Internal Links: All same-domain links with metadata
  • Images: Up to 10 images with alt text and metadata

⚑ Performance Features

  • Dual Mode Operation:
    • Browser Mode (Playwright): Full JavaScript support, perfect for SPAs
    • HTTP Mode: Lightning-fast for static content
  • Concurrent Processing: Asynchronous crawling with proper queue management
  • Smart Deduplication: Avoids crawling identical content
  • Progress Tracking: Real-time statistics and performance metrics
  • Memory Efficient: Optimized for large-scale crawling

πŸ›‘οΈ Reliability & Error Handling

  • Retry Logic: Configurable retry attempts with exponential backoff
  • Timeout Management: Customizable request timeouts
  • Error Categorization: Detailed error logging and reporting
  • Graceful Degradation: Continues crawling despite individual page failures
  • URL Validation: Prevents crawling invalid or dangerous URLs

Resources

πŸ› οΈ Configuration Options

Basic Configuration

{
"baseUrl": "https://example.com",
"maxPages": 100,
"maxDepth": 3
}

Advanced Configuration

{
"baseUrl": "https://example.com",
"maxPages": 500,
"maxDepth": 4,
"contentSelector": "article, .post-content, .entry-content",
"excludePatterns": [
".*\\.pdf$",
"/admin/.*",
".*\\?.*utm_.*"
],
"useBrowser": true,
"maxRetries": 5,
"requestTimeout": 45000
}

Configuration Parameters

ParameterTypeDefaultDescription
baseUrlstringrequiredStarting URL for crawling
maxPagesinteger100Maximum pages to crawl (1-1000)
maxDepthinteger3Maximum link depth to follow (1-10)
contentSelectorstring"body"CSS selector for content extraction
excludePatternsarray[]Regex patterns to exclude URLs
useBrowserbooleantrueEnable JavaScript with Playwright
maxRetriesinteger3Retry attempts for failed requests
requestTimeoutinteger30000Request timeout in milliseconds

πŸš€ Usage Examples

E-commerce Site Crawling

{
"baseUrl": "https://shop.example.com",
"maxPages": 200,
"contentSelector": ".product-description, .product-specs",
"excludePatterns": [
"/cart.*",
"/checkout.*",
"/account.*"
]
}

News Site Crawling

{
"baseUrl": "https://news.example.com",
"maxPages": 1000,
"maxDepth": 2,
"contentSelector": "article, .post-content",
"useBrowser": false,
"excludePatterns": [
"/tag/.*",
"/author/.*",
".*\\?.*utm_.*"
]
}

Documentation Site Crawling

{
"baseUrl": "https://docs.example.com",
"maxPages": 300,
"contentSelector": ".markdown-body, .content",
"excludePatterns": [
".*\\.pdf$",
".*\\.zip$"
]
}

πŸ“Š Output Data Structure

Each crawled page produces a comprehensive data object:

{
"url": "https://example.com/page",
"crawled_at": "2024-01-15T10:30:00Z",
"content_hash": "a1b2c3d4e5f6",
"metadata": {
"title": "Page Title",
"description": "Page description",
"keywords": "keyword1, keyword2",
"author": "Author Name",
"canonical_url": "https://example.com/canonical",
"og_data": {
"title": "Social Title",
"description": "Social Description",
"image": "https://example.com/image.jpg"
},
"structured_data": [
{
"@type": "Article",
"headline": "Article Title"
}
]
},
"main_content": "Extracted main content text...",
"specific_content": "Content from custom selector...",
"images": [
{
"src": "https://example.com/image.jpg",
"alt": "Image description",
"title": "Image title",
"width": "800",
"height": "600"
}
],
"internal_links": [
{
"url": "https://example.com/linked-page",
"text": "Link text",
"title": "Link title"
}
],
"word_count": 1250,
"page_size": 45678,
"load_time": 1.23
}
  • A short guide on how to build web scrapers using code templates:

Getting started

For complete information see this article. In short, you will:

  1. Build the Actor
  2. Run the Actor

Pull the Actor for local development

If you would like to develop locally, you can pull the existing Actor from Apify console using Apify CLI:

  1. Install apify-cli

    Using Homebrew

    $brew install apify-cli

    Using NPM

    $npm -g install apify-cli
  2. Pull the Actor by its unique <ActorId>, which is one of the following:

    • unique name of the Actor to pull (e.g. "apify/hello-world")
    • or ID of the Actor to pull (e.g. "E2jjCZBezvAZnX8Rb")

    You can find both by clicking on the Actor title at the top of the page, which will open a modal containing both Actor unique name and Actor ID.

    This command will copy the Actor into the current directory on your local machine.

    $apify pull <ActorId>

Documentation reference

To learn more about Apify and Actors, take a look at the following resources: