Enhanced Deep Content Crawler

Pricing

$10.00 / 1,000 results

Try for free

Go to Apify Store

Enhanced Deep Content Crawler

Try for free

Developed by

Gideon Nesh

Maintained by Community

A fast, Python-powered web crawler with smart content extraction, JS support, metadata capture, and duplicate detection. Ideal for SEO, content migration, and e-commerce scraping. Reliable, scalable, and easy to customize.

1.0 (1)

Pricing

$10.00 / 1,000 results

Last modified

4 months ago

Developer tools

SEO tools

E-commerce

Enhanced Deep Content Crawler

A powerful, production-ready web crawler for comprehensive content extraction built with modern Python technologies. This enhanced crawler combines the efficiency of Crawlee for Python with advanced content extraction capabilities, intelligent duplicate detection, and robust error handling.

🚀 Key Features

Dual Crawling Modes: Choose between Playwright (JavaScript support) or HTTP-only crawling for optimal performance
Smart Content Extraction: Automatic main content detection with fallback strategies
Comprehensive Metadata: Extracts titles, descriptions, Open Graph data, Twitter Cards, and JSON-LD structured data
Duplicate Detection: Content-based deduplication using hashing algorithms
Advanced Link Discovery: Intelligent internal link extraction with domain validation
Real-time Statistics: Live progress tracking with performance metrics
Robust Error Handling: Retry mechanisms with exponential backoff
Flexible Configuration: Extensive customization options via input schema

Included features

Apify SDK - a toolkit for building Apify Actors in Python.
Crawlee for Python - a web scraping and browser automation library.
Input schema - define and validate a schema for your Actor's input.
Request queue - manage the URLs you want to scrape in a queue.
Dataset - store and access structured data extracted from web pages.
Beautiful Soup - a library for pulling data out of HTML and XML files.
Playwright - modern browser automation for JavaScript-heavy sites.

📊 Data Extraction Capabilities

The enhanced crawler extracts comprehensive data from each page:

Basic Information

URL: Complete page URL with timestamp
Title: Page title from <title> tag
Description: Meta description and other metadata
Word Count: Total words in main content
Page Size: HTML document size in bytes

Advanced Metadata

Open Graph Data: Complete OG tags for social sharing
Twitter Card Data: Twitter-specific metadata
Structured Data: JSON-LD schemas for rich snippets
Canonical URLs: Preferred page URLs
Author Information: When available in meta tags

Content Analysis

Main Content: Intelligently extracted article/page content
Custom Content: User-defined CSS selector extraction
Content Hash: For duplicate detection
Internal Links: All same-domain links with metadata
Images: Up to 10 images with alt text and metadata

⚡ Performance Features

Dual Mode Operation:
- Browser Mode (Playwright): Full JavaScript support, perfect for SPAs
- HTTP Mode: Lightning-fast for static content
Concurrent Processing: Asynchronous crawling with proper queue management
Smart Deduplication: Avoids crawling identical content
Progress Tracking: Real-time statistics and performance metrics
Memory Efficient: Optimized for large-scale crawling

🛡️ Reliability & Error Handling

Retry Logic: Configurable retry attempts with exponential backoff
Timeout Management: Customizable request timeouts
Error Categorization: Detailed error logging and reporting
Graceful Degradation: Continues crawling despite individual page failures
URL Validation: Prevents crawling invalid or dangerous URLs

Resources

🛠️ Configuration Options

Basic Configuration

{
  "baseUrl": "https://example.com",
  "maxPages": 100,
  "maxDepth": 3
}

Advanced Configuration

{
  "baseUrl": "https://example.com",
  "maxPages": 500,
  "maxDepth": 4,
  "contentSelector": "article, .post-content, .entry-content",
  "excludePatterns": [
    ".*\\.pdf$",
    "/admin/.*",
    ".*\\?.*utm_.*"
  ],
  "useBrowser": true,
  "maxRetries": 5,
  "requestTimeout": 45000
}

Configuration Parameters

Parameter	Type	Default	Description
`baseUrl`	string	required	Starting URL for crawling
`maxPages`	integer	100	Maximum pages to crawl (1-1000)
`maxDepth`	integer	3	Maximum link depth to follow (1-10)
`contentSelector`	string	"body"	CSS selector for content extraction
`excludePatterns`	array	[]	Regex patterns to exclude URLs
`useBrowser`	boolean	true	Enable JavaScript with Playwright
`maxRetries`	integer	3	Retry attempts for failed requests
`requestTimeout`	integer	30000	Request timeout in milliseconds

🚀 Usage Examples

E-commerce Site Crawling

{
  "baseUrl": "https://shop.example.com",
  "maxPages": 200,
  "contentSelector": ".product-description, .product-specs",
  "excludePatterns": [
    "/cart.*",
    "/checkout.*",
    "/account.*"
  ]
}

News Site Crawling

{
  "baseUrl": "https://news.example.com",
  "maxPages": 1000,
  "maxDepth": 2,
  "contentSelector": "article, .post-content",
  "useBrowser": false,
  "excludePatterns": [
    "/tag/.*",
    "/author/.*",
    ".*\\?.*utm_.*"
  ]
}

Documentation Site Crawling

{
  "baseUrl": "https://docs.example.com",
  "maxPages": 300,
  "contentSelector": ".markdown-body, .content",
  "excludePatterns": [
    ".*\\.pdf$",
    ".*\\.zip$"
  ]
}

📊 Output Data Structure

Each crawled page produces a comprehensive data object:

{
  "url": "https://example.com/page",
  "crawled_at": "2024-01-15T10:30:00Z",
  "content_hash": "a1b2c3d4e5f6",
  "metadata": {
    "title": "Page Title",
    "description": "Page description",
    "keywords": "keyword1, keyword2",
    "author": "Author Name",
    "canonical_url": "https://example.com/canonical",
    "og_data": {
      "title": "Social Title",
      "description": "Social Description",
      "image": "https://example.com/image.jpg"
    },
    "structured_data": [
      {
        "@type": "Article",
        "headline": "Article Title"
      }
    ]
  },
  "main_content": "Extracted main content text...",
  "specific_content": "Content from custom selector...",
  "images": [
    {
      "src": "https://example.com/image.jpg",
      "alt": "Image description",
      "title": "Image title",
      "width": "800",
      "height": "600"
    }
  ],
  "internal_links": [
    {
      "url": "https://example.com/linked-page",
      "text": "Link text",
      "title": "Link title"
    }
  ],
  "word_count": 1250,
  "page_size": 45678,
  "load_time": 1.23
}

A short guide on how to build web scrapers using code templates:

Getting started

For complete information see this article. In short, you will:

Build the Actor
Run the Actor

Pull the Actor for local development

If you would like to develop locally, you can pull the existing Actor from Apify console using Apify CLI:

Install apify-cli

Using Homebrew

$brew install apify-cli

Using NPM

$npm -g install apify-cli

Pull the Actor by its unique <ActorId>, which is one of the following:
- unique name of the Actor to pull (e.g. "apify/hello-world")
- or ID of the Actor to pull (e.g. "E2jjCZBezvAZnX8Rb")
You can find both by clicking on the Actor title at the top of the page, which will open a modal containing both Actor unique name and Actor ID.

This command will copy the Actor into the current directory on your local machine.
```
$apify pull <ActorId>
```

Documentation reference

To learn more about Apify and Actors, take a look at the following resources:

On this page

Share Actor:

Deep URL Content Crawler

6sigmag/deep-url-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple URLs simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

Fast URL Content Crawler

6sigmag/fast-url-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple URLs simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

175

5.0

Web Crawler

rigelbytes/webcrawler

This web crawler is designed to provide users with complete flexibility by allowing them to use their **own proxies**. The scraper collects all pages from the website and returns extracts the **MetaData**, **Title**, and **Content** of the page in MarkDown.

Rigel Bytes

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

675

4.1

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

2.3K

4.3

Web Scraper

futurizerush/web-scraper

Simple web scraper. Extract titles, paragraphs, links, images, tables and more from websites. Supports custom CSS selectors and batch collection. For large needs, try Apify's Web Content Crawler.

Futurize Rush

5.0

AI-Powered Web Content & Link Extractor

scrapercoder/ai-powered-web-content-link-extractor

Crawls websites to extract clean, structured content for AI/LLM use, ideal for training datasets, knowledge bases, and RAG systems. Json output includes: * text: Normalized page content * links: Extracted sub-URLs

wallnut.ai

136

URL Metadata Crawler

easyapi/url-metadata-crawler

Extracting comprehensive metadata from web pages. Gather vital information like meta tags, favicons, Open Graph tags, and more, all while enjoying flexible options for customization. Perfect for SEO specialists, developers, and content creators looking to enhance their web presence! 🌐

EasyApi

FlexiScraper

codinpro/flexiscraper

FlexiScraper is a powerful web scraper that bypasses 403 errors and handles JavaScript rendering. It extracts data in HTML, Text, and Markdown formats, making it perfect for developers and data enthusiasts. Fast, reliable, and easy to use, it delivers data from any URL effortlessly.

CodinPro

Page Scraping Analyzer

apify/page-analyzer

Performs analysis of a webpage to figure out the best way how to scrape its data. Provide a URL and data points to find and get back a detailed dashboard showing how the data can be scraped. Works with initial and rendered HTML, JavaScript variables and dynamically loaded data.

Apify

1.4K

4.7

Enhanced Deep Content Crawler

Enhanced Deep Content Crawler

Enhanced Deep Content Crawler

🚀 Key Features

Included features

📊 Data Extraction Capabilities

Basic Information

Advanced Metadata

Content Analysis

⚡ Performance Features

🛡️ Reliability & Error Handling

Resources

🛠️ Configuration Options

Basic Configuration

Advanced Configuration

Configuration Parameters

🚀 Usage Examples

E-commerce Site Crawling

News Site Crawling

Documentation Site Crawling

📊 Output Data Structure

Getting started

Pull the Actor for local development

Documentation reference

You might also like

Deep URL Content Crawler

Fast URL Content Crawler

Web Crawler

Deep Website Content Crawler

Fast Website Content Crawler

Web Scraper

AI-Powered Web Content & Link Extractor

URL Metadata Crawler

FlexiScraper

Page Scraping Analyzer