WebLink Scraper avatar

WebLink Scraper

Pricing

from $10.00 / 1,000 results

Go to Apify Store
WebLink Scraper

WebLink Scraper

This Actor Scraps or extract all links present in single domain its crawls all pages and gets list of third partly along current URL endpoints to analyze what's hidden links are present in domains.

Pricing

from $10.00 / 1,000 results

Rating

0.0

(0)

Developer

Ruturaj Sharbidre

Ruturaj Sharbidre

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

1

Monthly active users

2 months ago

Last modified

Share

Deep Link Crawler Pro

A powerful Python-based web crawler designed for Apify. It uses Playwright to render JavaScript and extract links from HTML, scripts, CSS, and other sources.

Features

  • Deep Crawling: Follows links recursively up to a specified depth.
  • JavaScript Rendering: Uses Playwright to execute JS and find dynamic links.
  • Advanced Filtering: Include/Exclude by file extension, regex patterns, or domain.
  • Multiple Sources: Extracts from <a> tags, src attributes, plain text regex, and more.
  • Structured Output: Saves data in JSON, CSV, and TXT formats, organized by domain.

Installation

  1. Clone the repository:

    git clone <repository_url>
    cd WebLinkScraper
  2. Install dependencies:

    pip install -r requirements.txt
    playwright install

Usage

Local Testing

You can run the crawler locally using the provided test script:

$python tests/test_local_crawl.py

Or run the main module (requires mocking Apify input or setting environment variables):

$python -m src.main

Apify Deployment

  1. Push to Apify: Use the Apify CLI:

    $apify push
  2. Configuration (Input): The Actor accepts the following input:

    {
    "startUrls": ["https://example.com"],
    "maxDepth": 3,
    "maxPagesPerDomain": 1000,
    "includeExtensions": ".pdf,.doc",
    "excludeExtensions": ".png,.jpg,.css",
    "outputFormat": ["CSV", "JSON"]
    }

Input Parameters

  • startUrls: List of URLs to start crawling.
  • maxDepth: Maximum recursion depth (default: 3).
  • maxPagesPerDomain: Limit pages per domain to avoid getting stuck.
  • includeExtensions: Comma-separated list of extensions to include (whitelist).
  • excludeExtensions: Comma-separated list of extensions to exclude (blacklist).
  • csvFile: Upload a CSV file containing URLs to crawl.

Output

Results are saved in the results/ directory (locally) or the default Key-Value Store (Apify).

Structure:

results/
├── example.com/
│ ├── links.txt
│ ├── links.csv
│ └── links.json
└── ...

License

Author : Ruturaj Sharbidre