Pricing

from $0.01 / 1,000 results

Site Files Exporter

Site Files Exporter deep-crawls provided start URLs, filters out unwanted links, downloads matching documents to organized folders, and logs outcomes and metadata so you can quickly collect site files at scale.

Pricing

from $0.01 / 1,000 results

Rating

0.0

(0)

Developer

Youssef Benhammouda

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

Site Files Exporter

Deep-crawl pages, find file links, and download them with session-aware browsing. The Actor shares cookies across requests, respects your limits, and produces both detailed logs and a zip of everything it saved.

What it does

Starts from your start_urls, opens each page in a controlled browser session, and keeps going until limits are met.
Filters links with your download_regex or a set of file extensions, then downloads through the same session to honor auth and cookies.
Deduplicates file URLs, enforces max_files, and can randomize per-download delays.
Pushes every attempt to the dataset (success or failure) and zips the downloaded files into default key-value store.
Writes a compact OUTPUT summary so you can check progress quickly.

Features

Session-aware browsing with cookies applied once per browser context.
URL filtering by regex or extensions (regex wins), configurable base ignores (login/logout/signout/javascript/mailto/tel/data/blob + wp-json + wp-admin admin-ajax/admin-post + asset/rss/xmlrpc), plus custom ignore patterns.
Optional download pacing via min/max delay; safe filename sanitization; per-run summary with counts and zip metadata.
Hard stop on max_files to prevent runaway crawls.

Inputs

Pass these fields to the Actor (Apify Console form or apify run -p '{...}'). Defaults match the schema in .actor/input_schema.json.

start_urls (array, required): Objects with url strings. Non-objects or blank URLs are ignored. Start URLs that look like files are downloaded immediately; otherwise they seed the same-host crawl.
download_regex (string): Case-insensitive regex to mark a URL as a file candidate. Checked before extensions; when it matches, the URL is downloaded even if the extension does not look like a file. Default: empty (off).
file_extensions (array): Fallback file detection when download_regex does not match. Extensions are normalized (e.g., .PDF → pdf). Default: pdf, zip, doc, docx.
max_files (integer): Hard cap on unique file URLs. 0 means unlimited. Slots are reserved before download to prevent overshoot; once reached, the crawler stops enqueueing.
min_delay_ms / max_delay_ms (integers): Random per-download delay window. max is clamped to at least min. 0/0 disables delays.
download_timeout_ms (integer): Timeout per file download in milliseconds. Default: 60000 (60s). Minimum accepted: 1000.
ignore_base_patterns (array): Regexes skipped for every link. If you provide a list, it replaces the built-in defaults; if omitted, the defaults apply (login/logout/signout/javascript/mailto/tel/data/blob/wp-json/wp-admin admin-ajax/admin-post/xmlrpc/feed/rss/asset extensions).
ignore_patterns (array): Extra regexes appended after the base list for this run. Default: empty.
cookies (array): Applied once per browser context using Playwright add_cookies shape. Each cookie can include name, value, domain, path, expires (Unix seconds), httpOnly, secure, and optional sameSite.
headless (boolean): Whether to run Camoufox/Playwright headless. Non-boolean values default to true.

Outputs

Dataset (view overview): one item per download attempt with fields fileUrl, sourcePage, ok, status, bytes, fileName, error. See .actor/dataset_schema.json for the full shape.
Key-value store:
- files.zip: archive of all downloaded files, relative paths mirroring source host/path.
- OUTPUT: summary object { downloadedCount, failedCount, uniqueDownloadedUrls, zipFileKey, zipEntryCount, zipSizeBytes } (zip keys are None if nothing was saved).

How it works

Creates a browser-backed crawler.
Ensures cookies are set once per browser context, then extracts all links from each page, normalizes them, and filters by ignore rules + file detection.
Downloads files through the page context request API to share session state, sanitizes filenames, and writes them under storage/key_value_stores/default/files/<host>/<path>/.
Updates OUTPUT metadata throughout the run so you can monitor counts, then zips the files at the end and records zip metadata.

Tips

Use download_regex for precise targeting (e.g., query parameters or folders); fall back to file_extensions for general coverage.
Add small delays (min_delay_ms/max_delay_ms) if the target rate-limits downloads.

Exporting cookies

You can export cookies from your browser with any cookie-export extension. One option: Export Cookie JSON File (Firefox/Chrome).
Save the exported JSON and pass it via the cookies input array (same shape as Playwright add_cookies).

What Site

maged120/what-site

simple site lookup for title and description of any site

Maged

5.0

Shopify Products Exporter

technicaldost/shopify-products-exporter

Technical Dost Solutions

Zip Download Extraction Scraper

fresh_cliff/zip-download-extraction-scraper

Download and extract zip files automatically. Extract archives, process documents, analyze logs, backup files. Batch extract text, JSON, CSV content. Real-time data extraction API.

Brennan Crawford

Multi site hotel scraper

accomplished_yapok/my-actor

Kween Ash

Apk Scraper

thenetaji/apk-downloader

📱 Find and extract direct download links for APK files from websites. Also discovers other downloadable content like videos, images, and audio files. Perfect for gathering Android apps and media files from various sources.

The Netaji

5.0

Crawl Documentation Site — Data, Details & Metadata

tropical_quince/documentation-site-crawler

Crawl documentation site data at scale with this powerful Apify actor. Extracts data, details & metadata with automatic pagination and proxy rotation. Perfect for market research, competitive intelligence, and data-driven decision making.

Donny Nguyen

Firecrawl Site Mapper

alizarin_refrigerator-owner/firecrawl-site-mapper

Fast URL Discovery for Site Audits & Competitor Analysis Discover all URLs on a website using Firecrawl's Map endpoint. Perfect for competitor analysis, site audits, and content gap discovery.

The Howlers

Sitemap Analyzer & Structure Auditor

taroyamada/sitemap-analyzer

Parse and analyze sitemap.xml files. Discover URL structure, update frequencies, dead links, and site architecture at scale.

太郎山田

Website Recovery Actor

fiery_dream/website-recovery-actor

Recover and reverse-engineer website files from a live or deployed site. This actor downloads the complete website including HTML, CSS, JavaScript, images, fonts, and all other assets, then rewrites URLs so everything works locally.