Site Files Exporter avatar

Site Files Exporter

Pricing

Pay per usage

Go to Apify Store
Site Files Exporter

Site Files Exporter

Site Files Exporter deep-crawls provided start URLs, filters out unwanted links, downloads matching documents to organized folders, and logs outcomes and metadata so you can quickly collect site files at scale.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Youssef Benhammouda

Youssef Benhammouda

Maintained by Community

Actor stats

0

Bookmarked

7

Total users

0

Monthly active users

20 days ago

Last modified

Share

Site Files Exporter

Deep-crawl pages, find file links, and download them with session-aware browsing. The Actor shares cookies across requests, respects your limits, and produces both detailed logs and a zip of everything it saved.

What it does

  • Starts from your start_urls, opens each page in a controlled browser session, and keeps going until limits are met.
  • Filters links with your download_regex or a set of file extensions, then downloads through the same session to honor auth and cookies.
  • Deduplicates file URLs, enforces max_files, and can randomize per-download delays.
  • Pushes every attempt to the dataset (success or failure) and zips the downloaded files into default key-value store.
  • Writes a compact OUTPUT summary so you can check progress quickly.

Features

  • Session-aware browsing with cookies applied once per browser context.
  • URL filtering by regex or extensions (regex wins), configurable base ignores (login/logout/signout/javascript/mailto/tel/data/blob + wp-json + wp-admin admin-ajax/admin-post + asset/rss/xmlrpc), plus custom ignore patterns.
  • Optional download pacing via min/max delay; safe filename sanitization; per-run summary with counts and zip metadata.
  • Hard stop on max_files to prevent runaway crawls.

Inputs

Pass these fields to the Actor (Apify Console form or apify run -p '{...}'). Defaults match the schema in .actor/input_schema.json.

  • start_urls (array, required): Objects with url strings. Non-objects or blank URLs are ignored. Start URLs that look like files are downloaded immediately; otherwise they seed the same-host crawl.
  • download_regex (string): Case-insensitive regex to mark a URL as a file candidate. Checked before extensions; when it matches, the URL is downloaded even if the extension does not look like a file. Default: empty (off).
  • file_extensions (array): Fallback file detection when download_regex does not match. Extensions are normalized (e.g., .PDFpdf). Default: pdf, zip, doc, docx.
  • max_files (integer): Hard cap on unique file URLs. 0 means unlimited. Slots are reserved before download to prevent overshoot; once reached, the crawler stops enqueueing.
  • min_delay_ms / max_delay_ms (integers): Random per-download delay window. max is clamped to at least min. 0/0 disables delays.
  • download_timeout_ms (integer): Timeout per file download in milliseconds. Default: 60000 (60s). Minimum accepted: 1000.
  • ignore_base_patterns (array): Regexes skipped for every link. If you provide a list, it replaces the built-in defaults; if omitted, the defaults apply (login/logout/signout/javascript/mailto/tel/data/blob/wp-json/wp-admin admin-ajax/admin-post/xmlrpc/feed/rss/asset extensions).
  • ignore_patterns (array): Extra regexes appended after the base list for this run. Default: empty.
  • cookies (array): Applied once per browser context using Playwright add_cookies shape. Each cookie can include name, value, domain, path, expires (Unix seconds), httpOnly, secure, and optional sameSite.
  • headless (boolean): Whether to run Camoufox/Playwright headless. Non-boolean values default to true.

Outputs

  • Dataset (view overview): one item per download attempt with fields fileUrl, sourcePage, ok, status, bytes, fileName, error. See .actor/dataset_schema.json for the full shape.
  • Key-value store:
    • files.zip: archive of all downloaded files, relative paths mirroring source host/path.
    • OUTPUT: summary object { downloadedCount, failedCount, uniqueDownloadedUrls, zipFileKey, zipEntryCount, zipSizeBytes } (zip keys are None if nothing was saved).

How it works

  • Creates a browser-backed crawler.
  • Ensures cookies are set once per browser context, then extracts all links from each page, normalizes them, and filters by ignore rules + file detection.
  • Downloads files through the page context request API to share session state, sanitizes filenames, and writes them under storage/key_value_stores/default/files/<host>/<path>/.
  • Updates OUTPUT metadata throughout the run so you can monitor counts, then zips the files at the end and records zip metadata.

Tips

  • Use download_regex for precise targeting (e.g., query parameters or folders); fall back to file_extensions for general coverage.
  • Add small delays (min_delay_ms/max_delay_ms) if the target rate-limits downloads.

Exporting cookies

  • You can export cookies from your browser with any cookie-export extension. One option: Export Cookie JSON File (Firefox/Chrome).
  • Save the exported JSON and pass it via the cookies input array (same shape as Playwright add_cookies).