Site Files Exporter avatar
Site Files Exporter

Pricing

from $0.01 / 1,000 results

Go to Apify Store
Site Files Exporter

Site Files Exporter

Site Files Exporter deep-crawls provided start URLs, filters out unwanted links, downloads matching documents to organized folders, and logs outcomes and metadata so you can quickly collect site files at scale.

Pricing

from $0.01 / 1,000 results

Rating

0.0

(0)

Developer

Youssef Benhammouda

Youssef Benhammouda

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

1

Monthly active users

11 days ago

Last modified

Share

Site Files Exporter

Deep-crawl pages, find file links, and download them with session-aware browsing. The Actor shares cookies across requests, respects your limits, and produces both detailed logs and a zip of everything it saved.

What it does

  • Starts from your start_urls, opens each page in a controlled browser session, and keeps going until limits are met.
  • Filters links with your download_regex or a set of file extensions, then downloads through the same session to honor auth and cookies.
  • Deduplicates file URLs, enforces max_files, and can randomize per-download delays.
  • Pushes every attempt to the dataset (success or failure) and zips the downloaded files into default key-value store.
  • Writes a compact OUTPUT summary so you can check progress quickly.

Features

  • Session-aware browsing with cookies applied once per browser context.
  • URL filtering by regex or extensions (regex wins), configurable base ignores (login/logout/signout/javascript/mailto/tel/data/blob + wp-json + wp-admin admin-ajax/admin-post + asset/rss/xmlrpc), plus custom ignore patterns.
  • Optional download pacing via min/max delay; safe filename sanitization; per-run summary with counts and zip metadata.
  • Hard stop on max_files to prevent runaway crawls.

Inputs

Pass these fields to the Actor (Apify Console form or apify run -p '{...}'). Defaults match the schema in .actor/input_schema.json.

  • start_urls (array, required): Objects with url strings. Non-objects or blank URLs are ignored. Start URLs that look like files are downloaded immediately; otherwise they seed the same-host crawl.
  • download_regex (string): Case-insensitive regex to mark a URL as a file candidate. Checked before extensions; when it matches, the URL is downloaded even if the extension does not look like a file. Default: empty (off).
  • file_extensions (array): Fallback file detection when download_regex does not match. Extensions are normalized (e.g., .PDFpdf). Default: pdf, zip, doc, docx.
  • max_files (integer): Hard cap on unique file URLs. 0 means unlimited. Slots are reserved before download to prevent overshoot; once reached, the crawler stops enqueueing.
  • min_delay_ms / max_delay_ms (integers): Random per-download delay window. max is clamped to at least min. 0/0 disables delays.
  • download_timeout_ms (integer): Timeout per file download in milliseconds. Default: 60000 (60s). Minimum accepted: 1000.
  • ignore_base_patterns (array): Regexes skipped for every link. If you provide a list, it replaces the built-in defaults; if omitted, the defaults apply (login/logout/signout/javascript/mailto/tel/data/blob/wp-json/wp-admin admin-ajax/admin-post/xmlrpc/feed/rss/asset extensions).
  • ignore_patterns (array): Extra regexes appended after the base list for this run. Default: empty.
  • cookies (array): Applied once per browser context using Playwright add_cookies shape. Each cookie can include name, value, domain, path, expires (Unix seconds), httpOnly, secure, and optional sameSite.
  • headless (boolean): Whether to run Camoufox/Playwright headless. Non-boolean values default to true.

Outputs

  • Dataset (view overview): one item per download attempt with fields fileUrl, sourcePage, ok, status, bytes, fileName, error. See .actor/dataset_schema.json for the full shape.
  • Key-value store:
    • files.zip: archive of all downloaded files, relative paths mirroring source host/path.
    • OUTPUT: summary object { downloadedCount, failedCount, uniqueDownloadedUrls, zipFileKey, zipEntryCount, zipSizeBytes } (zip keys are None if nothing was saved).

How it works

  • Creates a browser-backed crawler.
  • Ensures cookies are set once per browser context, then extracts all links from each page, normalizes them, and filters by ignore rules + file detection.
  • Downloads files through the page context request API to share session state, sanitizes filenames, and writes them under storage/key_value_stores/default/files/<host>/<path>/.
  • Updates OUTPUT metadata throughout the run so you can monitor counts, then zips the files at the end and records zip metadata.

Tips

  • Use download_regex for precise targeting (e.g., query parameters or folders); fall back to file_extensions for general coverage.
  • Add small delays (min_delay_ms/max_delay_ms) if the target rate-limits downloads.

Exporting cookies

  • You can export cookies from your browser with any cookie-export extension. One option: Export Cookie JSON File (Firefox/Chrome).
  • Save the exported JSON and pass it via the cookies input array (same shape as Playwright add_cookies).