Site Files Exporter
Pricing
from $0.01 / 1,000 results
Site Files Exporter
Site Files Exporter deep-crawls provided start URLs, filters out unwanted links, downloads matching documents to organized folders, and logs outcomes and metadata so you can quickly collect site files at scale.
Pricing
from $0.01 / 1,000 results
Rating
0.0
(0)
Developer

Youssef Benhammouda
Actor stats
0
Bookmarked
3
Total users
1
Monthly active users
11 days ago
Last modified
Categories
Share
Site Files Exporter
Deep-crawl pages, find file links, and download them with session-aware browsing. The Actor shares cookies across requests, respects your limits, and produces both detailed logs and a zip of everything it saved.
What it does
- Starts from your
start_urls, opens each page in a controlled browser session, and keeps going until limits are met. - Filters links with your
download_regexor a set of file extensions, then downloads through the same session to honor auth and cookies. - Deduplicates file URLs, enforces
max_files, and can randomize per-download delays. - Pushes every attempt to the dataset (success or failure) and zips the downloaded files into default key-value store.
- Writes a compact
OUTPUTsummary so you can check progress quickly.
Features
- Session-aware browsing with cookies applied once per browser context.
- URL filtering by regex or extensions (regex wins), configurable base ignores (login/logout/signout/javascript/mailto/tel/data/blob + wp-json + wp-admin admin-ajax/admin-post + asset/rss/xmlrpc), plus custom ignore patterns.
- Optional download pacing via min/max delay; safe filename sanitization; per-run summary with counts and zip metadata.
- Hard stop on
max_filesto prevent runaway crawls.
Inputs
Pass these fields to the Actor (Apify Console form or apify run -p '{...}'). Defaults match the schema in .actor/input_schema.json.
start_urls(array, required): Objects withurlstrings. Non-objects or blank URLs are ignored. Start URLs that look like files are downloaded immediately; otherwise they seed the same-host crawl.download_regex(string): Case-insensitive regex to mark a URL as a file candidate. Checked before extensions; when it matches, the URL is downloaded even if the extension does not look like a file. Default: empty (off).file_extensions(array): Fallback file detection whendownload_regexdoes not match. Extensions are normalized (e.g.,.PDF→pdf). Default:pdf, zip, doc, docx.max_files(integer): Hard cap on unique file URLs.0means unlimited. Slots are reserved before download to prevent overshoot; once reached, the crawler stops enqueueing.min_delay_ms/max_delay_ms(integers): Random per-download delay window.maxis clamped to at leastmin.0/0disables delays.download_timeout_ms(integer): Timeout per file download in milliseconds. Default:60000(60s). Minimum accepted:1000.ignore_base_patterns(array): Regexes skipped for every link. If you provide a list, it replaces the built-in defaults; if omitted, the defaults apply (login/logout/signout/javascript/mailto/tel/data/blob/wp-json/wp-admin admin-ajax/admin-post/xmlrpc/feed/rss/asset extensions).ignore_patterns(array): Extra regexes appended after the base list for this run. Default: empty.cookies(array): Applied once per browser context using Playwrightadd_cookiesshape. Each cookie can includename,value,domain,path,expires(Unix seconds),httpOnly,secure, and optionalsameSite.headless(boolean): Whether to run Camoufox/Playwright headless. Non-boolean values default totrue.
Outputs
- Dataset (view
overview): one item per download attempt with fieldsfileUrl,sourcePage,ok,status,bytes,fileName,error. See .actor/dataset_schema.json for the full shape. - Key-value store:
files.zip: archive of all downloaded files, relative paths mirroring source host/path.OUTPUT: summary object{ downloadedCount, failedCount, uniqueDownloadedUrls, zipFileKey, zipEntryCount, zipSizeBytes }(zip keys areNoneif nothing was saved).
How it works
- Creates a browser-backed crawler.
- Ensures cookies are set once per browser context, then extracts all links from each page, normalizes them, and filters by ignore rules + file detection.
- Downloads files through the page context request API to share session state, sanitizes filenames, and writes them under
storage/key_value_stores/default/files/<host>/<path>/. - Updates
OUTPUTmetadata throughout the run so you can monitor counts, then zips the files at the end and records zip metadata.
Tips
- Use
download_regexfor precise targeting (e.g., query parameters or folders); fall back tofile_extensionsfor general coverage. - Add small delays (
min_delay_ms/max_delay_ms) if the target rate-limits downloads.
Exporting cookies
- You can export cookies from your browser with any cookie-export extension. One option: Export Cookie JSON File (Firefox/Chrome).
- Save the exported JSON and pass it via the
cookiesinput array (same shape as Playwrightadd_cookies).


