Site Files Exporter
Pricing
Pay per usage
Go to Apify Store

Site Files Exporter
Site Files Exporter deep-crawls provided start URLs, filters out unwanted links, downloads matching documents to organized folders, and logs outcomes and metadata so you can quickly collect site files at scale.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Youssef Benhammouda
Maintained by Community
Actor stats
0
Bookmarked
7
Total users
0
Monthly active users
20 days ago
Last modified
Categories
Share
Site Files Exporter
Deep-crawl pages, find file links, and download them with session-aware browsing. The Actor shares cookies across requests, respects your limits, and produces both detailed logs and a zip of everything it saved.
What it does
- Starts from your
start_urls, opens each page in a controlled browser session, and keeps going until limits are met. - Filters links with your
download_regexor a set of file extensions, then downloads through the same session to honor auth and cookies. - Deduplicates file URLs, enforces
max_files, and can randomize per-download delays. - Pushes every attempt to the dataset (success or failure) and zips the downloaded files into default key-value store.
- Writes a compact
OUTPUTsummary so you can check progress quickly.
Features
- Session-aware browsing with cookies applied once per browser context.
- URL filtering by regex or extensions (regex wins), configurable base ignores (login/logout/signout/javascript/mailto/tel/data/blob + wp-json + wp-admin admin-ajax/admin-post + asset/rss/xmlrpc), plus custom ignore patterns.
- Optional download pacing via min/max delay; safe filename sanitization; per-run summary with counts and zip metadata.
- Hard stop on
max_filesto prevent runaway crawls.
Inputs
Pass these fields to the Actor (Apify Console form or apify run -p '{...}'). Defaults match the schema in .actor/input_schema.json.
start_urls(array, required): Objects withurlstrings. Non-objects or blank URLs are ignored. Start URLs that look like files are downloaded immediately; otherwise they seed the same-host crawl.download_regex(string): Case-insensitive regex to mark a URL as a file candidate. Checked before extensions; when it matches, the URL is downloaded even if the extension does not look like a file. Default: empty (off).file_extensions(array): Fallback file detection whendownload_regexdoes not match. Extensions are normalized (e.g.,.PDF→pdf). Default:pdf, zip, doc, docx.max_files(integer): Hard cap on unique file URLs.0means unlimited. Slots are reserved before download to prevent overshoot; once reached, the crawler stops enqueueing.min_delay_ms/max_delay_ms(integers): Random per-download delay window.maxis clamped to at leastmin.0/0disables delays.download_timeout_ms(integer): Timeout per file download in milliseconds. Default:60000(60s). Minimum accepted:1000.ignore_base_patterns(array): Regexes skipped for every link. If you provide a list, it replaces the built-in defaults; if omitted, the defaults apply (login/logout/signout/javascript/mailto/tel/data/blob/wp-json/wp-admin admin-ajax/admin-post/xmlrpc/feed/rss/asset extensions).ignore_patterns(array): Extra regexes appended after the base list for this run. Default: empty.cookies(array): Applied once per browser context using Playwrightadd_cookiesshape. Each cookie can includename,value,domain,path,expires(Unix seconds),httpOnly,secure, and optionalsameSite.headless(boolean): Whether to run Camoufox/Playwright headless. Non-boolean values default totrue.
Outputs
- Dataset (view
overview): one item per download attempt with fieldsfileUrl,sourcePage,ok,status,bytes,fileName,error. See .actor/dataset_schema.json for the full shape. - Key-value store:
files.zip: archive of all downloaded files, relative paths mirroring source host/path.OUTPUT: summary object{ downloadedCount, failedCount, uniqueDownloadedUrls, zipFileKey, zipEntryCount, zipSizeBytes }(zip keys areNoneif nothing was saved).
How it works
- Creates a browser-backed crawler.
- Ensures cookies are set once per browser context, then extracts all links from each page, normalizes them, and filters by ignore rules + file detection.
- Downloads files through the page context request API to share session state, sanitizes filenames, and writes them under
storage/key_value_stores/default/files/<host>/<path>/. - Updates
OUTPUTmetadata throughout the run so you can monitor counts, then zips the files at the end and records zip metadata.
Tips
- Use
download_regexfor precise targeting (e.g., query parameters or folders); fall back tofile_extensionsfor general coverage. - Add small delays (
min_delay_ms/max_delay_ms) if the target rate-limits downloads.
Exporting cookies
- You can export cookies from your browser with any cookie-export extension. One option: Export Cookie JSON File (Firefox/Chrome).
- Save the exported JSON and pass it via the
cookiesinput array (same shape as Playwrightadd_cookies).