Website Media Link Scraper avatar

Website Media Link Scraper

Pricing

from $2.00 / 1,000 results

Go to Apify Store
Website Media Link Scraper

Website Media Link Scraper

Quickly find video, audio, docs, pdf, image and more links from websites using this fast and lightweight web crawler. No browser needed—just clean and efficient media extraction.

Pricing

from $2.00 / 1,000 results

Rating

4.3

(5)

Developer

The Netaji

The Netaji

Maintained by Community

Actor stats

2

Bookmarked

250

Total users

6

Monthly active users

3 days ago

Last modified

Share

🔍 Media Link Crawler

TL;DR: Extract videos, images, documents, and other media from any website. Automatically bypasses anti-bot protections with adaptive stealth escalation powered by Scrapling.

✅ Features

  • Extracts 18 media types: videos, audio, images, PDFs, documents, archives, eBooks, fonts, text, APKs, contacts, subtitles, 3D models, datasets, design assets, source code, disk images
  • Adaptive stealth: starts with a fast HTTP session, automatically escalates to a full stealth browser when blocked
  • Cloudflare bypass: optional solver for Turnstile and Interstitial challenges
  • XHR capture: intercepts background API calls to catch media that never appears in HTML
  • Crawls multiple pages with depth and URL count limits
  • Proxy support (Apify Proxy or custom proxy list)
  • Contact extraction from visible text (emails, phones, addresses)

🎯 Supported Media Types

Media TypeFormats
Videomp4, webm, mkv, mov, avi, flv, m3u8, ts, 3gp…
Audiomp3, wav, aac, flac, m4a, opus, wma, ogg…
Imagejpg, png, gif, webp, svg, avif, heic, tiff, ico…
PDFpdf
Documentdoc, docx, ppt, pptx, xls, xlsx, odt, ods, odp, rtf…
Archivezip, rar, tar, gz, 7z, bz2, xz, zst…
eBookepub, mobi, azw3, fb2, djvu…
Fontttf, otf, woff, woff2, eot
Texttxt, md, json, xml, ndjson, jsonl…
APKapk, xapk, apks
Contactemails, phones, social profiles, addresses
Subtitlesrt, vtt, sub, ass, ssa, sbv…
3D Modelobj, stl, gltf, glb, fbx, blend, dae, ply…
Dataset / DBsql, sqlite, parquet, geojson, csv, feather…
Design Assetpsd, ai, sketch, xd, indd, afdesign…
Source Codepy, js, ts, java, cpp, go, rs, rb, php, sh…
Disk Imageiso, dmg, vmdk, vhd, img, qcow2…

⚙️ Input Configuration

{
"startUrls": [{ "url": "https://example.com" }],
"mediaType": "all",
"maxCrawlDepth": 2,
"maxUrlsToCrawl": 100,
"concurrentRequests": 10,
"maxRequestRetries": 3,
"maxBlockedRetries": 3,
"downloadDelay": 0,
"stayOnDomain": true,
"useStealth": true,
"solveCloudflare": false,
"captureXhr": false,
"xhrPattern": ".*",
"includeContactText": true,
"useProxy": { "useApifyProxy": false }
}

📊 Output Schema

Each result is one media item:

{
"url": "https://example.com/video.mp4",
"sourceUrl": "https://example.com/gallery",
"type": "video",
"subType": null,
"title": "Gallery Page",
"foundAt": "2026-03-30T12:00:00+00:00",
"foundBy": "dom"
}

foundBy values: dom, link-scan, inline, text-scan, xhr-capture

💡 Pro Tips

  • Use mediaType: "all" first to discover what's available
  • Set useStealth: true (default) to handle sites with anti-bot protections
  • Enable captureXhr: true for streaming sites that serve media via API calls
  • Set solveCloudflare: true only for Cloudflare-protected sites (slower)
  • Use maxCrawlDepth: 0 to scan only the start pages (no link following)
  • Set crawlDir to enable pause/resume on long crawls

❓ FAQ

How deep should I crawl?

Start with depth 2 for most sites. Higher depths find more content but take longer.

When should I enable XHR capture?

Enable it for streaming sites, video platforms, or any site that loads media dynamically via JavaScript API calls.

Is Cloudflare bypass necessary?

Only for sites actively protected by Cloudflare challenges. Most sites work fine with useStealth: true alone.

⚠️ Limitations

  • Very complex JavaScript-rendered apps may require captureXhr: true
  • Cloudflare enterprise plans may still block the solver
  • Large crawls with high depth may take substantial time and memory

📮 Need Help?

Contact @thenetaji through the Apify platform for support, implementation questions, or feature requests.