Website Image & Media Crawler — Bulk Asset Extractor avatar

Website Image & Media Crawler — Bulk Asset Extractor

Pricing

from $4.50 / 1,000 results

Go to Apify Store
Website Image & Media Crawler — Bulk Asset Extractor

Website Image & Media Crawler — Bulk Asset Extractor

Crawl an entire website and extract every image, video and audio asset — with alt text, dimensions, source page and file type. Perfect for AI training datasets, image SEO audits, asset inventories and migrations. No login, no browser.

Pricing

from $4.50 / 1,000 results

Rating

0.0

(0)

Developer

Logiover

Logiover

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

16 hours ago

Last modified

Share

Website Image & Media Crawler — Bulk Image & Asset Scraper 🖼️

Extract every image, video and audio file from a website. This image scraper / media extractor crawls an entire site and pulls out all media assets — together with alt text, dimensions, the source page and file type. Point it at one URL and it inventories the media across thousands of pages automatically. No login, no headless browser.

Need to scrape all images from a website, build an image dataset for AI, run an image SEO / alt-text audit, or inventory media before a migration? This actor delivers the full list of asset URLs and metadata.


✨ Key features

  • 🕷️ Full-site crawl — start from one URL and follow internal links across the whole domain.
  • 🖼️ Every media type<img>, srcset, <picture>, lazy-loaded data-src, CSS background images, <video> + posters, <audio>, plus og:image, twitter:image and favicons.
  • 🔗 Absolute, de-duplicated URLs — clean asset URLs ready to download or analyze.
  • 🏷️ Rich metadata — alt text, title, width/height, loading attribute and where each asset was found.
  • Fast & cheap — pure HTTP, no browser, high concurrency.

💡 Use cases

  • AI / ML training datasets — collect large image sets with their alt-text captions for multimodal models.
  • Image SEO audits — find images missing alt text at scale and improve accessibility & rankings.
  • Asset inventories & migrations — list every media file on a site before a redesign or platform move.
  • E-commerce & competitor research — pull product imagery across a whole catalog.
  • Bulk image download lists — generate a clean URL list to fetch images in bulk.

📦 What you get

One row per media asset:

FieldDescription
pageUrlThe page the asset was found on
mediaUrlAbsolute URL of the asset
mediaTypeimage, video, audio or icon
foundInSource (img, img-srcset, picture-source, meta, css-background, video, …)
fileExtensionjpg, png, webp, mp4, svg, …
alt / titleImage alt and title text
width / heightDeclared dimensions
loadinglazy / eager
posterVideo poster image
crawledAtISO 8601 timestamp

Example output

{
"pageUrl": "https://shop.example.com/product/123",
"mediaUrl": "https://shop.example.com/img/123-main.jpg",
"mediaType": "image",
"foundIn": "img",
"fileExtension": "jpg",
"alt": "Blue running shoe, side view",
"width": "800",
"height": "800",
"crawledAt": "2026-05-25T14:15:28.001Z"
}

🚀 How to use it

  1. Click Try for free / Start.
  2. Paste one or more website URLs into Start URLs.
  3. (Optional) Set Max pages to crawl0 for the whole site.
  4. (Optional) Toggle which media to include: images, video, audio, CSS backgrounds.
  5. Click Save & Start.
  6. Export the asset list as JSON, CSV, Excel or via API.

⚙️ Input

OptionDescriptionDefault
startUrlsWebsites to crawl– (required)
maxPagesToCrawlMax pages per run (0 = whole site)1000
includeImages<img>, srcset, <picture>, og:image, faviconstrue
includeVideo<video> sources and posterstrue
includeAudio<audio> sourcestrue
includeBackgroundImagesCSS inline background imagestrue
maxConcurrencyParallel requests10

Example input

{
"startUrls": [{ "url": "https://example.com" }],
"maxPagesToCrawl": 2000,
"includeImages": true
}

🔍 How it works

The crawler follows internal links within the same domain as your Start URLs. On each page it extracts media from <img> (including srcset and data-src), <picture>, inline CSS backgrounds, <video>/<audio> and their <source> children, plus og:image, twitter:image and favicons. All URLs are resolved to absolute and de-duplicated per page. Pure HTTP — fast and inexpensive.

🧰 Tips & best practices

  • Set maxPagesToCrawl to 0 to inventory an entire catalog or media library.
  • Filter by mediaType or fileExtension after the run to get exactly the assets you need.
  • Use imagesMissingAlt-style filtering: rows where alt is empty are your image-SEO fixes.
  • To download the files, feed the mediaUrl list into a bulk downloader.

❓ FAQ

Does it download the image files? No — it extracts asset URLs and metadata. You can download them from the mediaUrl list afterwards with any bulk downloader.

Does it capture lazy-loaded images? Yes — it reads data-src, srcset and <picture> sources in addition to plain src.

Does it render JavaScript? No — it parses server-rendered HTML for speed and low cost.

How do I crawl the whole site? Set maxPagesToCrawl to 0.

What formats can I export? JSON, CSV, Excel, HTML and a full REST API.

  • Website to Markdown & Text Crawler — clean text + Markdown for AI / RAG.
  • Website SEO Audit Crawler — on-page SEO audit including image alt coverage.
  • Broken Link Checker — find dead links across a whole site.
  • Sitemap to URL Crawler — extract all URLs from any sitemap.xml.

Changelog

  • 2026-05-25 — Maintenance & reliability pass: pulled the latest source and rebuilt the Actor on the current base image; build verified.

Last reviewed: 2026-05-25.