🧪High-Volume Website Content & Media Scraper
Pricing
$1.25 / 1,000 units
Go to Apify Store

🧪High-Volume Website Content & Media Scraper
🧪Crawling Done Right! Let me now what you think, what or where or how i can improve my actor, and i am all for constructive criticism. So please message if you have any questions. Enjoy and have a good day.
Pricing
$1.25 / 1,000 units
Rating
5.0
(2)
Developer

Jeff Halverson
Maintained by Community
Actor stats
6
Bookmarked
138
Total users
8
Monthly active users
3 days ago
Last modified
Categories
Share
ALL Social Media/WebScraper
Extract structured content from public social profile pages, article pages, landing pages, and other JavaScript-heavy websites. This actor focuses on turning a page into a clean record of text blocks, metadata, images, video references, and outgoing links.
What it does
- Opens each public URL in a browser session
- Extracts the page title and basic metadata
- Captures article-like text blocks from the page
- Collects image URLs, embedded video URLs, direct video source URLs, and outbound links
- Optionally filters Facebook links out of the outbound link list
- Stores diagnostic screenshots for failed pages
Good fit
- Public Instagram profile pages
- Blog articles and news pages
- Marketing sites and landing pages
- Content research and competitor monitoring
- Collecting media/link inventories from public pages
Not a good fit
- Logged-in or private content
- Full API-style social scraping for each platform
- Comments, followers, or hidden profile data
- Sites that require persistent authenticated sessions
Input example
{"startUrls": [{ "url": "https://instagram.com/muddlemix_" },{ "url": "https://example.com/blog/example-article" }],"includeFacebookLinks": true,"headless": true,"maxConcurrency": 3,"requestHandlerTimeoutSecs": 90,"navigationTimeoutSecs": 90,"waitAfterLoadSecs": 0.5,"saveErrorScreenshots": true}
Output fields
Each dataset item can include:
urltitlemetaarticlesimagesvideoslinksscrapedscrapeTimeprocessingTimeMscontentTypeerrordiagstatus
Notes
- The default dataset is the main output.
- Failed pages are still pushed into the dataset with
status,error, and optional diagnostic screenshot URL so runs stay debuggable. - This actor is best positioned as a public-page media and content extractor, not a full per-platform private-data scraper.