OpenAI Web Scraper
Pricing
$30.00 / 1,000 results
Pricing
$30.00 / 1,000 results
Rating
0.0
(0)
Developer
Tin
Actor stats
0
Bookmarked
8
Total users
3
Monthly active users
4 days ago
Last modified
Categories
Share
OpenAI Web Scraper
OpenAI Web Scraper is an Apify actor designed to crawl web pages and extract structured information using AI. The actor loads web pages, collects their content, and sends the extracted data to an AI model for intelligent processing. The AI then analyzes the page and returns structured information such as title, price, condition and other relevant data. It is build on top of Apify SDK and you can run it both on Apify platform and locally.
Key Capabilities
The scraper can extract data from multiple types of content, including:
Text from screenshots (via OCR and vision models) Tables from PDFs, including scanned documents Data from charts and graphs, even when not available as raw text
Input
Input is a JSON object with the following properties:
{"startUrls": START_URLS,"question": QUESTION,"outputSchema": OUTPUT_SCHEMA,"outputFilter": OUTPUT_FILTER,"clickButtonSelector": CLICK_BUTTON_SELECTOR,"nextPageSelector": NEXT_PAPGE_SELECTOR,"nextPageRegex": NEXT_PAPGE_REGEX,"maxPages": MAX_PAGES,"countryCode": COUNTRY_CODE}
Example:
{"question": "Extract the title, price and condition of the ebay item.","startUrls": [{"url": "https://www.ebay.com/p/3072579174?iid=186372216016&var=694422418597"}],"nextPageRegex": ["page=\\d+"],"nextPageSelector": "a[href*='page='],.product-title-link","outputSchema": "(z) => { return z.object({ title: z.string(), price: z.string(), condition: z.string(), isProductDetailPage: z.boolean() }); }","outputFilter": "(obj) => { return obj.isProductDetailPage; }","maxPages": 1,"countryCode": "US"}
{"countryCode": "US","maxPages": 10,"nextPageRegex": ["pg=\\\\d+"],"nextPageSelector": ".a-list-item .a-link-normal[role=link]","outputFilter": "(obj) => { return obj.isItemDetailPaqe; }","outputSchema": "(z) => { return z.object({\r\n position: z.number().int().positive(),\r\n\r\n category: z.string(),\r\n categoryUrl: z.string().url(),\r\n\r\n name: z.string(),\r\n\r\n price: z.number().nonnegative().nullable().optional(),\r\n currency: z.string().min(1), // \"$\" allowed\r\n\r\n numberOfOffers: z.number().int().nonnegative().optional(),\r\n\r\n url: z.string().url(),\r\n\r\n thumbnail: z.string().url(),\r\n isItemDetailPaqe: z.boolean()\r\n}); }","question": "if it is the product page, return the following fields:\n- isItemDetailPaqe (true/false)\n- position: index in list\n- category\n- categoryUrl\n- name\n- price\n- currency\n- numberOfOffers\n- url\n- thumbnail\n\nInstructions:\n- If a value is missing, return null.\n\nExample output:\n{\n \"position\": 1,\n \"category\": \"Amazon Best Sellers: Best Electronics\",\n \"categoryUrl\": \"https://www.amazon.com/Best-Sellers-Electronics/zgbs/electronics/\",\n \"name\": \"Amazon Fire TV Stick 4K, brilliant 4K streaming quality, TV and smart home controls, free and live TV\",\n \"price\": 22.99,\n \"currency\": \"$\",\n \"numberOfOffers\": 1,\n \"url\": \"https://www.amazon.com/all-new-fire-tv-stick-4k-with-alexa-voice-remote/dp/B08XVYZ1Y5/ref=zg_bs_g_electronics_sccl_1/134-0062779-1101052?psc=1\",\n \"thumbnail\": \"https://images-na.ssl-images-amazon.com/images/I/41GYmjbeVSL._AC_UL600_SR600,400_.jpg\"\n}","startUrls": [{"url": "https://www.amazon.com/Best-Sellers-Appliances/zgbs/appliances/ref=zg_bs_nav_appliances_0"}]}
Output
Output is stored in a dataset. Example:
{"url": "https://www.ebay.com/p/3072579174?iid=186372216016&var=694422418597","title": "Samsung Galaxy S22 - 128 GB - Phantom Black (Unlocked)","price": "$156.99","condition": "Very Good - Refurbished"}
{"url": "https://www.amazon.com/hOmeLabs-Portable-Machine-Stainless-Countertop/dp/B07Z733W6H/ref=zg_bs_10897729011_sccl_1/134-0062779-1101052?psc=1","title": "Amazon Best Sellers: Best Appliances","screenshotSentToOpenAiUrl": "https://api.apify.com/v2/key-value-stores/he7Sff76SHgGdFk0c/records/ff6c13b0-00d9-4e71-9a97-440809d6e9e6.jpg","isItemDetailPaqe": false,"category": "Amazon Best Sellers: Best Appliances","categoryUrl": "https://www.amazon.com/Best-Sellers-Appliances/zgbs/appliances/","name": "hOmeLabs Portable Ice Maker Machine, Stainless Steel, Clear Visual Window, 26 lbs (12kg) Ice Per Day, 6 Minutes Ice Cycle, Compact Countertop Frozen Maker for Kitchen, Bar, Party","price": 234.97,"currency": "$","numberOfOffers": 1,"thumbnail": "https://m.media-amazon.com/images/I/71HDNpd7whL._AC_UL320_.jpg"}
Compute units consumption
Keep in mind that it is much more efficient to run one longer scrape (at least one minute) than more shorter ones because of the startup time.
The average consumption is 1 Compute unit for 1000 actor pages scraped
Epilogue
Thank you for trying my actor. I will be very glad for a feedback that you can send to my email dtrungtin@gmail.com.