data:image/s3,"s3://crabby-images/d6c14/d6c149caccdac08b3721ef4bcb1479b7bc41c7b3" alt="Ai Web Scraper - Extract Data With Ease avatar"
Ai Web Scraper - Extract Data With Ease
Pay $20.00 for 1,000 results
data:image/s3,"s3://crabby-images/d6c14/d6c149caccdac08b3721ef4bcb1479b7bc41c7b3" alt="Ai Web Scraper - Extract Data With Ease"
Ai Web Scraper - Extract Data With Ease
Pay $20.00 for 1,000 results
Ai Web Scraper enables scraping for everyone, including non-techies! It uses Google's Gemini LLM to scrape websites with natural language commands. It dynamically extracts data, no selector input needed, handles dynamic content and cookie consent, avoids bot detection, outputs JSON or other formats.
AI Web Scraper - Extract data With Ease - Natural Language & Vision Scraper
This AI Web Scraper is a powerful and flexible tool that leverages Google's Gemini LLM (Large Language Model) to intelligently extract data from websites. Unlike traditional scrapers that rely on pre-defined selectors or rigid rules, this Actor enables you to specify what data you want in plain natural language, and it dynamically adapts to find and extract it.
Note
This scraper is experimental and depends on AI-based vision capabilities. Results can sometimes be inconsistent or not exactly what you expect. Please consider it a beta feature. We appreciate your feedback!
Key Updates / What's New
- Single or Multiple URL Support
- Single URL: If only one URL is provided (or a set of URLs with a unique domain), the scraper uses a no-overlap scrolling approach and processes screenshots entirely in an async manner, using the LLM's Vision capability only.
- Multiple URLs: If more than one URL is provided, the scraper derives bounding boxes/CSS selectors on the first page of a domain and then reuses them for subsequent pages on the same domain to reduce LLM calls and save costs.
- Automated Cookie Consent Handling
- The scraper attempts to detect and accept cookie consent banners using the LLM if present.
- Improved AI Prompts and Code Flow
- Internally updated prompt instructions for bounding box extraction and refined candidate CSS selectors for more accurate results.
- Greater Resilience
- If no results are found for a specific label, the scraper widens its search region.
- Single-page extraction now handles screenshots with minimal overlap, ensuring more comprehensive coverage.
What Does This Actor Do?
-
Dynamic Data Extraction
Simply tell it what items you want, in natural language. Example:"Give me the product title, product price, and the product description."
The Actor uses AI to figure out the best approach to locate and extract this data.
-
No Overlap Scrolling
For single URL runs, the scraper scrolls the page in distinct blocks and captures screenshots with minimal overlap—ensuring thorough coverage from the top to bottom of the page. -
AI-Powered Element Identification
Uses Google's Gemini LLM to analyze screenshots, detect bounding boxes for relevant items, and map them to real DOM elements—even if the website's structure is unfamiliar. -
Automatic Cookie Consent Handling
Attempts to detect and click cookie consent banners automatically to help you avoid manual intervention or partial data capture. -
Output
Returns extracted data in JSON format, with labels derived from the instructions you provided (or from the bounding boxes if you rely on the AI’s naming). -
Bot Detection Avoidance
Utilizes standard best practices: modern user-agent strings, headless browser settings, etc.
Important Notes
-
Single URL or Multiple URLs:
- Single URL: Will run a more robust, screenshot-based approach.
- Multiple URLs: The first URL in the list is used to derive bounding boxes (hence CSS selectors). Subsequent URLs on the same domain reuse those cached selectors, saving on LLM calls.
-
Variable Results:
The AI approach can produce inconsistent or incomplete results, especially on highly dynamic sites. If you encounter issues, consider providing clearer instructions or verifying that the desired content is actually visible on the page. -
Cookies & Consent Banners:
The Actor tries to detect and accept cookie banners, but this process is also AI-driven and may sometimes fail on complicated banners. -
Legal & Ethical Considerations:
Only scrape data that you have the right to access and process. Ensure compliance with the website’s terms of service and relevant data protection regulations (GDPR, etc.).
How to Use the AI Web Scraper
-
Create an Apify Account
Sign up for a free Apify account if you don’t already have one. -
Open the AI Web Scraper
Locate the Actor on the Apify Console or navigate to your own hosted version. -
Provide Instructions and URLs
- Instructions: A short description of what you want to extract (e.g.,
"product title, product price, product rating"
). - Start URLs: One or more URLs from the same or different domains.
- Instructions: A short description of what you want to extract (e.g.,
-
Run the Actor
Click Start. The Actor will spin up a headless browser, interact with the site, capture screenshots, ask the LLM to identify items, and store them in a dataset. -
Download Your Data
After the run is finished, you can view or download the results in various formats (JSON, CSV, XLSX, etc.).
Input Configuration
Example Input
1{ 2 "instructions": "Product title, product price, product description", 3 "start_urls": [ 4 "https://www.ikea.com/nl/nl/p/onsevig-vloerkleed-laagpolig-veelkleurig-60497078/", 5 "https://www.ikea.com/nl/nl/p/vedbak-vloerkleed-laagpolig-lichtgrijs-40528900/" 6 ] 7}
How It Works
- Single URL: If your
start_urls
has only one link, the Actor performs multiple screenshot extractions (no overlap) and uses the LLM for each screenshot. - Multiple URLs: The Actor first processes the first URL to identify bounding boxes via the LLM, then maps them to CSS selectors. It applies these cached selectors to the subsequent URLs from the same domain.
Output
The default output is structured JSON stored in a dataset. For example:
1[ 2 { 3 "url": "https://www.ikea.com/nl/nl/p/onsevig-vloerkleed-laagpolig-veelkleurig-60497078/", 4 "data": { 5 "product title": "ONSEVIG Vloerkleed, laagpolig, veelkleurig", 6 "product price": "€39,99", 7 "product description": "Made with recycled materials..." 8 } 9 }, 10 { 11 "url": "https://www.ikea.com/nl/nl/p/vedbak-vloerkleed-laagpolig-lichtgrijs-40528900/", 12 "data": { 13 "product title": "VEDBÄK Vloerkleed, laagpolig, lichtgrijs", 14 "product price": "€59,99", 15 "product description": "Durable, stain resistant..." 16 } 17 } 18]
Tip: You can then download the dataset in JSON, CSV, XML, or Excel formats via the Apify platform.
How Does the AI Web Scraper Work?
-
Launch Headless Chrome
We use Selenium with a headless Chrome browser. Common anti-bot detection measures are addressed using realistic user-agent strings and standard flags. -
Check for Cookie Consent
The Actor attempts to detect and click any cookie consent or GDPR banners by using a specialized AI prompt that evaluates all visible buttons. If none is detected or the click fails, it continues scraping anyway. -
Screenshot Analysis with Gemini
- For single URL:
The page is scrolled multiple times in non-overlapping segments. Each segment is screenshot and sent to the LLM with instructions (e.g.,"title, price, description"
). The results are aggregated into a final JSON object. - For multiple URLs:
The first URL for each domain is used to locate bounding boxes for each requested item. These bounding boxes are mapped back to CSS selectors. Subsequent pages on that domain reuse the same selectors for faster scraping.
- For single URL:
-
Refinement of Selectors
- If the bounding boxes are found, the code tries to identify the corresponding DOM elements and generate refined CSS selectors.
- It caches these selectors by domain so that multiple URLs from the same site can be scraped quickly.
-
Data Extraction
- Once the selectors are determined, the scraper extracts the text/HTML from the matching elements.
- If the item is not found, it returns an empty string.
-
Output
- The results are pushed to the Apify dataset with one record per URL, containing the
url
and thedata
.
- The results are pushed to the Apify dataset with one record per URL, containing the
Use Cases
- Market Research: Compare product prices, descriptions, or specifications across competitor websites.
- Content Aggregation: Collect articles, headlines, or metadata from news sites.
- Financial Data: Grab stock tickers, performance metrics, or fundamental ratios from finance portals.
- Real Estate: Scrape listings, prices, or agent contact info.
- Lead Generation: Collect B2B contact info from directories or e-commerce platforms.
Integrations
- Apify Platform: Schedule this Actor to run regularly and store results in datasets, or integrate with other Actors and tasks.
- Webhooks: Get notifications or forward data automatically upon completion.
- API: Control the Actor programmatically via the Apify API.
FAQ
1. Do I need multiple URLs?
No. You can use just one URL if that’s all you need. The Actor has been updated to handle single-URL scraping using an advanced, no-overlap screenshot approach. Multiple URLs, especially from the same domain, can speed up or reduce costs after the first page since the LLM-derived selectors get cached.
2. How Much Does It Cost?
Apify has a free plan to get you started and uses pay-as-you-go for additional resources. LLM usage may incur costs depending on your usage and the pricing policies of the underlying AI service (Google Gemini).
3. Is It Legal to Scrape Data Using This Actor?
This Actor only extracts publicly available data. You must ensure compliance with each website’s terms of service and any relevant laws (like GDPR). You are responsible for how you use the data.
4. What If AI Extraction Fails or Is Incomplete?
AI can sometimes misinterpret or return incomplete data, especially if the site’s layout is highly dynamic or the instructions are vague. You can try:
- More explicit instructions (e.g., “product title text, brand name, price, SKU”).
- Providing consistent domain/page structure for best results.
- Reviewing logs for partial data and adjusting your approach accordingly.
Feedback & Issues
We welcome your feedback! If you have suggestions, questions, or encounter any issues: Leave a bug report or suggestion in the Issues section for this Actor on the Apify Console.
Thank you for trying the AI Web Scraper! We hope it significantly reduces the complexity of data extraction for you and your team.
Actor Metrics
83 monthly users
-
5 bookmarks
73% runs succeeded
15 hours response time
Created in Dec 2024
Modified 12 days ago