The web data layer for machine learning

Web data is the fuel of AI, machine learning, and LLMs. Get the data you need for your ML projects.

Infinite web data to train your machine learning models

The more capable your AI, the more variety it needs in the data feeding it. Pulling that variety from across the web takes more than a one-off tool. Apify’s marketplace gives you 30,000+ ready-made Actors built for exactly this, plus the infrastructure to run them reliably so the data keeps flowing as fast as your models can learn.

Connect agents with Apify tools through MCP

Apify's MCP Server lets agents find, run, and fetch data from the right tool automatically. Agents can operate independently - fetching live data, reacting to real-world changes, and completing tasks without manual prompts.

See Apify's MCP Server

Data ingestion for LLMs

A model is only as good as the data it learns from, and that data has to match the task it’s being designed for. Apify has a range of tools built for specific data types, so you can grab exactly what your LLM needs without finding a new tool for every set.

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

142K

4.5

(213)

Instagram Profile Scraper

apify/instagram-profile-scraper

Scrape all Instagram profile info. Just add Instagram usernames, IDs or URLs and extract name, join date, number of followers, location, bio, website, related profiles, video&post count, latest posts. Export scraped data, schedule scraper via API, and integrate with other tools or AI workflows.

Apify

180K

4.8

(148)

Google Search Results Scraper

apify/google-search-scraper

Scrape Google Search Engine Results Pages (SERPs). Select the country or language and extract organic and paid results, AI Mode, AI overviews, ads, queries, People Also Ask, prices, reviews, like a Google SERP API. Export data, run the scraper via API, schedule runs, or integrate with other tools.

Apify

156K

4.5

(159)

Natural language processing

Pull online reviews to process and analyze large amounts of natural language data. Yelp Scraper grabs the latest restaurant reviews. Google Play Data Extractor brings back what people are saying about your favorite app.

Google Maps Scraper

compass/crawler-google-places

Extract data from thousands of Google Maps locations and businesses, including reviews, reviewer details, images, contact info, including full name, email, and job title, opening hours, prices & more. Export data, run via API, schedule and monitor runs, or integrate with other tools.

Compass

526K

4.7

(1,660)

Tripadvisor Reviews Scraper

maxcopell/tripadvisor-reviews

Get and download reviews for chosen places on Tripadvisor. Extract the review text, URL, rating, date of travel, published date, basic reviewer info, owner's response, helpful votes, images, review language, place details. Download reviews in XML, JSON, CSV.

Max

5.0

(51)

Booking Reviews Scraper

voyager/booking-reviews-scraper

Scraper to get reviews from hotels, apartments and other accommodations listed on the Booking.com portal. Extract data using hotel URLs for review text, ratings, stars, basic reviewer info, length of stay, liked/disliked parts, room info, date of stay and more. Download in JSON, HTML, Excel, CSV.

Voyager

3.3K

4.3

(30)

Google Play Data Extractor

epctex/google-play-scraper

Get valuable info & reviews from Google Play! Access title, price, ratings, download rates, screenshots, released date, version number & developer details for any region or language. Unlimited & lightning-fast extraction. Export data in XML, JSON, CSV, Excel, or HTML formats.

epctex

1.4K

5.0

(10)

AI Text Analyzer for Google Reviews

geneea-analytics/reviews-text-nlp-analyzer

Quickly analyze customer reviews extracted by Google Maps Scraper. Find out what the most frequently used keywords are in each review. Learn how people view your staff and prices. Obtain structured information from unstructured text. Monitor changes in customers’ sentiment over time.

Geneea Analytics

627

5.0

(1)

Yelp Scraper

tri_angle/yelp-scraper

Free Yelp web scraper to extract data from Yelp. Fast Yelp review scraper, but also gets business details and ratings without using the Yelp API.

Tri⟁angle

6.6K

4.1

(9)

Image recognition

Deep learning image recognition methods achieve the best results in terms of performance and flexibility. Pull images straight from the web to create custom models on a specific dataset.

Bulk Image Downloader

trudax/bulk-image-downloader

Download all images from a website with this easy-to-use Bulk Image Downloader. Scrape all images from any website by URL to a zip file with a single click.

Trudax

3.4K

5.0

(1)

DALL-E 2 Image Generation

jirimoravcik/dalle-2-image-generation

This actor enables you to generate images using OpenAI's DALL-E 2.

Jiří Moravčík

100

Product mapping for e-commerce

Apply machine learning to compare collected retail data. Identify, classify, and match products and prices across multiple e-commerce websites for competitive pricing intelligence.

Amazon Product Scraper

junglee/Amazon-crawler

Use this Amazon scraper to collect data based on URL and country from the Amazon website. Extract product information without using the Amazon API, including reviews, prices, descriptions, and Amazon Standard Identification Numbers (ASINs). Download data in various structured formats.

Junglee

20K

4.4

(53)

Google Shopping Insights

epctex/google-shopping-scraper

Unlock valuable insights from Google Shopping with our Data Extractor. Get reviews, descriptions, prices, merchant details, and affiliation links. Export data in JSON, XML, CSV, Excel, and HTML formats with no limits!

epctex

2.3K

4.9

(9)

AliExpress Scraper

epctex/aliexpress-scraper

Effortlessly extract descriptions, images, feedback, questions, prices, and shipping information from AliExpress. Customize country, language, and region preferences for enhanced data gathering.

epctex

5.0

(10)

News aggregation

Upgrade and train your models by adding new data from crawling global news sources. Track public sentiment, identify relationships, spot fake news, and gather up-to-date intelligence.

Website Content Crawler

apify/website-content-crawler

Apify

142K

4.5

(213)

Smart Article Extractor

lukaskrivka/article-extractor-smart

📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.

Lukáš Křivka

7.6K

4.1

(9)

Google News Scraper

lhotanova/google-news-scraper

Gets featured articles from Google News with title, link, source, publication date and image.

Kristýna Lhoťanová

3.1K

4.6

(13)

Product mapping with AI

Once you have your data, AI Product Matcher pairs products across your dataset so you can compare prices with competitors.

Keep everything inside Apify - integrate workflows, schedule tasks, and run the next step from the same place.

AI Product Matcher

equidem/ai-product-matcher

Match products across multiple e-commerce websites. Use this AI product matching Actor whenever you need to find matching pairs of products from different online shops for dynamic pricing, competitor analysis or market research.

Matěj Sochor

773

Let the machine learn

Website Content Crawler extracts web data from any website to feed AI models, LLM applications, or RAG pipelines. Watch this two-minute walkthrough to see it in action.

4 steps to get data for machine learning

Sign up

First, create an Apify account. It’s free, no credit card is required, and you get $5 free prepaid platform usage every month!

Choose an Actor

Apify Store features more than 30,000 ready-made Actors. Browse the ones built for machine learning that could fit your use case.

Get your data

After everything’s set up, run the Actor. As soon as it’s successful, you’ll be able to download your data in Excel, JSON, HTML, and many other formats.

Schedule, integrate, monitor

Push results to Google Drive, trigger Gmail or Slack alerts, and schedule or monitor your Actor runs to keep everything running on its own.

We are scraping Facebook comments using Apify’s Facebook scraper for a machine translation academic project. It saved us a lot of time and enabled us to meet the project’s deadline.

Hashem S.

Research Assistant

Why Apify?

Never get blocked

Every plan (free included) comes with Apify Proxy, which is great for avoiding blocking and giving you access to geo-specific content.

Customers love us

Apify is one of the top-rated marketplaces for web data and automation, rated 4.7/5 across more than 900 reviews on G2 and Capterra.

Monitor your runs

Our monitoring features give you immediate access to insights into the status of your Actor runs.

Export to various formats

Your datasets can be exported to any format that suits your data workflow, including Excel, CSV, JSON, XML, HTML table, JSONL, and RSS.

Integrate Apify to your workflow

You can integrate your Apify runs with platforms such as Zapier, Make, Keboola, Google Drive, or GitHub. Connect with practically any cloud service or web app.

Large developer community

Apify is built by developers, so you'll be in good hands if you have any technical questions. Our Discord server is always here to help!

Frequently asked questions

What is web scraping and how does it relate to machine learning?

Web scraping is the automated process of extracting data from websites using software. Machine learning uses this data to train models for various applications such as sentiment analysis, recommender systems, and fraud detection.

How can you ensure the quality and accuracy of the data collected through web scraping?

It’s important to monitor and check for errors in your data and to make sure that the data is representative of the population it’s meant to represent. Sampling techniques and data cleaning methods can help improve data quality.

How can web scraping be used for supervised and unsupervised machine learning?

In supervised learning, scraped data can be labeled for training classification or regression models. In unsupervised learning, it can be used for clustering or association analysis to uncover patterns and relationships in the data.

Is it legal to scrape data for machine learning?

It is legal to scrape publicly available data such as product descriptions, prices, or ratings. On the other hand, certain types of data, such as personal data or copyrighted content, are under special legal protection and you should not scrape these without first making sure you follow the relevant laws and regulations. Read through our blog post on the web scraping legality to learn more about the law and extracting data from the web. Web scraping for market research is specfically permitted in the European Union by the DSM directive.

I couldn’t find a scraper for my specific website. Can I build it?

Knock yourself out! Our platform was built to host and run thousands of scrapers. You can customize a universal Web Scraper or start a new one with some of our ready-made templates in Python, JavaScript, or TypeScript. You can keep the scraper to yourself or make it public by adding it to Apify Store (and even make a little cash out of it). You can also integrate your scraper with other popular data processing services such as Keboola, Airbyte, or Zapier.

I don’t need to download scraped data. Is there an API I can use instead?

Yes, there is. You can have programmatic access to any scraper on the platform via Apify's web scraping API. It is organized around RESTful HTTP endpoints and can be accessed either by using Python or Node.js clients, or manually. This API will enable you to fetch results directly from any of your datasets. Check out the Apify API reference docs for full details.

I'm not a developer. Can you build a custom machine learning tool for me?

Sure! We can build you a custom web scraper or, if you're searching for a more affordable solution, get an external developer to create the scraper for you via our Apify freelancer program.

I don’t need scrapers for machine learning, but I know somebody who does. Can I refer them?

Yes. Our affiliate program offers up to 30% recurring commission for its participants. You can check out the terms & conditions and sign up for Apify Affiliate here.