Machine learning
Web data is the fuel of AI, machine learning, and LLMs. Get the data you need for your ML projects.
Infinite web data to power up your machine learning
Web scraping has made gathering large training datasets from the web much easier, but the more complex your AI, the greater the size of the dataset you need. To acquire diverse data from a wide range of sources, you need web scrapers that can scale. Apify has the tools and expertise to get the data you need fast.
Data ingestion for LLMs
Data ingestion is a process that begins with data collection. The data collected needs to be relevant to the task the LLM is being trained for. That means you need the right scraping tool for the right data type. Apify has a range of tools designed to get specific kinds of data. Automatically filter what you need to feed and train your large language models.
Website Content Crawler
apify/website-content-crawler
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
28.6k
711
Instagram Profile Scraper
apify/instagram-profile-scraper
Scrape all Instagram profile info. Just add one or more Instagram usernames and extract number of followers&follows, URLs, bio, posts, likes, counts, related profiles, captions, highlight reels. Export scraped data, run the scraper via API, schedule and monitor runs or integrate with other tools.
39.1k
232
Google Search Results Scraper
apify/google-search-scraper
Scrape Google Search Engine Results Pages (SERPs). Select the country or language and extract organic and paid results, AI overviews, ads, queries, People Also Ask, prices, reviews, like a Google SERP API. Export scraped data, run the scraper via API, schedule runs, or integrate with other tools.
50.9k
253
Twitter Scraper
quacker/twitter-scraper
Scrape tweets from any Twitter user profile. Top Twitter API alternative to scrape Twitter hashtags, threads, replies, followers, images, videos, statistics, and Twitter history. Export scraped data, run the scraper via API, schedule and monitor runs or integrate with other tools.
29.9k
112
Natural language processing
Scrape online reviews to process and analyze large amounts of natural language data. For instance, Yelp Scraper checks the web for the latest reviews of selected restaurants. Or get reviews from the Google Play Store for your favorite app.
Google Maps Scraper
compass/crawler-google-places
Extract data from hundreds of Google Maps locations and businesses. Get Google Maps data including reviews, images, contact info, opening hours, location, popular times, prices & more. Export scraped data, run the scraper via API, schedule and monitor runs, or integrate with other tools.
80.2k
607
Google Play Data Extractor
epctex/google-play-scraper
Get valuable info & reviews from Google Play! Access title, price, ratings, download rates, screenshots, released date, version number & developer details for any region or language. Unlimited & lightning-fast extraction. Export data in XML, JSON, CSV, Excel, or HTML formats.
925
9
AI Text Analyzer for Google Reviews
geneea-analytics/reviews-text-nlp-analyzer
Quickly analyze customer reviews extracted by Google Maps Scraper. Find out what the most frequently used keywords are in each review. Learn how people view your staff and prices. Obtain structured information from unstructured text. Monitor changes in customers’ sentiment over time.
342
12
Yelp Scraper
tri_angle/yelp-scraper
Free Yelp web scraper to extract data from Yelp. Fast Yelp review scraper, but also gets business details and ratings without using the Yelp API.
3.4k
26
Image recognition
Deep learning image recognition methods achieve the best results in terms of performance and flexibility. Scrape images from the web to create custom models on a specific dataset.
Bulk Image Downloader
trudax/bulk-image-downloader
Download all images from a website with this easy-to-use Bulk Image Downloader. Scrape all images from any website by URL to a zip file with a single click.
2.2k
10
DALL-E 2 Image Generation
jirimoravcik/dalle-2-image-generation
This actor enables you to generate images using OpenAI's DALL-E 2.
61
2
Product mapping for e-commerce
Apply machine learning to compare collected retail data. Identify, classify, and match products and prices across multiple e-commerce websites for competitive pricing intelligence.
AliExpress Scraper
epctex/aliexpress-scraper
Effortlessly extract descriptions, images, feedback, questions, prices, and shipping information from AliExpress. Customize country, language, and region preferences for enhanced data gathering.
1.5k
21
Google Shopping Insights
epctex/google-shopping-scraper
Unlock valuable insights from Google Shopping with our Data Extractor. Get reviews, descriptions, prices, merchant details, and affiliation links. Export data in JSON, XML, CSV, Excel, and HTML formats with no limits!
1.7k
27
Amazon Product Scraper
junglee/Amazon-crawler
Use this Amazon scraper to collect data based on URL and country from the Amazon website. Extract product information without using the Amazon API, including reviews, prices, descriptions, and Amazon Standard Identification Numbers (ASINs). Download data in various structured formats.
7.5k
90
News aggregation
Upgrade and train your models by adding new data from crawling global news sources. Track public sentiment, identify relationships, spot fake news, and gather up-to-date intelligence.
Website Content Crawler
apify/website-content-crawler
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
28.6k
711
Smart Article Extractor
lukaskrivka/article-extractor-smart
📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.
4.1k
68
Product mapping with AI
After extracting the data you need, use our AI Product Matcher to find product pairs in provided datasets to i.e. compare your prices with your competitors.
Get a head start with further data manipulation, all while keeping your data safe within Apify's ecosystem, where you can integrate your workflow with other platforms and schedule your tasks to run on a regular basis.
Let the machine learn
GPT Scraper lets you extract data from any website and feed it into GPT. Watch our tutorial on how you can set it up to proofread content, summarize reviews, extract contact details.
Sign up
First, create an Apify account. It’s free, no credit card is required, and you get $5 free prepaid platform usage every month!
Choose an Actor
Apify Store features hundreds of pre-built tools (we call them Actors) for extracting data from different websites. Check out the machine learning scrapers that could fit your use case.
Get your data
After everything’s set up, run the Actor. As soon as it’s successful, you’ll be able to download your data in Excel, JSON, HTML, and many other formats.
Schedule, integrate, monitor
You can further automate your workflow by saving the data to Google Drive, sending automated Gmail and Slack notifications, or monitoring and scheduling your Actor runs.
Every plan (free included) comes with Apify Proxy, which is great for avoiding blocking and giving you access to geo-specific content.
With our latest monitoring features, you always have immediate access to valuable insights on the status of your web scraping tasks.
Your datasets can be exported to any format that suits your data workflow, including Excel, CSV, JSON, XML, HTML table, JSONL, and RSS.
You can integrate your Apify runs with platforms such as Zapier, Make, Keboola, Google Drive, or GitHub. Connect with practically any cloud service or web app.
Apify is built by developers, so you'll be in good hands if you have any technical questions. Our Discord server is always here to help!
Web scraping is the automated process of extracting data from websites using software. Machine learning uses this data to train models for various applications such as sentiment analysis, recommender systems, and fraud detection.
It’s important to monitor and check for errors in your data and to make sure that the data is representative of the population it’s meant to represent. Sampling techniques and data cleaning methods can help improve data quality.
In supervised learning, scraped data can be labeled for training classification or regression models. In unsupervised learning, it can be used for clustering or association analysis to uncover patterns and relationships in the data.
It is legal to scrape publicly available data such as product descriptions, prices, or ratings. On the other hand, certain types of data, such as personal data or copyrighted content, are under special legal protection and you should not scrape these without first making sure you follow the relevant laws and regulations. Read through our blog post on the web scraping legality to learn more about the law and extracting data from the web. Web scraping for market research is specfically permitted in the European Union by the DSM directive.
Knock yourself out! Our platform was built to host and run thousands of scrapers. You can customize a universal Web Scraper or start a new one with some of our ready-made templates in Python, JavaScript, or TypeScript. You can keep the scraper to yourself or make it public by adding it to Apify Store (and even make a little cash out of it). You can also integrate your scraper with other popular data processing services such as Keboola, Airbyte, or Zapier.
Yes, there is. You can have programmatic access to any scraper on the platform via Apify's web scraping API. It is organized around RESTful HTTP endpoints and can be accessed either by using Python or Node.js clients, or manually. This API will enable you to fetch results directly from any of your datasets. Check out the Apify API reference docs for full details.
Sure! We can build you a custom web scraper or, if you're searching for a more affordable solution, get an external developer to create the scraper for you via our Apify freelancer program.
Yes. Our affiliate program offers up to 50% recurring commission for its participants. You can check out the terms & conditions and sign up for Apify Affiliate here.