Wired tech news extractor
Pricing
from $5.00 / 1,000 articles
Go to Apify Store
Pricing
from $5.00 / 1,000 articles
Rating
0.0
(0)
Developer

Jan Kirchner
Maintained by Community
Actor stats
0
Bookmarked
3
Total users
1
Monthly active users
18 days ago
Last modified
Categories
Share
Python Crawlee & BeautifulSoup Actor Template
This template example was built with Crawlee for Python to scrape data from a website using Beautiful Soup wrapped into BeautifulSoupCrawler.
Quick Start
Once you've installed the dependencies, start the Actor:
$apify run
Once your Actor is ready, you can push it to the Apify Console:
apify login # first, you need to log in if you haven't already done soapify push
Project Structure
.actor/├── actor.json # Actor config: name, version, env vars, runtime settings├── dataset_schena.json # Structure and representation of data produced by an Actor├── input_schema.json # Input validation & Console form definition└── output_schema.json # Specifies where an Actor stores its outputsrc/└── main.py # Actor entry point and orchestratorstorage/ # Local storage (mirrors Cloud during development)├── datasets/ # Output items (JSON objects)├── key_value_stores/ # Files, config, INPUT└── request_queues/ # Pending crawl requestsDockerfile # Container image definition
For more information, see the Actor definition documentation.
How it works
This code is a Python script that uses BeautifulSoup to scrape data from a website. It then stores the website titles in a dataset.
- The crawler starts with URLs provided from the input
startUrlsfield defined by the input schema. Number of scraped pages is limited bymaxPagesPerCrawlfield from the input schema. - The crawler uses
requestHandlerfor each URL to extract the data from the page with the BeautifulSoup library and to save the title and URL of each page to the dataset. It also logs out each result that is being saved.
What's included
- Apify SDK - toolkit for building Actors
- Crawlee for Python - web scraping and browser automation library
- Input schema - define and easily validate a schema for your Actor's input
- Dataset - store structured data where each object stored has the same attributes
- Beautiful Soup - a library for pulling data out of HTML and XML files
- Proxy configuration - rotate IP addresses to prevent blocking
Resources
- Quick Start guide for building your first Actor
- Video introduction to Python SDK
- Webinar introducing to Crawlee for Python
- Apify Python SDK documentation
- Crawlee for Python documentation
- Python tutorials in Academy
- Integration with Zapier, Make, Google Drive and others
- Video guide on getting data using Apify API