Python Scraper
Pricing
from $2.00 / 1,000 scraped pages
Python Scraper
Python Scraper extracts web page data using Requests and BeautifulSoup. It collects titles, meta tags, headings, links, images, Open Graph data, text snippets, and custom CSS selector fields, with exports to JSON, CSV, Excel, XML, or HTML.
Pricing
from $2.00 / 1,000 scraped pages
Rating
5.0
(4)
Developer
Sovanza
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Python Scraper – Extract Web Page Data with Requests & BeautifulSoup
Scrape any public web page at scale using Python (Requests + BeautifulSoup). Extract titles, meta tags, headings, links, Open Graph data, and custom CSS fields — then export results to JSON, CSV, Excel, XML, or HTML via Apify dataset and key-value store.
What is Python Scraper and How Does It Work?
Python Scraper is a flexible general-purpose web scraping actor on Apify. It fetches each URL you provide, parses HTML with BeautifulSoup, and returns structured data per page. It is designed for:
- Developers building custom scrapers from a Python template
- Data analysts collecting page metadata at scale
- SEO and content teams auditing titles, descriptions, and headings
- Automation engineers feeding pipelines with JSON/CSV exports
Each run processes your start URLs sequentially, pushes one dataset item per URL, and writes a combined export file to the default key-value store in the format you choose.
Why Use This Python Scraper?
Use this actor to:
- Scrape multiple URLs in one Apify run without writing boilerplate
- Extract standard page fields (title, meta, H1, links, images) out of the box
- Add custom CSS selectors for any extra text fields you need
- Export all results as JSON, CSV, Excel, XML, or HTML for downstream tools
- Integrate with Apify API, schedules, and webhooks for recurring jobs
➡️ Lightweight and fast — no browser required; ideal for static HTML and simple sites.
What Data Does Python Scraper Extract?
This actor outputs one dataset item per URL, including (when available on the page):
Core page data
url— input URLfinalUrl— URL after redirectsstatus— HTTP status codecontentType,charset,contentLengthtitle— document titlemetaDescription,metaKeywords,robotscanonicalUrl,languageh1— primary headingheadings— optionalh1/h2/h3liststext— visible text snippet (up to 4,000 characters)
Social & media
openGraph— Open Graph meta propertiestwitterCard— Twitter card meta propertiesimages— image URLs and alt text (up to 25)
Links (optional)
links— extracted hrefs (whenincludeLinksis true)linkCount— number of linksfirstLinkText,firstLinkUrl— first anchor on the page
Custom fields
- Any keys you define in
selectors(CSS selector → field name)
Errors
error— message when a URL fails to fetch or parse
Export (key-value store)
- Combined file:
{exportKey}.json|.csv|.xlsx|.xml|.html(based onexportFormat)
➡️ Dataset rows are structured and exportable in JSON, CSV, or Excel via Apify. Optional raw html in export when includeHtml is enabled.
Features
- Multi-URL scraping — batch many
startUrlsin a single run - Custom CSS selectors — map field names to selectors (supports
::textsuffix) - Rich metadata — title, meta, canonical, Open Graph, Twitter card
- Link extraction — optional full link list per page
- Multiple export formats — JSON, CSV, Excel, XML, HTML to KV store
- Configurable HTTP — method, headers, timeout
- Clean output — empty fields omitted from dataset items
- Automation-ready — Apify API, schedules, webhooks
How to Use Python Scraper on Apify
Using the Actor
-
Go to Python Scraper on the Apify platform.
-
Input Configuration:
- Add one or more start URLs (public pages you are allowed to scrape).
- Optionally set CSS selectors for extra fields.
- Choose export format and export key for the combined KV file.
-
Run the Actor — Each URL produces one dataset row; export file is written to the key-value store.
-
Access Your Results — Dataset tab for per-URL items; Key-value store for the combined export; use API links from the Output schema.
-
Schedule (optional) — Recurring runs for monitoring or refresh workflows.
Input Configuration
The actor accepts the following parameters:
{"startUrls": ["https://example.com","https://news.ycombinator.com/"],"method": "GET","headers": {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Python Scraper (Apify)"},"timeoutSecs": 60,"selectors": {"headline": "h1::text","tagline": "p.lead::text"},"includeLinks": true,"includeHtml": false,"exportFormat": "json","exportKey": "EXPORT"}
| Field | Description |
|---|---|
startUrls | Required. URLs to scrape (one per line in Console or JSON array). |
method | HTTP method: GET or POST (default GET). |
headers | Optional request headers (JSON object). |
timeoutSecs | Per-request timeout in seconds (default 60, max 300). |
selectors | Optional { "fieldName": "css selector" } — first matching text per field. |
includeLinks | If true, extract and include page links (default true). |
includeHtml | If true, include raw HTML in export (can be large; default false). |
exportFormat | json, csv, excel, xml, or html — combined KV export format. |
exportKey | KV store key prefix (default EXPORT); file extension added automatically. |
Output
Dataset — one item per URL (overview table in Console):
| Field | Description |
|---|---|
url | Input URL |
finalUrl | Final URL after redirects |
status | HTTP status code |
title | Page title |
metaDescription | Meta description |
h1 | Primary H1 text |
firstLinkText / firstLinkUrl | First link on page |
linkCount | Number of links (if includeLinks) |
contentType | Response content type |
openGraph / twitterCard | Social meta objects |
images | Image list with src and optional alt |
headings | Nested h1/h2/h3 when multiple headings exist |
error | Error message on failed URLs |
| custom | Fields from your selectors map |
Key-value store — combined export at {exportKey}.{ext} (e.g. EXPORT.json).
Example dataset item (illustrative):
{"url": "https://example.com","finalUrl": "https://example.com/","status": 200,"title": "Example Domain","metaDescription": "Example description","h1": "Example Domain","text": "This domain is for use in documentation examples...","openGraph": { "title": "Example", "type": "website" },"linkCount": 1,"contentType": "text/html"}
Use the actor Output schema in Apify Console for direct API links to dataset items and export files.
How the Scraper Works
- Requests fetches each
startUrlwith your headers and timeout. - BeautifulSoup (lxml) parses HTML and extracts structured fields.
- Custom selectors run per field name you define in input.
- Each result is pushed to the default dataset (empty values omitted).
- All results are serialized to the key-value store in the chosen
exportFormat.
Reliability & Best Practices
- Respect robots.txt, site terms, and rate limits for target sites.
- Use realistic User-Agent and headers for sites that block bots.
- Increase
timeoutSecsfor slow pages. - Disable
includeHtmlunless you need raw HTML (large files). - For JavaScript-heavy SPAs, consider a Playwright-based actor instead.
Performance
- No browser overhead — fast for static HTML.
- Batch many URLs in one run; sequential fetching keeps memory predictable.
- Large
includeHtmlexports increase KV store size.
Use Cases
- SEO audits (titles, meta, H1, canonical)
- Content monitoring and change detection (scheduled runs)
- Lead / directory page extraction with custom selectors
- Research datasets and internal analytics pipelines
- Template for building custom Python scrapers on Apify
Integrations & API
- Full API via Apify (
apify-clientPython / Node.js) - Zapier, Make, Google Sheets via dataset export
- Webhooks and scheduled runs
- Output schema links for dataset and KV export URLs
FAQ
Can I scrape any website with this actor?
Only pages you are legally permitted to access and process. You must comply with each site’s terms, robots.txt, and applicable privacy laws. This tool does not bypass paywalls or authentication by default.
Does this actor run JavaScript?
No. It uses HTTP + HTML parsing. Dynamic sites that render content only in the browser may need a Playwright/Puppeteer actor.
How do custom CSS selectors work?
Add a JSON object selectors where keys are output field names and values are CSS selectors. Use ::text suffix if needed; the actor normalizes selectors and extracts the first match’s text.
Where is the full export file?
In the run’s default key-value store, named {exportKey}.{extension} (e.g. EXPORT.csv). Dataset items are always written per URL regardless of export format.
What happens if one URL fails?
That URL gets a dataset row with error; other URLs still process. The combined export includes both successful and failed rows.
Can I automate runs on a schedule?
Yes. Use Apify schedules and pass updated startUrls via API or integrations.
SEO Keywords (high-intent)
python web scraper
python scraper apify
beautifulsoup scraper
requests web scraping
html scraper api
extract page metadata
seo scraper tool
web scraping template python
scrape urls to json csv
apify python actor
Why Choose This Actor?
- Popular Python stack (Requests + BeautifulSoup)
- Per-URL dataset + combined multi-format export
- Custom CSS selectors without code changes
- Clear overview table via dataset schema
- Ideal starter template for Apify Python projects
Limitations
- No JavaScript rendering — static HTML only.
- No built-in proxy — add headers or use Apify Proxy at platform level if needed for your deployment.
- Sequential requests — no built-in concurrency per URL.
- Some sites block datacenter IPs or require cookies — not included by default.
Running locally
pip install -r requirements.txt- Create
INPUT.jsonin the actor folder, e.g.:
{"startUrls": ["https://example.com"],"exportFormat": "json","exportKey": "EXPORT"}
- Run with Apify CLI:
apify runor executemain.pyin an Apify-compatible environment.
Deploy to Apify
Use INPUT_SCHEMA.json, OUTPUT_SCHEMA.json, .actor/dataset_schema.json, and the provided Dockerfile. Push with Apify CLI or connect the Git repository.
Get Started
Add your URLs, define optional selectors, pick an export format, and start extracting structured web data with Python on Apify. 🚀