Python Scraper
Pricing
from $2.00 / 1,000 scraped pages
Python Scraper
Python Scraper extracts web page data using Requests and BeautifulSoup. It collects titles, meta tags, headings, links, images, Open Graph data, text snippets, and custom CSS selector fields, with exports to JSON, CSV, Excel, XML, or HTML.
Pricing
from $2.00 / 1,000 scraped pages
Rating
5.0
(4)
Developer
Sovanza
Maintained by CommunityActor stats
0
Bookmarked
7
Total users
1
Monthly active users
15 days ago
Last modified
Categories
Share
Python Scraper – Extract Web Page Data with Requests & BeautifulSoup
Python Scraper is a general-purpose Apify Actor that fetches public web pages with Requests, parses HTML with BeautifulSoup, and returns structured metadata per URL. Export combined results to JSON, CSV, Excel, XML, or HTML in the default key-value store.
Overview
Each run processes your startUrls sequentially, pushes one dataset item per URL, and writes a combined export file named {exportBasename}.{ext}.
- No browser — fast for static HTML and simple sites
- Custom CSS selectors — map field names to selectors without code changes
- Secret inputs — cookies, bearer tokens, and request headers are encrypted in Apify storage
- Automation-ready — API, schedules, and webhooks
Quick start
- Open Python Scraper on Apify and add one or more start URLs (e.g.
https://example.com). - Optionally set CSS selectors for extra fields and choose an export format.
- Click Run — inspect the Dataset tab for per-URL rows.
- Download the combined file from the run’s Key-value store (see Output schema links in Console).
Minimal input example
{"startUrls": ["https://example.com"],"exportFormat": "json","exportBasename": "EXPORT"}
Input configuration
| Field | Required | Description |
|---|---|---|
startUrls | Yes | URLs to scrape (string list). |
method | No | GET or POST (default GET). |
cookies | No | Secret. Raw Cookie header for logged-in sessions. |
bearerToken | No | Secret. Bearer token sent as Authorization header. |
headers | No | Secret. Extra HTTP headers (JSON object). Default User-Agent applied if omitted. |
timeoutSecs | No | Per-request timeout (default 60, max 300). |
selectors | No | Map of fieldName → CSS selector (supports ::text suffix). |
includeLinks | No | Extract page links (default true). |
includeHtml | No | Include raw HTML in KV export only (default false; can be large). |
exportFormat | No | json, csv, excel, xml, or html. |
exportBasename | No | KV export basename (default EXPORT); extension added automatically. |
Full input example (with secrets and selectors)
{"startUrls": ["https://example.com","https://news.ycombinator.com/"],"method": "GET","cookies": "session=YOUR_SESSION; path=/","bearerToken": "YOUR_API_TOKEN","headers": {"Accept-Language": "en-US,en;q=0.9"},"timeoutSecs": 60,"selectors": {"headline": "h1::text","tagline": "p.lead::text"},"includeLinks": true,"includeHtml": false,"exportFormat": "csv","exportBasename": "EXPORT"}
Legacy input: exportKey is still accepted at runtime but renamed to exportBasename in the schema.
Output
Dataset (one row per URL)
| Field | Description |
|---|---|
url | Input URL |
finalUrl | URL after redirects |
status | HTTP status code |
title | Document title |
metaDescription | Meta description |
h1 | Primary heading |
openGraph / twitterCard | Social meta objects |
links / linkCount | Extracted links when includeLinks is true |
error | Error message when a URL fails |
| custom | Fields from your selectors map |
Example dataset item
{"url": "https://example.com","finalUrl": "https://example.com/","status": 200,"title": "Example Domain","metaDescription": "Example description","h1": "Example Domain","text": "This domain is for use in documentation examples...","linkCount": 1,"contentType": "text/html"}
Key-value store export
Combined file: {exportBasename}.json, .csv, .xlsx, .xml, or .html depending on exportFormat. Use the actor Output schema in Console for direct API links.
How it works
- Requests fetches each URL with merged secret headers (cookies, bearer token, custom headers).
- BeautifulSoup (lxml) parses HTML and extracts standard fields.
- Custom selectors run per configured field name.
- Each page is pushed to the default dataset (empty values omitted).
- All rows are serialized to the key-value store in the chosen format.
Authentication & sensitive input
Fields that can hold credentials use Apify secret input (isSecret: true):
cookies— session cookiesbearerToken— API or OAuth bearer tokensheaders— JSON object for Authorization, API keys, or other sensitive headers
Secret values are encrypted in storage and are not copied into dataset rows or logs.
Best practices
- Respect robots.txt, site terms, and rate limits.
- Use realistic headers or cookies only when you are allowed to access the target pages.
- Increase
timeoutSecsfor slow sites. - Keep
includeHtmldisabled unless you need raw HTML (large KV files). - For JavaScript-heavy SPAs, use a Playwright-based actor instead.
Use cases
- SEO audits (title, meta, H1, canonical)
- Content monitoring with scheduled runs
- Directory or listing extraction with custom selectors
- Starter template for custom Python scrapers on Apify
Limitations
- No JavaScript rendering — static HTML only.
- Sequential requests — one URL at a time per run.
- No built-in Apify Proxy configuration in this actor (add headers/cookies or use platform proxy at account level if needed).
FAQ
Does this actor run JavaScript?
No. It uses HTTP + HTML parsing. Dynamic sites may need Playwright or Puppeteer.
How do custom CSS selectors work?
Add a JSON object selectors where keys are output field names and values are CSS selectors. The actor extracts the first match’s text.
What if one URL fails?
That URL gets a dataset row with error; other URLs still process.
Where is the combined export file?
In the run’s default key-value store as {exportBasename}.{extension} (e.g. EXPORT.csv).
Run locally
pip install -r requirements.txt- Create
INPUT.jsonin this folder (see examples above). apify runor runmain.pyin an Apify-compatible environment.
Deploy
Push with Apify CLI or connect the Git repository. Schema files: INPUT_SCHEMA.json, OUTPUT_SCHEMA.json, .actor/dataset_schema.json, Dockerfile.
Get started
Add your URLs, optional selectors and export format, and start extracting structured web data with Python on Apify.


