Python Scraper avatar

Python Scraper

Pricing

from $2.00 / 1,000 scraped pages

Go to Apify Store
Python Scraper

Python Scraper

Python Scraper extracts web page data using Requests and BeautifulSoup. It collects titles, meta tags, headings, links, images, Open Graph data, text snippets, and custom CSS selector fields, with exports to JSON, CSV, Excel, XML, or HTML.

Pricing

from $2.00 / 1,000 scraped pages

Rating

5.0

(4)

Developer

Sovanza

Sovanza

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Python Scraper – Extract Web Page Data with Requests & BeautifulSoup

Scrape any public web page at scale using Python (Requests + BeautifulSoup). Extract titles, meta tags, headings, links, Open Graph data, and custom CSS fields — then export results to JSON, CSV, Excel, XML, or HTML via Apify dataset and key-value store.

What is Python Scraper and How Does It Work?

Python Scraper is a flexible general-purpose web scraping actor on Apify. It fetches each URL you provide, parses HTML with BeautifulSoup, and returns structured data per page. It is designed for:

  • Developers building custom scrapers from a Python template
  • Data analysts collecting page metadata at scale
  • SEO and content teams auditing titles, descriptions, and headings
  • Automation engineers feeding pipelines with JSON/CSV exports

Each run processes your start URLs sequentially, pushes one dataset item per URL, and writes a combined export file to the default key-value store in the format you choose.

Why Use This Python Scraper?

Use this actor to:

  • Scrape multiple URLs in one Apify run without writing boilerplate
  • Extract standard page fields (title, meta, H1, links, images) out of the box
  • Add custom CSS selectors for any extra text fields you need
  • Export all results as JSON, CSV, Excel, XML, or HTML for downstream tools
  • Integrate with Apify API, schedules, and webhooks for recurring jobs

➡️ Lightweight and fast — no browser required; ideal for static HTML and simple sites.

What Data Does Python Scraper Extract?

This actor outputs one dataset item per URL, including (when available on the page):

Core page data

  • url — input URL
  • finalUrl — URL after redirects
  • status — HTTP status code
  • contentType, charset, contentLength
  • title — document title
  • metaDescription, metaKeywords, robots
  • canonicalUrl, language
  • h1 — primary heading
  • headings — optional h1 / h2 / h3 lists
  • text — visible text snippet (up to 4,000 characters)

Social & media

  • openGraph — Open Graph meta properties
  • twitterCard — Twitter card meta properties
  • images — image URLs and alt text (up to 25)

Links (optional)

  • links — extracted hrefs (when includeLinks is true)
  • linkCount — number of links
  • firstLinkText, firstLinkUrl — first anchor on the page

Custom fields

  • Any keys you define in selectors (CSS selector → field name)

Errors

  • error — message when a URL fails to fetch or parse

Export (key-value store)

  • Combined file: {exportKey}.json | .csv | .xlsx | .xml | .html (based on exportFormat)

➡️ Dataset rows are structured and exportable in JSON, CSV, or Excel via Apify. Optional raw html in export when includeHtml is enabled.

Features

  • Multi-URL scraping — batch many startUrls in a single run
  • Custom CSS selectors — map field names to selectors (supports ::text suffix)
  • Rich metadata — title, meta, canonical, Open Graph, Twitter card
  • Link extraction — optional full link list per page
  • Multiple export formats — JSON, CSV, Excel, XML, HTML to KV store
  • Configurable HTTP — method, headers, timeout
  • Clean output — empty fields omitted from dataset items
  • Automation-ready — Apify API, schedules, webhooks

How to Use Python Scraper on Apify

Using the Actor

  1. Go to Python Scraper on the Apify platform.

  2. Input Configuration:

    • Add one or more start URLs (public pages you are allowed to scrape).
    • Optionally set CSS selectors for extra fields.
    • Choose export format and export key for the combined KV file.
  3. Run the Actor — Each URL produces one dataset row; export file is written to the key-value store.

  4. Access Your ResultsDataset tab for per-URL items; Key-value store for the combined export; use API links from the Output schema.

  5. Schedule (optional) — Recurring runs for monitoring or refresh workflows.

Input Configuration

The actor accepts the following parameters:

{
"startUrls": [
"https://example.com",
"https://news.ycombinator.com/"
],
"method": "GET",
"headers": {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Python Scraper (Apify)"
},
"timeoutSecs": 60,
"selectors": {
"headline": "h1::text",
"tagline": "p.lead::text"
},
"includeLinks": true,
"includeHtml": false,
"exportFormat": "json",
"exportKey": "EXPORT"
}
FieldDescription
startUrlsRequired. URLs to scrape (one per line in Console or JSON array).
methodHTTP method: GET or POST (default GET).
headersOptional request headers (JSON object).
timeoutSecsPer-request timeout in seconds (default 60, max 300).
selectorsOptional { "fieldName": "css selector" } — first matching text per field.
includeLinksIf true, extract and include page links (default true).
includeHtmlIf true, include raw HTML in export (can be large; default false).
exportFormatjson, csv, excel, xml, or html — combined KV export format.
exportKeyKV store key prefix (default EXPORT); file extension added automatically.

Output

Dataset — one item per URL (overview table in Console):

FieldDescription
urlInput URL
finalUrlFinal URL after redirects
statusHTTP status code
titlePage title
metaDescriptionMeta description
h1Primary H1 text
firstLinkText / firstLinkUrlFirst link on page
linkCountNumber of links (if includeLinks)
contentTypeResponse content type
openGraph / twitterCardSocial meta objects
imagesImage list with src and optional alt
headingsNested h1/h2/h3 when multiple headings exist
errorError message on failed URLs
customFields from your selectors map

Key-value store — combined export at {exportKey}.{ext} (e.g. EXPORT.json).

Example dataset item (illustrative):

{
"url": "https://example.com",
"finalUrl": "https://example.com/",
"status": 200,
"title": "Example Domain",
"metaDescription": "Example description",
"h1": "Example Domain",
"text": "This domain is for use in documentation examples...",
"openGraph": { "title": "Example", "type": "website" },
"linkCount": 1,
"contentType": "text/html"
}

Use the actor Output schema in Apify Console for direct API links to dataset items and export files.

How the Scraper Works

  1. Requests fetches each startUrl with your headers and timeout.
  2. BeautifulSoup (lxml) parses HTML and extracts structured fields.
  3. Custom selectors run per field name you define in input.
  4. Each result is pushed to the default dataset (empty values omitted).
  5. All results are serialized to the key-value store in the chosen exportFormat.

Reliability & Best Practices

  • Respect robots.txt, site terms, and rate limits for target sites.
  • Use realistic User-Agent and headers for sites that block bots.
  • Increase timeoutSecs for slow pages.
  • Disable includeHtml unless you need raw HTML (large files).
  • For JavaScript-heavy SPAs, consider a Playwright-based actor instead.

Performance

  • No browser overhead — fast for static HTML.
  • Batch many URLs in one run; sequential fetching keeps memory predictable.
  • Large includeHtml exports increase KV store size.

Use Cases

  • SEO audits (titles, meta, H1, canonical)
  • Content monitoring and change detection (scheduled runs)
  • Lead / directory page extraction with custom selectors
  • Research datasets and internal analytics pipelines
  • Template for building custom Python scrapers on Apify

Integrations & API

  • Full API via Apify (apify-client Python / Node.js)
  • Zapier, Make, Google Sheets via dataset export
  • Webhooks and scheduled runs
  • Output schema links for dataset and KV export URLs

FAQ

Can I scrape any website with this actor?

Only pages you are legally permitted to access and process. You must comply with each site’s terms, robots.txt, and applicable privacy laws. This tool does not bypass paywalls or authentication by default.

Does this actor run JavaScript?

No. It uses HTTP + HTML parsing. Dynamic sites that render content only in the browser may need a Playwright/Puppeteer actor.

How do custom CSS selectors work?

Add a JSON object selectors where keys are output field names and values are CSS selectors. Use ::text suffix if needed; the actor normalizes selectors and extracts the first match’s text.

Where is the full export file?

In the run’s default key-value store, named {exportKey}.{extension} (e.g. EXPORT.csv). Dataset items are always written per URL regardless of export format.

What happens if one URL fails?

That URL gets a dataset row with error; other URLs still process. The combined export includes both successful and failed rows.

Can I automate runs on a schedule?

Yes. Use Apify schedules and pass updated startUrls via API or integrations.

SEO Keywords (high-intent)

python web scraper
python scraper apify
beautifulsoup scraper
requests web scraping
html scraper api
extract page metadata
seo scraper tool
web scraping template python
scrape urls to json csv
apify python actor

Why Choose This Actor?

  • Popular Python stack (Requests + BeautifulSoup)
  • Per-URL dataset + combined multi-format export
  • Custom CSS selectors without code changes
  • Clear overview table via dataset schema
  • Ideal starter template for Apify Python projects

Limitations

  • No JavaScript rendering — static HTML only.
  • No built-in proxy — add headers or use Apify Proxy at platform level if needed for your deployment.
  • Sequential requests — no built-in concurrency per URL.
  • Some sites block datacenter IPs or require cookies — not included by default.

Running locally

  1. pip install -r requirements.txt
  2. Create INPUT.json in the actor folder, e.g.:
{
"startUrls": ["https://example.com"],
"exportFormat": "json",
"exportKey": "EXPORT"
}
  1. Run with Apify CLI: apify run or execute main.py in an Apify-compatible environment.

Deploy to Apify

Use INPUT_SCHEMA.json, OUTPUT_SCHEMA.json, .actor/dataset_schema.json, and the provided Dockerfile. Push with Apify CLI or connect the Git repository.

Get Started

Add your URLs, define optional selectors, pick an export format, and start extracting structured web data with Python on Apify. 🚀