Pricing

from $2.00 / 1,000 scraped pages

Try for free

Go to Apify Store

Python Scraper

Try for free

Python Scraper extracts web page data using Requests and BeautifulSoup. It collects titles, meta tags, headings, links, images, Open Graph data, text snippets, and custom CSS selector fields, with exports to JSON, CSV, Excel, XML, or HTML.

Pricing

from $2.00 / 1,000 scraped pages

Rating

5.0

(4)

Developer

Sovanza

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

Python Scraper – Extract Web Page Data with Requests & BeautifulSoup

Python Scraper is a general-purpose Apify Actor that fetches public web pages with Requests, parses HTML with BeautifulSoup, and returns structured metadata per URL. Export combined results to JSON, CSV, Excel, XML, or HTML in the default key-value store.

Overview

Each run processes your startUrls sequentially, pushes one dataset item per URL, and writes a combined export file named {exportBasename}.{ext}.

No browser — fast for static HTML and simple sites
Custom CSS selectors — map field names to selectors without code changes
Secret inputs — cookies, bearer tokens, and request headers are encrypted in Apify storage
Automation-ready — API, schedules, and webhooks

Quick start

Open Python Scraper on Apify and add one or more start URLs (e.g. https://example.com).
Optionally set CSS selectors for extra fields and choose an export format.
Click Run — inspect the Dataset tab for per-URL rows.
Download the combined file from the run’s Key-value store (see Output schema links in Console).

Minimal input example

{
  "startUrls": ["https://example.com"],
  "exportFormat": "json",
  "exportBasename": "EXPORT"
}

Input configuration

Field	Required	Description
`startUrls`	Yes	URLs to scrape (string list).
`method`	No	`GET` or `POST` (default `GET`).
`cookies`	No	Secret. Raw `Cookie` header for logged-in sessions.
`bearerToken`	No	Secret. Bearer token sent as `Authorization` header.
`headers`	No	Secret. Extra HTTP headers (JSON object). Default User-Agent applied if omitted.
`timeoutSecs`	No	Per-request timeout (default `60`, max `300`).
`selectors`	No	Map of `fieldName → CSS selector` (supports `::text` suffix).
`includeLinks`	No	Extract page links (default `true`).
`includeHtml`	No	Include raw HTML in KV export only (default `false`; can be large).
`exportFormat`	No	`json`, `csv`, `excel`, `xml`, or `html`.
`exportBasename`	No	KV export basename (default `EXPORT`); extension added automatically.

Full input example (with secrets and selectors)

{
  "startUrls": [
    "https://example.com",
    "https://news.ycombinator.com/"
  ],
  "method": "GET",
  "cookies": "session=YOUR_SESSION; path=/",
  "bearerToken": "YOUR_API_TOKEN",
  "headers": {
    "Accept-Language": "en-US,en;q=0.9"
  },
  "timeoutSecs": 60,
  "selectors": {
    "headline": "h1::text",
    "tagline": "p.lead::text"
  },
  "includeLinks": true,
  "includeHtml": false,
  "exportFormat": "csv",
  "exportBasename": "EXPORT"
}

Legacy input: exportKey is still accepted at runtime but renamed to exportBasename in the schema.

Output

Dataset (one row per URL)

Field	Description
`url`	Input URL
`finalUrl`	URL after redirects
`status`	HTTP status code
`title`	Document title
`metaDescription`	Meta description
`h1`	Primary heading
`openGraph` / `twitterCard`	Social meta objects
`links` / `linkCount`	Extracted links when `includeLinks` is true
`error`	Error message when a URL fails
custom	Fields from your `selectors` map

Example dataset item

{
  "url": "https://example.com",
  "finalUrl": "https://example.com/",
  "status": 200,
  "title": "Example Domain",
  "metaDescription": "Example description",
  "h1": "Example Domain",
  "text": "This domain is for use in documentation examples...",
  "linkCount": 1,
  "contentType": "text/html"
}

Key-value store export

Combined file: {exportBasename}.json, .csv, .xlsx, .xml, or .html depending on exportFormat. Use the actor Output schema in Console for direct API links.

How it works

Requests fetches each URL with merged secret headers (cookies, bearer token, custom headers).
BeautifulSoup (lxml) parses HTML and extracts standard fields.
Custom selectors run per configured field name.
Each page is pushed to the default dataset (empty values omitted).
All rows are serialized to the key-value store in the chosen format.

Authentication & sensitive input

Fields that can hold credentials use Apify secret input (isSecret: true):

cookies — session cookies
bearerToken — API or OAuth bearer tokens
headers — JSON object for Authorization, API keys, or other sensitive headers

Secret values are encrypted in storage and are not copied into dataset rows or logs.

Best practices

Respect robots.txt, site terms, and rate limits.
Use realistic headers or cookies only when you are allowed to access the target pages.
Increase timeoutSecs for slow sites.
Keep includeHtml disabled unless you need raw HTML (large KV files).
For JavaScript-heavy SPAs, use a Playwright-based actor instead.

Use cases

SEO audits (title, meta, H1, canonical)
Content monitoring with scheduled runs
Directory or listing extraction with custom selectors
Starter template for custom Python scrapers on Apify

Limitations

No JavaScript rendering — static HTML only.
Sequential requests — one URL at a time per run.
No built-in Apify Proxy configuration in this actor (add headers/cookies or use platform proxy at account level if needed).

FAQ

Does this actor run JavaScript?

No. It uses HTTP + HTML parsing. Dynamic sites may need Playwright or Puppeteer.

How do custom CSS selectors work?

Add a JSON object selectors where keys are output field names and values are CSS selectors. The actor extracts the first match’s text.

What if one URL fails?

That URL gets a dataset row with error; other URLs still process.

Where is the combined export file?

In the run’s default key-value store as {exportBasename}.{extension} (e.g. EXPORT.csv).

Run locally

pip install -r requirements.txt
Create INPUT.json in this folder (see examples above).
apify run or run main.py in an Apify-compatible environment.

Deploy

Push with Apify CLI or connect the Git repository. Schema files: INPUT_SCHEMA.json, OUTPUT_SCHEMA.json, .actor/dataset_schema.json, Dockerfile.

Get started

Add your URLs, optional selectors and export format, and start extracting structured web data with Python on Apify.

BeautifulSoup Scraper

apify/beautifulsoup-scraper

Crawls websites using raw HTTP requests. It parses the HTML with the BeautifulSoup library and extracts data from the pages using Python code. Supports both recursive crawling and lists of URLs. This Actor is a Python alternative to Cheerio Scraper.

Apify

5.0

Getting started with Python and BeautifulSoup

salutary_quadrillion/my-first-actor

Scrapes titles of websites using BeautifulSoup.

Darsh Thakur

Getting started with Python and BeautifulSoup

omnipotent_recorder/namma-seo-auditor

Scrapes titles of websites using BeautifulSoup.

Slam Book Cinema

Meta Tags Extractor - SEO & Open Graph Data

benthepythondev/meta-tags-extractor

Extract page title, meta description, robots, canonical URL, Open Graph tags, Twitter Card tags and alternate links from web pages.

Ben

Getting started with Python Crawlee and BeautifulSoup

rapturous_dancehall/goodreads-book-scraper

Scrapes titles of websites using Crawlee and BeautifulSoup.

Danny Lindner

Meta Tags Scraper

rl1987/meta-tags-scraper

Web page metadata scraper.

R.L.

Google Images Scraper

scraper-engine/google-images-scraper

Google Images Scraper collects image URLs, alt text, source pages, and metadata from Google Images. Use it as an API, with Python or Node.js, or via npm. Ideal for datasets, AI training, research, and automation. Exports in JSON, CSV, or Excel.

Scraper Engine

363

5.0

Meta Tags Extractor

krawlify/meta-tags-extractor

Extract SEO meta tags, Open Graph, Twitter Cards, JSON-LD structured data, and headings from any website. Perfect for SEO analysis, competitor research, and content audits.