Python Scraper avatar

Python Scraper

Pricing

from $2.00 / 1,000 scraped pages

Go to Apify Store
Python Scraper

Python Scraper

Python Scraper extracts web page data using Requests and BeautifulSoup. It collects titles, meta tags, headings, links, images, Open Graph data, text snippets, and custom CSS selector fields, with exports to JSON, CSV, Excel, XML, or HTML.

Pricing

from $2.00 / 1,000 scraped pages

Rating

5.0

(4)

Developer

Sovanza

Sovanza

Maintained by Community

Actor stats

0

Bookmarked

7

Total users

1

Monthly active users

15 days ago

Last modified

Share

Python Scraper – Extract Web Page Data with Requests & BeautifulSoup

Python Scraper is a general-purpose Apify Actor that fetches public web pages with Requests, parses HTML with BeautifulSoup, and returns structured metadata per URL. Export combined results to JSON, CSV, Excel, XML, or HTML in the default key-value store.

Overview

Each run processes your startUrls sequentially, pushes one dataset item per URL, and writes a combined export file named {exportBasename}.{ext}.

  • No browser — fast for static HTML and simple sites
  • Custom CSS selectors — map field names to selectors without code changes
  • Secret inputs — cookies, bearer tokens, and request headers are encrypted in Apify storage
  • Automation-ready — API, schedules, and webhooks

Quick start

  1. Open Python Scraper on Apify and add one or more start URLs (e.g. https://example.com).
  2. Optionally set CSS selectors for extra fields and choose an export format.
  3. Click Run — inspect the Dataset tab for per-URL rows.
  4. Download the combined file from the run’s Key-value store (see Output schema links in Console).

Minimal input example

{
"startUrls": ["https://example.com"],
"exportFormat": "json",
"exportBasename": "EXPORT"
}

Input configuration

FieldRequiredDescription
startUrlsYesURLs to scrape (string list).
methodNoGET or POST (default GET).
cookiesNoSecret. Raw Cookie header for logged-in sessions.
bearerTokenNoSecret. Bearer token sent as Authorization header.
headersNoSecret. Extra HTTP headers (JSON object). Default User-Agent applied if omitted.
timeoutSecsNoPer-request timeout (default 60, max 300).
selectorsNoMap of fieldName → CSS selector (supports ::text suffix).
includeLinksNoExtract page links (default true).
includeHtmlNoInclude raw HTML in KV export only (default false; can be large).
exportFormatNojson, csv, excel, xml, or html.
exportBasenameNoKV export basename (default EXPORT); extension added automatically.

Full input example (with secrets and selectors)

{
"startUrls": [
"https://example.com",
"https://news.ycombinator.com/"
],
"method": "GET",
"cookies": "session=YOUR_SESSION; path=/",
"bearerToken": "YOUR_API_TOKEN",
"headers": {
"Accept-Language": "en-US,en;q=0.9"
},
"timeoutSecs": 60,
"selectors": {
"headline": "h1::text",
"tagline": "p.lead::text"
},
"includeLinks": true,
"includeHtml": false,
"exportFormat": "csv",
"exportBasename": "EXPORT"
}

Legacy input: exportKey is still accepted at runtime but renamed to exportBasename in the schema.

Output

Dataset (one row per URL)

FieldDescription
urlInput URL
finalUrlURL after redirects
statusHTTP status code
titleDocument title
metaDescriptionMeta description
h1Primary heading
openGraph / twitterCardSocial meta objects
links / linkCountExtracted links when includeLinks is true
errorError message when a URL fails
customFields from your selectors map

Example dataset item

{
"url": "https://example.com",
"finalUrl": "https://example.com/",
"status": 200,
"title": "Example Domain",
"metaDescription": "Example description",
"h1": "Example Domain",
"text": "This domain is for use in documentation examples...",
"linkCount": 1,
"contentType": "text/html"
}

Key-value store export

Combined file: {exportBasename}.json, .csv, .xlsx, .xml, or .html depending on exportFormat. Use the actor Output schema in Console for direct API links.

How it works

  1. Requests fetches each URL with merged secret headers (cookies, bearer token, custom headers).
  2. BeautifulSoup (lxml) parses HTML and extracts standard fields.
  3. Custom selectors run per configured field name.
  4. Each page is pushed to the default dataset (empty values omitted).
  5. All rows are serialized to the key-value store in the chosen format.

Authentication & sensitive input

Fields that can hold credentials use Apify secret input (isSecret: true):

  • cookies — session cookies
  • bearerToken — API or OAuth bearer tokens
  • headers — JSON object for Authorization, API keys, or other sensitive headers

Secret values are encrypted in storage and are not copied into dataset rows or logs.

Best practices

  • Respect robots.txt, site terms, and rate limits.
  • Use realistic headers or cookies only when you are allowed to access the target pages.
  • Increase timeoutSecs for slow sites.
  • Keep includeHtml disabled unless you need raw HTML (large KV files).
  • For JavaScript-heavy SPAs, use a Playwright-based actor instead.

Use cases

  • SEO audits (title, meta, H1, canonical)
  • Content monitoring with scheduled runs
  • Directory or listing extraction with custom selectors
  • Starter template for custom Python scrapers on Apify

Limitations

  • No JavaScript rendering — static HTML only.
  • Sequential requests — one URL at a time per run.
  • No built-in Apify Proxy configuration in this actor (add headers/cookies or use platform proxy at account level if needed).

FAQ

Does this actor run JavaScript?

No. It uses HTTP + HTML parsing. Dynamic sites may need Playwright or Puppeteer.

How do custom CSS selectors work?

Add a JSON object selectors where keys are output field names and values are CSS selectors. The actor extracts the first match’s text.

What if one URL fails?

That URL gets a dataset row with error; other URLs still process.

Where is the combined export file?

In the run’s default key-value store as {exportBasename}.{extension} (e.g. EXPORT.csv).

Run locally

  1. pip install -r requirements.txt
  2. Create INPUT.json in this folder (see examples above).
  3. apify run or run main.py in an Apify-compatible environment.

Deploy

Push with Apify CLI or connect the Git repository. Schema files: INPUT_SCHEMA.json, OUTPUT_SCHEMA.json, .actor/dataset_schema.json, Dockerfile.

Get started

Add your URLs, optional selectors and export format, and start extracting structured web data with Python on Apify.