Pricing

from $2.00 / 1,000 results

Try for free

Go to Apify Store

💎ESG Scraper: Sustainability Reports & PDF Disclosures

Try for free

Powerful ESG scraper (Environmental, Social, and Governance) to automatically extract sustainability reports, PDF disclosures, articles, and content from any website. Get clean, AI-ready datasets with keyword filtering, metadata extraction, images, links, and full PDF support.

Pricing

from $2.00 / 1,000 results

Rating

5.0

(1)

Developer

PrimeParse

Actor stats

Bookmarked

Total users

Monthly active users

5 months ago

Last modified

🌱 ESG Scraper: Sustainability Reports, Articles & PDF Disclosures Extractor

Enterprise-grade ESG web scraper that automatically extracts sustainability articles, corporate reports, climate news, and PDF disclosures — clean, structured, and ready for investors, compliance teams, or AI training.

High-quality ESG & Sustainability Web Scraper for Investors, Analysts, and AI Teams

Automatically collects ESG articles, sustainability reports, corporate disclosures, climate news, and PDF reports from any website — clean, structured, ready for analysis or AI.

Built for:

Sustainable investors & analysts
Compliance and risk teams
AI/ML engineers building ESG models
Researchers and NGOs tracking climate & governance trends

✅ Smart ESG keyword filtering ✅ Full clean article text extraction ✅ PDF sustainability reports parsing ✅ Rich metadata (date, author, description) ✅ ESG-relevant images and related links ✅ AI-ready dataset splitting (overview / full-text / images)

👉 Runs on Apify • No code required • Pay only for compute used

🚀 Why This Scraper

✔ Purpose-Built for ESG Data Intelligently filters pages using custom ESG keywords (climate, emissions, governance, CSR, net zero, etc.).

✔ Excellent PDF Handling Full text extraction from sustainability and ESG reports (PDF) with metadata where available.

✔ Clean & Noise-Free Output Removes ads, navigation, scripts — only meaningful content remains.

✔ Rich Structured Data Title, publication date, author, description, ESG keywords, internal links, relevant images.

✔ AI & ML Ready Optional splitting into specialized datasets for RAG, LLM fine-tuning, or training.

✔ Fast & Efficient
Powered by Crawlee + Cheerio — excellent for static and content-heavy sites (news, corporate pages, PDFs). For heavily JavaScript-rendered sites, results may vary.

✔ Safe & Controlled Crawling Automatic domain restriction, depth limit (max 3 levels), request limits.

💼 Use Cases

ESG portfolio screening and risk monitoring
Training ESG-focused LLMs or RAG systems
Regulatory compliance and disclosure tracking
Competitive intelligence on corporate sustainability
Academic research on climate and governance trends

📊 Supported Sources

ESG news sections (Reuters, Bloomberg, FT, Guardian, etc.)
Corporate sustainability / ESG pages
Annual sustainability reports (PDF)
Climate, emissions, governance disclosures

⚙️ How It Works

Provide start URLs (news sections, corporate pages, PDF links)
Set custom ESG keywords and limits
Run the Actor
Download clean, structured ESG datasets

🧩 Input Configuration

Example JSON Input

{
"startUrls": [
  { "url": "https://www.reuters.com/sustainability/" },
  { "url": "https://www.weforum.org/stories/technological-innovation/" }
],
  "allowedDomains": ["reuters.com"],
  "useApifyProxy": false,
  "maxRequestsPerCrawl": 500,
  "esgKeywords": [
    "ESG",
    "sustainability",
    "climate",
    "emissions",
    "net zero",
    "governance"
  ],
  "extractContent": true,
  "extractMetadata": true,
  "followLinks": true,
  "useSeparateDatasets": true,
  "cleanDefaultDataset": true,
  "proxyUrls": [
    {
      "url": "http://user:pass@host:port"
    }
  ]
}

Key Options

startUrls — one or more starting pages or direct PDF links (required)
allowedDomains — restrict crawling to specific domains. If empty, automatically limited to domains from startUrls
maxRequestsPerCrawl — control cost and crawl size
esgKeywords — custom list for relevance filtering (default includes common ESG terms)
extractContent / extractMetadata — toggle full text or metadata extraction
followLinks — enable internal crawling (limited to depth 3 for safety)
useSeparateDatasets — recommended for large runs and AI workflows
cleanDefaultDataset — clear previous run data

📂 Output Datasets

When useSeparateDatasets: true (recommended):

esg-overview (primary) — lightweight metadata for fast analysis
esg-full-content — long articles (>5000 characters)
esg-images — ESG-relevant images with context
Default dataset — minimal preview records (for Apify UI visibility)

When useSeparateDatasets: false

Single dataset with full detailed records

Example Output Record (Full Mode)

{
  "url": "https://www.reuters.com/sustainability/example",
  "title": "Companies strengthen climate commitments",
  "scrapedAt": "2025-12-15T10:30:45Z",
  "publishedDate": "2025-12-10",
  "author": "Jane Doe",
  "description": "Major firms enhance ESG targets...",
  "content": "Full clean article text...\n\nParagraphs preserved...",
  "esgKeywords": ["climate", "emissions", "sustainability"],
  "relatedLinks": [
    {
      "url": "https://www.reuters.com/sustainability/esg-guide",
      "text": "ESG Explained"
    }
  ],
  "images": [
    {
      "url": "https://reuters.com/chart-netzero.jpg",
      "alt": "Net zero emissions progress"
    }
  ]
}

PDF Example

{
  "url": "https://company.com/sustainability-2024.pdf",
  "title": "Annual Sustainability Report 2024",
  "content": "Full extracted report text...",
  "esgKeywords": ["sustainability", "carbon", "governance"],
  "type": "PDF",
  "author": "Corporate Sustainability Team",
  "publishedDate": "2024-03-15"
}

🏁 Getting Started

Click “Try for free” on Apify
Paste ESG/sustainability URLs or direct PDF links
Customize keywords and limits
Run and download your dataset

📧 Support

Email: kidaxxb@gmail.com
Response within 24 hours
Issues: Use Apify Issues tab

Tags: ESG, sustainability, web scraping, PDF extraction, climate data, corporate governance, RAG, LLM training, sustainable investing, compliance monitoring

Built with ❤️ on Apify

Company ESG & Sustainability Data Extractor

technicaldost/company-esg-sustainability-extractor

Extract ESG and sustainability metrics, carbon commitments, and net-zero targets from public company sustainability pages. Structured JSON output for finance, research, and procurement teams.

Technical Dost Solutions

SEC & ESG Report Scraper

taroyamada/esg-disclosure-tracker

Extract climate disclosures and sustainability reports directly from SEC EDGAR filings and corporate investor relations web pages.

naoki anzai

PDF Scraper

onidivo/pdf-scraper

Scrape and extract text from PDF links.

Onidivo Technologies

516

Pdf API

vivid_astronaut/pdf

BRAINIALL Team

Global Climate Sustainability B2B Leads

blukaze/global-climate-sustainability-b2b-leads-Apify-Actor

Global Climate & Sustainability B2B Leads Finder crawls company websites to detect ESG and sustainability activity, then converts it into structured leads with key pages, contacts, and a sustainabilityIntentScore (0–100) to quickly identify high-intent organizations.

Blukaze Automations

CSRHub.com ESG Data Scraper

njoylab/csrhub-com-esg-data-scraper

Extract comprehensive ESG metrics and company profiles from CSRHub.com with this efficient Apify scraper. Get structured sustainability ratings, corporate information, and industry benchmarks for investment analysis and research

njoylab

PDF Text Extractor - Text, Metadata & Page Count from PDF URL

ninhothedev/pdf-text-extractor

$0.5/1K 🔥 PDF text extractor API! Extract full text, metadata & page count from any PDF URL — ready for RAG, LLMs & AI pipelines. No API key. Export JSON, CSV, Excel or API in seconds ⚡

ninhothedev

PDF Link Extractor

happitap/pdf-link-extractor

Deeply crawl any website to automatically extract all hidden and exposed PDF download links into a clean dataset

HappiTap

PDF Link Finder

mahogany_songbird/pdf-link-finder

Find all PDF links on web pages with their anchor text. One row per PDF found.

Britton Furness

PDF Parser API

george.the.developer/pdf-parser-api

Instant API that parses any PDF from a URL — extracts full text, page count, metadata (title, author, dates), and PDF version. Returns structured JSON. Perfect for document processing pipelines and AI agents.