Blog Scraper avatar
Blog Scraper

Pricing

from $33.00 / 1,000 standard-fetches

Go to Apify Store
Blog Scraper

Blog Scraper

Company Blog Scraper, Blog Post Scraper, Corporate Blog Crawler, Automatic Blog Discovery, Blog Content Extractor, Article Metadata Scraper, Multi-Domain Blog Scraper, Competitor Blog Analysis, Content Marketing Scraper, Blog Post Metadata Extraction, Company Announcements Scraper.

Pricing

from $33.00 / 1,000 standard-fetches

Rating

0.0

(0)

Developer

Wyald

Wyald

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

7 days ago

Last modified

Share

A robust Apify Actor designed to scrape blog posts from company websites. Given a list of company domains and a maximum number of posts to fetch, this scraper automatically discovers blog sections, extracts blog posts, and collects comprehensive content and metadata.

Targeted Keywords

  • Primary: Blog Scraper, Content Extraction, Company Blog Crawler, Article Scraper
  • Secondary: Blog Post Metadata, Content Marketing Analysis, Blog Content Aggregation, Corporate Blog Mining

Features

Automatic Blog Discovery: Intelligently finds blog sections on company websites ✅ Smart Content Extraction: Extracts comprehensive blog post data including: * Title * Author * Publication date * Full article content * Excerpt/summary * Tags * Category * URL ✅ Configurable Limits: Set maximum number of posts per domain (up to 50) ✅ Multiple Domain Support: Scrape from multiple company websites in a single run ✅ Structured Output: Returns clean JSON data with all metadata ✅ Fast & Lightweight: Uses crawlee with BeautifulSoup for efficient HTTP-based scraping (no headless browser overhead)

Input

FieldTypeDescriptionRequiredDefault
company_urlsArrayList of company domain URLs or homepage URLs to scrape (e.g., ["https://stripe.com", "shopify.com"]).Yes-
max_blogposts_to_fetchNumberMaximum number of blog posts to fetch per domain (1-50)No10
max_concurrencyNumberNumber of concurrent requestsNo2

Input Example

{
"company_urls": [
"https://www.stripe.com",
"https://shopify.com",
"https://ai-bees.io"
],
"max_blogposts_to_fetch": 10,
"max_concurrency": 2
}

Output Example

{
"url": "https://www.stripe.com/blog/example-post",
"domain": "www.stripe.com",
"post_title": "How we scaled our payment infrastructure",
"author": "Jane Doe",
"published_date": "2024-01-15",
"content": "Full article content here...",
"excerpt": "Learn how we scaled our payment infrastructure to handle millions of transactions...",
"tags": ["engineering", "infrastructure", "scaling"],
"category": "Engineering",
"scraped_at": "2024-01-20T10:30:00.000Z"
}

How It Works

  1. Domain Analysis: The scraper starts by visiting each provided company domain
  2. Blog Detection: It automatically searches for blog sections using common patterns (/blog, /news, /articles, etc.)
  3. Post Discovery: Once in the blog section, it identifies individual blog post URLs
  4. Content Extraction: For each post, it extracts:
    • Structured metadata (title, author, date)
    • Full article content
    • Additional metadata (tags, categories)
  5. Limit Enforcement: Respects the number_of_blog_posts_to_fetch limit per domain

Usage Tips

  • URL Format: You can provide URLs with or without https:// - the scraper will normalize them
  • Rate Limiting: The scraper includes automatic delays to be respectful to target websites
  • Post Limits: Maximum 50 posts per domain to prevent excessive scraping
  • Concurrency: Adjust max_concurrency based on target website capacity (default: 2)

Use Cases

  • Content Marketing Analysis: Analyze competitor blog strategies
  • Content Aggregation: Collect blog content for research or analysis
  • Market Intelligence: Monitor company announcements and thought leadership
  • SEO Research: Study content patterns and topics from successful blogs
  • Training Data: Collect blog content for ML/AI model training

Notes

  • The scraper respects robots.txt and includes reasonable delays between requests
  • Blog structure varies by website - extraction quality depends on site structure
  • Some blogs may require authentication or have anti-scraping measures
  • Always ensure you have permission to scrape the target websites