WordPress Posts Scraper - Extract Articles & Metadata avatar
WordPress Posts Scraper - Extract Articles & Metadata

Pricing

$10.00 / 1,000 results

Go to Apify Store
WordPress Posts Scraper - Extract Articles & Metadata

WordPress Posts Scraper - Extract Articles & Metadata

Extract posts, articles, and metadata from any WordPress site using REST API. 20+ filters: date ranges, categories, tags, 0authors, search keywords. Get title, content, author bio, featured images & more. No WordPress account needed. Fast, reliable data extraction for content aggregation & research.

Pricing

$10.00 / 1,000 results

Rating

0.0

(0)

Developer

DevnaZ

DevnaZ

Maintained by Community

Actor stats

0

Bookmarked

4

Total users

2

Monthly active users

3 days ago

Last modified

Share

WordPress Posts Scraper

The WordPress Posts Scraper is an Apify actor that extracts posts and metadata from any WordPress website using the WordPress REST API. It automatically handles pagination and fetches additional information like author details, categories, tags, and featured images.

This actor is perfect for researchers, content aggregators, and developers who need structured data from WordPress sites.

How It Works

  1. You provide one or more WordPress site URLs.
  2. The actor checks if the WordPress REST API is available.
  3. It fetches posts with your specified filters (dates, categories, keywords, etc.).
  4. Handles pagination automatically until all posts are retrieved.
  5. Extracts metadata such as author name, categories, tags, and featured images.
  6. Returns structured JSON output with all relevant post details.

Features

✅ Fetches posts from any WordPress site using REST API ✅ Supports pagination until all posts are retrieved ✅ 20+ advanced filters: date ranges, categories, tags, author, search keywords, status, and more ✅ Extracts metadata like author bio, categories, tags, and featured images ✅ Configurable sorting (by date, modified, title, author, relevance) ✅ Optional proxy support (not required for most sites) ✅ Clean and structured JSON output ✅ No WordPress account required

Getting Started

1. Input Parameters

To use the scraper, provide the following inputs:

ParameterTypeRequiredDescription
startUrlsArrayList of WordPress site URLs to scrape (e.g., [{"url": "https://techcrunch.com"}])
maxPostsIntegerMaximum total posts to extract per site (default: 5, max: 10000)
perPageInteger[Advanced] Posts per API request (default: 50, max: 100). Higher = fewer requests = lower cost. Reduce to 10-20 if timeouts occur.
searchKeywordStringFilter posts by keyword search
afterStringPosts published after this date (ISO8601: 2025-01-01T00:00:00)
beforeStringPosts published before this date (ISO8601: 2025-12-31T23:59:59)
modifiedAfterStringPosts modified after this date (ISO8601)
modifiedBeforeStringPosts modified before this date (ISO8601)
categoriesArrayFilter by category IDs (e.g., ["1", "5", "12"])
categoriesExcludeArrayExclude specific category IDs
tagsArrayFilter by tag IDs
tagsExcludeArrayExclude specific tag IDs
authorArrayFilter by author IDs
authorExcludeArrayExclude specific author IDs
statusStringPost status: publish, draft, pending, private, future (default: publish)
orderByStringSort by: date, modified, title, author, id, relevance (default: date)
orderStringSort order: asc or desc (default: desc)
stickyBooleanInclude only sticky posts (default: false)
slugStringFilter by specific post slug
offsetIntegerSkip a specific number of posts (default: 0)
proxyConfigurationObjectProxy settings (optional - not needed for most WordPress sites)

2. Running the Actor

Using Apify Interface

  1. Navigate to the actor's Apify page.
  2. Enter the required parameters.
  3. Click Run and wait for the data to be scraped.

Using Apify API

curl -X POST -H "Content-Type: application/json" \
-d '{
"startUrls": [{"url": "https://techcrunch.com"}],
"maxPosts": 50,
"after": "2025-01-01T00:00:00",
"orderBy": "date",
"order": "desc"
}' \
"https://api.apify.com/v2/acts/YOUR_ACTOR_ID/runs?token=YOUR_API_TOKEN"

Output Format

The output is a JSON dataset containing structured post details:

[
{
"id": 19263,
"date": "2025-11-04T15:34:27",
"modified": "2025-11-04T16:08:02",
"slug": "wordpress-6-9-beta-3",
"link": "https://wordpress.org/news/2025/11/wordpress-6-9-beta-3/",
"title": "WordPress 6.9 Beta 3",
"content": "<p>WordPress 6.9 Beta 3 is available for download and testing!</p>...",
"excerpt": "<p>WordPress 6.9 Beta 3 is available for download and testing!...</p>",
"author": "Amy Kamala",
"categories": ["Development", "General", "Releases"],
"tags": ["6.9", "development", "release"],
"featured_image": "https://wordpress.org/wp-content/uploads/featured.jpg",
"extra_metadata": {
"author_bio": "Full Stack Dev, Artist, Masters from UCLA",
"author_url": "https://kittenkamala.com/",
"category_description": "Development news and updates"
}
}
]

Use Cases

  • Content Aggregation – Collect and analyze posts from different WordPress sites.
  • SEO Research – Extract content and metadata for SEO analysis.
  • Data Science – Gather datasets for NLP or sentiment analysis.
  • Backup and Archiving – Store blog content for future reference.
  • Competitor Monitoring – Track competitor blog posts and content strategies.
  • Research & Analysis – Extract posts by date range, category, or keyword for academic or business research.

Performance & Cost Optimization

Speed & Reliability

  • Speed: ~2-5 seconds per 50 posts (using REST API)
  • Success rate: 99%+ on WordPress sites with REST API enabled
  • Concurrency: Supports multiple sites simultaneously
  • No proxy required: WordPress REST API is public and doesn't require proxies in most cases

Cost Optimization with perPage Parameter

The perPage parameter controls how many posts are fetched per API request, directly impacting cost and speed:

Example: Extracting 100 posts

perPageAPI RequestsCompute UnitsSpeedNotes
1010 requestsHigher costSlowerUse if large sites timeout
50 (default)2 requestsLower costFasterRecommended - best balance
1001 requestLowest costFastestMay timeout on large sites (TechCrunch, etc.)

Recommendation:

  • Default (50): Works on most sites, good balance between cost and reliability
  • Large sites (TechCrunch, Wired, etc.): If timeouts occur, reduce to perPage: 20-30
  • Small sites: Increase to perPage: 100 for maximum speed and lowest cost

Notes

  • WordPress REST API required: This actor only works with sites that have the WordPress REST API enabled (enabled by default on most WordPress sites).
  • API not available?: If a site has disabled the REST API, the actor will return an error message.
  • Category/Tag IDs: To filter by categories or tags, you need the numeric IDs (not names). You can find these in the WordPress admin or via the API endpoints:
    • Categories: https://yoursite.com/wp-json/wp/v2/categories
    • Tags: https://yoursite.com/wp-json/wp/v2/tags
  • Date format: Use ISO8601 format for date filters (e.g., 2025-01-01T00:00:00)

Support & Troubleshooting

Having issues? Check these common solutions:

  1. Timeout errors (large sites like TechCrunch): Reduce the perPage parameter to 20-30. This makes more API requests but prevents timeouts.
  2. WordPress REST API not available: The site may have disabled the REST API. Verify by visiting https://yoursite.com/wp-json/wp/v2/posts in your browser.
  3. No posts returned: Check your filters - they may be too restrictive (e.g., date range with no matching posts).
  4. Missing author data: Some WordPress sites may not include author information in the _embedded response.
  5. Category/Tag filtering not working: Ensure you're using numeric IDs, not names.
  6. High costs: Increase perPage to 80-100 for small/fast sites to reduce API requests and compute units.

For bugs or feature requests, feel free to contact support. Happy scraping! 🚀


No WordPress account or subscription required. Get started analyzing WordPress content today!