article-scrapper avatar
article-scrapper

Pricing

$3.00 / 1,000 result_sets

Go to Apify Store
article-scrapper

article-scrapper

A flexible and powerful Apify Actor for scraping articles from tech news websites. This scraper can work with any tech news site - either from predefined presets or custom URLs

Pricing

$3.00 / 1,000 result_sets

Rating

0.0

(0)

Developer

RK K

RK K

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

0

Monthly active users

2 days ago

Last modified

Categories

Share

Tech News Article Scraper

A flexible and powerful Apify Actor for scraping articles from tech news websites. This scraper can work with any tech news site - either from predefined presets or custom URLs provided by you.

Features

  • Universal Scraping: Generic scraper works with most tech news sites
  • Preset Sites: Pre-configured settings for popular tech news sources
  • Custom URLs: Scrape any tech news site by providing URLs
  • Smart Extraction: Automatically detects article content, titles, authors, dates, images, and tags
  • Advanced Filtering: Filter by keywords, title text, or description content
  • Flexible Configuration: Control article count, pagination, and content inclusion
  • Error Handling: Robust retry logic and graceful error handling
  • Rate Limiting: Built-in delays to respect website resources

Supported Preset Sites

The following sites are pre-configured for easy scraping:

  • The Verge - Tech news and media
  • CNET - Tech product reviews and news
  • Wired - Technology and culture
  • TechCrunch - Startup and technology news
  • Ars Technica - Technology news and analysis
  • Engadget - Consumer electronics and gadgets
  • The Guardian Tech - Technology news from The Guardian
  • The Next Web - International technology news

How It Works

This actor uses a generic scraper that adapts to different news site structures:

  1. Detects Article Links: Uses multiple strategies to find article URLs on listing pages

    • Looks for <article> tags
    • Searches headings (h1, h2, h3) for links
    • Finds elements with article/post/story classes
    • Identifies URLs with date patterns (/2024/, /2025/)
  2. Extracts Content: Uses fallback strategies for each field

    • Title: h1 tag → og:title → twitter:title → title tag
    • Author: rel="author" → author classes → itemprop="author" → meta tags
    • Date: time[datetime] → datePublished → meta tags
    • Content: article tag → articleBody → content classes
    • Summary: og:description → meta description → intro paragraph
    • Image: og:image → twitter:image → first article image
    • Tags: rel="tag" → category links → tag classes
  3. Handles Edge Cases: Normalizes URLs, filters non-articles, removes duplicates

Usage Examples

Example 1: Scrape Latest Articles from The Verge

{
"usePresets": true,
"presetSources": ["verge"],
"maxArticlesPerSource": 20,
"maxPages": 1,
"includeContent": true
}

Example 2: Scrape Multiple Tech Sites (Light Mode)

Get summaries without full content to save time:

{
"usePresets": true,
"presetSources": ["verge", "techcrunch", "wired", "arstechnica"],
"maxArticlesPerSource": 10,
"maxPages": 1,
"includeContent": false
}

Example 3: Scrape Custom Tech Blogs

{
"usePresets": false,
"customUrls": [
"https://www.theverge.com",
"https://news.ycombinator.com",
"https://9to5mac.com"
],
"maxArticlesPerSource": 15,
"maxPages": 2,
"includeContent": true
}

Example 4: Deep Scrape a Single Site

Get many articles from one source:

{
"usePresets": false,
"customUrls": ["https://techcrunch.com"],
"maxArticlesPerSource": 50,
"maxPages": 5,
"includeContent": true
}

Example 5: Search for Specific Topics (AI Articles)

Filter articles by keywords in title or description:

{
"usePresets": true,
"presetSources": ["verge", "techcrunch", "wired"],
"maxArticlesPerSource": 20,
"searchKeywords": ["AI", "artificial intelligence", "ChatGPT", "GPT-4"]
}

Example 6: Filter by Title

Only scrape articles with "iPhone" in the title:

{
"usePresets": true,
"presetSources": ["verge", "cnet"],
"maxArticlesPerSource": 15,
"titleContains": "iPhone"
}

Example 7: Filter by Description

Only scrape articles about a specific topic in the description:

{
"usePresets": false,
"customUrls": ["https://techcrunch.com"],
"maxArticlesPerSource": 25,
"descriptionContains": "startup funding"
}

Example 8: Combine Multiple Filters

Search for AI articles with "OpenAI" in the title:

{
"usePresets": true,
"presetSources": ["verge", "techcrunch", "arstechnica"],
"maxArticlesPerSource": 30,
"searchKeywords": ["AI", "ChatGPT"],
"titleContains": "OpenAI"
}

Input Parameters

ParameterTypeDefaultDescription
usePresetsbooleantrueUse predefined sites or custom URLs
presetSourcesarray["verge", "techcrunch", "wired"]List of preset site keys
customUrlsarray[]Custom URLs to scrape (when usePresets is false)
maxArticlesPerSourceinteger10Max articles per source (1-100)
maxPagesinteger1Max listing pages to check (1-10)
includeContentbooleantrueExtract full article text
searchKeywordsarray[]Filter articles by keywords (matches ANY keyword)
titleContainsstring""Only scrape articles with this text in title
descriptionContainsstring""Only scrape articles with this text in description

Filtering Behavior

  • searchKeywords: Articles matching ANY of the keywords in title OR description will be included
  • titleContains: Only articles with this exact text (case-insensitive) in the title
  • descriptionContains: Only articles with this exact text (case-insensitive) in the summary/description
  • Combined filters: All specified filters must match (AND logic)

Output Format

Articles are saved to the dataset in the following format:

{
"title": "Article Title",
"url": "https://example.com/article",
"author": "Author Name",
"published_date": "2025-11-08T12:00:00+00:00",
"content": "Full article text...",
"summary": "Brief summary or excerpt",
"source": "Site Name",
"tags": ["tag1", "tag2"],
"image_url": "https://example.com/image.jpg",
"scraped_at": "2025-11-08T13:45:00+00:00"
}

Viewing Results

After a successful run:

  1. Dataset Tab: View all scraped articles in a table
  2. Export Options: Download as JSON, CSV, XML, or Excel
  3. API Access: Access results programmatically via Apify API
  4. Schedule: Set up periodic scraping (hourly, daily, weekly)

Best Practices

  1. Start Small: Test with a few articles before scaling up
  2. Respect Robots.txt: The scraper respects robots.txt automatically
  3. Rate Limiting: Built-in delays prevent overwhelming servers
  4. Monitor Logs: Check logs for warnings about missing content
  5. Adjust Parameters: If scraping fails, try reducing maxArticlesPerSource

Limitations

  • Some sites may use JavaScript-heavy rendering (not supported by this scraper)
  • Paywalled content cannot be extracted
  • Sites with anti-scraping measures may block requests
  • Content structure varies; generic extraction may miss some fields

Troubleshooting

No articles found

  • Check if the URL is correct and accessible in a browser
  • Verify the site allows scraping (check robots.txt)
  • Try increasing maxPages parameter
  • Some sites require JavaScript - this scraper uses static HTML only

Missing content fields

  • Some sites have unique structures that the generic scraper might miss
  • This is normal - not all sites have all fields
  • The scraper uses multiple fallback strategies, but some data may be unavailable

Connection timeouts

  • Check your internet connection
  • Some sites may be blocking automated requests
  • Try reducing maxArticlesPerSource to avoid overwhelming the target site

Rate limiting errors

  • Reduce maxArticlesPerSource (try 5-10 instead of 50+)
  • Scrape fewer sources per run
  • Wait between runs

FAQ

Q: Can I scrape paywalled content? A: No, this scraper only accesses publicly available content. It respects the same limitations as a regular web browser.

Q: How fast is the scraper? A: Approximately 1-2 articles per second, depending on site response time and content size. Built-in delays ensure polite scraping.

Q: Can I scrape sites not in the preset list? A: Yes! Use the customUrls option to scrape any tech news site.

Q: Will this work with JavaScript-heavy sites? A: This scraper uses static HTML parsing. For JavaScript-heavy sites (React, Vue, etc.), you may need a browser-based solution.

Q: How do I schedule automated scraping? A: Use Apify's scheduler feature to run hourly, daily, or weekly.

Q: Can I export data to a database? A: Yes, the scraper outputs JSON which can be easily imported into databases. On Apify, you can use integrations to automatically push data to various services.

Q: Is this legal? A: Web scraping legality depends on the website's terms of service and your jurisdiction. Always:

  • Check the site's robots.txt
  • Review their Terms of Service
  • Don't overload their servers
  • Use data responsibly

This tool is for educational and legitimate use cases only.

License

This project is provided as-is for educational and legitimate scraping purposes. Always respect website terms of service and robots.txt files.