Pricing

$10.00/month + usage

Smart Article Extractor

News Article Extractor Actor fetches article URLs and extracts structured content using Requests, , and Newspaper3k. It collects title, author, publish date, text, summary, keywords, images, and word count. Supports proxy use and outputs clean JSON results.

Pricing

$10.00/month + usage

Rating

0.0

(0)

Developer

Data Pilot

Actor stats

Bookmarked

Total users

Monthly active users

4 months ago

Last modified

🔥 Features

Comprehensive Smart Article Extraction – Extracts detailed Smart Article data, including titles, authors, full text, and summaries for any URL.
Dual Extraction Engine – Uses both Newspaper3k and for robust Smart Article content extraction.
Homepage Detection – Automatically skips potential homepages to ensure only Smart Article content is processed.
Proxy Support – Utilizes Apify's residential proxies to bypass restrictions and ensure high success rates for Smart Article scraping.
Metadata Enrichment – Provides Smart Article metadata like keywords, publication dates, and image URLs.
Text Cleaning – Automatically cleans and formats Smart Article text for better readability and NLP processing.
Error Handling – Robust logging and fallback mechanisms for failed Smart Article extractions.
Dataset Integration – Automatically uploads Smart Article data to your Apify dataset for easy export and analysis.

⚙️ How It Works

The Smart Article Extractor takes a list of URLs as input and uses requests to fetch the HTML content. It then employs Newspaper3k for initial extraction, falling back to for more complex pages. The extractor returns structured Smart Article data on success or error details on failure, providing a reliable way to gather Smart Article information for research and analysis.

Key Processing Steps:

URL Validation – Parse and validate article URLs
HTML Fetching – Fetch page content using requests
Newspaper3k Extraction – Extract using NLP-based method
** Fallback** – Use HTML parsing if primary fails
Homepage Detection – Skip non-article pages
Metadata Extraction – Get titles, authors, dates
Text Cleaning – Format and clean article text
Export – Push results to dataset in JSON format

Key benefits for Smart Article analysis:

Access full Smart Article text and metadata.
Analyze Smart Article content for NLP and sentiment analysis.
Build Smart Article databases for content research.
Extract structured data from unstructured web content.
Enable content aggregation and news monitoring.

📥 Input

The extractor accepts the following input parameters:

Field	Type	Default	Description
`urls`	string / array	required	List of URLs to extract Smart Article data from, comma-separated or array (e.g., `"https://example.com/article1, https://example.com/article2"`).

Example input JSON:

{
  "urls": "https://www.bbc.com/news/live/cz0g2yg3579t, https://example.com/article2"
}

Alternative array format:

{
  "urls": [
    "https://example.com/article1",
    "https://example.com/article2",
    "https://news.example.com/article3"
  ]
}

📤 Output

The extractor outputs detailed Smart Article data in JSON format for each URL. Each record includes:

Field	Type	Description
`url`	string	Original URL of the Smart Article.
`domain`	string	Domain of the Smart Article source.
`title`	string	Title of the Smart Article.
`author`	array	Author(s) of the Smart Article.
`publisher`	string	Publisher domain of the Smart Article.
`published`	string	Publication date of the Smart Article.
`text`	string	Full text content of the Smart Article.
`summary`	string	Summary of the Smart Article.
`keywords`	array	Keywords associated with the Smart Article.
`word_count`	integer	Word count of the Smart Article.
`image`	string	Top image URL of the Smart Article.
`scrapedAt`	string	ISO timestamp of the scrape.

Example output for Smart Article data:

{
  "url": "https://www.bbc.com/news",
  "domain": "example.com",
  "title": "Example Smart Article Title",
  "author": ["John Doe"],
  "publisher": "example.com",
  "published": "2025-02-14T12:00:00Z",
  "text": "This is the full text of the smart article...",
  "summary": "This is a summary of the smart article.",
  "keywords": ["technology", "news"],
  "word_count": 500,
  "image": "https://example.com/image.jpg",
  "scrapedAt": "2025-02-14T12:00:00Z"
}

Example error response:

{
  "url": "https://www.bbc.com/news",
  "status": "failed",
  "error": "Article content not found or page is not an article",
  "scrapedAt": "2025-02-14T12:00:00Z"
}

Example summary record:

{
  "summary": true,
  "total_urls": 10,
  "successful_extractions": 9,
  "failed_extractions": 1,
  "average_word_count": 650,
  "total_keywords": 45,
  "completed_at": "2025-02-14T12:35:00Z"
}

🧰 Technical Stack

Article Extraction: Newspaper3k – Advanced NLP-based extraction
HTML Parsing: – Robust fallback parsing
HTTP Client: requests – Web page fetching
NLP: NLTK – Natural language processing for summaries
Text Processing: Natural language tokenization and analysis
Platform: Apify Actor – serverless, scalable, integrated with Dataset
Deployment: One‑click run on Apify Console or via REST API

🎯 Use Cases

Content Aggregation – Aggregate article content from multiple sources.
News Monitoring – Extract and monitor news articles on specific topics.
Research & Analysis – Extract content for academic and market research.
Sentiment Analysis – Analyze article sentiment and tone.
Keyword Extraction – Extract and analyze keywords from articles.
Content Database Building – Build databases of article content.
Archive Creation – Create archives of important articles.
Competitor Intelligence – Monitor competitor-related news and articles.
SEO Analysis – Analyze article structure for SEO insights.
Content Recommendation – Extract data for content recommendation systems.
Machine Learning Datasets – Create datasets for ML and NLP models.
Summarization Research – Analyze article summaries and extraction quality.
Author Analysis – Track articles by specific authors.
Publication Analysis – Analyze content patterns by publication.

🚀 Quick Start

Open in Apify Console – visit the Actor page and click Try for free.
Enter article URLs – provide one or more article URLs (comma-separated or as array).
Click Start – the Actor will extract article content using dual extraction engines.
View Results – check the dataset for extracted article data.
Analyze Content – examine titles, authors, full text, summaries, and keywords.
Review Metadata – check publication dates, authors, and image URLs.
Monitor Quality – review extraction success and any failures.
Export – download the results as JSON, CSV, or Excel.

You can also call this Actor programmatically via Apify SDK or REST API – ideal for automated content extraction and news aggregation pipelines.

💎 Why This Extractor?

Feature	Benefit

📦 Changelog

Initial release of Smart Article Extractor
Newspaper3k integration for NLP-based extraction
fallback for complex pages
Title, author, and publisher extraction
Full text content extraction
Publication date parsing
Keyword extraction and analysis
Article summary generation
Image URL extraction
Word count calculation
Homepage detection and filtering
Text cleaning and formatting
Dual extraction engine support
Error handling with detailed logging
Automatic dataset integration
Full Apify Actor integration

🧑‍💻 Support & Feedback

Issues & Ideas: Open a ticket on the Apify Actor issue tracker
Contributions: Pull requests are welcome via the GitHub repository
Documentation: Visit Apify Docs for comprehensive platform guides
Community: Join the Apify community forum for discussions and support
Bug Reports: Submit detailed bug reports through the issue tracker
Feature Requests: Suggest new features to improve the extractor

💰 Pricing

Free for basic usage on Apify platform
Paid plans available for higher limits and priority support

Disclaimer: Smart Article Extractor is provided as-is for research and content analysis purposes. Users are responsible for ensuring their usage complies with website policies and applicable copyright laws. Always attribute content to original authors and publications.

🎉 Get Started Today

Begin extracting article content now!

Use Smart Article Extractor for:

📰 Content Aggregation
📊 Text Analysis
🔍 Research & Analysis
💡 NLP Tasks
📚 Database Building

Perfect for:

Content Strategists
Researchers
Data Scientists
Content Curators
Analysts

Last Updated: February 2025
Version: 1.0.0
Status: Active Development
Support: 24/7 Customer Support Available
Platform: Apify

For comprehensive content analysis and research, explore our full suite of tools:

Fast News Content Scraper
Google Search Results Scraper
Ranked Keywords Scraper with SEO Metrics
RAG Web Scraper
All-in-One Media Downloader

AI Blog Dataset Creator

datapilot/ai-blog-dataset-creator

Smart Article Scraper Actor extracts structured article data from URLs using, and Newspaper3k. It collects title, author, publish date, tags, full content, language, and word count. Supports proxy usage, JavaScript-rendered pages, and outputs clean JSON datasets.

Data Pilot

Article Content Extractor

codingfrontend/article-content-extractor

Extract clean article content, metadata and structured information from any web page. Returns title, description, author, publish date, plain content, word count, images, and more.

Coding Frontned

Article Content Extractor 📄

easyapi/article-content-extractor

Extract clean article content, metadata and structured information from any web page. Supports multiple URLs and returns well-formatted JSON with title, description, content, author, publish date and more. 🔍📄

EasyApi

136

5.0

🧠 Smart Article Extractor

api-empire/smart-article-extractor

API Empire

🧠 Smart Article Extractor

scraper-engine/smart-article-extractor

Scraper Engine

🧠 Smart Article Extractor

scrapier/smart-article-extractor

Scrapier

🧠 Smart Article Extractor

simpleapi/smart-article-extractor

SimpleAPI

🧠 Smart Article Extractor

scrapio/smart-article-extractor

Scrapio

Google News Article Scraper

webscrap18/google-news-article-scraper

Scrape Google News, Extract full content with Title, Article Text, Images and Structured data.

WebScrap

Smart Article Extractor

parseforge/article-extractor

Extract clean article content from any news, blog, or publisher site! Pull full body text, author, publish date, word count, language, reading time, images, and metadata at scale. Ideal for content research, media monitoring, SEO audits, and AI training. Start extracting articles in minutes!