Smart Article Extractor
Pricing
$10.00/month + usage
Smart Article Extractor
News Article Extractor Actor fetches article URLs and extracts structured content using Requests, , and Newspaper3k. It collects title, author, publish date, text, summary, keywords, images, and word count. Supports proxy use and outputs clean JSON results.
Pricing
$10.00/month + usage
Rating
0.0
(0)
Developer

Data Pilot
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
๐ Smart Article Extractor is a powerful Apify Actor designed to extract article content from web pages using advanced NLP and parsing techniques. This tool provides comprehensive Smart Article information, including titles, authors, full text, summaries, and metadata for any URL. Whether you're conducting Smart Article research, content aggregation, or NLP tasks, the Smart Article Extractor delivers accurate Smart Article data efficiently.
With dual extraction methods using Newspaper3k and , the Smart Article Extractor ensures reliable extraction of article content that may not be available through simple HTML parsing. It focuses on key Smart Article metrics like word count, keywords, and publication dates, making it an essential tool for Smart Article analysis and content processing.
๐ฅ Features
- Comprehensive Smart Article Extraction โ Extracts detailed Smart Article data, including titles, authors, full text, and summaries for any URL.
- Dual Extraction Engine โ Uses both Newspaper3k and for robust Smart Article content extraction.
- Homepage Detection โ Automatically skips potential homepages to ensure only Smart Article content is processed.
- Proxy Support โ Utilizes Apify's residential proxies to bypass restrictions and ensure high success rates for Smart Article scraping.
- Metadata Enrichment โ Provides Smart Article metadata like keywords, publication dates, and image URLs.
- Text Cleaning โ Automatically cleans and formats Smart Article text for better readability and NLP processing.
- Error Handling โ Robust logging and fallback mechanisms for failed Smart Article extractions.
- Dataset Integration โ Automatically uploads Smart Article data to your Apify dataset for easy export and analysis.
โ๏ธ How It Works
The Smart Article Extractor takes a list of URLs as input and uses requests to fetch the HTML content. It then employs Newspaper3k for initial extraction, falling back to for more complex pages. The extractor returns structured Smart Article data on success or error details on failure, providing a reliable way to gather Smart Article information for research and analysis.
Key Processing Steps:
- URL Validation โ Parse and validate article URLs
- HTML Fetching โ Fetch page content using requests
- Newspaper3k Extraction โ Extract using NLP-based method
- ** Fallback** โ Use HTML parsing if primary fails
- Homepage Detection โ Skip non-article pages
- Metadata Extraction โ Get titles, authors, dates
- Text Cleaning โ Format and clean article text
- Export โ Push results to dataset in JSON format
Key benefits for Smart Article analysis:
- Access full Smart Article text and metadata.
- Analyze Smart Article content for NLP and sentiment analysis.
- Build Smart Article databases for content research.
- Extract structured data from unstructured web content.
- Enable content aggregation and news monitoring.
๐ฅ Input
The extractor accepts the following input parameters:
| Field | Type | Default | Description |
|---|---|---|---|
urls | string / array | required | List of URLs to extract Smart Article data from, comma-separated or array (e.g., "https://example.com/article1, https://example.com/article2"). |
Example input JSON:
{"urls": "https://www.bbc.com/news/live/cz0g2yg3579t, https://example.com/article2"}
Alternative array format:
{"urls": ["https://example.com/article1","https://example.com/article2","https://news.example.com/article3"]}
๐ค Output
The extractor outputs detailed Smart Article data in JSON format for each URL. Each record includes:
| Field | Type | Description |
|---|---|---|
url | string | Original URL of the Smart Article. |
domain | string | Domain of the Smart Article source. |
title | string | Title of the Smart Article. |
author | array | Author(s) of the Smart Article. |
publisher | string | Publisher domain of the Smart Article. |
published | string | Publication date of the Smart Article. |
text | string | Full text content of the Smart Article. |
summary | string | Summary of the Smart Article. |
keywords | array | Keywords associated with the Smart Article. |
word_count | integer | Word count of the Smart Article. |
image | string | Top image URL of the Smart Article. |
scrapedAt | string | ISO timestamp of the scrape. |
Example output for Smart Article data:
{"url": "https://www.bbc.com/news","domain": "example.com","title": "Example Smart Article Title","author": ["John Doe"],"publisher": "example.com","published": "2025-02-14T12:00:00Z","text": "This is the full text of the smart article...","summary": "This is a summary of the smart article.","keywords": ["technology", "news"],"word_count": 500,"image": "https://example.com/image.jpg","scrapedAt": "2025-02-14T12:00:00Z"}
Example error response:
{"url": "https://www.bbc.com/news","status": "failed","error": "Article content not found or page is not an article","scrapedAt": "2025-02-14T12:00:00Z"}
Example summary record:
{"summary": true,"total_urls": 10,"successful_extractions": 9,"failed_extractions": 1,"average_word_count": 650,"total_keywords": 45,"completed_at": "2025-02-14T12:35:00Z"}
๐งฐ Technical Stack
- Article Extraction: Newspaper3k โ Advanced NLP-based extraction
- HTML Parsing: โ Robust fallback parsing
- HTTP Client: requests โ Web page fetching
- NLP: NLTK โ Natural language processing for summaries
- Text Processing: Natural language tokenization and analysis
- Platform: Apify Actor โ serverless, scalable, integrated with Dataset
- Deployment: Oneโclick run on Apify Console or via REST API
๐ฏ Use Cases
- Content Aggregation โ Aggregate article content from multiple sources.
- News Monitoring โ Extract and monitor news articles on specific topics.
- Research & Analysis โ Extract content for academic and market research.
- Sentiment Analysis โ Analyze article sentiment and tone.
- Keyword Extraction โ Extract and analyze keywords from articles.
- Content Database Building โ Build databases of article content.
- Archive Creation โ Create archives of important articles.
- Competitor Intelligence โ Monitor competitor-related news and articles.
- SEO Analysis โ Analyze article structure for SEO insights.
- Content Recommendation โ Extract data for content recommendation systems.
- Machine Learning Datasets โ Create datasets for ML and NLP models.
- Summarization Research โ Analyze article summaries and extraction quality.
- Author Analysis โ Track articles by specific authors.
- Publication Analysis โ Analyze content patterns by publication.
๐ Quick Start
- Open in Apify Console โ visit the Actor page and click Try for free.
- Enter article URLs โ provide one or more article URLs (comma-separated or as array).
- Click Start โ the Actor will extract article content using dual extraction engines.
- View Results โ check the dataset for extracted article data.
- Analyze Content โ examine titles, authors, full text, summaries, and keywords.
- Review Metadata โ check publication dates, authors, and image URLs.
- Monitor Quality โ review extraction success and any failures.
- Export โ download the results as JSON, CSV, or Excel.
You can also call this Actor programmatically via Apify SDK or REST API โ ideal for automated content extraction and news aggregation pipelines.
๐ Why This Extractor?
| Feature | Benefit |
|---|
| โ Full text | Get complete article content, not just snippets. | | โ NLP-powered | Smart extraction using natural language processing. | | โ Metadata rich | Get titles, authors, dates, keywords, images. | | โ Homepage detection | Skip non-article pages automatically. | | โ Text cleaning | Clean, formatted text ready for analysis. | | โ Error handling | Robust fallback mechanisms. | | โ Apify ecosystem | Seamless integration with other Actors, triggers, and webhooks. |
๐ฆ Changelog
- Initial release of Smart Article Extractor
- Newspaper3k integration for NLP-based extraction
- fallback for complex pages
- Title, author, and publisher extraction
- Full text content extraction
- Publication date parsing
- Keyword extraction and analysis
- Article summary generation
- Image URL extraction
- Word count calculation
- Homepage detection and filtering
- Text cleaning and formatting
- Dual extraction engine support
- Error handling with detailed logging
- Automatic dataset integration
- Full Apify Actor integration
๐งโ๐ป Support & Feedback
- Issues & Ideas: Open a ticket on the Apify Actor issue tracker
- Contributions: Pull requests are welcome via the GitHub repository
- Documentation: Visit Apify Docs for comprehensive platform guides
- Community: Join the Apify community forum for discussions and support
- Bug Reports: Submit detailed bug reports through the issue tracker
- Feature Requests: Suggest new features to improve the extractor
๐ฐ Pricing
- Free for basic usage on Apify platform
- Paid plans available for higher limits and priority support
Disclaimer: Smart Article Extractor is provided as-is for research and content analysis purposes. Users are responsible for ensuring their usage complies with website policies and applicable copyright laws. Always attribute content to original authors and publications.
๐ Get Started Today
Begin extracting article content now!
Use Smart Article Extractor for:
- ๐ฐ Content Aggregation
- ๐ Text Analysis
- ๐ Research & Analysis
- ๐ก NLP Tasks
- ๐ Database Building
Perfect for:
- Content Strategists
- Researchers
- Data Scientists
- Content Curators
- Analysts
Last Updated: February 2025
Version: 1.0.0
Status: Active Development
Support: 24/7 Customer Support Available
Platform: Apify
๐ Related Tools
For comprehensive content analysis and research, explore our full suite of tools:
- Fast News Content Scraper
- Google Search Results Scraper
- Ranked Keywords Scraper with SEO Metrics
- RAG Web Scraper
- All-in-One Media Downloader