Article to Text Extractor (for TTS/LLMs)
Pricing
from $1.00 / 1,000 article extracteds
Article to Text Extractor (for TTS/LLMs)
Extract the core readable text of any article or blog post, stripping out boilerplate. Perfect for Text-to-Speech or AI summaries.
Pricing
from $1.00 / 1,000 article extracteds
Rating
0.0
(0)
Developer
Andok
Actor stats
0
Bookmarked
5
Total users
3
Monthly active users
19 days ago
Last modified
Categories
Share
Article Text Extractor for TTS & AI
Extract clean, readable article text from any web page, stripped of navigation, ads, and boilerplate. Feed the output directly into text-to-speech engines, summarization models, or LLM pipelines without wasting tokens on HTML noise. Bulk-process hundreds of URLs with parallel concurrency.
Features
- Readability engine — uses Mozilla Readability to isolate the main article content from page clutter
- Plain text output — returns clean text ready for TTS APIs like ElevenLabs or OpenAI TTS
- Bulk processing — extract articles from hundreds of URLs in a single run
- Metadata extraction — captures title, author byline, and excerpt alongside the article text
- Redirect tracking — follows HTTP redirects and records the final URL
- Configurable concurrency — process 1 to 50 URLs in parallel
- Backwards compatible — accepts both
urlsarray and singleurlfield
Input
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
urls | array | No | — | List of webpage URLs to extract article text from |
url | string | No | — | Single URL for backwards compatibility (use urls for bulk) |
timeoutSeconds | integer | No | 15 | Maximum seconds to wait for each URL response |
concurrency | integer | No | 10 | Number of URLs to process in parallel (1-50) |
Input Example
{"urls": ["https://crawlee.dev","https://blog.apify.com/what-is-web-scraping/"],"timeoutSeconds": 15,"concurrency": 10}
Output
Each URL produces one dataset item containing the extracted plain text and metadata.
Key output fields:
inputUrl(string) — the original URL providedfinalUrl(string) — the URL after following redirectsstatus(number) — HTTP status codepageTitle(string) — extracted article titlebyline(string) — author name if availableexcerpt(string) — short summary of the articletextContent(string) — the full article text, cleaned and ready for TTS or AI processingerror(string) — error message if extraction failed, otherwisenullcheckedAt(string) — ISO 8601 timestamp of when the extraction was performed
Output Example
{"inputUrl": "https://crawlee.dev","finalUrl": "https://crawlee.dev/","status": 200,"pageTitle": "Crawlee - Build reliable crawlers. Fast.","byline": null,"excerpt": "Crawlee is a web scraping and browser automation library for Node.js.","textContent": "Crawlee\n\nBuild reliable crawlers. Fast.\n\nCrawlee is a web scraping and browser automation library that helps you build reliable crawlers...","error": null,"checkedAt": "2025-01-15T10:30:00.000Z"}
Pricing
| Event | Cost |
|---|---|
| Article Extracted | Pay-per-event (see actor pricing page) |
The actor respects the per-run max charge limit. Processing stops automatically when the spending cap is reached.
Use Cases
- Podcast generation — turn blog posts and news articles into clean text payloads for TTS APIs
- LLM summarization — feed distraction-free article text into GPT, Claude, or other models
- Content monitoring — track article changes over time with clean text snapshots
- Accessibility tools — extract readable text for screen readers and assistive technology
- Newsletter curation — pull article text from multiple sources for digest generation
Related Actors
| Actor | What it adds |
|---|---|
| Web Page to Markdown Converter for LLMs | Markdown-formatted output with heading structure preserved |
| PDF to Text Converter for AI & RAG | Extend text extraction to PDF documents |
| RSS Feed Parser & Reader | Discover article URLs automatically from RSS feeds |
