Article to Text Extractor (for TTS/LLMs) avatar

Article to Text Extractor (for TTS/LLMs)

Pricing

from $1.00 / 1,000 article extracteds

Go to Apify Store
Article to Text Extractor (for TTS/LLMs)

Article to Text Extractor (for TTS/LLMs)

Extract the core readable text of any article or blog post, stripping out boilerplate. Perfect for Text-to-Speech or AI summaries.

Pricing

from $1.00 / 1,000 article extracteds

Rating

0.0

(0)

Developer

Andok

Andok

Maintained by Community

Actor stats

0

Bookmarked

5

Total users

3

Monthly active users

19 days ago

Last modified

Share

Article Text Extractor for TTS & AI

Extract clean, readable article text from any web page, stripped of navigation, ads, and boilerplate. Feed the output directly into text-to-speech engines, summarization models, or LLM pipelines without wasting tokens on HTML noise. Bulk-process hundreds of URLs with parallel concurrency.

Features

  • Readability engine — uses Mozilla Readability to isolate the main article content from page clutter
  • Plain text output — returns clean text ready for TTS APIs like ElevenLabs or OpenAI TTS
  • Bulk processing — extract articles from hundreds of URLs in a single run
  • Metadata extraction — captures title, author byline, and excerpt alongside the article text
  • Redirect tracking — follows HTTP redirects and records the final URL
  • Configurable concurrency — process 1 to 50 URLs in parallel
  • Backwards compatible — accepts both urls array and single url field

Input

FieldTypeRequiredDefaultDescription
urlsarrayNoList of webpage URLs to extract article text from
urlstringNoSingle URL for backwards compatibility (use urls for bulk)
timeoutSecondsintegerNo15Maximum seconds to wait for each URL response
concurrencyintegerNo10Number of URLs to process in parallel (1-50)

Input Example

{
"urls": [
"https://crawlee.dev",
"https://blog.apify.com/what-is-web-scraping/"
],
"timeoutSeconds": 15,
"concurrency": 10
}

Output

Each URL produces one dataset item containing the extracted plain text and metadata.

Key output fields:

  • inputUrl (string) — the original URL provided
  • finalUrl (string) — the URL after following redirects
  • status (number) — HTTP status code
  • pageTitle (string) — extracted article title
  • byline (string) — author name if available
  • excerpt (string) — short summary of the article
  • textContent (string) — the full article text, cleaned and ready for TTS or AI processing
  • error (string) — error message if extraction failed, otherwise null
  • checkedAt (string) — ISO 8601 timestamp of when the extraction was performed

Output Example

{
"inputUrl": "https://crawlee.dev",
"finalUrl": "https://crawlee.dev/",
"status": 200,
"pageTitle": "Crawlee - Build reliable crawlers. Fast.",
"byline": null,
"excerpt": "Crawlee is a web scraping and browser automation library for Node.js.",
"textContent": "Crawlee\n\nBuild reliable crawlers. Fast.\n\nCrawlee is a web scraping and browser automation library that helps you build reliable crawlers...",
"error": null,
"checkedAt": "2025-01-15T10:30:00.000Z"
}

Pricing

EventCost
Article ExtractedPay-per-event (see actor pricing page)

The actor respects the per-run max charge limit. Processing stops automatically when the spending cap is reached.

Use Cases

  • Podcast generation — turn blog posts and news articles into clean text payloads for TTS APIs
  • LLM summarization — feed distraction-free article text into GPT, Claude, or other models
  • Content monitoring — track article changes over time with clean text snapshots
  • Accessibility tools — extract readable text for screen readers and assistive technology
  • Newsletter curation — pull article text from multiple sources for digest generation
ActorWhat it adds
Web Page to Markdown Converter for LLMsMarkdown-formatted output with heading structure preserved
PDF to Text Converter for AI & RAGExtend text extraction to PDF documents
RSS Feed Parser & ReaderDiscover article URLs automatically from RSS feeds