Any Website URL to Article Summarizer avatar

Any Website URL to Article Summarizer

Pricing

from $4.99 / 1,000 results

Go to Apify Store
Any Website URL to Article Summarizer

Any Website URL to Article Summarizer

Extract and summarize articles from any website URL. Returns title, author, publish date, word count, reading time, full text, and a concise AI-style summary using extractive summarization.

Pricing

from $4.99 / 1,000 results

Rating

0.0

(0)

Developer

Coding Frontned

Coding Frontned

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

6 days ago

Last modified

Share

Extract and summarize article content from any website URL. Works on news sites, blogs, Wikipedia, Medium, documentation, and more.

Uses extractive summarization (selects the most important sentences from the article itself) โ€” no external AI API key required.

Features

  • ๐Ÿ“ฐ Extract articles from any URL โ€” news sites, blogs, Wikipedia, Medium, etc.
  • ๐Ÿ“ Automatic extractive summary generation (selects most informative sentences)
  • ๐Ÿ”‘ Key points โ€” top 5 most important sentences from the article
  • ๐Ÿ‘ค Extracts title, author, publish date, description, and hero image
  • ๐Ÿ“Š Word count and reading time estimation
  • ๐Ÿ”’ No AI API key required โ€” all summarization happens locally
  • ๐ŸŒ Supports multiple URLs per run
  • ๐Ÿ’พ Optional: include full cleaned article text in output

How Summarization Works

This actor uses extractive summarization:

  1. Article text is cleaned and split into sentences
  2. Each sentence is scored by word frequency (TF-style scoring)
  3. The top-scoring sentences are selected and returned in their original reading order
  4. Position bias boosts early sentences (intros are typically more important)

This approach works across all languages and domains without requiring an LLM or external API.

Input

FieldTypeDefaultDescription
urlsarrayrequiredList of article URLs to summarize
summaryLengthstring"medium"short (3 sentences), medium (5), long (8)
includeFullTextbooleanfalseInclude full cleaned article text in output
maxItemsinteger10Maximum number of articles to process

Example Input

{
"urls": [
"https://en.wikipedia.org/wiki/Artificial_intelligence",
"https://techcrunch.com/2024/01/01/sample-article/"
],
"summaryLength": "medium",
"includeFullText": false,
"maxItems": 10
}

Output

Each dataset record represents one summarized article:

FieldTypeDescription
positionintegerPosition in results
urlstringArticle URL
domainstringWebsite domain (e.g. "techcrunch.com")
titlestringArticle title
authorstring|nullArticle author name
publishDatestring|nullPublish date (ISO format or raw string)
descriptionstring|nullMeta description or excerpt
summarystringExtractive summary of the article
keyPointsarrayTop 5 key sentences from the article
wordCountintegerTotal word count
readingTimestringEstimated reading time (e.g. "5 min read")
imagestring|nullHero image URL (og:image)
siteNamestring|nullWebsite name (og:site_name)
languagestring|nullDocument language code
fullTextstringFull cleaned article text (if includeFullText=true)
scrapedAtstringISO 8601 scrape timestamp

Example output

{
"position": 1,
"url": "https://en.wikipedia.org/wiki/Machine_learning",
"domain": "en.wikipedia.org",
"title": "Machine learning - Wikipedia",
"author": null,
"publishDate": null,
"description": "Machine learning (ML) is a field of study...",
"summary": "Machine learning is a subset of artificial intelligence...",
"keyPoints": ["Machine learning models are often vulnerable to...", "..."],
"wordCount": 9653,
"readingTime": "48 min read",
"image": null,
"siteName": null,
"language": "en",
"scrapedAt": "2025-08-01T12:00:00.000Z"
}

Dataset Views

  • Articles Overview โ€” table with title, author, date, word count, reading time, URL, and summary
  • Summaries โ€” focused view showing title, summary, key points, URL, and domain

Technical Notes

  • Uses real Google Chrome browser (Playwright) for handling JavaScript-rendered pages
  • Fingerprint injection for natural browser behavior
  • Article content is extracted using a multi-selector heuristic that prioritizes <article>, [itemprop="articleBody"], and common blog/CMS CSS classes
  • Wikipedia [edit] and footnote [1] markers are automatically removed
  • Reference sections (.reflist, .references) are removed from Wikipedia pages
  • For paywalled articles, only publicly visible content is extracted

License

Apache-2.0