Pricing

Pay per event

Try for free

Go to Apify Store

Website Content Text Extractor

Try for free

Extract visible text content from websites as structured JSON blocks. Supports multi-URL batch processing, header/footer/cookie exclusion, and optional form extraction. Perfect for content analysis and translation workflows.

Pricing

Pay per event

Rating

5.0

(1)

Developer

My Smart Digital

Actor stats

Bookmarked

Total users

Monthly active users

4 months ago

Last modified

Description

Extract clean, visible text from websites as structured blocks. Perfect for content migration, translation workflows, and data analysis. This actor extracts text content that is actually visible to users, with options to exclude headers, footers, cookies, and extract form content.

Features

✅ Multi-URL Batch Processing: Process multiple URLs in a single run
✅ Viewport Presets: Choose between Desktop (1920x1080), Mobile (375x667), Tablet (768x1024), or custom dimensions
✅ Header/Footer/Cookie Exclusion: Compatible with WordPress, Shopify, Webflow, Drupal, Joomla and most CMS platforms
✅ Form Content Extraction: Optional extraction of form content (labels, placeholders, values, dropdown options)
✅ DOM Order Preservation: Text blocks extracted in the order they appear on the page
✅ Code Filtering: Automatically filters out JavaScript, CSS, and code snippets
✅ Deduplication: Removes duplicate text blocks
✅ Configurable Selectors: Customize which elements to include/exclude
✅ Clean JSON Output: Structured output perfect for content analysis, translation, and data migration

Input

{
  "startUrls": [
    "https://example.com",
    "https://example.com/about",
    "https://example.com/contact"
  ],
  "viewportType": "desktop",
  "viewportWidth": 1920,
  "viewportHeight": 1080,
  "excludeHeader": false,
  "excludeFooter": false,
  "excludeCookies": false,
  "excludeSelectors": [],
  "includeSelectors": [],
  "minTextLength": 3,
  "deduplicate": true,
  "waitForSelector": "",
  "waitTimeout": 30000,
  "removeEmptyBlocks": true,
  "extractForms": false
}

Parameters

startUrl (optional): Single URL to extract text from (useful for quick tests). Leave empty if you only use startUrls
startUrls (optional): List of additional URLs to process in bulk in a single run. Duplicates and empty lines are ignored. Use this field to process multiple pages in one execution
viewportType (optional, default: "desktop"): Choose a predefined viewport size (desktop, mobile, tablet) or use custom dimensions
viewportWidth (optional, default: 1920): Custom viewport width in pixels (only used when viewportType is "custom")
viewportHeight (optional, default: 1080): Custom viewport height in pixels (only used when viewportType is "custom")
excludeHeader (optional, default: false): Exclude header and navigation elements. Compatible with WordPress, Shopify, Webflow, Drupal, Joomla and most CMS platforms. If this doesn't work for your site, use Exclude Selectors manually
excludeFooter (optional, default: false): Exclude footer elements. Compatible with WordPress, Shopify, Webflow, Drupal, Joomla and most CMS platforms. If this doesn't work for your site, use Exclude Selectors manually
excludeCookies (optional, default: false): Exclude cookie consent banners and GDPR notices. Compatible with Cookiebot, OneTrust, Iubenda and most cookie consent platforms
extractForms (optional, default: false): Extract form content (labels, placeholders, values). When disabled, form elements (form, input, textarea, select, label) are excluded from extraction. When enabled, form content is extracted including labels, placeholders, and input values. Each form field and dropdown option is extracted as a separate block
excludeSelectors (optional, default: []): Additional CSS selectors for custom elements to exclude. Use this if Exclude Header/Footer/Cookies options don't work for your site structure
includeSelectors (optional): CSS selectors to specifically include. If empty, all visible text is extracted
minTextLength (optional, default: 3): Minimum character length for text blocks
deduplicate (optional, default: true): Remove duplicate text blocks
waitForSelector (optional): CSS selector to wait for before extraction
waitTimeout (optional, default: 30000): Timeout in milliseconds for waiting for selector
removeEmptyBlocks (optional, default: true): Remove empty or whitespace-only text blocks
extractForms (optional, default: false): Extract form content (labels, placeholders, values). When disabled, form elements (form, input, textarea, select, label) are excluded from extraction. When enabled, form content is extracted including labels, placeholders, and input values.

Output

The actor returns a JSON object with the following structure:

{
  "url": "https://example.com",
  "title": "Page Title",
  "viewport": {
    "width": 1920,
    "height": 1080
  },
  "textBlocks": [
    {
      "id": "block-1",
      "text": "First visible text block",
      "order": 1,
      "tagName": "h1",
      "selector": null
    },
    {
      "id": "block-2",
      "text": "Second text block",
      "order": 2,
      "tagName": "p",
      "selector": null
    }
  ],
  "statistics": {
    "totalBlocks": 25,
    "totalCharacters": 5432,
    "uniqueBlocks": 23,
    "excludedElements": 8
  }
}

Output Fields

url: The URL that was processed
title: Page title
viewport: Viewport dimensions used
textBlocks: Array of extracted text blocks, each with:
- id: Unique identifier (block-1, block-2, etc.)
- text: The extracted text content
- order: Order of appearance (1, 2, 3, etc.)
- tagName: HTML tag name (h1, p, li, etc.)
- selector: CSS selector if extracted from specific selector
statistics: Summary statistics

Use Cases

Content Translation: Extract clean text blocks for translation workflows
Content Analysis: Analyze visible content without navigation/header/footer noise
SEO Content Extraction: Get only the main content for SEO analysis
Content Migration: Extract content for migration to new platforms
Form Data Extraction: Extract form labels, placeholders, and dropdown options for documentation or analysis

Header and Footer Exclusion

By default, nothing is excluded - all visible text is extracted. You can optionally exclude headers and/or footers using the excludeHeader and excludeFooter options.

When enabled, these options use universal selectors compatible with:

WordPress: .site-header, .elementor-location-header, .wp-block-navigation, etc.
Shopify: .shopify-section-header, .shopify-section-footer, etc.
Webflow: [class*='header'], [id*='header'], etc.
Drupal: .region-header, .region-footer, etc.
Joomla: .header, .footer, .moduletable-menu, etc.
Generic: header, footer, nav, [role='banner'], [role='contentinfo'], etc.

If these options don't work for your specific site structure, you can use the excludeSelectors parameter to manually specify CSS selectors.

Technical Details

Uses Playwright for rendering with configurable viewport sizes
Waits for networkidle to ensure all content is loaded
Automatically scrolls pages to load lazy-loaded content
Checks element visibility using getBoundingClientRect() and computed styles
Filters out elements with display: none, visibility: hidden, opacity: 0
Filters out JavaScript, CSS, and code snippets automatically
Only includes elements within the viewport bounds
Deduplicates text blocks by normalized (lowercase) content
Maintains DOM order for accurate content structure

Text Scraper (Free)

karamelo/text-scraper-free

Website Text Extractor. Extract Text from Webpages and Feed Your LLMs

karamelo

941

5.0

Crawl4AI

janbuchar/crawl4ai

Wraps the Crawl4AI open-source library for retrieving text content from websites.

Jan Buchar

721

3.3

Google Bulk Index Checker

caprolok/google-bulk-index-checker

Google Bulk Index Checker is a swift, user-friendly tool designed to verify if a website is indexed by Google. It provides instant indexing status updates, helping SEO professionals and webmasters ensure their sites are visible on Google search. Essential for efficient SEO management.

Caprolok

168

Article Content Extractor 📄

easyapi/article-content-extractor

Extract clean article content, metadata and structured information from any web page. Supports multiple URLs and returns well-formatted JSON with title, description, content, author, publish date and more. 🔍📄

EasyApi

5.0

Google Indexer & Instant SEO Submitter

eunit/google-indexing

Instantly index your URLs with the Google Indexing API. Automate sitemap submissions, speed up crawling, and boost SEO rankings. Perfect for fresh content, bulk indexing, and removing dead links. Fast, secure, and pay-per-result!

Emmanuel Uchenna

5.0

Google News Article Scraper

webscrap18/google-news-article-scraper

Scrape Google News, Extract full content with Title, Article Text, Images and Structured data.

WebScrap

Smart Article Scraper - Text, Data & Insights

xtech/article-extractor

𝗔𝗿𝘁𝗶𝗰𝗹𝗲 𝗦𝗰𝗿𝗮𝗽𝗲𝗿 & 𝗖𝗼𝗻𝘁𝗲𝗻𝘁 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗼𝗿 - Extract clean text, metadata, keywords & summaries from any web article or blog post. Perfect for 𝗿𝗲𝘀𝗲𝗮𝗿𝗰𝗵, 𝗰𝗼𝗺𝗽𝗲𝘁𝗶𝘁𝗶𝘃𝗲 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀 & 𝗰𝗼𝗻𝘁𝗲𝗻𝘁 𝗺𝗮𝗿𝗸𝗲𝘁𝗶𝗻𝗴.

Xtech

1.0

H&M

datasaurus/hm-products

Scrape products from H&M websites. All countries and languages. Fast and efficient.

datasaurus

Google Indexing API Bulk URL Submission

mabdulmoghni/google-indexing-api-bulk-url-submission

This Actor allows you to submit multiple URLs for indexing in bulk through Google's Indexing API. It avoids the need to manually request each URL to be indexed via the Google Search Console interface. With this tool, you can quickly submit up to 100 URLs at once.

Mohamed Moo

SEO Checker

louisdeconinck/seo-checker

SEO Checker is an advanced Actor that performs comprehensive on-site SEO analysis for any website. It crawls web pages and extracts crucial SEO elements, providing detailed insights to help improve your website's search engine optimization.