SEO Data Extractor avatar
SEO Data Extractor

Pricing

from $2.00 / 1,000 results

Go to Apify Store
SEO Data Extractor

SEO Data Extractor

Extract comprehensive SEO metadata, headings, links, images, Open Graph tags, Twitter Cards, and technical data from websites. Perfect for SEO audits, competitor analysis, and content optimization. Runs on Apify platform with structured JSON output.

Pricing

from $2.00 / 1,000 results

Rating

0.0

(0)

Developer

No-Code Venture

No-Code Venture

Maintained by Community

Actor stats

1

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

Extract comprehensive SEO metadata, headings, links, images, Open Graph tags, Twitter Cards, and technical data from websites. Perfect for SEO audits, competitor analysis, and content optimization. Runs on Apify platform with structured JSON output.

A comprehensive SEO data extraction tool that runs on the Apify platform.

Features

Extract comprehensive SEO data from any webpage including:

  • Meta Information: Title, description, keywords, robots directives, canonical URLs, author, and generator tags with length counts
  • Heading Structure: All H1-H6 tags with text content and counts for each level
  • Content Analysis: Word count, link analysis (total/internal/external), and image audit (total/without alt text)
  • Open Graph Tags: Complete Open Graph metadata (title, description, image, URL, type, site name)
  • Twitter Cards: Twitter Card metadata for social sharing
  • Technical SEO: Status codes, response time, charset, language, viewport settings
  • Structured Data: JSON-LD detection and schema type identification
  • Branding Assets: Favicon, Apple touch icon, and theme color detection
  • Sitemap Extraction: Optionally fetch and include all URLs from each domain's sitemap.xml
  • Error Handling: Graceful handling of HTTP errors (404, 500, etc.) with proper error codes and messages

Use Cases

  • SEO Monitoring: Track SEO data for your websites or competitors over time
  • Content Analysis: Analyze meta tags to optimize webpage content for search engines
  • SEO Audits: Collect data for comprehensive SEO audits across multiple pages
  • Competitor Analysis: Track SEO data for your competitors
  • Bulk Data Extraction: Process 1 to 100,000+ pages efficiently

Input Configuration

FieldTypeDescriptionDefault
startUrlsArrayList of URLs to extract SEO data fromhttps://nocodeventure.com
extractSitemapUrlsBooleanFetch and include sitemap data for each domainfalse
sitemapUrlStringCustom sitemap path (e.g., sitemap_index.xml or /sitemaps/main.xml)/sitemap.xml
maxRequestsPerCrawlIntegerMaximum pages to scrape (0 = unlimited)100
requestTimeoutIntegerRequest timeout in seconds (3-10)5
maxConcurrencyIntegerParallel requests (1-50)10
maxRequestRetriesIntegerMax retries for failed requests (0-5)1
proxyConfigurationObjectProxy settings for anti-blockingApify Proxy disabled

Output Schema

The Actor returns structured JSON data with the following fields:

FieldTypeDescription
urlStringThe URL that was scraped
scrapedAtStringISO 8601 timestamp of when the page was scraped
errorString (optional)Error code if scraping failed (e.g., "404", "500", "REQUEST_FAILED")
errorMessageString (optional)Human-readable error message

Meta Information (meta)

FieldTypeDescription
titleStringPage title from <title> tag
titleLengthNumberCharacter count of the title
descriptionStringMeta description content
descriptionLengthNumberCharacter count of the description
keywordsStringMeta keywords content
robotsStringRobots meta directive (e.g., "index, follow")
canonicalStringCanonical URL from meta tag
authorStringAuthor meta tag content
generatorStringGenerator meta tag content

Headings (headings)

FieldTypeDescription
h1.textStringCombined text content of all H1 tags
h1.countNumberNumber of H1 tags found
h2.textStringCombined text content of all H2 tags
h2.countNumberNumber of H2 tags found
h3.textStringCombined text content of all H3 tags
h3.countNumberNumber of H3 tags found
h4.textStringCombined text content of all H4 tags
h4.countNumberNumber of H4 tags found
h5.textStringCombined text content of all H5 tags
h5.countNumberNumber of H5 tags found
h6.textStringCombined text content of all H6 tags
h6.countNumberNumber of H6 tags found

Open Graph Tags (openGraph)

FieldTypeDescription
titleStringOpen Graph title
descriptionStringOpen Graph description
imageStringOpen Graph image URL
urlStringOpen Graph URL
typeStringOpen Graph type (e.g., "website", "article")
siteNameStringOpen Graph site name

Twitter Cards (twitterCard)

FieldTypeDescription
cardStringTwitter card type (e.g., "summary", "summary_large_image")
titleStringTwitter card title
descriptionStringTwitter card description
imageStringTwitter card image URL
siteStringTwitter site handle

Content Analysis (content)

FieldTypeDescription
wordCountNumberTotal word count in page body
links.totalNumberTotal number of links found
links.internalNumberNumber of internal links (same domain)
links.externalNumberNumber of external links (different domain)
images.totalNumberTotal number of images found
images.withoutAltNumberNumber of images missing alt text

Technical SEO (technical)

FieldTypeDescription
statusCodeNumberHTTP response status code
responseTimeNumberResponse time in milliseconds
charsetStringCharacter encoding (e.g., "UTF-8")
languageStringPage language from HTML lang attribute
viewportStringViewport meta tag content
structuredData.hasStructuredDataBooleanWhether JSON-LD structured data was found
structuredData.typesArrayArray of structured data schema types found

Branding Assets (branding)

FieldTypeDescription
faviconStringFavicon URL
appleTouchIconStringApple touch icon URL
themeColorStringTheme color meta tag content

Sitemap Data (sitemap) - Optional

Note: This field is only included when extractSitemapUrls is enabled. If the page scrape fails (HTTP error or request failure), the sitemap object will not be included in the output.

FieldTypeDescription
foundBooleanWhether a sitemap was found and parsed
sitemapUrlStringThe sitemap URL that was fetched
isKnownPathBooleanWhether a known/custom sitemap path was used (see below)
urlCountNumberTotal number of URLs found in the sitemap
urlsArrayList of all URLs from the sitemap
errorString (optional)Error message if sitemap fetch failed

Example output with sitemap enabled:

{
"url": "https://example.com",
"meta": { ... },
"sitemap": {
"found": true,
"sitemapUrl": "https://example.com/sitemap.xml",
"isKnownPath": false,
"urlCount": 156,
"urls": [
"https://example.com/",
"https://example.com/about",
"https://example.com/contact",
...
]
},
"scrapedAt": "2025-12-12T10:00:00.000Z"
}

Sitemap caching: If you have multiple URLs from the same domain, the sitemap is only fetched once and reused for all pages from that domain.

Known Sitemap Paths

Some websites don't use the standard /sitemap.xml location. The Actor includes built-in support for these sites with isKnownPath: true in the output.

DomainSitemap Location
amazon.com, www.amazon.com, aws.amazon.comhttps://aws.amazon.com/ar/sitemaps/index/

When a known path is used, you'll see it in the logs:

Using known sitemap path for www.amazon.com: https://aws.amazon.com/ar/sitemaps/index/

To add support for more domains, edit src/utils/sitemap-paths.ts.

Error Output Example

When a URL returns an HTTP error (like 404), the Actor returns an error item instead of failing:

{
"url": "https://example.com/broken-link",
"meta": {
"title": "",
"titleLength": 0,
"description": "",
"descriptionLength": 0,
"keywords": "",
"robots": "",
"canonical": "",
"author": "",
"generator": ""
},
"technical": {
"statusCode": 404,
"responseTime": 150
},
"error": "404",
"errorMessage": "Page not found",
"scrapedAt": "2025-12-11T20:23:04.317Z"
}

This allows you to:

  • Continue processing other URLs without failing the entire run
  • Identify broken links and problematic URLs in your dataset
  • Filter error results using the dedicated "Errors" view

Output Views

The Actor provides multiple dataset views for different analysis needs:

  • SEO Overview: Quick summary with URL, error status, title, description, canonical, robots, H1 count, and links
  • Errors: Dedicated view for URLs that returned HTTP errors (404, 500, etc.) with error codes and messages
  • Heading Structure: H1-H6 tags with text content and counts for each level
  • Open Graph: Complete Open Graph metadata for social sharing
  • Twitter Cards: Twitter Card metadata for social sharing
  • Content Analysis: Word count, link breakdown (internal/external), and image audit data
  • Technical SEO: HTTP status, response time, charset, language, viewport, and structured data
  • Branding Assets: Favicon, Apple touch icon, and theme color information
  • Sitemap Data: URLs found in each domain's sitemap (when sitemap extraction is enabled)

How to Export

  1. Access Results: After running, view collected data in Apify's interface
  2. Select Export Option: Download as CSV, JSON, Excel, or XML
  3. Open in Tools: Import into Excel, Google Sheets, or your analysis tool
  4. API Access: Use the Apify API to integrate with your workflows

Pricing Model

This Actor uses Pay-Per-Event (PPE) pricing with automatic charging via Apify's synthetic events:

  • Actor Start: Charged automatically when the Actor starts
  • Dataset Item: Charged automatically for each result pushed to the dataset

Error Handling & Billing

URLs that return HTTP errors (404, 500, etc.) are still charged because:

  • The Actor had to make a request to discover the error
  • Error items are returned with proper error codes and messages
  • This allows you to identify broken links without failing the entire run

You can set a maximum spending limit in the Apify Console to control costs.

What's Included

Limitations

⚠️ JavaScript-Heavy Sites: This tool primarily extracts data from static HTML. It may not capture content that loads dynamically via JavaScript, potentially resulting in incomplete data extraction.

FAQ

Are duplicate URLs processed multiple times?

Yes. The Actor processes every URL in your input list, including duplicates. If you submit the same URL multiple times, it will be processed and charged each time.

Tip: Remove duplicates from your input list before running to save costs:

https://example.com/page1 ← processed, charged
https://example.com/page1 ← processed again, charged again
https://example.com/page2 ← processed, charged

Am I charged for failed requests?

Yes. URLs that return HTTP errors (404, 500, etc.) or fail after retries are still charged because the Actor had to make a request to discover the error. However, you receive an error item in your dataset with the error code and message, so you know exactly what happened.

How can I control costs?

  • Set a maximum spending limit in the Apify Console before running
  • Use the maxRequestsPerCrawl input to limit the number of pages processed
  • Remove duplicate URLs from your input list before running
  • Set maxRequestRetries to 0 if you don't want failed requests to be retried

This tool is provided for educational and research purposes only. By using this SEO Data Extractor, you agree to:

  • Comply with all applicable laws: You are solely responsible for ensuring your use of this tool complies with local, national, and international laws, including copyright laws, data protection regulations (such as GDPR, CCPA), and terms of service of target websites.

  • Respect website terms of service: Many websites prohibit automated scraping in their terms of service. You must review and comply with each website's terms before using this tool.

  • Respect robots.txt: This tool does not automatically check or respect robots.txt files. You are responsible for checking and honoring robots.txt directives.

  • Rate limiting and ethical use: Use reasonable request rates and respect website operators. Excessive requests may constitute a denial-of-service attack.

  • Data privacy compliance: Ensure your data collection and processing activities comply with privacy laws. Do not collect personal data without proper consent and legal basis.

  • No warranties: This tool is provided "as is" without warranties of any kind. The authors are not responsible for any damages or legal consequences arising from its use.

  • Use at your own risk: You assume all risks associated with using this tool. The authors disclaim all liability for any direct, indirect, incidental, or consequential damages.

Before using this tool, consult with legal counsel to ensure compliance with applicable laws and regulations.