AI-Enhanced Website Metadata avatar
AI-Enhanced Website Metadata

Pricing

from $7.00 / 1,000 results

Go to Apify Store
AI-Enhanced Website Metadata

AI-Enhanced Website Metadata

Extracts complete website metadata including SEO tags, OpenGraph data, social media links, contact information and performs link analysis. Features AI-powered content summarization with multilingual support and structured data extraction. Perfect for gathering deep insights from any URL.

Pricing

from $7.00 / 1,000 results

Rating

5.0

(1)

Developer

njoylab

njoylab

Maintained by Community

Actor stats

1

Bookmarked

2

Total users

1

Monthly active users

16 hours ago

Last modified

Share

URL Summary Scraper with AI

A powerful Apify actor that extracts essential website information with optional AI-powered summaries and key facts extraction. Supports LLM analysis in 30+ languages.

Features

Core Scraping

  • Comprehensive metadata extraction - SEO, OpenGraph, Twitter Card data
  • Social media links - Facebook, X (Twitter), LinkedIn, Instagram, YouTube, TikTok, Pinterest, Trustpilot, GitHub, Discord, Telegram, WhatsApp, Medium, Reddit, Threads, Mastodon, Twitch, Vimeo, Spotify, Snapchat
  • Contact information - Email, phone numbers, addresses
  • Link analysis - Internal/external links with domain categorization
  • Media assets - Favicons, logos, featured images
  • Structured data - JSON-LD extraction
  • Robots.txt compliance - Respects crawling rules (can be bypassed)
  • Batch processing - Process single URL or multiple URLs in one run

AI-Powered Analysis (Optional)

  • Intelligent summaries - Short (50 words), Medium (150 words), Long (300 words)
  • Semantic keywords - AI-extracted keywords from content (works for any page type)
  • Multilingual support - 30+ languages including English, Italian, Spanish, French, German, Portuguese, etc.
  • Key facts extraction - Company name, industry, services, target audience, business model
  • Graceful degradation - Returns metadata even if AI analysis fails

Input Parameters

ParameterTypeRequiredDefaultDescription
urlarrayYes-Array of URLs to scrape (use single-element array for one URL)
languagestringNoen, en-US;q=0.9, en-GB;q=0.8Accept-Language header
ignoreRobotsbooleanNofalseBypass robots.txt rules
ignoreExternalLinksbooleanNofalseSkip external links extraction
ignoreInteralLinksbooleanNofalseSkip internal links extraction
generateSummarybooleanNofalseEnable AI-powered summaries (opt-in)
summaryLengthstringNo-Summary length: short, medium, or long. Leave empty for all three.
summaryLanguagestringNoauto-detectTarget language code (e.g., en, it, es)
extractKeyFactsbooleanNofalseExtract structured business information

Usage Examples

Single URL - Basic Scraping

{
"url": ["https://apify.com"]
}

Multiple URLs - Batch Processing

{
"url": [
"https://example.com",
"https://example.org",
"https://example.net"
]
}

AI-Powered Analysis

{
"url": ["https://apify.com"],
"generateSummary": true,
"extractKeyFacts": true
}

Multilingual Summary

{
"url": ["https://example.it"],
"generateSummary": true,
"summaryLanguage": "it"
}

Output Schema

The actor returns hierarchical JSON structure for each URL:

{
"url": "string",
"seo": {
"title": "string",
"description": "string",
"keywords": ["string"],
"canonical": "string",
"robots": "string",
"language": "string",
"viewport": "string"
},
"openGraph": {
"title": "string",
"description": "string",
"image": "string",
"url": "string",
"type": "string",
"siteName": "string"
},
"twitterCard": {
"card": "string",
"site": "string",
"creator": "string",
"title": "string",
"description": "string",
"image": "string"
},
"social": {
"facebook": "string",
"x": "string",
"linkedin": "string",
"instagram": "string",
"youtube": "string",
"tiktok": "string",
"pinterest": "string",
"trustpilot": "string",
"github": "string",
"discord": "string",
"telegram": "string",
"whatsapp": "string",
"medium": "string",
"reddit": "string",
"threads": "string",
"mastodon": "string",
"twitch": "string",
"vimeo": "string",
"spotify": "string",
"snapchat": "string"
},
"contact": {
"email": "string",
"phone": "string",
"address": "string"
},
"technical": {
"statusCode": 200,
"finalUrl": "string",
"originalUrl": "string",
"robotsAllowed": true,
"loadTime": 1234,
"isSecure": true,
"contentType": "text/html"
},
"media": {
"favicon": "string",
"appleTouchIcon": "string",
"featuredImage": "string",
"logo": "string",
"screenshots": ["string"]
},
"links": {
"internal": {
"total": 42,
"urls": ["string"]
},
"external": {
"total": 15,
"urls": ["string"],
"domains": ["string"]
},
"mailto": ["string"],
"tel": ["string"]
},
"structuredData": [{}],
"ai": {
"summary": {
"short": "string",
"medium": "string",
"long": "string",
"contentLength": 5000,
"truncated": false
},
"keywords": ["string"],
"keyFacts": {
"companyName": "string",
"companyType": "B2B SaaS",
"industry": "Technology",
"services": ["string"],
"targetAudience": "string",
"headquarters": "San Francisco, USA",
"foundedYear": 2020,
"keyFeatures": ["string"],
"businessModel": "Subscription"
},
"processingTime": 2340,
"error": "string"
}
}

Note: When processing multiple URLs, one record per URL will be added to the dataset.

Supported Languages for AI Summaries

English, Italian, Spanish, French, German, Portuguese, Dutch, Russian, Chinese, Japanese, Korean, Arabic, Hindi, Polish, Turkish, Swedish, Norwegian, Danish, Finnish, Greek, Czech, Romanian, Hungarian, Thai, Vietnamese, Indonesian, Malay, Ukrainian, Bulgarian, Croatian, Slovak, Slovenian, Lithuanian, Latvian, Estonian.

Performance

  • Basic scraping: < 5 seconds per URL
  • With AI analysis: < 30 seconds per URL
  • Memory: Recommended 2048 MB
  • Timeout: Recommended 300 seconds (5 minutes)

Error Handling

The actor implements graceful degradation:

  • AI failures → Returns metadata with ai.error field
  • Network errors → Retries with different URL variants (http/https, www/non-www)
  • Robots.txt blocking → Can be bypassed with ignoreRobots: true
  • Partial failures → When processing multiple URLs, failed URLs return error objects while successful ones return full data
  • Individual URL errors → Each URL is processed independently; one failure doesn't stop the batch

Example Response

Here's a real example of the actor output for a single URL:

{
"url": "https://apify.com/",
"seo": {
"title": "Apify: Full-stack web scraping and data extraction platform",
"description": "Extract data from any website with Apify's scraping tools and ready-made scrapers. No coding needed.",
"keywords": ["web scraping", "data extraction", "automation"],
"canonical": "https://apify.com/",
"language": "en",
"viewport": "width=device-width, initial-scale=1"
},
"openGraph": {
"title": "Apify: Full-stack web scraping and data extraction platform",
"description": "Extract data from any website with Apify's scraping tools",
"image": "https://apify.com/og-image.png",
"url": "https://apify.com/",
"type": "website",
"siteName": "Apify"
},
"twitterCard": {
"card": "summary_large_image",
"site": "@apify",
"title": "Apify: Web scraping platform",
"image": "https://apify.com/twitter-card.png"
},
"social": {
"x": "https://x.com/apify",
"linkedin": "https://linkedin.com/company/apifytech",
"youtube": "https://youtube.com/c/apify",
"github": "https://github.com/apify",
"discord": "https://discord.com/invite/apify",
"medium": "https://medium.com/@apify"
},
"contact": {
"email": "support@apify.com"
},
"technical": {
"statusCode": 200,
"finalUrl": "https://apify.com/",
"originalUrl": "https://apify.com",
"robotsAllowed": true,
"loadTime": 1247,
"isSecure": true,
"contentType": "text/html; charset=utf-8"
},
"media": {
"favicon": "https://apify.com/favicon.ico",
"logo": "https://apify.com/logo.svg",
"featuredImage": "https://apify.com/og-image.png"
},
"links": {
"internal": {
"total": 127,
"urls": ["https://apify.com/pricing", "https://apify.com/about", "..."]
},
"external": {
"total": 8,
"urls": ["https://docs.apify.com", "..."],
"domains": ["docs.apify.com", "blog.apify.com"]
},
"mailto": ["support@apify.com"],
"tel": []
},
"ai": {
"summary": {
"short": "Apify is a web scraping and automation platform that allows users to extract data from websites without coding.",
"medium": "Apify is a comprehensive web scraping and data extraction platform designed for both developers and non-technical users. It offers ready-made scrapers, custom scraping tools, and a cloud infrastructure to extract data from any website at scale. The platform features an extensive library of pre-built actors, proxy management, and scheduling capabilities.",
"contentLength": 15420,
"truncated": false
},
"keywords": ["web scraping", "data extraction", "automation", "B2B SaaS", "cloud platform", "API"],
"keyFacts": {
"companyName": "Apify",
"companyType": "B2B SaaS",
"industry": "Web Scraping & Data Extraction",
"services": ["Web scraping tools", "Ready-made scrapers", "Cloud infrastructure", "Proxy services"],
"targetAudience": "Developers, Data Scientists, Business Analysts",
"businessModel": "Subscription",
"keyFeatures": ["Actor marketplace", "Serverless computing", "Proxy management", "Scheduling"]
},
"processingTime": 3421
}
}

Tips for Best Results

  1. Batch Processing - Use arrays for multiple URLs to process them efficiently
  2. AI Costs - Enable generateSummary only when needed to avoid AI costs
  3. Language Detection - Leave summaryLanguage empty to auto-detect from page content
  4. Specific Summaries - Use summaryLength to get only the length you need
  5. Robots.txt - Respect robots.txt by default; only use ignoreRobots: true when legally permitted

Disclaimer

This actor is provided for legitimate web scraping and data extraction purposes. Users are responsible for:

  • Compliance with Terms of Service - Ensure you have permission to scrape target websites
  • Respect for robots.txt - Follow website crawling guidelines unless legally permitted to override
  • Rate limiting - Implement appropriate delays to avoid overloading target servers
  • Data privacy - Comply with GDPR, CCPA, and other data protection regulations
  • Intellectual property - Respect copyright and trademark rights of scraped content

The developers of this actor are not responsible for misuse or violations of applicable laws and terms of service.