Pricing

$20.00/month + usage

Go to Store

Metadata Scraper

Try for free

Developed by

Louis Deconinck

Automatically scrape metadata such as title, description, heading and article from websites. It will crawl the start URLs and then scrape the metadata from the detail pages automatically navigating through the pagination.

5.0 (3)

Pricing

$20.00/month + usage

Total users

Monthly users

Runs succeeded

>99%

Last modified

9 months ago

Automation

Real estate

Features

Scrapes metadata from specified websites
Handles pagination and detail pages
Extracts title, description, heading, and article content
Configurable start URLs and maximum requests per crawl
Ignores specified URLs so no duplicates when scraping multiple times

Input

Be sure to use JSON mode for the input and not Manual mode. Here's an overview of the input parameters:

startUrls: An array of objects containing:
- url: The starting URL for the scrape
- scrapeUrlGlobs: An array of URL patterns for detail pages to scrape
- paginationUrlGlobs: An array of URL patterns for pagination pages (optional)
maxRequestsPerCrawl: Maximum number of requests per crawl (default: 100)
urlsToIgnore: An array of URLs to ignore when processing (optional)

Here's an example of the input data structure:

{
  "startUrls": [
    {
      "url": "https://roger-hannah.co.uk/property-search/?search_properties=1&tenure=&property_type%5B%5D=Development&property_type%5B%5D=Industrial&size_min=0&size_max=1000000",
      "scrapeUrlGlobs": ["https://roger-hannah.co.uk/properties/*"],
      "paginationUrlGlobs": []
    }
  ],
  "maxRequestsPerCrawl": 100,
  "urlsToIgnore": [
    "https://roger-hannah.co.uk/properties/development-site-with-potential-for-10-houses-planning-permission/",
    "https://roger-hannah.co.uk/properties/lower-mill-mill-street/"
  ]
}

Using Glob Patterns

Glob patterns are used to match URLs. They are similar to regular expressions but more flexible. They are used to match the URL patterns for detail pages and pagination pages.

Here are some common glob patterns used in URL matching:

*: Matches any number of characters (except /) Example: https://example.com/*.html matches all HTML files in the root directory
**: Matches any number of characters (including /) Example: https://example.com/**/*.jpg matches all JPG files in any subdirectory
?: Matches exactly one character Example: https://example.com/page?.html matches page1.html, pageA.html, etc.
[...]: Matches any one character in the brackets Example: https://example.com/file[123].txt matches file1.txt, file2.txt, file3.txt
[!...]: Matches any one character not in the brackets Example: https://example.com/img[!0-9].png matches imgA.png but not img1.png
{...}: Matches any of the comma-separated patterns Example: https://example.com/{blog,news}/*.html matches both blog and news HTML files

Examples in the context of web scraping:

https://example.com/products/*.html: Matches all product detail pages
https://example.com/category/*/page-*.html: Matches pagination pages in all categories
https://example.com/{2021,2022,2023}/**: Matches all pages from specific years
https://example.com/page/*: Matches all pages in the root directory
https://example.com/page/**: Matches all pages in all subdirectories

When using glob patterns in the startGlobs configuration, make sure they accurately represent the structure of the website you're scraping to ensure all relevant pages are captured.

Output

The Actor outputs the following data for each scraped property listing:

url: The URL of the scraped page
title: The title of a detail page
description: The description of a detail page
heading: The main heading of a detail page
article: The content of a detail page

Here's an example of the output data structure:

{
  "url": "https://roger-hannah.co.uk/properties/bolton-street/",
  "title": "Bolton Street - Roger Hannah",
  "description": "Property Information The property comprises of a detached former warehouse/showroom facility constructed by way of a steel portal frame with concrete render under a pitched tiled roof. Access to the property is via personnel entrance doors fronting Bolton Street with rear loading access off Millett Street via two electrically operated roller shutter loading doors. There is a small private yard/parking/loading area to the rear of the premises. Internally, the facility provided flexible ground fl...",
  "heading": "Bolton Street",
  "article": "Property Information The property comprises of a detached former warehouse/showroom facility constructed by way of a steel portal frame with concrete render under a pitched tiled roof. Access to the property is via personnel entrance doors fronting Bolton Street with rear loading access off Millett Street via two electrically operated roller shutter loading doors. There is a small private yard/parking/loading area to the rear of the premises. Internally, the facility provided flexible ground fl..."
}

On this page

Metadata Scraper

Share Actor:

URL to Metadata

njoylab/url-summary-scraper

A powerful Apify actor that extracts essential website information, including title, description, images, and social media links. Perfect for quick data gathering and insights from any URL.

njoylab

5.0

Get Metadata

maged120/get-metadata

The actor extracts comprehensive metadata including image previews, titles, descriptions, author, time of publish, fav icon, and a lot more

Maged

5.0

Metadata Extractor

jancurn/extract-metadata

A small efficient actor that loads a web page, parses its HTML using Cheerio library and extracts the following meta-data from the <HEAD> tag, such as page title, description, author etc.

Jan Čurn

1.3K

Metadata Scraper

autofacts/metadata-scraper

A powerful web scraper that extracts various types of structured metadata from web pages, including JSON-LD, Microdata, Open Graph, Twitter Cards, and more. Perfect for SEO analysis, content aggregation, and research purposes.

Autofactor

5.0

URL Metadata Crawler

easyapi/url-metadata-crawler

Extracting comprehensive metadata from web pages. Gather vital information like meta tags, favicons, Open Graph tags, and more, all while enjoying flexible options for customization. Perfect for SEO specialists, developers, and content creators looking to enhance their web presence! 🌐

EasyApi

HTML Scraper pro

scrapingxpert/html-scraper-pro

The HTML Scraper Pro is a powerful tool designed to extract the HTML source code and metadata from websites. It uses advanced web scraping techniques to retrieve the full HTML content of web pages,page title and HTTP status code.This tool is ideal for data extraction, website analysis, and archiving

scrapingxpert

100

Website Metadata Extractor (meta tags, sitemap, robots) 🔎

powerful_bachelor/website-metadata-extractor

🔍 Website Metadata Extractor 🌐 Extract essential website data: meta tags, robots.txt, and sitemap.xml in one scan. 📊 Analyze SEO elements, crawler directives, and site structure. ✅ Perfect for SEO audits, 🔎 competitor research, and 🚀 understanding how search engines view your website.

Powerful Bachelor

Envato Scraper

dcwhale/envato-scraper

Scrape envator product in any Category

Amsyari

HTTP Status Codes and URL Checker

antonio_espresso/website-status-code-crawler

A HTTP Status Codes Crawler is a tool that scans a website and retrieves HTTP status codes for each page. This helps in diagnosing errors and optimizing technical SEO.

Antonio Blago

ImageFX API

ib4ngz/imageFX-api

This actor uses the ImageFX API to generate images from a list of text prompts. It supports multiple authentication tokens, configurable image count, choice of file extension (JPEG or PNG), a seed value for reproducibility, and optional ZIP archive creation for the generated images.