LD+JSON Schema scraper

Pricing

Pay per usage

Try for free

Go to Apify Store

LD+JSON Schema scraper

Try for free

Developed by

Paulo Cesar

Maintained by Community

Extract all LD+JSON tags from the given URLs.

0.0 (0)

Pricing

Pay per usage

370

Last modified

4 years ago

Automation

SEO tools

Open source

.editorconfig

root = true

[*]
indent_style = space
indent_size = 4
charset = utf-8
trim_trailing_whitespace = true
insert_final_newline = true
end_of_line = lf

.eslintrc

{
    "extends": "@apify"
}

.gitignore

# This file tells Git which files shouldn't be added to source control

.idea
node_modules

Dockerfile

# First, specify the base Docker image. You can read more about
# the available images at https://sdk.apify.com/docs/guides/docker-images
# You can also use any other image from Docker Hub.
FROM apify/actor-node:16

# Second, copy just package.json and package-lock.json since it should be
# the only file that affects "npm install" in the next step, to speed up the build
COPY package*.json ./

# Install NPM packages, skip optional and development dependencies to
# keep the image small. Avoid logging too much and print the dependency
# tree for debugging
RUN npm --quiet set progress=false \
 && npm install --only=prod --no-optional \
 && echo "Installed NPM packages:" \
 && (npm list --only=prod --no-optional --all || true) \
 && echo "Node.js version:" \
 && node --version \
 && echo "NPM version:" \
 && npm --version

# Next, copy the remaining files and directories with the source code.
# Since we do this after NPM install, quick build will be really fast
# for most source file changes.
COPY . ./

# Optionally, specify how to launch the source code of your actor.
# By default, Apify's base Docker images define the CMD instruction
# that runs the Node.js source code using the command specified
# in the "scripts.start" section of the package.json file.
# In short, the instruction looks something like this:
#
# CMD npm start

apify.json

{
    "env": { "npm_config_loglevel": "silent" }
}

main.js

1const Apify = require('apify');
2
3const pageFunction = async (context) => {
4    const { request, $, log, customData } = context;
5
6    const { url } = request;
7
8    const lds = $('script[type="application/ld+json"]');
9
10    if (!lds.length) {
11        log.warning('No LD+JSON found on page', { url });
12        return {
13            data: {},
14            url,
15            customData,
16        };
17    }
18
19    return lds
20        .map((_, el) => $(el).html().trim())
21        .get()
22        .map((html) => {
23            try {
24                return JSON.parse(html);
25            } catch (e) {
26                log.exception(e, 'Invalid JSON', { url });
27            }
28        })
29        .filter(Boolean)
30        .map((data) => {
31            return {
32                data,
33                url,
34                customData,
35            }
36        });
37};
38
39Apify.main(async () => {
40    const { proxyConfiguration, startUrls, customData } = await Apify.getInput();
41
42    if (!proxyConfiguration) {
43        throw new Error('You require a proxy to run');
44    }
45
46    // test proxy
47    const proxy = await Apify.createProxyConfiguration(proxyConfiguration);
48
49    if (!proxy) {
50        throw new Error('Invalid proxy configuration');
51    }
52
53    if (!startUrls?.length) {
54        throw new Error('Provide a RequestList sources array on "startUrls" input'); 
55    }
56
57    await Apify.metamorph('apify/cheerio-scraper', {
58        startUrls,
59        pageFunction: pageFunction.toString(),  
60        proxyConfiguration,  
61        customData,
62        ignoreSslErrors: true,
63    });
64});

package.json

{
    "name": "project-empty",
    "version": "0.0.1",
    "description": "This is a boilerplate of an Apify actor.",
    "dependencies": {
        "apify": "^2.2.1"
    },
    "scripts": {
        "start": "node main.js",
        "lint": "./node_modules/.bin/eslint ./src --ext .js,.jsx",
        "lint:fix": "./node_modules/.bin/eslint ./src --ext .js,.jsx --fix",
        "test": "echo \"Error: oops, the actor has no tests yet, sad!\" && exit 1"
    },
    "author": "It's not you it's me",
    "license": "ISC"
}

Structured Data Scraper (Schema.org)

datavault/schemaorg

Fast, lightweight scraper that extracts structured data (JSON-LD & microdata) from HTML pages. Ideal for e-commerce and sites that embed schema.org markup without heavy client-side rendering.

Datavault

SEO/GEO - Schema Markup Scraper

wisteria_banjo/schema-markup-scraper

This actor to fetches JSON-LD/Schema Markup from Multiple URLs & checks whether the page contains markups for the following types: AggregateRating, Article, Event, FAQPage, LocalBusiness, Organization, Person, Product, & Review. Schema Markup helps search and generative engines find & read webpages.

Chris Xavier

Sitemap Scraper

pvillalva/sitemap-scraper

The Sitemap Scraper extracts and outputs all URLs from a given sitemap.

Percival Villalva

141

5.0

Metadata Scraper

autofacts/metadata-scraper

A powerful web scraper that extracts various types of structured metadata from web pages, including JSON-LD, Microdata, Open Graph, Twitter Cards, and more. Perfect for SEO analysis, content aggregation, and research purposes.

Autofactor

5.0

Pay-as-you-go API / JSON scraper

pocesar/pay-as-you-go-api-json-scraper

Scrape as pay-as-you-go any API / JSON URLs directly to the dataset, and return them in CSV, XML, HTML, or Excel formats. Transform and filter the output. Enables you to follow pagination recursively from the payload without the need to visit the HTML page.

Paulo Cesar

Sitemap Sniffer

vaclavrut/sitemap-sniffer

Sitemap sniffer will check the most used variants of sitemaps and you can use that for crawling. This will just save you time so you don't have to check manually.

Vaclav Rut

670

5.0

Website Metadata Extractor (meta tags, sitemap, robots) 🔎

powerful_bachelor/website-metadata-extractor

🔍 Website Metadata Extractor 🌐 Extract essential website data: meta tags, robots.txt, and sitemap.xml in one scan. 📊 Analyze SEO elements, crawler directives, and site structure. ✅ Perfect for SEO audits, 🔎 competitor research, and 🚀 understanding how search engines view your website.

Powerful Bachelor

Get URLs from link

boring_code/get-urls-from-link

Extracts URLs from a sitemap or webpage with intuitive path matching. Use comma-separated patterns to include or exclude URL paths with smart matching: '/tags/' for exact paths, '/product' for paths starting with, or simple text for substring matches.

Audrius L.

185

5.0

Scrape product data from any e-commerce site with a dataLayer

eloquent_mountain/scrape-product-data-from-any-e-commerce-site

Scrapes e-commerce product data from any (e-commerce) website that has a dataLayer object (mostly used in google analytics implementations). It returns all product data in multiple data formats. Also available as an API to integrate with your own or other products. Circumvents the Cookie wall.

Paco

352

Sitemap URL Extractor

onescales/sitemap-url-extractor

Provide a link to a sitemap.xml and the app will extract and list all URLs in the sitemap as well as additional data in the sitemap (i.e. https://onescales.com/sitemap.xml).