API / JSON scraper

Pricing

$5.00/month + usage

Try Actor

Go to Apify Store

API / JSON scraper

Try Actor

Developed by

Paulo Cesar

Maintained by Community

Scrape any API / JSON URLs directly to the dataset, and return them in CSV, XML, HTML, or Excel formats. Transform and filter the output. Enables you to follow pagination recursively from the payload without the need to visit the HTML page.

0.0 (0)

Pricing

$5.00/month + usage

540

Last modified

4 months ago

Developer tools

Automation

Integrations

Download and format JSON endpoint data

Download any JSON URLs directly to the dataset, and return them in CSV, XML, HTML, or Excel formats. Transform and filter the output.

Features

Optimized, fast and lightweight
Small memory requirement
Works only with JSON payloads
Easy recursion
Filter and map complex JSON structures
Comes enabled with helper libraries: lodash, moment
Full access to your account resources through Apify variable
The run fails if all requests failed

Handling errors

This scraper is different from cheerio-scraper that you can handle the errors before the handlePageFunction fails. Using the handleError input, you can enqueue extra requests before failing, allowing you to recover or trying a different URL.

{
  handleError: async ({ addRequest, request, response, error }) => {
    request.noRetry = error.message.includes('Unexpected') || response.statusCode == 404;

    addRequest({
      url: `${request.url}?retry=true`,
    });
  }
}

Filter Map function

This function can filter, map and enqueue requests at the same time. The difference is that the userData from the current request will pass to the next request.

const startUrls = [{
  url: "https://example.com",
  userData: {
    firstValue: 0,
  }
}];

// assuming the INPUT url above
await Apify.call('pocesar/json-downloader', {
  filterMap: async ({ request, addRequest, data }) => {

    if (request.userData.isPost) {
      // userData will be inherited from previous request
      request.userData.firstValue == 0; 

      // return the data only after the POST request
      return data;
    } else {
      // add the same request, but as a POST
      addRequest({
        url: `${request.url}/?method=post`,
        method: 'POST',
        payload: {
          username: 'username',
          password: 'password',
        },
        headers: {
          'Content-Type': 'application/json',
        },
        userData: {
          isPost: true
        }
      });
      // omit return or return a falsy value will ignore the output
    } 
  },
})

Examples

Flatten an object

{
   filterMap: async ({ flattenObjectKeys, data }) => {
     return flattenObjectKeys(data);
   }
}
/**
 * an object like 
 * {
 *    "deep": {
 *       "nested": ["state", "state1"]
 *    } 
 * }
 * 
 * becomes
 * {
 *    "deep.nested.0": "state",
 *    "deep.nested.1": "state1"
 * }
 */

Submit a JSON API with POST

{
  "startUrls": [
    {
      "url": "https://ow0o5i3qo7-dsn.algolia.net/1/indexes/prod_PUBLIC_STORE/query?x-algolia-agent=Algolia%20for%20JavaScript%20(4.13.0)%3B%20Browser%20(lite)&x-algolia-api-key=0ecccd09f50396a4dbbe5dbfb17f4525&x-algolia-application-id=OW0O5I3QO7",
      "method": "POST",
      "payload": "{\"query\":\"instagram\",\"page\":0,\"hitsPerPage\":24,\"restrictSearchableAttributes\":[],\"attributesToHighlight\":[],\"attributesToRetrieve\":[\"title\",\"name\",\"username\",\"userFullName\",\"stats\",\"description\",\"pictureUrl\",\"userPictureUrl\",\"notice\",\"currentPricingInfo\"]}",
      "headers": {
        "content-type": "application/x-www-form-urlencoded"
      }
    }
  ]
}

Follow pagination from payload

{
  filterMap: async ({ addRequest, request, data }) => {
    if (data.nbPages > 1 && data.page < data.nbPages) {
      // get the current payload from the input
      const payload = JSON.parse(request.payload);

      // change the page number
      request.payload = { ...payload, page: data.page + 1 };
      // add the request for parsing the next page
      addRequest(request);
    }

    return data;
  }
}

Omit output if condition is met

{
  filterMap: async ({ addRequest, request, data }) => {
    if (data.hits.length < 10) {
      return;
    }

    return data;
  }
}

Unwind array of results, each item from the array in a separate dataset item

{
  filterMap: async ({ addRequest, request, data }) => {
    return data.hits; // just return an array from here
  }
}

On this page

Download and format JSON endpoint data

Share Actor:

Pay-as-you-go API / JSON scraper

pocesar/pay-as-you-go-api-json-scraper

Scrape as pay-as-you-go any API / JSON URLs directly to the dataset, and return them in CSV, XML, HTML, or Excel formats. Transform and filter the output. Enables you to follow pagination recursively from the payload without the need to visit the HTML page.

Paulo Cesar

Vanilla JS Scraper

mstephen190/vanilla-js-scraper

Scrape the web using familiar JavaScript methods! Crawls websites using raw HTTP requests, parses the HTML with the JSDOM package, and extracts data from the pages using Node.js code. Supports both recursive crawling and lists of URLs. This actor is a non jQuery alternative to CheerioScraper.

Matthias Stephens

487

Web Crawler

rigelbytes/webcrawler

This web crawler is designed to provide users with complete flexibility by allowing them to use their **own proxies**. The scraper collects all pages from the website and returns extracts the **MetaData**, **Title**, and **Content** of the page in MarkDown.

Rigel Bytes

HTTP API - Flexible Client for Prototyping and Automation

xyzzy/http-api

Easily test, integrate, and automate APIs—configure method, headers, query params, body, timeout, and retries, no code or setup required. Ideal for rapid prototyping, endpoint monitoring, workflows, and sales & marketing automation.

xyzzy

SuperScraper API

apify/super-scraper-api

Generic REST API for scraping websites: send a URL and get back HTML. This Actor is a drop-in replacement for ScrapingBee, ScrapingAnt, and ScraperAPI services. And it is open-source!

Apify

758

4.1

Parsera

parsera-labs/parsera

Extract data from any website using just a URL and column descriptions

Parsera Labs

450

4.0

Web Scraper

apify/web-scraper

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

Apify

100K

3.5

Cheerio Scraper

apify/cheerio-scraper

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Apify

11K

4.9

Fast Scraper

danielherman/fast-scraper

Fast Scraper is a blazingly fast web scraper powered by Rust on the backend. It allows you to scrape static HTML pages extremely quickly while using only <128 MB of memory. With this scraper, you can maximize the efficiency of your credits on Apify.

Daniel Herman

Cloudflare Web Scraper

ecomscrape/cloudflare-web-scraper

Advanced web scraper designed to extract data from Cloudflare-protected websites with CAPTCHA bypass, proxy rotation, and JavaScript execution capabilities.

ecomscrape

287

5.0