Actor picture

API / JSON scraper

pocesar/json-downloader

Scrape any API / JSON URLs directly to the dataset, and return them in CSV, XML, HTML, or Excel formats. Transform and filter the output. Enables you to follow pagination recursively from the payload without the need to visit the HTML page.

Free trial for 14 days

Then $25/month

No credit card required now

Author's avatarPaulo Cesar
  • Modified
  • Users114
  • Runs3,867
Actor picture
API / JSON scraper

Free trial for 14 days

Then $25/month

Download any JSON URLs directly to the dataset, and return them in CSV, XML, HTML, or Excel formats. Transform and filter the output.

Features

  • Optimized, fast and lightweight
  • Small memory requirement
  • Works only with JSON payloads
  • Easy recursion
  • Filter and map complex JSON structures
  • Comes enabled with helper libraries: lodash, moment
  • Full access to your account resources through Apify variable
  • The run fails if all requests failed

Handling errors

This scraper is different from cheerio-scraper that you can handle the errors before the handlePageFunction fails. Using the handleError input, you can enqueue extra requests before failing, allowing you to recover or trying a different URL.

{
  handleError: async ({ addRequest, request, response, error }) => {
    request.noRetry = error.message.includes('Unexpected') || response.statusCode == 404;

    addRequest({
      url: `${request.url}?retry=true`,
    });
  }
}

Filter Map function

This function can filter, map and enqueue requests at the same time. The difference is that the userData from the current request will pass to the next request.

const startUrls = [{
  url: "https://example.com",
  userData: {
    firstValue: 0,
  }
}];

// assuming the INPUT url above
await Apify.call('pocesar/json-downloader', {
  filterMap: async ({ request, addRequest, data }) => {

    if (request.userData.isPost) {
      // userData will be inherited from previous request
      request.userData.firstValue == 0; 

      // return the data only after the POST request
      return data;
    } else {
      // add the same request, but as a POST
      addRequest({
        url: `${request.url}/?method=post`,
        method: 'POST',
        payload: {
          username: 'username',
          password: 'password',
        },
        headers: {
          'Content-Type': 'application/json',
        },
        userData: {
          isPost: true
        }
      });
      // omit return or return a falsy value will ignore the output
    } 
  },
})

Examples

Flatten an object

{
   filterMap: async ({ flattenObjectKeys, data }) => {
     return flattenObjectKeys(data);
   }
}
/**
 * an object like 
 * {
 *    "deep": {
 *       "nested": ["state", "state1"]
 *    } 
 * }
 * 
 * becomes
 * {
 *    "deep.nested.0": "state",
 *    "deep.nested.1": "state1"
 * }
 */

Submit a JSON API with POST

{
  "startUrls": [
    {
      "url": "https://ow0o5i3qo7-dsn.algolia.net/1/indexes/prod_PUBLIC_STORE/query?x-algolia-agent=Algolia%20for%20JavaScript%20(4.13.0)%3B%20Browser%20(lite)&x-algolia-api-key=0ecccd09f50396a4dbbe5dbfb17f4525&x-algolia-application-id=OW0O5I3QO7",
      "method": "POST",
      "payload": "{\"query\":\"instagram\",\"page\":0,\"hitsPerPage\":24,\"restrictSearchableAttributes\":[],\"attributesToHighlight\":[],\"attributesToRetrieve\":[\"title\",\"name\",\"username\",\"userFullName\",\"stats\",\"description\",\"pictureUrl\",\"userPictureUrl\",\"notice\",\"currentPricingInfo\"]}",
      "headers": {
        "content-type": "application/x-www-form-urlencoded"
      }
    }
  ]
}

Follow pagination from payload

{
  filterMap: async ({ addRequest, request, data }) => {
    if (data.nbPages > 1 && data.page < data.nbPages) {
      // get the current payload from the input
      const payload = JSON.parse(request.payload);

      // change the page number
      request.payload = { ...payload, page: data.page + 1 };
      // add the request for parsing the next page
      addRequest(request);
    }

    return data;
  }
}

Omit output if condition is met

{
  filterMap: async ({ addRequest, request, data }) => {
    if (data.hits.length < 10) {
      return;
    }

    return data;
  }
}

Unwind array of results, each item from the array in a separate dataset item

{
  filterMap: async ({ addRequest, request, data }) => {
    return data.hits; // just return an array from here
  }
}

Industries

See how API / JSON scraper is used in industries around the world