
API / JSON scraper
Pricing
$5.00/month + usage

API / JSON scraper
Scrape any API / JSON URLs directly to the dataset, and return them in CSV, XML, HTML, or Excel formats. Transform and filter the output. Enables you to follow pagination recursively from the payload without the need to visit the HTML page.
0.0 (0)
Pricing
$5.00/month + usage
7
Total users
520
Monthly users
29
Runs succeeded
98%
Last modified
a year ago
Download and format JSON endpoint data
Download any JSON URLs directly to the dataset, and return them in CSV, XML, HTML, or Excel formats. Transform and filter the output.
Features
- Optimized, fast and lightweight
- Small memory requirement
- Works only with JSON payloads
- Easy recursion
- Filter and map complex JSON structures
- Comes enabled with helper libraries: lodash, moment
- Full access to your account resources through
Apify
variable - The run fails if all requests failed
Handling errors
This scraper is different from cheerio-scraper that you can handle the errors before the handlePageFunction
fails.
Using the handleError
input, you can enqueue extra requests before failing, allowing you to recover or trying a different URL.
{handleError: async ({ addRequest, request, response, error }) => {request.noRetry = error.message.includes('Unexpected') || response.statusCode == 404;addRequest({url: `${request.url}?retry=true`,});}}
Filter Map function
This function can filter, map and enqueue requests at the same time. The difference is that the userData from the current request will pass to the next request.
const startUrls = [{url: "https://example.com",userData: {firstValue: 0,}}];// assuming the INPUT url aboveawait Apify.call('pocesar/json-downloader', {filterMap: async ({ request, addRequest, data }) => {if (request.userData.isPost) {// userData will be inherited from previous requestrequest.userData.firstValue == 0;// return the data only after the POST requestreturn data;} else {// add the same request, but as a POSTaddRequest({url: `${request.url}/?method=post`,method: 'POST',payload: {username: 'username',password: 'password',},headers: {'Content-Type': 'application/json',},userData: {isPost: true}});// omit return or return a falsy value will ignore the output}},})
Examples
Flatten an object
{filterMap: async ({ flattenObjectKeys, data }) => {return flattenObjectKeys(data);}}/*** an object like* {* "deep": {* "nested": ["state", "state1"]* }* }** becomes* {* "deep.nested.0": "state",* "deep.nested.1": "state1"* }*/
Submit a JSON API with POST
{"startUrls": [{"url": "https://ow0o5i3qo7-dsn.algolia.net/1/indexes/prod_PUBLIC_STORE/query?x-algolia-agent=Algolia%20for%20JavaScript%20(4.13.0)%3B%20Browser%20(lite)&x-algolia-api-key=0ecccd09f50396a4dbbe5dbfb17f4525&x-algolia-application-id=OW0O5I3QO7","method": "POST","payload": "{\"query\":\"instagram\",\"page\":0,\"hitsPerPage\":24,\"restrictSearchableAttributes\":[],\"attributesToHighlight\":[],\"attributesToRetrieve\":[\"title\",\"name\",\"username\",\"userFullName\",\"stats\",\"description\",\"pictureUrl\",\"userPictureUrl\",\"notice\",\"currentPricingInfo\"]}","headers": {"content-type": "application/x-www-form-urlencoded"}}]}
Follow pagination from payload
{filterMap: async ({ addRequest, request, data }) => {if (data.nbPages > 1 && data.page < data.nbPages) {// get the current payload from the inputconst payload = JSON.parse(request.payload);// change the page numberrequest.payload = { ...payload, page: data.page + 1 };// add the request for parsing the next pageaddRequest(request);}return data;}}
Omit output if condition is met
{filterMap: async ({ addRequest, request, data }) => {if (data.hits.length < 10) {return;}return data;}}
Unwind array of results, each item from the array in a separate dataset item
{filterMap: async ({ addRequest, request, data }) => {return data.hits; // just return an array from here}}