Sitemap To Request Queue
Try for free
No credit card required
Go to Store
Sitemap To Request Queue
pocesar/sitemap-to-request-queue
Try for free
No credit card required
Download sitemap XMLs and put them in a RequestQueue
Sitemap to RequestQueue
Downloads a sitemap.xml files and append them to a RequestQueue of your choice.
Example
1// this is your actor 2Apify.main(async () => { 3 const { proxyConfig } = await Apify.getInput(); 4 const requestQueue = await Apify.openRequestQueue(); 5 6 // this is needed so it doesn't execute everytime there's a migration 7 const run = (await Apify.getValue('SITEMAP-CALL', run)) || { runId: '', actorId: '' }; 8 9 if (!run || !run.runId) { 10 // this might take a while! 11 const runCall = await Apify.call('pocesar/sitemap-to-request-queue', { 12 // required proxy configuration, like { useApifyProxy: true, apifyProxyGroups: ['SHADER'] } 13 proxyConfig, 14 // use this for this run's RequestQueue, but can be a named one, or if you 15 // leave it empty, it will be placed on the remote run RQ 16 targetRQ: requestQueue.queueId, 17 // required sitemaps 18 startUrls: [{ 19 url: "http://example.com/sitemap1.xml", 20 userData: { 21 label: "DETAILS" // userData will passthrough 22 } 23 }, { 24 url: "http://example.com/sitemap2.xml", 25 }], 26 // Provide your own transform callback to filter or alter the request before adding it to the queue 27 transform: ((request) => { 28 if (!request.url.includes('detail')) { 29 return null; 30 } 31 32 request.userData.label = request.url.includes('/item/') ? 'DETAILS' : 'CATEGORY'; 33 34 return request; 35 }).toString() 36 }, { waitSecs: 0 }); 37 38 run.runId = runCall.id; 39 run.actorId = runCall.actId; 40 41 await Apify.setValue('SITEMAP-CALL', run); 42 } 43 44 await Apify.utils.waitForRunToFinish(run); 45 46 const crawler = new Apify.PuppeteerCrawler({ 47 requestQueue, // ready to use! 48 //... 49 }); 50 51 await crawler.run(); 52});
License
Apache 2.0
Developer
Maintained by Community
Actor Metrics
5 monthly users
-
0 No stars yet
50% runs succeeded
Created in Sep 2020
Modified 2 years ago
Categories