Podcasts
View all Actors
This Actor is unavailable because the developer has decided to deprecate it. Would you like to try a similar Actor instead?
See alternative ActorsPodcasts
zyberg/podcasts
Gets the url of the first podcast from google podcasts
Dockerfile
1# This is a template for a Dockerfile used to run acts in Actor system.
2# The base image name below is set during the act build, based on user settings.
3# IMPORTANT: The base image must set a correct working directory, such as /usr/src/app or /home/user
4FROM apify/actor-node-basic
5
6# Second, copy just package.json and package-lock.json since it should be
7# the only file that affects "npm install" in the next step, to speed up the build
8COPY package*.json ./
9
10# Install NPM packages, skip optional and development dependencies to
11# keep the image small. Avoid logging too much and print the dependency
12# tree for debugging
13RUN npm --quiet set progress=false \
14 && npm install --only=prod --no-optional \
15 && echo "Installed NPM packages:" \
16 && (npm list --all || true) \
17 && echo "Node.js version:" \
18 && node --version \
19 && echo "NPM version:" \
20 && npm --version
21
22# Copy source code to container
23# Do this in the last step, to have fast build if only the source code changed
24COPY . ./
25
26# NOTE: The CMD is already defined by the base image.
27# Uncomment this for local node inspector debugging:
28# CMD [ "node", "--inspect=0.0.0.0:9229", "main.js" ]
package.json
1{
2 "name": "apify-project",
3 "version": "0.0.1",
4 "description": "",
5 "author": "It's not you it's me",
6 "license": "ISC",
7 "dependencies": {
8 "apify": "0.22.4"
9 },
10 "scripts": {
11 "start": "node main.js"
12 }
13}
main.js
1const Apify = require('apify');
2
3Apify.main(async () => {
4 const input = await Apify.getInput();
5 const requestQueue = await Apify.openRequestQueue('google-podcasts');
6 const dataset = await Apify.openDataset('google-podcasts');
7 let output = [];
8
9 for(const link of input.links)
10 await requestQueue.addRequest({
11 url: link,
12 uniqueKey: link + (new Date).toString()
13 });
14
15 const crawler = new Apify.CheerioCrawler({
16 requestQueue,
17
18 // The crawler downloads and processes the web pages in parallel, with a concurrency
19 // automatically managed based on the available system memory and CPU (see AutoscaledPool class).
20 // Here we define some hard limits for the concurrency.
21 minConcurrency: 10,
22 maxConcurrency: 50,
23
24 // On error, retry each page at most once.
25 maxRequestRetries: 1,
26
27 // Increase the timeout for processing of each page.
28 handlePageTimeoutSecs: 30,
29
30 // Limit to 10 requests per one crawl
31 maxRequestsPerCrawl: 10,
32
33 // This function will be called for each URL to crawl.
34 // It accepts a single parameter, which is an object with options as:
35 // https://sdk.apify.com/docs/typedefs/cheerio-crawler-options#handlepagefunction
36 // We use for demonstration only 2 of them:
37 // - request: an instance of the Request class with information such as URL and HTTP method
38 // - $: the cheerio object containing parsed HTML
39 handlePageFunction: async ({ request, $ }) => {
40 console.log("Handling " + request.url)
41
42 const pattern = /\/feed\/(\w|\/|\?|\=|\&|;)+"/
43 let url_podcast = $.html().match(pattern) // This will return the first link
44
45 if (url_podcast != null) {
46 url_podcast = url_podcast[0].replace('"', '') // take the first item and remove the trailing quote symbol (")
47
48 const out = {
49 url: request.url,
50 url_podcast
51 };
52
53 await dataset.pushData(out);
54 output.push(out)
55 }
56
57 },
58 });
59
60 await crawler.run();
61 console.log(output)
62
63 await Apify.setValue('OUTPUT', output);
64});
Developer
Maintained by Community
Categories