CSFD Scraper avatar

CSFD Scraper

Try for free

1 day trial then $25.00/month - No credit card required now

Go to Store
CSFD Scraper

CSFD Scraper

danielherman/csfd-scraper
Try for free

1 day trial then $25.00/month - No credit card required now

CSFD Scraper is a blazingly fast web scraper powered by Rust on the backend. It allows you to scrape csfd.cz

What is CSFD Scraper?

Fast Scraper is a blazingly fast web scraper powered by Rust on the backend. It allows you to scrape static csfd HTML pages extremely quickly without renderring while using only 128 MB of memory. With this scraper, you can maximize the efficiency of your credits on Apify. 🚀🚀🚀

Regarding benchmark see https://apify.com/danielherman/fast-scraper.

Explanation of the input

There are some global parameters for the actor that you can find in the Input tab description and then there are requests. Requests have the following structure

1{
2    "request_type": string, // required
3    "url": string, // optional
4    "id": string, // optional
5    "headers": object, // optional
6    "user-agent": string // optional
7}

Only request_type is required, so if request_type=Sitemap then url won't be considered, but with different request_type the url has to be mentioned otherwise the actor will panic. Key id is optional and will be copied to results, this value is for you if you want to track the requests with something else then url only. In the response list the order of scraped data most likely will be different than in requests. Both headers and user-agent are optional, you can also state user-agent in headers directly. Request headers and user-agent will override the global headers and user-agent. Let's see an example

1{
2    "requests": [
3        {
4            "request_type": "View",
5            "url": "https://www.csfd.cz/film/68990-star-trek-hluboky-vesmir-devet/494608-serie-6/prehled/",
6            "headers": {
7                "dnt": "0",
8                "priority": "u=0, i",
9                "referer": "https://www.csfd.cz/"
10            }
11        }
12    ],
13    "headers": {
14        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
15        "accept-language": "en-US,en;q=0.9",
16        "dnt": "1",
17        "priority": "u=0, i",
18        "sec-ch-ua": "\"Chromium\";v=\"124\", \"Google Chrome\";v=\"124\", \"Not-A.Brand\";v=\"99\"",
19        "sec-ch-ua-mobile": "?0",
20        "sec-ch-ua-platform": "\"macOS\"",
21        "sec-fetch-dest": "document",
22        "sec-fetch-mode": "navigate",
23        "sec-fetch-site": "none",
24        "sec-fetch-user": "?1",
25        "upgrade-insecure-requests": "1",
26    },
27    "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
28    "force_cloud": false,
29    "push_data_size": 500,
30    "max_concurrency": 10,
31    "max_request_retries": 3,
32    "max_request_retry_timeout_ms": 10000,
33    "request_retry_wait_ms": 5000
34}

Here the request will contain all the global headers and user_agent, but "dnt" (Do Not Track) header will be set to 0 from 1 and we also have two additional header "priority" and "referer". Once you set global headers you can not delete them at the request level, only override them.

Supported page types

There are different request_types

  • Sitemap request_type=Sitemap: this allows you to scrape all urls that are in the exposed csfd.cz sitemap. This can take a while and it is done in a single request specification.
  • View request_type=View: with this type you will obtain information for views (/prehled) of movies, serials, series and episodes.
  • View reviews request_type=ViewReviews: this would return you comments for specific movie, serial, series or episode.
  • User request_type=User: this would return you information about the user.
  • User reviews request_type=UserReviews: this would return you information about the user.
  • Ratings request_type=Ratings: this would return you all th user ratings for specific movie, serial, series or episode. PLANNED
  • Creator request_type=Creator: this would return you information about the creator. PLANNED
  • Program request_type=Program: this would return you parsed TV program from https://www.csfd.cz/televize/program/. PLANNED

At this moment there is only one page type supported and that is view type. Soon will be also added rating and comments types. You can now scrape the whole sitemap with this scraper.

Sitemap

Make sure that the timeout for actor is long enough (e.g. 3600 s). The scraping of sitemap is not done in parallel.

Input example:

1{
2    "requests": [
3        {
4            "request_type": "Sitemap"
5        }
6    ],
7    "user_agent": "ApifyFastScraper/1.0",
8    "force_cloud": false,
9    "push_data_size": 500,
10    "max_concurrency": 10,
11    "max_request_retries": 3,
12    "max_request_retry_timeout_ms": 10000,
13    "request_retry_wait_ms": 5000
14}

It will fetch for whole published sitemap of csfd.cz that contains also:

  • https://www.csfd.cz/film/
  • https://www.csfd.cz/tvurce/
  • https://www.csfd.cz/uzivatel/
  • https://www.csfd.cz/diskuze/
  • https://www.csfd.cz/akce/
  • https://www.csfd.cz/festival/
  • https://www.csfd.cz/kino/
  • https://www.csfd.cz/novinky/
  • https://www.csfd.cz/zanry/

Output example:

1[
2    {
3        "id": "06de9c9d-b17f-44aa-a9f3-e87a6769fffd",
4        "request_type": "Sitemap",
5        "url": "https://www.csfd.cz/sitemap.xml",
6        "data": {
7            "Sitemap": [
8                "https://www.csfd.cz/film/16-zurov/231-zurov-2/prehled/",
9                "https://www.csfd.cz/film/16-zurov/703683-zurov/prehled/",
10                "https://www.csfd.cz/film/16-zurov/703684-teorema-lobacevskogo/prehled/"
11            ]
12        }
13    }
14]

Views (film + prehled)

Pages of the type https://www.csfd.cz/film/<movie-id>/, https://www.csfd.cz/film/<movie-id>/<movie-id2>/, https://www.csfd.cz/film/<movie-id>/prehled/ or https://www.csfd.cz/film/<movie-id>/<movie-id2>/prehled/.

Set request_type=View and here is an example of input and output. Input example:

1{
2    "requests": [
3        {
4            "request_type": "View",
5            "url": "https://www.csfd.cz/film/17592-ctyri-svatby-a-jeden-pohreb/prehled/"
6        }
7    ],
8    "user_agent": "ApifyFastScraper/1.0",
9    "force_cloud": false,
10    "push_data_size": 500,
11    "max_concurrency": 10,
12    "max_request_retries": 3,
13    "max_request_retry_timeout_ms": 10000,
14    "request_retry_wait_ms": 5000
15}

The following requests are all equivalent

1"requests": [
2    {
3        "request_type": "View",
4        "url": "https://www.csfd.cz/film/17592-ctyri-svatby-a-jeden-pohreb/prehled/"
5    },
6    {
7        "request_type": "View",
8        "url": "https://www.csfd.cz/film/17592/prehled/"
9    },
10]

Output example:

1{
2  "View": {
3    "header_name": "Čtyři svatby a jeden pohřeb",
4    "header_name_langs": [
5      {
6        "country": "Velká Británie",
7        "title": "Four Weddings and a Funeral(více)"
8      },
9      {
10        "country": "USA",
11        "title": "Four Weddings and a Funeral"
12      },
13      {
14        "country": "Slovensko",
15        "title": "Štyri svadby a jeden pohreb(méně)"
16      }
17    ],
18    "rating": "72%",
19    "rating_votes_count": 14484,
20    "rating_fanklub_count": 45,
21    "origin": "Velká Británie / USA, 1994, 117 min(Alternativní 113 min)",
22    "plot_full": "Snímek vypráví příběh Charlese (Hugh Grant), vtipného a okouzlujícího muže, který ve svých dvaatřiceti letech stále střídá partnerky jako na běžícím pásu. Jeho životem prošla spousta žen, které zbožňoval, ale s žádnou z nich nedokázal navázat hlubší vztah. Rezervovaný Angličan vystavěl kolem vlastního nitra tak nepropustnou zeď, že nyní nedokáže projevit své city. A čím více svateb společně se svými kamarády navštíví, tím méně se sám hrne do ženění. Až do oné osudné soboty, kdy v jednom kostele spatří Carrie (Andie MacDowellová) – tu nejzajímavější, nejkrásnější, nejdůvtipnější a také nejnedostupnější Američanku, jakou kdy v životě potkal. Charles se ze všech sil snaží, aby ji příliš neuháněl a hlavně se do ní nezamiloval - během jednoho pohřbu a tří dalších svateb…(Cinemax)",
23    "genres": [
24      "Komedie",
25      "Romantický",
26      "Drama"
27    ],
28    "creators": [
29      {
30        "name": "režie",
31        "people": [
32          {
33            "name": "Mike Newell",
34            "url": "/tvurce/4060-mike-newell/"
35          }
36        ]
37      },
38      {
39        "name": "scénář",
40        "people": [
41          {
42            "name": "Richard Curtis",
43            "url": "/tvurce/6726-richard-curtis/"
44          }
45        ]
46      },
47      {
48        "name": "kamera",
49        "people": [
50          {
51            "name": "Michael Coulter",
52            "url": "/tvurce/75908-michael-coulter/"
53          }
54        ]
55      },
56      {
57        "name": "hudba",
58        "people": [
59          {
60            "name": "Richard Rodney Bennett",
61            "url": "/tvurce/63995-richard-rodney-bennett/"
62          }
63        ]
64      },
65      {
66        "name": "hrají",
67        "people": [
68          {
69            "name": "Hugh Grant",
70            "url": "/tvurce/332-hugh-grant/"
71          },
72          {
73            "name": "Andie MacDowell",
74            "url": "/tvurce/130-andie-macdowell/"
75          },
76          {
77            "name": "James Fleet",
78            "url": "/tvurce/17860-james-fleet/"
79          },
80          {
81            "name": "Simon Callow",
82            "url": "/tvurce/12966-simon-callow/"
83          },
84          {
85            "name": "John Hannah",
86            "url": "/tvurce/803-john-hannah/"
87          },
88          {
89            "name": "Kristin Scott Thomas",
90            "url": "/tvurce/164-kristin-scott-thomas/"
91          },
92          {
93            "name": "Elspet Gray",
94            "url": "/tvurce/35549-elspet-gray/"
95          },
96          {
97            "name": "Rowan Atkinson",
98            "url": "/tvurce/349-rowan-atkinson/"
99          },
100          {
101            "name": "Corin Redgrave",
102            "url": "/tvurce/16584-corin-redgrave/"
103          },
104          {
105            "name": "Anna Chancellor",
106            "url": "/tvurce/12166-anna-chancellor/"
107          },
108          {
109            "name": "Hannah Taylor-Gordon",
110            "url": "/tvurce/23562-hannah-taylor-gordon/"
111          },
112          {
113            "name": "Bernice Stegers",
114            "url": "/tvurce/11078-bernice-stegers/"
115          },
116          {
117            "name": "Jeremy Kemp",
118            "url": "/tvurce/53343-jeremy-kemp/"
119          },
120          {
121            "name": "Sophie Thompson",
122            "url": "/tvurce/55128-sophie-thompson/"
123          },
124          {
125            "name": "Charlotte Coleman",
126            "url": "/tvurce/76910-charlotte-coleman/"
127          },
128          {
129            "name": "David Haig",
130            "url": "/tvurce/78156-david-haig/"
131          },
132          {
133            "name": "Nicola Walker",
134            "url": "/tvurce/111089-nicola-walker/"
135          },
136          {
137            "name": "Struan Rodger",
138            "url": "/tvurce/115678-struan-rodger/"
139          },
140          {
141            "name": "Simon Kunz",
142            "url": "/tvurce/145261-simon-kunz/"
143          },
144          {
145            "name": "Duncan Kenworthy",
146            "url": "/tvurce/205006-duncan-kenworthy/"
147          },
148          {
149            "name": "Rosalie Crutchley",
150            "url": "/tvurce/214678-rosalie-crutchley/"
151          },
152          {
153            "name": "Rupert Vansittart",
154            "url": "/tvurce/219917-rupert-vansittart/"
155          },
156          {
157            "name": "Kenneth Griffith",
158            "url": "/tvurce/277724-kenneth-griffith/"
159          },
160          {
161            "name": "Philip Voss",
162            "url": "/tvurce/298970-philip-voss/"
163          },
164          {
165            "name": "Randall Paul",
166            "url": "/tvurce/308830-randall-paul/"
167          },
168          {
169            "name": "Sara Crowe",
170            "url": "/tvurce/157205-sara-crowe/"
171          },
172          {
173            "name": "Richard Butler",
174            "url": "/tvurce/348163-richard-butler/"
175          },
176          {
177            "name": "Nigel Hastings",
178            "url": "/tvurce/368537-nigel-hastings/"
179          },
180          {
181            "name": "Juliette James",
182            "url": "/tvurce/529994-juliette-james/"
183          },
184          {
185            "name": "Amanda Mealing",
186            "url": "/tvurce/875966-amanda-mealing/"
187          }
188        ]
189      },
190      {
191        "name": "produkce",
192        "people": [
193          {
194            "name": "Duncan Kenworthy",
195            "url": "/tvurce/205006-duncan-kenworthy/"
196          },
197          {
198            "name": "Eric Fellner",
199            "url": "/tvurce/150112-eric-fellner/"
200          }
201        ]
202      },
203      {
204        "name": "střih",
205        "people": [
206          {
207            "name": "Jon Gregory",
208            "url": "/tvurce/241299-jon-gregory/"
209          }
210        ]
211      },
212      {
213        "name": "scénografie",
214        "people": [
215          {
216            "name": "Anna Pinnock",
217            "url": "/tvurce/787919-anna-pinnock/"
218          }
219        ]
220      },
221      {
222        "name": "masky",
223        "people": [
224          {
225            "name": "Ann Buchanan",
226            "url": "/tvurce/630463-ann-buchanan/"
227          }
228        ]
229      },
230      {
231        "name": "kostýmy",
232        "people": [
233          {
234            "name": "Lindy Hemming",
235            "url": "/tvurce/254644-lindy-hemming/"
236          }
237        ]
238      }
239    ],
240    "vod_content": [
241      {
242        "name": "Apple TV+",
243        "ga_name": "vod-service-apple-tv|film|vod",
244        "url": "https://tv.apple.com/cz/movie/four-weddings-and-a-funeral/umc.cmc.50uemm7f92zctyyjp8z6x1upu"
245      },
246      {
247        "name": "Google Play",
248        "ga_name": "vod-service-google-play|film|vod",
249        "url": "https://play.google.com/store/movies/details/Four_Weddings_And_A_Funeral?id=ZnSmxlAWj4s&hl=cs&gl=cz"
250      }
251    ]
252  }
253}

View reviews (film + recenze)

Pages of the type https://www.csfd.cz/film/<movie-id>/recenze/?page=<N> and https://www.csfd.cz/film/<movie-id>/<movie-id2>/recenze/?page=<N>.

For request_type=ViewReviews you have to make sure that the url contains film and recenze. You don't have to put ?page=<N> at the end of the url, because it will be replaced with page=1, the number of pages needed to scan will retrieved and then the scraper will scrape all of them one at the time. Input example:

1{
2    "requests": [
3        {
4            "request_type": "ViewReviews",
5            "url": "https://www.csfd.cz/film/1490468-survivor-cesko-slovensko/1486957-serie-3/recenze/"
6        }
7    ],
8    "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
9    "force_cloud": false,
10    "push_data_size": 500,
11    "max_concurrency": 10,
12    "max_request_retries": 3,
13    "max_request_retry_timeout_ms": 10000,
14    "request_retry_wait_ms": 5000
15}

The results, are split so that results have roughly 1MB or less, this way we can make sure that the results will be uploaded to apify store. The id stays the same and part key indicates the order of the results. Output example:

1[
2    {
3        "id": "191f5d99-e101-455d-b833-9554e7b102e8",
4        "request_type": "ViewReviews",
5        "url": "https://www.csfd.cz/film/2294-vykoupeni-z-veznice-shawshank/recenze/",
6        "data": {
7            "part": 1,
8            "content": [
9                {
10                    "user_name": "golfista",
11                    "user_url": "/uzivatel/95-golfista/",
12                    "star_rating": "5",
13                    "comment": "\n Na velmi ošemetnou a těžko zodpověditelnou otázku \"který film je podle vás nejlepší\", mi dal do úst tímhle dílem Frank Darabont odpověď, za kterou se opravdu nebudu stydět. Pokud bych měl jenom jednu (možná dvě :) možnost, pak právě sem patří 6*. Bohužel jsem nestihl tenhle film v kině, ale vydáním na DVD jsem si ho konečně vychutnal i v originále a je to fakt nádhera (tím nechci hanět český dabing, který je mimochodem vynikající).\n",
14                    "comment_html": "Na velmi ošemetnou a těžko zodpověditelnou otázku \"který film je podle vás nejlepší\", mi dal do úst tímhle dílem Frank Darabont odpověď, za kterou se opravdu nebudu stydět. Pokud bych měl jenom jednu (možná dvě :) možnost, pak právě sem patří 6*. Bohužel jsem nestihl tenhle film v kině, ale vydáním na DVD jsem si ho konečně vychutnal i v originále a je to fakt nádhera (tím nechci hanět český dabing, který je mimochodem vynikající).",
15                    "date": "14.02.2003"
16                },
17                ...
18            ]
19        }
20    }
21]

View ratings (film + prehled)

User (uzivatel)

Pages of the type https://www.csfd.cz/film/<movie-id>/, https://www.csfd.cz/film/<movie-id>/<movie-id2>/, https://www.csfd.cz/film/<movie-id>/prehled/ or https://www.csfd.cz/film/<movie-id>/<movie-id2>/prehled/.

For request_type=ViewRatings you have to make sure that the url contains film. You don't have to put ?pageRating=<N> at the end of the url, because it will be replaced with pageRating=1, the number of pages needed to scan will retrieved and then the scraper will scrape all of them one at the time. Input example:

1{
2    "requests": [
3        {
4            "request_type": "ViewRatings",
5            "url": "https://www.csfd.cz/film/425904-mizerove-na-zivot-a-na-smrt/prehled/"
6        }
7    ],
8    "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
9    "force_cloud": false,
10    "push_data_size": 500,
11    "max_concurrency": 10,
12    "max_request_retries": 3,
13    "max_request_retry_timeout_ms": 10000,
14    "request_retry_wait_ms": 5000
15}

The results, are split so that results have roughly 1MB or less, this way we can make sure that the results will be uploaded to apify store. The id stays the same and part key indicates the order of the results. Output example:

1[
2    {
3        "id": "7749baa2-b364-4920-afbb-88907fa2f194",
4        "request_type": "ViewRatings",
5        "url": "https://www.csfd.cz/film/425904-mizerove-na-zivot-a-na-smrt/prehled/",
6        "data": {
7            "part": 1,
8            "content": [
9                {
10                    "user_name": "POMO",
11                    "user_url": "/uzivatel/1-pomo/",
12                    "date": "Vloženo v 05.06.2024",
13                    "star_rating": "3"
14                },
15                {
16                    "user_name": "kleopatra",
17                    "user_url": "/uzivatel/1263-kleopatra/",
18                    "date": "Vloženo v 07.06.2024",
19                    "star_rating": "4"
20                },
21                ...
22            ]
23        }
24    }
25]

User reviews (uzivatel + recenze)

Pages of the type https://www.csfd.cz/uzivatel/<movie-id>/recenze/?page=<N>.

For request_type=UserReviews you have to make sure that the url contains uzivatel and recenze. You don't have to put ?page=<N> at the end of the url, because it will be replaced with page=1, the number of pages needed to scan will retrieved and then the scraper will scrape all of them one at the time. Input example:

1{
2    "requests": [
3        {
4            "request_type": "UserReviews",
5            "url": "https://www.csfd.cz/uzivatel/195357-verbal/recenze/"
6        }
7    ],
8    "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
9    "force_cloud": false,
10    "push_data_size": 500,
11    "max_concurrency": 10,
12    "max_request_retries": 3,
13    "max_request_retry_timeout_ms": 10000,
14    "request_retry_wait_ms": 5000
15}

The results, are split so that results have roughly 1MB or less, this way we can make sure that the results will be uploaded to apify store. The id stays the same and part key indicates the order of the results. Output example:

1[
2    {
3        "id": "e5475be7-a70c-4749-bfff-60ad68bdc38e",
4        "request_type": "UserReviews",
5        "url": "https://www.csfd.cz/uzivatel/195357-verbal/recenze/",
6        "data": {
7            "part": 1,
8            "content": [
9                {
10                    "movie_name": "Mizerové: Na život a na smrt",
11                    "movie_url": "/film/425904-mizerove-na-zivot-a-na-smrt/",
12                    "star_rating": "5",
13                    "comment": "\n Jak repujeme my, sportovně založení bílí Dolní Slezané „ Bembajs, bembajs, jak pro tebe du, narobiš pyču!!! “… A oni zas přišli, co nadělám! Šup do kina! Navíc se belgičtí uzenáči od minula o dost zlepšili, scénáristé zavzpomínali na staroškolské fláky, a Špatňáci jsou tak na zase plné kule tím, čím bývali za časů Míši Záliva ve starých dobrých devadesátkách. Tedy pořád docela freš Wilík a ubohý starý trapák Lórenc proti bandě konečně charismatických a bezskrupulózních zlolidí ve vinně potěšující akčně kokotmediální taškařici a lá Smrtonosné smrti. Míša si v tom zase štěknul a docela nechápu, proč si svou vypiplanou značku rovnou nezmáknul sám. Patrně má nahrabáno tolik, že už jen rybaří v Zálivu. Ale i tak furt klasicka blažena oddychovka jak cyp.\n",
14                    "comment_html": "Jak repujeme my, sportovně založení bílí Dolní Slezané „<em>Bembajs, bembajs, jak pro tebe du, narobiš pyču!!!</em>“… A oni zas přišli, co nadělám! Šup do kina! Navíc se belgičtí uzenáči od minula o dost zlepšili, scénáristé zavzpomínali na staroškolské fláky, a Špatňáci jsou tak na zase plné kule tím, čím bývali za časů Míši Záliva ve starých dobrých devadesátkách. Tedy pořád docela freš Wilík a ubohý starý trapák Lórenc proti bandě konečně charismatických a bezskrupulózních zlolidí ve vinně potěšující akčně kokotmediální taškařici a lá Smrtonosné smrti. Míša si v tom zase štěknul a docela nechápu, proč si svou vypiplanou značku rovnou nezmáknul sám. Patrně má nahrabáno tolik, že už jen rybaří v Zálivu. Ale i tak furt klasicka blažena oddychovka jak cyp.",
15                    "date": "14.06.2024"
16                },
17                ...
18            ]
19        }
20    }
21]

Your feedback

I am always working on improving the performance of my Actors. So if you’ve got any technical feedback for Fast Scraper or simply found a bug, please create an issue on the Actor’s Issues tab in Apify Console.

Developer
Maintained by Community

Actor Metrics

  • 1 monthly user

  • 0 No stars yet

  • 94% runs succeeded

  • 71 days response time

  • Created in Jun 2024

  • Modified 5 months ago