CSFD Scraper
30 minutes trial then $25.00/month - No credit card required now
CSFD Scraper
30 minutes trial then $25.00/month - No credit card required now
CSFD Scraper is a blazingly fast web scraper powered by Rust on the backend. It allows you to scrape csfd.cz
What is CSFD Scraper?
Fast Scraper is a blazingly fast web scraper powered by Rust on the backend. It allows you to scrape static csfd HTML pages extremely quickly without renderring while using only 128 MB of memory. With this scraper, you can maximize the efficiency of your credits on Apify. 🚀🚀🚀
Regarding benchmark see https://apify.com/danielherman/fast-scraper.
Explanation of the input
There are some global parameters for the actor that you can find in the Input
tab description and then there are requests. Requests have the following structure
1{ 2 "request_type": string, // required 3 "url": string, // optional 4 "id": string, // optional 5 "headers": object, // optional 6 "user-agent": string // optional 7}
Only request_type
is required, so if request_type=Sitemap
then url
won't be considered, but with different request_type
the url
has to be mentioned otherwise the actor will panic. Key id
is optional and will be copied to results, this value is for you if you want to track the requests with something else then url
only. In the response list the order of scraped data most likely will be different than in requests. Both headers
and user-agent
are optional, you can also state user-agent in headers
directly. Request headers
and user-agent
will override the global headers
and user-agent
. Let's see an example
1{ 2 "requests": [ 3 { 4 "request_type": "View", 5 "url": "https://www.csfd.cz/film/68990-star-trek-hluboky-vesmir-devet/494608-serie-6/prehled/", 6 "headers": { 7 "dnt": "0", 8 "priority": "u=0, i", 9 "referer": "https://www.csfd.cz/" 10 } 11 } 12 ], 13 "headers": { 14 "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7", 15 "accept-language": "en-US,en;q=0.9", 16 "dnt": "1", 17 "priority": "u=0, i", 18 "sec-ch-ua": "\"Chromium\";v=\"124\", \"Google Chrome\";v=\"124\", \"Not-A.Brand\";v=\"99\"", 19 "sec-ch-ua-mobile": "?0", 20 "sec-ch-ua-platform": "\"macOS\"", 21 "sec-fetch-dest": "document", 22 "sec-fetch-mode": "navigate", 23 "sec-fetch-site": "none", 24 "sec-fetch-user": "?1", 25 "upgrade-insecure-requests": "1", 26 }, 27 "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36", 28 "force_cloud": false, 29 "push_data_size": 500, 30 "max_concurrency": 10, 31 "max_request_retries": 3, 32 "max_request_retry_timeout_ms": 10000, 33 "request_retry_wait_ms": 5000 34}
Here the request will contain all the global headers
and user_agent
, but "dnt" (Do Not Track) header will be set to 0 from 1 and we also have two additional header "priority" and "referer". Once you set global headers you can not delete them at the request level, only override them.
Supported page types
There are different request_type
s
- Sitemap
request_type=Sitemap
: this allows you to scrape all urls that are in the exposed csfd.cz sitemap. This can take a while and it is done in a singlerequest
specification. - View
request_type=View
: with this type you will obtain information for views (/prehled) of movies, serials, series and episodes. - View reviews
request_type=ViewReviews
: this would return you comments for specific movie, serial, series or episode. - User
request_type=User
: this would return you information about the user. - User reviews
request_type=UserReviews
: this would return you information about the user. - Ratings
request_type=Ratings
: this would return you all th user ratings for specific movie, serial, series or episode. PLANNED - Creator
request_type=Creator
: this would return you information about the creator. PLANNED - Program
request_type=Program
: this would return you parsed TV program from https://www.csfd.cz/televize/program/. PLANNED
At this moment there is only one page type supported and that is view
type. Soon will be also added rating
and comments
types. You can now scrape the whole sitemap
with this scraper.
Sitemap
Make sure that the timeout for actor is long enough (e.g. 3600 s). The scraping of sitemap is not done in parallel.
Input example:
1{ 2 "requests": [ 3 { 4 "request_type": "Sitemap" 5 } 6 ], 7 "user_agent": "ApifyFastScraper/1.0", 8 "force_cloud": false, 9 "push_data_size": 500, 10 "max_concurrency": 10, 11 "max_request_retries": 3, 12 "max_request_retry_timeout_ms": 10000, 13 "request_retry_wait_ms": 5000 14}
It will fetch for whole published sitemap of csfd.cz that contains also:
https://www.csfd.cz/film/
https://www.csfd.cz/tvurce/
https://www.csfd.cz/uzivatel/
https://www.csfd.cz/diskuze/
https://www.csfd.cz/akce/
https://www.csfd.cz/festival/
https://www.csfd.cz/kino/
https://www.csfd.cz/novinky/
https://www.csfd.cz/zanry/
Output example:
1[ 2 { 3 "id": "06de9c9d-b17f-44aa-a9f3-e87a6769fffd", 4 "request_type": "Sitemap", 5 "url": "https://www.csfd.cz/sitemap.xml", 6 "data": { 7 "Sitemap": [ 8 "https://www.csfd.cz/film/16-zurov/231-zurov-2/prehled/", 9 "https://www.csfd.cz/film/16-zurov/703683-zurov/prehled/", 10 "https://www.csfd.cz/film/16-zurov/703684-teorema-lobacevskogo/prehled/" 11 ] 12 } 13 } 14]
Views (film + prehled)
Pages of the type https://www.csfd.cz/film/<movie-id>/
, https://www.csfd.cz/film/<movie-id>/<movie-id2>/
, https://www.csfd.cz/film/<movie-id>/prehled/
or https://www.csfd.cz/film/<movie-id>/<movie-id2>/prehled/
.
Set request_type=View
and here is an example of input and output.
Input example:
1{ 2 "requests": [ 3 { 4 "request_type": "View", 5 "url": "https://www.csfd.cz/film/17592-ctyri-svatby-a-jeden-pohreb/prehled/" 6 } 7 ], 8 "user_agent": "ApifyFastScraper/1.0", 9 "force_cloud": false, 10 "push_data_size": 500, 11 "max_concurrency": 10, 12 "max_request_retries": 3, 13 "max_request_retry_timeout_ms": 10000, 14 "request_retry_wait_ms": 5000 15}
The following requests are all equivalent
1"requests": [ 2 { 3 "request_type": "View", 4 "url": "https://www.csfd.cz/film/17592-ctyri-svatby-a-jeden-pohreb/prehled/" 5 }, 6 { 7 "request_type": "View", 8 "url": "https://www.csfd.cz/film/17592/prehled/" 9 }, 10]
Output example:
1{ 2 "View": { 3 "header_name": "Čtyři svatby a jeden pohřeb", 4 "header_name_langs": [ 5 { 6 "country": "Velká Británie", 7 "title": "Four Weddings and a Funeral(více)" 8 }, 9 { 10 "country": "USA", 11 "title": "Four Weddings and a Funeral" 12 }, 13 { 14 "country": "Slovensko", 15 "title": "Štyri svadby a jeden pohreb(méně)" 16 } 17 ], 18 "rating": "72%", 19 "rating_votes_count": 14484, 20 "rating_fanklub_count": 45, 21 "origin": "Velká Británie / USA, 1994, 117 min(Alternativní 113 min)", 22 "plot_full": "Snímek vypráví příběh Charlese (Hugh Grant), vtipného a okouzlujícího muže, který ve svých dvaatřiceti letech stále střídá partnerky jako na běžícím pásu. Jeho životem prošla spousta žen, které zbožňoval, ale s žádnou z nich nedokázal navázat hlubší vztah. Rezervovaný Angličan vystavěl kolem vlastního nitra tak nepropustnou zeď, že nyní nedokáže projevit své city. A čím více svateb společně se svými kamarády navštíví, tím méně se sám hrne do ženění. Až do oné osudné soboty, kdy v jednom kostele spatří Carrie (Andie MacDowellová) – tu nejzajímavější, nejkrásnější, nejdůvtipnější a také nejnedostupnější Američanku, jakou kdy v životě potkal. Charles se ze všech sil snaží, aby ji příliš neuháněl a hlavně se do ní nezamiloval - během jednoho pohřbu a tří dalších svateb…(Cinemax)", 23 "genres": [ 24 "Komedie", 25 "Romantický", 26 "Drama" 27 ], 28 "creators": [ 29 { 30 "name": "režie", 31 "people": [ 32 { 33 "name": "Mike Newell", 34 "url": "/tvurce/4060-mike-newell/" 35 } 36 ] 37 }, 38 { 39 "name": "scénář", 40 "people": [ 41 { 42 "name": "Richard Curtis", 43 "url": "/tvurce/6726-richard-curtis/" 44 } 45 ] 46 }, 47 { 48 "name": "kamera", 49 "people": [ 50 { 51 "name": "Michael Coulter", 52 "url": "/tvurce/75908-michael-coulter/" 53 } 54 ] 55 }, 56 { 57 "name": "hudba", 58 "people": [ 59 { 60 "name": "Richard Rodney Bennett", 61 "url": "/tvurce/63995-richard-rodney-bennett/" 62 } 63 ] 64 }, 65 { 66 "name": "hrají", 67 "people": [ 68 { 69 "name": "Hugh Grant", 70 "url": "/tvurce/332-hugh-grant/" 71 }, 72 { 73 "name": "Andie MacDowell", 74 "url": "/tvurce/130-andie-macdowell/" 75 }, 76 { 77 "name": "James Fleet", 78 "url": "/tvurce/17860-james-fleet/" 79 }, 80 { 81 "name": "Simon Callow", 82 "url": "/tvurce/12966-simon-callow/" 83 }, 84 { 85 "name": "John Hannah", 86 "url": "/tvurce/803-john-hannah/" 87 }, 88 { 89 "name": "Kristin Scott Thomas", 90 "url": "/tvurce/164-kristin-scott-thomas/" 91 }, 92 { 93 "name": "Elspet Gray", 94 "url": "/tvurce/35549-elspet-gray/" 95 }, 96 { 97 "name": "Rowan Atkinson", 98 "url": "/tvurce/349-rowan-atkinson/" 99 }, 100 { 101 "name": "Corin Redgrave", 102 "url": "/tvurce/16584-corin-redgrave/" 103 }, 104 { 105 "name": "Anna Chancellor", 106 "url": "/tvurce/12166-anna-chancellor/" 107 }, 108 { 109 "name": "Hannah Taylor-Gordon", 110 "url": "/tvurce/23562-hannah-taylor-gordon/" 111 }, 112 { 113 "name": "Bernice Stegers", 114 "url": "/tvurce/11078-bernice-stegers/" 115 }, 116 { 117 "name": "Jeremy Kemp", 118 "url": "/tvurce/53343-jeremy-kemp/" 119 }, 120 { 121 "name": "Sophie Thompson", 122 "url": "/tvurce/55128-sophie-thompson/" 123 }, 124 { 125 "name": "Charlotte Coleman", 126 "url": "/tvurce/76910-charlotte-coleman/" 127 }, 128 { 129 "name": "David Haig", 130 "url": "/tvurce/78156-david-haig/" 131 }, 132 { 133 "name": "Nicola Walker", 134 "url": "/tvurce/111089-nicola-walker/" 135 }, 136 { 137 "name": "Struan Rodger", 138 "url": "/tvurce/115678-struan-rodger/" 139 }, 140 { 141 "name": "Simon Kunz", 142 "url": "/tvurce/145261-simon-kunz/" 143 }, 144 { 145 "name": "Duncan Kenworthy", 146 "url": "/tvurce/205006-duncan-kenworthy/" 147 }, 148 { 149 "name": "Rosalie Crutchley", 150 "url": "/tvurce/214678-rosalie-crutchley/" 151 }, 152 { 153 "name": "Rupert Vansittart", 154 "url": "/tvurce/219917-rupert-vansittart/" 155 }, 156 { 157 "name": "Kenneth Griffith", 158 "url": "/tvurce/277724-kenneth-griffith/" 159 }, 160 { 161 "name": "Philip Voss", 162 "url": "/tvurce/298970-philip-voss/" 163 }, 164 { 165 "name": "Randall Paul", 166 "url": "/tvurce/308830-randall-paul/" 167 }, 168 { 169 "name": "Sara Crowe", 170 "url": "/tvurce/157205-sara-crowe/" 171 }, 172 { 173 "name": "Richard Butler", 174 "url": "/tvurce/348163-richard-butler/" 175 }, 176 { 177 "name": "Nigel Hastings", 178 "url": "/tvurce/368537-nigel-hastings/" 179 }, 180 { 181 "name": "Juliette James", 182 "url": "/tvurce/529994-juliette-james/" 183 }, 184 { 185 "name": "Amanda Mealing", 186 "url": "/tvurce/875966-amanda-mealing/" 187 } 188 ] 189 }, 190 { 191 "name": "produkce", 192 "people": [ 193 { 194 "name": "Duncan Kenworthy", 195 "url": "/tvurce/205006-duncan-kenworthy/" 196 }, 197 { 198 "name": "Eric Fellner", 199 "url": "/tvurce/150112-eric-fellner/" 200 } 201 ] 202 }, 203 { 204 "name": "střih", 205 "people": [ 206 { 207 "name": "Jon Gregory", 208 "url": "/tvurce/241299-jon-gregory/" 209 } 210 ] 211 }, 212 { 213 "name": "scénografie", 214 "people": [ 215 { 216 "name": "Anna Pinnock", 217 "url": "/tvurce/787919-anna-pinnock/" 218 } 219 ] 220 }, 221 { 222 "name": "masky", 223 "people": [ 224 { 225 "name": "Ann Buchanan", 226 "url": "/tvurce/630463-ann-buchanan/" 227 } 228 ] 229 }, 230 { 231 "name": "kostýmy", 232 "people": [ 233 { 234 "name": "Lindy Hemming", 235 "url": "/tvurce/254644-lindy-hemming/" 236 } 237 ] 238 } 239 ], 240 "vod_content": [ 241 { 242 "name": "Apple TV+", 243 "ga_name": "vod-service-apple-tv|film|vod", 244 "url": "https://tv.apple.com/cz/movie/four-weddings-and-a-funeral/umc.cmc.50uemm7f92zctyyjp8z6x1upu" 245 }, 246 { 247 "name": "Google Play", 248 "ga_name": "vod-service-google-play|film|vod", 249 "url": "https://play.google.com/store/movies/details/Four_Weddings_And_A_Funeral?id=ZnSmxlAWj4s&hl=cs&gl=cz" 250 } 251 ] 252 } 253}
View reviews (film + recenze)
Pages of the type https://www.csfd.cz/film/<movie-id>/recenze/?page=<N>
and https://www.csfd.cz/film/<movie-id>/<movie-id2>/recenze/?page=<N>
.
For request_type=ViewReviews
you have to make sure that the url contains film
and recenze
. You don't have to put ?page=<N>
at the end of the url, because it will be replaced with page=1
, the number of pages needed to scan will retrieved and then the scraper will scrape all of them one at the time.
Input example:
1{ 2 "requests": [ 3 { 4 "request_type": "ViewReviews", 5 "url": "https://www.csfd.cz/film/1490468-survivor-cesko-slovensko/1486957-serie-3/recenze/" 6 } 7 ], 8 "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36", 9 "force_cloud": false, 10 "push_data_size": 500, 11 "max_concurrency": 10, 12 "max_request_retries": 3, 13 "max_request_retry_timeout_ms": 10000, 14 "request_retry_wait_ms": 5000 15}
The results, are split so that results have roughly 1MB or less, this way we can make sure that the results will be uploaded to apify store. The id
stays the same and part
key indicates the order of the results.
Output example:
1[ 2 { 3 "id": "191f5d99-e101-455d-b833-9554e7b102e8", 4 "request_type": "ViewReviews", 5 "url": "https://www.csfd.cz/film/2294-vykoupeni-z-veznice-shawshank/recenze/", 6 "data": { 7 "part": 1, 8 "content": [ 9 { 10 "user_name": "golfista", 11 "user_url": "/uzivatel/95-golfista/", 12 "star_rating": "5", 13 "comment": "\n Na velmi ošemetnou a těžko zodpověditelnou otázku \"který film je podle vás nejlepší\", mi dal do úst tímhle dílem Frank Darabont odpověď, za kterou se opravdu nebudu stydět. Pokud bych měl jenom jednu (možná dvě :) možnost, pak právě sem patří 6*. Bohužel jsem nestihl tenhle film v kině, ale vydáním na DVD jsem si ho konečně vychutnal i v originále a je to fakt nádhera (tím nechci hanět český dabing, který je mimochodem vynikající).\n", 14 "comment_html": "Na velmi ošemetnou a těžko zodpověditelnou otázku \"který film je podle vás nejlepší\", mi dal do úst tímhle dílem Frank Darabont odpověď, za kterou se opravdu nebudu stydět. Pokud bych měl jenom jednu (možná dvě :) možnost, pak právě sem patří 6*. Bohužel jsem nestihl tenhle film v kině, ale vydáním na DVD jsem si ho konečně vychutnal i v originále a je to fakt nádhera (tím nechci hanět český dabing, který je mimochodem vynikající).", 15 "date": "14.02.2003" 16 }, 17 ... 18 ] 19 } 20 } 21]
View ratings (film + prehled)
User (uzivatel)
Pages of the type https://www.csfd.cz/film/<movie-id>/
, https://www.csfd.cz/film/<movie-id>/<movie-id2>/
, https://www.csfd.cz/film/<movie-id>/prehled/
or https://www.csfd.cz/film/<movie-id>/<movie-id2>/prehled/
.
For request_type=ViewRatings
you have to make sure that the url contains film
. You don't have to put ?pageRating=<N>
at the end of the url, because it will be replaced with pageRating=1
, the number of pages needed to scan will retrieved and then the scraper will scrape all of them one at the time.
Input example:
1{ 2 "requests": [ 3 { 4 "request_type": "ViewRatings", 5 "url": "https://www.csfd.cz/film/425904-mizerove-na-zivot-a-na-smrt/prehled/" 6 } 7 ], 8 "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36", 9 "force_cloud": false, 10 "push_data_size": 500, 11 "max_concurrency": 10, 12 "max_request_retries": 3, 13 "max_request_retry_timeout_ms": 10000, 14 "request_retry_wait_ms": 5000 15}
The results, are split so that results have roughly 1MB or less, this way we can make sure that the results will be uploaded to apify store. The id
stays the same and part
key indicates the order of the results.
Output example:
1[ 2 { 3 "id": "7749baa2-b364-4920-afbb-88907fa2f194", 4 "request_type": "ViewRatings", 5 "url": "https://www.csfd.cz/film/425904-mizerove-na-zivot-a-na-smrt/prehled/", 6 "data": { 7 "part": 1, 8 "content": [ 9 { 10 "user_name": "POMO", 11 "user_url": "/uzivatel/1-pomo/", 12 "date": "Vloženo v 05.06.2024", 13 "star_rating": "3" 14 }, 15 { 16 "user_name": "kleopatra", 17 "user_url": "/uzivatel/1263-kleopatra/", 18 "date": "Vloženo v 07.06.2024", 19 "star_rating": "4" 20 }, 21 ... 22 ] 23 } 24 } 25]
User reviews (uzivatel + recenze)
Pages of the type https://www.csfd.cz/uzivatel/<movie-id>/recenze/?page=<N>
.
For request_type=UserReviews
you have to make sure that the url contains uzivatel
and recenze
. You don't have to put ?page=<N>
at the end of the url, because it will be replaced with page=1
, the number of pages needed to scan will retrieved and then the scraper will scrape all of them one at the time.
Input example:
1{ 2 "requests": [ 3 { 4 "request_type": "UserReviews", 5 "url": "https://www.csfd.cz/uzivatel/195357-verbal/recenze/" 6 } 7 ], 8 "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36", 9 "force_cloud": false, 10 "push_data_size": 500, 11 "max_concurrency": 10, 12 "max_request_retries": 3, 13 "max_request_retry_timeout_ms": 10000, 14 "request_retry_wait_ms": 5000 15}
The results, are split so that results have roughly 1MB or less, this way we can make sure that the results will be uploaded to apify store. The id
stays the same and part
key indicates the order of the results.
Output example:
1[ 2 { 3 "id": "e5475be7-a70c-4749-bfff-60ad68bdc38e", 4 "request_type": "UserReviews", 5 "url": "https://www.csfd.cz/uzivatel/195357-verbal/recenze/", 6 "data": { 7 "part": 1, 8 "content": [ 9 { 10 "movie_name": "Mizerové: Na život a na smrt", 11 "movie_url": "/film/425904-mizerove-na-zivot-a-na-smrt/", 12 "star_rating": "5", 13 "comment": "\n Jak repujeme my, sportovně založení bílí Dolní Slezané „ Bembajs, bembajs, jak pro tebe du, narobiš pyču!!! “… A oni zas přišli, co nadělám! Šup do kina! Navíc se belgičtí uzenáči od minula o dost zlepšili, scénáristé zavzpomínali na staroškolské fláky, a Špatňáci jsou tak na zase plné kule tím, čím bývali za časů Míši Záliva ve starých dobrých devadesátkách. Tedy pořád docela freš Wilík a ubohý starý trapák Lórenc proti bandě konečně charismatických a bezskrupulózních zlolidí ve vinně potěšující akčně kokotmediální taškařici a lá Smrtonosné smrti. Míša si v tom zase štěknul a docela nechápu, proč si svou vypiplanou značku rovnou nezmáknul sám. Patrně má nahrabáno tolik, že už jen rybaří v Zálivu. Ale i tak furt klasicka blažena oddychovka jak cyp.\n", 14 "comment_html": "Jak repujeme my, sportovně založení bílí Dolní Slezané „<em>Bembajs, bembajs, jak pro tebe du, narobiš pyču!!!</em>“… A oni zas přišli, co nadělám! Šup do kina! Navíc se belgičtí uzenáči od minula o dost zlepšili, scénáristé zavzpomínali na staroškolské fláky, a Špatňáci jsou tak na zase plné kule tím, čím bývali za časů Míši Záliva ve starých dobrých devadesátkách. Tedy pořád docela freš Wilík a ubohý starý trapák Lórenc proti bandě konečně charismatických a bezskrupulózních zlolidí ve vinně potěšující akčně kokotmediální taškařici a lá Smrtonosné smrti. Míša si v tom zase štěknul a docela nechápu, proč si svou vypiplanou značku rovnou nezmáknul sám. Patrně má nahrabáno tolik, že už jen rybaří v Zálivu. Ale i tak furt klasicka blažena oddychovka jak cyp.", 15 "date": "14.06.2024" 16 }, 17 ... 18 ] 19 } 20 } 21]
Your feedback
I am always working on improving the performance of my Actors. So if you’ve got any technical feedback for Fast Scraper or simply found a bug, please create an issue on the Actor’s Issues tab in Apify Console.
Actor Metrics
7 monthly users
-
0 No stars yet
>99% runs succeeded
6.6 days response time
Created in Jun 2024
Modified 20 days ago