Booking Scraper avatar
Booking Scraper
Try for free

Pay $5.00 for 1,000 results

View all Actors
Booking Scraper

Booking Scraper

voyager/booking-scraper
Try for free

Pay $5.00 for 1,000 results

Scrape Booking with this hotels scraper and get data about accommodation on Booking.com. You can crawl by keywords or URLs for hotel prices, ratings, addresses, number of reviews, stars. You can also download all that room and hotel data from Booking.com with a few clicks: CSV, JSON, HTML, and Excel

Do you want to learn more about this Actor?

Get a demo
2T

Invalid address for some hotels

Closed

2tunnels opened this issue
2 months ago

Just to be clear, Booking.com does have a weird way for showing address. For example: https://www.booking.com/hotel/jp/shangri-la-tokyo.html?selected_currency=EUR&lang=en-us&group_adults=2&group_children=0&no_rooms=1

Address: 100-8283 Tokyo-to, Chiyoda-ku, Marunouchi Trust Tower Main, 1-8-3 Marunouchi,, Japan

Has an empty part, which usually is a city or district. Can you also scrape address hierarchy from breadcrumbs: Japan > Tokyo-to > Tokyo > Chiyoda. URL structure is actually very helpful to get what each links is: country, region, city, district, etc.

Returning a list (or even better a dictionary) would be super helpful, to save location hierarchy.

Thank you!

lhotanok avatar

Hello, thanks for this suggestion! We can add breadcrumbs for sure 👌Getting the names such as Chiyoda or Tokyo will be straightforward but adding breadcrumbs URLs will be challenging a bit. But we'll try our best to add those too 🙂

2T

2tunnels

2 months ago

Thanks for the quick response! I was considering an additional dictionary like this:

1{
2  "country": "Japan",
3  "region": "Tokyo-to",
4  ...
5}

However, having the entire breadcrumb trail would be even better:

1[
2  {
3    "url": "https://www.booking.com/country/jp.html...",
4    "title": "Japan"
5  },
6  ...
7]

With the breadcrumb structure, I can deduce the hierarchy based on the URL structure or other heuristics.

A plain list of titles wouldn't be very useful:

["Japan", "Tokyo-to", "Tokyo"]

It's difficult to determine the exact hierarchy from just the titles, especially since different countries have varying location structures.

Including raw breadcrumbs would be a fantastic addition. It offers flexibility, allowing users to decide how to utilize that information best.

lhotanok avatar

However, having the entire breadcrumb trail would be even better:

Yeah I was thinking of adding breadcrumbs basically in the same format:

1{
2  "breadcrumbs": [
3    {
4      "name": "Chiyoda",
5      "fullName": "Hotels in Chiyoda",
6      "link": "https://www.booking.com/district/jp/tokyo/chiyoda.html"
7    }
8  ]
9}

Personally, I think it's a little easier to work with compared to the following dictionary format:

1{
2  "breadcrumbs": {
3      "Chiyoda": {
4        "name": "Hotels in Chiyoda",
5        "link": "https://www.booking.com/district/jp/tokyo/chiyoda.html"
6      }
7    }
8}

A plain list of titles wouldn't be very useful

I suppose we'll manage to extract the links as well, we just need to build them from parameters such as dest_type (district), search string (chiyoda) and country code jp. That's because the full URLs such as https://www.booking.com/district/jp/tokyo/chiyoda.html are not available directly in the HTML data our Actor works with.

Anyway, the issue is ready in our backlog and we'll let you know once this new feature gets published!

lhotanok avatar

Hello, we have just published the new version of the Actor with breadcrumbs extraction 🙂

There're basically 2 modes depending on whether you run the Actor with checkIn and checkOut info or without it.

Run with checkIn + checkOut

The Actor collects 2 types of links - primary (link) and alternative (altLink). The primary link includes the searchresults substring and it also contains search parameters such as the checkIn and checkOut specified in the input. Alternative link is built by the Actor using the parameters from primary link and it's a bit experimental (there're many edge cases being handled). Alternative link is something extra that is not available on Booking when you're browsing hotels with checkIn and checkOut specified (you can test this in your web browser). Example run: https://console.apify.com/view/runs/tBue6u9UMVLuhVlQB Example breadcrumb:

1{
2  "name": "Chiyoda",
3  "fullName": "Hotels in Chiyoda",
4  "link": "https://www.booking.com/searchresults.en-gb.html?label=gen173nr-1FCAsodUIQc2hhbmdyaS1sYS10b2t5b0gJWARotAKIAQGYAQm4AQfIAQzYAQHoAQH4AQOIAgGoAgO4Asvg7bYGwAIB0gIkZTUyZTZkZGEtZjIwMC00NzI4LTgxNjAtMmI0MGViZmMxNTIz2AIF4AIB&sid=9ce2dab9d2dcc077d60360b65be7fcf2&checkin=2024-09-19&checkout=2024-09-20&dest_id=308&dest_type=district&ss=Chiyoda&",
5  "altLink": "https://www.booking.com/district/jp/tokyo/chiyoda.html"
6}

Run without checkIn + checkOut

The Actor only collects a single primary link (link) and the alternative link is always null. This is because Booking doesn't return the link with searchresults substring in this case and provides us with the more structured link directly. Thanks to that, the extracted link is more reliable than the altLink from the previous example - it is constructed by Booking and not by our Actor. Example run: https://console.apify.com/view/runs/2ZN4EKwPs6hoCsYhP Example breadcrumb:

1{
2  "name": "Chiyoda",
3  "fullName": "Hotels in Chiyoda",
4  "link": "https://www.booking.com/district/jp/tokyo/chiyoda.en-gb.html?label=gen173nr-1FCAsodUIQc2hhbmdyaS1sYS10b2t5b0gJWARosgKIAQGYAQm4ARfIAQzYAQHoAQH4AQOIAgGoAgO4AsPf7bYGwAIB0gIkMDhmZTYxZDktOTkwMC00OTgyLTg5OWUtZThjNmIzODE1MWU42AIF4AIB&sid=35653b42a7ee5220836aca11d3e33fb9&breadcrumb=hotel&",
5  "altLink": null
6}
Developer
Maintained by Apify
Actor metrics
  • 134 monthly users
  • 27 stars
  • 99.6% runs succeeded
  • 1.8 days response time
  • Created in Aug 2023
  • Modified 6 days ago
Categories