Booking Scraper avatar

Booking Scraper

Try for free

Pay $5.00 for 1,000 results

Go to Store
Booking Scraper

Booking Scraper

voyager/booking-scraper
Try for free

Pay $5.00 for 1,000 results

Scrape Booking with this hotels scraper and get data about accommodation on Booking.com. You can crawl by keywords or URLs for hotel prices, ratings, addresses, number of reviews, stars. You can also download all that room and hotel data from Booking.com with a few clicks: CSV, JSON, HTML, and Excel

Do you want to learn more about this Actor?

Get a demo
2T

Invalid address for some hotels

Closed

2tunnels opened this issue
6 months ago

Just to be clear, Booking.com does have a weird way for showing address. For example: https://www.booking.com/hotel/jp/shangri-la-tokyo.html?selected_currency=EUR&lang=en-us&group_adults=2&group_children=0&no_rooms=1

Address: 100-8283 Tokyo-to, Chiyoda-ku, Marunouchi Trust Tower Main, 1-8-3 Marunouchi,, Japan

Has an empty part, which usually is a city or district. Can you also scrape address hierarchy from breadcrumbs: Japan > Tokyo-to > Tokyo > Chiyoda. URL structure is actually very helpful to get what each links is: country, region, city, district, etc.

Returning a list (or even better a dictionary) would be super helpful, to save location hierarchy.

Thank you!

lhotanok avatar

Hello, thanks for this suggestion! We can add breadcrumbs for sure 👌Getting the names such as Chiyoda or Tokyo will be straightforward but adding breadcrumbs URLs will be challenging a bit. But we'll try our best to add those too 🙂

2T

2tunnels

6 months ago

Thanks for the quick response! I was considering an additional dictionary like this:

1{
2  "country": "Japan",
3  "region": "Tokyo-to",
4  ...
5}

However, having the entire breadcrumb trail would be even better:

1[
2  {
3    "url": "https://www.booking.com/country/jp.html...",
4    "title": "Japan"
5  },
6  ...
7]

With the breadcrumb structure, I can deduce the hierarchy based on the URL structure or other heuristics.

A plain list of titles wouldn't be very useful:

["Japan", "Tokyo-to", "Tokyo"]

It's difficult to determine the exact hierarchy from just the titles, especially since different countries have varying location structures.

Including raw breadcrumbs would be a fantastic addition. It offers flexibility, allowing users to decide how to utilize that information best.

lhotanok avatar

However, having the entire breadcrumb trail would be even better:

Yeah I was thinking of adding breadcrumbs basically in the same format:

1{
2  "breadcrumbs": [
3    {
4      "name": "Chiyoda",
5      "fullName": "Hotels in Chiyoda",
6      "link": "https://www.booking.com/district/jp/tokyo/chiyoda.html"
7    }
8  ]
9}

Personally, I think it's a little easier to work with compared to the following dictionary format:

1{
2  "breadcrumbs": {
3      "Chiyoda": {
4        "name": "Hotels in Chiyoda",
5        "link": "https://www.booking.com/district/jp/tokyo/chiyoda.html"
6      }
7    }
8}

A plain list of titles wouldn't be very useful

I suppose we'll manage to extract the links as well, we just need to build them from parameters such as dest_type (district), search string (chiyoda) and country code jp. That's because the full URLs such as https://www.booking.com/district/jp/tokyo/chiyoda.html are not available directly in the HTML data our Actor works with.

Anyway, the issue is ready in our backlog and we'll let you know once this new feature gets published!

lhotanok avatar

Hello, we have just published the new version of the Actor with breadcrumbs extraction 🙂

There're basically 2 modes depending on whether you run the Actor with checkIn and checkOut info or without it.

Run with checkIn + checkOut

The Actor collects 2 types of links - primary (link) and alternative (altLink). The primary link includes the searchresults substring and it also contains search parameters such as the checkIn and checkOut specified in the input. Alternative link is built by the Actor using the parameters from primary link and it's a bit experimental (there're many edge cases being handled). Alternative link is something extra that is not available on Booking when you're browsing hotels with checkIn and checkOut specified (you can test this in your web browser). Example run: https://console.apify.com/view/runs/tBue6u9UMVLuhVlQB Example breadcrumb:

1{
2  "name": "Chiyoda",
3  "fullName": "Hotels in Chiyoda",
4  "link": "https://www.booking.com/searchresults.en-gb.html?label=gen173nr-1FCAsodUIQc2hhbmdyaS1sYS10b2t5b0gJWARotAKIAQGYAQm4AQfIAQzYAQHoAQH4AQOIAgGoAgO4Asvg7bYGwAIB0gIkZTUyZTZkZGEtZjIwMC00NzI4LTgxNjAtMmI0MGViZmMxNTIz2AIF4AIB&sid=9ce2dab9d2dcc077d60360b65be7fcf2&checkin=2024-09-19&checkout=2024-09-20&dest_id=308&dest_type=district&ss=Chiyoda&",
5  "altLink": "https://www.booking.com/district/jp/tokyo/chiyoda.html"
6}

Run without checkIn + checkOut

The Actor only collects a single primary link (link) and t... [trimmed]

Developer
Maintained by Apify

Actor Metrics

  • 175 monthly users

  • 48 stars

  • 98% runs succeeded

  • 2.6 days response time

  • Created in Aug 2023

  • Modified 16 days ago

Categories