Dun & Bradstreet Scraper avatar

Dun & Bradstreet Scraper

Try for free

3 days trial then $25.00/month - No credit card required now

Go to Store
Dun & Bradstreet Scraper

Dun & Bradstreet Scraper

epctex/dnb-scraper
Try for free

3 days trial then $25.00/month - No credit card required now

Effortlessly extract valuable company information, financial projections, industry insights, and more from the extensive Dun & Bradstreet commercial database. Dive deep into the D&B Data Cloud, Business Directory, articles, companies, and industries with customized search terms.

SC

Actor doesn't return all requested companies and also returns companies which were not requested

Closed

scrape_crawl opened this issue
2 years ago
SC

scrape_crawl

2 years ago

How do I tell the scraper to only crawl the submitted initial urls and not follow other urls found in the source?

TU

tugkan

2 years ago

Hey Martin,

Thank you very much for reaching out. If you only want to retrieve the first page of the listing URLs, you can add the endPage property to 1. This will apply to all the sub-pages and let you crawl over only the initial URLs. About the problem above, it would be better if we could investigate how the actor behaved while you were trying it out. If possible, can you please share your Run ID? Which can be found on the top right of the screen called "Share" while you are on the Run Detail page.

Best

SC

scrape_crawl

2 years ago

Thanks for explaining - the endPage property isn't documented here https://apify.com/epctex/dnb-scraper are there any other undocumented properties? As for your question this is a similar run ID where not all requested pages were in the response but the response also including pages which were not requested: 49fFAwGLzfVCfTHKt

TU

tugkan

2 years ago

Hey there,

Thank you very much for sharing the Run ID and the missing property. We will update the documentation immediately. About the Run, we investigated what might go wrong there.

In such cases, the actor enqueues all the URLs in place and starts scraping. Whenever it reaches 50 rows in output, it won't fetch the remaining start URLs. , In your case, the actor filled up this 50 limit with the ones that have been found over the listing URLs. I am not quite sure if that is the intended approach here. What exactly would you like to retrieve from the actor? Maybe we can build up an Input for achieving your goal.

Best

SO

scrape_crawl-owner

2 years ago

Hi Tuğkan Cengiz, thanks for responding. In this case we only want to receive the profiles provided in the start URLs. As we were requesting 50 Profile Urls we set the maxItems property to 50 - any lower number would not follow all start urls. The Crawler should (in this case) not follow further links found on the profile pages but only scrape the provided profile.

TE

tugkan_epctex

2 years ago

Hey Martin,

Thank you very much for explaining.

As mentioned, the main problem is the links that have been included in the Start URLs. Meaning that, if you provide the company profiles only, then the actor will retrieve that amount of rows in place and not touch any internal links. So in this case, you don't even need to use Max Items.

However, when a start Url includes a link with https://www.dnb.com/business-directory/company-information https://www.dnb.com/business-directory/company-information.household_appliances_and_electrical_and_electronic_goods_merchant_wholesalers.in.tamil_nadu.kanchipuram.html, the actor counts it as a listing page and starts scraping it over. Therefore, if a listing page is included, the actor will proceed forward with it until it hits the Max Items parameter. That is the main logic of the actor and unfortunately, it cannot be changed.

My suggestion for you to achieve your goal; would only provide the links which start with https://www.dnb.com/business-directory/company-profiles https://www.dnb.com/business-directory/company-profiles.mobis_india_limited.5ed64f1f15ee79861b7e876f46215921.html. These links are being counted as DnB Company Profiles as they should be. The couple of links you mentioned in your Input are the ones that contain many companies. That's where your goal is getting distracted.

Please let me know if I can help you with anything else,

Developer
Maintained by Community

Actor Metrics

  • 24 monthly users

  • 16 stars

  • 82% runs succeeded

  • 1.7 days response time

  • Created in Mar 2021

  • Modified 14 hours ago

Categories