Dun & Bradstreet Scraper
3 days trial then $25.00/month - No credit card required now
Dun & Bradstreet Scraper
3 days trial then $25.00/month - No credit card required now
Effortlessly extract valuable company information, financial projections, industry insights, and more from the extensive Dun & Bradstreet commercial database. Dive deep into the D&B Data Cloud, Business Directory, articles, companies, and industries with customized search terms.
These companies were requested:
https://www.dnb.com/business-directory/company-profiles.mobis_india_limited.5ed64f1f15ee79861b7e876f46215921.html https://www.dnb.com/business-directory/company-profiles.bayerische_motoren_werke_aktiengesellschaft.b2f60216a8986dd0132b03ae1d926227.html https://www.dnb.com/business-directory/company-profiles.abb_india_limited.5111bc20d5f711deed549f6f3055025a.html https://www.dnb.com/business-directory/company-profiles.apple_india_private_limited.4e84ec79d60a2d86b2e8bf8a954d3430.html https://www.dnb.com/business-directory/company-profiles.skoda_auto_volkswagen_india_private_limited.3860c3970e9396c6ee0015e531a1b641.html https://www.dnb.com/business-directory/company-profiles.fossil_india_private_limited.f8b75c7f8c10d0c21d4c2a51f79348a3.html https://www.dnb.com/business-directory/company-profiles.shahi_exports_private_limited.cdff6adae29ea68d4320f2a5775afe4f.html https://www.dnb.com/business-directory/company-profiles.Chiquita_Brands_International_S%C3%A0rl.68c45c0b4a91d33d59f615effa8a46cf.html https://www.dnb.com/business-directory/company-profiles.nokia_solutions_and_networks_india_private_limited.4d2ee87fde4fa249803ab560525bfd09.html https://www.dnb.com/business-directory/company-profiles.dixon_technologies_(india)_limited.0133777077b385c6bcb3d06e416cafc0.html https://www.dnb.com/business-directory/company-profiles.wipro_ge_healthcare_private_limited.194f91f15ce842e1c2555112ceaa4d76.html https://www.dnb.com/business-directory/compan... [trimmed]
How do I tell the scraper to only crawl the submitted initial urls and not follow other urls found in the source?
Hey Martin,
Thank you very much for reaching out. If you only want to retrieve the first page of the listing URLs, you can add the endPage
property to 1. This will apply to all the sub-pages and let you crawl over only the initial URLs. About the problem above, it would be better if we could investigate how the actor behaved while you were trying it out. If possible, can you please share your Run ID? Which can be found on the top right of the screen called "Share" while you are on the Run Detail page.
Best
Thanks for explaining - the endPage property isn't documented here https://apify.com/epctex/dnb-scraper are there any other undocumented properties? As for your question this is a similar run ID where not all requested pages were in the response but the response also including pages which were not requested: 49fFAwGLzfVCfTHKt
Hey there,
Thank you very much for sharing the Run ID and the missing property. We will update the documentation immediately. About the Run, we investigated what might go wrong there.
- We found out that there are a couple of listing pages such as; https://www.dnb.com/business-directory/company-information.household_appliances_and_electrical_and_electronic_goods_merchant_wholesalers.in.tamil_nadu.kanchipuram.html
- The
maxItems
property is set with 50 items.
In such cases, the actor enqueues all the URLs in place and starts scraping. Whenever it reaches 50 rows in output, it won't fetch the remaining start URLs. , In your case, the actor filled up this 50 limit with the ones that have been found over the listing URLs. I am not quite sure if that is the intended approach here. What exactly would you like to retrieve from the actor? Maybe we can build up an Input for achieving your goal.
Best
Hi Tuğkan Cengiz, thanks for responding. In this case we only want to receive the profiles provided in the start URLs. As we were requesting 50 Profile Urls we set the maxItems property to 50 - any lower number would not follow all start urls. The Crawler should (in this case) not follow further links found on the profile pages but only scrape the provided profile.
Hey Martin,
Thank you very much for explaining.
As mentioned, the main problem is the links that have been included in the Start URLs. Meaning that, if you provide the company profiles only, then the actor will retrieve that amount of rows in place and not touch any internal links. So in this case, you don't even need to use Max Items.
However, when a start Url
includes a link with
https://www.dnb.com/business-directory/company-information
https://www.dnb.com/business-directory/company-information.household_appliances_and_electrical_and_electronic_goods_merchant_wholesalers.in.tamil_nadu.kanchipuram.html,
the actor counts it as a listing page and starts scraping it over.
Therefore, if a listing page is included, the actor will proceed forward
with it until it hits the Max Items parameter. That is the main logic of
the actor and unfortunately, it cannot be changed.
My suggestion for you to achieve your goal; would only provide the links which start with https://www.dnb.com/business-directory/company-profiles https://www.dnb.com/business-directory/company-profiles.mobis_india_limited.5ed64f1f15ee79861b7e876f46215921.html. These links are being counted as DnB Company Profiles as they should be. The couple of links you mentioned in your Input are the ones that contain many companies. That's where your goal is getting distracted.
Please let me know if I can help you with anything else,