Contact Details Scraper avatar

Contact Details Scraper

Try for free

Pay $3.00 for 1,000 pages

View all Actors
Contact Details Scraper

Contact Details Scraper

vdrmota/contact-info-scraper
Try for free

Pay $3.00 for 1,000 pages

Free email extractor to extract and download emails, phone numbers, Facebook, Twitter, LinkedIn, and Instagram profiles from any website. Extract contact information at scale from lists of URLs and download the data as Excel, CSV, JSON, HTML, and XML.

Do you want to learn more about this Actor?

Get a demo
BJ

Does not find any e-mail, even though I find it manually

Closed

beautiful_jumbo opened this issue
2 months ago

Hi,

I have a list of URLs and it doesnt find any e-mails at all.

Let's take the first example:

https://mi-tech.de

I can find an e-mail here (https://www.mi-tech.de/impressum/) and here (https://www.mi-tech.de/kontakt/)

It is in the code, even as a href link. Are my settings wrong?

Here is my input: { "considerChildFrames": false, "maxDepth": 1000, "maxRequests": 9999999, "maxRequestsPerStartUrl": 1000, "sameDomain": true, "startUrls": [ { "requestsFromUrl": "https://apify-uploads-prod.s3.us-east-1.amazonaws.com/J04srFO9aYIUdph82-no-email.txt" } ], "waitUntil": "domcontentloaded", "proxyConfig": { "useApifyProxy": true } }

milunnn avatar

Hi,

the https://mi-tech.de page was the only page from that domain that was scraped at that point in time. Your config is okay, the only thing I would change is maxDepth, it seems too high for an ordinary purpose. The recommended value for this field would be below 10.

If you don't know what it means, it is like a "distance" from the first page (it describes how many pages you can go away from the original one). So if you put a high number there, you could theoretically crawl the whole website (which would also include a lot of unimportant and rarely referenced pages).

I did a test run of the actor and it works as you would expect. The problem is that it concentrated on other websites from the list. To see the results for this page more quickly, try to isolate the input to only this https://mi-tech.de. You should see that it gives you emails at the right pages.

Also, this actor returns one row of data per webpage, not per the whole domain. Maybe it seemed like it did not find anything on the whole https://mi-tech.de website, but it only searched through the initial (index) page.

BJ

beautiful_jumbo

2 months ago

Hi,

thanks alot for your response. It is a bit confusing to me. I would expect that the actor takes every domain of the list, crawls it, if it finds an email ok, go to next domain from list, if it doesnt find an email, go to the referenced pages and look there. As soon as it finds an email, go to next domain. But it seems its not working like that. What do you mean by "concentrating on other domains from the list". What logic is behind this, why have other domains higher priorities. I will try and reduce the maxDepth parameter and let your know.

BJ

beautiful_jumbo

2 months ago

interesting, as you have mentioned, if the domains are added manually, instead of using the .txt, it works as expected. It outputs the maxDepth number set amount of rows per domain, which is logical to me. Can you explain, why when using the remot txt file, it is only crawling the start domain and not going further, despite the maxDepth Setting being > 1 ? Thanks.

BJ

beautiful_jumbo

2 months ago

my bad, they are being queued and run asynchronously....sorry!

Developer
Maintained by Apify

Actor Metrics

  • 1.3k monthly users

  • 190 stars

  • >99% runs succeeded

  • 4.1 days response time

  • Created in May 2019

  • Modified a day ago