Contact Details Scraper
Pay $3.00 for 1,000 pages
Contact Details Scraper
Pay $3.00 for 1,000 pages
Free email extractor to extract and download emails, phone numbers, Facebook, Twitter, LinkedIn, and Instagram profiles from any website. Extract contact information at scale from lists of URLs and download the data as Excel, CSV, JSON, HTML, and XML.
Do you want to learn more about this Actor?
Get a demoHi,
I have a list of URLs and it doesnt find any e-mails at all.
Let's take the first example:
I can find an e-mail here (https://www.mi-tech.de/impressum/) and here (https://www.mi-tech.de/kontakt/)
It is in the code, even as a href link. Are my settings wrong?
Here is my input: { "considerChildFrames": false, "maxDepth": 1000, "maxRequests": 9999999, "maxRequestsPerStartUrl": 1000, "sameDomain": true, "startUrls": [ { "requestsFromUrl": "https://apify-uploads-prod.s3.us-east-1.amazonaws.com/J04srFO9aYIUdph82-no-email.txt" } ], "waitUntil": "domcontentloaded", "proxyConfig": { "useApifyProxy": true } }
Hi,
the https://mi-tech.de
page was the only page from that domain that was scraped at that point in time. Your config is okay, the only thing I would change is maxDepth
, it seems too high for an ordinary purpose. The recommended value for this field would be below 10.
If you don't know what it means, it is like a "distance" from the first page (it describes how many pages you can go away from the original one). So if you put a high number there, you could theoretically crawl the whole website (which would also include a lot of unimportant and rarely referenced pages).
I did a test run of the actor and it works as you would expect. The problem is that it concentrated on other websites from the list. To see the results for this page more quickly, try to isolate the input to only this https://mi-tech.de
. You should see that it gives you emails at the right pages.
Also, this actor returns one row of data per webpage, not per the whole domain. Maybe it seemed like it did not find anything on the whole https://mi-tech.de
website, but it only searched through the initial (index) page.
Hi,
thanks alot for your response. It is a bit confusing to me. I would expect that the actor takes every domain of the list, crawls it, if it finds an email ok, go to next domain from list, if it doesnt find an email, go to the referenced pages and look there. As soon as it finds an email, go to next domain. But it seems its not working like that. What do you mean by "concentrating on other domains from the list". What logic is behind this, why have other domains higher priorities. I will try and reduce the maxDepth parameter and let your know.
interesting, as you have mentioned, if the domains are added manually, instead of using the .txt, it works as expected. It outputs the maxDepth number set amount of rows per domain, which is logical to me. Can you explain, why when using the remot txt file, it is only crawling the start domain and not going further, despite the maxDepth Setting being > 1 ? Thanks.
my bad, they are being queued and run asynchronously....sorry!
Actor Metrics
1.3k monthly users
-
190 stars
>99% runs succeeded
4.1 days response time
Created in May 2019
Modified a day ago