Macy's Scraper
Pay $9.00 for 1,000 results
Macy's Scraper
Pay $9.00 for 1,000 results
Macy's web scraper to crawl product information including price and sale price, color, and images. Extract all data in a dataset in structured formats.
Hi Gustavo,
https://www.macys.com/shop/mens-clothing/all-mens-clothing?id=197651&edge=hybrid
There are 1463 pages in the link above. For some reason, it was completed with 5214 results. And it was like this for last few Runs. Why isn't it scraping all the pages?
Can you share the run ID with me so I can see the logs?
Run ID: nLNIWvB7Evws63TC5, LoeNOUKGLc2jfXh55, DeLcKpIfzCmEAAdth
Seem to be an issue with the requests that fail and exceeded the retry limit. I have made some improvements on the logic to fix this, but this new version will not be compatible with the previous runs unfortunally.
"This new version will not be compatible with the previous runs unfortunately" Does that mean that if I proceed with the new version, it will scrape all the ones that I already scraped from previous version?
Gustavo, could you please kindly answer above.
Additionally, I ran a new task today with the new version and it Failed after 1861 results - Run ID hzQfpfBlYcZvmfc5V
Yeah, it will scrape them again. The next ones will not have that problem. Also even if you have failed requests it will try them again the next time.
Also I have fixed the issue you had in your run last run.
"Also even if you have failed requests it will try them again the next time" Are you saying - if it runs into failed request, it will automatically rerun the task?
Additionally, is there any way to avoid scraping duplicate data with the new version? I've already spent appx $130 on scraping so far..
Should I still use "macy-products" for Product request queue name ?
You don´t need to change the name. But is not using a request queue anymore, is using a KeyValue Store. This way I have more controll of how to check if the product was scrapped correctly. If you already ran once with the new version, use the same name so the previous products will be already skipped. I will update the name on the input to reflect my changes.
Hi Gustavo! Thank you so much! By the way, since the new version will scrape duplicated data from the previous version, is it possible to ask Apify if they can previde credit for the amount that I spent (roughly $120) on the previous version?
I don´t know what is Apify policy for something like this but I don't think so since it was related to a feature of the actor and not and issue with the platform. What you can try to do to avoid scrapping all the products again is provide a paginated URL as the starting point. So if your runs already went until the 20th page for example, you can pass the URL for 21th page as the starting one.
The logs should have the indication of the last page in a message like this:
Adding next page to queue. URL: ...
Also, since the actor is using named KeyValue store and previously was using named Request Queue, you need to manually delete this after you do not want to use them anymore. Apify will charge for storing this data.
-The log should have the indication of the last page.... - ahh okay. that was going to be my question. Thanks for answering ahead. -Also, since the actor is using named KeyValue store... - I'm not fully understanding this. So after the task is complete, I should delete everything that's under Queue ID? I started a new task since the morning with the new version. Once it's complete, I should delete it? The page in attached screenshot?
Also Gustavo, I'm trying to build an external application with Apify (currently inquiring) as you mentioned before. With all the data that I already scraped, will I be able to transfer them to the application? If so, what file should it be stored under?
Sorry, I'm very new to this and it's not my areas of expertise.
Gustavo, the new run succeeded and stopped at 2876 results. Why did it stopped again? Queue ID: DQggPJ9i4cV4GhGy4
You should be able to see the named Request Queue and KeyValue stores under the Storege menu: https://console.apify.com/storage?tab=keyValueStores You should not delete it if you want to keep the previous products. You can keep the data there forever, but it will have a cost. Apify automatically deletes data after 30 days when is not named. To transfer the data to your application you just need to export the already scrapped data. You can export it as a JSON file or CSV file as a last resource to make a backup but it will not be automatically deleted since the KeyValue store was named.
I need the run ID to be able to see what is happening.
Okay cool. When will you be able to run the ID and find out? Just trying to scrape the data before some products go out of stock. Once it goes out of stock, Macy's don't display product information.
You shared the Queue ID, I need the run ID
Hi Gustavo, have you had the chance to run the ID?
I am waiting for you to share the run ID with me.
Run ID W8Y1h53f9NUSeyD9b
Thank you. It does not seems to have any obvious errors, Whats happens if you run it again? It should skip the already stored ones and get new ones.
I haven't resurrected since when it stopped. But It stopped at 2876 results as Succeeded. But clearly it wasn't done.
I'll resurrect again.
From the logs all requests where either completed or failed too many times. I think that resurrect will not work but run it again will since it will add the 46 failed requests back to the queue. Probably the pagination requests are in this 46 failed requests. I can try to improve the scrape of the pagination so it will at least always try more pages.
Gustavo, I tried resurrecting it, but it stopped right away as Succeeded
Try a new run with the same input. This will get the failed requests from the previous run and will add more pages. The dataset will be a different one, you will need to merge the results. I can also make the result be stored in a named dataset so all runs will add into it and you will have all data in one dataset.
This failed request are normal and happen for different reasons including the page taking to long to load sometimes or the actor start being blocked until it start using a new session to bypass it.
Okay, thanks Guastva! Yea, can you make it so that all the results will be added to the same dataset? Also, if I change the Task Name, will the affect the dataset being merged?
Once you confirm the question above, I'll run a new task
Changing the task name will have no effect in the dataset.
Hi Gustavo, Tasks stopped again as succeeded. First one, it stopped at 119 result - Run ID GASjg4gDlbQMVKC4Y Second one stopped at 8690 results - Run ID SS2RsC5CTG1Mv85Ph The second one, I tried to resurrect it twice. However, it keeps stopping as Succeeded with no additional results.
Why is this keep happening?
Once there is no request to be processed on the queue the actor does not run anymore. If the request that handles the pagination fails to load more than 6 times, it is considered as processed and further pages are not added to the queue. I need to find is Macys is blocking those request and bypass it.
So should I wait?
Get Outlook for Androidhttps://aka.ms/AAb9ysg
It says It's under maintenance. "This actor may be unreliable while under maintenance. Would you like to try a similar actor instead?" Should I just wait?
Gustavo, when will you become available to build an external application? Apify is very slow with their response and is not really helpful.
The actor is still working normally, it was set under maintenance automatically because it took too long to scrape the example input. If you just run the task again with the same input it should get the failed requests and continue scrapping but the results on each different run are in different datasets. I am thinking that I will need to add the option to save in the save dataset and also create a script to merge the old datasets into a single one. I am working on another project without a time limit so I really don't know when I will be available to create external applications.
Just found a free actor that already solves this issue: https://apify.com/lukaskrivka/dedup-datasets
Wait, so currently, the failed requests aren't re-attempting scrape?
Also, is Macy's blocking the request that handles the pagination? it keeps
Thanks for the recommendation of the free actor. I just contacted the actor to help out in how to use the actor collaborating with yours.
Usually how much will it cost to build an application like that? Like the external application you described? Can we get on a phone call today real quick? I'd really appreciate it.
We can schedule a call in here so I can explain it to you: https://calendly.com/gustavorudiger/15min
I just check the result files. And it appears that it is scraping duplicates.
Hi Gustavo, I tried re-running a task and it stopped as Succeeded after 7342 results. Run ID iwWXV6aiE4q9RphEy. Also it's scraping duplication. I think my understanding is that the other actor that you recommended can get rid off the duplicate.
Gustavo, there are two ongoing and important issues right now that are correlated to each with each other pointing to the same cause.
- It keeps stopping as Succeeded.
- Every time I run a new tasks, it's scraping duplicates. And I'm being billed for duplicates.
The problem with the actor you recommended is that I need to provide Dataset ID. Which, I think, means that I have to run the task with this actor and provide the dataset to the other actor. So basically, I have to pay for duplicates, right?
Can you fix it by over the weekend? I can't continue with this actor if it can't be fixed. My time is running low because everyday product data becomes unavailable on Mayc's as they go out of stock. So everyday counts. Could you please please fix it.
Btw, I'm not trying to offend or threaten you that I'm going to leave. You've been nothing but very helpful to me. It's just that I'm quite frustrated with the situation because I'm very new to this industry and I just need this to function well for my business operation. And I plan to use this for a long time. However, at this time, time is my enemy as product data are becoming unavailable day by day on Macy's. Please help.
Sure Andrew, I appreciate the feedback so I can improve my actor. The actor should not be saving duplicates anymore, I will take a look and try to fixt it.
I think I have manage to fix it. Founf a bug that was preventing some results from being saved.
Hi Gustavo,
Run ID: Cf19jwErPefnfBaLP Run ID: gXIDq6UWcMug6UYQy
I'm trying to Resurrect these two run ID, but it stops as Succeeded.
Why is this keep happening? Is there a way to fix this?
Actor Metrics
2 monthly users
-
4 stars
88% runs succeeded
Created in Dec 2019
Modified 4 days ago