Under maintenance

Pricing

$9.00 / 1,000 results

Try for free

Go to Store

Macy's Scraper

Under maintenance

Try for free

Developed by

Gustavo Rudiger

Macy's web scraper to crawl product information including price and sale price, color, and images. Extract all data in a dataset in structured formats.

0.0 (0)

Pricing

$9.00 / 1,000 results

Total users

Monthly users

Runs succeeded

89%

Last modified

4 days ago

E-commerce

Back to issues Create new issue

Not scraping everything

Closed

82society opened this issue

Hi Gustavo,

https://www.macys.com/shop/mens-clothing/all-mens-clothing?id=197651&edge=hybrid

There are 1463 pages in the link above. For some reason, it was completed with 5214 results. And it was like this for last few Runs. Why isn't it scraping all the pages?

Gustavo Rudiger (trudax)

Can you share the run ID with me so I can see the logs?

82society

Run ID: nLNIWvB7Evws63TC5, LoeNOUKGLc2jfXh55, DeLcKpIfzCmEAAdth

Gustavo Rudiger (trudax)

Seem to be an issue with the requests that fail and exceeded the retry limit. I have made some improvements on the logic to fix this, but this new version will not be compatible with the previous runs unfortunally.

82society

"This new version will not be compatible with the previous runs unfortunately" Does that mean that if I proceed with the new version, it will scrape all the ones that I already scraped from previous version?

82society

Gustavo, could you please kindly answer above.

Additionally, I ran a new task today with the new version and it Failed after 1861 results - Run ID hzQfpfBlYcZvmfc5V

Gustavo Rudiger (trudax)

Yeah, it will scrape them again. The next ones will not have that problem. Also even if you have failed requests it will try them again the next time.

Gustavo Rudiger (trudax)

Also I have fixed the issue you had in your run last run.

82society

"Also even if you have failed requests it will try them again the next time" Are you saying - if it runs into failed request, it will automatically rerun the task?

Additionally, is there any way to avoid scraping duplicate data with the new version? I've already spent appx $130 on scraping so far..

82society

Should I still use "macy-products" for Product request queue name ?

Gustavo Rudiger (trudax)

You don´t need to change the name. But is not using a request queue anymore, is using a KeyValue Store. This way I have more controll of how to check if the product was scrapped correctly. If you already ran once with the new version, use the same name so the previous products will be already skipped. I will update the name on the input to reflect my changes.

82society

Hi Gustavo! Thank you so much! By the way, since the new version will scrape duplicated data from the previous version, is it possible to ask Apify if they can previde credit for the amount that I spent (roughly $120) on the previous version?

Gustavo Rudiger (trudax)

I don´t know what is Apify policy for something like this but I don't think so since it was related to a feature of the actor and not and issue with the platform. What you can try to do to avoid scrapping all the products again is provide a paginated URL as the starting point. So if your runs already went until the 20th page for example, you can pass the URL for 21th page as the starting one.

Gustavo Rudiger (trudax)

The logs should have the indication of the last page in a message like this: Adding next page to queue. URL: ...

Gustavo Rudiger (trudax)

Also, since the actor is using named KeyValue store and previously was using named Request Queue, you need to manually delete this after you do not want to use them anymore. Apify will charge for storing this data.

82society

-The log should have the indication of the last page.... - ahh okay. that was going to be my question. Thanks for answering ahead. -Also, since the actor is using named KeyValue store... - I'm not fully understanding this. So after the task is complete, I should delete everything that's under Queue ID? I started a new task since the morning with the new version. Once it's complete, I should delete it? The page in attached screenshot?

82society

Also Gustavo, I'm trying to build an external application with Apify (currently inquiring) as you mentioned before. With all the data that I already scraped, will I be able to transfer them to the application? If so, what file should it be stored under?

Sorry, I'm very new to this and it's not my areas of expertise.

82society

Gustavo, the new run succeeded and stopped at 2876 results. Why did it stopped again? Queue ID: DQggPJ9i4cV4GhGy4

Gustavo Rudiger (trudax)

You should be able to see the named Request Queue and KeyValue stores under the Storege menu: https://console.apify.com/storage?tab=keyValueStores You should not delete it if you want to keep the previous products. You can keep the data there forever, but it will have a cost. Apify automatically deletes data after 30 days when is not named. To transfer the data to your application you just need to export the already scrapped data. You can export it as a JSON file or CSV file as a last resource to make a backup but it will not be automatically deleted since the KeyValue store was named.

Gustavo Rudiger (trudax)

I need the run ID to be able to see what is happening.

82society

Okay cool. When will you be able to run the ID and find out? Just trying to scrape the data before some products go out of stock. Once it goes out of stock, Macy's don't display product information.

Gustavo Rudiger (trudax)

You shared the Queue ID, I need the run ID

82society

Hi Gustavo, have you had the chance to run the ID?

Gustavo Rudiger (trudax)

I am waiting for you to share the run ID with me.

82society

Run ID W8Y1h53f9NUSeyD9b

Gustavo Rudiger (trudax)

Thank you. It does not seems to have any obvious errors, Whats happens if you run it again? It should skip the already stored ones and get new ones.

82society

I haven't resurrected since when it stopped. But It stopped at 2876 results as Succeeded. But clearly it wasn't done.

82society

I'll resurrect again.

Gustavo Rudiger (trudax)

From the logs all requests where either completed or failed too many times. I think that resurrect will not work but run it again will since it will add the 46 failed requests back to the queue. Probably the pagination requests are in this 46 failed requests. I can try to improve the scrape of the pagination so it will at least always try more pages.

82society

Gustavo, I tried resurrecting it, but it stopped right away as Succeeded

Gustavo Rudiger (trudax)

Try a new run with the same input. This will get the failed requests from the previous run and will add more pages. The dataset will be a different one, you will need to merge the results. I can also make the result be stored in a named dataset so all runs will add into it and you will have all data in one dataset.

Gustavo Rudiger (trudax)

This failed request are normal and happen for different reasons including the page taking to long to load sometimes or the actor start being blocked until it start using a new session to bypass it.

82society

Okay, thanks Guastva! Yea, can you make it so that all the results will be added to the same dataset? Also, if I change the Task Name, will the affect the dataset being merged?

82society

Once you confirm the question above, I'll run a new task

Gustavo Rudiger (trudax)

Changing the task name will have no effect in the dataset.

82society

Hi Gustavo, Tasks stopped again as succeeded. First one, it stopped at 119 result - Run ID GASjg4gDlbQMVKC4Y Second one stopped at 8690 results - Run ID SS2RsC5CTG1Mv85Ph The second one, I tried to resurrect it twice. However, it keeps stopping as Succeeded with no additional results.

Why is this keep happening?

Gustavo Rudiger (trudax)

Once there is no request to be processed on the queue the actor does not run anymore. If the request that handles the pagination fails to load more than 6 times, it is considered as processed and further pages are not added to the queue. I need to find is Macys is blocking those request and bypass it.

82society

So should I wait?

Get Outlook for Androidhttps://aka.ms/AAb9ysg

82society

It says It's under maintenance. "This actor may be unreliable while under maintenance. Would you like to try a similar actor instead?" Should I just wait?

Gustavo, when will you become available to build an external application? Apify is very slow with their response and is not really helpful.

Gustavo Rudiger (trudax)

The actor is still working normally, it was set under maintenance automatically because it took too long to scrape the example input. If you just run the task again with the same input it should get the failed requests and continue scrapping but the results on each different run are in different datasets. I am thinking that I will need to add the option to save in the save dataset and also create a script to merge the old datasets into a single one. I am working on another project without a time limit so I really don't know when I will be available to create external applications.

Gustavo Rudiger (trudax)

Just found a free actor that already solves this issue: https://apify.com/lukaskrivka/dedup-datasets

82society

Wait, so currently, the failed requests aren't re-attempting scrape?

Also, is Macy's blocking the request that handles the pagination? it keeps

Thanks for the recommendation of the free actor. I just contacted the actor to help out in how to use the actor collaborating with yours.

Usually how much will it cost to build an application like that? Like the external application you described? Can we get on a phone call today real quick? I'd really appreciate it.

Gustavo Rudiger (trudax)

We can schedule a call in here so I can explain it to you: https://calendly.com/gustavorudiger/15min

82society

I just check the result files. And it appears that it is scraping duplicates.

82society

Hi Gustavo, I tried re-running a task and it stopped as Succeeded after 7342 results. Run ID iwWXV6aiE4q9RphEy. Also it's scraping duplication. I think my understanding is that the other actor that you recommended can get rid off the duplicate.

Gustavo, there are two ongoing and important issues right now that are correlated to each with each other pointing to the same cause.

It keeps stopping as Succeeded.
Every time I run a new tasks, it's scraping duplicates. And I'm being billed for duplicates.

The problem with the actor you recommended is that I need to provide Dataset ID. Which, I think, means that I have to run the task with this actor and provide the dataset to the other actor. So basically, I have to pay for duplicates, right?

Can you fix it by over the weekend? I can't continue with this actor if it can't be fixed. My time is running low because everyday product data becomes unavailable on Mayc's as they go out of stock. So everyday counts. Could you please please fix it.

82society

Btw, I'm not trying to offend or threaten you that I'm going to leave. You've been nothing but very helpful to me. It's just that I'm quite frustrated with the situation because I'm very new to this industry and I just need this to function well for my business operation. And I plan to use this for a long time. However, at this time, time is my enemy as product data are becoming unavailable day by day on Macy's. Please help.

Gustavo Rudiger (trudax)

Sure Andrew, I appreciate the feedback so I can improve my actor. The actor should not be saving duplicates anymore, I will take a look and try to fixt it.

Gustavo Rudiger (trudax)

I think I have manage to fix it. Founf a bug that was preventing some results from being saved.

82society

Hi Gustavo,

Run ID: Cf19jwErPefnfBaLP Run ID: gXIDq6UWcMug6UYQy

I'm trying to Resurrect these two run ID, but it stops as Succeeded.

Why is this keep happening? Is there a way to fix this?

Add comment

Nordstrom Scraper

trudax/actor-nordstrom-scraper

Nordstrom web scraper to crawl product information including price and sale price, color, and images. Extract all data in a dataset in multiple formats.

Gustavo Rudiger

214

Kabum Scraper

trudax/kabum

Gustavo Rudiger

Macys Product Search

pintostudio/macys-product-search

The Macy's Product Search Scraper is a reliable tool to extract product data directly from Macy's search results. Whether you're conducting market research, tracking pricing trends, or building a product database, this actor provides comprehensive product information in an easy-to-use format.

Pinto Studio

Reddit Scraper

trudax/reddit-scraper

Unlimited Reddit web scraper to crawl posts, comments, communities, and users without login. Limit web scraping by number of posts or items and extract all data in a dataset in multiple formats.

Gustavo Rudiger

7.1K

3.9

Macys Product Scraper

getdataforme/macys-product-scraper

Scrape product data from Macy's using the Macys Scraper Apify actor. Extract detailed information like product names, prices, images, descriptions, and SKUs from Macy's product pages. Automate your e-commerce data extraction with high accuracy and efficiency. Proxy support included.

GetDataForMe

Ulta Scraper

autofacts/ulta-scraper

Ulta web scraper to crawl product information including price and sale price, color, and images.

Autofactor

5.0

Farfetch Scraper

autofacts/farfetch

Farfetch web scraper to crawl product information including price and sale price, color, and images.

Autofactor

208

5.0

Yellow Pages US Scraper

trudax/yellow-pages-us-scraper

Scrape addresses, phone numbers, categories, and names from Yellow Pages US listings. Customizable Yellow Pages API to crawl and download all contact data.

Gustavo Rudiger

4.1K

Target Search Scraper

axlymxp/target-search-scraper

A web scraper that searches Target's website for products based on a keyword and store ID. Extracts detailed product information including name, price, description, images, availability and store details. Results are saved to a structured dataset.

axly

Reddit Scraper Lite

trudax/reddit-scraper-lite

Pay Per Result, unlimited Reddit web scraper to crawl posts, comments, communities, and users without login. Limit web scraping by number of posts or items and extract all data in a dataset in multiple formats.

Gustavo Rudiger

7.2K

3.9