Macy's Scraper avatar
Macy's Scraper
Try for free

Pay $9.00 for 1,000 results

View all Actors
Macy's Scraper

Macy's Scraper

trudax/actor-macys-scraper
Try for free

Pay $9.00 for 1,000 results

Macy's web scraper to crawl product information including price and sale price, color, and images. Extract all data in a dataset in structured formats.

User avatar

Not scraping everything

Closed

82society opened this issue
2 years ago

Hi Gustavo,

https://www.macys.com/shop/mens-clothing/all-mens-clothing?id=197651&edge=hybrid

There are 1463 pages in the link above. For some reason, it was completed with 5214 results. And it was like this for last few Runs. Why isn't it scraping all the pages?

User avatar

Can you share the run ID with me so I can see the logs?

User avatar

82society

2 years ago

Run ID: nLNIWvB7Evws63TC5, LoeNOUKGLc2jfXh55, DeLcKpIfzCmEAAdth

User avatar

Seem to be an issue with the requests that fail and exceeded the retry limit. I have made some improvements on the logic to fix this, but this new version will not be compatible with the previous runs unfortunally.

User avatar

82society

2 years ago

"This new version will not be compatible with the previous runs unfortunately" Does that mean that if I proceed with the new version, it will scrape all the ones that I already scraped from previous version?

User avatar

82society

2 years ago

Gustavo, could you please kindly answer above.

Additionally, I ran a new task today with the new version and it Failed after 1861 results - Run ID hzQfpfBlYcZvmfc5V

User avatar

Yeah, it will scrape them again. The next ones will not have that problem. Also even if you have failed requests it will try them again the next time.

User avatar

Also I have fixed the issue you had in your run last run.

User avatar

82society

2 years ago

"Also even if you have failed requests it will try them again the next time" Are you saying - if it runs into failed request, it will automatically rerun the task?

Additionally, is there any way to avoid scraping duplicate data with the new version? I've already spent appx $130 on scraping so far..

User avatar

82society

2 years ago

Should I still use "macy-products" for Product request queue name ?

User avatar

You don´t need to change the name. But is not using a request queue anymore, is using a KeyValue Store. This way I have more controll of how to check if the product was scrapped correctly. If you already ran once with the new version, use the same name so the previous products will be already skipped. I will update the name on the input to reflect my changes.

User avatar

82society

2 years ago

Hi Gustavo! Thank you so much! By the way, since the new version will scrape duplicated data from the previous version, is it possible to ask Apify if they can previde credit for the amount that I spent (roughly $120) on the previous version?

User avatar

I don´t know what is Apify policy for something like this but I don't think so since it was related to a feature of the actor and not and issue with the platform. What you can try to do to avoid scrapping all the products again is provide a paginated URL as the starting point. So if your runs already went until the 20th page for example, you can pass the URL for 21th page as the starting one.

User avatar

The logs should have the indication of the last page in a message like this: Adding next page to queue. URL: ...

User avatar

Also, since the actor is using named KeyValue store and previously was using named Request Queue, you need to manually delete this after you do not want to use them anymore. Apify will charge for storing this data.

User avatar

82society

2 years ago

-The log should have the indication of the last page.... - ahh okay. that was going to be my question. Thanks for answering ahead. -Also, since the actor is using named KeyValue store... - I'm not fully understanding this. So after the task is complete, I should delete everything that's under Queue ID? I started a new task since the morning with the new version. Once it's complete, I should delete it? The page in attached screenshot?

User avatar

82society

2 years ago

Also Gustavo, I'm trying to build an external application with Apify (currently inquiring) as you mentioned before. With all the data that I already scraped, will I be able to transfer them to the application? If so, what file should it be stored under?

Sorry, I'm very new to this and it's not my areas of expertise.

User avatar

82society

2 years ago

Gustavo, the new run succeeded and stopped at 2876 results. Why did it stopped again? Queue ID: DQggPJ9i4cV4GhGy4

User avatar

You should be able to see the named Request Queue and KeyValue stores under the Storege menu: https://console.apify.com/storage?tab=keyValueStores You should not delete it if you want to keep the previous products. You can keep the data there forever, but it will have a cost. Apify automatically deletes data after 30 days when is not named. To transfer the data to your application you just need to export the already scrapped data. You can export it as a JSON file or CSV file as a last resource to make a backup but it will not be automatically deleted since the KeyValue store was named.

User avatar

I need the run ID to be able to see what is happening.

User avatar

82society

2 years ago

Okay cool. When will you be able to run the ID and find out? Just trying to scrape the data before some products go out of stock. Once it goes out of stock, Macy's don't display product information.

User avatar

You shared the Queue ID, I need the run ID

User avatar

82society

2 years ago

Hi Gustavo, have you had the chance to run the ID?

User avatar

I am waiting for you to share the run ID with me.

User avatar

82society

2 years ago

Run ID W8Y1h53f9NUSeyD9b

User avatar

Thank you. It does not seems to have any obvious errors, Whats happens if you run it again? It should skip the already stored ones and get new ones.

User avatar

82society

2 years ago

I haven't resurrected since when it stopped. But It stopped at 2876 results as Succeeded. But clearly it wasn't done.

User avatar

82society

2 years ago

I'll resurrect again.

User avatar

From the logs all requests where either completed or failed too many times. I think that resurrect will not work but run it again will since it will add the 46 failed requests back to the queue. Probably the pagination requests are in this 46 failed requests. I can try to improve the scrape of the pagination so it will at least always try more pages.

User avatar

82society

2 years ago

Gustavo, I tried resurrecting it, but it stopped right away as Succeeded

User avatar

Try a new run with the same input. This will get the failed requests from the previous run and will add more pages. The dataset will be a different one, you will need to merge the results. I can also make the result be stored in a named dataset so all runs will add into it and you will have all data in one dataset.

User avatar

This failed request are normal and happen for different reasons including the page taking to long to load sometimes or the actor start being blocked until it start using a new session to bypass it.

User avatar

82society

2 years ago

Okay, thanks Guastva! Yea, can you make it so that all the results will be added to the same dataset? Also, if I change the Task Name, will the affect the dataset being merged?

User avatar

82society

2 years ago

Once you confirm the question above, I'll run a new task

User avatar

Changing the task name will have no effect in the dataset.

User avatar

82society

2 years ago

Hi Gustavo, Tasks stopped again as succeeded. First one, it stopped at 119 result - Run ID GASjg4gDlbQMVKC4Y Second one stopped at 8690 results - Run ID SS2RsC5CTG1Mv85Ph The second one, I tried to resurrect it twice. However, it keeps stopping as Succeeded with no additional results.

Why is this keep happening?

User avatar

Once there is no request to be processed on the queue the actor does not run anymore. If the request that handles the pagination fails to load more than 6 times, it is considered as processed and further pages are not added to the queue. I need to find is Macys is blocking those request and bypass it.

User avatar

82society

2 years ago

So should I wait?

Get Outlook for Androidhttps://aka.ms/AAb9ysg

User avatar

82society

2 years ago

It says It's under maintenance. "This actor may be unreliable while under maintenance. Would you like to try a similar actor instead?" Should I just wait?

Gustavo, when will you become available to build an external application? Apify is very slow with their response and is not really helpful.

User avatar

The actor is still working normally, it was set under maintenance automatically because it took too long to scrape the example input. If you just run the task again with the same input it should get the failed requests and continue scrapping but the results on each different run are in different datasets. I am thinking that I will need to add the option to save in the save dataset and also create a script to merge the old datasets into a single one. I am working on another project without a time limit so I really don't know when I will be available to create external applications.

User avatar

Just found a free actor that already solves this issue: https://apify.com/lukaskrivka/dedup-datasets

User avatar

82society

2 years ago

Wait, so currently, the failed requests aren't re-attempting scrape?

Also, is Macy's blocking the request that handles the pagination? it keeps

Thanks for the recommendation of the free actor. I just contacted the actor to help out in how to use the actor collaborating with yours.

Usually how much will it cost to build an application like that? Like the external application you described? Can we get on a phone call today real quick? I'd really appreciate it.

User avatar

We can schedule a call in here so I can explain it to you: https://calendly.com/gustavorudiger/15min

User avatar

82society

2 years ago

I just check the result files. And it appears that it is scraping duplicates.

User avatar

82society

2 years ago

Hi Gustavo, I tried re-running a task and it stopped as Succeeded after 7342 results. Run ID iwWXV6aiE4q9RphEy. Also it's scraping duplication. I think my understanding is that the other actor that you recommended can get rid off the duplicate.

Gustavo, there are two ongoing and important issues right now that are correlated to each with each other pointing to the same cause.

  1. It keeps stopping as Succeeded.
  2. Every time I run a new tasks, it's scraping duplicates. And I'm being billed for duplicates.

The problem with the actor you recommended is that I need to provide Dataset ID. Which, I think, means that I have to run the task with this actor and provide the dataset to the other actor. So basically, I have to pay for duplicates, right?

Can you fix it by over the weekend? I can't continue with this actor if it can't be fixed. My time is running low because everyday product data becomes unavailable on Mayc's as they go out of stock. So everyday counts. Could you please please fix it.

User avatar

82society

2 years ago

Btw, I'm not trying to offend or threaten you that I'm going to leave. You've been nothing but very helpful to me. It's just that I'm quite frustrated with the situation because I'm very new to this industry and I just need this to function well for my business operation. And I plan to use this for a long time. However, at this time, time is my enemy as product data are becoming unavailable day by day on Macy's. Please help.

User avatar

Sure Andrew, I appreciate the feedback so I can improve my actor. The actor should not be saving duplicates anymore, I will take a look and try to fixt it.

User avatar

I think I have manage to fix it. Founf a bug that was preventing some results from being saved.

User avatar

82society

2 years ago

Hi Gustavo,

Run ID: Cf19jwErPefnfBaLP Run ID: gXIDq6UWcMug6UYQy

I'm trying to Resurrect these two run ID, but it stops as Succeeded.

Why is this keep happening? Is there a way to fix this?

Developer
Maintained by Community
Actor metrics
  • 3 monthly users
  • 98.9% runs succeeded
  • days response time
  • Created in Dec 2019
  • Modified 3 months ago
Categories