Website Content Crawler avatar
Website Content Crawler
Try for free

No credit card required

View all Actors
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Automatically crawl and extract text content from websites with documentation, knowledge bases, help centers, or blogs. This Actor is designed to provide data to feed, fine-tune, or train large language models such as ChatGPT or LLaMA.

User avatar

How to scrape all FAQs on a page the requires each FAQ be clicked separately?

Closed

ollieiq opened this issue
11 days ago

I am trying to scrape the FAQs (both the questions and answers) on a page like this: https://www.philips-hue.com/en-us/support/product/philips-hue-system/100005 I can grab all the questions but for answers, only the last FAQ answer on the page is grabbed and that was with me using the . collapse__header parameter in the Expand Clickable elements field. Please advise on what I can do to remedy this.

User avatar

Hello and thank you for your interest in this Actor!

Unfortunately, this is one of the downsides of Website Content Crawler - JS navigation and dropdowns. Since the implementation of these is not standardized by any document anywhere, different pages handle these differently. As a result, the Actor has fairly limited support for interaction with the on-page elements.

The good news is - it's pretty simple to download the data from this page using Cheerio Crawler - before the Javascript is executed on the page, the questions / answers are stored in :info attribute of the <faq> element. You can simply parse it out and store those in a Dataset. Check out my example run here - feel free to copy the input and experiment with it. And definitely let me know if you have any other questions (regarding this website or others).

I'll close this issue now (but as I said, still feel free to ping me :)) Cheers!

User avatar

tm_oiq

11 days ago

Hello and thank you for your quick and helpful response to my inquiry! My ultimate goal is to scrape all the data in the Support section of the Philips Hue portal and other similar IoT manufacturer devices. Is there an actor you recommend that would do that for me? https://www.philips-hue.com/en-us/support/faq is the URL I am starting from.

I would like to grab all the FAQ’s and articles (both text and pdf) that are linked to from the main support page URL and all pages linked to off of that (some of the images link to setup guides, etc). I’m just getting started into scraping so any direction you can point me in is greatly appreciated!

Developer
Maintained by Apify
Actor metrics
  • 2k monthly users
  • 99.9% runs succeeded
  • 2.9 days response time
  • Created in Mar 2023
  • Modified 3 days ago