Website Content Crawler avatar
Website Content Crawler
Try for free

No credit card required

View all Actors
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Automatically crawl and extract text content from websites with documentation, knowledge bases, help centers, or blogs. This Actor is designed to provide data to feed, fine-tune, or train large language models such as ChatGPT or LLaMA.

User avatar

Text field has HTML content error

Closed

motivated_leaflet opened this issue
a month ago

For site: https://pay.flywire.com/ The scraper outputted the full HTML result in the text column instead of only text. URL texts from the rest of the run were ok.

User avatar

Hello and thank you for your interest in this Actor!

This is expected behaviour - you can see the reason in the Actor logs :

2024-04-12T19:01:02.916Z WARN  Text processing disabled on https://pay.flywire.com/ as it is too large (2111338 chars), returning raw HTML instead...

This is a feature that prevents the Actor from timeouting while processing large HTML content.

Looking at the target website (https://pay.flywire.com/), I see that the biggest chunk of the content is actually in the <select> element. If you don't need the data from the dropdown list, you can remove it manually by adding select.institutions__dropdown selector to the Remove CSS selectors input option (note that the whole option is one CSS selector, so you have to concatenate the new selector with the current one using , (comma)).

You can see this working in my example run - feel free to copy my input from there.

I'll close this issue now, but feel free to ask additional questions, if you have any. Cheers!

Developer
Maintained by Apify
Actor metrics
  • 1.9k monthly users
  • 99.9% runs succeeded
  • 2.9 days response time
  • Created in Mar 2023
  • Modified 3 days ago