Website Content Crawler avatar
Website Content Crawler

Pricing

Pay per usage

Go to Store
Website Content Crawler

Website Content Crawler

Developed by

Apify

Apify

Maintained by Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

3.6 (39)

Pricing

Pay per usage

1410

Total users

54K

Monthly users

8K

Runs succeeded

>99%

Issues response

7.6 days

Last modified

3 days ago

HH

Crawler can not get all the text content

Closed

hhhhhz opened this issue
a year ago

I tried to use python Apify to crawl the URL: "https://roundrockisd.org/graduation/top-10/whs-2023-top-10/", however, I can only get 8 students information and the text is cut-off.

I checked other issues mentioning similar problems, all the suggested solutions are to update the parameter "htmlTransformer" to "none", but it does not work for me.

Here is the end of scraping results.

"Jyotsna Arunkumar – No. 8
What campuses in Round Rock ISD did you attend?
Canyon Vista Middle School, and Westwood High School.
What’s the best memory you have of your time in Round Rock ISD? Why?
The FBLA State trip to Galveston my senior year. I loved spending time with friends – walking an hour after we missed our bus by 30 seconds, sharing a tub of melted ice cream, seeing jellyfish at the aquarium, and watching movies into the night stand out among many memorable moments.
Who has been your most influential teacher in Round Rock ISD? Why?
So many teachers have had a positive influence on me. Thank you Mrs. Key for making IB such a welcoming and positive experience. To Mrs. Howi"

My parameters are

run_input = {
"startUrls": [{"url": url}],
"useSitemaps": False,
"crawlerType": "playwright:firefox",
"includeUrlGlobs": [],
"excludeUrlGlobs": [],
"ignoreCanonicalUrl": False,
"maxCrawlDepth": 0,
"maxCrawlPages": 1,
"initialConcurrency": 0,
"maxConcurrency": 200,
"initialCookies": [],
... [trimmed]
jindrich.bar avatar

Hello and thank you for your interest in this Actor!

This is curious - I tried running the Actor with htmlTransformer: "none" and got the data for all the students. Check out my example run - perhaps there might be some issue in the rest of your code?

I'll close this issue now, but feel free to ask additional questions if you have any. We'd be more than happy to see the whole script - maybe there are some issues in the Apify Python Client we don't know about yet! :)

Cheers!