Website Content Crawler avatar
Website Content Crawler
Try for free

No credit card required

View all Actors
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Automatically crawl and extract text content from websites with documentation, knowledge bases, help centers, or blogs. This Actor is designed to provide data to feed, fine-tune, or train large language models such as ChatGPT or LLaMA.

User avatar

Require a field to filter the results

Closed

sai_sampath opened this issue
a month ago

For a huge sitemap of 30,000 sites, It's getting difficult for us to read the sites as I want the data site wise.

Can we get a field to filter the results based on the URL?

User avatar

Hello and thank you for your interest in this Actor!

You don't need to use the sitemap feature for scraping a website. You can instead simply put the page URL to the Start URLs option. By default, the Actor only crawls the descendants of the start URL, so if I set the start URL to https://docs.apify.com/api/client/js/, the crawler will only crawl the pages under this URL (so https://docs.apify.com/api/client/js/docs/features, but not https://docs.apify.com/api/client/python).

You can also use the Include / Exclude globs to change this behavior, e.g. if you only want to process pages under certain paths / with given parameters etc.

If you want to scrape everything, but then filter the scrape results, you can do that with third-party tools - Apify allows you to download the data as CSV, JSON, or XML file - all of which you can process with streaming algorithms, only filtering the relevant data points.

Does this answer your question? If not, let us know about your use case a bit more, so we can offer more relevant tips. Cheers!

User avatar

sai_sampath

a month ago

Yes, I was talking about the results.

Can we get a result for a specific URL from the list of all URLs without changing the input?

Let me explain my use case a bit better, I read all the URLs once and I'd sometimes require the data for a single URL, I'm now getting everything and filtering, So, This change would allow me to pass a specific url and get the data related to it rather than recrawling on it.

Hope this clarifies!

User avatar

Alright - I understand now! Apify offers only the Datasets and Key-Value Stores - while you could theoretically use the Key-Value Store for this (key is the URL, value is the content), this Actor stores the results in a Dataset (which has no indexing).

From my point of view, you'd be much better off with an external database. You can easily download the dataset content in CSV or JSON, which you can import into an SQL database like PostgreSQL or SQLite. Then you can query this data much faster and without generating network traffic, you can run a full-text search on the data etc.

I'll let our backend devs know about this idea of an Apify-hosted database, but it's definitely a long-term goal (if anything).

Closing this issue for now, but feel free to share any other ideas (or ask additional questions). Cheers!

Developer
Maintained by Apify
Actor metrics
  • 2k monthly users
  • 99.9% runs succeeded
  • 2.9 days response time
  • Created in Mar 2023
  • Modified 3 days ago