No credit card required

Website Content Crawler

apify/website-content-crawler

No credit card required

Automatically crawl and extract text content from websites with documentation, knowledge bases, help centers, or blogs. This Actor is designed to provide data to feed, fine-tune, or train large language models such as ChatGPT or LLaMA.

All issues Create new issue

Timeout setting does not work

Closed

manuel3 opened this issue

[Need Help]

I set timeout but it does not work. Even timeout setting is on, some times the run keeps and use a lot of balance...

What could I do?

It happens frequently.

Jindřich Bär (jindrich.bar)

Hello and thank you for your interest in this Actor!

Would you mind sharing a specific run where this happens (sharing only the Run ID is fine). The run linked to this issue has the limits (maxResults and maxCrawlPages) set to default and behaves as expected.

Note that the requestTimeoutSecs option sets the timeout per request (crawling one page), not the entire Actor Run. requestTimeoutSecs mostly protects the Actor from malformed (or too large) pages that take too long to parse and process.

You can also set a run-wide timeout limit in the bottom-most section of the input schema (Run Options > Timeout). Note that this means that the Apify Platform kills the Actor once it exceeds the given time - this is great for keeping tabs on the Platform usage ($$$), but also might result in incomplete results in your dataset (the Actor might not be able to finish its job).

But again, if you have encountered something that doesn't feel right, please share the run id here. Thanks!

manuel3

Thank you for replying.

As you say, I want to set it to kill the Acotor once exceeds the time, And I set at 180 seconds, but for example, this run got 4 hours and I cancelled it.

yxKmXu6URp8v7g7Nt

manuel3

What could I do?

Jindřich Bär (jindrich.bar)

Thank you for the additional information!

Looking at the run you linked, I can see that the (hard) timeout for this run were the default 360 000 seconds. I also can see that you didn't start this Run from the web console, but via API (from a Python script perhaps?)

Note that the timeout set in the Input schema (in the web console) is only applicable for the current run that you run from the web. If you want to start a run with a hard timeout from your Python script, you need to pass the timeout option from there (e.g. see the documentation for ActorClient.start() method in our Python client - you can pass the named argument timeout_secs there.

If you are making the API calls yourself in your script, you can pass the query parameter timeout (see documentation). However, we strongly recommend you use the Apify Client for Python - it provides much nicer DX.

TLDR: pass the timeout option with every (Actor start) API call you make. Let me know how it went!

manuel3

Thank you for your reply. From your advice, I made a code as below with "timeout_secs" but the duration is still uncontrollable...

from apify_client import ApifyClient

apify_client = ApifyClient('MY_API_KEY')

actor_call = apify_client.actor('apify/website-content-crawler').call(run_input={ 'startUrls': [{ 'url': 'https://www.sakataseed.co.jp/special/korotan/howto/' }], 'maxRequestsPerCrawl': 1, 'maxCrawlingDepth': 1, 'timeout_secs': 30 })

dataset_items = apify_client.dataset(actor_call['defaultDatasetId']).list_items().items

for item in dataset_items: print(item['url']) print(item['text']) print('---')

manuel3

I will attach the code file too here.

Jindřich Bär (jindrich.bar)

Hello again!

Note that timeout_secs is not a part of the Actor input (run_input), it's a separate keyword argument to the call method (see docs). The following code should work as expected:

1from apify_client import ApifyClient
2
3apify_client = ApifyClient('MY-API-KEY')
4
5actor_call = apify_client.actor('apify/website-content-crawler').call(
6    run_input={
7        'startUrls': [{
8            'url': 'https://www.sakataseed.co.jp/special/korotan/howto/'
9        }],
10        'maxRequestsPerCrawl': 1,
11        'maxCrawlingDepth': 1,
12    },
13   timeout_secs=30 # timeout_secs is a separate keyword argument
14)

Once again, thank you for your patience! This kind of feedback is very important to us - we'll look at what we can do regarding the documentation - we'd love to make it more approachable. Thanks again!

Add comment

Developer

Apify

Actor metrics

2k monthly users
99.9% runs succeeded
2.9 days response time
Created in Mar 2023
Modified 3 days ago

Categories

Developer tools

Business

Web Scraper

apify/web-scraper

Crawls arbitrary websites using the Chrome browser and extracts data from pages using a provided JavaScript code. The actor supports both recursive crawling and lists of URLs and automatically manages concurrency for maximum performance. This is Apify's basic tool for web crawling and scraping.

Apify

63.5k

Google Maps Scraper

compass/crawler-google-places

Extract data from hundreds of Google Maps locations and businesses. Get Google Maps data including reviews, images, contact info, opening hours, location, popular times, prices & more. Export scraped data, run the scraper via API, schedule and monitor runs, or integrate with other tools.

Compass

63.5k

Google Search Results Scraper

apify/google-search-scraper

Scrape Google Search Engine Results Pages (SERPs). Select the country or language and extract organic and paid results, ads, queries, People Also Ask, prices, reviews, like a Google SERP API. Export scraped data, run the scraper via API, schedule and monitor runs, or integrate with other tools.

Apify

43.8k

GPT Scraper

drobnikj/gpt-scraper

Extract data from any website and feed it into GPT via the OpenAI API. Use ChatGPT to proofread content, analyze sentiment, summarize reviews, extract contact details, and much more.

Jakub Drobník

4.4k

AI Product Matcher

equidem/ai-product-matcher

Match products across multiple e-commerce websites. Use this AI product matching Actor whenever you need to find matching pairs of products from different online shops for dynamic pricing, competitor analysis or market research.

Matěj Sochor

318

Cheerio Scraper

apify/cheerio-scraper

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Apify

4.3k

Facebook Ads Scraper

apify/facebook-ads-scraper

Extract advertising data from one or multiple Facebook Pages. Get page details, reach estimates, publisher platforms, report count, number of impressions, ad IDs, timestamps, and more. Download Facebook ads data in JSON, CSV, and Excel and use it in apps, spreadsheets, and reports.

Apify

AI Web Agent

apify/ai-web-agent

Use natural language prompts to browse the web, click on elements, fill and submit forms, extract data, and take screenshots using the OpenAI API.

Apify

431

📩📍 Google Maps Email Extractor

lukaskrivka/google-maps-with-contact-details

Extract Google Maps contact details. Scrape websites of Google Maps places for contact details and get email addresses, website, location, address, zipcode, phone number, social media links. Export scraped data, run the scraper via API, schedule and monitor runs or integrate with other tools.