Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoHey, Can you please have a lot at this table: https://app.clay.com/workspaces/349620/workbooks/wb_2UMetbkw9uuk/tables/t_YsGt2h2AbpgC/views/gv_Z4aJukiDFRio I try to input in apify actor a domain This is what i tested and it worked: { "aggressivePrune": false, "clickElementsCssSelector": "[aria-expanded="false"]", "clientSideMinChangePercentage": 15, "crawlerType": "cheerio", "debugLog": false, "debugMode": false, "expandIframes": true, "ignoreCanonicalUrl": false, "includeUrlGlobs": [ { "glob": "" } ], "keepUrlFragments": true, "maxCrawlPages": 500, "proxyConfiguration": { "useApifyProxy": true }, "readableTextCharThreshold": 100, "removeCookieWarnings": true, "removeElementsCssSelector": "nav, footer, script, style, noscript, svg,\n[role="alert"],\n[role="banner"],\n[role="dialog"],\n[role="alertdialog"],\n[role="region"][aria-label*="skip" i],\n[aria-modal="true"]", "renderingTypeDetectionPercentage": 10, "saveFiles": false, "saveHtml": false, "saveHtmlAsFile": false, "saveMarkdown": true, "saveScreenshots": false, "startUrls": [ { "url": "https://onepilot.co/", // NOT using dynamic, Tested and send it from clay like this and it worked. "method": "GET" } ], "useSitemaps": true } When I add it with dynamic domain I keep getting this error: { "error": { "type": "invalid-i... [trimmed]
Thought it will help better understand it
Hi, thank you for using Website Content Crawler!
I’m not entirely sure if I understand the issue fully.
You’re using Clay to call the Website Content Crawler. When you specify {"startUrls": {"url": "https://onepilot.co/"}}
, it works fine. However, when you use the variable domainInput, you get an error: “Cannot parse input JSON body. Bad control character.”
For example: {"startUrls": [{"url": "${domainInput}", "method": "GET"}]}
From what I understand, this issue originates in Clay itself and a call from Clay to Apify is not happening at all. Is that correct?
I’m not very familiar with Clay’s integration, in the meantime I’ll raise your question internally to get more insight.
Hi, I was able to reproduce your issue in Clay. I set up the Website Content Crawler to enrich a table in Clay.
When integrating this Actor, you need to provide input in JSON format. I’m not sure how to specify a variable in the Input Data
, but I was able to insert a column using a forward slash /
. Please see the attached screenshot - note the double quotes around the column name!
Another potential issue is that the Website Content Crawler accepts URLs, not domain names. So, you’ll need to find a specific URL for a domain like "clay.com." You can automate this with, for example, the Google Search Actor.
Hope this helps! I’ll close this issue now, but feel free to reopen it or ask any additional questions.
- 3.8k monthly users
- 636 stars
- 100.0% runs succeeded
- 2.7 days response time
- Created in Mar 2023
- Modified 7 days ago