Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

View all Actors
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
RO

I can't send data from Clay to Apify.

Closed

romeoman opened this issue
19 days ago

Hey, Can you please have a lot at this table: https://app.clay.com/workspaces/349620/workbooks/wb_2UMetbkw9uuk/tables/t_YsGt2h2AbpgC/views/gv_Z4aJukiDFRio I try to input in apify actor a domain This is what i tested and it worked: { "aggressivePrune": false, "clickElementsCssSelector": "[aria-expanded="false"]", "clientSideMinChangePercentage": 15, "crawlerType": "cheerio", "debugLog": false, "debugMode": false, "expandIframes": true, "ignoreCanonicalUrl": false, "includeUrlGlobs": [ { "glob": "" } ], "keepUrlFragments": true, "maxCrawlPages": 500, "proxyConfiguration": { "useApifyProxy": true }, "readableTextCharThreshold": 100, "removeCookieWarnings": true, "removeElementsCssSelector": "nav, footer, script, style, noscript, svg,\n[role="alert"],\n[role="banner"],\n[role="dialog"],\n[role="alertdialog"],\n[role="region"][aria-label*="skip" i],\n[aria-modal="true"]", "renderingTypeDetectionPercentage": 10, "saveFiles": false, "saveHtml": false, "saveHtmlAsFile": false, "saveMarkdown": true, "saveScreenshots": false, "startUrls": [ { "url": "https://onepilot.co/", // NOT using dynamic, Tested and send it from clay like this and it worked. "method": "GET" } ], "useSitemaps": true } When I add it with dynamic domain I keep getting this error: { "error": { "type": "invalid-i... [trimmed]

RO

romeoman

19 days ago

Thought it will help better understand it

jiri.spilka avatar

Hi, thank you for using Website Content Crawler!

I’m not entirely sure if I understand the issue fully.

You’re using Clay to call the Website Content Crawler. When you specify {"startUrls": {"url": "https://onepilot.co/"}}, it works fine. However, when you use the variable domainInput, you get an error: “Cannot parse input JSON body. Bad control character.”

For example: {"startUrls": [{"url": "${domainInput}", "method": "GET"}]}

From what I understand, this issue originates in Clay itself and a call from Clay to Apify is not happening at all. Is that correct?

I’m not very familiar with Clay’s integration, in the meantime I’ll raise your question internally to get more insight.

jiri.spilka avatar

Hi, I was able to reproduce your issue in Clay. I set up the Website Content Crawler to enrich a table in Clay.

When integrating this Actor, you need to provide input in JSON format. I’m not sure how to specify a variable in the Input Data, but I was able to insert a column using a forward slash /. Please see the attached screenshot - note the double quotes around the column name!

Another potential issue is that the Website Content Crawler accepts URLs, not domain names. So, you’ll need to find a specific URL for a domain like "clay.com." You can automate this with, for example, the Google Search Actor.

Hope this helps! I’ll close this issue now, but feel free to reopen it or ask any additional questions.

Developer
Maintained by Apify
Actor metrics
  • 3.8k monthly users
  • 636 stars
  • 100.0% runs succeeded
  • 2.7 days response time
  • Created in Mar 2023
  • Modified 7 days ago