Pricing

Pay per usage

contextractor - Trafilatura based

Extract clean, readable content . Uses Trafilatura, the top rated library, to strip away navigation, ads, and boilerplate—leaving just the text you need.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Glueo

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Categories

Developer tools

News

You can access the contextractor - Trafilatura based programmatically from your own applications by using the Apify API. You can also choose the language preference from below. To use the Apify API, you’ll need an Apify account and your API token, found in Integrations settings in Apify Console.

Python

JavaScript

CLI

OpenAPI

HTTP

MCP

{
  "openapi": "3.0.1",
  "info": {
    "version": "0.1",
    "x-build-id": "OjRzQC4PSQsdE11aN"
  },
  "servers": [
    {
      "url": "https://api.apify.com/v2"
    }
  ],
  "paths": {
    "/acts/shortc~contextractor/run-sync-get-dataset-items": {
      "post": {
        "operationId": "run-sync-get-dataset-items-shortc-contextractor",
        "x-openai-isConsequential": false,
        "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
        "tags": [
          "Run Actor"
        ],
        "requestBody": {
          "required": true,
          "content": {
            "application/json": {
              "schema": {
                "$ref": "#/components/schemas/inputSchema"
              }
            }
          }
        },
        "parameters": [
          {
            "name": "token",
            "in": "query",
            "required": true,
            "schema": {
              "type": "string"
            },
            "description": "Enter your Apify token here"
          }
        ],
        "responses": {
          "200": {
            "description": "OK"
          }
        }
      }
    },
    "/acts/shortc~contextractor/runs": {
      "post": {
        "operationId": "runs-sync-shortc-contextractor",
        "x-openai-isConsequential": false,
        "summary": "Executes an Actor and returns information about the initiated run in response.",
        "tags": [
          "Run Actor"
        ],
        "requestBody": {
          "required": true,
          "content": {
            "application/json": {
              "schema": {
                "$ref": "#/components/schemas/inputSchema"
              }
            }
          }
        },
        "parameters": [
          {
            "name": "token",
            "in": "query",
            "required": true,
            "schema": {
              "type": "string"
            },
            "description": "Enter your Apify token here"
          }
        ],
        "responses": {
          "200": {
            "description": "OK",
            "content": {
              "application/json": {
                "schema": {
                  "$ref": "#/components/schemas/runsResponseSchema"
                }
              }
            }
          }
        }
      }
    },
    "/acts/shortc~contextractor/run-sync": {
      "post": {
        "operationId": "run-sync-shortc-contextractor",
        "x-openai-isConsequential": false,
        "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
        "tags": [
          "Run Actor"
        ],
        "requestBody": {
          "required": true,
          "content": {
            "application/json": {
              "schema": {
                "$ref": "#/components/schemas/inputSchema"
              }
            }
          }
        },
        "parameters": [
          {
            "name": "token",
            "in": "query",
            "required": true,
            "schema": {
              "type": "string"
            },
            "description": "Enter your Apify token here"
          }
        ],
        "responses": {
          "200": {
            "description": "OK"
          }
        }
      }
    }
  },
  "components": {
    "schemas": {
      "inputSchema": {
        "type": "object",
        "required": [
          "startUrls"
        ],
        "properties": {
          "startUrls": {
            "title": "Start URLs",
            "type": "array",
            "description": "URLs to extract content from",
            "items": {
              "type": "object",
              "required": [
                "url"
              ],
              "properties": {
                "url": {
                  "type": "string",
                  "title": "URL of a web page",
                  "format": "uri"
                }
              }
            }
          },
          "globs": {
            "title": "Include URLs (globs)",
            "type": "array",
            "description": "Glob patterns matching URLs of pages that will be included in crawling. Setting this option allows you to customize the crawling scope. For example `https://{store,docs}.example.com/**` lets the crawler access all URLs starting with `https://store.example.com/` or `https://docs.example.com/`.",
            "default": [],
            "items": {
              "type": "object",
              "required": [
                "glob"
              ],
              "properties": {
                "glob": {
                  "type": "string",
                  "title": "Glob of a web page"
                }
              }
            }
          },
          "excludes": {
            "title": "Exclude URLs (globs)",
            "type": "array",
            "description": "Glob patterns matching URLs of pages that will be excluded from crawling. Note that this affects only links found on pages, but not Start URLs, which are always crawled.",
            "default": [],
            "items": {
              "type": "object",
              "required": [
                "glob"
              ],
              "properties": {
                "glob": {
                  "type": "string",
                  "title": "Glob of a web page"
                }
              }
            }
          },
          "pseudoUrls": {
            "title": "Pseudo-URLs",
            "type": "array",
            "description": "Pseudo-URLs to match links in the page that you want to enqueue. Alternative to glob patterns. Combine with Link selector to tell the scraper where to find links.",
            "default": [],
            "items": {
              "type": "object",
              "required": [
                "purl"
              ],
              "properties": {
                "purl": {
                  "type": "string",
                  "title": "Pseudo-URL of a web page"
                }
              }
            }
          },
          "linkSelector": {
            "title": "Link Selector",
            "type": "string",
            "description": "CSS selector for links to enqueue. Leave empty to disable link enqueueing.",
            "default": ""
          },
          "keepUrlFragments": {
            "title": "Keep URL fragments",
            "type": "boolean",
            "description": "URL fragments (the parts of URL after a #) are not considered when the scraper determines whether a URL has already been visited. Turn this on to treat URLs with different fragments as different pages.",
            "default": false
          },
          "respectRobotsTxtFile": {
            "title": "Respect robots.txt",
            "type": "boolean",
            "description": "If enabled, the crawler will consult the robots.txt file for each domain before crawling pages.",
            "default": false
          },
          "initialCookies": {
            "title": "Initial cookies",
            "type": "array",
            "description": "Cookies that will be pre-set to all pages the scraper opens. This is useful for pages that require login. The value is expected to be a JSON array of objects with `name` and `value` properties. For example: \n\n```json\n[\n  {\n    \"name\": \"cookieName\",\n    \"value\": \"cookieValue\",\n    \"path\": \"/\",\n    \"domain\": \".example.com\"\n  }\n]\n```\n\nYou can use the [EditThisCookie](https://docs.apify.com/academy/tools/edit-this-cookie) browser extension to copy browser cookies in this format, and paste it here.\n\nNote that the value is secret and encrypted to protect your login cookies."
          },
          "customHttpHeaders": {
            "title": "Custom HTTP headers",
            "type": "object",
            "description": "HTTP headers that will be added to all requests made by the crawler. This is useful for setting custom authentication headers or other headers required by the target website. The value is expected to be a JSON object with header names as keys and header values as values. For example: `{ \"Authorization\": \"Bearer token123\", \"X-Custom-Header\": \"value\" }`."
          },
          "maxPagesPerCrawl": {
            "title": "Max pages",
            "minimum": 0,
            "type": "integer",
            "description": "Maximum pages to crawl. Includes start URLs and pagination pages. The crawler will automatically finish after reaching this number. 0 means unlimited.",
            "default": 0
          },
          "maxResultsPerCrawl": {
            "title": "Max results",
            "minimum": 0,
            "type": "integer",
            "description": "Maximum number of results that will be saved to dataset. The scraper will terminate after reaching this number. 0 means unlimited.",
            "default": 0
          },
          "maxCrawlingDepth": {
            "title": "Max crawling depth",
            "minimum": 0,
            "type": "integer",
            "description": "Maximum link depth from Start URLs. Pages discovered further from start URLs than this limit will not be crawled. 0 means unlimited.",
            "default": 0
          },
          "maxConcurrency": {
            "title": "Max concurrency",
            "minimum": 1,
            "type": "integer",
            "description": "Maximum number of browser pages running in parallel. This setting is useful to avoid overloading target websites and getting blocked.",
            "default": 50
          },
          "maxRequestRetries": {
            "title": "Max request retries",
            "minimum": 0,
            "type": "integer",
            "description": "Maximum number of retries for failed requests on network, proxy, or server errors.",
            "default": 3
          },
          "trafilaturaConfig": {
            "title": "Trafilatura options",
            "type": "object",
            "description": "Trafilatura library extraction settings. Leave empty for balanced defaults. Keys: fast, favorPrecision, favorRecall, includeComments, includeTables, includeImages, includeFormatting, includeLinks, deduplicate, targetLanguage, withMetadata, onlyWithMetadata, teiValidation, pruneXpath."
          },
          "saveRawHtmlToKeyValueStore": {
            "title": "Save raw HTML to key-value store",
            "type": "boolean",
            "description": "If enabled, the crawler saves the raw HTML of all pages to the default key-value store and includes the URL link in the dataset output.",
            "default": false
          },
          "saveExtractedTextToKeyValueStore": {
            "title": "Save extracted text to key-value store",
            "type": "boolean",
            "description": "If enabled, the crawler extracts plain text from all pages, saves it to the key-value store, and includes the URL link in the dataset output.",
            "default": false
          },
          "saveExtractedJsonToKeyValueStore": {
            "title": "Save extracted JSON to key-value store",
            "type": "boolean",
            "description": "If enabled, the crawler extracts JSON with metadata from all pages, saves it to the key-value store, and includes the URL link in the dataset output.",
            "default": false
          },
          "saveExtractedMarkdownToKeyValueStore": {
            "title": "Save extracted Markdown to key-value store",
            "type": "boolean",
            "description": "If enabled, the crawler extracts Markdown from all pages, saves it to the key-value store, and includes the URL link in the dataset output.",
            "default": true
          },
          "saveExtractedXmlToKeyValueStore": {
            "title": "Save extracted XML to key-value store",
            "type": "boolean",
            "description": "If enabled, the crawler extracts XML from all pages, saves it to the key-value store, and includes the URL link in the dataset output.",
            "default": false
          },
          "saveExtractedXmlTeiToKeyValueStore": {
            "title": "Save extracted XML-TEI to key-value store",
            "type": "boolean",
            "description": "If enabled, the crawler extracts XML-TEI (scholarly format) from all pages, saves it to the key-value store, and includes the URL link in the dataset output.",
            "default": false
          },
          "datasetName": {
            "title": "Dataset name",
            "type": "string",
            "description": "Name or ID of the dataset for storing results. Leave empty to use the default run dataset."
          },
          "keyValueStoreName": {
            "title": "Key-value store name",
            "type": "string",
            "description": "Name or ID of the key-value store for content files. Leave empty to use the default store."
          },
          "requestQueueName": {
            "title": "Request queue name",
            "type": "string",
            "description": "Name of the request queue for pending URLs. Leave empty to use the default queue."
          },
          "proxyConfiguration": {
            "title": "Proxy configuration",
            "type": "object",
            "description": "Enables loading websites from IP addresses in specific geographies and to circumvent blocking."
          },
          "proxyRotation": {
            "title": "Proxy rotation",
            "enum": [
              "RECOMMENDED",
              "PER_REQUEST",
              "UNTIL_FAILURE"
            ],
            "type": "string",
            "description": "Proxy rotation strategy. RECOMMENDED automatically picks the best proxies. PER_REQUEST uses a new proxy for each request. UNTIL_FAILURE uses one proxy until it fails.",
            "default": "RECOMMENDED"
          },
          "pageLoadTimeoutSecs": {
            "title": "Page load timeout",
            "minimum": 1,
            "type": "integer",
            "description": "Maximum time to wait for page load in seconds",
            "default": 60
          },
          "waitUntil": {
            "title": "Navigation wait until",
            "enum": [
              "NETWORKIDLE",
              "LOAD",
              "DOMCONTENTLOADED"
            ],
            "type": "string",
            "description": "When to consider navigation finished",
            "default": "NETWORKIDLE"
          },
          "launcher": {
            "title": "Browser type",
            "enum": [
              "CHROMIUM",
              "FIREFOX"
            ],
            "type": "string",
            "description": "Browser to use for crawling",
            "default": "CHROMIUM"
          },
          "headless": {
            "title": "Headless mode",
            "type": "boolean",
            "description": "Run browser in headless mode",
            "default": true
          },
          "ignoreCorsAndCsp": {
            "title": "Ignore CORS and CSP",
            "type": "boolean",
            "description": "Ignore Content Security Policy and Cross-Origin Resource Sharing restrictions. Enables free XHR/Fetch requests from pages.",
            "default": false
          },
          "closeCookieModals": {
            "title": "Close cookie modals",
            "type": "boolean",
            "description": "Automatically dismiss cookie consent modals",
            "default": false
          },
          "maxScrollHeightPixels": {
            "title": "Max scroll height",
            "minimum": 0,
            "type": "integer",
            "description": "Maximum pixels to scroll down the page until all content is loaded. Setting to 0 disables scrolling.",
            "default": 5000
          },
          "ignoreSslErrors": {
            "title": "Ignore SSL errors",
            "type": "boolean",
            "description": "Ignore SSL certificate errors. Use at your own risk.",
            "default": false
          },
          "debugLog": {
            "title": "Debug log",
            "type": "boolean",
            "description": "Include debug messages in the log output.",
            "default": false
          },
          "browserLog": {
            "title": "Browser log",
            "type": "boolean",
            "description": "Include browser console messages in the log. May flood logs with errors at high concurrency.",
            "default": false
          }
        }
      },
      "runsResponseSchema": {
        "type": "object",
        "properties": {
          "data": {
            "type": "object",
            "properties": {
              "id": {
                "type": "string"
              },
              "actId": {
                "type": "string"
              },
              "userId": {
                "type": "string"
              },
              "startedAt": {
                "type": "string",
                "format": "date-time",
                "example": "2025-01-08T00:00:00.000Z"
              },
              "finishedAt": {
                "type": "string",
                "format": "date-time",
                "example": "2025-01-08T00:00:00.000Z"
              },
              "status": {
                "type": "string",
                "example": "READY"
              },
              "meta": {
                "type": "object",
                "properties": {
                  "origin": {
                    "type": "string",
                    "example": "API"
                  },
                  "userAgent": {
                    "type": "string"
                  }
                }
              },
              "stats": {
                "type": "object",
                "properties": {
                  "inputBodyLen": {
                    "type": "integer",
                    "example": 2000
                  },
                  "rebootCount": {
                    "type": "integer",
                    "example": 0
                  },
                  "restartCount": {
                    "type": "integer",
                    "example": 0
                  },
                  "resurrectCount": {
                    "type": "integer",
                    "example": 0
                  },
                  "computeUnits": {
                    "type": "integer",
                    "example": 0
                  }
                }
              },
              "options": {
                "type": "object",
                "properties": {
                  "build": {
                    "type": "string",
                    "example": "latest"
                  },
                  "timeoutSecs": {
                    "type": "integer",
                    "example": 300
                  },
                  "memoryMbytes": {
                    "type": "integer",
                    "example": 1024
                  },
                  "diskMbytes": {
                    "type": "integer",
                    "example": 2048
                  }
                }
              },
              "buildId": {
                "type": "string"
              },
              "defaultKeyValueStoreId": {
                "type": "string"
              },
              "defaultDatasetId": {
                "type": "string"
              },
              "defaultRequestQueueId": {
                "type": "string"
              },
              "buildNumber": {
                "type": "string",
                "example": "1.0.0"
              },
              "containerUrl": {
                "type": "string"
              },
              "usage": {
                "type": "object",
                "properties": {
                  "ACTOR_COMPUTE_UNITS": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATASET_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATASET_WRITES": {
                    "type": "integer",
                    "example": 0
                  },
                  "KEY_VALUE_STORE_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "KEY_VALUE_STORE_WRITES": {
                    "type": "integer",
                    "example": 1
                  },
                  "KEY_VALUE_STORE_LISTS": {
                    "type": "integer",
                    "example": 0
                  },
                  "REQUEST_QUEUE_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "REQUEST_QUEUE_WRITES": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATA_TRANSFER_INTERNAL_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATA_TRANSFER_EXTERNAL_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "PROXY_SERPS": {
                    "type": "integer",
                    "example": 0
                  }
                }
              },
              "usageTotalUsd": {
                "type": "number",
                "example": 0.00005
              },
              "usageUsd": {
                "type": "object",
                "properties": {
                  "ACTOR_COMPUTE_UNITS": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATASET_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATASET_WRITES": {
                    "type": "integer",
                    "example": 0
                  },
                  "KEY_VALUE_STORE_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "KEY_VALUE_STORE_WRITES": {
                    "type": "number",
                    "example": 0.00005
                  },
                  "KEY_VALUE_STORE_LISTS": {
                    "type": "integer",
                    "example": 0
                  },
                  "REQUEST_QUEUE_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "REQUEST_QUEUE_WRITES": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATA_TRANSFER_INTERNAL_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATA_TRANSFER_EXTERNAL_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "PROXY_SERPS": {
                    "type": "integer",
                    "example": 0
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

contextractor - Trafilatura based OpenAPI definition

OpenAPI is a standard for designing and describing RESTful APIs, allowing developers to define API structure, endpoints, and data formats in a machine-readable way. It simplifies API development, integration, and documentation.

OpenAPI is effective when used with AI agents and GPTs by standardizing how these systems interact with various APIs, for reliable integrations and efficient communication.

By defining machine-readable API specifications, OpenAPI allows AI models like GPTs to understand and use varied data sources, improving accuracy. This accelerates development, reduces errors, and provides context-aware responses, making OpenAPI a core component for AI applications.

You can download the OpenAPI definitions for contextractor - Trafilatura based from the options below:

OpenAPI.json

If you’d like to learn more about how OpenAPI powers GPTs, read our blog post.

You can also check out our other API clients:

contextractor - Trafilatura based API in Python

contextractor - Trafilatura based API in JavaScript

contextractor - Trafilatura based API through CLI

contextractor - Trafilatura based API

Facebook Ads Library Scraper

microworlds/facebook-ads-library-scraper

Scrape Facebook Ads Library

Caleb David

494

2.1

Facebook Ads Library Scraper

datapilot/facebook-ads-library-scraper

This script uses to scrape Facebook Ads data from the Facebook Ads Library based on a user-provided keyword. It captures ads in real-time as the page loads and scrolls, then saves them into a JSON file.

Data Pilot

Article to Text Extractor (for TTS/LLMs)

andok/tts-reader

Extract the core readable text of any article or blog post, stripping out boilerplate. Perfect for Text-to-Speech or AI summaries.

Andok

Facebook Ads Library Scraper

moving_beacon-owner1/my-actor-28

A Python tool to extract ad data from the Facebook Ads Library for research, analysis, or competitive insights.

Jamshaid Arif

Pinterest Ads Library Scraper

saswave/pinterest-ads-library-scraper

Extract ads from pinterest ads library. Collect post content, comments, shares, publication date. Allow to get domain and company name of the ad owner

SASWAVE

Meta (Facebook) Ad Scrapper (Using Ad Library URL) (Premium)

scrapeio/meta-facebook-ad-scrapper-using-ad-library-url-premium

Scrape Meta ads instantly using any brand’s Ad Library URL. Just enter the URL and number of ads you want — get structured, real-time ad data in seconds. Ideal for marketers, analysts, and creators. Fast, simple, and pay-per-result. No fluff, just the ads you need.

Shop Intel

744

4.4

Facebook Ad Library Scraper

apipi/facebook-ad-library-scraper

Ultra-fast Facebook Ads Library scraper with advanced filtering. Extract ads from Meta Ads Library in seconds! Perfect for competitor analysis, market research, and ad intelligence.

ApiPi

202

5.0

Linkedin Ads Library

data_link_miner/linkedin-ads-library

The scraper extract LinkedIn ads data & detect active advertisers. This actor allows you to scrape LinkedIn Ads Library to either: Extract full ad creatives and metadata.Check whether a company is running ads and how many ads are active

Data LinkMiner

5.0

Article Extractor & News Scraper

web.harvester/article-extractor-news-scraper

Extract articles from any news site, blog, or webpage. Get title, full text, author, date, images & metadata using 7 extraction engines (Newspaper4k, Trafilatura, Goose3). Anti-bot bypass, proxy rotation, automatic fallback. Perfect for news monitoring, NLP datasets & content aggregation.

Web Harvester

5.0

YouTube Transcript Extractor:Video Text|$3/1K|Pay-Per Result

fastcrawler/youtube-transcript-extractor-video-text-3-1k-pay-per-result

Quickly extract accurate transcripts from any YouTube video, including shorts. Our no-code tool uses advanced technology to deliver high-quality, readable text in just a few clicks. Pay only for the results you need.