Pricing

Pay per usage

Try for free

Go to Apify Store

Smart Article Extractor

Try for free

📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.

Pricing

Pay per usage

Rating

5.0

(8)

Developer

Lukáš Křivka

Actor stats

183

Bookmarked

7.1K

Total users

422

Monthly active users

12 days ago

Last modified

Categories

News

You can access the Smart Article Extractor programmatically from your own applications by using the Apify API. You can also choose the language preference from below. To use the Apify API, you’ll need an Apify account and your API token, found in Integrations settings in Apify Console.

Python

JavaScript

CLI

OpenAPI

HTTP

MCP

{
  "openapi": "3.0.1",
  "info": {
    "version": "1.0",
    "x-build-id": "CEjSuzFP2f2spLgiq"
  },
  "servers": [
    {
      "url": "https://api.apify.com/v2"
    }
  ],
  "paths": {
    "/acts/lukaskrivka~article-extractor-smart/run-sync-get-dataset-items": {
      "post": {
        "operationId": "run-sync-get-dataset-items-lukaskrivka-article-extractor-smart",
        "x-openai-isConsequential": false,
        "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
        "tags": [
          "Run Actor"
        ],
        "requestBody": {
          "required": true,
          "content": {
            "application/json": {
              "schema": {
                "$ref": "#/components/schemas/inputSchema"
              }
            }
          }
        },
        "parameters": [
          {
            "name": "token",
            "in": "query",
            "required": true,
            "schema": {
              "type": "string"
            },
            "description": "Enter your Apify token here"
          }
        ],
        "responses": {
          "200": {
            "description": "OK"
          }
        }
      }
    },
    "/acts/lukaskrivka~article-extractor-smart/runs": {
      "post": {
        "operationId": "runs-sync-lukaskrivka-article-extractor-smart",
        "x-openai-isConsequential": false,
        "summary": "Executes an Actor and returns information about the initiated run in response.",
        "tags": [
          "Run Actor"
        ],
        "requestBody": {
          "required": true,
          "content": {
            "application/json": {
              "schema": {
                "$ref": "#/components/schemas/inputSchema"
              }
            }
          }
        },
        "parameters": [
          {
            "name": "token",
            "in": "query",
            "required": true,
            "schema": {
              "type": "string"
            },
            "description": "Enter your Apify token here"
          }
        ],
        "responses": {
          "200": {
            "description": "OK",
            "content": {
              "application/json": {
                "schema": {
                  "$ref": "#/components/schemas/runsResponseSchema"
                }
              }
            }
          }
        }
      }
    },
    "/acts/lukaskrivka~article-extractor-smart/run-sync": {
      "post": {
        "operationId": "run-sync-lukaskrivka-article-extractor-smart",
        "x-openai-isConsequential": false,
        "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
        "tags": [
          "Run Actor"
        ],
        "requestBody": {
          "required": true,
          "content": {
            "application/json": {
              "schema": {
                "$ref": "#/components/schemas/inputSchema"
              }
            }
          }
        },
        "parameters": [
          {
            "name": "token",
            "in": "query",
            "required": true,
            "schema": {
              "type": "string"
            },
            "description": "Enter your Apify token here"
          }
        ],
        "responses": {
          "200": {
            "description": "OK"
          }
        }
      }
    }
  },
  "components": {
    "schemas": {
      "inputSchema": {
        "type": "object",
        "properties": {
          "startUrls": {
            "title": "Website/category URLs",
            "type": "array",
            "description": "These could be the main page URL or any category/subpage URL, e.g. https://www.bbc.com/. Article pages are detected and crawled from these. If you prefer to use direct article URLs, use `articleUrls` input instead",
            "items": {
              "type": "object",
              "required": [
                "url"
              ],
              "properties": {
                "url": {
                  "type": "string",
                  "title": "URL of a web page",
                  "format": "uri"
                }
              }
            }
          },
          "articleUrls": {
            "title": "Article URLs",
            "type": "array",
            "description": "These are direct URLs for the articles to be extracted, e.g. https://www.bbc.com/news/uk-62836057. No extra pages are crawled from article pages.",
            "items": {
              "type": "object",
              "required": [
                "url"
              ],
              "properties": {
                "url": {
                  "type": "string",
                  "title": "URL of a web page",
                  "format": "uri"
                }
              }
            }
          },
          "onlyNewArticles": {
            "title": "Only new articles (only for small runs)",
            "type": "boolean",
            "description": "This option is only viable for smaller runs. If you plan to use this on a large scale, use the 'Only new articles (saved per domain)' option below instead. If this function is selected, the extractor will only scrape new articles each time you run it. (Scraped URLs are saved in a dataset named `articles-state`, and are compared with new ones.)",
            "default": false
          },
          "onlyNewArticlesPerDomain": {
            "title": "Only new articles (saved per domain, preferable)",
            "type": "boolean",
            "description": "If this function is selected, the extractor will only scrape only new articles each time you run it. (Scraped articles are saved in one dataset, named 'ARTICLES-SCRAPED-domain', per each domain, and compared with new ones.)",
            "default": false
          },
          "onlyInsideArticles": {
            "title": "Only inside domain articles",
            "type": "boolean",
            "description": "If this function is selected, the extractor will only scrape articles that are on the domain from where they are linked. If the domain presents links to articles on different domains, those articles will not be scraped, e.g. https://www.bbc.com/ vs. https://www.bbc.co.uk/.",
            "default": true
          },
          "enqueueFromArticles": {
            "title": "Enqueue articles from articles",
            "type": "boolean",
            "description": "Normally, the scraper only extracts articles from category pages. This option allows the scraper to also extract articles linked within articles.",
            "default": false
          },
          "crawlWholeSubdomain": {
            "title": "Crawl whole subdomain (same base as Start URL)",
            "type": "boolean",
            "description": "Automatically enqueue categories and articles from whole subdomain with the same path. E.g. if Start URL is https://apify.com/store, it will enqueue all pages starting with https://apify.com/store",
            "default": false
          },
          "onlySubdomainArticles": {
            "title": "Limit articles to only from subdomain",
            "type": "boolean",
            "description": "Only loads articles which URL begins with the same path as Start URL. E.g. if Start URL is https://apify.com/store, it will only load articles starting with https://apify.com/store",
            "default": false
          },
          "scanSitemaps": {
            "title": "Find articles in sitemaps (caution)",
            "type": "boolean",
            "description": "We recommend using `Sitemap URLs` instead. \n If this function is selected, the extractor will scan different sitemaps from the initial article URL. Keep in mind that this option can lead to the loading of a huge amount of (sometimes old) articles, in which case the time and cost of the scrape will increase.",
            "default": false
          },
          "sitemapUrls": {
            "title": "Sitemap URLs (safer)",
            "type": "array",
            "description": "You can provide selected sitemap URLs that include the articles you need to extract.",
            "items": {
              "type": "object",
              "required": [
                "url"
              ],
              "properties": {
                "url": {
                  "type": "string",
                  "title": "URL of a web page",
                  "format": "uri"
                }
              }
            }
          },
          "saveHtml": {
            "title": "Save full HTML",
            "type": "boolean",
            "description": "If this function is selected, the scraper will save the full HTML of the article page, but this will make the data less readable."
          },
          "saveHtmlAsLink": {
            "title": "Save full HTML (only as link to it)",
            "type": "boolean",
            "description": "If this function is selected, the scraper will save the full HTML of the article page as a URL to keep the dataset clean and small."
          },
          "saveSnapshots": {
            "title": "Save screenshots of article pages (browser only)",
            "type": "boolean",
            "description": "Stores a screenshot for each article page to Key-Value Store and provides that as screenshotUrl. Useful for debugging.",
            "default": false
          },
          "useGoogleBotHeaders": {
            "title": "Use Googlebot headers",
            "type": "boolean",
            "description": "This option will allow you to bypass protection and paywalls on some websites. Use with caution as it might lead to getting blocked.",
            "default": false
          },
          "minWords": {
            "title": "Minimum words",
            "type": "integer",
            "description": "The article needs to contain at least this number of words to be extracted",
            "default": 150
          },
          "dateFrom": {
            "title": "Extract articles from [date]",
            "type": "string",
            "description": "Only articles from this day on will be scraped. If empty, all articles will be scraped. Format is YYYY-MM-DD, e.g. 2019-12-31, or number type e.g. 1 week or 20 days"
          },
          "onlyArticlesForLastDays": {
            "title": "Only articles for last X days",
            "type": "integer",
            "description": "Only get posts that were published in the last X days from time the scraping starts. Use either this or the absolute date."
          },
          "mustHaveDate": {
            "title": "Must have date",
            "type": "boolean",
            "description": "If checked, the article must have a date of release to be extracted.",
            "default": true
          },
          "isUrlArticleDefinition": {
            "title": "Is the URL an article?",
            "type": "object",
            "description": "Here you can input JSON settings to define what URLs should be considered articles by the scraper. If any of them is `true`, then the link will be opened and the article extracted."
          },
          "pseudoUrls": {
            "title": "Pseudo URLs",
            "type": "array",
            "description": "This function can be used to enqueue more pages, i.e. include more links like pagination or categories. This doesn't work for articles, as they are recognized by the recognition system.",
            "items": {
              "type": "object",
              "required": [
                "url"
              ],
              "properties": {
                "url": {
                  "type": "string",
                  "title": "URL of a web page",
                  "format": "uri"
                }
              }
            }
          },
          "linkSelector": {
            "title": "Link selector",
            "type": "string",
            "description": "You can limit the <a> tags whose links will be enqueued. This field is empty by default. Add `a.some-class` to activate it"
          },
          "maxDepth": {
            "title": "Max depth",
            "type": "integer",
            "description": "Maximum depth of crawling, i.e. how many times the scraper picks up a link to other webpages. Level 0 refers to the start URLs, 1 are the first level links, and so on. This is only valid for pseudo URLs"
          },
          "maxPagesPerCrawl": {
            "title": "Max pages per crawl",
            "type": "integer",
            "description": "Maximum number of total pages crawled. It includes the home page, pagination pages, invalid articles, and so on. The crawler will stop automatically after reaching this number."
          },
          "maxArticlesPerCrawl": {
            "title": "Max articles per crawl",
            "type": "integer",
            "description": "Maximum number of valid articles scraped. The crawler will stop automatically after reaching this number."
          },
          "maxArticlesPerStartUrl": {
            "title": "Max articles per start URL",
            "type": "integer",
            "description": "Maximum number of articles scraped per start URL."
          },
          "maxConcurrency": {
            "title": "Max concurrency",
            "type": "integer",
            "description": "You can limit the speed of the scraper to avoid getting blocked."
          },
          "proxyConfiguration": {
            "title": "Proxy configuration",
            "type": "object",
            "description": "Proxy configuration"
          },
          "overrideProxyGroup": {
            "title": "Override proxy group",
            "type": "string",
            "description": "If you want to override the default proxy group, you can specify it here. This is useful if you want to use a different proxy group for each crawler."
          },
          "useBrowser": {
            "title": "Use browser (Puppeteer)",
            "type": "boolean",
            "description": "This option is more expensive, but it allows you to evaluate JavaScript and wait for dynamically loaded data.",
            "default": false
          },
          "pageWaitMs": {
            "title": "Wait on each page (ms)",
            "type": "integer",
            "description": "How many milliseconds to wait on each page before extracting data"
          },
          "navigationWaitUntil": {
            "title": "Wait until navigation event is finished",
            "enum": [
              "load",
              "domcontentloaded",
              "networkidle0",
              "networkidle2"
            ],
            "type": "string",
            "description": "What to wait until the navigation is finished. `domcontentloaded` happens when initial HTML loads and is fastest. `load` happens when JS is executed and it is default. `networkidle0`, `networkidle2` wait for background network but cannot cause infinite loading.",
            "default": "load"
          },
          "pageWaitSelectorCategory": {
            "title": "Wait for selector on each category page",
            "type": "string",
            "description": "For what selector to wait on each page before extracting data"
          },
          "pageWaitSelectorArticle": {
            "title": "Wait for selector on each article page",
            "type": "string",
            "description": "For what selector to wait on each page before extracting data"
          },
          "scrollToBottom": {
            "title": "Scroll to bottom of the page (infinite scroll)",
            "type": "boolean",
            "description": "Scroll to the botton of the page, loading dynamic articles."
          },
          "scrollToBottomButtonSelector": {
            "title": "Scroll to bottom button selector",
            "type": "string",
            "description": "CSS selector for a button to load more articles"
          },
          "scrollToBottomMaxSecs": {
            "title": "Scroll to bottom max seconds",
            "type": "integer",
            "description": "Limit for how long the scrolling can run so it does no go infinite."
          },
          "extendOutputFunction": {
            "title": "Extend output function",
            "type": "string",
            "description": "This function allows you to merge your custom extraction with the default one. You can only return an object from this function. This object will be merged/overwritten with the default output for each article."
          },
          "stopAfterCUs": {
            "title": "Limit CU consumption",
            "type": "integer",
            "description": "The scraper will stop running after reaching this number of compute units."
          },
          "notificationEmails": {
            "title": "Emails address for notifications",
            "type": "array",
            "description": "Notifications will be sent to these email addresses.",
            "items": {
              "type": "string"
            }
          },
          "notifyAfterCUs": {
            "title": "Notify after [number] CUs",
            "type": "integer",
            "description": "The scraper will send notifications to the provided email when it reaches this number of CUs."
          },
          "notifyAfterCUsPeriodically": {
            "title": "Notify every [number] CUs",
            "type": "integer",
            "description": "The scraper will send notifications to the provided email every time this number of CUs is reached since the last notification."
          }
        }
      },
      "runsResponseSchema": {
        "type": "object",
        "properties": {
          "data": {
            "type": "object",
            "properties": {
              "id": {
                "type": "string"
              },
              "actId": {
                "type": "string"
              },
              "userId": {
                "type": "string"
              },
              "startedAt": {
                "type": "string",
                "format": "date-time",
                "example": "2025-01-08T00:00:00.000Z"
              },
              "finishedAt": {
                "type": "string",
                "format": "date-time",
                "example": "2025-01-08T00:00:00.000Z"
              },
              "status": {
                "type": "string",
                "example": "READY"
              },
              "meta": {
                "type": "object",
                "properties": {
                  "origin": {
                    "type": "string",
                    "example": "API"
                  },
                  "userAgent": {
                    "type": "string"
                  }
                }
              },
              "stats": {
                "type": "object",
                "properties": {
                  "inputBodyLen": {
                    "type": "integer",
                    "example": 2000
                  },
                  "rebootCount": {
                    "type": "integer",
                    "example": 0
                  },
                  "restartCount": {
                    "type": "integer",
                    "example": 0
                  },
                  "resurrectCount": {
                    "type": "integer",
                    "example": 0
                  },
                  "computeUnits": {
                    "type": "integer",
                    "example": 0
                  }
                }
              },
              "options": {
                "type": "object",
                "properties": {
                  "build": {
                    "type": "string",
                    "example": "latest"
                  },
                  "timeoutSecs": {
                    "type": "integer",
                    "example": 300
                  },
                  "memoryMbytes": {
                    "type": "integer",
                    "example": 1024
                  },
                  "diskMbytes": {
                    "type": "integer",
                    "example": 2048
                  }
                }
              },
              "buildId": {
                "type": "string"
              },
              "defaultKeyValueStoreId": {
                "type": "string"
              },
              "defaultDatasetId": {
                "type": "string"
              },
              "defaultRequestQueueId": {
                "type": "string"
              },
              "buildNumber": {
                "type": "string",
                "example": "1.0.0"
              },
              "containerUrl": {
                "type": "string"
              },
              "usage": {
                "type": "object",
                "properties": {
                  "ACTOR_COMPUTE_UNITS": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATASET_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATASET_WRITES": {
                    "type": "integer",
                    "example": 0
                  },
                  "KEY_VALUE_STORE_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "KEY_VALUE_STORE_WRITES": {
                    "type": "integer",
                    "example": 1
                  },
                  "KEY_VALUE_STORE_LISTS": {
                    "type": "integer",
                    "example": 0
                  },
                  "REQUEST_QUEUE_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "REQUEST_QUEUE_WRITES": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATA_TRANSFER_INTERNAL_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATA_TRANSFER_EXTERNAL_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "PROXY_SERPS": {
                    "type": "integer",
                    "example": 0
                  }
                }
              },
              "usageTotalUsd": {
                "type": "number",
                "example": 0.00005
              },
              "usageUsd": {
                "type": "object",
                "properties": {
                  "ACTOR_COMPUTE_UNITS": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATASET_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATASET_WRITES": {
                    "type": "integer",
                    "example": 0
                  },
                  "KEY_VALUE_STORE_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "KEY_VALUE_STORE_WRITES": {
                    "type": "number",
                    "example": 0.00005
                  },
                  "KEY_VALUE_STORE_LISTS": {
                    "type": "integer",
                    "example": 0
                  },
                  "REQUEST_QUEUE_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "REQUEST_QUEUE_WRITES": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATA_TRANSFER_INTERNAL_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATA_TRANSFER_EXTERNAL_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "PROXY_SERPS": {
                    "type": "integer",
                    "example": 0
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Scrape and download articles and news OpenAPI definition

OpenAPI is a standard for designing and describing RESTful APIs, allowing developers to define API structure, endpoints, and data formats in a machine-readable way. It simplifies API development, integration, and documentation.

OpenAPI is effective when used with AI agents and GPTs by standardizing how these systems interact with various APIs, for reliable integrations and efficient communication.

By defining machine-readable API specifications, OpenAPI allows AI models like GPTs to understand and use varied data sources, improving accuracy. This accelerates development, reduces errors, and provides context-aware responses, making OpenAPI a core component for AI applications.

You can download the OpenAPI definitions for Smart Article Extractor from the options below:

OpenAPI.json

If you’d like to learn more about how OpenAPI powers GPTs, read our blog post.

You can also check out our other API clients:

Smart Article Extractor API in Python

Smart Article Extractor API in JavaScript

Smart Article Extractor API through CLI

Smart Article Extractor API

Articles Extractor

web.harvester/articles-extractor

The Article Extractor is an enterprise-grade web scraping solution designed specifically for extracting structured data from news articles, blog posts, and online publications. Our advanced HTML parsing engine delivers unmatched accuracy in content extraction across thousands of websites.

Web Harvester

712

5.0

News Website Crawler & Article Extractor

xtech/news-source-crawler

Scrape all articles from any news website. Extract full text, metadata, keywords, and summaries. Ideal for content analysis, research, and news aggregation.

Xtech

347

4.9

Article Content Extractor 📄

easyapi/article-content-extractor

Extract clean article content, metadata and structured information from any web page. Supports multiple URLs and returns well-formatted JSON with title, description, content, author, publish date and more. 🔍📄

EasyApi

Web Article Content Extractor

vulnv/web-article-content-extractor

Extract clean, readable content from news articles, blog posts, and web pages. Batch process multiple URLs, download images, bypass bot protection with proxy support. Perfect for content curation, research, and data analysis.

VulnV

News Article Scraper for Feeding LLM

proscraper/newsarticlescraper

Scrape news articles metadata to feed into LLM models. Returns article body, published date, article title, author etc.

Owais Nazir

152

Extract Website With URL

mrahil/extract-website-with-url

The Extract Website with URL API allows users to extract structured data from any webpage by providing a URL. It retrieves HTML, metadata, tables, and images, returning data in JSON format. Ideal for web scraping, SEO analysis, and content extraction. Use it for e-commerce data, news scraping

Mohammed Rahil

211

Smart Article Scraper - Text, Data & Insights

xtech/article-extractor

𝗔𝗿𝘁𝗶𝗰𝗹𝗲 𝗦𝗰𝗿𝗮𝗽𝗲𝗿 & 𝗖𝗼𝗻𝘁𝗲𝗻𝘁 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗼𝗿 - Extract clean text, metadata, keywords & summaries from any web article or blog post. Perfect for 𝗿𝗲𝘀𝗲𝗮𝗿𝗰𝗵, 𝗰𝗼𝗺𝗽𝗲𝘁𝗶𝘁𝗶𝘃𝗲 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀 & 𝗰𝗼𝗻𝘁𝗲𝗻𝘁 𝗺𝗮𝗿𝗸𝗲𝘁𝗶𝗻𝗴.

Xtech

1.0

Article Text Extractor

mtrunkat/article-text-extractor

Simply extracts article texts and other meta info from the given URL. Uses https://github.com/ageitgey/node-unfluff which is a NodeJS implementation of https://github.com/grangier/python-goose.

Marek Trunkát

950

5.0

Advanced News Scraper

dorcy/advanced-news-scraper

Extract the latest news articles with custom search queries, providing all the information, including article titles, sources, publication dates, full article text, and an AI-generated summary.