Scrapie avatar
Scrapie
Under maintenance

Pricing

$3.00 / 1,000 results

Go to Apify Store
Scrapie

Scrapie

Under maintenance

Developed by

Saksham Sharma

Saksham Sharma

Maintained by Community

Scrapie is a production-ready microservice that scrapes public web pages (including JS-heavy sites) and returns structured JSON with: title, description headings features pricing testimonials links, images contact info (emails, phones) raw text

0.0 (0)

Pricing

$3.00 / 1,000 results

0

1

1

Last modified

3 days ago

Scrapie - Web Scraping API

Scrapie is a production-ready microservice that scrapes public web pages (including JS-heavy sites) and returns structured JSON with:

  • title, description
  • headings
  • features
  • pricing
  • testimonials
  • links, images
  • contact info (emails, phones)
  • raw text

Runs on Node.js with Playwright for full JS rendering. Ships with OpenAPI docs, rate limiting, caching, and API key auth.

Contents

  • Features
  • Endpoints & Auth
  • Request/Response schema
  • Error codes
  • Environment variables
  • Rate limiting & caching
  • Usage examples
    • curl
    • Node (fetch/axios)
    • Next.js (App Router & Pages Router)
    • React (client) via backend proxy
    • Simple Node/Express proxy
  • SDK-style minimal client
  • Swagger/OpenAPI usage
  • Deployment notes (Render, Docker)
  • RapidAPI / Apify listing
  • Best practices & limitations

Features

  • Playwright-powered scraping for JS-heavy websites
  • Heuristic extraction for common landing page sections (pricing, testimonials, features)
  • API key auth via header x-api-key
  • In-memory caching (per-instance) with configurable TTL
  • Rate limiting (per IP)
  • Swagger UI at /docs and OpenAPI spec in openapi.yaml

Endpoints & Auth

  • Base URL (local): http://localhost:3000
  • Health: GET /health{ status: "ok" }
  • Docs: GET /docs
  • Scrape: POST /scrape (requires API key)

Auth header:

  • x-api-key: <your-api-key>

Never include your API key in public client-side code. Use a server-side proxy (examples below).

Request schema

POST /scrape

{
"url": "https://example.com",
"waitUntil": "networkidle",
"timeoutMs": 45000,
"userAgent": "<optional-ua>"
}
  • waitUntil: one of load | domcontentloaded | networkidle | commit (default networkidle)
  • timeoutMs: 1000..120000 (default 45000)
  • userAgent: optional custom UA string

Response schema (200)

{
"cached": false,
"data": {
"url": "https://example.com",
"title": "...",
"description": "...",
"headings": [{"tag":"h1","text":"..."}],
"features": ["..."],
"pricing": ["..."],
"testimonials": ["..."],
"links": [{"href":"/contact","text":"Contact"}],
"images": [{"src":"/logo.png","alt":"Logo"}],
"contacts": {"emails":["..."], "phones":["..."]},
"rawText": "..."
}
}

Error codes

  • 400: invalid request body
  • 401: missing/invalid API key
  • 500: scrape failure (navigation timeout, blocked, etc.)

Environment variables

Set these in .env (local) or provider dashboard (Render):

  • API_KEY (required): your secret key
  • API_KEYS (optional): comma-separated list of allowed keys
  • PORT (default 3000)
  • RATE_LIMIT_PER_MIN (default 30)
  • CACHE_TTL_SECONDS (default 300)
  • EXTRA_WAIT_MS (default 1000): extra wait after load for late content

Rate limiting & caching

  • Rate limit applies per IP per minute.
  • Cache: responses per URL are cached in-memory for CACHE_TTL_SECONDS on a single instance.
    • Horizontal scaling instances won’t share cache; use a distributed cache (e.g., Redis) if needed.

Usage examples

curl

curl -X POST https://your-service/scrape \
-H "Content-Type: application/json" \
-H "x-api-key: $API_KEY" \
-d '{"url":"https://example.com","waitUntil":"networkidle"}'

Node (native fetch)

const res = await fetch('https://your-service/scrape', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'x-api-key': process.env.SCRAPIE_API_KEY,
},
body: JSON.stringify({ url: 'https://example.com' }),
});
const json = await res.json();

Node (axios)

const axios = require('axios');
const { data } = await axios.post(
'https://your-service/scrape',
{ url: 'https://example.com', waitUntil: 'networkidle' },
{ headers: { 'x-api-key': process.env.SCRAPIE_API_KEY } }
);

Next.js (App Router: route handler, server-side)

Never call the Scrapie service from a client component with your secret key. Create a route handler as a proxy.

app/api/scrape/route.ts

import { NextResponse } from 'next/server';
export async function POST(req: Request) {
const body = await req.json(); // { url, waitUntil, timeoutMs, userAgent }
const res = await fetch(process.env.SCRAPIE_BASE_URL + '/scrape', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'x-api-key': process.env.SCRAPIE_API_KEY!,
},
body: JSON.stringify(body),
// Optionally set a timeout with AbortController
});
const data = await res.json();
return NextResponse.json(data, { status: res.status });
}

Usage in a Client Component:

// in a client component
const res = await fetch('/api/scrape', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ url: 'https://example.com' }),
});
const data = await res.json();

Next.js (Pages Router: API route, server-side)

pages/api/scrape.ts

import type { NextApiRequest, NextApiResponse } from 'next';
export default async function handler(req: NextApiRequest, res: NextApiResponse) {
if (req.method !== 'POST') return res.status(405).end();
const upstream = await fetch(process.env.SCRAPIE_BASE_URL + '/scrape', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'x-api-key': process.env.SCRAPIE_API_KEY as string,
},
body: JSON.stringify(req.body),
});
const json = await upstream.json();
res.status(upstream.status).json(json);
}

React (client) via backend proxy

Do not embed your API key in the browser. Use your backend to proxy requests as shown above (Next.js route or Express server). Client calls your backend endpoint, not Scrapie directly.

Simple Node/Express proxy

const express = require('express');
const fetch = require('node-fetch');
const app = express();
app.use(express.json());
app.post('/api/scrape', async (req, res) => {
const r = await fetch(process.env.SCRAPIE_BASE_URL + '/scrape', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'x-api-key': process.env.SCRAPIE_API_KEY,
},
body: JSON.stringify(req.body),
});
const json = await r.json();
res.status(r.status).json(json);
});
app.listen(4000, () => console.log('Proxy on :4000'));

Minimal SDK-style client

export class ScrapieClient {
constructor(private baseUrl: string, private apiKey: string) {}
async scrape(input: { url: string; waitUntil?: 'load'|'domcontentloaded'|'networkidle'|'commit'; timeoutMs?: number; userAgent?: string }) {
const res = await fetch(`${this.baseUrl}/scrape`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'x-api-key': this.apiKey,
},
body: JSON.stringify(input),
});
if (!res.ok) {
const err = await res.text();
throw new Error(`Scrapie error ${res.status}: ${err}`);
}
return res.json();
}
}

Swagger/OpenAPI

  • Live docs: /docs
  • Spec: openapi.yaml
  • You can import the spec into Postman or generate clients with openapi-generator-cli.

Deployment

  • Render.com: use included render.yaml (Docker); set API_KEY in env vars
  • Docker:
    docker build -t scrapie .
    docker run -p 3000:3000 --env-file .env scrapie

Best practices & limitations

  • Respect robots.txt and website terms. Obtain consent where required.
  • Avoid scraping authenticated or private content.
  • Heuristic extraction may miss content on atypical layouts; review data.rawText as fallback.
  • For large-scale use, add a queue, retries, rotating IPs/UAs, and distributed caching.