SEO Keyword Extractor avatar
SEO Keyword Extractor

Pricing

from $20.00 / 1,000 results

Go to Apify Store
SEO Keyword Extractor

SEO Keyword Extractor

Finds keyword phrases from a list of websites 🌐, groups similar ones into clear themes 🧩, and ranks them. Also suggests good main keywords ⭐ and possible negative keywords 🚫 so you can plan SEO and ad campaigns in a smarter, more focused way 📈.

Pricing

from $20.00 / 1,000 results

Rating

0.0

(0)

Developer

Chris Xavier

Chris Xavier

Maintained by Community

Actor stats

0

Bookmarked

4

Total users

2

Monthly active users

7 days ago

Last modified

Share

🔍 SEO Keyword Theme & Negative Keyword Analyzer 🚀

📘 Overview

This actor takes one or more URLs, extracts high-value multi-word SEO keyphrases, and then:

  • Clusters common cross-site keyword families (semantic variants across multiple domains).
  • Computes n-gram stats (e.g. “real estate lawyer”, “fort lauderdale real estate lawyer”) only for phrases that show up on multiple sites.
  • Builds keyword themes (ranked topics with all their variants and sites).
  • Suggests candidate negative keywords (likely competitor names / one-off phrases that only appear on a single site).

It’s built for serious competitive research, PPC planning, and semantic SEO clustering across your niche 🌐✨

🌟 Use Cases

💼 Scenario📈 Benefit
🔎 Competitor keyword intelligenceSee which phrases multiple competitors converge on (strong themes) vs. one-off phrases (weak or brand-specific).
🧩 Local + practice-area SEOQuickly surface geo + service combos like “fort lauderdale real estate lawyer” or “west palm beach probate attorney.”
🧠 Semantic clustering & topic planningGet “keyword themes” with a primary phrase, all variants, and which sites use them.
🎯 PPC campaign & ad group designUse themes as ad groups and variants as match types; use single-site phrases as negative keyword candidates.
🧹 Keyword cleanup & noise reductionFilters out junky code-like phrases, numeric strings, and odd technical terms by default.

🧪 Output Structure

Results are written as flat dataset rows so they’re easy to export to CSV, Sheets, or BI tools. Each row has a record_type that tells you what kind of entity it is.

1️⃣ Per-page keywords

One row per URL:

{
"record_type": "page_keywords",
"page_url": "https://example.com",
"top_keywords": [
"west palm beach real estate attorney",
"florida real estate lawyers",
"business litigation fort lauderdale"
]
}

2️⃣ Common cross-site keyword families

Clusters of similar phrases that show up on more than one site, with similarity metrics:

{
"record_type": "common_cross_site_keywords",
"group_representative": "florida real estate attorney",
"group_keywords": [
"florida real estate attorney",
"florida real estate lawyers",
"florida real estate law",
"law florida real estate",
"real estate litigation attorneys"
],
"keyword_count": 5,
"site_count": 3,
"sites": [
"https://a.com",
"https://b.com",
"https://c.com"
],
"levenshtein_avg_distance": 0.31,
"levenshtein_max_distance": 0.53
}
Use these rows to see:
- Which concepts recur across domains (`site_count`).
- How tight the wording cluster is (lower Levenshtein distances = more similar).
### 3️⃣ N-gram stats (cross-site phrases)
For each n (2, 3, …), the actor aggregates n-grams that appear on **at least 3 different sites** (strong cross-site themes):
```json
{
"record_type": "ngram_3",
"ngram": "fort lauderdale real",
"n": 3,
"count": 8,
"site_count": 4,
"sites": [
"https://a.com",
"https://b.com",
"https://c.com",
"https://d.com"
],
"sample_keywords": [
"fort lauderdale real estate",
"lauderdale real estate lawyer",
"lauderdale real estate attorneys"
]
}

This is great for spotting standard phrases in the market (“real estate lawyer”, “west palm beach”, etc.).

4️⃣ Group-to-group similarity (Jaccard)

When two cross-site keyword families heavily overlap in their token sets, they’re connected with a Jaccard score:

{
"record_type": "group_similarity",
"group_a": "florida real estate attorney",
"group_b": "real estate lawyer",
"similarity": 0.63
}

These tell you which keyword families are basically talking about the same thing and should probably be treated as one theme in your planning.

5️⃣ Keyword themes (the “use this in campaigns” layer)

Themes merge similar groups into higher-level topics and rank them:

{
"record_type": "keyword_theme",
"primary_keyword": "florida real estate attorney",
"score": 0.95,
"site_count": 3,
"groups_in_theme": 2,
"all_variants": [
"florida real estate attorney",
"florida real estate law",
"florida real estate lawyers",
"law florida real estate",
"real estate litigation attorneys"
],
"all_sites": [
"https://a.com",
"https://b.com",
"https://c.com"
]
}

How to use these:

  • Treat each keyword_theme as:
    • A core SEO topic / pillar page, or
    • A PPC ad group (primary = ad group name, variants = match types / ad copy phrases).

Higher score = stronger candidate.

6️⃣ Candidate negative keywords

The actor also flags n-grams that only appear on one site as negative keyword candidates (often brand names or very specific, non-generic terms):

{
"record_type": "negative_keyword_candidate",
"phrase": "ryan shipp",
"n": 2,
"count": 3,
"site_count": 1,
"sites": [
"https://competitor.com"
],
"reason": "single_site_ngram"
}

These are not auto-applied negatives. They’re suggestions that you should manually review before adding to a PPC negative list (especially competitor names or hyper-specific phrases you don’t want to pay for).

⚙️ Input

Required fields

{
"urls": [
{ "url": "https://example.com" },
{ "url": "https://another-site.com" }
],
"min_ngram_n": 2
}
  • urls (array)

    • Uses the requestListSources editor in Apify.
    • Accepts either { "url": "..." } objects or plain strings "https://...".
  • min_ngram_n (integer, optional, default 2)

    • The minimum n-gram length to analyze.
    • 2 = start at bigrams (“real estate”), 3 = only 3+ word phrases (“real estate lawyer”, “fort lauderdale real estate”).
    • Unigrams (single words) are never computed to keep noise down.

Internally, the actor analyzes n-grams from min_ngram_n up to a safe cap (currently 6) to avoid combinatorial blow-ups on very long phrases.

🔄 How it works (under the hood)

  1. Fetch & clean

    • Fetches each URL via HTTP.
    • Strips scripts, styles, and other noise and extracts visible text.
  2. Keyword extraction

    • Uses a transformer-based model (all-MiniLM-L6-v2 via KeyBERT) to extract multi-word keyphrases from the page content.
    • Filters out:
      • Numeric strings
      • Code-y / technical junk
      • Blacklisted tokens (e.g., obvious non-SEO boilerplate)
    • Keeps the most relevant 2–4 word keyphrases per page.
  3. Cross-site aggregation

    • Clusters similar phrases across sites using RapidFuzz (token-set similarity).
    • Keeps only clusters seen on multiple domains.
    • Computes Levenshtein distances inside each cluster to quantify how tight/loose the variants are.
  4. N-gram analysis

    • Builds n-gram stats across pages:
      • Only n in [min_ngram_n, 6].
      • Only n-grams seen on ≥ 3 sites are kept as strong cross-site themes.
  5. Theme building

    • Builds a graph of keyword groups connected by high Jaccard similarity.
    • Collapses connected components into themes.
    • Scores each theme by:
      • Cross-site importance (how many sites use it).
      • Cohesion (Levenshtein-based).
      • Phrase length (favoring 2–4 word phrases).
  6. Negative keyword suggestions

    • Separately scans all phrases for n-grams that appear on exactly one site.
    • Emits them as negative_keyword_candidate rows for manual review.

💰 Monetization & Scaling

This actor is designed to work cleanly with Apify Pay-Per-Event (PPE):

  • One event per runapify-actor-start
    Charge per actor start (each run).

  • One event per result rowapify-default-dataset-item
    Every Actor.push_data(...) call creates a dataset item, which can be billed as a per-item event.

That means:

  • Small runs with a few URLs → a handful of items → lower cost.
  • Large competitive sweeps (many domains) → more items (pages, cross-site keywords, themes, negatives) → higher cost but also richer insight.

You can control cost by:

  • Limiting the number of input URLs.
  • Truncating or filtering which record types you care about (e.g., only page_keywords + keyword_theme).

🔄 Workflow Examples

This actor is workflow-ready and plays nicely with other Apify tools:

🔗 Integration🔍 Description
serp-scraperScrape top-ranking Google results for a query, then feed the URLs here to see the shared themes across the SERP.
map-scraperCollect local business websites from Google Maps, then compare cross-site phrasing for local SEO campaigns.
Other actorsBuild end-to-end automations: harvest → extract → cluster → export to Sheets/Data Studio.

🚀 Ready to Launch?

Use this actor when you want more than just a list of keywords:

  • See which phrases truly define your niche (themes & n-grams).
  • Separate generic market language from brand-specific noise.
  • Build better SEO topics, tighter PPC ad groups, and smarter negative lists.

Perfect for:

  • SEO agencies
  • Performance marketers
  • Local law firms & service businesses
  • Content strategists and SERP analysts

Happy crawling & clustering! 🚀🌐