SEO Keyword Extractor
Pricing
from $20.00 / 1,000 results
SEO Keyword Extractor
Finds keyword phrases from a list of websites 🌐, groups similar ones into clear themes 🧩, and ranks them. Also suggests good main keywords ⭐ and possible negative keywords 🚫 so you can plan SEO and ad campaigns in a smarter, more focused way 📈.
Pricing
from $20.00 / 1,000 results
Rating
0.0
(0)
Developer

Chris Xavier
Actor stats
0
Bookmarked
4
Total users
2
Monthly active users
7 days ago
Last modified
Categories
Share
🔍 SEO Keyword Theme & Negative Keyword Analyzer 🚀
📘 Overview
This actor takes one or more URLs, extracts high-value multi-word SEO keyphrases, and then:
- Clusters common cross-site keyword families (semantic variants across multiple domains).
- Computes n-gram stats (e.g. “real estate lawyer”, “fort lauderdale real estate lawyer”) only for phrases that show up on multiple sites.
- Builds keyword themes (ranked topics with all their variants and sites).
- Suggests candidate negative keywords (likely competitor names / one-off phrases that only appear on a single site).
It’s built for serious competitive research, PPC planning, and semantic SEO clustering across your niche 🌐✨
🌟 Use Cases
| 💼 Scenario | 📈 Benefit |
|---|---|
| 🔎 Competitor keyword intelligence | See which phrases multiple competitors converge on (strong themes) vs. one-off phrases (weak or brand-specific). |
| 🧩 Local + practice-area SEO | Quickly surface geo + service combos like “fort lauderdale real estate lawyer” or “west palm beach probate attorney.” |
| 🧠 Semantic clustering & topic planning | Get “keyword themes” with a primary phrase, all variants, and which sites use them. |
| 🎯 PPC campaign & ad group design | Use themes as ad groups and variants as match types; use single-site phrases as negative keyword candidates. |
| 🧹 Keyword cleanup & noise reduction | Filters out junky code-like phrases, numeric strings, and odd technical terms by default. |
🧪 Output Structure
Results are written as flat dataset rows so they’re easy to export to CSV, Sheets, or BI tools. Each row has a record_type that tells you what kind of entity it is.
1️⃣ Per-page keywords
One row per URL:
{"record_type": "page_keywords","page_url": "https://example.com","top_keywords": ["west palm beach real estate attorney","florida real estate lawyers","business litigation fort lauderdale"]}
2️⃣ Common cross-site keyword families
Clusters of similar phrases that show up on more than one site, with similarity metrics:
{"record_type": "common_cross_site_keywords","group_representative": "florida real estate attorney","group_keywords": ["florida real estate attorney","florida real estate lawyers","florida real estate law","law florida real estate","real estate litigation attorneys"],"keyword_count": 5,"site_count": 3,"sites": ["https://a.com","https://b.com","https://c.com"],"levenshtein_avg_distance": 0.31,"levenshtein_max_distance": 0.53}Use these rows to see:- Which concepts recur across domains (`site_count`).- How tight the wording cluster is (lower Levenshtein distances = more similar).### 3️⃣ N-gram stats (cross-site phrases)For each n (2, 3, …), the actor aggregates n-grams that appear on **at least 3 different sites** (strong cross-site themes):```json{"record_type": "ngram_3","ngram": "fort lauderdale real","n": 3,"count": 8,"site_count": 4,"sites": ["https://a.com","https://b.com","https://c.com","https://d.com"],"sample_keywords": ["fort lauderdale real estate","lauderdale real estate lawyer","lauderdale real estate attorneys"]}
This is great for spotting standard phrases in the market (“real estate lawyer”, “west palm beach”, etc.).
4️⃣ Group-to-group similarity (Jaccard)
When two cross-site keyword families heavily overlap in their token sets, they’re connected with a Jaccard score:
{"record_type": "group_similarity","group_a": "florida real estate attorney","group_b": "real estate lawyer","similarity": 0.63}
These tell you which keyword families are basically talking about the same thing and should probably be treated as one theme in your planning.
5️⃣ Keyword themes (the “use this in campaigns” layer)
Themes merge similar groups into higher-level topics and rank them:
{"record_type": "keyword_theme","primary_keyword": "florida real estate attorney","score": 0.95,"site_count": 3,"groups_in_theme": 2,"all_variants": ["florida real estate attorney","florida real estate law","florida real estate lawyers","law florida real estate","real estate litigation attorneys"],"all_sites": ["https://a.com","https://b.com","https://c.com"]}
How to use these:
- Treat each
keyword_themeas:- A core SEO topic / pillar page, or
- A PPC ad group (primary = ad group name, variants = match types / ad copy phrases).
Higher score = stronger candidate.
6️⃣ Candidate negative keywords
The actor also flags n-grams that only appear on one site as negative keyword candidates (often brand names or very specific, non-generic terms):
{"record_type": "negative_keyword_candidate","phrase": "ryan shipp","n": 2,"count": 3,"site_count": 1,"sites": ["https://competitor.com"],"reason": "single_site_ngram"}
These are not auto-applied negatives. They’re suggestions that you should manually review before adding to a PPC negative list (especially competitor names or hyper-specific phrases you don’t want to pay for).
⚙️ Input
Required fields
{"urls": [{ "url": "https://example.com" },{ "url": "https://another-site.com" }],"min_ngram_n": 2}
-
urls(array)- Uses the
requestListSourceseditor in Apify. - Accepts either
{ "url": "..." }objects or plain strings"https://...".
- Uses the
-
min_ngram_n(integer, optional, default2)- The minimum n-gram length to analyze.
2= start at bigrams (“real estate”),3= only 3+ word phrases (“real estate lawyer”, “fort lauderdale real estate”).- Unigrams (single words) are never computed to keep noise down.
Internally, the actor analyzes n-grams from min_ngram_n up to a safe cap (currently 6) to avoid combinatorial blow-ups on very long phrases.
🔄 How it works (under the hood)
-
Fetch & clean
- Fetches each URL via HTTP.
- Strips scripts, styles, and other noise and extracts visible text.
-
Keyword extraction
- Uses a transformer-based model (
all-MiniLM-L6-v2via KeyBERT) to extract multi-word keyphrases from the page content. - Filters out:
- Numeric strings
- Code-y / technical junk
- Blacklisted tokens (e.g., obvious non-SEO boilerplate)
- Keeps the most relevant 2–4 word keyphrases per page.
- Uses a transformer-based model (
-
Cross-site aggregation
- Clusters similar phrases across sites using RapidFuzz (token-set similarity).
- Keeps only clusters seen on multiple domains.
- Computes Levenshtein distances inside each cluster to quantify how tight/loose the variants are.
-
N-gram analysis
- Builds n-gram stats across pages:
- Only n in
[min_ngram_n, 6]. - Only n-grams seen on ≥ 3 sites are kept as strong cross-site themes.
- Only n in
- Builds n-gram stats across pages:
-
Theme building
- Builds a graph of keyword groups connected by high Jaccard similarity.
- Collapses connected components into themes.
- Scores each theme by:
- Cross-site importance (how many sites use it).
- Cohesion (Levenshtein-based).
- Phrase length (favoring 2–4 word phrases).
-
Negative keyword suggestions
- Separately scans all phrases for n-grams that appear on exactly one site.
- Emits them as
negative_keyword_candidaterows for manual review.
💰 Monetization & Scaling
This actor is designed to work cleanly with Apify Pay-Per-Event (PPE):
-
One event per run –
apify-actor-start
Charge per actor start (each run). -
One event per result row –
apify-default-dataset-item
EveryActor.push_data(...)call creates a dataset item, which can be billed as a per-item event.
That means:
- Small runs with a few URLs → a handful of items → lower cost.
- Large competitive sweeps (many domains) → more items (pages, cross-site keywords, themes, negatives) → higher cost but also richer insight.
You can control cost by:
- Limiting the number of input URLs.
- Truncating or filtering which record types you care about (e.g., only
page_keywords+keyword_theme).
🔄 Workflow Examples
This actor is workflow-ready and plays nicely with other Apify tools:
| 🔗 Integration | 🔍 Description |
|---|---|
serp-scraper | Scrape top-ranking Google results for a query, then feed the URLs here to see the shared themes across the SERP. |
map-scraper | Collect local business websites from Google Maps, then compare cross-site phrasing for local SEO campaigns. |
| Other actors | Build end-to-end automations: harvest → extract → cluster → export to Sheets/Data Studio. |
🚀 Ready to Launch?
Use this actor when you want more than just a list of keywords:
- See which phrases truly define your niche (themes & n-grams).
- Separate generic market language from brand-specific noise.
- Build better SEO topics, tighter PPC ad groups, and smarter negative lists.
Perfect for:
- SEO agencies
- Performance marketers
- Local law firms & service businesses
- Content strategists and SERP analysts
Happy crawling & clustering! 🚀🌐