AI & PhD Researcher Dataset Filter — recruiting, GTM, research avatar

AI & PhD Researcher Dataset Filter — recruiting, GTM, research

Pricing

from $10.00 / 1,000 delivered records

Go to Apify Store
AI & PhD Researcher Dataset Filter — recruiting, GTM, research

AI & PhD Researcher Dataset Filter — recruiting, GTM, research

Turn a raw JSON export of AI / PhD / researcher profiles into a precise, deduplicated, deliverable-grade shortlist in seconds. Built for recruiting teams, B2B growth/SDR teams, and research panels who need clean, targeted lists instead of raw scraping noise. 🚀 22.5k records filtered in <6s.

Pricing

from $10.00 / 1,000 delivered records

Rating

5.0

(1)

Developer

CrystalBytes

CrystalBytes

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

6 days ago

Last modified

Share

🎓 AI & PhD Researcher Dataset Filter — recruiting, GTM, research

Turn a raw JSON export of AI / PhD / researcher profiles into a precise, deduplicated, deliverable-grade shortlist in seconds.

You do not need extra engineering to get a useful first run. This Actor does not browse the web or pull live profiles. You bring your own JSON file (a single array of profile objects). The Actor filters, deduplicates, and shapes the rows you choose, then writes them to an Apify Dataset you can download as JSON or CSV.


Who it is for

  • Hiring and talent teams shortlisting PhD-level AI, ML, or research profiles from an existing export.
  • B2B GTM, SDR, and growth teams who need a clean, ICP-matched list instead of a noisy raw dump.
  • Research, policy, and panel coordinators who need specific countries, languages, or seniority without manual spreadsheet work.
  • Data and ops teams that already have profile JSON and want repeatable, versioned “audience” runs.

Get started in three steps

  1. Try the built-in sample — Open the Actor and run. The bundled demo loads automatically so you can see how filters work; results appear on the Dataset tab.
  2. Your own profile file — For production runs, whoever manages your workspace connects the JSON source your organization uses. If you need a different file than the default, ask them to point the Actor at it.
  3. Download results — Open the run’s Dataset for the rows. For a step-by-step breakdown, read RUN_SUMMARY in the run’s default Key-value store.

Note: The Input form is for filters and export limits only. Which JSON file a run uses is chosen outside the public form (by your workspace setup).


Find the right people (practical playbooks)

Use the matching sections in the Input form. Leave a field empty to turn that filter off.

I want to…Start here in the form
US or UK candidates onlyLocation — countries include (and add excludes for regions you do not want).
Europe-based PhD+ researchersLocation (continent or country) + Education — minimum level, schools, or degrees.
Senior AI / product / legal in softwareCareer — industry, job title, job level; optionally Company for size or employer name.
Quality contacts (work email, fewer bad domains)Contact quality — require work email, allow / block email domains.
A tight shortlist, not the whole fileVolume, sampling & pagination — see How many rows you export below.
No duplicate peopleDeduplication — pick a primary key (e.g. LinkedIn username) and optional backup key.
Safer sharing or demos (masked email / phone)Output shaping & privacy — redact PII, trim fields, or flatten nested fields for CSV.

Narrow with AND (every enabled group must match) or explore more broadly with OR (at least one group). Exclusion lists (countries you block, bad domains, title excludes) are always applied, even in OR mode, so you do not “leak” blocked rows by accident.


How filters work (short version)

  • Each enabled field is a condition. Match mode (AND / OR) controls how groups of conditions combine; values inside one list are OR’d (e.g. any of several countries).
  • Empty = that filter is off.
  • Excludes (countries, companies, keywords, etc.) are always enforced for safety.

The Console form is grouped into sections: Optional listing (if used) → Filter logicVolumeLocation through Output shaping. Every field has examples and tips inline.


How many rows you export

The Actor filters the entire file first, then deduplicates (if you set dedupe), then optionally takes a random sample, and only then applies row limits. So limits always apply to the qualified list.

You can use either style — not both (the run will stop with a clear error if you mix them on purpose).

A) “Start at row” and “Stop before row” (range)

Good when you want a single slice without doing math (e.g. “rows 0–999” or “100 to 1000”).

  • Start at row — 0 = first row in the matched list (after filters, dedupe, and optional sample).
  • Stop before rowExclusive end: valid rows are [Start, Stop). Example: start 0, stop 1000 = first 1000 rows. Start 100, stop 1000 = 900 rows (indices 100 through 999).

Rows in this export ≈ Stop − Start. Paid plans support starting after row 0 (pagination). On the free tier, starting after the first row is not supported — use the first slice only, or upgrade for offset / pagination.

B) “Skip first N” and “Max records” (classic)

  • Skip first N — offset after the qualified list (page 2 of 1 000: skip 1000, max 1000 when each “page” is 1 000 rows).
  • Max records to output0 means “up to the limit allowed by your plan and the monthly allowance,” not “zero rows.”

Random sample (optional) shuffles the qualified list before skip / cap — use it for A/B tests or training splits, not for stable paging unless you know what you are doing.

Billing reminder: the platform may charge by delivered rows; your plan also enforces per-run and per-month caps. See the Actor’s Pricing tab in Apify and RUN_SUMMARYmonetization.


Output and transparency

  • Dataset — one JSON object per row; download as JSON, CSV, or Excel from the run.
  • RUN_SUMMARY (in the run’s default Key-value store) — how many records were loaded, filtered, deduplicated, sampled, skipped, and exported, plus monetization and timing. Use it when results look empty, too small, or when reconciling usage.

Set Flatten nested fields for wider CSV columns. Use Redact PII when you need shareable samples without full email or phone.


Pricing and plans (summary)

Exact unit prices, events, and any platform fees are on this Actor’s Pricing tab in the Apify Console. The table below is the Actor-side policy (from our tier file), so you can see run and monthly caps; it is not a substitute for the Console invoice.

TierMax / runMax / monthRuns / dayFree tier field limits
free501201Yes (basic fields only)
starter4 00015 000no hard daily cap in Actor
pro4 00025 000no hard daily cap in Actor
agency10 000100 000no hard daily cap in Actor
development(high)(high)(high)For local / owner tests only
  • Free strips sensitive columns (e.g. work email, phones, some addresses) so you can evaluate fit before upgrading.
  • Paid tiers unlock the full record, offset pagination (skip / start-after-first-row), and overage past the monthly cap where configured — see the Console for overage event names and prices.
  • After each run, check RUN_SUMMARYmonetization and compare to your Apify billing view.

Trust, data, and compliance

  • You supply the JSON; this run does not crawl third-party sites or “discover” new profiles from the open web.
  • You are responsible for lawful use, consent, and platform terms that apply to your source data (e.g. privacy rules, email outreach laws).
  • Use redaction and field allow / deny lists for demos, contractors, or external sharing.
  • Who can see a run’s full Input is controlled in Apify (organization permissions). Do not put passwords or private keys in task input.

On performance and large files, see Options on the run (memory, timeout). A rough guide: a 22k-row file has been used in development tests in a few seconds at 2 GB memory; very large single files may need more memory, a longer timeout, or splitting the source file — ask your workspace admin if a run times out or runs out of memory.


Reliability and support

  • Invalid inputs (e.g. bad regex patterns, over-claimed advertised counts, or conflicting volume settings) fail fast with a readable error.
  • 0 results after filters — widen one group at a time, try OR match mode, or check RUN_SUMMARYpipeline to see where the list went to zero.

Support and feedback: crystalbytes@proton.me — usually within one business day.


Ready to build a clean, plan-aware shortlist from your own researcher JSON — start a run and refine filters using RUN_SUMMARY until the numbers match your goal.