Group Trip Data Extractor
Pricing
from $0.05 / result
Go to Apify Store
Group Trip Data Extractor
Extract structured trip information from multiple group trip URLs and enrich data using AI
A production-ready Apify Actor that extracts structured trip information from multiple group trip URLs and enriches data using AI.
Project Structure
AI Group trip extractor/├── .actor/│ ├── actor.json # Actor configuration│ ├── input_schema.json # Input parameter definitions│ ├── output_schema.json # Output schema for API/Console│ └── dataset_schema.json # Dataset field definitions├── src/│ ├── main.js # Core actor logic│ ├── scraper.js # Web scraping (Cheerio/Playwright)│ ├── ai-enricher.js # OpenAI integration│ └── schema.js # Schema validation utilities├── Dockerfile # Docker build configuration├── package.json # Dependencies & scripts└── INPUT.json # Sample input for testing
Features
- Multi-URL Processing: Process multiple trip URLs in a single run
- Intelligent Scraping: Uses Cheerio (fast) or Playwright (JavaScript-heavy pages)
- AI Enrichment: Uses OpenAI to infer missing data (coordinates, country, city, trip type)
- Strict Schema: Fixed 17-field output schema for Excel compatibility
- Error Handling: Never breaks schema - returns empty structured rows on errors
- Test Mode: Quick testing with mock data
Output Schema
Every output item contains exactly these 17 fields:
| Field | Description |
|---|---|
title | Trip/tour name |
destination | Main destination name |
country | Country name (AI-inferred if missing) |
state | State/province/region (AI-inferred if missing) |
city | City name (AI-inferred if missing) |
latitude | Latitude coordinate (AI-inferred from destination) |
longitude | Longitude coordinate (AI-inferred from destination) |
provider | Travel company/organizer name |
price | Numeric price value |
currency | Currency code (INR, USD, EUR, etc.) |
start_date | Start date (YYYY-MM-DD format) |
end_date | End date (YYYY-MM-DD format) |
trip_type | Trip category (trek, backpacking, weekend, etc.) |
description | Brief trip description |
images | Comma-separated image URLs |
inclusions | What's included (comma-separated) |
booking_url | URL to book the trip |
Input Configuration
{"tripUrls": ["https://example-travel.com/trip/himalayan-trek","https://example-travel.com/trip/goa-weekend"],"openaiApiKey": "sk-...","model": "gpt-4o-mini","testMode": false,"maxConcurrency": 5,"requestTimeout": 60000,"usePlaywright": false}
Input Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
tripUrls | array | Yes | - | List of trip URLs to extract |
openaiApiKey | string | Yes | - | OpenAI API key for enrichment |
model | string | No | gpt-4o-mini | OpenAI model to use |
testMode | boolean | No | false | Return mock data without scraping |
maxConcurrency | integer | No | 5 | Max concurrent page loads |
requestTimeout | integer | No | 60000 | Page load timeout (ms) |
usePlaywright | boolean | No | false | Use Playwright for JS-heavy pages |
How It Works
Step 1: Fetch Data
For each URL, the actor loads the page and extracts:
- Title, description, destination
- Price, dates, duration
- Itinerary, inclusions
- Provider name, images, booking link
Step 2: Clean Data
- Remove HTML tags
- Normalize whitespace
- Keep meaningful content only
Step 3: AI Enrichment
Send extracted content to OpenAI to:
- Map data to required schema
- Infer missing fields (country, state, city, trip_type)
- Generate coordinates from destination
- Normalize currency and dates
Step 4: Output
Each dataset item contains all 17 fields with missing values as empty strings.
Error Handling
- Scraping fails: Returns empty structured row with booking_url
- AI fails: Returns partially mapped data
- Invalid URL: Returns empty structured row
- Schema never breaks: All outputs have exactly 17 fields
Performance
- Handles multiple URLs efficiently
- Configurable concurrency
- Timeout under 5 minutes per URL
- Test mode for quick validation
Local Development
# Install dependenciesnpm install# Run locallynpm start# Or with Apify CLIapify run
Deployment to Apify
# Login to Apifyapify login# Push to Apifyapify push
Cost Estimation
- Scraping: ~$0.01 per URL (Apify compute)
- AI Enrichment: ~$0.001-0.01 per URL (depends on model)
- gpt-4o-mini: Most cost-effective
- gpt-4o: Higher quality, higher cost
License
ISC