Group Trip Data Extractor avatar

Group Trip Data Extractor

Pricing

from $0.05 / result

Go to Apify Store
Group Trip Data Extractor

Group Trip Data Extractor

Extract structured trip information from multiple group trip URLs and enrich data using AI

Pricing

from $0.05 / result

Rating

0.0

(0)

Developer

Aadhithya

Aadhithya

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

5 days ago

Last modified

Categories

Share

A production-ready Apify Actor that extracts structured trip information from multiple group trip URLs and enriches data using AI.

Project Structure

AI Group trip extractor/
├── .actor/
│ ├── actor.json # Actor configuration
│ ├── input_schema.json # Input parameter definitions
│ ├── output_schema.json # Output schema for API/Console
│ └── dataset_schema.json # Dataset field definitions
├── src/
│ ├── main.js # Core actor logic
│ ├── scraper.js # Web scraping (Cheerio/Playwright)
│ ├── ai-enricher.js # OpenAI integration
│ └── schema.js # Schema validation utilities
├── Dockerfile # Docker build configuration
├── package.json # Dependencies & scripts
└── INPUT.json # Sample input for testing

Features

  • Multi-URL Processing: Process multiple trip URLs in a single run
  • Intelligent Scraping: Uses Cheerio (fast) or Playwright (JavaScript-heavy pages)
  • AI Enrichment: Uses OpenAI to infer missing data (coordinates, country, city, trip type)
  • Strict Schema: Fixed 17-field output schema for Excel compatibility
  • Error Handling: Never breaks schema - returns empty structured rows on errors
  • Test Mode: Quick testing with mock data

Output Schema

Every output item contains exactly these 17 fields:

FieldDescription
titleTrip/tour name
destinationMain destination name
countryCountry name (AI-inferred if missing)
stateState/province/region (AI-inferred if missing)
cityCity name (AI-inferred if missing)
latitudeLatitude coordinate (AI-inferred from destination)
longitudeLongitude coordinate (AI-inferred from destination)
providerTravel company/organizer name
priceNumeric price value
currencyCurrency code (INR, USD, EUR, etc.)
start_dateStart date (YYYY-MM-DD format)
end_dateEnd date (YYYY-MM-DD format)
trip_typeTrip category (trek, backpacking, weekend, etc.)
descriptionBrief trip description
imagesComma-separated image URLs
inclusionsWhat's included (comma-separated)
booking_urlURL to book the trip

Input Configuration

{
"tripUrls": [
"https://example-travel.com/trip/himalayan-trek",
"https://example-travel.com/trip/goa-weekend"
],
"openaiApiKey": "sk-...",
"model": "gpt-4o-mini",
"testMode": false,
"maxConcurrency": 5,
"requestTimeout": 60000,
"usePlaywright": false
}

Input Parameters

ParameterTypeRequiredDefaultDescription
tripUrlsarrayYes-List of trip URLs to extract
openaiApiKeystringYes-OpenAI API key for enrichment
modelstringNogpt-4o-miniOpenAI model to use
testModebooleanNofalseReturn mock data without scraping
maxConcurrencyintegerNo5Max concurrent page loads
requestTimeoutintegerNo60000Page load timeout (ms)
usePlaywrightbooleanNofalseUse Playwright for JS-heavy pages

How It Works

Step 1: Fetch Data

For each URL, the actor loads the page and extracts:

  • Title, description, destination
  • Price, dates, duration
  • Itinerary, inclusions
  • Provider name, images, booking link

Step 2: Clean Data

  • Remove HTML tags
  • Normalize whitespace
  • Keep meaningful content only

Step 3: AI Enrichment

Send extracted content to OpenAI to:

  • Map data to required schema
  • Infer missing fields (country, state, city, trip_type)
  • Generate coordinates from destination
  • Normalize currency and dates

Step 4: Output

Each dataset item contains all 17 fields with missing values as empty strings.

Error Handling

  • Scraping fails: Returns empty structured row with booking_url
  • AI fails: Returns partially mapped data
  • Invalid URL: Returns empty structured row
  • Schema never breaks: All outputs have exactly 17 fields

Performance

  • Handles multiple URLs efficiently
  • Configurable concurrency
  • Timeout under 5 minutes per URL
  • Test mode for quick validation

Local Development

# Install dependencies
npm install
# Run locally
npm start
# Or with Apify CLI
apify run

Deployment to Apify

# Login to Apify
apify login
# Push to Apify
apify push

Cost Estimation

  • Scraping: ~$0.01 per URL (Apify compute)
  • AI Enrichment: ~$0.001-0.01 per URL (depends on model)
    • gpt-4o-mini: Most cost-effective
    • gpt-4o: Higher quality, higher cost

License

ISC