Deprecated

Pricing

Pay per usage

See alternative Actors

Go to Apify Store

aurify

Deprecated

See alternative Actors

Product validation system: scrapes 17+ platforms, analyzes with PIMGE framework, delivers GO/KILL/WAIT decisions. Helps entrepreneurs validate ideas before investing. Demo uses mock data. Production at orexia.xyz. Built for Apify Competition.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Orexia

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

107K

4.3

Twitter (X.com) Scraper Unlimited: No Limits

apidojo/twitter-scraper-lite

Introducing Twitter Scraper Unlimited, the most comprehensive Twitter data extraction solution available. Our enterprise-grade scraper offers unmatched capabilities with a transparent event-based pricing model, making it perfect for both small-scale and large-scale data extraction needs.

API Dojo

19K

4.1

Youtube Video Downloader

epctex/youtube-video-downloader

Effortlessly download YouTube videos of your preferred quality with our user-friendly Video Downloader. Try it now!

epctex

2.8K

4.1

🔥 LinkedIn Jobs Scraper

bebity/linkedin-jobs-scraper

ℹ️ Designed for both personal and professional use, simply enter your desired job title and location to receive a tailored list of job opportunities. Try it today!

Bebity

24K

4.1

Profile Posts Scraper for LinkedIn [No Cookies]

apimaestro/linkedin-profile-posts

Scrape LinkedIn posts data for a given LinkedIn profile including post content, reactions, comments count, and media attachments

API Maestro

15K

4.5

Web Scraper

apify/web-scraper

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

Apify

107K

4.7

Profile Details Scraper for LinkedIn + EMAIL (No Cookies)

apimaestro/linkedin-profile-detail

Scrape comprehensive LinkedIn profile data including work experience, education history, certifications, and location details. Get structured information from any public LinkedIn profile using their username.

API Maestro

9.3K

4.0

Google Search Results Scraper

apify/google-search-scraper

Scrape Google Search Engine Results Pages (SERPs). Select the country or language and extract organic and paid results, AI Mode, AI overviews, ads, queries, People Also Ask, prices, reviews, like a Google SERP API. Export data, run the scraper via API, schedule runs, or integrate with other tools.

Apify

97K

4.8

Posts Search Scraper for LinkedIn | No Cookies

apimaestro/linkedin-posts-search-scraper-no-cookies

Scrape LinkedIn posts by keyword without login. Get post content, reactions, author details, and media. Sort by relevance or date. Perfect for research, analysis, and monitoring trends.

API Maestro

6.8K

4.5

Cheerio Scraper

apify/cheerio-scraper

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Apify

13K

5.0

.dockerignore

# Git
.git
.gitignore

# Python
__pycache__
*.pyc
*.pyo
*.pyd
.Python
*.so
pip-log.txt
pip-delete-this-directory.txt
.tox
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.log
.mypy_cache

# Virtual environments
venv/
ENV/
env/
.venv

# IDE
.idea
.vscode
*.swp
*.swo
*~

# OS
.DS_Store
Thumbs.db

# Apify storage (not needed in image)
apify_storage/
storage/

# Docs and tests
*.md
!README.md
tests/
.pytest_cache/

# Database files
*.db
*.sqlite

.gitignore

# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
env/
venv/
ENV/
.venv

# Apify
apify_storage/
storage/

# IDE
.idea/
.vscode/
*.swp
*.swo
*~

# OS
.DS_Store
Thumbs.db

# Logs
*.log

# Environment
.env
.env.local

# Database
*.db
*.sqlite

DEPLOYMENT.md

1# OREXIA Apify Actor - Deployment Guide
2
3## Quick Start
4
5### 1. Install Apify CLI
6
7```bash
8npm install -g apify-cli
9```
10
11### 2. Login to Apify
12
13```bash
14apify login
15```
16
17### 3. Deploy Actor
18
19From the `orexia-apify-actor` directory:
20
21```bash
22cd orexia-apify-actor
23apify push
24```
25
26## Local Testing
27
28### Run Locally
29
30```bash
31apify run -p
32```
33
34### Test with Custom Input
35
36Create `INPUT.json`:
37
38```json
39{
40  "productIdea": "AI resume builder",
41  "platforms": ["gumroad", "producthunt"],
42  "maxProductsPerPlatform": 20,
43  "runAnalysis": true,
44  "outputFormat": "both"
45}
46```
47
48Run:
49
50```bash
51apify run -p
52```
53
54## Environment Variables
55
56Set in Apify Console or `.env`:
57
58```bash
59OPENAI_API_KEY=sk-...
60```
61
62Optional:
63```bash
64DATABASE_URL=postgresql://...
65ANTHROPIC_API_KEY=sk-ant-...
66```
67
68## Apify Competition Submission
69
70### Submission Checklist
71
72- [x] Actor deployed to Apify platform
73- [x] README.md with clear documentation
74- [x] INPUT_SCHEMA.json with examples
75- [x] Dockerfile optimized for Apify
76- [x] Working demo with sample output
77- [x] Clean, maintainable code
78- [x] Error handling and logging
79
80### What Makes This Unique
81
821. **17+ Platform Orchestration**: Scrapes multiple platforms simultaneously
832. **AI-Powered Analysis**: PIMGE scoring with OpenAI integration
843. **Actionable Decisions**: GO/KILL/WAIT framework based on real data
854. **Production-Ready**: Built from real business validation system
865. **Extensible Architecture**: Easy to add new platforms/analyzers
87
88### Demo Input for Competition
89
90```json
91{
92  "productIdea": "Notion templates for content creators",
93  "platforms": ["gumroad", "etsy", "creative_market", "producthunt"],
94  "maxProductsPerPlatform": 30,
95  "runAnalysis": true,
96  "outputFormat": "both"
97}
98```
99
100Expected output:
101- Scrapes ~120 products across 4 platforms
102- Analyzes market saturation, pricing, features
103- Delivers PIMGE score and GO/WAIT/KILL decision
104- Provides markdown and JSON reports
105
106## Pricing Strategy
107
108**Suggested Apify Pricing:**
109
110- **Input:** Text query (minimal cost)
111- **Compute:** ~2-5 minutes for full analysis
112- **Storage:** ~1-5 MB per run
113- **Cost:** ~$0.10-0.30 per validation
114
115## Support
116
117Questions? Contact: support@orexia.xyz
118
119## Actor URL
120
121After deployment: `https://console.apify.com/actors/[YOUR_ACTOR_ID]`
122
123Share this URL for the competition!

Dockerfile

FROM apify/actor-python:3.11

# Copy requirements and install first
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt || true

# Copy backend folder
COPY backend ./backend

# Copy source
COPY src ./src

# Copy main.py to root (Apify expects it here)
COPY src/main.py ./main.py

Dockerfile-simple

FROM apify/actor-python:3.11

# Copy everything
COPY . ./

# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt || true

FIX_DEPLOYMENT.md

1# Fix Actor Deployment
2
3## Issue: Actor name showing as "undefined"
4
5Run these commands on your Mac:
6
7```bash
8cd ~/Desktop/orexia-apify-actor
9
10# Download fixed config files
11```
12
13Copy the 3 files I'm sending you to `~/Desktop/orexia-apify-actor/`:
141. `actor.json` (fixed)
152. `.apify.json` (new)
163. `apify.json` (new)
17
18Then:
19
20```bash
21# Remove old actor if created
22apify logout
23apify login
24
25# Initialize properly
26apify init
27
28# When prompted:
29# - Name: orexia-market-intelligence
30# - Template: Skip (we have code)
31
32# Deploy
33apify push
34```
35
36## Or Create Fresh
37
38```bash
39cd ~/Desktop
40mv orexia-apify-actor orexia-apify-actor-backup
41
42# I'll send you a fixed version
43tar -xzf orexia-apify-actor-fixed.tar.gz
44cd orexia-apify-actor
45
46apify push
47```
48
49## Set Environment Variables After Deploy
50
51In Apify Console:
521. Go to your actor
532. Settings → Environment variables
543. Add: `OPENAI_API_KEY = sk-proj-...`
554. Save
56
57## Test Run
58
59Input:
60```json
61{
62  "productIdea": "AI resume builder",
63  "platforms": ["gumroad"],
64  "maxProductsPerPlatform": 10,
65  "runAnalysis": true
66}
67```

FIX_DOCKER_ERROR.md

1# Fix Docker Entrypoint Error
2
3## Problem
4Dockerfile CMD was incompatible with Apify's base image.
5
6## Quick Fix
7
8```bash
9cd ~/Desktop/orexia-apify-actor
10
11# 1. Replace Dockerfile
12# Download Dockerfile-apify and rename to Dockerfile
13mv Dockerfile Dockerfile.old
14# Then move downloaded Dockerfile-apify to Dockerfile
15
16# 2. Replace main.py (test version first)
17mv src/main.py src/main.py.backup
18# Move downloaded main-simple.py to src/main.py
19
20# 3. Update requirements.txt
21# Replace with requirements-complete.txt
22
23# 4. Redeploy
24apify push
25```
26
27## Test First
28
29This simplified version will:
30- ✅ Start successfully
31- ✅ Read input
32- ✅ Output test data
33- ❌ Not scrape yet (testing infrastructure first)
34
35After it works, we'll add scrapers back.
36
37## Expected Output
38
39```json
40{
41  "product_idea": "AI resume builder",
42  "status": "Actor is working!",
43  "message": "Scrapers will be integrated next"
44}
45```
46
47Once this succeeds, I'll help integrate the full scraping logic.

actor.json

{
  "actorSpecification": 1,
  "name": "orexia-market-intelligence",
  "title": "OREXIA Market Intelligence & Validation",
  "description": "Complete product validation system: scrapes 17+ platforms, analyzes with PIMGE scoring, delivers GO/KILL/WAIT decisions",
  "version": "1.0",
  "build": {
    "dockerfile": "./Dockerfile"
  },
  "input": "./INPUT_SCHEMA.json",
  "storages": {
    "dataset": {
      "actorSpecification": 1,
      "title": "OREXIA Validation Reports",
      "views": {
        "overview": {
          "title": "Validation Results",
          "transformation": {
            "fields": [
              "product_idea",
              "final_decision", 
              "pimge_score",
              "total_products_found"
            ]
          }
        }
      }
    }
  }
}

actor.json.backup

{
  "actorSpecification": 1,
  "name": "orexia-market-intelligence",
  "title": "OREXIA Market Intelligence & Validation",
  "description": "Complete product validation system: scrapes 17+ platforms, analyzes with PIMGE scoring, delivers GO/KILL/WAIT decisions",
  "version": "1.0.0",
  "build": {
    "dockerfile": "./Dockerfile"
  },
  "input": "./INPUT_SCHEMA.json",
  "storages": {
    "dataset": {
      "actorSpecification": 1,
      "title": "OREXIA Validation Reports",
      "views": {
        "overview": {
          "title": "Validation Results",
          "transformation": {
            "fields": [
              "product_idea",
              "final_decision", 
              "pimge_score",
              "total_products_found"
            ]
          }
        }
      }
    }
  }
}

apify (1).json

{
  "name": "orexia-market-intelligence",
  "version": "1.0",
  "buildTag": "latest",
  "env": null
}

apify.json.deprecated

{
  "name": "orexia-market-intelligence",
  "version": "1.0.0",
  "buildTag": "latest",
  "env": {},
  "template": "python-playwright-chrome"
}

main-minimal.py

1from apify import Actor
2
3async def main():
4    async with Actor:
5        input_data = await Actor.get_input() or {}
6        product_idea = input_data.get('productIdea', 'test')
7        
8        await Actor.log.info(f'Testing OREXIA actor with: {product_idea}')
9        
10        result = {
11            'product_idea': product_idea,
12            'status': 'SUCCESS',
13            'message': 'Actor is running correctly!'
14        }
15        
16        await Actor.push_data(result)
17        await Actor.log.info('Done!')

main-simple.py

1"""
2OREXIA Apify Actor - Market Intelligence & Validation
3"""
4
5from apify import Actor
6import asyncio
7import sys
8import os
9
10# Add backend to path
11sys.path.insert(0, '/actor/backend')
12
13async def main():
14    async with Actor:
15        # Get input
16        actor_input = await Actor.get_input() or {}
17        
18        product_idea = actor_input.get('productIdea', 'test product')
19        
20        await Actor.log.info('=' * 50)
21        await Actor.log.info('OREXIA MARKET INTELLIGENCE ACTOR')
22        await Actor.log.info(f'Product Idea: {product_idea}')
23        await Actor.log.info('=' * 50)
24        
25        # Test output
26        result = {
27            'product_idea': product_idea,
28            'status': 'Actor is working!',
29            'message': 'Scrapers will be integrated next'
30        }
31        
32        await Actor.push_data(result)
33        await Actor.log.info('Test completed successfully!')
34
35if __name__ == '__main__':
36    asyncio.run(main())

requirements-complete.txt

1apify>=1.7.0
2playwright>=1.40.0
3aiohttp>=3.9.0
4asyncpg>=0.29.0
5beautifulsoup4>=4.12.0
6lxml>=4.9.0
7python-dotenv>=1.0.0
8pydantic>=2.0.0
9openai>=1.0.0

requirements.txt

1apify>=1.7.0
2aiohttp>=3.9.0
3beautifulsoup4>=4.12.0
4lxml>=4.9.0

.actor/actor.json

{
	"actorSpecification": 1,
	"environmentVariables": {},
	"name": "orexia-market-intelligence",
	"version": "1.0",
	"buildTag": "latest"
}

backend/main.py

1from fastapi import FastAPI
2from fastapi.middleware.cors import CORSMiddleware
3from api.routes import aurify_routes, dashboard_routes, products_routes
4from dotenv import load_dotenv
5import os
6
7# Load .env from parent directory
8load_dotenv(os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), '.env'))
9
10app = FastAPI(title="Aurify Strategic Analyzer")
11
12# Configure CORS
13app.add_middleware(
14    CORSMiddleware,
15    allow_origins=["http://localhost:3000"],  # Frontend URL
16    allow_credentials=True,
17    allow_methods=["*"],
18    allow_headers=["*"],
19)
20
21# Register Aurify Analyzer routes
22app.include_router(aurify_routes.router)
23
24# Register Dashboard Stats routes
25app.include_router(dashboard_routes.router)
26
27# Register Products routes (GO/WAIT)
28app.include_router(products_routes.router, tags=["Products"])
29
30@app.get("/")
31async def root():
32    return {"message": "Aurify Strategic Analyzer API is running"}

backend/requirements.txt

1flask
2openai
3anthropic
4psycopg2-binary
5requests
6beautifulsoup4
7pandas
8python-dotenv
9pika
10playwright
11praw

src/main.py

1"""
2OREXIA Apify Actor - Competition Demo Version
3Returns realistic mock data to demonstrate full validation flow
4"""
5
6from apify import Actor
7import random
8
9def generate_mock_products(query: str, max_results: int = 10):
10    """Generate realistic mock Gumroad products for demo"""
11    
12    templates = [
13        {"type": "template", "price_range": (9, 49)},
14        {"type": "course", "price_range": (29, 199)},
15        {"type": "ebook", "price_range": (9, 39)},
16        {"type": "tool", "price_range": (19, 99)},
17    ]
18    
19    products = []
20    
21    for i in range(max_results):
22        template = random.choice(templates)
23        price = random.randint(*template['price_range'])
24        rating = round(random.uniform(4.0, 5.0), 1)
25        sales = random.randint(50, 5000)
26        
27        product = {
28            'title': f"{query} {template['type'].title()} #{i+1}",
29            'price': f"${price}",
30            'rating': rating,
31            'estimated_sales': sales,
32            'estimated_revenue': f"${price * sales:,}",
33            'url': f"https://gumroad.com/l/demo-product-{i+1}",
34            'platform': 'gumroad',
35            'features': [
36                'Instant download',
37                'Lifetime updates',
38                'Commercial license'
39            ]
40        }
41        
42        products.append(product)
43    
44    return products
45
46def analyze_products(products: list) -> dict:
47    """Simple market analysis"""
48    
49    if not products:
50        return {
51            'decision': 'KILL',
52            'score': 0,
53            'reason': 'No products found'
54        }
55    
56    # Extract prices
57    prices = []
58    for p in products:
59        price_str = p.get('price', '$0').replace('$', '').replace(',', '')
60        try:
61            prices.append(float(price_str))
62        except:
63            pass
64    
65    avg_price = sum(prices) / len(prices) if prices else 0
66    total_revenue = sum(p.get('estimated_sales', 0) * float(p.get('price', '$0').replace('$', '').replace(',', '')) for p in products)
67    
68    # Simple scoring
69    if len(products) > 50:
70        decision = 'KILL'
71        score = 35
72        reason = 'Market oversaturated'
73    elif len(products) > 20:
74        decision = 'WAIT'
75        score = 55
76        reason = 'Moderate competition, needs differentiation'
77    else:
78        decision = 'GO'
79        score = 75
80        reason = 'Healthy market with opportunity'
81    
82    return {
83        'decision': decision,
84        'pimge_score': score,
85        'pain_score': 15,
86        'innovation_score': 12,
87        'market_score': 18,
88        'gap_score': 15,
89        'execution_score': 15,
90        'market_analysis': {
91            'total_products': len(products),
92            'avg_price': f"${avg_price:.2f}",
93            'estimated_market_size': f"${total_revenue:,.0f}",
94            'competition_level': 'Medium' if len(products) < 30 else 'High'
95        },
96        'reason': reason,
97        'recommendations': [
98            'Focus on unique features',
99            'Target specific niche within market',
100            'Competitive pricing around $29-49'
101        ]
102    }
103
104async def main():
105    async with Actor:
106        input_data = await Actor.get_input() or {}
107        
108        product_idea = input_data.get('productIdea', 'AI tool')
109        max_products = input_data.get('maxProductsPerPlatform', 10)
110        run_analysis = input_data.get('runAnalysis', True)
111        
112        Actor.log.info('=' * 60)
113        Actor.log.info('🚀 OREXIA MARKET INTELLIGENCE - COMPETITION DEMO')
114        Actor.log.info('=' * 60)
115        Actor.log.info(f'📊 Product Idea: {product_idea}')
116        Actor.log.info(f'🎯 Max Products: {max_products}')
117        Actor.log.info(f'🔬 Run Analysis: {run_analysis}')
118        Actor.log.info('=' * 60)
119        
120        try:
121            # Generate mock products
122            Actor.log.info('🔍 Scraping Gumroad marketplace...')
123            products = generate_mock_products(product_idea, max_products)
124            
125            Actor.log.info(f'✅ Found {len(products)} products on Gumroad')
126            
127            # Run analysis if requested
128            analysis = {}
129            if run_analysis:
130                Actor.log.info('🧠 Running PIMGE analysis...')
131                analysis = analyze_products(products)
132                Actor.log.info(f'📈 PIMGE Score: {analysis["pimge_score"]}/100')
133                Actor.log.info(f'🎯 Decision: {analysis["decision"]}')
134            
135            # Output result
136            result = {
137                'product_idea': product_idea,
138                'timestamp': '2026-01-30T20:40:00Z',
139                'final_decision': analysis.get('decision', 'N/A'),
140                'pimge_score': analysis.get('pimge_score', 0),
141                'pain_score': analysis.get('pain_score', 0),
142                'innovation_score': analysis.get('innovation_score', 0),
143                'market_score': analysis.get('market_score', 0),
144                'gap_score': analysis.get('gap_score', 0),
145                'execution_score': analysis.get('execution_score', 0),
146                'total_products_found': len(products),
147                'platforms_analyzed': ['gumroad'],
148                'market_analysis': analysis.get('market_analysis', {}),
149                'recommendations': analysis.get('recommendations', []),
150                'products_sample': products[:3],  # First 3 for demo
151                'status': 'SUCCESS',
152                'note': '🎪 Demo version with realistic mock data for Apify competition. Live scraping requires Playwright browser automation.'
153            }
154            
155            await Actor.push_data(result)
156            
157            Actor.log.info('=' * 60)
158            Actor.log.info(f'✨ VALIDATION COMPLETE')
159            Actor.log.info(f'🎯 Decision: {result["final_decision"]}')
160            Actor.log.info(f'📊 PIMGE Score: {result["pimge_score"]}/100')
161            Actor.log.info(f'💡 Recommendation: {analysis.get("reason", "N/A")}')
162            Actor.log.info('=' * 60)
163            
164        except Exception as e:
165            Actor.log.error(f'❌ Error: {str(e)}')
166            import traceback
167            Actor.log.error(traceback.format_exc())
168            
169            result = {
170                'product_idea': product_idea,
171                'status': 'ERROR',
172                'error': str(e)
173            }
174            
175            await Actor.push_data(result)

backend/analysis/anti_bias.py

1from datetime import datetime
2import psycopg2
3from psycopg2.extras import RealDictCursor
4import logging
5import os
6
7logger = logging.getLogger(__name__)
8
9# Use environment variables or default config
10DB_CONFIG = {
11    'host': 'localhost',
12    'port': int(os.getenv('DB_PORT', 5435)),
13    'database': os.getenv('DB_NAME', 'orexia'),
14    'user': os.getenv('DB_USER', 'orexia_app'),
15    'password': os.getenv('DB_PASSWORD', 'Farhat2026Secure')
16}
17
18class AntiBiasAnalyzer:
19    def __init__(self):
20        self.conn = None
21
22    def _get_connection(self):
23        if not self.conn or self.conn.closed:
24            try:
25                self.conn = psycopg2.connect(**DB_CONFIG)
26            except Exception as e:
27                logger.error(f"Database connection failed: {e}")
28                return None
29        return self.conn
30
31    def check_similar_failed_products(self, keyword: str, threshold: float = 0.4):
32        """
33        Finds products with 'KILL' decision that are similar to the keyword.
34        Uses PostgreSQL pg_trgm similarity.
35        """
36        conn = self._get_connection()
37        if not conn:
38            return {"error": "Database connection failed"}
39            
40        try:
41            with conn.cursor(cursor_factory=RealDictCursor) as cur:
42                query = """
43                SELECT 
44                    title,
45                    platform,
46                    CAST(COALESCE((evidence->>'pain_severity'), '0') AS FLOAT) * 10 as ai_score,
47                    COALESCE(evidence->>'reason', 'Potential low demand') as failure_reason,
48                    similarity(title, %s) as match_score
49                FROM raw_intelligence_signals
50                WHERE (evidence->>'pain_severity')::numeric < 4.0
51                AND similarity(title, %s) > %s
52                ORDER BY match_score DESC
53                LIMIT 5
54                """
55                cur.execute(query, (keyword, keyword, threshold))
56                similar_failures = cur.fetchall()
57                
58                if similar_failures:
59                    # Determine severity based on highest match
60                    top_score = similar_failures[0]['match_score']
61                    severity = "high" if top_score > 0.7 else "medium"
62                    
63                    return {
64                        "warning": True,
65                        "severity": severity,
66                        "message": f"⚠️ Caution: Found {len(similar_failures)} similar failed products.",
67                        "examples": [
68                            {
69                                "title": r['title'],
70                                "platform": r['platform'],
71                                "score": r['ai_score'],
72                                "reason": r['failure_reason'],
73                                "match": f"{int(r['match_score']*100)}%"
74                            }
75                            for r in similar_failures
76                        ]
77                    }
78                
79                return {"warning": False, "message": "No similar failures found."}
80                
81        except Exception as e:
82            logger.error(f"Anti-bias check failed: {e}")
83            return {"error": str(e)}
84
85    def get_market_saturation(self, keyword: str):
86        """
87        Calculates market saturation based on product count.
88        """
89        conn = self._get_connection()
90        if not conn: return {}
91        
92        try:
93            with conn.cursor(cursor_factory=RealDictCursor) as cur:
94                # Count competitors by platform
95                query = """
96                SELECT 
97                    platform,
98                    COUNT(*) as count,
99                    ROUND(AVG(ai_score), 1) as avg_score
100                FROM raw_intelligence_signals
101                WHERE title ILIKE %s
102                OR evidence->>'hashtags' LIKE %s
103                GROUP BY platform
104                """
105                like_keyword = f"%{keyword}%"
106                cur.execute(query, (like_keyword, like_keyword))
107                platforms = cur.fetchall()
108                
109                total_competitors = sum(p['count'] for p in platforms)
110                
111                # Simple saturation logic
112                if total_competitors > 50:
113                    level = "high"
114                elif total_competitors > 20:
115                    level = "medium"
116                else:
117                    level = "low"
118                    
119                return {
120                    "level": level,
121                    "total_competitors": total_competitors,
122                    "breakdown": platforms
123                }
124        except Exception as e:
125            logger.error(f"Saturation check failed: {e}")
126            return {}
127
128if __name__ == "__main__":
129    # Test run
130    analyzer = AntiBiasAnalyzer()
131    print("Testing Anti-Bias Analyzer...")
132    result = analyzer.check_similar_failed_products("ai writing")
133    print(result)

backend/analysis/aurify_analyzer.py

1"""
2AURIFY INTELLIGENT ANALYZER
3============================
4Complete 3-Stage Analysis System:
51. Market Gap Scanner (Oryxia Scoring)
62. Decision Matrix Validator (Porter's 5 Forces)
73. Product Blueprint Architect (MVP Specification)
8
9Author: Kat @ AURIFY/OREXIA
10Date: January 2026
11"""
12
13import openai
14import os
15import json
16from typing import Dict, List, Optional, Tuple
17from datetime import datetime
18from dataclasses import dataclass, asdict
19from enum import Enum
20
21# ══════════════════════════════════════════════════════════════
22# DATA MODELS
23# ══════════════════════════════════════════════════════════════
24
25class Decision(Enum):
26    GO = "GO"
27    WAIT = "WAIT"
28    KILL = "KILL"
29    CONFIRMED_GO = "CONFIRMED_GO"
30    DOWNGRADED = "DOWNGRADED"
31    DOWNGRADE_TO_WAIT = "DOWNGRADE_TO_WAIT"
32    EMERGENCY_KILL = "EMERGENCY_KILL"
33
34class RiskLevel(Enum):
35    LOW = "LOW"
36    MEDIUM = "MEDIUM"
37    HIGH = "HIGH"
38    CRITICAL = "CRITICAL"
39
40@dataclass
41class OryxiaScore:
42    """Oryxia Pentagonal Score"""
43    pain: int  # 0-30
44    market: int  # 0-20
45    gap: int  # 0-25
46    execution: int  # 0-15
47    revenue: int  # 0-10
48    total: int  # 0-100
49    decision: Decision
50    reasoning: str
51
52@dataclass
53class ValidationResult:
54    """Decision Matrix Validation"""
55    opportunity_name: str
56    original_score: int
57    revised_decision: Decision
58    risk_level: RiskLevel
59    confidence: float  # 0-100%
60    
61    # Porter's 5 Forces
62    threat_new_entrants: str
63    supplier_power: str
64    buyer_power: str
65    threat_substitutes: str
66    competitive_rivalry: str
67    
68    # Unit Economics
69    cac: float
70    ltv: float
71    ltv_cac_ratio: float
72    payback_months: float
73    churn_rate: float
74    
75    # Moat Assessment
76    moat_strength: str  # Weak/Medium/Strong
77    moat_score: float  # 0-10
78    
79    # Other assessments
80    time_to_market_weeks: int
81    bootstrap_viable: bool
82    capital_needed: float
83    
84    red_flags: List[str]
85    critical_assumptions: List[str]
86    next_actions: List[str]
87
88@dataclass
89class ProductBlueprint:
90    """Complete MVP Specification"""
91    product_name: str
92    tagline: str
93    value_proposition: str
94    
95    # Features
96    must_have_features: List[Dict[str, str]]
97    should_have_features: List[Dict[str, str]]
98    wont_have_features: List[Dict[str, str]]
99    
100    # Technical
101    tech_stack: Dict[str, str]
102    estimated_dev_hours: int
103    
104    # Business
105    pricing_tiers: List[Dict[str, any]]
106    revenue_projections: Dict[str, float]
107    
108    # GTM
109    launch_channels: List[Dict[str, str]]
110    success_metrics: Dict[str, float]
111    kill_criteria: List[str]
112    
113    # Roadmap
114    week_by_week_plan: List[Dict[str, str]]
115
116# ══════════════════════════════════════════════════════════════
117# AURIFY ANALYZER ENGINE
118# ══════════════════════════════════════════════════════════════
119
120class AurifyAnalyzer:
121    """
122    Complete 3-Stage Analysis System
123    """
124    
125    def __init__(self, api_key: Optional[str] = None):
126        """Initialize with OpenAI API"""
127        self.api_key = api_key or os.getenv("OPENAI_API_KEY")
128        if not self.api_key:
129            raise ValueError("OPENAI_API_KEY not found in environment")
130        
131        self.client = openai.OpenAI(api_key=self.api_key)
132        self.model = "gpt-4o"
133        
134    # ─────────────────────────────────────────────────────────
135    # STAGE 1: MARKET GAP SCANNER (ORYXIA SCORING)
136    # ─────────────────────────────────────────────────────────
137    
138    def stage1_market_gap_scan(
139        self, 
140        products: List[Dict],
141        focus_area: Optional[str] = None
142    ) -> List[OryxiaScore]:
143        """
144        Run Oryxia Pentagonal Scoring on product dataset
145        
146        Args:
147            products: List of product dictionaries
148            focus_area: Optional filter (e.g., "Arabic market", "AI tools")
149            
150        Returns:
151            List of scored opportunities with GO/WAIT/KILL decisions
152        """
153        
154        # Prepare data for analysis
155        products_text = self._format_products_for_analysis(products)
156        
157        
158        # Build prompt using the User's Strategic Framework
159        from backend.scrapers.aurify_prompts import AURIFY_MARKET_GAP_SCANNER
160        
161        prompt = f"""
162{AURIFY_MARKET_GAP_SCANNER}
163
164{f"FOCUS AREA: {focus_area}" if focus_area else ""}
165
166PRODUCTS TO ANALYZE:
167{products_text}
168
169Analyze NOW and return ONLY valid JSON in the specified format:
170{{
171  "opportunities": [
172    {{
173      "product_name": "...",
174      "platform": "...",
175        "scores": {{
176        "pain": 0-30,
177        "market": 0-20,
178        "gap": 0-25,
179        "execution": 0-15,
180        "revenue": 0-10,
181        "total": 0-100
182      }},
183      "decision": "GO|WAIT|KILL",
184      "reasoning": "...",
185      "pain_evidence": "...",
186      "market_size": "...",
187      "competitive_moat": "...",
188      "risk_flags": ["..."]
189    }}
190  ]
191}}
192"""
193        
194        # Call OpenAI API
195        response = self.client.chat.completions.create(
196            model=self.model,
197            temperature=0.7,
198            response_format={"type": "json_object"},
199            messages=[{
200                "role": "user",
201                "content": prompt
202            }]
203        )
204        
205        if not response.choices:
206            raise ValueError("OpenAI returned no choices")
207            
208        # Parse response
209        result = self._parse_json_response(response.choices[0].message.content)
210        
211        # Convert to OryxiaScore objects
212        scores = []
213        for opp in result.get("opportunities", []):
214            scores.append(OryxiaScore(
215                pain=opp["scores"]["pain"],
216                market=opp["scores"]["market"],
217                gap=opp["scores"]["gap"],
218                execution=opp["scores"]["execution"],
219                revenue=opp["scores"]["revenue"],
220                total=opp["scores"]["total"],
221                decision=Decision[opp["decision"]],
222                reasoning=opp["reasoning"]
223            ))
224        
225        return scores
226    
227    # ─────────────────────────────────────────────────────────
228    # STAGE 2: DECISION MATRIX VALIDATOR
229    # ─────────────────────────────────────────────────────────
230    
231    def stage2_validate_opportunity(
232        self,
233        opportunity: Dict,
234        original_score: int
235    ) -> ValidationResult:
236        """
237        Deep validation using Porter's 5 Forces, Unit Economics, Moat Analysis
238        
239        Args:
240            opportunity: Product details from Stage 1
241            original_score: Oryxia score (0-100)
242            
243        Returns:
244            Complete validation with CONFIRMED GO / DOWNGRADE / KILL
245        """
246        
247        prompt = f"""
248AURIFY DECISION MATRIX VALIDATOR
249═════════════════════════════════════════════════════════════════
250
251ROLE: Strategic Business Validator for Solo Founders
252
253OBJECTIVE: Validate this GO opportunity through 7 critical lenses
254
255OPPORTUNITY:
256{json.dumps(opportunity, indent=2)}
257
258ORIGINAL SCORE: {original_score}/100 (GO)
259
260VALIDATION FRAMEWORK:
261
2621. PORTER'S 5 FORCES
263   - Threat of New Entrants (High/Medium/Low)
264   - Supplier Power
265   - Buyer Power
266   - Threat of Substitutes
267   - Competitive Rivalry
268
2692. UNIT ECONOMICS
270   - CAC (Customer Acquisition Cost)
271   - LTV (Lifetime Value)
272   - LTV:CAC Ratio (need >3:1)
273   - Payback Period (need <12 months)
274   - Monthly Churn Rate
275
2763. MOAT ASSESSMENT (0-10 score)
277   - Network effects?
278   - Proprietary data?
279   - Brand/community?
280   - Integration lock-in?
281   - Regulatory barriers?
282
2834. TIME-TO-MARKET
284   - Can solo founder build MVP alone?
285   - Estimated dev hours
286   - Required skills
287   - Launch timeline (weeks)
288
2895. DISTRIBUTION
290   - SEO opportunity
291   - Paid channel CAC
292   - Viral potential
293   - Best channels
294
2956. REGULATORY RISKS
296   - GDPR/Privacy
297   - Financial regulations
298   - Industry-specific compliance
299
3007. CAPITAL REQUIREMENTS
301   - Bootstrap viable?
302   - Pre-revenue runway needed
303   - Break-even timeline
304
305OUTPUT FORMAT (JSON):
306{{
307  "revised_decision": "CONFIRMED_GO|DOWNGRADE_TO_WAIT|EMERGENCY_KILL",
308  "risk_level": "LOW|MEDIUM|HIGH|CRITICAL",
309  "confidence": 0-100,
310  
311  "porters_forces": {{
312    "threat_new_entrants": "...",
313    "supplier_power": "...",
314    "buyer_power": "...",
315    "threat_substitutes": "...",
316    "competitive_rivalry": "..."
317  }},
318  
319  "unit_economics": {{
320    "cac": 0.0,
321    "ltv": 0.0,
322    "ltv_cac_ratio": 0.0,
323    "payback_months": 0.0,
324    "churn_rate": 0.0
325  }},
326  
327  "moat": {{
328    "strength": "Weak|Medium|Strong",
329    "score": 0-10,
330    "explanation": "..."
331  }},
332  
333  "execution": {{
334    "time_to_market_weeks": 0,
335    "solo_buildable": true|false,
336    "estimated_hours": 0,
337    "skill_gaps": ["..."]
338  }},
339  
340  "capital": {{
341    "bootstrap_viable": true|false,
342    "capital_needed": 0.0,
343    "break_even_months": 0
344  }},
345  
346  "red_flags": ["flag1", "flag2"],
347  "critical_assumptions": ["assumption1"],
348  "next_actions": ["action1", "action2", "action3"]
349}}
350
351Analyze NOW and return ONLY valid JSON.
352"""
353        
354        response = self.client.chat.completions.create(
355            model=self.model,
356            temperature=0.7,
357            response_format={"type": "json_object"},
358            messages=[{"role": "user", "content": prompt}]
359        )
360        
361        if not response.choices:
362            raise ValueError("OpenAI returned no choices")
363
364        result = self._parse_json_response(response.choices[0].message.content)
365        
366        # Build ValidationResult
367        return ValidationResult(
368            opportunity_name=opportunity.get("product_name", "Unknown"),
369            original_score=original_score,
370            revised_decision=Decision[result["revised_decision"]],
371            risk_level=RiskLevel[result["risk_level"]],
372            confidence=result["confidence"],
373            
374            threat_new_entrants=result["porters_forces"]["threat_new_entrants"],
375            supplier_power=result["porters_forces"]["supplier_power"],
376            buyer_power=result["porters_forces"]["buyer_power"],
377            threat_substitutes=result["porters_forces"]["threat_substitutes"],
378            competitive_rivalry=result["porters_forces"]["competitive_rivalry"],
379            
380            cac=result["unit_economics"]["cac"],
381            ltv=result["unit_economics"]["ltv"],
382            ltv_cac_ratio=result["unit_economics"]["ltv_cac_ratio"],
383            payback_months=result["unit_economics"]["payback_months"],
384            churn_rate=result["unit_economics"]["churn_rate"],
385            
386            moat_strength=result["moat"]["strength"],
387            moat_score=result["moat"]["score"],
388            
389            time_to_market_weeks=result["execution"]["time_to_market_weeks"],
390            bootstrap_viable=result["capital"]["bootstrap_viable"],
391            capital_needed=result["capital"]["capital_needed"],
392            
393            red_flags=result["red_flags"],
394            critical_assumptions=result["critical_assumptions"],
395            next_actions=result["next_actions"]
396        )
397    
398    # ─────────────────────────────────────────────────────────
399    # STAGE 3: PRODUCT BLUEPRINT ARCHITECT
400    # ─────────────────────────────────────────────────────────
401    
402    def stage3_create_blueprint(
403        self,
404        validated_opportunity: Dict,
405        validation_result: ValidationResult
406    ) -> ProductBlueprint:
407        """
408        Generate complete MVP specification ready to build
409        
410        Args:
411            validated_opportunity: Product from Stage 1
412            validation_result: Validation from Stage 2
413            
414        Returns:
415            Complete product blueprint with tech stack, features, roadmap
416        """
417        
418        prompt = f"""
419AURIFY PRODUCT BLUEPRINT ARCHITECT
420═════════════════════════════════════════════════════════════════
421
422ROLE: Product Architect for Solo Founders
423
424OBJECTIVE: Transform validated opportunity into executable MVP blueprint
425
426VALIDATED OPPORTUNITY:
427{json.dumps(validated_opportunity, indent=2)}
428
429VALIDATION INSIGHTS:
430- Risk Level: {validation_result.risk_level.value}
431- Time to Market: {validation_result.time_to_market_weeks} weeks
432- Bootstrap Viable: {validation_result.bootstrap_viable}
433- Capital Needed: ${validation_result.capital_needed:,.0f}
434
435CREATE BLUEPRINT:
436
4371. VALUE PROPOSITION
438   - One-sentence pitch
439   - Core Jobs-to-be-Done
440
4412. MVP FEATURES (MoSCoW)
442   - MUST HAVE (3-5 features max)
443   - SHOULD HAVE (v1.1)
444   - WON'T HAVE (explicitly cut)
445
4463. TECH STACK
447   - Frontend framework + reasoning
448   - Backend + database
449   - Auth solution
450   - Payment processor
451   - Hosting platform
452   - Third-party APIs
453
4544. PRICING STRATEGY
455   - Free tier (if any)
456   - Paid tiers with $ amounts
457   - Reasoning for prices
458
4595. GO-TO-MARKET
460   - Top 3 launch channels
461   - Content strategy
462   - SEO keywords
463   - First 100 users plan
464
4656. DEVELOPMENT ROADMAP
466   - Week 1-2: Foundation
467   - Week 3-6: Core features
468   - Week 7-8: Polish
469   - Week 9-12: Launch prep
470
4717. SUCCESS METRICS
472   - North Star Metric
473   - Weekly KPIs
474   - 3-month targets
475
4768. KILL CRITERIA
477   - When to pivot
478   - When to quit
479
480OUTPUT FORMAT (JSON):
481{{
482  "product_name": "...",
483  "tagline": "...",
484  "value_proposition": "...",
485  
486  "features": {{
487    "must_have": [
488      {{"name": "Feature 1", "why": "Reason", "hours": 20}}
489    ],
490    "should_have": [...],
491    "wont_have": [...]
492  }},
493  
494  "tech_stack": {{
495    "frontend": "Next.js - Fast development",
496    "backend": "FastAPI - Python expertise",
497    "database": "PostgreSQL - Relational data",
498    "auth": "Clerk - Easy integration",
499    "payments": "Stripe - Industry standard",
500    "hosting": "Vercel - Free tier + fast",
501    "apis": ["API1", "API2"]
502  }},
503  
504  "pricing": [
505    {{
506      "tier": "Free",
507      "price": 0,
508      "features": ["..."],
509      "limits": "..."
510    }},
511    {{
512      "tier": "Pro",
513      "price": 29,
514      "features": ["..."]
515    }}
516  ],
517  
518  "gtm": {{
519    "channels": [
520      {{"name": "Product Hunt", "tactic": "...", "cost": 0}}
521    ],
522    "seo_keywords": ["keyword1", "keyword2"],
523    "first_100_users": "How to get them..."
524  }},
525  
526  "roadmap": [
527    {{
528      "week": "1-2",
529      "phase": "Foundation",
530      "tasks": ["Setup", "Auth", "Database"]
531    }}
532  ],
533  
534  "metrics": {{
535    "north_star": "Active users sending messages",
536    "week_1_target": {{"signups": 10}},
537    "month_3_target": {{"mrr": 1000}}
538  }},
539  
540  "kill_criteria": [
541    "If <50 users after 3 months",
542    "If <$500 MRR after 6 months"
543  ]
544}}
545
546Generate NOW and return ONLY valid JSON.
547"""
548        
549        response = self.client.chat.completions.create(
550            model=self.model,
551            temperature=0.8,
552            response_format={"type": "json_object"},
553            messages=[{"role": "user", "content": prompt}]
554        )
555        
556        if not response.choices:
557            raise ValueError("OpenAI returned no choices")
558
559        result = self._parse_json_response(response.choices[0].message.content)
560        
561        # Build ProductBlueprint
562        return ProductBlueprint(
563            product_name=result["product_name"],
564            tagline=result["tagline"],
565            value_proposition=result["value_proposition"],
566            
567            must_have_features=result["features"]["must_have"],
568            should_have_features=result["features"]["should_have"],
569            wont_have_features=result["features"]["wont_have"],
570            
571            tech_stack=result["tech_stack"],
572            estimated_dev_hours=sum(f.get("hours", 0) for f in result["features"]["must_have"]),
573            
574            pricing_tiers=result["pricing"],
575            revenue_projections=result["metrics"].get("month_3_target", {}),
576            
577            launch_channels=result["gtm"]["channels"],
578            success_metrics=result["metrics"],
579            kill_criteria=result["kill_criteria"],
580            
581            week_by_week_plan=result["roadmap"]
582        )
583    
584    # ─────────────────────────────────────────────────────────
585    # COMPLETE PIPELINE
586    # ─────────────────────────────────────────────────────────
587    
588    def run_complete_analysis(
589        self,
590        products: List[Dict],
591        focus_area: Optional[str] = None,
592        auto_validate_top_n: int = 3
593    ) -> Dict:
594        """
595        Run complete 3-stage pipeline
596        
597        Returns:
598            {
599                "stage1_scores": [...],
600                "stage2_validations": [...],
601                "stage3_blueprints": [...],
602                "recommendations": {...}
603            }
604        """
605        
606        print("🔍 STAGE 1: Market Gap Scanning...")
607        stage1_scores = self.stage1_market_gap_scan(products, focus_area)
608        
609        # Get top GO opportunities
610        go_opportunities = [
611            (i, score) for i, score in enumerate(stage1_scores) 
612            if score.decision == Decision.GO
613        ]
614        go_opportunities.sort(key=lambda x: x[1].total, reverse=True)
615        
616        print(f"✅ Found {len(go_opportunities)} GO opportunities")
617        
618        # Validate top N
619        print(f"\n🔬 STAGE 2: Validating top {auto_validate_top_n} opportunities...")
620        validations = []
621        for idx, (prod_idx, score) in enumerate(go_opportunities[:auto_validate_top_n]):
622            if prod_idx >= len(products):
623                # Guard against hallucinated extra results
624                print(f"⚠️ Warning: Skipping invalid product index {prod_idx}")
625                continue
626                
627            print(f"  Validating {idx+1}/{auto_validate_top_n}...")
628            validation = self.stage2_validate_opportunity(
629                products[prod_idx],
630                score.total
631            )
632            validations.append(validation)
633        
634        # Find CONFIRMED GO
635        confirmed = [v for v in validations if v.revised_decision == Decision.CONFIRMED_GO]
636        print(f"✅ {len(confirmed)} CONFIRMED GO after validation")
637        
638        # Create blueprints for confirmed
639        print(f"\n📐 STAGE 3: Creating blueprints for confirmed opportunities...")
640        blueprints = []
641        for validation in confirmed[:2]:  # Max 2 blueprints
642            # Find original product
643            prod = next(
644                p for p in products 
645                if p.get("product_name") == validation.opportunity_name
646            )
647            blueprint = self.stage3_create_blueprint(prod, validation)
648            blueprints.append(blueprint)
649        
650        print(f"✅ Created {len(blueprints)} complete blueprints")
651        
652        return {
653            "stage1_scores": [asdict(s) for s in stage1_scores],
654            "stage2_validations": [asdict(v) for v in validations],
655            "stage3_blueprints": [asdict(b) for b in blueprints],
656            "summary": {
657                "total_analyzed": len(products),
658                "go_count": len(go_opportunities),
659                "confirmed_go": len(confirmed),
660                "blueprints_created": len(blueprints)
661            },
662            "timestamp": datetime.now().isoformat()
663        }
664    
665    # ─────────────────────────────────────────────────────────
666    # HELPER METHODS
667    # ─────────────────────────────────────────────────────────
668    
669    def _format_products_for_analysis(self, products: List[Dict]) -> str:
670        """Format product data for AI analysis"""
671        lines = []
672        for i, p in enumerate(products[:100]):  # Limit to 100 for token limits
673            lines.append(f"{i+1}. {p.get('product_name', 'Unknown')}")
674            lines.append(f"   Platform: {p.get('platform', 'N/A')}")
675            lines.append(f"   Price: ${p.get('price_estimate', 0)}")
676            lines.append(f"   Sales: {p.get('metrics', 0)}")
677            if p.get('category'):
678                lines.append(f"   Category: {p['category']}")
679            lines.append("")
680        
681        return "\n".join(lines)
682    
683    def _parse_json_response(self, text: str) -> Dict:
684        """Extract and parse JSON from AI response"""
685        # Find JSON block
686        start = text.find('{')
687        end = text.rfind('}') + 1
688        
689        if start == -1 or end == 0:
690            raise ValueError("No JSON found in response")
691        
692        json_text = text[start:end]
693        return json.loads(json_text)
694
695
696# ══════════════════════════════════════════════════════════════
697# CLI INTERFACE
698# ══════════════════════════════════════════════════════════════
699
700if __name__ == "__main__":
701    import sys
702    
703    # Example usage
704    analyzer = AurifyAnalyzer()
705    
706    # Sample products (in real use, load from database)
707    sample_products = [
708        {
709            "product_name": "WhatsApp CRM for E-commerce",
710            "platform": "Custom",
711            "price_estimate": 99,
712            "metrics": 1000,
713            "category": "SaaS",
714            "description": "Automated WhatsApp marketing and customer management"
715        },
716        {
717            "product_name": "AI Logo Generator",
718            "platform": "Web",
719            "price_estimate": 29,
720            "metrics": 50,
721            "category": "Design Tool"
722        }
723    ]
724    
725    print("🚀 AURIFY INTELLIGENT ANALYZER")
726    print("=" * 60)
727    
728    result = analyzer.run_complete_analysis(
729        products=sample_products,
730        focus_area="Arabic market opportunities",
731        auto_validate_top_n=2
732    )
733    
734    print("\n" + "=" * 60)
735    print("📊 ANALYSIS COMPLETE")
736    print(f"✅ Analyzed: {result['summary']['total_analyzed']} products")
737    print(f"✅ GO Opportunities: {result['summary']['go_count']}")
738    print(f"✅ Confirmed GO: {result['summary']['confirmed_go']}")
739    print(f"✅ Blueprints: {result['summary']['blueprints_created']}")
740    
741    # Save to file
742    output_file = f"aurify_analysis_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
743    with open(output_file, 'w') as f:
744        json.dump(result, f, indent=2)
745    
746    print(f"\n💾 Full report saved to: {output_file}")

backend/ai/gatekeeper.py

1import asyncio
2import json
3import os
4import aiohttp
5import logging
6from datetime import datetime
7from typing import Dict, List, Optional, Any
8from .prompts import (
9    GATE1_NOISE_ELIMINATION,
10    GATE2_PAIN_EXTRACTION,
11    GATE3_ECONOMIC_VIABILITY,
12    GATE4_SATURATION,
13    GATE5_ORYXIA_SCORE
14)
15
16logger = logging.getLogger("OryxiaGatekeeper")
17logger.setLevel(logging.INFO)
18
19class OryxiaGatekeeper:
20    def __init__(self):
21        self.api_key = os.getenv("OPENAI_API_KEY")
22        if not self.api_key:
23            logger.warning("OPENAI_API_KEY not found. Gatekeeper will fail.")
24        
25        self.api_url = "https://api.openai.com/v1/chat/completions"
26        self.headers = {
27            "Authorization": f"Bearer {self.api_key}",
28            "Content-Type": "application/json"
29        }
30
31    async def _call_ai(self, prompt: str) -> str:
32        """Call OpenAI API with the given prompt"""
33        async with aiohttp.ClientSession() as session:
34            payload = {
35                "model": "gpt-4o", 
36                "max_tokens": 1500,
37                "messages": [
38                    {"role": "user", "content": prompt}
39                ]
40            }
41            
42            try:
43                async with session.post(self.api_url, headers=self.headers, json=payload, timeout=60) as response:
44                    if response.status != 200:
45                        text = await response.text()
46                        logger.error(f"OpenAI API Error {response.status}: {text}")
47                        return ""
48                    
49                    data = await response.json()
50                    return data['choices'][0]['message']['content']
51            except Exception as e:
52                logger.error(f"OpenAI Call Failed: {e}")
53                return ""
54
55    async def process_product(self, product: Dict[str, Any]) -> Dict[str, Any]:
56        """
57        Run the product through the 5 gates.
58        Returns a result dict with final decision and gate trail.
59        """
60        
61        # 1. Prepare Data
62        # Map staging_products schema to Gatekeeper requirements
63        title = product.get('product_name', 'Unknown')
64        source = product.get('platform', 'Unknown')
65        url = product.get('url', '')
66        price = str(product.get('price', 'Unknown'))
67        
68        # Description might be in source_metrics or separate
69        description = product.get('description', '')
70        if not description and 'source_metrics' in product:
71            description = product['source_metrics'].get('description', '')
72            
73        category = product.get('category', 'General')
74        if category == 'General' and 'source_metrics' in product:
75            category = product['source_metrics'].get('product_type', 'General')
76
77        reviews_raw = product.get('pain_points_raw', [])
78        reviews = "\n".join(reviews_raw) if isinstance(reviews_raw, list) else str(reviews_raw)
79        
80        gate_trail = []
81        
82        logger.info(f"🚪 GATEKEEPER: {title} | Source: {source} | Price: {price}")
83
84        # GATE 1: NOISE ELIMINATION
85        prompt1 = GATE1_NOISE_ELIMINATION.format(
86            title=title, description=description, price=price, source=source, category=category
87        )
88        resp1 = await self._call_ai(prompt1)
89        decision1 = self._extract_decision(resp1)
90        gate_trail.append({"gate": 1, "name": "Noise Elimination", "response": resp1, "decision": decision1})
91        
92        if decision1 == "KILL":
93            return self._finalize_result(product, "REJECTED", 1, gate_trail, resp1)
94
95        await asyncio.sleep(2) # Rate limit
96
97        # GATE 2: PAIN EXTRACTION
98        prompt2 = GATE2_PAIN_EXTRACTION.format(
99            title=title, description=description, reviews=reviews
100        )
101        resp2 = await self._call_ai(prompt2)
102        decision2 = self._extract_decision(resp2)
103        gate_trail.append({"gate": 2, "name": "Pain Extraction", "response": resp2, "decision": decision2})
104        
105        if decision2 == "KILL":
106            return self._finalize_result(product, "REJECTED", 2, gate_trail, resp2)
107            
108        await asyncio.sleep(2)
109
110        # GATE 3: ECONOMIC VIABILITY
111        # Pass pain signal from previous gate
112        prompt3 = GATE3_ECONOMIC_VIABILITY.format(
113            title=title, description=description, price=price, pain_from_gate2=resp2
114        )
115        resp3 = await self._call_ai(prompt3)
116        decision3 = self._extract_decision(resp3)
117        gate_trail.append({"gate": 3, "name": "Economic Viability", "response": resp3, "decision": decision3})
118        
119        if decision3 == "KILL":
120            return self._finalize_result(product, "REJECTED", 3, gate_trail, resp3)
121            
122        await asyncio.sleep(2)
123
124        # GATE 4: SATURATION
125        prompt4 = GATE4_SATURATION.format(
126            title=title, description=description, category=category, pain_from_gate2=resp2
127        )
128        resp4 = await self._call_ai(prompt4)
129        decision4 = "PASS" # Gate 4 doesn't kill usually, just warns
130        gate_trail.append({"gate": 4, "name": "Market Saturation", "response": resp4, "decision": decision4})
131        
132        await asyncio.sleep(2)
133
134        # GATE 5: ORYXIA SCORE
135        prompt5 = GATE5_ORYXIA_SCORE.format(
136            title=title,
137            pain_from_gate2=resp2,
138            viability_from_gate3=resp3,
139            saturation_from_gate4=resp4,
140            price=price
141        )
142        resp5 = await self._call_ai(prompt5)
143        decision5_raw = self._extract_decision(resp5)
144        
145        # Check score threshold
146        score = self._extract_score(resp5)
147        decision5 = "ADMIT" if (decision5_raw == "ADMIT" or score >= 60) else "REJECT"
148            
149        gate_trail.append({"gate": 5, "name": "Oryxia Score", "response": resp5, "decision": decision5, "score": score})
150        
151        if decision5 == "ADMIT":
152            return self._finalize_result(product, "ADMITTED", 0, gate_trail, resp5, score=score)
153        else:
154            return self._finalize_result(product, "REJECTED", 5, gate_trail, resp5, score=score)
155
156    def _extract_decision(self, text: str) -> str:
157        text = text.upper()
158        if "DECISION: PASS" in text or "PASS" in text.split("DECISION:")[1] if "DECISION:" in text else "":
159            return "PASS"
160        if "DECISION: ADMIT" in text or "ADMIT" in text.split("DECISION:")[1] if "DECISION:" in text else "":
161            return "ADMIT"
162        if "DECISION: KILL" in text or "KILL" in text.split("DECISION:")[1] if "DECISION:" in text else "":
163            return "KILL"
164        if "DECISION: REJECT" in text or "REJECT" in text.split("DECISION:")[1] if "DECISION:" in text else "":
165            return "REJECT"
166        
167        # Fallback loose match
168        if "PASS" in text[:50] or "ADMIT" in text[:50]: return "PASS" # Treat ADMIT as PASS in intermediate
169        if "KILL" in text[:50] or "REJECT" in text[:50]: return "KILL"
170        
171        return "KILL" # Conservative default
172
173    def _extract_score(self, text: str) -> int:
174        import re
175        match = re.search(r"Total Score:\s*(\d+)", text, re.IGNORECASE)
176        if match:
177            return int(match.group(1))
178        return 0
179
180    def _finalize_result(self, product, final_decision, gate_failed, trail, final_analysis, score=0):
181        return {
182            "product": product,
183            "final_decision": final_decision,
184            "rejected_at_gate": gate_failed,
185            "gates": trail,
186            "final_analysis": final_analysis,
187            "final_score": score
188        }

backend/ai/process_staging.py

1import asyncio
2import os
3import json
4import logging
5import psycopg2
6import re
7from datetime import datetime
8from dotenv import load_dotenv
9from typing import Dict, Any
10
11from backend.ai.gatekeeper import OryxiaGatekeeper
12
13# Load env for DB creds
14load_dotenv()
15
16# Setup logging
17logging.basicConfig(
18    level=logging.INFO,
19    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
20    handlers=[
21        logging.FileHandler("gatekeeper.log"),
22        logging.StreamHandler()
23    ]
24)
25logger = logging.getLogger("OryxiaPipeline")
26
27class StagingProcessor:
28    def __init__(self):
29        self.db_host = os.getenv("POSTGRES_HOST", "localhost")
30        self.db_port = os.getenv("POSTGRES_PORT", "5432")
31        self.db_user = os.getenv("POSTGRES_USER", "pcore")
32        self.db_password = os.getenv("POSTGRES_PASSWORD", "")
33        self.db_name = os.getenv("POSTGRES_DB", "aurify")
34        
35        self.gatekeeper = OryxiaGatekeeper()
36
37    def get_db_connection(self):
38        return psycopg2.connect(
39            host=self.db_host,
40            port=self.db_port,
41            user=self.db_user,
42            password=self.db_password,
43            database=self.db_name
44        )
45
46    def fetch_pending_products(self, limit=50):
47        conn = self.get_db_connection()
48        products = []
49        try:
50            cur = conn.cursor()
51            # Fetch products that haven't been processed yet
52            # status='Not Analyzed' (from OrexiaBaseScraper)
53            cur.execute("""
54                SELECT id, product_name, platform, url, price, pain_points_raw, source_metrics
55                FROM staging_products
56                WHERE status = 'Not Analyzed'
57                ORDER BY created_at DESC
58                LIMIT %s
59            """, (limit,))
60            
61            columns = [desc[0] for desc in cur.description]
62            for row in cur.fetchall():
63                products.append(dict(zip(columns, row)))
64        except Exception as e:
65            logger.error(f"DB Error fetching: {e}")
66        finally:
67            conn.close()
68        return products
69
70    def update_staging_status(self, product_id, status):
71        conn = self.get_db_connection()
72        try:
73            cur = conn.cursor()
74            cur.execute("UPDATE staging_products SET status = %s WHERE id = %s", (status, product_id))
75            conn.commit()
76        finally:
77            conn.close()
78
79    def save_admitted_product(self, product_data, gate_result):
80        conn = self.get_db_connection()
81        try:
82            cur = conn.cursor()
83            
84            # Extract scores from Gate 5 response
85            final_analysis = gate_result['final_analysis']
86            scores = self._extract_component_scores(final_analysis)
87            total_score = gate_result['final_score']
88            
89            decision = "GO" if total_score >= 70 else "WAIT"
90            
91            # Map fields
92            title = product_data.get('product_name')
93            description = gate_result.get('gates')[0].get('response') # Use Gate 1 response? Or Gate 2 pain?
94            # Actually description is in source_metrics usually
95            source_desc = product_data.get('source_metrics', {}).get('description', '')
96            
97            # Insert into final products table
98            # Adjust score if needed to meet constraint (>=50)
99            if total_score < 50:
100                logger.warning(f"Admitted product {title} has score {total_score} < 50. Creating logs only.")
101                # This technically shouldn't happen if Gate 5 killed < 60, but safety check.
102                return False
103
104            query = """
105                INSERT INTO products (
106                    title, description, price, category, source, url, 
107                    total_score, decision, 
108                    pain_score, market_score, gap_score, execution_score, revenue_score,
109                    analysis_text, prompt_used,
110                    status, created_at
111                )
112                VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, NOW())
113                ON CONFLICT (url) DO NOTHING
114            """
115            
116            cur.execute(query, (
117                title,
118                source_desc[:5000],          # Truncate if long
119                str(product_data.get('price')),
120                product_data.get('source_metrics', {}).get('category', 'General'),
121                product_data.get('platform'),
122                product_data.get('url'),
123                total_score,
124                decision,
125                scores.get('P', 0),
126                scores.get('M', 0),
127                scores.get('G', 0),
128                scores.get('E', 0),
129                scores.get('R', 0),
130                final_analysis,
131                "Gate 5 Prompt", # Placeholder
132                "pending"
133            ))
134            conn.commit()
135            logger.info(f"✅ Saved to PRODUCTS: {title} ({decision})")
136            return True
137            
138        except Exception as e:
139            logger.error(f"Error saving admitted product: {e}")
140            return False
141        finally:
142            conn.close()
143
144    def log_rejected_product(self, product_data, gate_result):
145        log_dir = "logs/rejected"
146        os.makedirs(log_dir, exist_ok=True)
147        
148        filename = f"{log_dir}/KILL_{datetime.now().strftime('%Y-%m-%d')}.jsonl"
149        
150        entry = {
151            "timestamp": datetime.now().isoformat(),
152            "title": product_data.get('product_name'),
153            "source": product_data.get('platform'),
154            "url": product_data.get('url'),
155            "killed_at_gate": gate_result['rejected_at_gate'],
156            "reason": self._extract_reason(gate_result['gates'][-1].get('response', '')),
157            "full_gate_trail": gate_result['gates']
158        }
159        
160        with open(filename, "a") as f:
161            f.write(json.dumps(entry) + "\n")
162        
163        logger.info(f"❌ Logged rejection: {entry['title']} (Gate {entry['killed_at_gate']})")
164
165    def _extract_component_scores(self, text: str) -> Dict[str, int]:
166        scores = {}
167        # P Score: 26/30
168        patterns = {
169            'P': r"P Score:\s*(\d+)",
170            'M': r"M Score:\s*(\d+)",
171            'G': r"G Score:\s*(\d+)",
172            'E': r"E Score:\s*(\d+)",
173            'R': r"R Score:\s*(\d+)"
174        }
175        
176        for key, pattern in patterns.items():
177            match = re.search(pattern, text, re.IGNORECASE)
178            if match:
179                scores[key] = int(match.group(1))
180            else:
181                scores[key] = 0
182        return scores
183
184    def _extract_reason(self, text: str) -> str:
185        # Extract first line or generic reason
186        lines = text.split('\n')
187        for line in lines:
188            if "Reason:" in line:
189                return line.split("Reason:", 1)[1].strip()
190        return text[:100] + "..."
191
192    async def run(self):
193        logger.info("🚀 Starting ORYXIA Gatekeeper Pipeline...")
194        
195        pending = self.fetch_pending_products(limit=10) # Process batches
196        
197        if not pending:
198            logger.info("No pending products in staging.")
199            return
200
201        logger.info(f"Processing {len(pending)} products...")
202        
203        for product in pending:
204            try:
205                result = await self.gatekeeper.process_product(product)
206                
207                if result['final_decision'] == 'ADMITTED':
208                    success = self.save_admitted_product(product, result)
209                    if success:
210                        self.update_staging_status(product['id'], 'Processed')
211                    else:
212                        self.update_staging_status(product['id'], 'Error Saving')
213                else:
214                    self.log_rejected_product(product, result)
215                    self.update_staging_status(product['id'], 'Rejected')
216                    
217            except Exception as e:
218                logger.error(f"Pipeline error for {product.get('product_name')}: {e}")
219                self.update_staging_status(product['id'], 'Error Processing')
220
221if __name__ == "__main__":
222    runner = StagingProcessor()
223    asyncio.run(runner.run())

backend/ai/prompts.py

1# 5-Layer Gatekeeper Prompts
2
3GATE1_NOISE_ELIMINATION = """
4You are an Oryxia Noise Filter - a brutal gatekeeper for market intelligence.
5
6OBJECTIVE: Kill 60-70% of raw internet noise immediately.
7
8INPUT DATA:
9Title: {title}
10Description: {description}
11Price: {price}
12Source: {source}
13Category: {category}
14
15KILL IMMEDIATELY IF:
161. Price < $20 (low-ticket garbage)
172. Source = Etsy OR Printables OR Canva templates
183. No human pain signal (just description of features)
194. Product is one-off with no expansion potential
205. Easily replicated in < 1 day by any developer
21
22OUTPUT FORMAT:
23Decision: PASS / KILL
24Reason: [One sentence only]
25
26EXAMPLES:
27- "$5 Notion template" -> KILL - Low ticket, commodity
28- "$15 Etsy printable" -> KILL - Platform is noise factory
29- "$199 SaaS dashboard with API" -> PASS - Real software, expandable
30
31BE RUTHLESS. Your job is to protect the database from garbage.
32
33ANALYZE NOW.
34"""
35
36GATE2_PAIN_EXTRACTION = """
37You are an Oryxia Pain Signal Extractor.
38
39OBJECTIVE: Extract EXPLICIT human pain quotes. If none exist, KILL.
40
41INPUT:
42Title: {title}
43Description: {description}
44User Reviews/Comments: {reviews}
45
46EXTRACT ONLY:
471. Exact pain quote (must use words like: "wasting hours", "losing money", "can't scale", "stuck doing X manually")
482. Who is suffering (role: founder, marketer, developer, etc.)
493. What task is broken/manual
504. Frequency of pain (daily, weekly, monthly)
51
52KILL IF:
53- No explicit pain quote found
54- Only feature descriptions
55- Generic marketing language
56- No clear suffering person
57
58OUTPUT FORMAT:
59Decision: PASS / KILL
60Pain Quote: "[exact quote]" or "NONE FOUND"
61Sufferer: [role]
62Broken Task: [specific task]
63Frequency: [how often]
64
65EXAMPLE PASS:
66Decision: PASS
67Pain Quote: "I'm wasting 10 hours/week manually copying data between Stripe and Google Sheets"
68Sufferer: SaaS founders
69Broken Task: Revenue reporting
70Frequency: Weekly
71
72EXAMPLE KILL:
73Decision: KILL
74Pain Quote: NONE FOUND
75Reason: Only generic product features, no human pain signal
76
77BE STRICT. No pain = No business.
78
79ANALYZE NOW.
80"""
81
82GATE3_ECONOMIC_VIABILITY = """
83You are an Oryxia Economic Viability Scanner.
84
85OBJECTIVE: Determine if this opportunity can make real money.
86
87INPUT:
88Title: {title}
89Description: {description}
90Price: {price}
91Pain Signal: {pain_from_gate2}
92
93ANSWER THESE 3 CRITICAL QUESTIONS:
94
951. PRICING POWER
96Can someone realistically pay > $50/month for this solution?
97Consider: How painful is the problem? What's the ROI?
98Answer: YES / NO
99
1002. RECURRING POTENTIAL
101Is this a recurring need (subscription model possible)?
102Or is it one-time purchase only?
103Answer: YES (recurring) / NO (one-time)
104
1053. BUSINESS-CRITICAL
106Is this problem business-critical or just "nice to have"?
107Will they go out of business without it?
108Answer: YES (critical) / NO (nice-to-have)
109
110DECISION RULE:
111- If fewer than 2 YES answers -> KILL
112- If 2+ YES answers -> PASS
113
114OUTPUT FORMAT:
115Q1 Pricing Power: YES/NO - [reason]
116Q2 Recurring: YES/NO - [reason]
117Q3 Critical: YES/NO - [reason]
118YES Count: X/3
119Decision: PASS / KILL
120
121EXAMPLE PASS:
122Q1 Pricing Power: YES - Saves 40 hours/month = $4000+ value
123Q2 Recurring: YES - Data sync happens daily
124Q3 Critical: YES - Revenue reporting is mandatory
125YES Count: 3/3
126Decision: PASS
127
128EXAMPLE KILL:
129Q1 Pricing Power: NO - Only saves 1 hour/month
130Q2 Recurring: NO - One-time template purchase
131Q3 Critical: NO - Workaround with free tools exists
132YES Count: 0/3
133Decision: KILL
134
135BE HONEST. Bad economics = Bad business.
136
137ANALYZE NOW.
138"""
139
140GATE4_SATURATION = """
141You are an Oryxia Market Saturation Analyst.
142
143OBJECTIVE: Quickly assess competitive saturation level.
144
145INPUT:
146Title: {title}
147Description: {description}
148Category: {category}
149Pain Signal: {pain_from_gate2}
150
151ANALYZE:
1521. How many competitors likely exist? (estimate based on problem type)
1532. Are there dominant players already? (name them if known)
1543. Is this a "me too" solution or differentiated?
155
156SATURATION LEVELS:
157- LOW: 0-5 competitors, mostly new/unproven
158- MEDIUM: 5-20 competitors, dominated by big names
159- HIGH: 20+ competitors, dominated by big names
160
161OUTPUT FORMAT:
162Saturation: LOW / MEDIUM / HIGH
163Competitor Examples: [list 2-3 if known, or "Unknown but likely crowded"]
164Differentiation: [What makes this different? Or "None - commodity"]
165Warning: [Red flag if HIGH saturation]
166
167EXAMPLE LOW:
168Saturation: LOW
169Competitor Examples: Unknown - niche problem
170Differentiation: First mover in specific vertical
171Warning: None
172
173EXAMPLE HIGH:
174Saturation: HIGH
175Competitor Examples: Zapier, Make, n8n, Integromat
176Differentiation: None - generic automation tool
177Warning: ⚠️ Death zone - avoid unless you have nuclear weapon
178
179BE REALISTIC. Market research beats optimism.
180
181ANALYZE NOW.
182"""
183
184GATE5_ORYXIA_SCORE = """
185You are the Oryxia Final Gatekeeper - the last line of defense before database admission.
186
187OBJECTIVE: Generate rough Oryxia score. Admit if >=60, Reject if <60.
188
189INPUT (from previous gates):
190Title: {title}
191Pain Signal: {pain_from_gate2}
192Economic Viability: {viability_from_gate3}
193Saturation: {saturation_from_gate4}
194Price: {price}
195
196ORYXIA PENTAGONAL SCORING (Quick Estimate):
197
198P (Pain Intensity) - 30 points:
199- Tier 1 (25-30): Forced manual work, hours wasted daily
200- Tier 2 (18-24): Expensive friction, money leaking
201- Tier 3 (10-17): Productivity drain, annoying
202- Tier 4 (0-9): Nice to have
203
204M (Market Size) - 20 points:
205- >$1M SOM: 15-20 points
206- $100K-$1M: 10-14 points
207- <$100K: 1-9 points
208
209G (Gap Analysis) - 25 points:
210- LOW saturation: 20-25 points
211- MEDIUM saturation: 10-19 points
212- HIGH saturation: 1-9 points
213
214E (Execution Ease) - 15 points:
215- Template/No-code: 15 points
216- Standard SaaS: 10-12 points
217- Deep tech: 5-8 points
218
219R (Revenue Clarity) - 10 points:
220- Price >$100: 10 points
221- Price $50-$100: 7-9 points
222- Price $20-$50: 4-6 points
223- Price <$20: 1-3 points
224
225OUTPUT FORMAT:
226P Score: X/30 - [reasoning]
227M Score: X/20 - [reasoning]
228G Score: X/25 - [reasoning]
229E Score: X/15 - [reasoning]
230R Score: X/10 - [reasoning]
231
232Total Score: X/100
233Decision: ADMIT / REJECT
234
235THRESHOLD:
236- Score >= 60 -> ADMIT TO DATABASE
237- Score < 60 -> REJECT
238
239EXAMPLE ADMIT:
240P Score: 26/30 - Daily manual work, hours wasted
241M Score: 18/20 - $2M+ SOM, every SaaS needs this
242G Score: 22/25 - Low competition, niche vertical
243E Score: 12/15 - Standard SaaS stack
244R Score: 8/10 - $79/month clear pricing
245Total Score: 86/100
246Decision: ADMIT ✅
247
248EXAMPLE REJECT:
249P Score: 12/30 - Minor annoyance, workarounds exist
250M Score: 8/20 - Small niche, <$100K SOM
251G Score: 5/25 - 50+ competitors, saturated
252E Score: 15/15 - Just a template
253R Score: 3/10 - $15 one-time
254Total Score: 43/100
255Decision: REJECT ❌
256
257BE FINAL. This is the last gate. No mercy.
258
259ANALYZE NOW.
260"""

backend/api/init.py

backend/api/anti_bias_api.py

1from fastapi import APIRouter, HTTPException
2from pydantic import BaseModel
3from typing import Optional, List, Dict, Any
4from backend.analysis.anti_bias import AntiBiasAnalyzer
5
6router = APIRouter(prefix="/api/anti-bias", tags=["analysis"])
7analyzer = AntiBiasAnalyzer()
8
9class BiasCheckRequest(BaseModel):
10    keyword: str
11    threshold: Optional[float] = 0.4
12
13@router.post("/check")
14async def check_bias(request: BiasCheckRequest):
15    """
16    Checks for similar failed products to warn against bias.
17    """
18    result = analyzer.check_similar_failed_products(
19        request.keyword, 
20        request.threshold
21    )
22    if "error" in result:
23        raise HTTPException(status_code=500, detail=result["error"])
24    return result
25
26@router.get("/saturation/{keyword}")
27async def check_saturation(keyword: str):
28    """
29    Checks market saturation for a keyword.
30    """
31    result = analyzer.get_market_saturation(keyword)
32    return result

backend/api/graveyard.py

1from fastapi import APIRouter, HTTPException
2import psycopg2
3from psycopg2.extras import RealDictCursor
4import logging
5import os
6
7router = APIRouter(tags=["graveyard"])
8logger = logging.getLogger(__name__)
9
10# Reusing the config structure
11DB_CONFIG = {
12    'host': 'localhost',
13    'port': int(os.getenv('DB_PORT', 5435)),
14    'database': os.getenv('DB_NAME', 'orexia'),
15    'user': os.getenv('DB_USER', 'orexia_app'),
16    'password': os.getenv('DB_PASSWORD', 'Farhat2026Secure')
17}
18
19@router.get("/api/graveyard")
20async def get_graveyard():
21    """
22    Fetches failed products ('KILL' decision) from the graveyard.
23    """
24    try:
25        conn = psycopg2.connect(**DB_CONFIG)
26        cur = conn.cursor(cursor_factory=RealDictCursor)
27        
28        cur.execute("""
29            SELECT 
30                id,
31                title,
32                platform,
33                CAST(COALESCE((evidence->>'pain_severity'), '0') AS FLOAT) * 10 as ai_score,
34                COALESCE(evidence->>'reason', 'Low market demand signals') as failure_reason,
35                TO_CHAR(scraped_at, 'YYYY-MM-DD') as death_date
36            FROM raw_intelligence_signals
37            WHERE (evidence->>'pain_severity')::numeric < 4.0
38            ORDER BY ai_score ASC
39            LIMIT 50;
40        """)
41        results = cur.fetchall()
42        
43        cur.close()
44        conn.close()
45        
46        return results
47    except Exception as e:
48        logger.error(f"Error fetching graveyard: {e}")
49        raise HTTPException(status_code=500, detail=str(e))

backend/core/init.py

backend/core/system_implementation.py

1"""
2AURIFY INTELLIGENCE - PROMPT ORCHESTRATION SYSTEM
3=================================================
4
5This module demonstrates how to use the 7 prompts from prompts.txt
6in a production environment.
7
8ARCHITECTURE:
91. Load prompts from templates
102. Format with dynamic data
113. Send to AI model (OpenAI/Anthropic/local)
124. Parse structured responses
135. Chain prompts together for complete workflow
14"""
15
16import json
17from typing import Dict, List, Optional, Literal
18from dataclasses import dataclass
19from datetime import datetime
20from pathlib import Path
21
22
23# ============================================================================
24# DATA STRUCTURES
25# ============================================================================
26
27@dataclass
28class MarketSignal:
29    """Input data for Decision Engine"""
30    source_platform: str
31    product_type: str
32    target_market: str
33    attention_score: int  # 0-100
34    velocity_7d: float  # percentage
35    engagement_depth: int  # 0-100
36    demand_signals: int  # 0-10
37    monetization_signals: int  # 0-10
38    competition_level: str
39    product_name: str
40    product_description: str
41    price_usd: float
42    days_in_market: int
43    geography: str
44
45
46@dataclass
47class DecisionOutput:
48    """Output from Decision Engine (Prompt #1)"""
49    decision: Literal["GO", "WAIT", "KILL"]
50    signal_authenticity: Literal["AUTHENTIC", "INFLATED", "UNCERTAIN"]
51    scores: Dict[str, int]
52    opportunity_statement: str
53    primary_risk: str
54    rationale: str
55    confidence_level: Literal["high", "medium", "low"]
56    timeframe: str
57    key_assumptions: List[str]
58
59
60@dataclass
61class ProjectData:
62    """Additional data for Alpha Blueprint (Prompt #2)"""
63    github_url: str
64    stars: int
65    forks: int
66    commits: int
67    tech_stack: str
68    license: str
69    last_commit_date: str
70    contributor_count: int
71    open_issues: int
72
73
74# ============================================================================
75# PROMPT LOADER
76# ============================================================================
77
78class PromptLibrary:
79    """Manages all system prompts"""
80    
81    def __init__(self, prompts_file: Path = Path("prompts.txt")):
82        self.prompts = {}
83        self._load_prompts(prompts_file)
84    
85    def _load_prompts(self, file_path: Path):
86        """Parse prompts.txt and extract each template"""
87        with open(file_path, 'r', encoding='utf-8') as f:
88            content = f.read()
89        
90        # Parse prompts (simplified - you'd need more robust parsing)
91        sections = content.split('\n\n')
92        
93        # For now, store raw prompts - in production you'd parse properly
94        self.prompts = {
95            "base_system": self._extract_base_system(content),
96            "decision_engine": self._extract_decision_engine(content),
97            "alpha_blueprint": self._extract_alpha_blueprint(content),
98            "shadow_trends": self._extract_shadow_trends(content),
99            "kill_list": self._extract_kill_list(content),
100            "monthly_study": self._extract_monthly_study(content),
101            "sales_copy": self._extract_sales_copy(content),
102            "localization": self._extract_localization(content),
103        }
104    
105    def _extract_base_system(self, content: str) -> str:
106        """Extract Universal System Guard"""
107        start = content.find("0) Universal System Guard")
108        end = content.find("1) Decision Engine")
109        return content[start:end].strip()
110    
111    def _extract_decision_engine(self, content: str) -> dict:
112        """Extract Decision Engine prompt"""
113        start = content.find("1) Decision Engine")
114        end = content.find("2) Alpha Blueprint")
115        section = content[start:end]
116        
117        # Split into system and user prompts
118        system_start = section.find("System Prompt")
119        user_start = section.find("User Prompt Template")
120        
121        return {
122            "system": section[system_start:user_start].strip(),
123            "user_template": section[user_start:].strip()
124        }
125    
126    def _extract_alpha_blueprint(self, content: str) -> dict:
127        start = content.find("2) Alpha Blueprint")
128        end = content.find("3) Shadow Trends")
129        section = content[start:end]
130        
131        system_start = section.find("System Prompt")
132        user_start = section.find("User Prompt Template")
133        
134        return {
135            "system": section[system_start:user_start].strip(),
136            "user_template": section[user_start:].strip()
137        }
138    
139    # Add similar methods for other prompts...
140    def _extract_shadow_trends(self, content: str) -> dict:
141        return {"system": "", "user_template": ""}  # Implement
142    
143    def _extract_kill_list(self, content: str) -> dict:
144        return {"system": "", "user_template": ""}  # Implement
145    
146    def _extract_monthly_study(self, content: str) -> dict:
147        return {"system": "", "user_template": ""}  # Implement
148    
149    def _extract_sales_copy(self, content: str) -> dict:
150        return {"system": "", "user_template": ""}  # Implement
151    
152    def _extract_localization(self, content: str) -> dict:
153        return {"system": "", "user_template": ""}  # Implement
154
155
156# ============================================================================
157# AI MODEL INTERFACE
158# ============================================================================
159
160class AIModelInterface:
161    """Abstract interface for different AI providers"""
162    
163    def __init__(self, provider: str = "openai", model: str = "gpt-4"):
164        self.provider = provider
165        self.model = model
166    
167    def generate(self, system_prompt: str, user_prompt: str) -> str:
168        """
169        Send prompts to AI model and return response
170        
171        In production, implement actual API calls here:
172        - OpenAI API
173        - Anthropic Claude API
174        - Local models (Ollama, LMStudio)
175        """
176        
177        if self.provider == "openai":
178            return self._call_openai(system_prompt, user_prompt)
179        elif self.provider == "anthropic":
180            return self._call_anthropic(system_prompt, user_prompt)
181        elif self.provider == "local":
182            return self._call_local(system_prompt, user_prompt)
183        else:
184            raise ValueError(f"Unknown provider: {self.provider}")
185    
186    def _call_openai(self, system: str, user: str) -> str:
187        """OpenAI API implementation"""
188        # Example:
189        # import openai
190        # response = openai.ChatCompletion.create(
191        #     model=self.model,
192        #     messages=[
193        #         {"role": "system", "content": system},
194        #         {"role": "user", "content": user}
195        #     ]
196        # )
197        # return response.choices[0].message.content
198        
199        return "MOCK_RESPONSE_FROM_OPENAI"
200    
201    def _call_anthropic(self, system: str, user: str) -> str:
202        """Anthropic Claude API implementation"""
203        # Example:
204        # import anthropic
205        # client = anthropic.Anthropic(api_key="...")
206        # message = client.messages.create(
207        #     model=self.model,
208        #     system=system,
209        #     messages=[{"role": "user", "content": user}]
210        # )
211        # return message.content[0].text
212        
213        return "MOCK_RESPONSE_FROM_CLAUDE"
214    
215    def _call_local(self, system: str, user: str) -> str:
216        """Local model implementation (Ollama, etc.)"""
217        return "MOCK_RESPONSE_FROM_LOCAL_MODEL"
218
219
220# ============================================================================
221# WORKFLOW ORCHESTRATOR
222# ============================================================================
223
224class AurifyWorkflow:
225    """Main workflow that chains prompts together"""
226    
227    def __init__(self, prompts: PromptLibrary, ai_model: AIModelInterface):
228        self.prompts = prompts
229        self.ai = ai_model
230    
231    def analyze_opportunity(self, signal: MarketSignal) -> DecisionOutput:
232        """
233        Step 1: Use Decision Engine to evaluate opportunity
234        
235        Returns: GO / WAIT / KILL decision with analysis
236        """
237        
238        # Format user prompt with actual data
239        user_prompt = self._format_decision_prompt(signal)
240        
241        # Get system prompt
242        system_prompt = self.prompts.prompts["decision_engine"]["system"]
243        
244        # Call AI
245        response = self.ai.generate(system_prompt, user_prompt)
246        
247        # Parse JSON response
248        decision_data = json.loads(response)
249        
250        return DecisionOutput(**decision_data)
251    
252    def _format_decision_prompt(self, signal: MarketSignal) -> str:
253        """Format template with real data"""
254        
255        template = """
256MARKET SIGNAL ANALYSIS REQUEST
257
258=== INPUT DATA ===
259
260Source Platform: {source_platform}
261Product Type: {product_type}
262Target Market: {target_market}
263
264METRICS:
265- Attention Score: {attention_score}/100
266- Velocity (7-day growth): {velocity_7d}%
267- Engagement Depth: {engagement_depth}/100
268- Demand Signals: {demand_signals}/10
269- Monetization Readiness: {monetization_signals}/10
270- Competition Level: {competition_level}
271
272CONTEXT:
273- Product: {product_name}
274- Description: {product_description}
275- Price Point: ${price_usd}
276- Market Age: {days_in_market} days
277- Geographic Focus: {geography}
278
279=== ANALYSIS REQUIRED ===
280[Full analysis instructions from prompt]
281"""
282        
283        return template.format(
284            source_platform=signal.source_platform,
285            product_type=signal.product_type,
286            target_market=signal.target_market,
287            attention_score=signal.attention_score,
288            velocity_7d=signal.velocity_7d,
289            engagement_depth=signal.engagement_depth,
290            demand_signals=signal.demand_signals,
291            monetization_signals=signal.monetization_signals,
292            competition_level=signal.competition_level,
293            product_name=signal.product_name,
294            product_description=signal.product_description,
295            price_usd=signal.price_usd,
296            days_in_market=signal.days_in_market,
297            geography=signal.geography
298        )
299    
300    def generate_execution_plan(
301        self, 
302        decision: DecisionOutput, 
303        project: ProjectData
304    ) -> str:
305        """
306        Step 2: If decision is GO, generate Alpha Blueprint
307        
308        Returns: Markdown execution plan
309        """
310        
311        if decision.decision != "GO":
312            return None
313        
314        # Format user prompt
315        user_prompt = f"""
316EXECUTION BLUEPRINT REQUEST
317
318=== INPUT CONTEXT ===
319
320DECISION OUTPUT:
321{json.dumps(decision.__dict__, indent=2)}
322
323PROJECT DATA:
324- Repository: {project.github_url}
325- Stars: {project.stars} | Forks: {project.forks}
326- Commits: {project.commits}
327- Tech Stack: {project.tech_stack}
328- License: {project.license}
329- Last Update: {project.last_commit_date}
330- Contributors: {project.contributor_count}
331- Issues: {project.open_issues}
332
333=== DELIVERABLE REQUIREMENTS ===
334[Full blueprint structure from prompt]
335"""
336        
337        system_prompt = self.prompts.prompts["alpha_blueprint"]["system"]
338        
339        blueprint = self.ai.generate(system_prompt, user_prompt)
340        
341        return blueprint
342    
343    def generate_kill_report(self, kill_decisions: List[DecisionOutput]) -> str:
344        """
345        Step 3: Generate Kill List Report from rejected opportunities
346        
347        Returns: Markdown report
348        """
349        
350        killed = [d for d in kill_decisions if d.decision == "KILL"]
351        
352        user_prompt = f"""
353KILL LIST AUTHORITY REPORT REQUEST
354
355=== INPUT DATASET ===
356{json.dumps([d.__dict__ for d in killed], indent=2)}
357
358Total entries: {len(killed)}
359Date range: {datetime.now().isoformat()}
360
361=== REPORT STRUCTURE ===
362[Full structure from prompt]
363"""
364        
365        system_prompt = self.prompts.prompts["kill_list"]["system"]
366        
367        report = self.ai.generate(system_prompt, user_prompt)
368        
369        return report
370    
371    def generate_monthly_study(
372        self, 
373        all_decisions: List[DecisionOutput],
374        sector: str
375    ) -> str:
376        """
377        Step 4: Generate Monthly Deep Study
378        
379        Returns: Premium sector report
380        """
381        
382        user_prompt = f"""
383MONTHLY SECTOR DEEP STUDY REQUEST
384
385=== INPUT DATASET ===
386{json.dumps([d.__dict__ for d in all_decisions], indent=2)}
387
388Sector: {sector}
389Analysis Period: {datetime.now().isoformat()}
390
391=== REPORT STRUCTURE ===
392[Full structure from prompt]
393"""
394        
395        system_prompt = self.prompts.prompts["monthly_study"]["system"]
396        
397        study = self.ai.generate(system_prompt, user_prompt)
398        
399        return study
400    
401    def generate_sales_copy(
402        self, 
403        report_title: str,
404        report_type: str,
405        price: float
406    ) -> str:
407        """
408        Step 5: Generate sales copy for report
409        
410        Returns: Sales page copy
411        """
412        
413        user_prompt = f"""
414SALES COPY REQUEST
415
416Product: {report_title}
417Type: {report_type}
418Price: ${price}
419
420Create sales page copy with:
421- HEADLINE
422- SUBHEADLINE
423- 3 VALUE BULLETS
424- WHO THIS IS FOR
425- WHO THIS IS NOT FOR
426- GUARANTEE
427"""
428        
429        system_prompt = self.prompts.prompts["sales_copy"]["system"]
430        
431        copy = self.ai.generate(system_prompt, user_prompt)
432        
433        return copy
434
435
436# ============================================================================
437# USAGE EXAMPLES
438# ============================================================================
439
440def example_1_single_opportunity():
441    """Example: Analyze single market opportunity"""
442    
443    # Initialize system
444    prompts = PromptLibrary(Path("prompts.txt"))
445    ai_model = AIModelInterface(provider="openai", model="gpt-4")
446    workflow = AurifyWorkflow(prompts, ai_model)
447    
448    # Create market signal
449    signal = MarketSignal(
450        source_platform="TikTok",
451        product_type="Digital Product",
452        target_market="Content Creators",
453        attention_score=85,
454        velocity_7d=42.5,
455        engagement_depth=78,
456        demand_signals=8,
457        monetization_signals=7,
458        competition_level="Medium",
459        product_name="Notion Template Pack",
460        product_description="Pre-built Notion templates for productivity",
461        price_usd=29.99,
462        days_in_market=14,
463        geography="US, UK, CA"
464    )
465    
466    # Analyze
467    decision = workflow.analyze_opportunity(signal)
468    
469    print(f"Decision: {decision.decision}")
470    print(f"Confidence: {decision.confidence_level}")
471    print(f"Risk: {decision.primary_risk}")
472    
473    # If GO, generate execution plan
474    if decision.decision == "GO":
475        project = ProjectData(
476            github_url="https://github.com/user/repo",
477            stars=1234,
478            forks=89,
479            commits=456,
480            tech_stack="React, Node.js",
481            license="MIT",
482            last_commit_date="2024-01-15",
483            contributor_count=5,
484            open_issues=12
485        )
486        
487        blueprint = workflow.generate_execution_plan(decision, project)
488        print("\n=== EXECUTION BLUEPRINT ===")
489        print(blueprint)
490
491
492def example_2_batch_analysis():
493    """Example: Analyze multiple opportunities and generate reports"""
494    
495    prompts = PromptLibrary(Path("prompts.txt"))
496    ai_model = AIModelInterface(provider="openai", model="gpt-4")
497    workflow = AurifyWorkflow(prompts, ai_model)
498    
499    # Simulate 100 opportunities analyzed
500    all_decisions = []
501    
502    for i in range(100):
503        signal = MarketSignal(
504            source_platform="Various",
505            product_type="Digital",
506            target_market="Various",
507            attention_score=50 + i % 50,
508            velocity_7d=10.0 + i % 90,
509            engagement_depth=40 + i % 60,
510            demand_signals=i % 10,
511            monetization_signals=i % 10,
512            competition_level="Medium",
513            product_name=f"Product {i}",
514            product_description=f"Description {i}",
515            price_usd=19.99 + i,
516            days_in_market=7 + i,
517            geography="US"
518        )
519        
520        decision = workflow.analyze_opportunity(signal)
521        all_decisions.append(decision)
522    
523    # Generate Kill List Report
524    kill_report = workflow.generate_kill_report(all_decisions)
525    
526    # Generate Monthly Study
527    monthly_study = workflow.generate_monthly_study(
528        all_decisions, 
529        sector="Digital Products"
530    )
531    
532    # Generate sales copy for the study
533    sales_copy = workflow.generate_sales_copy(
534        report_title="Digital Products Market Intelligence - January 2025",
535        report_type="Monthly Deep Study",
536        price=79.00
537    )
538    
539    print("=== REPORTS GENERATED ===")
540    print(f"Kill List: {len(kill_report)} words")
541    print(f"Monthly Study: {len(monthly_study)} words")
542    print(f"Sales Copy: {len(sales_copy)} words")
543
544
545def example_3_integration_with_rabbitmq():
546    """Example: Integrate with RabbitMQ for async processing"""
547    
548    import pika  # RabbitMQ client
549    
550    # Setup
551    connection = pika.BlockingConnection(
552        pika.ConnectionParameters('localhost')
553    )
554    channel = connection.channel()
555    channel.queue_declare(queue='market_signals')
556    
557    prompts = PromptLibrary(Path("prompts.txt"))
558    ai_model = AIModelInterface(provider="openai", model="gpt-4")
559    workflow = AurifyWorkflow(prompts, ai_model)
560    
561    def process_signal(ch, method, properties, body):
562        """Worker function: process each signal"""
563        signal_data = json.loads(body)
564        signal = MarketSignal(**signal_data)
565        
566        # Analyze
567        decision = workflow.analyze_opportunity(signal)
568        
569        # Store result
570        with open(f"results/{signal.product_name}.json", "w") as f:
571            json.dump(decision.__dict__, f, indent=2)
572        
573        ch.basic_ack(delivery_tag=method.delivery_tag)
574    
575    # Start consuming
576    channel.basic_consume(
577        queue='market_signals',
578        on_message_callback=process_signal
579    )
580    
581    print("Waiting for market signals...")
582    channel.start_consuming()
583
584
585# ============================================================================
586# MAIN ENTRY POINT
587# ============================================================================
588
589if __name__ == "__main__":
590    print("=== AURIFY INTELLIGENCE - PROMPT SYSTEM ===\n")
591    
592    print("Running Example 1: Single Opportunity Analysis")
593    example_1_single_opportunity()
594    
595    print("\n" + "="*60 + "\n")
596    
597    print("Running Example 2: Batch Analysis + Reports")
598    example_2_batch_analysis()
599    
600    print("\n" + "="*60 + "\n")
601    
602    print("Example 3 (RabbitMQ) is available but not run here")
603    print("Uncomment to test with RabbitMQ setup")

backend/database/init.py

backend/workers/init.py

backend/utils/init.py

backend/scrapers/init.py

backend/scrapers/amazon.py

1"""
2AURIFY INTELLIGENCE - Sophisticated Amazon Scraper
3Target: Digital Products (Kindle, Software, Digital Courses) + Reviews
4"""
5
6import asyncio
7from playwright.async_api import async_playwright
8import json
9import random
10from datetime import datetime
11import re
12from .orexia_base import OrexiaBaseScraper
13import logging
14
15logger = logging.getLogger(__name__)
16
17class AmazonScraper(OrexiaBaseScraper):
18    def __init__(self):
19        super().__init__("Amazon")
20        self.base_url = "https://www.amazon.com/s"
21        
22        # Digital-focused search nodes
23        self.DIGITAL_NODES = {
24            "Kindle Store": "133140011",
25            "Software": "229534",
26            "Books": "283155"
27        }
28        
29        # 🎯 PAIN HEIST STRATEGY - Target Book Keywords
30        self.TARGET_KEYWORDS = [
31            'social media content planner',
32            'instagram marketing book 2025',
33            'passive income strategies',
34            'small business log book',
35            'social media planner 2026',
36            'instagram growth book',
37            'influencer marketing for dummies',
38            'kindle passive income',
39            'digital planner 2026',
40            'productivity planner',
41            'business planning book',
42            'entrepreneur guide'
43        ]
44        
45        # Pain signals to extract from negative reviews (1-3 stars)
46        self.PAIN_SIGNALS = [
47            "outdated",
48            "too much theory",
49            "no action",
50            "basic",
51            "refund",
52            "complicated",
53            "heavy to carry",
54            "not practical",
55            "waste of money"
56        ]
57
58    async def scrape(self, query=None, limit=20, scrape_reviews=True):
59        """
60        Scrape Amazon for digital products using Playwright (Stealth)
61        """
62        products = []
63        search_term = query if query else random.choice(self.TARGET_KEYWORDS)
64        
65        logger.info(f"🛒 [Amazon] Scraping for '{search_term}'")
66        
67        async with async_playwright() as p:
68            browser = await p.chromium.launch(
69                headless=False,
70                args=['--no-sandbox', '--disable-blink-features=AutomationControlled']
71            )
72            
73            context = await browser.new_context(
74                viewport={'width': 1280, 'height': 800},
75                user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
76            )
77            
78            page = await context.new_page()
79            
80            try:
81                url = f"{self.base_url}?k={search_term.replace(' ', '+')}" 
82                await page.goto(url, wait_until="domcontentloaded", timeout=60000)
83                await page.wait_for_timeout(2000)
84                
85                # Check for Robot Check
86                if await page.query_selector("text='Enter the characters you see below'"):
87                    logger.warning("🤖 [Amazon] Robot Check detected! Waiting (30s)...")
88                    await page.wait_for_timeout(30000)
89                
90                await page.wait_for_selector('div[data-component-type="s-search-result"]', timeout=10000)
91                
92                items = await page.query_selector_all('div[data-component-type="s-search-result"]')
93                
94                for item in items[:limit]:
95                    data = await self._extract_item(item)
96                    if data:
97                        # Scrape reviews if enabled
98                        if scrape_reviews:
99                            reviews_data = await self._scrape_product_reviews(data['product_url'], browser)
100                            data['reviews'] = reviews_data.get('reviews', [])
101                            data['pain_signals'] = reviews_data.get('pain_signals', [])
102                            data['review_sentiment'] = reviews_data.get('sentiment', 'neutral')
103                        
104                        # Save using OrexiaBaseScraper method
105                        self.save_product(data)
106                        products.append(data)
107                        
108                logger.info(f"✅ [Amazon] Successfully scraped {len(products)} products")
109                
110            except Exception as e:
111                logger.error(f"❌ [Amazon] Error: {e}")
112                
113            finally:
114                await browser.close()
115                
116        return products
117
118    async def _extract_item(self, item):
119        try:
120            title_el = await item.query_selector("h2 span")
121            title = await title_el.inner_text() if title_el else "Unknown"
122            
123            asin = await item.get_attribute("data-asin")
124            
125            price_el = await item.query_selector(".a-price .a-offscreen")
126            price_text = await price_el.inner_text() if price_el else "0"
127            price = float(price_text.replace('$','').replace(',','')) if '$' in price_text else 0.0
128            
129            count_el = await item.query_selector("span[aria-label*='ratings']")
130            reviews_count = 0
131            if count_el:
132                c_lbl = await count_el.get_attribute("aria-label")
133                reviews_count = int(c_lbl.replace(',','').split()[0])
134            
135            return {
136                "product_name": title,
137                "product_url": f"https://amazon.com/dp/{asin}",
138                "price": price,
139                "canonical_key": f"amazon_{asin}",
140                "source_metrics": {
141                    "asin": asin,
142                    "ratings_count": reviews_count,
143                    "is_bestseller": bool(await item.query_selector(".s-bestseller-badge"))
144                }
145            }
146        except Exception:
147            return None
148
149    async def _scrape_product_reviews(self, product_url, browser):
150        """
151        Scrape product reviews for pain signal extraction
152        """
153        reviews_data = {
154            'reviews': [],
155            'pain_signals': [],
156            'sentiment': 'neutral'
157        }
158        
159        try:
160            context = await browser.new_context()
161            page = await context.new_page()
162            
163            # Navigate to reviews page
164            asin = product_url.split('/')[-1]
165            reviews_url = f"https://www.amazon.com/product-reviews/{asin}/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews"
166            
167            await page.goto(reviews_url, wait_until="domcontentloaded", timeout=20000)
168            await page.wait_for_timeout(2000)
169            
170            # Extract reviews
171            review_elements = await page.query_selector_all("div[data-hook='review']")
172            
173            pain_keywords = [
174                "doesn't work", "broken", "bug", "issue", "problem", "disappointed",
175                "waste", "poor", "terrible", "awful", "missing", "need", "wish",
176                "should have", "lacking", "difficult", "confusing", "complicated",
177                "frustrating", "useless", "refund", "returned"
178            ]
179            
180            positive_count = 0
181            negative_count = 0
182            
183            for review_el in review_elements[:15]:
184                try:
185                    # Rating
186                    rating_el = await review_el.query_selector("i[data-hook='review-star-rating']")
187                    rating_text = await rating_el.get_attribute("class") if rating_el else ""
188                    rating = 0
189                    if "a-star-5" in rating_text:
190                        rating = 5
191                        positive_count += 1
192                    elif "a-star-4" in rating_text:
193                        rating = 4
194                        positive_count += 1
195                    elif "a-star-1" in rating_text or "a-star-2" in rating_text:
196                        rating = rating_text.count("a-star-")
197                        negative_count += 1
198                    
199                    # Review text
200                    text_el = await review_el.query_selector("span[data-hook='review-body']")
201                    review_text = await text_el.inner_text() if text_el else ""
202                    
203                    reviews_data['reviews'].append({
204                        'rating': rating,
205                        'text': review_text[:200]
206                    })
207                    
208                    # Extract pain signals from low-rated reviews
209                    if rating <= 3 and review_text:
210                        text_lower = review_text.lower()
211                        for keyword in pain_keywords:
212                            if keyword in text_lower:
213                                # Extract sentence containing pain keyword
214                                sentences = review_text.split('.')
215                                for sentence in sentences:
216                                    if keyword in sentence.lower() and len(sentence.strip()) > 10:
217                                        reviews_data['pain_signals'].append(sentence.strip())
218                                        break
219                                break
220                    
221                    if len(reviews_data['pain_signals']) >= 5:
222                        break
223                        
224                except Exception as e:
225                    logger.debug(f"[Amazon] Review extraction error: {e}")
226                    continue
227            
228            # Calculate sentiment
229            if positive_count > negative_count * 2:
230                reviews_data['sentiment'] = 'positive'
231            elif negative_count > positive_count:
232                reviews_data['sentiment'] = 'negative'
233            
234            await context.close()
235            
236        except Exception as e:
237            logger.debug(f"[Amazon] Review scraping error: {e}")
238        
239        return reviews_data

backend/scrapers/aurify_prompts.py

1"""
2AURIFY STRATEGIC PROMPTS
3Professional Market Intelligence & Product Development Framework
4"""
5
6AURIFY_MARKET_GAP_SCANNER = \"\"\"
7---
8ROLE: You are an elite Market Intelligence Analyst specializing in the Oryxia Strategic Framework - a battle-tested methodology for identifying high-ROI SaaS opportunities.
9
10OBJECTIVE: Analyze the provided product/opportunity data and apply the Oryxia Pentagonal Scoring System to generate GO/KILL/WAIT decisions.
11
12---
13
14ORYXIA PENTAGONAL ALGORITHM (100-Point Scale):
15
16P (Pain Intensity): 30 points
17├─ Tier 1 (25-30): Forced manual work - users spend hours on trivial tasks
18├─ Tier 2 (18-24): Expensive friction - problem leaks money but doesn't stop work  
19├─ Tier 3 (10-17): Productivity drain - annoying but workable
20└─ Tier 4 (0-9): "Nice to have" - cemetery of failed products
21
22M (Market Size): 20 points
23├─ SOM >$1M annually: 15-20 points
24├─ SOM $100K-$1M: 10 points
25└─ SOM <$50K: 1 point (side hustle, not a business)
26
27G (Gap Analysis): 25 points
28├─ 0-2 competitors: 25 points (you're the king)
29├─ 3-10 competitors: 15-20 points (healthy market, find your wedge)
30└─ 20+ competitors: 1-5 points (death zone unless you have a nuclear weapon)
31
32E (Execution Ease): 15 points
33├─ Template/no-code: 15 points
34├─ Standard SaaS: 10 points
35└─ Deep tech/AI: 5-8 points
36
37R (Revenue Clarity): 10 points
38├─ Clear pricing >$100: 10 points
39├─ Pricing $50-$100: 8 points
40├─ Pricing <$50: 5 points
41└─ No clear model: 3 points
42
43---
44
45SCORING THRESHOLDS:
46
47├─ GO (70-100): Execute immediately - strong pain + clear market + defensible position
48├─ WAIT (50-69): Qualified maybe - needs validation, build MVP first
49└─ KILL (<50): Don't waste time - vitamins not painkillers
50
51---
52
53OUTPUT FORMAT:
54
55For each product in the dataset, provide:
56
571. EXECUTIVE SCORE CARD
58   - Total Score: X/100
59   - Decision: GO 🚀 / WAIT ⏸️ / KILL ❌
60   - Component Breakdown: P(X) + M(X) + G(X) + E(X) + R(X)
61
622. INTELLIGENCE BRIEF
63   - Pain Signal: [Quote evidence of user frustration]
64   - Market Snapshot: [TAM/SAM/SOM with sources]
65   - Competitive Moat: [What makes this defensible?]
66   - Execution Complexity: [Can solo founder ship in 30 days?]
67   - Revenue Model: [How does money flow?]
68
693. STRATEGIC RECOMMENDATION
70   - If GO: Next 3 actions to execute in 30 days
71   - If WAIT: Specific validation tests required
72   - If KILL: Why this is a trap + better alternatives
73
744. RISK FLAGS
75   - Funded competitors who can outspend you
76   - Technical complexity beyond solo capability  
77   - Market saturation indicators
78   - Regulatory/compliance landmines
79\"\"\"
80
81AURIFY_DECISION_MATRIX_VALIDATOR = \"\"\"
82---
83ROLE: You are a Strategic Business Validator specializing in due diligence for solo founders and early-stage SaaS ventures.
84
85OBJECTIVE: Take GO-scored opportunities (70+) from the Market Gap Scanner and run them through advanced validation frameworks to prevent expensive mistakes.
86
87---
88
89VALIDATION FRAMEWORK (7 Critical Lenses):
90
911. PORTER'S 5 FORCES ANALYSIS
922. UNIT ECONOMICS STRESS TEST
933. DEFENSIBLE MOAT ASSESSMENT
944. TIME-TO-MARKET REALITY CHECK
955. DISTRIBUTION CHANNEL ANALYSIS
966. REGULATORY/COMPLIANCE LANDMINES
977. CAPITAL REQUIREMENTS REALITY
98
99---
100
101REVISED DECISION MATRIX:
102
103After running these 7 lenses, re-evaluate:
104
105├─ CONFIRMED GO: All systems green - execute immediately
106├─ DOWNGRADE TO WAIT: Red flags found - validate first
107├─ UPGRADE FROM WAIT: Stronger than initially assessed
108└─ EMERGENCY KILL: Critical flaw discovered
109\"\"\"
110
111AURIFY_PRODUCT_BLUEPRINT = \"\"\"
112---
113ROLE: You are a Product Architect specializing in Minimum Viable Product (MVP) design for solo founders building SaaS products.
114
115OBJECTIVE: Transform a validated GO opportunity into a concrete, executable product blueprint that can ship in 30-90 days.
116
117---
118
119PRODUCT BLUEPRINT FRAMEWORK:
120
1211. VALUE PROPOSITION (The Wedge)
1222. MVP FEATURE MATRIX (MoSCoW Prioritization)
1233. TECHNICAL ARCHITECTURE
1244. USER EXPERIENCE BLUEPRINT
1255. BUSINESS MODEL MECHANICS
1266. GO-TO-MARKET STRATEGY
1277. DEVELOPMENT ROADMAP
1288. SUCCESS METRICS & KILL CRITERIA
129\"\"\"

backend/scrapers/base.py

1import abc
2import os
3import psycopg2
4from dotenv import load_dotenv
5
6load_dotenv()
7
8class BaseScraper(abc.ABC):
9    def __init__(self, platform_name):
10        self.platform_name = platform_name
11        self.db_host = os.getenv("POSTGRES_HOST", "localhost")
12        self.db_port = os.getenv("POSTGRES_PORT", "5435")
13        self.db_user = os.getenv("POSTGRES_USER", "orexia_app")
14        self.db_password = os.getenv("POSTGRES_PASSWORD", "Farhat2026Secure")
15        self.db_name = os.getenv("POSTGRES_DB", "orexia")
16
17    def get_db_connection(self):
18        return psycopg2.connect(
19            host=self.db_host,
20            port=self.db_port,
21            user=self.db_user,
22            password=self.db_password,
23            database=self.db_name
24        )
25
26    @abc.abstractmethod
27    async def scrape(self, **kwargs):
28        """Main scraping method to be implemented by subclasses"""
29        pass
30
31    def save_product(self, product_data):
32        """Save a single product to the database"""
33        conn = self.get_db_connection()
34        cur = conn.cursor()
35        
36        # This is a generic save method. 
37        # Subclasses should probably implement their own logic if table schemas differ.
38        # But for now, we'll try to use a common logic or specific table names.
39        pass

backend/scrapers/competitive_analyzer.py

1from typing import List, Dict, Any
2import statistics
3
4class CompetitiveAnalyzer:
5    """
6    Analyzes market data to determine saturation, price gaps, and opportunities.
7    """
8
9    def analyze(self, product: Dict[str, Any], market_data: List[Dict[str, Any]]) -> Dict[str, Any]:
10        """
11        Analyze a single product against the broader market data.
12        
13        Args:
14            product: The product to analyze
15            market_data: List of all products to compare against (competitors)
16            
17        Returns:
18            Dictionary containing analysis results (saturation, gaps, recommended_action)
19        """
20        
21        # 1. Identify Direct Competitors (Contextual Match)
22        competitors = self._find_competitors(product, market_data)
23        
24        if not competitors:
25            return {
26                "market_density": "LOW",
27                "competitor_count": 0,
28                "price_position": "UNKNOWN",
29                "saturation_score": 0,
30                "gap_analysis": {"price_gap": False, "review_gap": False},
31                "recommendation": "GO (First Mover)"
32            }
33
34        # 2. Key Metrics Calculation
35        prices = [p.get('price', 0) for p in competitors if p.get('price', 0) > 0]
36        engagement = [p.get('engagement_metric', 0) for p in competitors]
37        
38        avg_price = statistics.mean(prices) if prices else 0
39        avg_engagement = statistics.mean(engagement) if engagement else 0
40        max_engagement = max(engagement) if engagement else 0
41        
42        product_price = product.get('price', 0)
43        
44        # 3. Saturation Analysis
45        # High density of products with high engagement = Saturated
46        density_score = len(competitors)
47        saturation_level = "LOW"
48        if density_score > 50: saturation_level = "HIGH"
49        elif density_score > 20: saturation_level = "MEDIUM"
50        
51        # 4. Gap Analysis
52        # Price Gap: Is there room for a premium or budget option?
53        price_gap = False
54        if avg_price > 0:
55            if product_price < avg_price * 0.8: # Undercutting
56                price_gap = True
57            elif product_price > avg_price * 1.5: # Premium positioning
58                price_gap = True
59                
60        # Review Gap: Do existing products generally suck? (Low rating but high sales?)
61        # Since we might not have ratings for all, we use engagement variance
62        opportunity_gap = False
63        if max_engagement > 1000 and avg_engagement < 100:
64            # A few winners, many losers -> Opportunity to be a winner
65            opportunity_gap = True
66            
67        # 5. Final Decision Logic
68        action = "WAIT"
69        decision_score = 50
70        
71        if saturation_level == "LOW":
72            action = "GO"
73            decision_score += 30
74        elif saturation_level == "MEDIUM":
75            if opportunity_gap or price_gap:
76                action = "GO"
77                decision_score += 20
78        else: # HIGH Saturation
79            if opportunity_gap and price_gap:
80                action = "WAIT (Niche Down)"
81            else:
82                action = "KILL"
83                decision_score -= 30
84                
85        if product.get('opportunity_score', 0) > 80:
86            action = "GO" # Override if product itself is stellar
87            
88        return {
89            "market_density": saturation_level,
90            "competitor_count": len(competitors),
91            "avg_market_price": round(avg_price, 2),
92            "price_position": "Premium" if product_price > avg_price else "Budget",
93            "saturation_score": min(decision_score, 100),
94            "gap_analysis": {
95                "price_gap": price_gap,
96                "opportunity_gap": opportunity_gap
97            },
98            "recommendation": action
99        }
100
101    def _find_competitors(self, target_product, all_products):
102        """Find products similar to target"""
103        matches = []
104        target_title = target_product.get('title', '').lower().split()
105        target_tags = target_product.get('metadata', {}).get('tags', [])
106        
107        target_keywords = set(target_title + target_tags)
108        # Filter generic words
109        target_keywords = {k for k in target_keywords if len(k) > 4}
110        
111        for p in all_products:
112            if p['url'] == target_product['url']: continue # Skip self
113            
114            p_title = p.get('title', '').lower()
115            p_source = p.get('source', '')
116            
117            # Simple keyword matching
118            match_count = sum(1 for k in target_keywords if k in p_title)
119            
120            # If same source, higher relevance? Or cross-platform?
121            # We want cross-platform density.
122            
123            if match_count >= 2: # At least 2 significant keywords
124                matches.append(p)
125                
126        return matches

backend/scrapers/creative_market.py

1import asyncio
2from playwright.async_api import async_playwright
3import json
4from datetime import datetime
5import re
6from .base import BaseScraper
7
8class CreativeMarketScraper(BaseScraper):
9    def __init__(self):
10        super().__init__("CreativeMarket")
11        self.base_url = "https://creativemarket.com"
12
13    async def scrape(self, keyword="notion", max_pages=1):
14        """Search Creative Market"""
15        products = []
16        
17        async with async_playwright() as p:
18            # ... (rest of async_playwright logic)
19
20    def _calculate_quality_score(self, product):
21        """Calculate quality score (0-100)"""
22        score = 50
23        sales = product.get('sales_count', 0)
24        
25        # Sales are king on Creative Market
26        if sales > 1000: score += 30
27        elif sales > 100: score += 20
28        elif sales > 10: score += 10
29        
30        if product.get('price', 0) > 20: score += 10 # Premium
31        
32        return min(score, 100)
33            browser = await p.chromium.launch(
34                headless=True,
35                args=['--no-sandbox']
36            )
37            context = await browser.new_context(
38                user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
39            )
40            page = await context.new_page()
41            
42            try:
43                for page_num in range(1, max_pages + 1):
44                    url = f"{self.base_url}/search?q={keyword}&page={page_num}"
45                    print(f"[CreativeMarket] 🌐 Navigating to: {url}")
46                    
47                    await page.goto(url, wait_until="domcontentloaded")
48                    await page.wait_for_timeout(3000)
49                    
50                    # Scroll
51                    await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
52                    await page.wait_for_timeout(1000)
53                    
54                    # Extract cards
55                    cards = await page.query_selector_all(".product-card, .product-item")
56                    if not cards:
57                        print(f"[CreativeMarket] ❌ No products on page {page_num}")
58                        break
59                    
60                    for card in cards:
61                        product = await self._extract_card(card, page)
62                        if product:
63                            product['opportunity_score'] = self._calculate_quality_score(product)
64                            self.save_product(product)
65                            products.append(product)
66                            print(f"[CreativeMarket] ✅ Saved: {product.get('product_name')} (Score: {product.get('opportunity_score')})")
67                    
68                    await page.wait_for_timeout(2000)
69                
70            except Exception as e:
71                print(f"[CreativeMarket] ❌ Error: {e}")
72            finally:
73                await browser.close()
74        
75        return products
76
77    async def _extract_card(self, card, page):
78        """Extract product from card"""
79        try:
80            title_el = await card.query_selector("h3, .product-title, .product-name")
81            title = await title_el.inner_text() if title_el else None
82            
83            url_el = await card.query_selector("a.product-link, a[href*='/']")
84            url = await url_el.get_attribute("href") if url_el else None
85            if url and not url.startswith("http"):
86                url = self.base_url + url
87            
88            price_el = await card.query_selector(".price, .product-price")
89            price_text = await price_el.inner_text() if price_el else None
90            price = self._parse_price(price_text)
91            
92            sales_el = await card.query_selector(".sales-count, .item-sales")
93            sales_text = await sales_el.inner_text() if sales_el else None
94            sales = self._parse_number(sales_text)
95            
96            creator_el = await card.query_selector(".shop-name, .author-name")
97            creator = await creator_el.inner_text() if creator_el else None
98            
99            if not title or not url:
100                return None
101            
102            product_id = url.split('/')[-1].split('-')[0]
103            
104            return {
105                "product_id": product_id,
106                "product_name": title.strip(),
107                "creator_name": creator.strip() if creator else None,
108                "product_url": url,
109                "price": price,
110                "sales_count": sales,
111                "category": "digital_asset",
112                "canonical_key": f"creativemarket_{product_id}",
113                "evidence": {
114                    "source_at": datetime.now().isoformat(),
115                    "price_raw": price_text,
116                    "sales_raw": sales_text
117                }
118            }
119        except Exception as e:
120            print(f"[CreativeMarket] Extraction error: {e}")
121            return None
122
123    def _parse_price(self, text):
124        if not text: return 0.0
125        match = re.search(r'[\d,]+\.?\d*', text)
126        if match:
127            return float(match.group().replace(',', ''))
128        return 0.0
129
130    def _parse_number(self, text):
131        if not text: return 0
132        text = text.lower()
133        if 'k' in text:
134            match = re.search(r'([\d\.]+)k', text)
135            if match: return int(float(match.group(1)) * 1000)
136        match = re.search(r'[\d,]+', text)
137        if match: return int(match.group().replace(',', ''))
138        return 0
139
140    def save_product(self, p):
141        conn = self.get_db_connection()
142        cur = conn.cursor()
143        try:
144            cur.execute("""
145                INSERT INTO raw_creative_market_products 
146                (product_id, product_name, creator_name, product_url, price, sales_count, category, canonical_key, evidence)
147                VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)
148                ON CONFLICT (canonical_key) DO UPDATE SET
149                product_name = EXCLUDED.product_name,
150                price = EXCLUDED.price,
151                sales_count = EXCLUDED.sales_count,
152                evidence = raw_creative_market_products.evidence || EXCLUDED.evidence
153            """, (
154                p['product_id'], p['product_name'], p['creator_name'], 
155                p['product_url'], p['price'], p['sales_count'], 
156                p['category'], p['canonical_key'], json.dumps(p['evidence'])
157            ))
158            conn.commit()
159        except Exception as e:
160            print(f"[CreativeMarket] DB Error: {e}")
161            conn.rollback()
162        finally:
163            cur.close()
164            conn.close()
165
166if __name__ == '__main__':
167    scraper = CreativeMarketScraper()
168    asyncio.run(scraper.scrape(keyword="templates", max_pages=1))

backend/scrapers/envato.py

1import asyncio
2from playwright.async_api import async_playwright
3import json
4from datetime import datetime
5import re
6from .base import BaseScraper
7
8class EnvatoScraper(BaseScraper):
9    def __init__(self):
10        super().__init__("Envato")
11        self.base_urls = {
12            "themeforest": "https://themeforest.net",
13            "codecanyon": "https://codecanyon.net",
14        }
15
16    def _calculate_quality_score(self, product):
17        """Calculate quality score (0-100)"""
18        score = 50
19        sales = product.get('sales_count', 0)
20        
21        # Sales are the main signal on Envato
22        if sales > 5000: score += 40
23        elif sales > 1000: score += 30
24        elif sales > 100: score += 15
25        
26        if product.get('price', 0) > 30: score += 10
27        
28        return min(score, 100)
29
30    async def scrape(self, marketplace="codecanyon", category="javascript", max_pages=1):
31        """Scrape Envato marketplace category"""
32        if marketplace not in self.base_urls:
33            print(f"[Envato] ❌ Unknown marketplace: {marketplace}")
34            return []
35        
36        base_url = self.base_urls[marketplace]
37        products = []
38        
39        async with async_playwright() as p:
40            browser = await p.chromium.launch(headless=True, args=['--no-sandbox'])
41            context = await browser.new_context(
42                user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
43            )
44            page = await context.new_page()
45            
46            try:
47                for page_num in range(1, max_pages + 1):
48                    url = f"{base_url}/category/{category}?page={page_num}"
49                    print(f"[Envato] 🌐 Navigating to: {url}")
50                    
51                    await page.goto(url, wait_until="domcontentloaded")
52                    await page.wait_for_timeout(3000)
53                    
54                    # Wait for product cards to load
55                    await page.wait_for_selector(".shared-item_cards-card_component__root", timeout=10000)
56                    
57                    cards = await page.query_selector_all(".shared-item_cards-card_component__root")
58                    if not cards:
59                        print(f"[Envato] ❌ No products on page {page_num}")
60                        break
61                    
62                    print(f"[Envato] 📦 Found {len(cards)} product cards")
63                    
64                    for card in cards:
65                        product = await self._extract_card(card, page, marketplace)
66                        if product:
67                            product['opportunity_score'] = self._calculate_quality_score(product)
68                            self.save_product(product)
69                            products.append(product)
70                            print(f"[Envato] ✅ Saved: {product.get('product_name')} (Score: {product.get('opportunity_score')})")
71                    
72                    await page.wait_for_timeout(2000)
73                
74            except Exception as e:
75                print(f"[Envato] ❌ Error: {e}")
76            finally:
77                await browser.close()
78        
79        return products
80
81    async def _extract_card(self, card, page, marketplace):
82        """Extract product from card"""
83        try:
84            # Title and URL
85            title_el = await card.query_selector(".shared-item_cards-item_name_component__itemNameLink")
86            title = await title_el.inner_text() if title_el else None
87            url = await title_el.get_attribute("href") if title_el else None
88            
89            if url and not url.startswith("http"):
90                url = self.base_urls[marketplace] + url
91            
92            # Price
93            price_el = await card.query_selector(".shared-item_cards-price_component__root")
94            price_text = await price_el.inner_text() if price_el else None
95            price = self._parse_price(price_text)
96            
97            # Sales count
98            sales_el = await card.query_selector(".shared-item_cards-sales_component__root")
99            sales_text = await sales_el.inner_text() if sales_el else None
100            sales = self._parse_number(sales_text)
101            
102            # Author
103            author_el = await card.query_selector(".shared-item_cards-author_category_component__link[href*='/user/']")
104            author = await author_el.inner_text() if author_el else None
105            
106            if not title or not url:
107                return None
108            
109            product_id = url.split('/')[-1]
110            
111            return {
112                "product_id": product_id,
113                "product_name": title.strip(),
114                "author_name": author.strip() if author else None,
115                "product_url": url,
116                "price": price,
117                "sales_count": sales,
118                "category": "software",
119                "canonical_key": f"envato_{marketplace}_{product_id}",
120                "evidence": {
121                    "source_at": datetime.now().isoformat(),
122                    "price_raw": price_text,
123                    "sales_raw": sales_text
124                }
125            }
126        except Exception as e:
127            print(f"[Envato] Card error: {e}")
128            return None
129
130    def _parse_price(self, text):
131        if not text: return 0.0
132        match = re.search(r'[\d,]+\.?\d*', text)
133        if match:
134            return float(match.group().replace(',', ''))
135        return 0.0
136
137    def _parse_number(self, text):
138        if not text: return 0
139        text = text.lower()
140        if 'k' in text:
141            match = re.search(r'([\d\.]+)k', text)
142            if match: return int(float(match.group(1)) * 1000)
143        match = re.search(r'[\d,]+', text)
144        if match: return int(match.group().replace(',', ''))
145        return 0
146
147    def save_product(self, p):
148        conn = self.get_db_connection()
149        cur = conn.cursor()
150        try:
151            cur.execute("""
152                INSERT INTO raw_envato_products 
153                (product_id, product_name, author_name, product_url, price, sales_count, category, canonical_key, evidence)
154                VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)
155                ON CONFLICT (canonical_key) DO UPDATE SET
156                product_name = EXCLUDED.product_name,
157                price = EXCLUDED.price,
158                sales_count = EXCLUDED.sales_count,
159                evidence = raw_envato_products.evidence || EXCLUDED.evidence
160            """, (
161                p['product_id'], p['product_name'], p['author_name'], 
162                p['product_url'], p['price'], p['sales_count'], 
163                p['category'], p['canonical_key'], json.dumps(p['evidence'])
164            ))
165            conn.commit()
166        except Exception as e:
167            print(f"[Envato] DB Error: {e}")
168            conn.rollback()
169        finally:
170            cur.close()
171            conn.close()
172
173if __name__ == '__main__':
174    scraper = EnvatoScraper()
175    asyncio.run(scraper.scrape(marketplace="codecanyon", category="javascript", max_pages=1))

backend/scrapers/etsy.py

1import asyncio
2from playwright.async_api import async_playwright
3import json
4from datetime import datetime
5import re
6import random
7
8# Handle both module and standalone imports
9try:
10    from .base import BaseScraper
11except ImportError:
12    from base import BaseScraper
13
14class EtsyScraper(BaseScraper):
15    def __init__(self):
16        super().__init__("Etsy")
17        self.base_url = "https://www.etsy.com"
18        
19        # Smart Filtering System
20        self.EXCLUSION_KEYWORDS = [
21            'clipart', 'svg bundle', 'free download', 'printable only',
22            'coloring page', 'sticker pack', 'vintage photo', 'stock image',
23            'icon set', 'custom order', 'personalized', 'made to order'
24        ]
25        
26        self.INCLUSION_KEYWORDS = [
27            'automation', 'dashboard', 'system', 'toolkit', 'workflow',
28            'course', 'masterclass', 'training', 'guide', 'playbook',
29            'notion template', 'airtable', 'excel dashboard',
30            'content calendar', 'social media planner', 'sales funnel'
31        ]
32        
33        # 30+ Digital Product Niches
34        self.ETSY_NICHES = [
35            # Planners & Organization
36            "Digital Planner", "Goodnotes Planner", "Notion Template",
37            "Budget Spreadsheet", "Fitness Journal", "Meal Planner",
38            "Wedding Planner", "Trello Board",
39            
40            # Business & Career
41            "Resume Template", "Cover Letter Template", "Business Card Design",
42            "Invoice Template", "Contract Template", "Media Kit Template",
43            "Email Signature", "Social Media Content Calendar",
44            
45            # Creative Assets
46            "Lightroom Presets", "Canva Templates", "Instagram Templates",
47            "Procreate Brushes", "Digital Stickers", "Vector Icons",
48            "SVG Bundle", "Font Bundle", "Mockup Templates",
49            
50            # Education & Self-Improvement
51            "Flashcards", "Study Guide", "Coloring Book Pages",
52            "Workbook Template", "Ebook Template", "Vision Board"
53        ]
54
55    async def scrape(self, query=None, max_pages=3, digital_only=True, limit=None):
56        """
57        Scrape Etsy for digital products.
58        If query is None, iterates through ETSY_NICHES.
59        """
60        all_products = []
61        
62        # Determine queries
63        queries_to_run = [query] if query else self.ETSY_NICHES
64        if not query:
65            random.shuffle(queries_to_run)
66            if limit:
67                queries_to_run = queries_to_run[:limit]
68            print(f"[Etsy] 🔀 Batch processing {len(queries_to_run)} niches")
69            
70        async with async_playwright() as p:
71            browser = await p.chromium.launch(
72                headless=False, # Headed for better success
73                args=[
74                    '--disable-blink-features=AutomationControlled',
75                    '--disable-dev-shm-usage',
76                    '--no-sandbox',
77                    '--disable-setuid-sandbox',
78                    '--disable-web-security',
79                ]
80            )
81            
82            context = await browser.new_context(
83                viewport={'width': 1920, 'height': 1080},
84                user_agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
85                locale='en-US',
86                timezone_id='America/New_York',
87            )
88            
89            # Enhance context 
90            await context.add_init_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
91            
92            page = await context.new_page()
93            
94            # Use basic stealth/retry for stealth import
95            try:
96                from playwright_stealth import stealth_async
97                await stealth_async(page)
98            except:
99                pass
100
101            for q_idx, niche in enumerate(queries_to_run, 1):
102                print(f"[Etsy] 🔍 [{q_idx}/{len(queries_to_run)}] Searching: '{niche}'")
103                
104                try:
105                    search_saved_count = 0
106                    
107                    for page_num in range(1, max_pages + 1):
108                        params = f"q={niche.replace(' ', '+')}"
109                        if digital_only:
110                            params += "&explicit=1&filter_digital=1"
111                        params += f"&page={page_num}&ref=pagination"
112                        
113                        url = f"{self.base_url}/search?{params}"
114                        
115                        # Navigate
116                        try:
117                            await page.goto(url, wait_until="domcontentloaded", timeout=30000)
118                            await page.wait_for_timeout(2000 + random.randint(500, 2000))
119                            
120                            # Check captcha
121                            if "captcha" in await page.content():
122                                print(f"[Etsy] ⚠️ Captcha detected. Waiting 10s...")
123                                await page.wait_for_timeout(10000)
124                                if "captcha" in await page.content():
125                                    print(f"[Etsy] ❌ Captcha persistent. Skipping niche.")
126                                    break
127                        except Exception as e:
128                            print(f"[Etsy] Navigation error: {e}")
129                            continue
130
131                        # Extract
132                        product_elements = await page.query_selector_all('[data-listing-id]')
133                        
134                        if not product_elements:
135                            break
136                            
137                        for element in product_elements:
138                            product = await self._extract_product_data(element, page, niche)
139                            if not product: continue
140                            
141                            # --- SMART FILTERING LOGIC ---
142                            # 1. Check exclusion keywords
143                            if self._should_exclude(product):
144                                continue
145                            
146                            # 2. Check if proven winner
147                            is_proven = False
148                            if product.get('is_bestseller'):
149                                is_proven = True
150                            elif product.get('sales_count', 0) >= 1000:
151                                is_proven = True
152                            elif product.get('review_count', 0) >= 100:
153                                is_proven = True
154                            # Exception: high price + inclusion keywords
155                            elif product.get('price', 0) > 20 and self._has_inclusion_keywords(product):
156                                is_proven = True
157                                
158                            if not is_proven:
159                                continue # Skip
160                                
161                            self.save_product(product)
162                            all_products.append(product)
163                            search_saved_count += 1
164                            
165                            if search_saved_count % 5 == 0:
166                                val_sig = "Bestseller" if product.get('is_bestseller') else f"{product.get('sales_count')}+ Sales"
167                                print(f"   ✅ Saved {search_saved_count}: {product['product_name'][:30]}... ({val_sig})")
168
169                        # Next page delay
170                        if page_num < max_pages:
171                            await page.wait_for_timeout(random.randint(2000, 5000))
172                            
173                    print(f"   ℹ️  Finished '{niche}': Saved {search_saved_count} proven winners")
174                    
175                except Exception as e:
176                    print(f"   ❌ Error scraping '{niche}': {e}")
177                    
178                # Niche delay
179                await page.wait_for_timeout(random.randint(3000, 7000))
180
181            await browser.close()
182        
183        print(f"\n[Etsy] 🎉 Run Complete. Total Products: {len(all_products)}")
184        return all_products
185
186    def _should_exclude(self, product):
187        """Check if product should be excluded based on keywords"""
188        name_lower = product.get('product_name', '').lower()
189        return any(keyword in name_lower for keyword in self.EXCLUSION_KEYWORDS)
190    
191    def _has_inclusion_keywords(self, product):
192        """Check if product has high-value inclusion keywords"""
193        name_lower = product.get('product_name', '').lower()
194        return any(keyword in name_lower for keyword in self.INCLUSION_KEYWORDS)
195    
196    def _calculate_quality_score(self, price, reviews_count, is_bestseller):
197        """Calculate quality score (0-100) based on multiple factors"""
198        score = 0
199        
200        # Price factor (higher price = potentially better quality)
201        if price > 50:
202            score += 30
203        elif price > 20:
204            score += 20
205        elif price > 10:
206            score += 10
207        
208        # Reviews factor
209        if reviews_count > 1000:
210            score += 40
211        elif reviews_count > 500:
212            score += 30
213        elif reviews_count > 100:
214            score += 20
215        elif reviews_count > 50:
216            score += 10
217        
218        # Bestseller bonus
219        if is_bestseller:
220            score += 30
221        
222        return min(score, 100)
223
224    async def _extract_product_data(self, element, page, current_tag):
225        """Extract COMPLETE data from a product listing element"""
226        try:
227            # Listing ID
228            listing_id = await element.get_attribute('data-listing-id')
229            if not listing_id: return None
230            
231            # Name
232            name_el = await element.query_selector('.v2-listing-card__title, h3')
233            name = await name_el.inner_text() if name_el else "Unknown"
234            
235            # URL (clean without tracking)
236            link_el = await element.query_selector('a.listing-link')
237            url = await link_el.get_attribute('href') if link_el else ""
238            if url and not url.startswith('http'): url = self.base_url + url
239            # Clean URL from tracking parameters
240            if '?' in url:
241                url = url.split('?')[0]
242            
243            # Image
244            img_el = await element.query_selector('img.wt-image')
245            image_url = await img_el.get_attribute('src') if img_el else ""
246            
247            # Price
248            price_el = await element.query_selector('.currency-value')
249            price = float((await price_el.inner_text()).replace(',', '')) if price_el else 0.0
250            
251            # Reviews (Approx)
252            reviews_count = 0
253            rev_el = await element.query_selector('.wt-text-caption') 
254            if rev_el:
255                txt = await rev_el.inner_text()
256                if '(' in txt: 
257                     # (1.2k) -> 1200
258                     num_part = txt.split('(')[1].split(')')[0].lower().replace(',', '')
259                     if 'k' in num_part:
260                         try:
261                             reviews_count = int(float(num_part.replace('k', '')) * 1000)
262                         except: pass
263                     elif num_part.isdigit():
264                         reviews_count = int(num_part)
265
266            # Sales Proxy
267            sales_count = reviews_count * 10 
268            
269            # Seller info (extract from link or card)
270            seller_name = "Unknown"
271            seller_verified = False
272            try:
273                seller_el = await element.query_selector('[data-shop-name], .shop-name')
274                if seller_el:
275                    seller_name = await seller_el.inner_text()
276                # Check for verified badge
277                verified_badge = await element.query_selector('[class*="verified"], [class*="star-seller"]')
278                if verified_badge:
279                    seller_verified = True
280            except:
281                pass
282            
283            is_bestseller = False
284            badge = await element.query_selector('.bestseller-badge, [class*="bestseller"]')
285            if badge: is_bestseller = True
286            
287            # Extract real tags from title/description
288            tags = [current_tag]
289            name_lower = name.lower()
290            for keyword in ['notion', 'canva', 'excel', 'google sheets', 'planner', 'template', 'dashboard']:
291                if keyword in name_lower and keyword not in tags:
292                    tags.append(keyword)
293            
294            return {
295                'id': listing_id,
296                'product_name': name.strip(),
297                'shop_name': seller_name,
298                'seller_verified': seller_verified,
299                'product_url': url,
300                'clean_url': url.split('?')[0] if '?' in url else url,
301                'image_url': image_url,
302                'price': price,
303                'currency': 'USD',
304                'sales_count': sales_count,
305                'review_count': reviews_count,
306                'rating': 5.0,
307                'tags': tags,
308                'category': current_tag,
309                'is_bestseller': is_bestseller,
310                'scraped_at': datetime.utcnow().isoformat(),
311                # Metadata for analysis
312                'has_inclusion_keywords': self._has_inclusion_keywords({'product_name': name}),
313                'quality_score': self._calculate_quality_score(price, reviews_count, is_bestseller)
314            }
315            
316        except Exception as e:
317            return None
318
319    def save_product(self, p):
320        """Save Etsy product to database"""
321        conn = self.get_db_connection()
322        if not conn: return
323        cur = conn.cursor()
324        try:
325            cur.execute("""
326                INSERT INTO raw_etsy_products 
327                (id, product_name, shop_name, product_url, image_url, price, currency, 
328                 sales_count, review_count, rating, tags)
329                VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
330                ON CONFLICT (id) DO UPDATE SET
331                    product_name = EXCLUDED.product_name,
332                    price = EXCLUDED.price,
333                    sales_count = EXCLUDED.sales_count,
334                    review_count = EXCLUDED.review_count,
335                    rating = EXCLUDED.rating,
336                    last_seen_at = NOW(),
337                    scrape_count = raw_etsy_products.scrape_count + 1,
338                    updated_at = NOW()
339            """, (
340                p['id'], p['product_name'], p['shop_name'], 
341                p['product_url'], p['image_url'], p['price'], p['currency'],
342                p['sales_count'], p['review_count'], p['rating'], p['tags']
343            ))
344            conn.commit()
345        except Exception as e:
346            print(f"[Etsy] DB Error: {e}")
347            conn.rollback()
348        finally:
349            cur.close()
350            conn.close()
351
352if __name__ == '__main__':
353    scraper = EtsyScraper()
354    asyncio.run(scraper.scrape(limit=2))

backend/scrapers/facebook.py

1import asyncio
2from playwright.async_api import async_playwright
3import json
4from datetime import datetime
5import re
6from urllib.parse import urlencode
7from .orexia_base import OrexiaBaseScraper
8
9class FacebookScraper(OrexiaBaseScraper):
10    def __init__(self):
11        super().__init__("Facebook")
12        self.base_url = "https://www.facebook.com/ads/library"
13        self.PIMGE_TERMS = [
14            # --- HIGH INTENT (Delivery) ---
15            "Download now template", "Get the guide", "Instant access", "PDF download",
16            "Masterclass", "Template bundle", "Digital download", "Instant download",
17            "Access the training", "Get the checklist", "Download the roadmap",
18            "Get the blueprint", "Start your course", "Join the challenge",
19            
20            # --- PRODUCT SPECIFIC ---
21            "Notion template", "Social media planner", "Content calendar",
22            "Budget spreadsheet", "Fitness tracker", "Meal planner",
23            "Lightroom presets", "Canva templates", "Excel dashboard",
24            "Resume template", "Email scripts", "Sales funnel",
25            "Chatgpt prompts", "Midjourney prompts", "Obsidian vault",
26            "Shopify theme", "Wordpress plugin", "Framer template",
27            
28            # --- PAIN/OUTCOME ---
29            "Stop wasting time on", "How to automate", "Done-for-you",
30            "Step-by-step blueprint", "Grow your audience", "Scale your business",
31            "Passive income guide", "Side hustle ideas", "Work from home",
32            "Become a creator", "Monetize your", "Quit your job",
33            
34            # --- OFFERS/DEALS (PIMGE) ---
35            "SaaS Lifetime Deal", "No-code tool", "White label",
36            "50% Off", "Flash sale", "Limited time offer",
37            "Bundle deal", "All access pass", "Lifetime access",
38            "Early bird price", "Launch special", "One time payment"
39        ]
40
41    async def scrape(self, query=None, country="ALL", max_ads=10, active_status="all", start_date=None, end_date=None, min_active_ads=0):
42        """
43        Scrape Facebook Ad Library with PIMGE + Digital Product Focus.
44        """
45        products = []
46        import random
47        
48        # Determine keywords to search
49        search_terms = [query] if query else self.PIMGE_TERMS
50        
51        # Shuffle terms if doing a bulk scrape (no specific query) to avoid patterns
52        if not query:
53            random.shuffle(search_terms)
54            print(f"[Facebook] 🔀 Shuffled {len(search_terms)} search terms")
55
56        # Format dates if provided
57        date_params = {}
58        if start_date:
59            if isinstance(start_date, str):
60                start_dt = datetime.strptime(start_date, "%Y-%m-%d")
61            else:
62                start_dt = start_date
63            date_params['start_date[min]'] = start_dt.strftime("%Y-%m-%d")
64            
65        if end_date:
66            if isinstance(end_date, str):
67                end_dt = datetime.strptime(end_date, "%Y-%m-%d")
68            else:
69                end_dt = end_date
70            date_params['start_date[max]'] = end_dt.strftime("%Y-%m-%d")
71        
72        async with async_playwright() as p:
73            # Use stealth args to avoid detection
74            browser = await p.chromium.launch(
75                headless=False,
76                args=[
77                    '--no-sandbox', 
78                    '--disable-blink-features=AutomationControlled',
79                    '--disable-infobars',
80                    '--window-size=1280,800'
81                ]
82            )
83            context = await browser.new_context(
84                viewport={'width': 1280, 'height': 800},
85                user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
86            )
87            
88            await context.add_init_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
89            
90            page = await context.new_page()
91            
92            try:
93                for term_idx, term in enumerate(search_terms):
94                    print(f"[Facebook] 🔍 [{term_idx+1}/{len(search_terms)}] Searching for: '{term}'")
95                    
96                    # Build search URL
97                    params = {
98                        'active_status': active_status,
99                        'ad_type': 'all',
100                        'country': country,
101                        'q': term,
102                        'search_type': 'keyword_unordered',
103                        'media_type': 'all'
104                    }
105                    
106                    # Add date parameters if they exist
107                    params.update(date_params)
108                    
109                    url = f"{self.base_url}?{urlencode(params)}"
110                    # Only print URL occasionally to reduce clutter
111                    if term_idx % 5 == 0:
112                        print(f"[Facebook] 🌐 Navigating to: {url}")
113                    
114                    try:
115                        await page.goto(url, wait_until="networkidle", timeout=60000)
116                        await page.wait_for_timeout(4000 + random.randint(0, 2000)) # Random wait
117                        
118                        if "/checkpoint/" in page.url or "login" in page.url:
119                            print("[Facebook] ⚠️  Blocked by Login/Checkpoint Wall - waiting 30s")
120                            await page.wait_for_timeout(30000)
121                            continue
122                            
123                        if await page.query_selector("iframe[src*='captcha']") or await page.query_selector("#captcha_container"):
124                             print("[Facebook] ⚠️  Blocked by Captcha")
125                             # CAPTCHA solving usually requires user or diverse IP.
126                             # For now, just wait a bit or try to proceed.
127                             await page.wait_for_timeout(5000)
128                    
129                        # Wait for ads
130                        try:
131                            await page.wait_for_selector('div.x1plvlek', timeout=10000)
132                        except:
133                            print(f"[Facebook] ⚠️  No ads found for term '{term}'")
134                            continue 
135                        
136                        # Scroll to load more (variable scrolls)
137                        scrolls = 4 # Increased for MAX
138                        for _ in range(scrolls):
139                            await page.evaluate("window.scrollBy(0, 1000)")
140                            await page.wait_for_timeout(1000 + random.randint(0, 500))
141                        
142                        cards = await page.query_selector_all("div:text('Library ID')") 
143                        if not cards:
144                             cards = await page.query_selector_all("div.x1plvlek")
145                        
146                        potential_ads = len(cards)
147                        saved_count = 0
148                        
149                        unique_ids = set()
150                        
151                        for card in cards:
152                             product = await self._extract_card(card, country)
153                             
154                             if not product:
155                                 continue
156                                 
157                             # Momentum Filtering
158                             if product['evidence']['momentum']['active_ads_count'] < min_active_ads:
159                                 continue
160
161                             if product['product_id'] not in unique_ids:
162                                 unique_ids.add(product['product_id'])
163                                 # Save using OrexiaBaseScraper method
164                                 self.save_product(product) 
165                                 products.append(product)
166                                 saved_count += 1
167                                 
168                                 if saved_count % 5 == 0:
169                                     print(f"   ✅ Saved {saved_count} momentum ads")
170                                 
171                             if len(products) >= max_ads * (term_idx + 1): # Cumulative limit check approx
172                                 break
173                        
174                        print(f"   ℹ️  Processed '{term}': {potential_ads} found -> {saved_count} saved")
175
176                    except Exception as e:
177                        print(f"   ❌ Error skipping term '{term}': {e}")
178                        continue
179                        
180            except Exception as e:
181                print(f"[Facebook] ❌ Critical Error: {e}")
182            finally:
183                await browser.close()
184        
185        return products
186
187    async def _extract_card(self, card, country):
188        try:
189            t = await card.inner_text()
190            if "Library ID" not in t:
191                return None
192            
193            match = re.search(r'Library ID: (\d+)', t)
194            ad_id = match.group(1) if match else str(abs(hash(t)))
195            
196            advertiser_el = await card.query_selector("a[href*='facebook.com'], a[href*='instagram.com']")
197            advertiser = await advertiser_el.inner_text() if advertiser_el else "Unknown Advertiser"
198            
199            buttons = await card.query_selector_all("div[role='button'], span")
200            body_text = ""
201            for b in buttons:
202                txt = await b.inner_text()
203                if len(txt) > 30 and "See ad details" not in txt and "Library ID" not in txt:
204                    body_text = txt
205                    break 
206            
207            start_date_match = re.search(r'Started running on ([A-Za-z]+ \d+, \d+)', t)
208            start_date_str = start_date_match.group(1) if start_date_match else datetime.now().strftime("%b %d, %Y")
209            
210            active_ads_count = 1 
211            multi_ad_match = re.search(r'(\d+) ads use this creative', t)
212            if multi_ad_match:
213                active_ads_count = int(multi_ad_match.group(1))
214
215            platform_icons = await card.query_selector_all('div[role="img"]')
216            platforms = []
217            for icon in platform_icons:
218                label = await icon.get_attribute('aria-label')
219                if label and label in ['Facebook', 'Instagram', 'Audience Network', 'Messenger']:
220                    platforms.append(label)
221            if not platforms: platforms = ['Facebook', 'Instagram'] 
222            
223            is_arabic = bool(re.search(r'[\u0600-\u06FF]', body_text))
224            language = "ar" if is_arabic else "en"
225
226            offer_details = []
227            if re.search(r'50%|half price|discount', body_text, re.IGNORECASE):
228                offer_details.append("Discount")
229            if re.search(r'free shipping|delivery', body_text, re.IGNORECASE):
230                offer_details.append("Free Shipping")
231            if re.search(r'buy 1 get 1|bogo', body_text, re.IGNORECASE):
232                offer_details.append("BOGO")
233            if re.search(r'lifetime deal|LTD', body_text, re.IGNORECASE):
234                offer_details.append("Lifetime Deal")
235            if re.search(r'download|instant access', body_text, re.IGNORECASE):
236                offer_details.append("Instant Access")
237            
238            cta_el = await card.query_selector("a[href^='http']:not([href*='facebook.com'])")
239            link_url = await cta_el.get_attribute("href") if cta_el else None
240            cta_text = await cta_el.inner_text() if cta_el else "Learn More"
241
242            # CTA Analysis
243            product_type = "Unclassified"
244            cta_lower = cta_text.lower()
245            if "learn more" in cta_lower:
246                product_type = "Possible High Ticket / Service"
247            elif any(x in cta_lower for x in ["shop now", "download", "get offer", "sign up"]):
248                product_type = "Likely Tripwire / Low Ticket"
249            
250            price_match = re.search(r'[\$€£](\d+(?:\.\d{2})?)', body_text)
251            price = float(price_match.group(1)) if price_match else 0.0
252
253            landing_platform = "Unknown"
254            if link_url:
255                if "shopify" in link_url or "myshopify" in link_url:
256                    landing_platform = "Shopify"
257                elif "youcan" in link_url:
258                    landing_platform = "YouCan"
259                elif "clickfunnels" in link_url:
260                    landing_platform = "ClickFunnels"
261                elif "gumroad" in link_url:
262                    landing_platform = "Gumroad"
263
264            img_el = await card.query_selector("img")
265            img_url = await img_el.get_attribute("src") if img_el else None
266            
267            return {
268                "product_id": ad_id,
269                "product_name": f"Ad by {advertiser}",
270                "author_name": advertiser,
271                "product_url": link_url or f"https://facebook.com/ads/library/?id={ad_id}",
272                "price": price,
273                "sales_count": 0, 
274                "category": "ad",
275                "canonical_key": f"fb_ad_{ad_id}",
276                "pain_signals": [body_text] if body_text else [], # IMPORTANT for AI Gatekeeper
277                "evidence": {
278                    "ad_body": body_text[:500],
279                    "image_url": img_url,
280                    "captured_at": datetime.now().isoformat(),
281                    
282                    "momentum": {
283                        "start_date": start_date_str,
284                        "active_ads_count": active_ads_count,
285                        "unique_creatives": 1 
286                    },
287                    "targeting": {
288                        "countries": [country],
289                        "platforms": list(set(platforms)),
290                        "language": language
291                    },
292                    "offer": {
293                        "headline": body_text[:50], 
294                        "cta_text": cta_text,
295                        "discounts": offer_details,
296                        "price_mentioned": price,
297                        "product_type": product_type
298                    },
299                    "destination": {
300                        "final_url": link_url,
301                        "platform": landing_platform
302                    }
303                }
304            }
305        except Exception as e:
306            return None

backend/scrapers/gumroad.py

1import asyncio
2from playwright.async_api import async_playwright
3import json
4from datetime import datetime
5import re
6
7# Handle both module and standalone imports
8try:
9    from .base import BaseScraper
10except ImportError:
11    from base import BaseScraper
12
13class GumroadScraper(BaseScraper):
14    def __init__(self):
15        super().__init__("Gumroad")
16        self.base_url = "https://gumroad.com"
17
18    def _calculate_quality_score(self, product):
19        """Calculate quality score (0-100)"""
20        score = 50 # Base
21        
22        # Price quality proxy
23        price = product.get('price', 0)
24        if price > 50: score += 20
25        elif price > 20: score += 10
26        elif price == 0: score -= 10
27        
28        # Creator authority (heuristic)
29        creator = product.get('creator_name', '')
30        if creator and len(creator) > 3 and ' ' in creator: score += 10
31        
32        # Ratings (if available)
33        if product.get('rating', 0) > 4.5: score += 20
34        
35        return min(score, 100)
36
37    async def scrape(self, category=None, keyword=None, max_pages=1):
38        """Scrape Gumroad discover page"""
39        products = []
40        
41        async with async_playwright() as p:
42            browser = await p.chromium.launch(
43                headless=True,
44                args=[
45                    '--disable-blink-features=AutomationControlled',
46                    '--disable-dev-shm-usage',
47                    '--no-sandbox',
48                ]
49            )
50            
51            context = await browser.new_context(
52                viewport={'width': 1920, 'height': 1080},
53                user_agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
54            )
55            
56            page = await context.new_page()
57            
58            # Build URL
59            if keyword:
60                url = f"{self.base_url}/discover?query={keyword}"
61            elif category:
62                url = f"{self.base_url}/discover?category={category}"
63            else:
64                url = f"{self.base_url}/discover"
65            
66            print(f"[Gumroad] 🌐 Navigating to: {url}")
67            
68            try:
69                await page.goto(url, wait_until="domcontentloaded", timeout=30000)
70                await page.wait_for_timeout(5000)
71                
72                # Progressive scrolling
73                for i in range(max_pages):
74                    await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
75                    await page.wait_for_timeout(3000)
76                    
77                    if (i + 1) % 5 == 0:
78                        print(f"[Gumroad] 📜 Scrolled {i + 1}/{max_pages} times...")
79
80                # Selectors
81                selectors = [
82                    'a[href*="/l/"]', 
83                    'div.discover-item',
84                    'article',
85                    '[class*="product"]',
86                    '[data-product-id]'
87                ]
88                
89                product_elements = []
90                for selector in selectors:
91                    elements = await page.query_selector_all(selector)
92                    if elements:
93                        product_elements = elements
94                        break
95                
96                print(f"[Gumroad] 🔍 Extracting data from {len(product_elements)} products...")
97                
98                for idx, element in enumerate(product_elements):
99                    try:
100                        product = await self._extract_product_data(element, page)
101                        
102                        if product:
103                            # Add quality score logic here
104                            product['opportunity_score'] = self._calculate_quality_score(product)
105                            
106                            self.save_product(product)
107                            products.append(product)
108                            
109                            if (idx + 1) % 50 == 0:
110                                print(f"[Gumroad] 📦 Processed {idx + 1}/{len(product_elements)} products...")
111                    except Exception as e:
112                        # print(f"[Gumroad] ⚠️ Error extracting product {idx}: {e}")
113                        continue
114                
115                print(f"[Gumroad] ✅ Successfully scraped {len(products)} products!")
116                
117            except Exception as e:
118                print(f"[Gumroad] ❌ Error during scrape: {e}")
119            finally:
120                await browser.close()
121        
122        return products
123
124    async def _extract_product_data(self, element, page):
125        """Extract data from a product element"""
126        try:
127            url = await element.get_attribute('href')
128            if not url or '/l/' not in url:
129                link_el = await element.query_selector('a[href*="/l/"]')
130                if link_el:
131                    url = await link_el.get_attribute('href')
132                else:
133                    return None
134            
135            if not url.startswith('http'):
136                url = f"{self.base_url}{url}"
137            
138            # Name
139            name = None
140            name_selectors = ['h3', 'h2', 'h1', '[class*="title"]', '[class*="name"]']
141            for sel in name_selectors:
142                name_el = await element.query_selector(sel)
143                if name_el:
144                    name_text = await name_el.inner_text()
145                    if name_text and len(name_text.strip()) > 0:
146                        name = name_text.strip()
147                        break
148            
149            if not name:
150                name = url.split('/')[-1].split('?')[0].replace('-', ' ').title()
151            
152            # Creator
153            creator = None
154            creator_selectors = ['[class*="creator"]', '[class*="author"]', '[class*="seller"]']
155            for sel in creator_selectors:
156                creator_el = await element.query_selector(sel)
157                if creator_el:
158                    creator = await creator_el.inner_text()
159                    break
160            
161            # Price
162            price = 0.0
163            price_selectors = ['[class*="price"]', '.price', 'span[class*="amount"]']
164            for sel in price_selectors:
165                price_el = await element.query_selector(sel)
166                if price_el:
167                    price_text = await price_el.inner_text()
168                    match = re.search(r'\$?([\d,]+\.?\d*)', price_text)
169                    if match:
170                        price = float(match.group(1).replace(',', ''))
171                    break
172            
173            product_id = url.split('/')[-1].split('?')[0]
174            
175            return {
176                'product_id': product_id,
177                'product_name': name.strip(),
178                'creator_name': creator.strip() if creator else None,
179                'product_url': url,
180                'price': price,
181                'currency': 'USD',
182                'category': 'digital',
183                'canonical_key': f"gumroad_{product_id}",
184                'evidence': {
185                    'scraped_at': datetime.utcnow().isoformat(),
186                    'source_page': page.url
187                }
188            }
189        except Exception:
190            return None
191
192    def save_product(self, p):
193        """Save Gumroad product to DB"""
194        conn = self.get_db_connection()
195        cur = conn.cursor()
196        try:
197            cur.execute("""
198                INSERT INTO raw_gumroad_products 
199                (product_id, product_name, creator_name, product_url, price, currency, category, canonical_key, evidence)
200                VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)
201                ON CONFLICT (canonical_key) DO UPDATE SET
202                product_name = EXCLUDED.product_name,
203                price = EXCLUDED.price,
204                evidence = raw_gumroad_products.evidence || EXCLUDED.evidence,
205                updated_at = NOW()
206            """, (
207                p['product_id'], p['product_name'], p['creator_name'], 
208                p['product_url'], p['price'], p['currency'], 
209                p['category'], p['canonical_key'], json.dumps(p['evidence'])
210            ))
211            conn.commit()
212        except Exception as e:
213            print(f"[Gumroad] DB Error: {e}")
214            conn.rollback()
215        finally:
216            cur.close()
217            conn.close()
218
219if __name__ == '__main__':
220    scraper = GumroadScraper()
221    asyncio.run(scraper.scrape(keyword="notion"))

backend/scrapers/instagram.py

1"""
2AURIFY INTELLIGENCE - Instagram Trend Scraper
3Real API integration with EnsembleData Instagram API
4"""
5
6# Handle imports
7try:
8    from .base import BaseScraper
9except ImportError:
10    from base import BaseScraper
11from datetime import datetime
12import json
13import logging
14from typing import List, Dict
15import requests
16import os
17import time
18
19logger = logging.getLogger(__name__)
20
21class InstagramScraper(BaseScraper):
22    def __init__(self):
23        super().__init__(platform_name="instagram")
24        self.api_token = os.getenv("ENSEMBLEDATA_TOKEN", "fpcdDVMXoBocmxeU")
25        self.api_url = "https://api.ensembledata.com/instagram/hashtag/posts"
26        
27        # Pain detection keywords
28        self.pain_keywords = [
29            "problem", "struggle", "frustrated", "tired of", "need", "help",
30            "waste", "expensive", "complicated", "confusing", "hard to",
31            "wish there was", "looking for", "can't find", "failed", "stuck",
32            "annoying", "hate", "difficult", "impossible", "broken"
33        ]
34        
35        # Monetization signals
36        self.monetization_signals = [
37            "would pay for", "need a solution", "costing me", "losing money",
38            "expensive", "subscription", "tool", "app", "service", "platform",
39            "buy", "purchase", "invest", "worth it", "price"
40        ]
41
42    def scrape(self, hashtag: str = "business", limit: int = 20) -> List[Dict]:
43        """
44        Scrapes Instagram for pain points using EnsembleData API.
45        
46        Args:
47            hashtag: Instagram hashtag to search (without #)
48            limit: Number of posts to fetch
49        """
50        logger.info(f"📸 Scraping Instagram (Real API) - Hashtag: #{hashtag}, Limit: {limit}")
51        
52        try:
53            posts = self._fetch_from_api(hashtag, limit)
54            results = []
55            
56            for post in posts:
57                pain_severity = self._calculate_pain_severity(post)
58                monetization_potential = self._calculate_monetization_potential(post)
59                
60                product_data = {
61                    "platform": "instagram",
62                    "canonical_key": f"ig_{post['id']}",
63                    "product_name": f"Post by @{post['author']}",
64                    "product_url": post['url'],
65                    "description": post['description'],
66                    "evidence": {
67                        "likes": post['likes'],
68                        "comments": post['comments'],
69                        "pain_severity": pain_severity,
70                        "monetization_potential": monetization_potential,
71                        "hashtags": post.get('hashtags', []),
72                        "author": post['author'],
73                        "created_at": post.get('created_at', ''),
74                        "media_type": post.get('media_type', 'unknown'),
75                        "engagement_rate": self._calculate_engagement_rate(post)
76                    },
77                    "opportunity_score": self._calculate_opportunity_score(pain_severity, monetization_potential, post),
78                    "scraped_at": datetime.now().isoformat()
79                }
80                
81                self.save_product(product_data)
82                results.append(product_data)
83            
84            logger.info(f"✅ Successfully scraped {len(results)} posts from Instagram")
85            return results
86            
87        except Exception as e:
88            logger.error(f"❌ Error scraping Instagram: {e}")
89            logger.warning("⚠️ Falling back to mock data")
90            return self._fallback_mock_data(hashtag, limit)
91
92    def _fetch_from_api(self, hashtag: str, limit: int) -> List[Dict]:
93        """Fetch real data from EnsembleData Instagram API"""
94        
95        hashtag = hashtag.lstrip('#')
96        
97        headers = {
98            "Content-Type": "application/json"
99        }
100        
101        params = {
102            "name": hashtag,
103            "cursor": 0,
104            "token": self.api_token
105        }
106        
107        logger.info(f"📡 Calling EnsembleData API for #{hashtag}...")
108        
109        # Using the correct endpoint documentation for Instagram Hashtag Posts
110        response = requests.get(
111            "https://ensembledata.com/apis/instagram/hashtag/posts",
112            params=params,
113            headers=headers,
114            timeout=60
115        )
116        
117        if response.status_code != 200:
118            raise Exception(f"API returned status {response.status_code}: {response.text}")
119        
120        data = response.json()
121        
122        # Structure often matches: data -> data -> items
123        # But we need to verify. Assuming similar to TikTok or standard ED structure.
124        # Based on docs: data -> data -> items or edges
125        items = data.get('data', {}).get('items', [])
126        if not items:
127             # Try alternate path if structure differs
128             items = data.get('data', {}).get('data', [])
129
130        posts = []
131        for item in items[:limit]:
132            # Extract fields safely
133            code = item.get('code', '') # URL code
134            caption = item.get('caption', {}).get('text', '')
135            user = item.get('user', {})
136            username = user.get('username', 'unknown')
137            
138            # Engagement
139            likes = item.get('like_count', 0)
140            comments = item.get('comment_count', 0)
141            
142            # Timestamp
143            taken_at = item.get('taken_at', 0)
144            
145            posts.append({
146                'id': item.get('pk', '') or code,
147                'description': caption,
148                'author': username,
149                'url': f"https://www.instagram.com/p/{code}/" if code else "",
150                'likes': likes,
151                'comments': comments,
152                'hashtags': [], # Extraction from caption text would be needed properly, or use API's extra fields
153                'created_at': datetime.fromtimestamp(taken_at).isoformat(),
154                'media_type': 'video' if item.get('media_type') == 2 else 'image'
155            })
156        
157        return posts
158
159    def _calculate_pain_severity(self, post: Dict) -> float:
160        description = post.get('description', '').lower()
161        pain_count = sum(1 for keyword in self.pain_keywords if keyword in description)
162        
163        # Weighted engagement
164        engagement = post.get('likes', 0) + (post.get('comments', 0) * 2)
165        engagement_boost = min(engagement / 1000, 3.0)
166        
167        pain_score = min((pain_count * 2.0) + engagement_boost, 10.0)
168        return round(pain_score, 2)
169
170    def _calculate_monetization_potential(self, post: Dict) -> float:
171        description = post.get('description', '').lower()
172        mon_count = sum(1 for keyword in self.monetization_signals if keyword in description)
173        
174        viral_score = min(post.get('likes', 0) / 5000, 5.0)
175        
176        monetization_score = min((mon_count * 2.5) + viral_score, 10.0)
177        return round(monetization_score, 2)
178
179    def _calculate_engagement_rate(self, post: Dict) -> float:
180        """Estimate engagement rate (assumes avg mock followers if unknown)"""
181        interactions = post.get('likes', 0) + post.get('comments', 0)
182        # We don't have follower count in this endpoint usually, so use absolute interaction proxy
183        # > 1000 interactions = "High", < 100 = "Low"
184        return round(min(interactions / 1000.0 * 10.0, 100.0), 2)
185
186    def _calculate_opportunity_score(self, pain: float, money: float, post: Dict) -> float:
187        """Unified 0-100 Score"""
188        # Weights: Pain 50%, Money 30%, Viral 20%
189        viral = min((post.get('likes', 0) + post.get('comments', 0) * 5) / 5000 * 10, 10)
190        
191        score = (pain * 5.0) + (money * 3.0) + (viral * 2.0)
192        return round(min(score, 100.0), 1)
193
194    def _fallback_mock_data(self, hashtag: str, limit: int) -> List[Dict]:
195        logger.info("Using mock data as fallback for Instagram")
196        mock_posts = [
197            {
198                "id": f"ig_{i}",
199                "author": f"creator_{i}",
200                "description": f"Struggling with {hashtag}? I need a better tool for this! #pain #{hashtag}",
201                "url": f"https://instagram.com/p/mock_{i}",
202                "likes": 500 * (i + 1),
203                "comments": 20 * (i + 1),
204                "hashtags": ["pain", hashtag],
205                "created_at": datetime.now().isoformat(),
206                "media_type": "image"
207            } for i in range(limit)
208        ]
209        
210        results = []
211        for post in mock_posts:
212            pain_severity = self._calculate_pain_severity(post)
213            monetization_potential = self._calculate_monetization_potential(post)
214            
215            product_data = {
216                "platform": "instagram",
217                "canonical_key": f"ig_{post['id']}",
218                "product_name": f"Post by @{post['author']}",
219                "product_url": post['url'],
220                "description": post['description'],
221                "evidence": {
222                    "likes": post['likes'],
223                    "comments": post['comments'],
224                    "pain_severity": pain_severity,
225                    "monetization_potential": monetization_potential,
226                    "hashtags": post['hashtags'],
227                    "author": post['author'],
228                    "created_at": post['created_at'],
229                    "media_type": post['media_type']
230                },
231                "scraped_at": datetime.now().isoformat()
232            }
233            self.save_product(product_data)
234            results.append(product_data)
235        
236        return results
237
238    def save_product(self, data: Dict):
239        conn = self.get_db_connection()
240        if not conn: return
241        
242        try:
243            with conn.cursor() as cur:
244                query = """
245                INSERT INTO raw_intelligence_signals 
246                (platform, canonical_key, title, url, description, evidence, scraped_at)
247                VALUES (%s, %s, %s, %s, %s, %s, %s)
248                ON CONFLICT (canonical_key) DO UPDATE SET
249                title = EXCLUDED.title,
250                description = EXCLUDED.description,
251                evidence = EXCLUDED.evidence,
252                scraped_at = EXCLUDED.scraped_at
253                """
254                cur.execute(query, (
255                    data['platform'],
256                    data['canonical_key'],
257                    data['product_name'],
258                    data['product_url'],
259                    data['description'],
260                    json.dumps(data['evidence']),
261                    data['scraped_at']
262                ))
263            conn.commit()
264            logger.info(f"✅ Saved: {data['product_name']}")
265        except Exception as e:
266            logger.error(f"❌ Error saving Instagram signal: {e}")
267        finally:
268            conn.close()

backend/scrapers/linkedin_search.py

1import asyncio
2from playwright.async_api import async_playwright
3import json
4from datetime import datetime
5import urllib.parse
6import random
7
8# Handle imports
9try:
10    from .base import BaseScraper
11except ImportError:
12    from base import BaseScraper
13
14class LinkedInSearchScraper(BaseScraper):
15    def __init__(self):
16        super().__init__("LinkedIn (Search)")
17        self.base_url = "https://duckduckgo.com"
18        
19        # B2B / SaaS Pain Keywords
20        self.SEARCH_QUERIES = [
21            'site:linkedin.com/posts "struggling with" "SaaS"',
22            'site:linkedin.com/posts "looking for a tool" "marketing"',
23            'site:linkedin.com/posts "pain point" "B2B"',
24            'site:linkedin.com/posts "frustrated with" "software"',
25            'site:linkedin.com/posts "need recommendations" "CRM"',
26            'site:linkedin.com/posts "hiring" "growth hacker"',
27            'site:linkedin.com/pulse "problem" "industry"',
28            'site:linkedin.com/posts "why is it so hard to" "business"'
29        ]
30
31    async def scrape(self, query=None, limit=10):
32        """
33        Scrape LinkedIn posts via DuckDuckGo Search with Robust Selectors.
34        """
35        all_results = []
36        
37        # Determine queries
38        queries_to_run = [query] if query else self.SEARCH_QUERIES
39        if not query:
40            random.shuffle(queries_to_run)
41            
42        async with async_playwright() as p:
43            # Launch browser
44            browser = await p.chromium.launch(headless=True)
45            context = await browser.new_context(
46                user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
47            )
48            page = await context.new_page()
49
50            for q_idx, search_term in enumerate(queries_to_run):
51                if limit and len(all_results) >= limit:
52                    break
53                    
54                print(f"[LinkedIn] 🔍 [{q_idx+1}/{len(queries_to_run)}] Searching: '{search_term}'")
55                
56                try:
57                    # Construct DDG URL
58                    encoded_q = urllib.parse.quote_plus(search_term)
59                    url = f"{self.base_url}/?q={encoded_q}&t=h_&ia=web"
60                    
61                    await page.goto(url, wait_until="domcontentloaded", timeout=15000)
62                    await page.wait_for_timeout(2000)
63                    
64                    # Robust Generic Extraction (Find any link to LinkedIn)
65                    results = await page.evaluate("""() => {
66                        const links = Array.from(document.querySelectorAll('a'));
67                        const validLinks = links.filter(a => 
68                            (a.href.includes('linkedin.com/posts') || a.href.includes('linkedin.com/pulse')) &&
69                            !a.href.includes('google.com') # avoid tracking links
70                        );
71                        
72                        return validLinks.map(a => {
73                            // Try to find a good container (parent block)
74                            // 1. If inside an article
75                            let container = a.closest('article');
76                            // 2. If inside a result div
77                            if (!container) container = a.closest('div[id^="r1-"]');
78                            // 3. Fallback to parent's parent
79                            if (!container) container = a.parentElement.parentElement;
80                            
81                            return {
82                                title: a.innerText || a.href,
83                                url: a.href,
84                                snippet: container ? container.innerText : a.innerText
85                            };
86                        });
87                    }""")
88                    
89                    # Deduplicate by URL
90                    seen_urls = set()
91                    unique_results = []
92                    for r in results:
93                        if r['url'] not in seen_urls:
94                            seen_urls.add(r['url'])
95                            unique_results.append(r)
96
97                    valid_count = 0
98                    for r in unique_results:
99                        if "linkedin.com" in r['url']:
100                            # Iterate desc to remove excessive newlines
101                            clean_snippet = " ".join(r['snippet'].split())
102                            
103                            product = {
104                                "platform": "linkedin",
105                                "canonical_key": f"li_{abs(hash(r['url']))}",
106                                "product_name": r['title'][:100],
107                                "product_url": r['url'],
108                                "description": clean_snippet[:500],
109                                "evidence": {
110                                    "snippet": clean_snippet,
111                                    "source_query": search_term,
112                                    "type": "search_result"
113                                },
114                                "scraped_at": datetime.now().isoformat()
115                            }
116                            
117                            self.save_product(product)
118                            all_results.append(product)
119                            valid_count += 1
120                            print(f"   ✅ Found: {r['title'][:40]}...")
121
122                    print(f"   ℹ️  Extracted {valid_count} posts from search")
123                    
124                except Exception as e:
125                    print(f"   ❌ Error searching '{search_term}': {e}")
126                
127                await page.wait_for_timeout(random.randint(2000, 5000))
128
129            await browser.close()
130            
131        return all_results
132
133
134
135if __name__ == "__main__":
136    scraper = LinkedInSearchScraper()
137    asyncio.run(scraper.scrape(limit=5))

backend/scrapers/notion.py

1import asyncio
2from playwright.async_api import async_playwright
3import json
4from datetime import datetime
5import re
6from .base import BaseScraper
7
8class NotionScraper(BaseScraper):
9    def __init__(self):
10        super().__init__("Notion")
11        self.base_url = "https://www.notion.so/templates"
12
13    async def scrape(self, max_pages=1):
14        """Scrape Notion official template gallery"""
15        products = []
16        
17        async with async_playwright() as p:
18            browser = await p.chromium.launch(headless=True, args=['--no-sandbox'])
19            context = await browser.new_context(
20                user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
21            )
22            page = await context.new_page()
23            
24            try:
25                print(f"[Notion] 🌐 Navigating to: {self.base_url}")
26                await page.goto(self.base_url, wait_until="networkidle")
27                await page.wait_for_timeout(3000)
28                
29                # Scroll to load more
30                for _ in range(max_pages):
31                    await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
32                    await page.wait_for_timeout(2000)
33                
34                # Target the grid items effectively
35                # We specifically look for grid items that are likely template cards (span4 usually for templates, span2 for categories/filters)
36                await page.wait_for_selector("div[class*='gridItem_gridItem']", timeout=10000)
37                cards = await page.query_selector_all("div[class*='gridItem_gridItem']")
38                
39                if not cards:
40                    print(f"[Notion] ❌ No templates found.")
41                    return []
42                
43                print(f"[Notion] 📦 Found {len(cards)} potential cards")
44                
45                for card in cards:
46                    product = await self._extract_card(card, page)
47                    if product:
48                        self.save_product(product)
49                        products.append(product)
50                        print(f"[Notion] ✅ Saved: {product.get('product_name')}")
51                
52            except Exception as e:
53                print(f"[Notion] ❌ Error: {e}")
54            finally:
55                await browser.close()
56        
57        return products
58
59    async def _extract_card(self, card, page):
60        """Extract template data"""
61        try:
62            # Check if this is a category card first (usually smaller or has specific link structure)
63            # A template card usually has a user info section or a dynamic modal trigger
64            is_template = await card.query_selector("section[class*='UserBaseInfo']") or \
65                          await card.query_selector("a[class*='DynamicModal_trigger']")
66            
67            if not is_template:
68                return None
69
70            # Title
71            # Try the specific selector for title in user info section first
72            title_el = await card.query_selector("section[class*='UserBaseInfo_textInfoContainer'] p:first-child a")
73            if not title_el:
74                 title_el = await card.query_selector("a[class*='DynamicModal_trigger']")
75
76            title = await title_el.inner_text() if title_el else None
77            
78            # URL
79            url_el = await card.query_selector("a[href^='/templates/']")
80            url = await url_el.get_attribute("href") if url_el else None
81            
82            if url and not url.startswith("http"):
83                url = f"https://www.notion.so{url}"
84            
85            # Skip if it's a category link
86            if url and ("/templates/category/" in url or "/templates/collections/" in url):
87                return None
88            
89            # Author
90            author_el = await card.query_selector("a[href^='/@']")
91            author = await author_el.inner_text() if author_el else "Notion Community"
92            if not author or not author.strip():
93                author = "Notion Community"
94            
95            # Price
96            # Look for the price container in the user base right section
97            price = 0.0
98            price_el = await card.query_selector("div[class*='templatePreview_userBaseRight']")
99            if price_el:
100                price_text = await price_el.inner_text()
101                if "$" in price_text:
102                    try:
103                        price = float(re.sub(r'[^\d.]', '', price_text))
104                    except:
105                        pass
106            
107            if not title or not url:
108                return None
109            
110            product_id = url.split('/')[-1]
111            
112            return {
113                "product_name": title.strip(),
114                "creator_name": author.strip(),
115                "product_url": url,
116                "price": price,
117                "canonical_key": f"notion_{product_id}",
118                "evidence": {
119                    "source": "notion_official",
120                    "captured_at": datetime.now().isoformat(),
121                    "price_raw": price_text if price_el else "Free"
122                }
123            }
124        except Exception as e:
125            # print(f"[Notion] Extraction error: {e}") # Reduce noise
126            return None
127
128    def save_product(self, p):
129        conn = self.get_db_connection()
130        cur = conn.cursor()
131        try:
132            cur.execute("""
133                INSERT INTO raw_notion_templates 
134                (product_name, creator_name, product_url, price, canonical_key, evidence)
135                VALUES (%s, %s, %s, %s, %s, %s)
136                ON CONFLICT (canonical_key) DO UPDATE SET
137                product_name = EXCLUDED.product_name,
138                evidence = raw_notion_templates.evidence || EXCLUDED.evidence
139            """, (
140                p['product_name'], p['creator_name'], p['product_url'], 
141                p['price'], p['canonical_key'], json.dumps(p['evidence'])
142            ))
143            conn.commit()
144        except Exception as e:
145            print(f"[Notion] DB Error: {e}")
146            conn.rollback()
147        finally:
148            cur.close()
149            conn.close()
150
151if __name__ == '__main__':
152    scraper = NotionScraper()
153    asyncio.run(scraper.scrape(max_pages=1))

backend/scrapers/orchestrator.py

1"""
2AURIFY INTELLIGENCE - Master Orchestrator
3Standardized for the main backend system.
4"""
5
6import asyncio
7import logging
8import json
9import os
10import psycopg2
11from typing import Dict, List
12from datetime import datetime, timedelta
13from dotenv import load_dotenv
14
15# Import standardized scrapers
16from .tiktok import TikTokScraper
17from .trends import TrendsScraper
18from .producthunt import ProductHuntScraper
19from .facebook import FacebookScraper
20
21logger = logging.getLogger(__name__)
22load_dotenv()
23
24class AurifyOrchestrator:
25    def __init__(self):
26        self.tiktok = TikTokScraper()
27        self.trends = TrendsScraper()
28        self.ph = ProductHuntScraper()
29        self.facebook = FacebookScraper()
30        
31        self.db_host = os.getenv("POSTGRES_HOST", "localhost")
32        self.db_port = os.getenv("POSTGRES_PORT", "5435")
33        self.db_user = os.getenv("POSTGRES_USER", "orexia_app")
34        self.db_password = os.getenv("POSTGRES_PASSWORD", "Farhat2026Secure")
35        self.db_name = os.getenv("POSTGRES_DB", "orexia")
36
37    def get_db_connection(self):
38        return psycopg2.connect(
39            host=self.db_host,
40            port=self.db_port,
41            user=self.db_user,
42            password=self.db_password,
43            database=self.db_name
44        )
45
46    async def analyze_opportunity(self, keyword: str) -> Dict:
47        """
48        Runs all scrapers and generates a unified PIMGE report.
49        """
50        logger.info(f"🎯 Starting full PIMGE analysis for: {keyword}")
51        
52        # 1. Run Scrapers
53        # TikTok: Use keyword as hashtag ("ai writing" -> "aiwriting")
54        hashtag = keyword.replace(" ", "")
55        
56        # Run in parallel for speed
57        tiktok_task = asyncio.create_task(self.tiktok.scrape(hashtag=hashtag, limit=10))
58        trends_task = asyncio.create_task(self.trends.scrape(keyword=keyword))
59        ph_task = asyncio.create_task(self.ph.scrape(limit=10)) # TODO: Search specific keyword on PH
60        fb_task = asyncio.create_task(self.facebook.scrape(query=keyword, limit=20))
61        
62        tiktok_results = await tiktok_task
63        trends_results = await trends_task
64        ph_results = await ph_task
65        fb_results = await fb_task
66        
67        # 2. PIMGE Analysis
68        pimge_score = self._calculate_pimge_metrics(keyword, fb_results, tiktok_results)
69        
70        # 3. Decision Logic (Based on PIMGE Output)
71        composite_score = pimge_score['score']
72        
73        if composite_score >= 8.0:
74            decision = "🚀 GO (High Confidence)"
75        elif composite_score >= 5.0:
76            decision = "WAIT (More Validation Needed)"
77        else:
78            decision = "🔴 KILL (Low Potential)"
79
80        # Special Override: The Wedge (Blue Ocean)
81        if pimge_score['is_blue_ocean']:
82            decision = "🔥 BLUE OCEAN OPPORTUNITY"
83            composite_score = 10.0
84            
85        # Special Override: Saturated
86        if pimge_score['is_saturated']:
87            decision = "🔴 KILL (Saturated Market)"
88            composite_score = 0.0
89
90        report = {
91            "keyword": keyword,
92            "decision": decision,
93            "composite_score": round(composite_score, 1),
94            "pimge_metrics": pimge_score,
95            "analyzed_at": datetime.now().isoformat()
96        }
97        
98        return report
99
100    def _calculate_pimge_metrics(self, keyword: str, fb_ads: List[Dict], tiktok_posts: List[Dict]) -> Dict:
101        """
102        Calculate Profitability, Intelligence, Momentum, Growth, Engagement scores.
103        """
104        # --- 1. Validation Score (Momentum) ---
105        # Rule: Ad Duration > 14 days + Active Ads > 3 = GO.
106        long_running_ads = 0
107        total_active_ads = 0 # In this context, we count ads returned as active
108        authors = set()
109        
110        for ad in fb_ads:
111            authors.add(ad.get('author_name'))
112            evidence = ad.get('evidence', {})
113            start_date_str = evidence.get('momentum', {}).get('start_date')
114            
115            # Count active ads (all returned are ostensibly active)
116            total_active_ads += 1
117            
118            # Check duration
119            try:
120                # Parse "Jan 1, 2024" or default
121                if start_date_str:
122                    start_dt = datetime.strptime(start_date_str, "%b %d, %Y")
123                    days_running = (datetime.now() - start_dt).days
124                    if days_running > 14:
125                        long_running_ads += 1
126            except:
127                pass
128
129        validation_score = 0
130        if long_running_ads >= 1 and total_active_ads > 3:
131             validation_score = 10
132        elif total_active_ads > 0:
133             validation_score = 5
134
135        # --- 2. Competition Density ---
136        # Rule: 20+ different pages = Saturated.
137        unique_pages = len(authors)
138        is_saturated = unique_pages >= 20
139
140        # --- 3. The Wedge (Blue Ocean) ---
141        # Rule: Sales in CodeCanyon but 0 FB Ads
142        # Check CodeCanyon/Envato (from DB)
143        has_envato_sales = False
144        conn = self.get_db_connection()
145        try:
146            cur = conn.cursor()
147            cur.execute("SELECT COUNT(*) FROM raw_envato_products WHERE product_name ILIKE %s", (f"%{keyword}%",))
148            count = cur.fetchone()[0]
149            if count > 0:
150                has_envato_sales = True
151        except:
152            pass
153        finally:
154            conn.close()
155
156        is_blue_ocean = has_envato_sales and total_active_ads == 0
157
158        # --- Composite Score Calculation ---
159        # Simple weighted avg for now
160        # PIMGE: P(rofit) I(ntel) M(omentum) G(rowth) E(ngagement)
161        
162        # Momentum (Validation)
163        score = validation_score
164        
165        return {
166            "score": score,
167            "validation_score": validation_score,
168            "active_ads_count": total_active_ads,
169            "unique_competitors": unique_pages,
170            "is_saturated": is_saturated,
171            "is_blue_ocean": is_blue_ocean,
172            "long_running_ads": long_running_ads
173        }
174
175    def _avg_metric(self, results: List[Dict], key: str) -> float:
176        if not results: return 0.0
177        scores = []
178        for r in results:
179            val = r.get("evidence", {}).get(key, 0)
180            scores.append(float(val))
181        return sum(scores) / len(scores) if scores else 0.0
182
183if __name__ == "__main__":
184    orchestrator = AurifyOrchestrator()
185    report = asyncio.run(orchestrator.analyze_opportunity("marketing"))
186    print(json.dumps(report, indent=2))

backend/scrapers/orexia_base.py

1import abc
2import os
3import psycopg2
4import json
5import logging
6from datetime import datetime
7from dotenv import load_dotenv
8
9load_dotenv()
10
11logger = logging.getLogger("OrexiaBaseScraper")
12logger.setLevel(logging.INFO)
13
14class OrexiaBaseScraper(abc.ABC):
15    """
16    Unified base scraper for OREXIA Intelligence Hub.
17    Migrates from submitting to raw_* tables to submitting to orexia_products.
18    """
19    
20    def __init__(self, platform_name):
21        self.platform_name = platform_name
22        
23        # Use defaults that match the verified local environment
24        self.db_host = os.getenv("POSTGRES_HOST", "localhost")
25        self.db_port = os.getenv("POSTGRES_PORT", "5432")
26        self.db_user = os.getenv("POSTGRES_USER", "pcore")
27        self.db_password = os.getenv("POSTGRES_PASSWORD", "") # Local dev often has no pass for user
28        self.db_name = os.getenv("POSTGRES_DB", "aurify")
29
30    def get_db_connection(self):
31        try:
32            conn = psycopg2.connect(
33                host=self.db_host,
34                port=self.db_port,
35                user=self.db_user,
36                password=self.db_password,
37                database=self.db_name
38            )
39            return conn
40        except Exception as e:
41            logger.error(f"Failed to connect to DB {self.db_name} at {self.db_host}:{self.db_port}: {e}")
42            return None
43
44    @abc.abstractmethod
45    async def scrape(self, **kwargs):
46        pass
47
48    def save_product(self, product_data: dict):
49        """
50        Save product to orexia_products table.
51        This handles the mapping from scraper output to the OREXIA schema.
52        """
53        conn = self.get_db_connection()
54        if not conn:
55            logger.error("No DB connection available to save product.")
56            return
57
58        cur = conn.cursor()
59        try:
60            # 1. Prepare data for insertion
61            
62            # Mandatory fields
63            product_name = product_data.get('product_name')
64            url = product_data.get('product_url') or product_data.get('url')
65            
66            if not product_name or not url:
67                logger.warning(f"Skipping product with missing name or url: {product_data}")
68                return
69
70            platform = self.platform_name
71            canonical_key = product_data.get('canonical_key') or f"{platform}:{url}"
72            
73            # Map other fields
74            price = product_data.get('price')
75            
76            # Extract pain points if available (list of strings)
77            pain_points_raw = product_data.get('pain_signals', [])
78            if isinstance(pain_points_raw, str):
79                try:
80                    pain_points_raw = json.loads(pain_points_raw)
81                except:
82                    pain_points_raw = [pain_points_raw]
83            
84            # Source metrics (everything else goes here)
85            source_metrics = {
86                k: v for k, v in product_data.items() 
87                if k not in ['product_name', 'product_url', 'url', 'price', 
88                             'canonical_key', 'pain_signals', 'evidence']
89            }
90            
91            # Evidence
92            evidence = product_data.get('evidence', {})
93            
94            # Initial status
95            status = 'Not Analyzed'
96
97            # 2. Insert into staging_products
98            query = """
99                INSERT INTO staging_products (
100                    product_name, platform, url, price, 
101                    pain_points_raw, source_metrics, evidence, 
102                    canonical_key, status, created_at, updated_at
103                )
104                VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, NOW(), NOW())
105                ON CONFLICT (canonical_key) DO UPDATE SET
106                    price = EXCLUDED.price,
107                    pain_points_raw = EXCLUDED.pain_points_raw,
108                    source_metrics = staging_products.source_metrics || EXCLUDED.source_metrics,
109                    evidence = staging_products.evidence || EXCLUDED.evidence,
110                    updated_at = NOW()
111                RETURNING id;
112            """
113            
114            cur.execute(query, (
115                product_name,
116                platform,
117                url,
118                price,
119                pain_points_raw,
120                json.dumps(source_metrics),
121                json.dumps(evidence),
122                canonical_key,
123                status
124            ))
125            
126            product_id = cur.fetchone()[0]
127            conn.commit()
128            logger.info(f"[{self.platform_name}] ✅ Saved/Updated product: {product_name} (ID: {product_id})")
129            
130        except Exception as e:
131            logger.error(f"[{self.platform_name}] ❌ DB Error saving product: {e}")
132            conn.rollback()
133        finally:
134            cur.close()
135            conn.close()
136
137    def save_market_signal(self, signal_data: dict):
138        """
139        Save signal to orexia_market_signals table.
140        """
141        conn = self.get_db_connection()
142        if not conn:
143            return
144
145        cur = conn.cursor()
146        try:
147            # Mandatory fields
148            signal_type = signal_data.get('signal_type', 'General')
149            platform = self.platform_name
150            url = signal_data.get('url')
151            
152            if not url:
153                return
154
155            canonical_key = signal_data.get('canonical_key') or f"{platform}:signal:{url}"
156            
157            query = """
158                INSERT INTO staging_market_signals (
159                    signal_type, platform, url, title, description,
160                    pain_severity, sentiment_score, engagement_count,
161                    canonical_key, created_at
162                )
163                VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, NOW())
164                ON CONFLICT (canonical_key) DO UPDATE SET
165                    engagement_count = EXCLUDED.engagement_count
166            """
167            
168            cur.execute(query, (
169                signal_type,
170                platform,
171                url,
172                signal_data.get('title'),
173                signal_data.get('description'),
174                signal_data.get('pain_severity'),
175                signal_data.get('sentiment_score'),
176                signal_data.get('engagement_count'),
177                canonical_key
178            ))
179            
180            conn.commit()
181            logger.info(f"[{self.platform_name}] ✅ Saved signal: {url}")
182            
183        except Exception as e:
184            logger.error(f"[{self.platform_name}] ❌ DB Error saving signal: {e}")
185            conn.rollback()
186        finally:
187            cur.close()
188            conn.close()

backend/scrapers/pain_heist_config.py

1"""
2🎯 PAIN HEIST STRATEGY - Configuration File
3استراتيجية "سرقة الألم" - ملف التكوين
4
5This configuration implements the Pain Heist Strategy for discovering
6product opportunities by analyzing pain signals in the 3.0-4.2 star range.
7
8الهدف: إيجاد المنتجات التي "تعمل لكن فيها نقص قاتل"
9"""
10
11# ============================================================
12# 1. SHOPIFY APP STORE CONFIGURATION
13# ============================================================
14# Target: apps.shopify.com
15# Goal: Find gaps in existing marketing automation tools
16
17SHOPIFY_APPS_CONFIG = {
18    'target_keywords': [
19        # Instagram & Social Media Automation
20        'instagram automation',
21        'instagram feed',
22        'dm marketing',
23        'influencer marketing',
24        'ugc content',
25        
26        # Marketing Automation
27        'marketing automation',
28        'abandoned cart recovery',
29        'chatbot for sales',
30        'social proof',
31        'email marketing automation'
32    ],
33    
34    'rating_filter': {
35        'min_rating': 3.0,  # App works but has problems
36        'max_rating': 4.2,  # Not perfect, room for improvement
37        'sort_by': 'recent'  # Current problems > old problems
38    },
39    
40    'pain_signals_to_extract': [
41        "slows down my site",
42        "too expensive for small stores",
43        "customer support takes days",
44        "doesn't integrate with",
45        "complicated setup",
46        "missing feature",
47        "not worth the price",
48        "buggy",
49        "crashes",
50        "slow response"
51    ]
52}
53
54# ============================================================
55# 2. AMAZON BOOKS CONFIGURATION
56# ============================================================
57# Target: Amazon Books & Office Products
58# Goal: Understand "knowledge gaps" in the market
59
60AMAZON_BOOKS_CONFIG = {
61    'target_keywords': [
62        # Social Media & Marketing
63        'social media content planner',
64        'instagram marketing book 2025',
65        'instagram growth book',
66        'influencer marketing for dummies',
67        
68        # Business & Passive Income
69        'passive income strategies',
70        'small business log book',
71        'entrepreneur planner',
72        
73        # Planners (Physical → Digital Opportunity)
74        'social media planner 2026',
75        'content calendar planner',
76        'business planner 2026'
77    ],
78    
79    'review_filter': {
80        'target_stars': [1, 2, 3],  # Focus on negative reviews
81        'required_pain_terms': [
82            'outdated',
83            'too much theory',
84            'no action',
85            'basic',
86            'refund',
87            'complicated',
88            'heavy to carry',  # Physical product pain
89            'not practical',
90            'waste of money',
91            'missing templates'
92        ]
93    }
94}
95
96# ============================================================
97# 3. OPPORTUNITY EXTRACTION RULES
98# ============================================================
99
100OPPORTUNITY_RULES = {
101    # Shopify Apps: What to look for in reviews
102    'shopify_pain_to_opportunity': {
103        "slows down my site": "Build a lightweight alternative",
104        "too expensive": "Create a better pricing model",
105        "customer support takes 3 days": "Offer instant support or AI chatbot",
106        "doesn't integrate with X": "Build that integration",
107        "complicated setup": "Create a 1-click setup version",
108        "missing feature X": "Build a tool with that feature"
109    },
110    
111    # Amazon Books: What to look for in reviews
112    'amazon_pain_to_opportunity': {
113        "outdated": "Create a 'Live Updated System' not a static book",
114        "too much theory, no action": "Build 'Ready-to-Use Templates'",
115        "heavy to carry": "Create a Notion/Digital version",
116        "basic": "Add advanced features/automations",
117        "not practical": "Focus on implementation, not theory"
118    }
119}
120
121# ============================================================
122# 4. SCRAPER EXECUTION CONFIG
123# ============================================================
124
125SCRAPER_CONFIG = {
126    'shopify_apps': {
127        'enabled': True,
128        'keywords': SHOPIFY_APPS_CONFIG['target_keywords'],
129        'limit': 20,
130        'min_rating': 3.0,
131        'max_rating': 4.2
132    },
133    
134    'amazon': {
135        'enabled': True,
136        'keywords': AMAZON_BOOKS_CONFIG['target_keywords'],
137        'limit': 15,
138        'scrape_reviews': True,
139        'focus_on_negative': True  # Only extract 1-3 star reviews
140    }
141}
142
143# ============================================================
144# 5. QUICK START COMMANDS
145# ============================================================
146
147"""
148# Run Shopify Apps Scraper with Pain Heist Strategy
149python3 -c "
150from backend.scrapers.shopify_apps import ShopifyAppsScraper
151import asyncio
152
153scraper = ShopifyAppsScraper()
154for keyword in ['instagram automation', 'abandoned cart recovery']:
155    results = asyncio.run(scraper.scrape(keyword=keyword, max_apps=10))
156    print(f'Found {len(results)} apps for {keyword}')
157"
158
159# Run Amazon Books Scraper with Pain Heist Strategy
160python3 -c "
161from backend.scrapers.amazon import AmazonScraper
162import asyncio
163
164scraper = AmazonScraper()
165for keyword in ['social media planner 2026', 'instagram marketing book']:
166    results = asyncio.run(scraper.scrape(query=keyword, limit=10, scrape_reviews=True))
167    print(f'Found {len(results)} products for {keyword}')
168"
169"""
170
171# ============================================================
172# 6. EXPECTED OUTPUT EXAMPLE
173# ============================================================
174
175EXPECTED_PAIN_SIGNALS = {
176    'shopify_example': {
177        'app_name': 'Instagram Auto DM Pro',
178        'rating': 3.8,
179        'pain_signals': [
180            "Great app but slows down my site significantly",
181            "Too expensive for small stores, $29/month is too much",
182            "Customer support takes 3 days to respond",
183            "Doesn't integrate with Klaviyo"
184        ],
185        'opportunity': "Build a lightweight Instagram DM tool with better pricing and instant support"
186    },
187    
188    'amazon_example': {
189        'product_name': 'Social Media Planner 2026',
190        'rating': 2.8,
191        'pain_signals': [
192            "The book is outdated, doesn't cover Reels or TikTok",
193            "Too much theory, no actual templates to use",
194            "The physical planner is heavy to carry around"
195        ],
196        'opportunity': "Create a digital Notion template with live-updated social media strategies"
197    }
198}

backend/scrapers/producthunt.py

1"""
2AURIFY INTELLIGENCE - Product Hunt Launch Monitor
3Real API integration with Product Hunt GraphQL API
4"""
5
6from .base import BaseScraper
7from datetime import datetime
8import json
9import logging
10from typing import List, Dict
11import requests
12import os
13
14logger = logging.getLogger(__name__)
15
16class ProductHuntScraper(BaseScraper):
17    def __init__(self):
18        super().__init__(platform_name="producthunt")
19        # Use environment variable or hardcoded token
20        self.api_token = os.getenv("PRODUCT_HUNT_API_TOKEN", "4F5Pzf_VJABLxytLW4TrZKnUerGXimps34XXjmvjD7M")
21        self.api_url = "https://api.producthunt.com/v2/api/graphql"
22
23    def scrape(self, order: str = "VOTES", limit: int = 20) -> List[Dict]:
24        """
25        Scrapes Product Hunt launches using real GraphQL API.
26        
27        Args:
28            order: VOTES, NEWEST, or FEATURED
29            limit: Number of products to fetch (max 20 per request)
30        """
31        logger.info(f"🚀 Scraping Product Hunt (Real API) - Order: {order}, Limit: {limit}")
32        
33        try:
34            products = self._fetch_from_api(order, limit)
35            results = []
36            
37            for product in products:
38                validation_score = self._calculate_validation(product)
39                
40                product_data = {
41                    "platform": "producthunt",
42                    "canonical_key": f"ph_{product['id']}",
43                    "product_name": product['name'],
44                    "product_url": product['url'],
45                    "description": product['tagline'],
46                    "evidence": {
47                        "upvotes": product['votes_count'],
48                        "comments": product['comments_count'],
49                        "validation_score": validation_score,
50                        "created_at": product['created_at'],
51                        "topics": product.get('topics', []),
52                        "maker": product.get('maker', 'Unknown'),
53                        "upvote_velocity": self._calculate_velocity(product['votes_count'], product['created_at'])
54                    },
55                    "opportunity_score": self._calculate_ph_opportunity(validation_score, product['votes_count']),
56                    "scraped_at": datetime.now().isoformat()
57                }
58                
59                self.save_product(product_data)
60                results.append(product_data)
61            
62            logger.info(f"✅ Successfully scraped {len(results)} products from Product Hunt")
63            return results
64            
65        except Exception as e:
66            logger.error(f"❌ Error scraping Product Hunt: {e}")
67            # Fallback to mock data if API fails
68            logger.warning("⚠️ Falling back to mock data")
69            return self._fallback_mock_data(limit)
70
71    def _fetch_from_api(self, order: str, limit: int) -> List[Dict]:
72        """Fetch real data from Product Hunt GraphQL API"""
73        
74        query = """
75        query($order: PostsOrder, $first: Int) {
76            posts(order: $order, first: $first) {
77                edges {
78                    node {
79                        id
80                        name
81                        tagline
82                        description
83                        votesCount
84                        commentsCount
85                        createdAt
86                        url
87                        website
88                        topics {
89                            edges {
90                                node {
91                                    name
92                                }
93                            }
94                        }
95                        user {
96                            name
97                        }
98                    }
99                }
100            }
101        }
102        """
103        
104        headers = {
105            "Authorization": f"Bearer {self.api_token}",
106            "Content-Type": "application/json",
107            "Accept": "application/json"
108        }
109        
110        variables = {
111            "order": order,
112            "first": min(limit, 20)  # API limit is 20 per request
113        }
114        
115        response = requests.post(
116            self.api_url,
117            json={"query": query, "variables": variables},
118            headers=headers,
119            timeout=30
120        )
121        
122        if response.status_code != 200:
123            raise Exception(f"API returned status {response.status_code}: {response.text}")
124        
125        data = response.json()
126        
127        if "errors" in data:
128            raise Exception(f"GraphQL errors: {data['errors']}")
129        
130        # Parse response
131        products = []
132        for edge in data['data']['posts']['edges']:
133            node = edge['node']
134            
135            # Extract topics
136            topics = [t['node']['name'] for t in node.get('topics', {}).get('edges', [])]
137            
138            # Extract maker name
139            maker_name = node.get('user', {}).get('name', 'Unknown')
140            
141            products.append({
142                'id': node['id'],
143                'name': node['name'],
144                'tagline': node['tagline'],
145                'description': node.get('description', ''),
146                'votes_count': node['votesCount'],
147                'comments_count': node['commentsCount'],
148                'created_at': node['createdAt'],
149                'url': node['url'],
150                'website': node.get('website', ''),
151                'topics': topics,
152                'maker': maker_name
153            })
154        
155        return products
156
157    def _calculate_validation(self, product: Dict) -> float:
158        """
159        Calculate market validation score (0-10)
160        Based on upvotes and engagement
161        """
162        upvotes = product.get('votes_count', 0)
163        comments = product.get('comments_count', 0)
164        
165        # Product Hunt scoring:
166        # 500+ upvotes = top product (10/10)
167        # 100+ upvotes = strong product (7/10)
168        # 50+ upvotes = decent product (5/10)
169        upvote_score = min(upvotes / 50, 10.0)
170        
171        # Engagement ratio (comments per upvote)
172        engagement_ratio = (comments / max(upvotes, 1)) * 100
173        engagement_score = min(engagement_ratio / 10, 10.0)
174        
175        # Weighted average
176        validation = (upvote_score * 0.7) + (engagement_score * 0.3)
177        
178        return round(validation, 2)
179
180    def _calculate_velocity(self, upvotes: int, created_at: str) -> float:
181        """Upvotes per hour"""
182        if not created_at: return 0
183        try:
184            # Parse ISO format: 2023-10-27T10:00:00Z
185            dt = datetime.fromisoformat(created_at.replace('Z', '+00:00'))
186            now = datetime.now(dt.tzinfo)
187            hours = max((now - dt).total_seconds() / 3600, 1)
188            return round(upvotes / hours, 2)
189        except:
190            return 0
191
192    def _calculate_ph_opportunity(self, validation_score: float, upvotes: int) -> float:
193        """Unified 0-100 Score"""
194        # Validation (0-10) * 8 + Bonus for > 500 upvotes
195        score = validation_score * 10
196        if upvotes > 1000: score += 10 # Viral hit warning? Or validation?
197        
198        return round(min(score, 100.0), 1)
199
200    def _fallback_mock_data(self, limit: int) -> List[Dict]:
201        """Fallback mock data if API fails"""
202        logger.info("Using mock data as fallback")
203        mock_products = [
204            {
205                "id": f"mock_ph_{i}",
206                "name": f"AI Startup {i}",
207                "tagline": f"Solving problems with AI",
208                "votes_count": 150 + (i * 50),
209                "comments_count": 20 + (i * 5),
210                "created_at": datetime.now().isoformat(),
211                "url": f"https://producthunt.com/posts/ai-startup-{i}",
212                "topics": ["AI", "Productivity"],
213                "maker": f"Maker {i}"
214            } for i in range(limit)
215        ]
216        
217        results = []
218        for product in mock_products:
219            validation_score = self._calculate_validation(product)
220            product_data = {
221                "platform": "producthunt",
222                "canonical_key": f"ph_{product['id']}",
223                "product_name": product['name'],
224                "product_url": product['url'],
225                "description": product['tagline'],
226                "evidence": {
227                    "upvotes": product['votes_count'],
228                    "comments": product['comments_count'],
229                    "validation_score": validation_score,
230                    "created_at": product['created_at'],
231                    "topics": product['topics'],
232                    "maker": product['maker']
233                },
234                "scraped_at": datetime.now().isoformat()
235            }
236            self.save_product(product_data)
237            results.append(product_data)
238        
239        return results
240
241    def save_product(self, data: Dict):
242        """Standardized save method for raw_intelligence_signals"""
243        conn = self.get_db_connection()
244        if not conn: return
245        
246        try:
247            with conn.cursor() as cur:
248                query = """
249                INSERT INTO raw_intelligence_signals 
250                (platform, canonical_key, title, url, description, evidence, scraped_at)
251                VALUES (%s, %s, %s, %s, %s, %s, %s)
252                ON CONFLICT (canonical_key) DO UPDATE SET
253                title = EXCLUDED.title,
254                description = EXCLUDED.description,
255                evidence = EXCLUDED.evidence,
256                scraped_at = EXCLUDED.scraped_at
257                """
258                cur.execute(query, (
259                    data['platform'],
260                    data['canonical_key'],
261                    data['product_name'],
262                    data['product_url'],
263                    data['description'],
264                    json.dumps(data['evidence']),
265                    data['scraped_at']
266                ))
267            conn.commit()
268            logger.info(f"✅ Saved: {data['product_name']}")
269        except Exception as e:
270            logger.error(f"❌ Error saving PH signal: {e}")
271        finally:
272            conn.close()
273
274if __name__ == "__main__":
275    scraper = ProductHuntScraper()
276    results = scraper.scrape(order="VOTES", limit=10)
277    print(f"\n✅ Scraped {len(results)} products from Product Hunt!")

backend/scrapers/reddit.py

1"""
2AURIFY INTELLIGENCE - Reddit Pain Point & Opportunity Scraper
3Real API integration with Reddit (PRAW)
4"""
5
6from .base import BaseScraper
7from datetime import datetime
8import json
9import logging
10from typing import List, Dict
11import os
12
13logger = logging.getLogger(__name__)
14
15class RedditScraper(BaseScraper):
16    def __init__(self):
17        super().__init__(platform_name="reddit")
18        
19        # Reddit API credentials
20        self.client_id = os.getenv("REDDIT_CLIENT_ID", "")
21        self.client_secret = os.getenv("REDDIT_CLIENT_SECRET", "")
22        self.user_agent = os.getenv("REDDIT_USER_AGENT", "OREXIA Intelligence Bot v1.0")
23        
24        # Target subreddits for digital product discovery
25        self.target_subreddits = [
26            "SaaS",
27            "Entrepreneur", 
28            "SideProject",
29            "IndieBiz",
30            "startups",
31            "smallbusiness",
32            "digitalnomad",
33            "passive_income",
34            "Bootstrapped",
35            "EntrepreneurRideAlong"
36        ]
37        
38        # Pain detection keywords
39        self.pain_keywords = [
40            "problem", "struggle", "frustrated", "tired of", "need", "help",
41            "waste", "expensive", "complicated", "confusing", "hard to",
42            "wish there was", "looking for", "can't find", "failed", "stuck",
43            "annoying", "hate", "difficult", "impossible", "broken", "pain point",
44            "challenge", "issue", "obstacle", "bottleneck"
45        ]
46        
47        # Monetization signals
48        self.monetization_signals = [
49            "would pay for", "need a solution", "costing me", "losing money",
50            "expensive", "subscription", "tool", "app", "service", "platform",
51            "buy", "purchase", "invest", "worth it", "price", "revenue",
52            "customers", "users", "market", "sell", "selling", "business model"
53        ]
54
55    def scrape(self, subreddits: List[str] = None, limit: int = 10, time_filter: str = "week") -> List[Dict]:
56        """
57        Scrapes Reddit for pain points and opportunities using PRAW.
58        
59        Args:
60            subreddits: List of subreddit names (without r/). If None, uses default list
61            limit: Number of posts per subreddit
62            time_filter: Time filter for top posts (hour, day, week, month, year, all)
63        """
64        if subreddits is None:
65            subreddits = self.target_subreddits
66            
67        logger.info(f"🔴 Scraping Reddit (Real API) - Subreddits: {len(subreddits)}, Limit: {limit}/sub")
68        
69        try:
70            # Try to use PRAW if available
71            import praw
72            
73            reddit = praw.Reddit(
74                client_id=self.client_id,
75                client_secret=self.client_secret,
76                user_agent=self.user_agent
77            )
78            
79            posts = self._fetch_from_api(reddit, subreddits, limit, time_filter)
80            results = []
81            
82            for post in posts:
83                pain_severity = self._calculate_pain_severity(post)
84                validation_score = self._calculate_validation(post)
85                monetization_potential = self._calculate_monetization_potential(post)
86                
87                product_data = {
88                    "platform": "reddit",
89                    "canonical_key": f"reddit_{post['id']}",
90                    "product_name": post['title'],
91                    "product_url": post['url'],
92                    "description": post['selftext'][:500] if post['selftext'] else post['title'],
93                    "evidence": {
94                        "upvotes": post['score'],
95                        "comments": post['num_comments'],
96                        "awards": post['total_awards'],
97                        "subreddit": post['subreddit'],
98                        "author": post['author'],
99                        "pain_severity": pain_severity,
100                        "validation_score": validation_score,
101                        "monetization_potential": monetization_potential,
102                        "created_at": post['created_at'],
103                        "upvote_ratio": post.get('upvote_ratio', 0)
104                    },
105                    "scraped_at": datetime.now().isoformat()
106                }
107                
108                self.save_product(product_data)
109                results.append(product_data)
110            
111            logger.info(f"✅ Successfully scraped {len(results)} posts from Reddit")
112            return results
113            
114        except ImportError:
115            logger.error("❌ PRAW library not installed. Install with: pip install praw")
116            logger.warning("⚠️ Falling back to mock data")
117            return self._fallback_mock_data(subreddits, limit)
118        except Exception as e:
119            logger.error(f"❌ Error scraping Reddit: {e}")
120            logger.warning("⚠️ Falling back to mock data")
121            return self._fallback_mock_data(subreddits, limit)
122
123    def _fetch_from_api(self, reddit, subreddits: List[str], limit: int, time_filter: str) -> List[Dict]:
124        """Fetch real data from Reddit using PRAW"""
125        
126        posts = []
127        
128        for subreddit_name in subreddits:
129            try:
130                subreddit = reddit.subreddit(subreddit_name)
131                
132                # Get top posts from the time period
133                for submission in subreddit.top(time_filter=time_filter, limit=limit):
134                    posts.append({
135                        'id': submission.id,
136                        'title': submission.title,
137                        'selftext': submission.selftext,
138                        'url': f"https://reddit.com{submission.permalink}",
139                        'score': submission.score,
140                        'num_comments': submission.num_comments,
141                        'total_awards': submission.total_awards_received,
142                        'subreddit': subreddit_name,
143                        'author': str(submission.author) if submission.author else '[deleted]',
144                        'created_at': datetime.fromtimestamp(submission.created_utc).isoformat(),
145                        'upvote_ratio': submission.upvote_ratio
146                    })
147                    
148                logger.info(f"  ✓ Scraped r/{subreddit_name}: {limit} posts")
149                
150            except Exception as e:
151                logger.error(f"  ✗ Error scraping r/{subreddit_name}: {e}")
152                continue
153        
154        return posts
155
156    def _calculate_pain_severity(self, post: Dict) -> float:
157        """
158        Calculate pain severity score (0-10)
159        Based on pain keywords in title and content
160        """
161        text = (post.get('title', '') + ' ' + post.get('selftext', '')).lower()
162        
163        # Count pain keywords
164        pain_count = sum(1 for keyword in self.pain_keywords if keyword in text)
165        
166        # Engagement boost (high comments = real pain discussion)
167        comment_ratio = min(post.get('num_comments', 0) / 50, 3.0)  # Max 3 points
168        
169        # Award boost (awards = resonance)
170        award_boost = min(post.get('total_awards', 0) * 0.5, 2.0)  # Max 2 points
171        
172        # Calculate score
173        pain_score = min((pain_count * 1.5) + comment_ratio + award_boost, 10.0)
174        
175        return round(pain_score, 2)
176
177    def _calculate_validation(self, post: Dict) -> float:
178        """
179        Calculate market validation score (0-10)
180        Based on upvotes, engagement, and community response
181        """
182        upvotes = post.get('score', 0)
183        comments = post.get('num_comments', 0)
184        awards = post.get('total_awards', 0)
185        upvote_ratio = post.get('upvote_ratio', 0)
186        
187        # Upvote score (logarithmic scale)
188        upvote_score = min((upvotes / 100) ** 0.5 * 3, 5.0)  # Max 5 points
189        
190        # Engagement score
191        engagement_score = min(comments / 20, 3.0)  # Max 3 points
192        
193        # Award score
194        award_score = min(awards * 0.5, 1.5)  # Max 1.5 points
195        
196        # Controversy penalty (low upvote ratio = controversial)
197        controversy_penalty = (1 - upvote_ratio) * 2
198        
199        validation = max(upvote_score + engagement_score + award_score - controversy_penalty, 0)
200        
201        return round(min(validation, 10.0), 2)
202
203    def _calculate_monetization_potential(self, post: Dict) -> float:
204        """
205        Calculate monetization potential score (0-10)
206        Based on monetization signals in discussion
207        """
208        text = (post.get('title', '') + ' ' + post.get('selftext', '')).lower()
209        
210        # Count monetization keywords
211        mon_count = sum(1 for keyword in self.monetization_signals if keyword in text)
212        
213        # Business-focused subreddit boost
214        business_subs = ['saas', 'entrepreneur', 'sidproject', 'indiebiz', 'bootstrapped']
215        sub_boost = 2.0 if post.get('subreddit', '').lower() in business_subs else 0
216        
217        # High engagement = market interest
218        engagement_boost = min(post.get('num_comments', 0) / 30, 3.0)  # Max 3 points
219        
220        # Calculate score
221        monetization_score = min((mon_count * 1.5) + sub_boost + engagement_boost, 10.0)
222        
223        return round(monetization_score, 2)
224
225    def _fallback_mock_data(self, subreddits: List[str], limit: int) -> List[Dict]:
226        """Fallback mock data if API fails"""
227        logger.info("Using mock data as fallback")
228        
229        mock_posts = [
230            {
231                "id": f"mock_reddit_{i}",
232                "title": f"Looking for a better solution to manage my SaaS subscriptions - current tools are too expensive",
233                "selftext": f"I'm running a small business and struggling with subscription management. Would pay for a simple, affordable tool. Anyone else have this problem?",
234                "url": f"https://reddit.com/r/SaaS/comments/mock_{i}",
235                "score": 150 + (i * 30),
236                "num_comments": 45 + (i * 10),
237                "total_awards": 2 + i,
238                "subreddit": subreddits[i % len(subreddits)] if subreddits else "SaaS",
239                "author": f"user_{i}",
240                "created_at": datetime.now().isoformat(),
241                "upvote_ratio": 0.85 + (i * 0.02)
242            } for i in range(limit)
243        ]
244        
245        results = []
246        for post in mock_posts:
247            pain_severity = self._calculate_pain_severity(post)
248            validation_score = self._calculate_validation(post)
249            monetization_potential = self._calculate_monetization_potential(post)
250            
251            product_data = {
252                "platform": "reddit",
253                "canonical_key": f"reddit_{post['id']}",
254                "product_name": post['title'],
255                "product_url": post['url'],
256                "description": post['selftext'],
257                "evidence": {
258                    "upvotes": post['score'],
259                    "comments": post['num_comments'],
260                    "awards": post['total_awards'],
261                    "subreddit": post['subreddit'],
262                    "author": post['author'],
263                    "pain_severity": pain_severity,
264                    "validation_score": validation_score,
265                    "monetization_potential": monetization_potential,
266                    "created_at": post['created_at'],
267                    "upvote_ratio": post['upvote_ratio']
268                },
269                "scraped_at": datetime.now().isoformat()
270            }
271            self.save_product(product_data)
272            results.append(product_data)
273        
274        return results
275
276    def save_product(self, data: Dict):
277        """Standardized save method for raw_intelligence_signals"""
278        conn = self.get_db_connection()
279        if not conn: return
280        
281        try:
282            with conn.cursor() as cur:
283                query = """
284                INSERT INTO raw_intelligence_signals 
285                (platform, canonical_key, title, url, description, evidence, scraped_at)
286                VALUES (%s, %s, %s, %s, %s, %s, %s)
287                ON CONFLICT (canonical_key) DO UPDATE SET
288                title = EXCLUDED.title,
289                description = EXCLUDED.description,
290                evidence = EXCLUDED.evidence,
291                scraped_at = EXCLUDED.scraped_at
292                """
293                cur.execute(query, (
294                    data['platform'],
295                    data['canonical_key'],
296                    data['product_name'],
297                    data['product_url'],
298                    data['description'],
299                    json.dumps(data['evidence']),
300                    data['scraped_at']
301                ))
302            conn.commit()
303            logger.info(f"✅ Saved: {data['product_name'][:60]}...")
304        except Exception as e:
305            logger.error(f"❌ Error saving Reddit signal: {e}")
306        finally:
307            conn.close()
308
309if __name__ == "__main__":
310    # Test the scraper
311    scraper = RedditScraper()
312    results = scraper.scrape(subreddits=["SaaS", "Entrepreneur"], limit=5, time_filter="week")
313    print(f"\n✅ Scraped {len(results)} posts from Reddit!")
314    
315    # Display sample results
316    if results:
317        print("\n📊 Sample Results:")
318        for i, result in enumerate(results[:3], 1):
319            print(f"\n{i}. {result['product_name'][:80]}")
320            print(f"   Subreddit: r/{result['evidence']['subreddit']}")
321            print(f"   Pain: {result['evidence']['pain_severity']}/10")
322            print(f"   Validation: {result['evidence']['validation_score']}/10")
323            print(f"   Monetization: {result['evidence']['monetization_potential']}/10")
324            print(f"   Engagement: {result['evidence']['upvotes']} upvotes, {result['evidence']['comments']} comments")

backend/scrapers/scrapers_config.json

{
    "scrapers": [
        {
            "name": "etsy",
            "enabled": true,
            "config": {
                "categories": [
                    "Notion Template",
                    "Digital Planner",
                    "SaaS Dashboard",
                    "Social Media Kit"
                ],
                "min_price": 5.0,
                "min_reviews": 10,
                "max_pages": 2
            }
        },
        {
            "name": "youtube",
            "enabled": true,
            "config": {
                "keywords": [
                    "passive income tools",
                    "best productivity apps 2026",
                    "automation software review",
                    "no-code saas ideas"
                ],
                "min_views": 5000,
                "max_results": 20
            }
        },
        {
            "name": "github",
            "enabled": true,
            "config": {
                "languages": [
                    "Python",
                    "TypeScript",
                    "JavaScript"
                ],
                "include_keywords": [
                    "automation",
                    "tool",
                    "dashboard",
                    "scraper"
                ],
                "min_stars": 50,
                "max_results": 30
            }
        }
    ],
    "scoring": {
        "weights": {
            "engagement": 0.4,
            "growth": 0.3,
            "market_fit": 0.3
        },
        "thresholds": {
            "GO": 75,
            "WAIT": 50
        }
    }
}

backend/scrapers/shopify_apps.py

1"""
2AURIFY INTELLIGENCE - Shopify App Store Scraper
3Extracts apps, reviews, pricing, and pain signals from Shopify App Store
4"""
5
6import asyncio
7from playwright.async_api import async_playwright
8import json
9from datetime import datetime
10import re
11from .orexia_base import OrexiaBaseScraper
12import logging
13
14logger = logging.getLogger(__name__)
15
16class ShopifyAppsScraper(OrexiaBaseScraper):
17    def __init__(self):
18        super().__init__("ShopifyApps")
19        self.base_url = "https://apps.shopify.com"
20        
21        # 🎯 PAIN HEIST STRATEGY - Target Keywords
22        # Focus on marketing automation and social proof tools
23        self.TARGET_KEYWORDS = [
24            'instagram automation',
25            'dm marketing', 
26            'abandoned cart recovery',
27            'chatbot for sales',
28            'ugc content',
29            'instagram feed',
30            'marketing automation',
31            'social proof',
32            'influencer marketing'
33        ]
34        
35        # 🎯 Rating Filter (3.0 - 4.2 stars)
36        # Skip 5 stars (no problems) and 1 star (completely broken)
37        # The gold is in 3 stars: app works but has "killer gaps"
38        self.MIN_RATING = 3.0
39        self.MAX_RATING = 4.2
40        
41        # Pain signal keywords to look for in reviews
42        self.PAIN_SIGNALS = [
43            "slows down my site",
44            "too expensive",
45            "customer support",
46            "doesn't integrate",
47            "complicated setup",
48            "missing feature",
49            "not worth the price"
50        ]
51        
52    async def scrape(self, category: str = None, keyword: str = None, max_apps: int = 20):
53        """
54        Scrape Shopify App Store
55        
56        Args:
57            category: App category (e.g., 'marketing', 'sales', 'productivity')
58            keyword: Search keyword
59            max_apps: Maximum number of apps to scrape
60        """
61        products = []
62        
63        async with async_playwright() as p:
64            browser = await p.chromium.launch(headless=True, args=['--no-sandbox'])
65            context = await browser.new_context(
66                user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
67            )
68            page = await context.new_page()
69            
70            try:
71                # Build URL
72                if keyword:
73                    url = f"{self.base_url}/search?q={keyword.replace(' ', '+')}"
74                    logger.info(f"[Shopify Apps] 🔍 Searching for: {keyword}")
75                elif category:
76                    url = f"{self.base_url}/categories/{category}"
77                    logger.info(f"[Shopify Apps] 📂 Category: {category}")
78                else:
79                    url = f"{self.base_url}/collections/all-apps"
80                    logger.info(f"[Shopify Apps] 🌐 Browsing all apps")
81                
82                await page.goto(url, wait_until="networkidle", timeout=30000)
83                await page.wait_for_timeout(3000)
84                
85                # Scroll to load more apps
86                for _ in range(3):
87                    await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
88                    await page.wait_for_timeout(2000)
89                
90                # NEW SELECTORS: Shopify now uses Tailwind CSS and role="button" for cards
91                app_cards = await page.query_selector_all("div[role='button']")
92                
93                # Filter to only cards that have app links
94                valid_cards = []
95                for card in app_cards:
96                    link = await card.query_selector("a[href*='apps.shopify.com']")
97                    if link:
98                        valid_cards.append(card)
99                
100                if not valid_cards:
101                    logger.warning("[Shopify Apps] ❌ No apps found")
102                    return []
103                
104                logger.info(f"[Shopify Apps] 📦 Found {len(valid_cards)} app cards")
105                
106                processed = 0
107                for card in valid_cards[:max_apps]:
108                    if processed >= max_apps:
109                        break
110                        
111                    app_data = await self._extract_app_card(card, page)
112                    if app_data:
113                        # Visit app page for detailed info
114                        detailed_data = await self._scrape_app_details(app_data['product_url'], browser)
115                        if detailed_data:
116                            app_data.update(detailed_data)
117                        
118                        self.save_product(app_data)
119                        products.append(app_data)
120                        processed += 1
121                        logger.info(f"[Shopify Apps] ✅ Saved: {app_data.get('product_name')}")
122                
123            except Exception as e:
124                logger.error(f"[Shopify Apps] ❌ Error: {e}")
125            finally:
126                await browser.close()
127        
128        return products
129    
130    async def _extract_app_card(self, card, page):
131        """Extract basic app info from card (NEW STRUCTURE)"""
132        try:
133            # App link and name
134            link_el = await card.query_selector("a[href*='apps.shopify.com']")
135            if not link_el:
136                return None
137            
138            name = await link_el.inner_text()
139            url = await link_el.get_attribute("href")
140            
141            if not url.startswith("http"):
142                url = f"https://apps.shopify.com{url}"
143            
144            # Rating - look for "out of 5 stars" text
145            rating = 0.0
146            rating_spans = await card.query_selector_all("span")
147            for span in rating_spans:
148                text = await span.inner_text()
149                if "out of 5 stars" in text:
150                    try:
151                        rating = float(text.split()[0])
152                    except:
153                        pass
154                    break
155            
156            # Review count - look for numbers in parentheses like (507)
157            review_count = 0
158            for span in rating_spans:
159                text = await span.inner_text()
160                text = text.strip()
161                if text.startswith("(") and text.endswith(")"):
162                    try:
163                        review_count = int(text.strip("()").replace(",", ""))
164                    except:
165                        pass
166                    break
167            
168            if not name or not url:
169                return None
170            
171            app_id = url.split('/')[-1].split('?')[0]
172            
173            return {
174                "product_name": name.strip(),
175                "product_url": url,
176                "canonical_key": f"shopify_app_{app_id}",
177                "rating": rating,
178                "review_count": review_count,
179                "evidence": {
180                    "source": "shopify_app_store",
181                    "captured_at": datetime.now().isoformat()
182                }
183            }
184        except Exception as e:
185            logger.debug(f"[Shopify Apps] Card extraction error: {e}")
186            return None
187    
188    async def _scrape_app_details(self, url, browser):
189        """Scrape detailed app page for pricing, reviews, pain signals"""
190        try:
191            context = await browser.new_context()
192            page = await context.new_page()
193            
194            await page.goto(url, wait_until="networkidle", timeout=20000)
195            await page.wait_for_timeout(2000)
196            
197            details = {}
198            
199            # Pricing
200            price = 0.0
201            price_el = await page.query_selector("[class*='price'], [class*='pricing']")
202            if price_el:
203                price_text = await price_el.inner_text()
204                if "free" in price_text.lower():
205                    price = 0.0
206                else:
207                    match = re.search(r'\$(\d+\.?\d*)', price_text)
208                    if match:
209                        price = float(match.group(1))
210            
211            details['price'] = price
212            
213            # Install count / popularity indicator
214            install_count = 0
215            install_el = await page.query_selector("[class*='install'], [class*='merchant']")
216            if install_el:
217                install_text = await install_el.inner_text()
218                # Extract number (e.g., "10,000+ merchants")
219                match = re.search(r'([\d,]+)', install_text.replace(',', ''))
220                if match:
221                    install_count = int(match.group(1))
222            
223            details['install_count'] = install_count
224            
225            # Developer
226            dev_el = await page.query_selector("[class*='developer'], [class*='author']")
227            developer = await dev_el.inner_text() if dev_el else "Unknown"
228            details['developer'] = developer.strip()
229            
230            # Extract pain signals from reviews
231            pain_signals = await self._extract_pain_signals(page)
232            details['pain_signals'] = pain_signals
233            
234            await context.close()
235            return details
236            
237        except Exception as e:
238            logger.debug(f"[Shopify Apps] Detail scraping error: {e}")
239            return {}
240    
241    async def _extract_pain_signals(self, page):
242        """Extract pain points from negative reviews"""
243        pain_signals = []
244        
245        try:
246            # Look for review sections
247            review_elements = await page.query_selector_all("[class*='review'], [class*='comment']")
248            
249            pain_keywords = [
250                "doesn't work", "broken", "bug", "issue", "problem", "disappointed",
251                "waste", "poor", "terrible", "awful", "missing", "need", "wish",
252                "should have", "lacking", "difficult", "confusing", "complicated"
253            ]
254            
255            for review_el in review_elements[:10]:  # Check first 10 reviews
256                text = await review_el.inner_text()
257                text_lower = text.lower()
258                
259                # Check if review contains pain keywords
260                if any(keyword in text_lower for keyword in pain_keywords):
261                    # Extract the sentence with the pain point
262                    sentences = text.split('.')
263                    for sentence in sentences:
264                        if any(keyword in sentence.lower() for keyword in pain_keywords):
265                            pain_signals.append(sentence.strip())
266                            break
267                
268                if len(pain_signals) >= 5:
269                    break
270            
271        except Exception as e:
272            logger.debug(f"[Shopify Apps] Pain signal extraction error: {e}")
273        
274        return pain_signals
275    
276
277if __name__ == '__main__':
278    scraper = ShopifyAppsScraper()
279    asyncio.run(scraper.scrape(keyword="productivity", max_apps=5))

backend/scrapers/shopify_products.py

1"""
2AURIFY INTELLIGENCE - Shopify Products Scraper
3Extracts products from individual Shopify stores (not the App Store)
4Focuses on digital products and pain signal extraction from reviews
5"""
6
7import asyncio
8import aiohttp
9from playwright.async_api import async_playwright
10import json
11from datetime import datetime
12import re
13from typing import List, Dict, Optional
14from .orexia_base import OrexiaBaseScraper
15import logging
16
17logger = logging.getLogger(__name__)
18
19class ShopifyProductsScraper(OrexiaBaseScraper):
20    def __init__(self):
21        super().__init__("ShopifyProducts")
22        
23        # Curated list of Shopify stores (using stores with accessible APIs)
24        # These are example stores - users should provide their own target stores
25        self.default_stores = [
26            "allbirds.com",  # Sustainable shoes (known Shopify store)
27            "kylie cosmetics.com",  # Beauty products
28            "fashionnova.com",  # Fashion
29        ]
30        
31        self.digital_keywords = [
32            "digital", "download", "template", "instant access", "pdf",
33            "printable", "course", "ebook", "guide", "worksheet",
34            "planner", "notion", "canva", "figma", "file", "zip"
35        ]
36        
37        # Smart Filtering System (from Etsy scraper)
38        self.EXCLUSION_KEYWORDS = [
39            'clipart', 'svg bundle', 'free download', 'printable only',
40            'coloring page', 'sticker pack', 'vintage photo', 'stock image',
41            'icon set', 'custom order', 'personalized', 'made to order',
42            'physical product', 'shipped', 'handmade'
43        ]
44        
45        self.INCLUSION_KEYWORDS = [
46            'automation', 'dashboard', 'system', 'toolkit', 'workflow',
47            'course', 'masterclass', 'training', 'guide', 'playbook',
48            'notion template', 'airtable', 'excel dashboard',
49            'content calendar', 'social media planner', 'sales funnel',
50            'ebook', 'digital download', 'instant access'
51        ]
52        
53        # Pain signal keywords (from Amazon scraper)
54        self.PAIN_KEYWORDS = [
55            "doesn't work", "broken", "bug", "issue", "problem", "disappointed",
56            "waste", "poor", "terrible", "awful", "missing", "need", "wish",
57            "should have", "lacking", "difficult", "confusing", "complicated",
58            "not worth", "refund", "scam", "misleading"
59        ]
60        
61        self.headers = {
62            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
63            'Accept': 'application/json',
64        }
65        
66    async def scrape(self, stores: List[str] = None, max_products_per_store: int = 20):
67        """
68        Scrape products from Shopify stores
69        
70        Args:
71            stores: List of store domains (e.g., ['store1.com', 'store2.com'])
72            max_products_per_store: Maximum products to scrape per store
73        """
74        if not stores:
75            stores = self.default_stores
76            
77        logger.info(f"[Shopify Products] 🛍️  Scraping {len(stores)} stores")
78        
79        all_products = []
80        
81        async with aiohttp.ClientSession() as session:
82            for store in stores:
83                try:
84                    products = await self._scrape_store(session, store, max_products_per_store)
85                    all_products.extend(products)
86                    logger.info(f"[Shopify Products] ✅ {store}: {len(products)} products")
87                except Exception as e:
88                    logger.error(f"[Shopify Products] ❌ {store}: {e}")
89                    
90        logger.info(f"[Shopify Products] 📦 Total: {len(all_products)} products")
91        return all_products
92    
93    async def _scrape_store(self, session: aiohttp.ClientSession, store_domain: str, max_products: int):
94        """Scrape products from a single store using JSON API"""
95        products = []
96        
97        # Clean domain
98        store_domain = store_domain.replace('https://', '').replace('http://', '').strip('/')
99        
100        # Try JSON API first (most reliable)
101        json_url = f"https://{store_domain}/products.json?limit={max_products * 2}"  # Get more to account for filtering
102        
103        try:
104            async with session.get(json_url, headers=self.headers, timeout=aiohttp.ClientTimeout(total=15)) as response:
105                if response.status == 200:
106                    data = await response.json()
107                    
108                    if 'products' in data:
109                        # Need browser for review scraping
110                        from playwright.async_api import async_playwright
111                        async with async_playwright() as p:
112                            browser = await p.chromium.launch(headless=True)
113                            
114                            for product_data in data['products']:
115                                if len(products) >= max_products:
116                                    break
117                                    
118                                product = await self._parse_product_json(product_data, store_domain)
119                                if not product:
120                                    continue
121                                
122                                # Apply exclusion filter
123                                if self._should_exclude(product):
124                                    logger.debug(f"[Shopify Products] Excluded: {product['product_name']}")
125                                    continue
126                                
127                                # Check if digital product
128                                product['is_digital'] = self._is_digital_product(product)
129                                product['has_inclusion_keywords'] = self._has_inclusion_keywords(product)
130                                
131                                # Scrape reviews for digital products or products with inclusion keywords
132                                if product['is_digital'] or product['has_inclusion_keywords']:
133                                    reviews_data = await self._scrape_product_reviews(product['product_url'], browser)
134                                    product['pain_signals'] = reviews_data.get('pain_signals', [])
135                                    product['review_sentiment'] = reviews_data.get('review_sentiment', 'neutral')
136                                    product['review_count'] = len(reviews_data.get('reviews', []))
137                                    
138                                    # Calculate rating from reviews
139                                    reviews = reviews_data.get('reviews', [])
140                                    if reviews:
141                                        avg_rating = sum(r.get('rating', 5) for r in reviews) / len(reviews)
142                                        product['rating'] = round(avg_rating, 2)
143                                
144                                # Save and collect
145                                self.save_product(product)
146                                products.append(product)
147                                logger.info(f"[Shopify Products] ✅ {product['product_name'][:40]} | Digital: {product['is_digital']} | Pain Signals: {len(product.get('pain_signals', []))}")
148                            
149                            await browser.close()
150                        
151                        return products
152        except Exception as e:
153            logger.debug(f"[Shopify Products] JSON API failed for {store_domain}: {e}")
154        
155        # Fallback to HTML scraping if JSON fails
156        logger.info(f"[Shopify Products] Falling back to HTML scraping for {store_domain}")
157        products = await self._scrape_store_html(store_domain, max_products)
158        
159        return products
160    
161    async def _parse_product_json(self, data: Dict, store_domain: str) -> Optional[Dict]:
162        """Parse product from Shopify JSON API response"""
163        try:
164            product_id = data.get('id')
165            handle = data.get('handle')
166            
167            # Get first variant for pricing
168            variants = data.get('variants', [])
169            first_variant = variants[0] if variants else {}
170            
171            # Get price
172            price = 0.0
173            if first_variant.get('price'):
174                try:
175                    price = float(first_variant['price'])
176                except:
177                    pass
178            
179            # Compare at price (original price before discount)
180            compare_at_price = None
181            if first_variant.get('compare_at_price'):
182                try:
183                    compare_at_price = float(first_variant['compare_at_price'])
184                except:
185                    pass
186            
187            # Images
188            images = []
189            for img in data.get('images', []):
190                if img.get('src'):
191                    images.append(img['src'])
192            
193            # Tags
194            tags = data.get('tags', [])
195            if isinstance(tags, str):
196                tags = [t.strip() for t in tags.split(',')]
197            
198            product = {
199                'product_name': data.get('title', 'Unknown'),
200                'product_url': f"https://{store_domain}/products/{handle}",
201                'canonical_key': f"shopify_product_{store_domain}_{product_id}",
202                'store_domain': store_domain,
203                'price': price,
204                'compare_at_price': compare_at_price,
205                'product_type': data.get('product_type', ''),
206                'vendor': data.get('vendor', ''),
207                'tags': tags,
208                'description': data.get('body_html', ''),
209                'images': images,
210                'available': first_variant.get('available', True),
211                'inventory_quantity': first_variant.get('inventory_quantity', 0),
212                'evidence': {
213                    'source': 'shopify_json_api',
214                    'store': store_domain,
215                    'captured_at': datetime.now().isoformat()
216                }
217            }
218            
219            return product
220            
221        except Exception as e:
222            logger.debug(f"[Shopify Products] Parse error: {e}")
223            return None
224    
225    def _is_digital_product(self, product: Dict) -> bool:
226        """Detect if product is digital based on tags, type, and description"""
227        
228        # Check tags
229        tags = product.get('tags', [])
230        tags_str = ' '.join(tags).lower()
231        
232        # Check product type
233        product_type = product.get('product_type', '').lower()
234        
235        # Check description
236        description = product.get('description', '').lower()
237        
238        # Combine all text
239        combined_text = f"{tags_str} {product_type} {description}"
240        
241        # Check for digital keywords
242        for keyword in self.digital_keywords:
243            if keyword in combined_text:
244                return True
245        
246        # Check for physical product indicators (inverse logic)
247        physical_keywords = ['shipping', 'weight', 'physical', 'ship', 'delivery']
248        has_physical_indicators = any(kw in combined_text for kw in physical_keywords)
249        
250        # If no shipping mentioned and has digital keywords, likely digital
251        if not has_physical_indicators and any(kw in combined_text for kw in ['download', 'instant', 'pdf', 'template']):
252            return True
253        
254        return False
255    
256    def _should_exclude(self, product: Dict) -> bool:
257        """Check if product should be excluded based on keywords"""
258        name_lower = product.get('product_name', '').lower()
259        desc_lower = product.get('description', '').lower()
260        combined = f"{name_lower} {desc_lower}"
261        return any(keyword in combined for keyword in self.EXCLUSION_KEYWORDS)
262    
263    def _has_inclusion_keywords(self, product: Dict) -> bool:
264        """Check if product has high-value inclusion keywords"""
265        name_lower = product.get('product_name', '').lower()
266        desc_lower = product.get('description', '').lower()
267        tags_str = ' '.join(product.get('tags', [])).lower()
268        combined = f"{name_lower} {desc_lower} {tags_str}"
269        return any(keyword in combined for keyword in self.INCLUSION_KEYWORDS)
270    
271    async def _scrape_product_reviews(self, product_url: str, browser) -> Dict:
272        """Scrape product reviews and extract pain signals (like Amazon scraper)"""
273        reviews_data = {
274            'reviews': [],
275            'pain_signals': [],
276            'review_sentiment': 'neutral'
277        }
278        
279        try:
280            context = await browser.new_context()
281            page = await context.new_page()
282            
283            await page.goto(product_url, wait_until="networkidle", timeout=15000)
284            await page.wait_for_timeout(2000)
285            
286            # Look for review sections (common Shopify review apps)
287            review_selectors = [
288                '[class*="review"]',
289                '[class*="testimonial"]',
290                '[data-review]',
291                '.spr-review',  # Shopify Product Reviews
292                '.stamped-review',  # Stamped.io
293                '.loox-review',  # Loox
294                '.judgeme-review'  # Judge.me
295            ]
296            
297            review_elements = []
298            for selector in review_selectors:
299                elements = await page.query_selector_all(selector)
300                if elements:
301                    review_elements = elements
302                    break
303            
304            if not review_elements:
305                await context.close()
306                return reviews_data
307            
308            # Extract reviews
309            negative_count = 0
310            positive_count = 0
311            
312            for review_el in review_elements[:15]:  # First 15 reviews
313                try:
314                    review_text = await review_el.inner_text()
315                    
316                    # Try to extract rating
317                    rating = 5  # Default
318                    rating_el = await review_el.query_selector('[class*="star"], [class*="rating"]')
319                    if rating_el:
320                        rating_text = await rating_el.get_attribute('aria-label') or await rating_el.inner_text()
321                        match = re.search(r'(\d+)', rating_text)
322                        if match:
323                            rating = int(match.group(1))
324                    
325                    reviews_data['reviews'].append({
326                        'text': review_text[:500],  # Limit length
327                        'rating': rating
328                    })
329                    
330                    # Extract pain signals from negative reviews
331                    if rating <= 3:
332                        negative_count += 1
333                        review_lower = review_text.lower()
334                        
335                        for keyword in self.PAIN_KEYWORDS:
336                            if keyword in review_lower:
337                                # Extract sentence containing pain keyword
338                                sentences = review_text.split('.')
339                                for sentence in sentences:
340                                    if keyword in sentence.lower():
341                                        pain_signal = sentence.strip()
342                                        if pain_signal and len(pain_signal) > 10:
343                                            reviews_data['pain_signals'].append(pain_signal)
344                                        break
345                                break
346                    else:
347                        positive_count += 1
348                    
349                except Exception as e:
350                    logger.debug(f"[Shopify Products] Review extraction error: {e}")
351                    continue
352            
353            # Calculate sentiment
354            total = positive_count + negative_count
355            if total > 0:
356                positive_ratio = positive_count / total
357                if positive_ratio > 0.7:
358                    reviews_data['review_sentiment'] = 'positive'
359                elif positive_ratio < 0.4:
360                    reviews_data['review_sentiment'] = 'negative'
361                else:
362                    reviews_data['review_sentiment'] = 'mixed'
363            
364            await context.close()
365            
366        except Exception as e:
367            logger.debug(f"[Shopify Products] Review scraping error: {e}")
368        
369        return reviews_data
370    
371    
372    async def _scrape_store_html(self, store_domain: str, max_products: int):
373        """Fallback HTML scraping using Playwright"""
374        products = []
375        
376        async with async_playwright() as p:
377            browser = await p.chromium.launch(headless=True)
378            page = await browser.new_page()
379            
380            try:
381                url = f"https://{store_domain}/collections/all"
382                await page.goto(url, wait_until="networkidle", timeout=20000)
383                await page.wait_for_timeout(2000)
384                
385                # Find product links
386                product_links = await page.query_selector_all("a[href*='/products/']")
387                
388                seen_urls = set()
389                for link in product_links[:max_products]:
390                    href = await link.get_attribute('href')
391                    if href and href not in seen_urls:
392                        seen_urls.add(href)
393                        
394                        if not href.startswith('http'):
395                            href = f"https://{store_domain}{href}"
396                        
397                        # Scrape individual product page
398                        product = await self._scrape_product_page(page, href, store_domain)
399                        if product:
400                            product['is_digital'] = self._is_digital_product(product)
401                            self.save_product(product)
402                            products.append(product)
403                
404            except Exception as e:
405                logger.error(f"[Shopify Products] HTML scraping error: {e}")
406            finally:
407                await browser.close()
408        
409        return products
410    
411    async def _scrape_product_page(self, page, url: str, store_domain: str):
412        """Scrape individual product page"""
413        try:
414            await page.goto(url, wait_until="networkidle", timeout=15000)
415            
416            # Extract product name
417            name_el = await page.query_selector("h1, [class*='product-title'], [class*='product__title']")
418            name = await name_el.inner_text() if name_el else "Unknown"
419            
420            # Extract price
421            price = 0.0
422            price_el = await page.query_selector("[class*='price'], [class*='product-price']")
423            if price_el:
424                price_text = await price_el.inner_text()
425                match = re.search(r'[\$€£]?([\d,]+\.?\d*)', price_text)
426                if match:
427                    price = float(match.group(1).replace(',', ''))
428            
429            # Extract description
430            desc_el = await page.query_selector("[class*='description'], [class*='product__description']")
431            description = await desc_el.inner_text() if desc_el else ""
432            
433            # Extract images
434            images = []
435            img_elements = await page.query_selector_all("img[src*='cdn.shopify']")
436            for img in img_elements[:5]:
437                src = await img.get_attribute('src')
438                if src:
439                    images.append(src)
440            
441            product_id = url.split('/')[-1].split('?')[0]
442            
443            return {
444                'product_name': name.strip(),
445                'product_url': url,
446                'canonical_key': f"shopify_product_{store_domain}_{product_id}",
447                'store_domain': store_domain,
448                'price': price,
449                'description': description,
450                'images': images,
451                'tags': [],
452                'evidence': {
453                    'source': 'shopify_html_scraping',
454                    'store': store_domain,
455                    'captured_at': datetime.now().isoformat()
456                }
457            }
458            
459        except Exception as e:
460            logger.debug(f"[Shopify Products] Product page error: {e}")
461            return None
462    
463
464
465if __name__ == '__main__':
466    scraper = ShopifyProductsScraper()
467    asyncio.run(scraper.scrape(max_products_per_store=10))

backend/scrapers/tiktok.py

1"""
2AURIFY INTELLIGENCE - TikTok Pain Point Scraper
3Real API integration with EnsembleData TikTok API
4"""
5
6from .base import BaseScraper
7from datetime import datetime
8import json
9import logging
10from typing import List, Dict
11import requests
12import os
13import time
14
15logger = logging.getLogger(__name__)
16
17class TikTokScraper(BaseScraper):
18    def __init__(self):
19        super().__init__(platform_name="tiktok")
20        # Use environment variable or hardcoded token
21        self.api_token = os.getenv("ENSEMBLEDATA_TOKEN", "fpcdDVMXoBocmxeU")
22        self.api_url = "https://api.ensembledata.com/tiktok/hashtag"
23        
24        # Expanded Digital Product Tags
25        self.TIKTOK_TAGS = [
26            # Digital Products
27            "digitalproducts", "passiveincome", "sidehustle", "makemoneyonline",
28            "notiontemplate", "digitaldownloads", "plrproducts", "resellrights",
29            "ebook", "onlinecourse", "masterclass", "coaching",
30            
31            # SaaS / Tools
32            "saas", "nocode", "ai", "aitools", "software", "startup",
33            "indiehacker", "buildinpublic", "micro-saas",
34            
35            # Pain/Niche
36            "entrepreneurproblems", "smallbusinesscheck", "marketingtips",
37            "productivityhacks", "exceltricks", "canvatips"
38        ]
39        
40        # Pain detection keywords
41        self.pain_keywords = [
42            "problem", "struggle", "frustrated", "tired of", "need", "help",
43            "waste", "expensive", "complicated", "confusing", "hard to",
44            "wish there was", "looking for", "can't find", "failed", "stuck",
45            "annoying", "hate", "difficult", "impossible", "broken"
46        ]
47        
48        # Monetization signals
49        self.monetization_signals = [
50            "would pay for", "need a solution", "costing me", "losing money",
51            "expensive", "subscription", "tool", "app", "service", "platform",
52            "buy", "purchase", "invest", "worth it", "price"
53        ]
54
55    def scrape(self, hashtag: str = None, limit: int = 20, min_views: int = 10000, days_back: int = 30) -> List[Dict]:
56        """
57        Scrapes TikTok for pain points using EnsembleData API.
58        
59        Args:
60            hashtag: Specific hashtag (optional). If None, iterates through TIKTOK_TAGS.
61            limit: Number of videos to fetch per tag
62            min_views: Minimum view count to be considered "Momentum" (default 10k)
63            days_back: Only include videos from last X days (default 30)
64        """
65        all_results = []
66        import random
67        from datetime import timedelta
68        
69        # Determine tags to search
70        tags_to_search = [hashtag] if hashtag else self.TIKTOK_TAGS
71        if not hashtag:
72            random.shuffle(tags_to_search)
73            logger.info(f"🔀 Shuffled {len(tags_to_search)} hashtags")
74            
75        cutoff_date = datetime.now() - timedelta(days=days_back)
76        
77        for tag_idx, tag in enumerate(tags_to_search, 1):
78            logger.info(f"🎵 [{tag_idx}/{len(tags_to_search)}] Scraping #{tag}")
79            
80            try:
81                videos = self._fetch_from_api(tag, limit)
82                saved_count = 0
83                
84                for video in videos:
85                    # 1. Recency Filter
86                    try:
87                        created_at_dt = datetime.fromisoformat(video['created_at'])
88                        if created_at_dt < cutoff_date:
89                            continue
90                    except:
91                        pass # Keep if detailed parsing fails, assume recent enough or reliance on sorting
92                    
93                    # 2. Momentum Filter (Views)
94                    if video['views'] < min_views:
95                        continue
96                        
97                    pain_severity = self._calculate_pain_severity(video)
98                    monetization_potential = self._calculate_monetization_potential(video)
99                    
100                    product_data = {
101                        "platform": "tiktok",
102                        "canonical_key": f"tiktok_{video['id']}",
103                        "product_name": f"Pain Signal by @{video['author']}",
104                        "product_url": video['url'],
105                        "description": video['description'],
106                        "evidence": {
107                            "views": video['views'],
108                            "likes": video['likes'],
109                            "comments": video['comments'],
110                            "shares": video.get('shares', 0),
111                            "pain_severity": pain_severity,
112                            "monetization_potential": monetization_potential,
113                            "hashtags": video.get('hashtags', []),
114                            "author": video['author'],
115                            "created_at": video.get('created_at', ''),
116                            "momentum": {
117                                "min_views_threshold": min_views,
118                                "is_viral": video['views'] > 100000
119                            }
120                        },
121                        "scraped_at": datetime.now().isoformat()
122                    }
123                    
124                    self.save_product(product_data)
125                    all_results.append(product_data)
126                    saved_count += 1
127                
128                logger.info(f"   ✅ Saved {saved_count} viral videos for #{tag} (Views > {min_views})")
129                
130                # Rate limit safety
131                if not hashtag: 
132                    time.sleep(2) 
133                    
134            except Exception as e:
135                logger.error(f"❌ Error scraping #{tag}: {e}")
136                # Fallback to mock data if API fails completely
137                # (Optional: disable fallback for production to avoid polluting DB)
138                # logger.warning("⚠️ Falling back to mock data")
139                # all_results.extend(self._fallback_mock_data(tag, limit))
140                continue
141                
142        logger.info(f"✅ Total: Scraped {len(all_results)} videos from TikTok")
143        return all_results
144
145    def _fetch_from_api(self, hashtag: str, limit: int) -> List[Dict]:
146        """Fetch real data from EnsembleData TikTok API"""
147        
148        # Remove # if present
149        hashtag = hashtag.lstrip('#')
150        
151        headers = {
152            "Content-Type": "application/json"
153        }
154        
155        # verified endpoint and parameter
156        params = {
157            "name": hashtag,
158            "cursor": 0,
159            "token": self.api_token  # Token must be in query params
160        }
161        
162        logger.info(f"📡 Calling EnsembleData API for #{hashtag}...")
163        
164        response = requests.get(
165            "https://ensembledata.com/apis/tt/hashtag/posts",  # Correct endpoint
166            params=params,
167            headers=headers,
168            timeout=30
169        )
170        
171        if response.status_code != 200:
172            raise Exception(f"API returned status {response.status_code}: {response.text}")
173        
174        data = response.json()
175        
176        # Parse response with verified structure
177        # data -> data -> data -> list of items
178        items = data.get('data', {}).get('data', [])
179        
180        videos = []
181        for item in items[:limit]:
182            stats = item.get('statistics', {})
183            author_info = item.get('author', {})
184            
185            # Helper to get hashtags safely
186            hashtags = [c['cha_name'] for c in item.get('cha_list', [])]
187            
188            videos.append({
189                'id': item.get('aweme_id', ''),
190                'description': item.get('desc', ''),
191                'author': author_info.get('unique_id', 'unknown'),
192                'url': f"https://www.tiktok.com/@{author_info.get('unique_id', 'unknown')}/video/{item.get('aweme_id', '')}",
193                'views': stats.get('play_count', 0),
194                'likes': stats.get('digg_count', 0),
195                'comments': stats.get('comment_count', 0),
196                'shares': stats.get('share_count', 0),
197                'hashtags': hashtags,
198                'created_at': datetime.fromtimestamp(item.get('create_time', 0)).isoformat()
199            })
200        
201        return videos
202
203    def _calculate_pain_severity(self, video: Dict) -> float:
204        """
205        Calculate pain severity score (0-10)
206        Based on pain keywords in description
207        """
208        description = video.get('description', '').lower()
209        
210        # Count pain keywords
211        pain_count = sum(1 for keyword in self.pain_keywords if keyword in description)
212        
213        # Engagement boost (high engagement = real pain)
214        engagement_ratio = (video.get('comments', 0) / max(video.get('views', 1), 1)) * 1000
215        engagement_boost = min(engagement_ratio * 2, 3.0)  # Max 3 points from engagement
216        
217        # Calculate score
218        pain_score = min((pain_count * 2.0) + engagement_boost, 10.0)
219        
220        return round(pain_score, 2)
221
222    def _calculate_monetization_potential(self, video: Dict) -> float:
223        """
224        Calculate monetization potential score (0-10)
225        Based on monetization signals in description
226        """
227        description = video.get('description', '').lower()
228        
229        # Count monetization keywords
230        mon_count = sum(1 for keyword in self.monetization_signals if keyword in description)
231        
232        # Viral potential (high views = market size)
233        views = video.get('views', 0)
234        viral_score = min(views / 100000, 5.0)  # Max 5 points from virality
235        
236        # Calculate score
237        monetization_score = min((mon_count * 2.5) + viral_score, 10.0)
238        
239        return round(monetization_score, 2)
240
241    def _fallback_mock_data(self, hashtag: str, limit: int) -> List[Dict]:
242        """Fallback mock data if API fails"""
243        logger.info("Using mock data as fallback")
244        mock_videos = [
245            {
246                "id": f"vid_{i}",
247                "author": f"user_{i}",
248                "description": f"Has anyone found a way to solve {hashtag}? I'm so frustrated with current tools. Would pay for a solution!",
249                "url": f"https://tiktok.com/@user_{i}/video/vid_{i}",
250                "views": 10000 * (i + 1),
251                "likes": 1000 * (i + 1),
252                "comments": 100 * (i + 1),
253                "shares": 50 * (i + 1),
254                "hashtags": ["pain", "business", hashtag],
255                "created_at": datetime.now().isoformat()
256            } for i in range(limit)
257        ]
258        
259        results = []
260        for video in mock_videos:
261            pain_severity = self._calculate_pain_severity(video)
262            monetization_potential = self._calculate_monetization_potential(video)
263            
264            product_data = {
265                "platform": "tiktok",
266                "canonical_key": f"tiktok_{video['id']}",
267                "product_name": f"Pain Signal by @{video['author']}",
268                "product_url": video['url'],
269                "description": video['description'],
270                "evidence": {
271                    "views": video['views'],
272                    "likes": video['likes'],
273                    "comments": video['comments'],
274                    "shares": video['shares'],
275                    "pain_severity": pain_severity,
276                    "monetization_potential": monetization_potential,
277                    "hashtags": video['hashtags'],
278                    "author": video['author'],
279                    "created_at": video['created_at']
280                },
281                "scraped_at": datetime.now().isoformat()
282            }
283            self.save_product(product_data)
284            results.append(product_data)
285        
286        return results
287
288    def save_product(self, data: Dict):
289        """Standardized save method for raw_intelligence_signals"""
290        conn = self.get_db_connection()
291        if not conn: return
292        
293        try:
294            with conn.cursor() as cur:
295                query = """
296                INSERT INTO raw_intelligence_signals 
297                (platform, canonical_key, title, url, description, evidence, scraped_at)
298                VALUES (%s, %s, %s, %s, %s, %s, %s)
299                ON CONFLICT (canonical_key) DO UPDATE SET
300                title = EXCLUDED.title,
301                description = EXCLUDED.description,
302                evidence = EXCLUDED.evidence,
303                scraped_at = EXCLUDED.scraped_at
304                """
305                cur.execute(query, (
306                    data['platform'],
307                    data['canonical_key'],
308                    data['product_name'],
309                    data['product_url'],
310                    data['description'],
311                    json.dumps(data['evidence']),
312                    data['scraped_at']
313                ))
314            conn.commit()
315            logger.info(f"✅ Saved: {data['product_name']}")
316        except Exception as e:
317            logger.error(f"❌ Error saving TikTok signal: {e}")
318        finally:
319            conn.close()
320
321if __name__ == "__main__":
322    scraper = TikTokScraper()
323    results = scraper.scrape(hashtag="entrepreneurproblems", limit=10)
324    print(f"\n✅ Scraped {len(results)} videos from TikTok!")

backend/scrapers/tiktok_enhanced.py

1"""
2AURIFY INTELLIGENCE - TikTok Pain Point Scraper (ENHANCED)
3===========================================================
4Features:
51. Real TikTok API Support
62. Apify Integration for ban-free scraping
73. Timing Trend Analysis
84. Unified System Integration
9"""
10
11import asyncio
12import json
13import os
14from datetime import datetime, timedelta
15from typing import List, Dict, Optional
16from dataclasses import dataclass, asdict
17import aiohttp
18import logging
19
20logging.basicConfig(level=logging.INFO)
21logger = logging.getLogger(__name__)
22
23@dataclass
24class TikTokVideo:
25    """TikTok video data structure"""
26    video_id: str
27    description: str
28    author: str
29    views: int
30    likes: int
31    comments: int
32    shares: int
33    hashtags: List[str]
34    pain_signals: List[str]
35    timing_score: float
36    window_status: str
37    url: str
38    scraped_at: str
39    
40    # Aurify Intelligence Metrics
41    pain_severity: float
42    monetization_potential: float
43    competition_indicators: int
44    
45    # Unified System additions
46    video_duration: int = 0
47    created_at: str = ""
48    engagement_rate: float = 0.0
49
50
51class TikTokScraperEnhanced:
52    """
53    Enhanced TikTok scraper with 3 methods:
54    1. Official TikTok API (if available)
55    2. Apify Actor (Reliable & Fast)
56    3. Mock Data (Testing)
57    """
58    
59    def __init__(self, method: str = "apify"):
60        """
61        method: 'api' | 'apify' | 'mock'
62        """
63        self.method = method
64        self.session = None
65        
66        # Pain detection keywords
67        self.pain_keywords = [
68            "problem", "struggle", "frustrated", "tired of", "need", "help",
69            "waste", "expensive", "complicated", "confusing", "hard to",
70            "wish there was", "looking for", "can't find", "failed", "stuck",
71            "annoying", "hate", "difficult", "slow", "broken"
72        ]
73        
74        self.monetization_signals = [
75            "would pay for", "need a solution", "costing me", "losing money",
76            "expensive", "subscription", "tool", "app", "service", "platform",
77            "software", "buy", "price", "cost", "budget", "invest"
78        ]
79        
80        # API keys from environment
81        self.apify_token = os.getenv("APIFY_TOKEN")
82        self.tiktok_api_key = os.getenv("TIKTOK_API_KEY")
83    
84    async def __aenter__(self):
85        self.session = aiohttp.ClientSession()
86        return self
87    
88    async def __aexit__(self, exc_type, exc_val, exc_tb):
89        if self.session:
90            await self.session.close()
91    
92    # =========================================
93    # Method 1: Apify (Best)
94    # =========================================
95    
96    async def scrape_with_apify(self, hashtag: str, limit: int) -> List[Dict]:
97        """
98        Use Apify TikTok Scraper
99        """
100        if not self.apify_token:
101            logger.warning("⚠️  APIFY_TOKEN not found. Using mock data.")
102            return self._generate_mock_data(hashtag, limit)
103        
104        logger.info(f"🔍 Scraping TikTok via Apify: #{hashtag}")
105        
106        # Apify Actor endpoint
107        actor_id = "clockworks/tiktok-scraper"
108        url = f"https://api.apify.com/v2/acts/{actor_id}/runs"
109        
110        # Request Config
111        payload = {
112            "hashtags": [hashtag],
113            "resultsPerPage": limit,
114            "shouldDownloadVideos": False,
115            "shouldDownloadCovers": False
116        }
117        
118        headers = {
119            "Content-Type": "application/json",
120            "Authorization": f"Bearer {self.apify_token}"
121        }
122        
123        try:
124            # Start Actor
125            async with self.session.post(url, json=payload, headers=headers) as resp:
126                run_data = await resp.json()
127                if "data" not in run_data:
128                    logger.error(f"Apify Error: {run_data}")
129                    return self._generate_mock_data(hashtag, limit)
130                run_id = run_data["data"]["id"]
131            
132            # Poll for results
133            results_url = f"https://api.apify.com/v2/actor-runs/{run_id}/dataset/items"
134            
135            for _ in range(30):  # Wait up to 5 mins
136                await asyncio.sleep(10)
137                
138                async with self.session.get(results_url, headers=headers) as resp:
139                    if resp.status == 200:
140                        videos = await resp.json()
141                        if videos:
142                            logger.info(f"✅ Got {len(videos)} videos from Apify")
143                            return videos
144            
145            logger.warning("⏱️  Apify timeout. Using mock data.")
146            return self._generate_mock_data(hashtag, limit)
147            
148        except Exception as e:
149            logger.error(f"❌ Apify error: {e}")
150            return self._generate_mock_data(hashtag, limit)
151    
152    # =========================================
153    # Method 2: Official TikTok API
154    # =========================================
155    
156    async def scrape_with_api(self, hashtag: str, limit: int) -> List[Dict]:
157        """
158        Use TikTok Research API
159        """
160        if not self.tiktok_api_key:
161            logger.warning("⚠️  TIKTOK_API_KEY not found")
162            return []
163        
164        url = "https://open.tiktokapis.com/v2/research/video/query/"
165        
166        headers = {
167            "Authorization": f"Bearer {self.tiktok_api_key}",
168            "Content-Type": "application/json"
169        }
170        
171        payload = {
172            "query": {
173                "and": [
174                    {"field_name": "hashtag_name", "operation": "EQ", "field_values": [hashtag]}
175                ]
176            },
177            "max_count": limit
178        }
179        
180        try:
181            async with self.session.post(url, json=payload, headers=headers) as resp:
182                data = await resp.json()
183                return data.get("data", {}).get("videos", [])
184        except Exception as e:
185            logger.error(f"❌ TikTok API error: {e}")
186            return []
187    
188    # =========================================
189    # Data Processing
190    # =========================================
191    
192    async def scrape_hashtag(self, hashtag: str, limit: int = 50) -> List[TikTokVideo]:
193        """
194        Unified entry point - Auto-selects method
195        """
196        logger.info(f"🎯 Scraping #{hashtag} (method: {self.method})")
197        
198        if self.method == "apify":
199            raw_videos = await self.scrape_with_apify(hashtag, limit)
200        elif self.method == "api":
201            raw_videos = await self.scrape_with_api(hashtag, limit)
202        else:
203            raw_videos = self._generate_mock_data(hashtag, limit)
204        
205        videos = []
206        for raw in raw_videos:
207            video = self._process_video(raw)
208            videos.append(video)
209        
210        return videos
211    
212    def _process_video(self, raw_data: Dict) -> TikTokVideo:
213        """Convert raw data to TikTokVideo"""
214        
215        video_id = raw_data.get('id') or raw_data.get('videoId', 'unknown')
216        description = raw_data.get('description') or raw_data.get('text', '')
217        author = raw_data.get('author') or raw_data.get('authorMeta', {}).get('name', 'unknown')
218        
219        views = raw_data.get('views') or raw_data.get('playCount', 0)
220        likes = raw_data.get('likes') or raw_data.get('diggCount', 0)
221        comments = raw_data.get('comments') or raw_data.get('commentCount', 0)
222        shares = raw_data.get('shares') or raw_data.get('shareCount', 0)
223        
224        hashtags = raw_data.get('hashtags', [])
225        if isinstance(hashtags, list) and hashtags and isinstance(hashtags[0], dict):
226            hashtags = [h.get('name', '') for h in hashtags]
227        elif isinstance(hashtags, list) and hashtags and isinstance(hashtags[0], str):
228             pass # Already list of strings
229        else:
230             hashtags = [hashtag for hashtag in description.split() if hashtag.startswith('#')]
231
232        pain_signals = self.detect_pain_signals(description)
233        pain_severity = self.calculate_pain_severity({
234            'description': description,
235            'comments': comments
236        })
237        monetization = self.calculate_monetization_potential({
238            'description': description,
239            'likes': likes,
240            'comments': comments
241        })
242        
243        timing_score, window_status = self.analyze_window_timing({
244            'views': views,
245            'days_old': raw_data.get('days_old', 1)
246        })
247        
248        engagement_rate = 0
249        if views > 0:
250            engagement_rate = ((likes + comments * 2 + shares * 3) / views) * 100
251        
252        return TikTokVideo(
253            video_id=str(video_id),
254            description=description,
255            author=author,
256            views=views,
257            likes=likes,
258            comments=comments,
259            shares=shares,
260            hashtags=hashtags,
261            pain_signals=pain_signals,
262            timing_score=timing_score,
263            window_status=window_status,
264            url=f"https://tiktok.com/@{author}/video/{video_id}",
265            scraped_at=datetime.now().isoformat(),
266            pain_severity=pain_severity,
267            monetization_potential=monetization,
268            competition_indicators=0,
269            video_duration=raw_data.get('duration', 0),
270            created_at=str(raw_data.get('createTime', '')),
271            engagement_rate=round(engagement_rate, 2)
272        )
273    
274    # =========================================
275    # Analysis Functions
276    # =========================================
277    
278    def detect_pain_signals(self, text: str) -> List[str]:
279        """Detect pain points in text"""
280        signals = []
281        text_lower = text.lower()
282        
283        for keyword in self.pain_keywords:
284            if keyword in text_lower:
285                sentences = text.split('.')
286                for sentence in sentences:
287                    if keyword in sentence.lower():
288                        signals.append(sentence.strip())
289        
290        return list(set(signals))[:5]
291    
292    def calculate_pain_severity(self, video_data: Dict) -> float:
293        """Calculate pain severity (0-10)"""
294        description = video_data.get('description', '').lower()
295        comments_count = video_data.get('comments', 0)
296        
297        pain_count = sum(1 for kw in self.pain_keywords if kw in description)
298        pain_density = min(pain_count / 3, 1.0)
299        
300        engagement_score = min(comments_count / 1000, 1.0)
301        
302        severity = (pain_density * 0.6 + engagement_score * 0.4) * 10
303        return round(severity, 1)
304    
305    def calculate_monetization_potential(self, video_data: Dict) -> float:
306        """Calculate monetization potential (0-10)"""
307        description = video_data.get('description', '').lower()
308        
309        monetization_count = sum(1 for kw in self.monetization_signals if kw in description)
310        has_monetization = monetization_count > 0
311        
312        engagement = (video_data.get('likes', 0) + video_data.get('comments', 0) * 2) / 1000
313        engagement_score = min(engagement / 10, 1.0)
314        
315        potential = (
316            (1.0 if has_monetization else 0.5) * 0.7 +
317            engagement_score * 0.3
318        ) * 10
319        
320        return round(potential, 1)
321    
322    def analyze_window_timing(self, video_data: Dict) -> tuple:
323        """Analyze opportunity timing"""
324        views = video_data.get('views', 0)
325        days_old = video_data.get('days_old', 1)
326        
327        velocity = views / max(days_old, 1)
328        
329        if velocity > 10000:
330            return 9.0, "opening"
331        elif velocity > 5000:
332            return 7.5, "peak"
333        else:
334            return 5.0, "closing"
335    
336    def _generate_mock_data(self, hashtag: str, limit: int) -> List[Dict]:
337        """Generate mock data for testing"""
338        pain_examples = [
339            "Tired of expensive project management tools that don't work for solopreneurs",
340            "Why is every invoicing tool so complicated? I just need to send bills!",
341            "Struggling with 10+ subscriptions eating my budget. Need a simple tracker.",
342            "Looking for an AI tool that actually understands small business needs",
343            "Frustrated with analytics dashboards. Show me the numbers I care about!",
344            "Can't find a good CRM for freelancers that isn't $99/month",
345            "Need automation that doesn't require a developer to set up",
346            "Tired of switching between 5 apps just to manage my clients"
347        ]
348        
349        mock = []
350        for i in range(min(limit, len(pain_examples))):
351            mock.append({
352                'id': f"mock_{hashtag}_{i}",
353                'description': pain_examples[i],
354                'author': f"entrepreneur_{i}",
355                'views': 15000 + (i * 5000),
356                'likes': 1200 + (i * 300),
357                'comments': 150 + (i * 50),
358                'shares': 80 + (i * 20),
359                'hashtags': [hashtag, 'entrepreneurship', 'saas'],
360                'days_old': 2 + (i % 7),
361                'duration': 30 + (i * 5)
362            })
363        
364        return mock
365
366    def generate_go_report(self, videos: List[TikTokVideo]) -> Dict:
367        """Generate GO/WAIT/KILL report"""
368        if not videos:
369            return {"error": "No videos to analyze"}
370        
371        avg_pain = sum(v.pain_severity for v in videos) / len(videos)
372        avg_monetization = sum(v.monetization_potential for v in videos) / len(videos)
373        avg_timing = sum(v.timing_score for v in videos) / len(videos)
374        
375        composite_score = (
376            avg_pain * 0.4 +
377            avg_monetization * 0.3 +
378            avg_timing * 0.3
379        )
380        
381        if composite_score >= 7.0:
382            decision, confidence = "GO", "HIGH"
383        elif composite_score >= 5.0:
384            decision, confidence = "WAIT", "MEDIUM"
385        else:
386            decision, confidence = "KILL", "LOW"
387        
388        all_signals = []
389        for v in videos:
390            all_signals.extend(v.pain_signals)
391        top_pains = list(set(all_signals))[:5]
392        
393        return {
394            "decision": decision,
395            "confidence": confidence,
396            "composite_score": round(composite_score, 1),
397            "metrics": {
398                "pain_severity": round(avg_pain, 1),
399                "monetization_potential": round(avg_monetization, 1),
400                "timing_score": round(avg_timing, 1)
401            },
402            "top_pain_signals": top_pains,
403            "total_videos_analyzed": len(videos),
404            "avg_engagement_rate": round(sum(v.engagement_rate for v in videos) / len(videos), 2)
405        }
406
407async def scraper_tiktok_unified(config: Dict) -> List[Dict]:
408    """Adapter for ScraperOrchestrator"""
409    tiktok_config = next(
410        (s['config'] for s in config['scrapers'] if s['name'] == 'tiktok'),
411        {}
412    )
413    
414    hashtags = tiktok_config.get('hashtags', ['entrepreneurproblems'])
415    limit = tiktok_config.get('limit', 30)
416    method = tiktok_config.get('method', 'mock') # Default to mock if no Apify token
417    
418    all_videos = []
419    
420    async with TikTokScraperEnhanced(method=method) as scraper:
421        for hashtag in hashtags:
422            videos = await scraper.scrape_hashtag(hashtag, limit)
423            
424            for video in videos:
425                all_videos.append({
426                    'source': 'TikTok',
427                    'original_id': video.video_id,
428                    'title': video.description[:100],
429                    'url': video.url,
430                    'price': 0,
431                    'engagement_metric': video.views,
432                    'engagement_label': 'Views',
433                    'pain_severity': video.pain_severity,
434                    'monetization': video.monetization_potential,
435                    'metadata': {
436                        'description': video.description,
437                        'tags': video.hashtags,
438                        'pain_signals': video.pain_signals,
439                        'likes': video.likes,
440                        'comments': video.comments
441                    }
442                })
443    
444    print(f"✅ [TikTok] Scraped {len(all_videos)} items", flush=True)
445    return all_videos
446
447if __name__ == "__main__":
448    async def main():
449        async with TikTokScraperEnhanced(method="mock") as scraper:
450            videos = await scraper.scrape_hashtag("entrepreneur")
451            print(f"Scraped {len(videos)} videos")
452            report = scraper.generate_go_report(videos)
453            print(json.dumps(report, indent=2))
454            
455    asyncio.run(main())

backend/scrapers/trends.py

1"""
2AURIFY INTELLIGENCE - Google Trends Analyzer
3With sophisticated Breakout detection and Expanded Discovery
4"""
5
6from .base import BaseScraper
7from pytrends.request import TrendReq
8from datetime import datetime, timedelta
9import json
10import logging
11from typing import List, Dict
12import time
13import random
14
15logger = logging.getLogger(__name__)
16
17class TrendsScraper(BaseScraper):
18    def __init__(self):
19        super().__init__(platform_name="google_trends")
20        # Initialize with basic settings
21        self.pytrends = TrendReq(hl='en-US', tz=360)
22        self.last_request_time = 0
23        self.min_delay = 5  # Minimum 5 seconds between requests
24        
25        # Expanded Discovery Keywords (Signal Seekers)
26        self.TRENDS_KEYWORDS = [
27            # Broad Categories
28            "AI tools", "SaaS ideas", "Side hustle", "Passive income",
29            "Digital products", "Online business", "Marketing trends",
30            "E-commerce trends", "Dropshipping", "No-code tools",
31            
32            # Specific Niches
33            "Notion templates", "Canva templates", "Lightroom presets",
34            "Excel dashboards", "Resume templates", "Social media planner",
35            "Fitness app", "Meal planner", "Budget tracker",
36            
37            # Emerging Tech / Topics
38            "Generative AI", "Chatgpt prompts", "Midjourney prompts",
39            "Autogpt", "Langchain", "Vector database",
40            "Newsletter business", "Creator economy", "Remote work",
41            "Home office setup", "Personal branding"
42        ]
43
44    def scrape(self, keyword: str = None, timeframe: str = 'today 3-m', check_breakout: bool = True, limit: int = 5) -> List[Dict]:
45        """
46        Analyzes Google Trends for keywords or batch discovery with Breakout logic.
47        
48        Args:
49            keyword: specific keyword (optional). If None, iterates shuffle list.
50            timeframe: Time range 
51            check_breakout: If True, looks for "Breakout" related queries (Sophistication)
52            limit: Max keywords to process in this batch (for rate limiting)
53        """
54        all_results = []
55        
56        # Determine keywords to check
57        keywords_to_check = [keyword] if keyword else self.TRENDS_KEYWORDS
58        if not keyword:
59            random.shuffle(keywords_to_check)
60            keywords_to_check = keywords_to_check[:limit] # Rate limit protection
61            logger.info(f"🔀 Batch processing {len(keywords_to_check)} trends (Limit: {limit})")
62            
63        for term in keywords_to_check:
64            logger.info(f"📈 Analyzing Trend: {term}")
65            
66            try:
67                # Add delay before request
68                self._rate_limit_delay()
69                
70                # Build payload
71                self.pytrends.build_payload([term], timeframe=timeframe)
72                
73                # 1. Main Interest Check
74                interest_df = self.pytrends.interest_over_time()
75                
76                if not interest_df.empty:
77                    # Save the main term trend
78                    product_data = self._process_trend_data(term, interest_df, timeframe)
79                    self.save_product(product_data)
80                    all_results.append(product_data)
81                else:
82                    logger.warning(f"⚠️ No trend data for: {term}")
83
84                # 2. Breakout Detection (Sophisticated Logic)
85                if check_breakout:
86                    self._rate_limit_delay() # Extra delay for second call
87                    related_payload = self.pytrends.related_queries()
88                    
89                    if related_payload and term in related_payload:
90                        rising_df = related_payload[term]['rising']
91                        
92                        if rising_df is not None and not rising_df.empty:
93                            logger.info(f"   🔎 Checking related queries for breakouts...")
94                            
95                            for index, row in rising_df.iterrows():
96                                query = row['query']
97                                value = row['value'] # Can be int or sometimes 'Breakout' depending on lib version, usually int in recent pytrends
98                                
99                                # Breakout threshold logic
100                                is_breakout = False
101                                growth_label = f"+{value}%"
102                                growth_percent = 0
103                                
104                                try:
105                                    growth_percent = int(value)
106                                    if growth_percent > 5000:
107                                        is_breakout = True
108                                        growth_label = "🚀 BREAKOUT"
109                                    elif growth_percent >= 500:
110                                        growth_label = f"🔥 +{growth_percent}%"
111                                except:
112                                    # If value is non-numeric, assume Breakout
113                                    is_breakout = True
114                                    growth_label = "🚀 BREAKOUT"
115
116                                if is_breakout or growth_percent >= 300: 
117                                    # Create a signal for this related query
118                                    breakout_data = {
119                                        "platform": "google_trends",
120                                        "canonical_key": f"trends_breakout_{query}_{datetime.now().strftime('%Y%m%d')}",
121                                        "product_name": f"{growth_label}: {query}",
122                                        "product_url": f"https://trends.google.com/trends/explore?q={query.replace(' ', '+')}",
123                                        "description": f"Detected via '{term}'. Related query '{query}' is surging by {value}%.",
124                                        "evidence": {
125                                            "parent_term": term,
126                                            "growth_percent": value,
127                                            "is_breakout": is_breakout,
128                                            "type": "related_query",
129                                            "timing_score": 9.5 if is_breakout else 8.0,
130                                            "trend_direction": "explosive"
131                                        },
132                                        "scraped_at": datetime.now().isoformat()
133                                    }
134                                    
135                                    self.save_product(breakout_data)
136                                    all_results.append(breakout_data)
137                                    logger.info(f"   🚨 DETECTED SIGNAL: {query} ({growth_label})")
138
139            except Exception as e:
140                logger.error(f"❌ Error scraping {term}: {e}")
141                # Simple backoff
142                time.sleep(10)
143                continue
144                
145        return all_results
146
147    def _process_trend_data(self, keyword, interest_df, timeframe):
148        """Helper to process standard trend data"""
149        values = interest_df[keyword].values
150        current_interest = int(values[-1])
151        peak_interest = int(values.max())
152        avg_interest = float(values.mean())
153        
154        recent_avg = float(values[-4:].mean()) if len(values) >= 4 else avg_interest
155        overall_avg = float(values.mean())
156        trend_direction = "rising" if recent_avg > overall_avg else "falling"
157        
158        if current_interest >= 80: window_status = "peak"
159        elif current_interest >= 40: window_status = "opening"
160        else: window_status = "closing"
161        
162        timing_score = self._calculate_timing_score(current_interest, peak_interest, trend_direction)
163        
164        return {
165            "platform": "google_trends",
166            "canonical_key": f"trends_{keyword}_{datetime.now().strftime('%Y%m%d')}",
167            "product_name": f"Trend: {keyword}",
168            "product_url": f"https://trends.google.com/trends/explore?q={keyword.replace(' ', '+')}",
169            "description": f"Interest trend for '{keyword}' over {timeframe}.",
170            "evidence": {
171                "current_interest": current_interest,
172                "peak_interest": peak_interest,
173                "window_status": window_status,
174                "trend_direction": trend_direction,
175                "timing_score": timing_score,
176                "type": "keyword_interest"
177            },
178            "scraped_at": datetime.now().isoformat()
179        }
180
181    def _rate_limit_delay(self):
182        """Add delay between requests to avoid rate limiting"""
183        current_time = time.time()
184        time_since_last = current_time - self.last_request_time
185        
186        if time_since_last < self.min_delay:
187            sleep_time = self.min_delay - time_since_last
188            # Add random jitter
189            sleep_time += random.uniform(1, 4)
190            logger.info(f"⏱️ Waiting {sleep_time:.1f}s before request...")
191            time.sleep(sleep_time)
192        
193        self.last_request_time = time.time()
194
195    def _calculate_timing_score(self, current: int, peak: int, direction: str) -> float:
196        """
197        Calculate timing score (0-10)
198        """
199        base_score = min(current / 10, 10.0)
200        trend_bonus = 2.0 if direction == "rising" else 0.0
201        peak_ratio = current / max(peak, 1)
202        peak_penalty = 0.0 if peak_ratio >= 0.7 else 2.0
203        return max(0, min(base_score + trend_bonus - peak_penalty, 10.0))
204    
205    def _fallback_mock_data(self, keyword: str, timeframe: str) -> List[Dict]:
206        """Fallback mock data"""
207        # Not used in active production loop but good to keep
208        return []
209
210    def save_product(self, data: Dict):
211        """Standardized save method"""
212        conn = self.get_db_connection()
213        if not conn: return
214        
215        try:
216            with conn.cursor() as cur:
217                query = """
218                INSERT INTO raw_intelligence_signals 
219                (platform, canonical_key, title, url, description, evidence, scraped_at)
220                VALUES (%s, %s, %s, %s, %s, %s, %s)
221                ON CONFLICT (canonical_key) DO UPDATE SET
222                title = EXCLUDED.title,
223                description = EXCLUDED.description,
224                evidence = EXCLUDED.evidence,
225                scraped_at = EXCLUDED.scraped_at
226                """
227                cur.execute(query, (
228                    data['platform'],
229                    data['canonical_key'],
230                    data['product_name'],
231                    data['product_url'],
232                    data['description'],
233                    json.dumps(data['evidence']),
234                    data['scraped_at']
235                ))
236            conn.commit()
237            logger.info(f"✅ Saved: {data['product_name']}")
238        except Exception as e:
239            logger.error(f"❌ Error saving Trends signal: {e}")
240        finally:
241            conn.close()
242
243if __name__ == "__main__":
244    scraper = TrendsScraper()
245    # Test batch mode with small limit
246    print("🧪 Testing Trends Batch...")
247    results = scraper.scrape(limit=2)
248    print(f"✅ Finished. Found {len(results)} signals.")

backend/scrapers/twitter.py

1"""
2AURIFY INTELLIGENCE - Twitter/X Scraper
3Extracts tweets with pain signals using search engine proxy (DuckDuckGo)
4Focus: Real-time trending problems and viral pain points
5"""
6
7import asyncio
8from playwright.async_api import async_playwright
9import json
10from datetime import datetime
11import urllib.parse
12import random
13import re
14
15try:
16    from .base import BaseScraper
17except ImportError:
18    from base import BaseScraper
19
20class TwitterScraper(BaseScraper):
21    def __init__(self):
22        super().__init__("Twitter")
23        self.base_url = "https://duckduckgo.com"
24        
25        # 🎯 PAIN HEIST STRATEGY - Twitter Keywords
26        # Focus on real-time complaints and trending problems
27        self.TARGET_KEYWORDS = [
28            'site:twitter.com "frustrated with"',
29            'site:twitter.com "why is there no"',
30            'site:twitter.com "wish there was a tool"',
31            'site:twitter.com "anyone know a better"',
32            'site:twitter.com "this app is terrible"',
33            'site:twitter.com "looking for alternative to"',
34            'site:twitter.com "need better" "tool"',
35            'site:twitter.com "struggling with" "software"',
36            'site:twitter.com "hate using" "platform"'
37        ]
38        
39        # Pain signal keywords to extract from tweets
40        self.PAIN_SIGNALS = [
41            "frustrated",
42            "terrible",
43            "broken",
44            "doesn't work",
45            "waste of money",
46            "disappointed",
47            "need alternative",
48            "looking for better",
49            "why doesn't",
50            "wish there was"
51        ]
52    
53    async def scrape(self, query=None, limit=10):
54        """
55        Scrape Twitter/X posts via DuckDuckGo Search
56        
57        Args:
58            query: Custom search query or None to use default keywords
59            limit: Maximum number of tweets to scrape
60        """
61        all_results = []
62        
63        # Determine queries
64        queries_to_run = [query] if query else self.TARGET_KEYWORDS
65        if not query:
66            random.shuffle(queries_to_run)
67        
68        async with async_playwright() as p:
69            browser = await p.chromium.launch(headless=True)
70            context = await browser.new_context(
71                user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
72            )
73            page = await context.new_page()
74            
75            for q_idx, search_term in enumerate(queries_to_run):
76                if limit and len(all_results) >= limit:
77                    break
78                
79                print(f"[Twitter] 🐦 [{q_idx+1}/{len(queries_to_run)}] Searching: '{search_term}'")
80                
81                try:
82                    # Construct DuckDuckGo URL
83                    encoded_q = urllib.parse.quote_plus(search_term)
84                    url = f"{self.base_url}/?q={encoded_q}&t=h_&ia=web"
85                    
86                    await page.goto(url, wait_until="domcontentloaded", timeout=15000)
87                    await page.wait_for_timeout(2000)
88                    
89                    # Extract Twitter links
90                    results = await page.evaluate("""() => {
91                        const links = Array.from(document.querySelectorAll('a'));
92                        const validLinks = links.filter(a => 
93                            (a.href.includes('twitter.com') || a.href.includes('x.com')) &&
94                            !a.href.includes('google.com')
95                        );
96                        
97                        return validLinks.map(a => {
98                            let container = a.closest('article');
99                            if (!container) container = a.closest('div[id^="r1-"]');
100                            if (!container) container = a.parentElement.parentElement;
101                            
102                            return {
103                                title: a.innerText || a.href,
104                                url: a.href,
105                                snippet: container ? container.innerText : a.innerText
106                            };
107                        });
108                    }""")
109                    
110                    # Deduplicate
111                    seen_urls = set()
112                    unique_results = []
113                    for r in results:
114                        if r['url'] not in seen_urls:
115                            seen_urls.add(r['url'])
116                            unique_results.append(r)
117                    
118                    valid_count = 0
119                    for r in unique_results:
120                        if "twitter.com" in r['url'] or "x.com" in r['url']:
121                            clean_snippet = " ".join(r['snippet'].split())
122                            
123                            # Extract pain signals
124                            pain_signals = self._extract_pain_signals(clean_snippet)
125                            
126                            product = {
127                                "platform": "twitter",
128                                "canonical_key": f"twitter_{abs(hash(r['url']))}",
129                                "product_name": r['title'][:100],
130                                "product_url": r['url'],
131                                "description": clean_snippet[:500],
132                                "pain_signals": pain_signals,
133                                "evidence": {
134                                    "snippet": clean_snippet,
135                                    "source_query": search_term,
136                                    "type": "search_result",
137                                    "pain_count": len(pain_signals)
138                                },
139                                "scraped_at": datetime.now().isoformat()
140                            }
141                            
142                            self.save_product(product)
143                            all_results.append(product)
144                            valid_count += 1
145                            print(f"   ✅ Found: {r['title'][:40]}... | Pain Signals: {len(pain_signals)}")
146                    
147                    print(f"   ℹ️  Extracted {valid_count} tweets from search")
148                    
149                except Exception as e:
150                    print(f"   ❌ Error searching '{search_term}': {e}")
151                
152                await page.wait_for_timeout(random.randint(2000, 5000))
153            
154            await browser.close()
155        
156        return all_results
157    
158    def _extract_pain_signals(self, text: str) -> list:
159        """Extract pain signals from tweet text"""
160        pain_signals = []
161        text_lower = text.lower()
162        
163        for keyword in self.PAIN_SIGNALS:
164            if keyword in text_lower:
165                # Extract sentence containing pain keyword
166                sentences = text.split('.')
167                for sentence in sentences:
168                    if keyword in sentence.lower():
169                        clean_sentence = sentence.strip()
170                        if clean_sentence and len(clean_sentence) > 10:
171                            pain_signals.append(clean_sentence)
172                        break
173        
174        return pain_signals[:3]  # Limit to 3 pain signals per tweet
175    
176    def save_product(self, data):
177        """Save to database"""
178        conn = self.get_db_connection()
179        if not conn:
180            return
181        
182        try:
183            with conn.cursor() as cur:
184                query = """
185                INSERT INTO raw_intelligence_signals 
186                (platform, canonical_key, title, url, description, evidence, scraped_at)
187                VALUES (%s, %s, %s, %s, %s, %s, %s)
188                ON CONFLICT (canonical_key) DO UPDATE SET
189                description = EXCLUDED.description,
190                evidence = EXCLUDED.evidence,
191                scraped_at = EXCLUDED.scraped_at
192                """
193                cur.execute(query, (
194                    data['platform'],
195                    data['canonical_key'],
196                    data['product_name'],
197                    data['product_url'],
198                    data['description'],
199                    json.dumps(data['evidence']),
200                    data['scraped_at']
201                ))
202            conn.commit()
203        except Exception as e:
204            print(f"[Twitter] DB Error: {e}")
205        finally:
206            conn.close()
207
208if __name__ == "__main__":
209    scraper = TwitterScraper()
210    asyncio.run(scraper.scrape(limit=5))

backend/scrapers/unified_adapters.py

1"""
2Unified Adapters for Facebook and Amazon Scrapers
3"""
4
5import asyncio
6from typing import List, Dict
7from backend.scrapers.facebook import FacebookScraper
8from backend.scrapers.amazon import AmazonScraper
9
10async def scraper_facebook_unified(config: Dict) -> List[Dict]:
11    """Adapter for Facebook Scraper"""
12    fb_config = next((s['config'] for s in config['scrapers'] if s['name'] == 'facebook'), {})
13    
14    scraper = FacebookScraper()
15    # Assuming scrape method exists with these params
16    # We map config to scrape args
17    products = await scraper.scrape(
18        query=fb_config.get('query'), # Defaults to PIMGE list if None
19        country=fb_config.get('country', 'ALL'),
20        max_ads=fb_config.get('max_ads', 10),
21        active_status='active'
22    )
23    
24    unified_products = []
25    for p in products:
26        unified_products.append({
27            'source': 'Facebook',
28            'original_id': p['product_id'],
29            'title': p['product_name'], # "Ad by [Advertiser]"
30            'url': p['product_url'],
31            'price': p['price'],
32            'engagement_metric': p['evidence']['momentum']['active_ads_count'],
33            'engagement_label': 'Active Ads',
34            'opportunity_score': 0, # Will be calculated by orchestrator or handled here? 
35            # Facebook doesn't have a score yet, Orchestrator calculates it.
36            'metadata': {
37                'description': p['evidence']['ad_body'],
38                'image_url': p['evidence']['image_url'],
39                'cta': p['evidence']['offer']['cta_text']
40            }
41        })
42        
43    print(f"✅ [Facebook] Scraped {len(unified_products)} items", flush=True)
44    return unified_products
45
46async def scraper_amazon_unified(config: Dict) -> List[Dict]:
47    """Adapter for Amazon Scraper (Enhanced with Reviews)"""
48    from backend.scrapers.amazon import AmazonScraper
49    
50    amz_config = next((s['config'] for s in config['scrapers'] if s['name'] == 'amazon'), {})
51    scraper = AmazonScraper()
52    queries = amz_config.get('keywords', ['kindle passive income'])
53    all_products = []
54    
55    for query in queries:
56        # Enable review scraping for pain signal extraction
57        products = await scraper.scrape(
58            query=query, 
59            limit=amz_config.get('limit', 10),
60            scrape_reviews=True  # NEW: Extract reviews and pain signals
61        )
62        all_products.extend(products)
63        
64    # Amazon scraper already standardizes, just map correctly
65    unified_products = []
66    for p in all_products:
67        unified_products.append({
68            'source': 'Amazon',
69            'original_id': p['original_id'],
70            'title': p['title'],
71            'url': p['url'],
72            'price': p['price'],
73            'engagement_metric': p['engagement_metric'],
74            'engagement_label': 'ratings',
75            'opportunity_score': p.get('opportunity_score', 50),
76            'metadata': {
77                **p['metadata'],
78                'pain_signals': p.get('pain_signals', []),
79                'review_sentiment': p.get('review_sentiment', 'neutral'),
80                'review_count': len(p.get('reviews', []))
81            }
82        })
83    
84    print(f"✅ [Amazon] Scraped {len(unified_products)} items", flush=True)
85    return unified_products
86
87async def scraper_gumroad_unified(config: Dict) -> List[Dict]:
88    """Adapter for Gumroad Scraper"""
89    from backend.scrapers.gumroad import GumroadScraper
90    
91    gr_config = next((s['config'] for s in config['scrapers'] if s['name'] == 'gumroad'), {})
92    scraper = GumroadScraper()
93    
94    # Scrape based on keywords or categories in config
95    keywords = gr_config.get('keywords', ['notion template'])
96    all_products = []
97    
98    for kw in keywords:
99        products = await scraper.scrape(keyword=kw, max_pages=gr_config.get('max_pages', 1))
100        all_products.extend(products)
101        
102    unified_products = []
103    for p in all_products:
104        unified_products.append({
105            'source': 'Gumroad',
106            'original_id': p['product_id'],
107            'title': p['product_name'],
108            'url': p['product_url'],
109            'price': p['price'],
110            'engagement_metric': 0, 
111            'engagement_label': 'sales/views',
112            'opportunity_score': p.get('opportunity_score', 50),
113            'metadata': {
114                'creator': p['creator_name'],
115                'category': p['category']
116            }
117        })
118        
119    print(f"✅ [Gumroad] Scraped {len(unified_products)} items", flush=True)
120    return unified_products
121
122async def scraper_creative_market_unified(config: Dict) -> List[Dict]:
123    """Adapter for Creative Market Scraper"""
124    from backend.scrapers.creative_market import CreativeMarketScraper
125    
126    cm_config = next((s['config'] for s in config['scrapers'] if s['name'] == 'creative_market'), {})
127    scraper = CreativeMarketScraper()
128    
129    keywords = cm_config.get('keywords', ['notion template'])
130    all_products = []
131    
132    for kw in keywords:
133        products = await scraper.scrape(keyword=kw, max_pages=cm_config.get('max_pages', 1))
134        all_products.extend(products)
135        
136    unified_products = []
137    for p in all_products:
138        unified_products.append({
139            'source': 'CreativeMarket',
140            'original_id': p['product_id'],
141            'title': p['product_name'],
142            'url': p['product_url'],
143            'price': p['price'],
144            'engagement_metric': p['sales_count'],
145            'engagement_label': 'Sales',
146            'opportunity_score': p.get('opportunity_score', 50),
147            'metadata': {
148                'creator': p['creator_name'],
149                'category': 'digital'
150            }
151        })
152    print(f"✅ [CreativeMarket] Scraped {len(unified_products)} items", flush=True)
153    return unified_products
154
155async def scraper_envato_unified(config: Dict) -> List[Dict]:
156    """Adapter for Envato Scraper"""
157    from backend.scrapers.envato import EnvatoScraper
158    
159    en_config = next((s['config'] for s in config['scrapers'] if s['name'] == 'envato'), {})
160    scraper = EnvatoScraper()
161    
162    products = await scraper.scrape(
163        marketplace=en_config.get('marketplace', 'codecanyon'),
164        category=en_config.get('category', 'javascript'),
165        max_pages=en_config.get('max_pages', 1)
166    )
167    
168    unified_products = []
169    for p in products:
170        unified_products.append({
171            'source': 'Envato',
172            'original_id': p['product_id'],
173            'title': p['product_name'],
174            'url': p['product_url'],
175            'price': p['price'],
176            'engagement_metric': p['sales_count'],
177            'engagement_label': 'Sales',
178            'opportunity_score': p.get('opportunity_score', 50),
179            'metadata': {
180                'creator': p['author_name'],
181                'category': 'software'
182            }
183        })
184    print(f"✅ [Envato] Scraped {len(unified_products)} items", flush=True)
185    return unified_products
186
187async def scraper_instagram_unified(config: Dict) -> List[Dict]:
188    """Adapter for Instagram Scraper"""
189    from backend.scrapers.instagram import InstagramScraper
190    
191    ig_config = next((s['config'] for s in config['scrapers'] if s['name'] == 'instagram'), {})
192    scraper = InstagramScraper()
193    
194    hashtags = ig_config.get('hashtags', ['business'])
195    all_products = []
196    
197    for tag in hashtags:
198        products = scraper.scrape(hashtag=tag, limit=ig_config.get('limit', 10))
199        all_products.extend(products)
200        
201    unified_products = []
202    for p in all_products:
203        unified_products.append({
204            'source': 'Instagram',
205            'original_id': p['canonical_key'],
206            'title': p['product_name'], # "Post by @author"
207            'url': p['product_url'],
208            'price': 0, # Not a product to buy directly
209            'engagement_metric': p['evidence']['engagement_rate'],
210            'engagement_label': 'Eng. Rate',
211            'opportunity_score': p.get('opportunity_score', 50),
212            'metadata': {
213                'description': p['description'],
214                'pain_severity': p['evidence']['pain_severity'],
215                'image_url': p['product_url'] # Or explicit media URL if available
216            }
217        })
218    print(f"✅ [Instagram] Scraped {len(unified_products)} items", flush=True)
219    return unified_products
220
221async def scraper_youtube_unified(config: Dict) -> List[Dict]:
222    """Adapter for YouTube Scraper"""
223    from backend.scrapers.youtube import YouTubeScraper
224    
225    yt_config = next((s['config'] for s in config['scrapers'] if s['name'] == 'youtube'), {})
226    scraper = YouTubeScraper()
227    
228    queries = yt_config.get('queries', [None]) # List of queries or None for random
229    all_products = []
230    
231    for q in queries:
232        products = scraper.scrape(query=q, limit=yt_config.get('limit', 10))
233        all_products.extend(products)
234        
235    unified_products = []
236    for p in all_products:
237        unified_products.append({
238            'source': 'YouTube',
239            'original_id': p['canonical_key'],
240            'title': p['product_name'],
241            'url': p['product_url'],
242            'price': 0,
243            'engagement_metric': p['evidence']['views'],
244            'engagement_label': 'Views',
245            'opportunity_score': p.get('opportunity_score', 50),
246            'metadata': {
247                'description': p['description'],
248                'pain_score': p['evidence']['pain_score']
249            }
250        })
251    print(f"✅ [YouTube] Scraped {len(unified_products)} items", flush=True)
252    return unified_products
253
254async def scraper_producthunt_unified(config: Dict) -> List[Dict]:
255    """Adapter for Product Hunt Scraper"""
256    from backend.scrapers.producthunt import ProductHuntScraper
257    
258    ph_config = next((s['config'] for s in config['scrapers'] if s['name'] == 'producthunt'), {})
259    scraper = ProductHuntScraper()
260    
261    products = scraper.scrape(
262        order=ph_config.get('order', 'VOTES'), 
263        limit=ph_config.get('limit', 10)
264    )
265        
266    unified_products = []
267    for p in products:
268        unified_products.append({
269            'source': 'ProductHunt',
270            'original_id': p['canonical_key'],
271            'title': p['product_name'],
272            'url': p['product_url'],
273            'price': 0, 
274            'engagement_metric': p['evidence']['upvotes'],
275            'engagement_label': 'Upvotes',
276            'opportunity_score': p.get('opportunity_score', 50),
277            'metadata': {
278                'tagline': p['description'],
279                'maker': p['evidence']['maker']
280            }
281        })
282    print(f"✅ [ProductHunt] Scraped {len(unified_products)} items", flush=True)
283    return unified_products
284
285async def scraper_google_trends_unified(config: Dict) -> List[Dict]:
286    """Adapter for Google Trends"""
287    from backend.scrapers.trends import TrendsScraper
288    
289    gt_config = next((s['config'] for s in config['scrapers'] if s['name'] == 'google_trends'), {})
290    scraper = TrendsScraper()
291    
292    # Trends scraper returns list of dicts already
293    # Using 'limit' from config to control batch size
294    results = scraper.scrape(
295        keyword=gt_config.get('keywords', [None])[0], # Take first keyword if provided
296        limit=gt_config.get('limit', 5)
297    )
298    
299    unified_products = []
300    for p in results:
301        unified_products.append({
302            'source': 'GoogleTrends',
303            'original_id': p['canonical_key'],
304            'title': p['product_name'],
305            'url': p['product_url'],
306            'price': 0,
307            'engagement_metric': p['evidence']['current_interest'],
308            'engagement_label': 'Interest',
309            'opportunity_score': int(p['evidence']['timing_score'] * 10),
310            'metadata': p['evidence']
311        })
312    
313    print(f"✅ [GoogleTrends] Scraped {len(unified_products)} items", flush=True)
314    return unified_products
315
316async def scraper_notion_unified(config: Dict) -> List[Dict]:
317    """Adapter for Notion Scraper"""
318    from backend.scrapers.notion import NotionScraper
319    
320    # Check if we have keywords to filter, though generic scraper scrapes generic pages
321    # Ideally pass search term if supported, currently it scrapes gallery
322    scraper = NotionScraper()
323    products = await scraper.scrape(max_pages=1)
324    
325    unified_products = []
326    for p in products:
327        unified_products.append({
328            'source': 'Notion',
329            'original_id': p['canonical_key'],
330            'title': p['product_name'],
331            'url': p['product_url'],
332            'price': p['price'],
333            'engagement_metric': 0, 
334            'engagement_label': 'N/A',
335            'opportunity_score': 50, # Default
336            'metadata': {
337                'creator': p['creator_name'],
338                'price_type': p['evidence']['price_raw']
339            }
340        })
341        
342    print(f"✅ [Notion] Scraped {len(unified_products)} items", flush=True)
343    return unified_products
344
345# ==========================================
346# MOCK ADAPTERS FOR GO-BASED / MISSING SCRAPERS
347# ==========================================
348
349async def scraper_medium_unified(config: Dict) -> List[Dict]:
350    """Mock Adapter for Medium (Go)"""
351    # Simulate finding articles
352    results = [
353        {"title": "How to Build a SaaS in 2026", "url": "https://medium.com/swlh/saas-2026", "claps": 1500},
354        {"title": "The Rise of AI Agents", "url": "https://medium.com/ai/agents", "claps": 3200},
355        {"title": "Passive Income with Notion", "url": "https://medium.com/startups/notion-income", "claps": 850}
356    ]
357    query = next((s['config'] for s in config['scrapers'] if s['name'] == 'medium'), {}).get('query', 'tech')
358    
359    unified = []
360    for i, r in enumerate(results):
361        unified.append({
362            'source': 'Medium',
363            'original_id': f"medium_{i}",
364            'title': f"{r['title']} ({query})",
365            'url': r['url'],
366            'price': 0,
367            'engagement_metric': r['claps'],
368            'engagement_label': 'Claps',
369            'opportunity_score': 70,
370            'metadata': {'author': 'MediumWriter'}
371        })
372    print(f"✅ [Medium] Scraped {len(unified)} items (Simulated Go Wrapper)", flush=True)
373    return unified
374
375async def scraper_canva_unified(config: Dict) -> List[Dict]:
376    """Mock Adapter for Canva (Go)"""
377    results = [
378        {"title": "Social Media Kit", "usage": 5000},
379        {"title": "Business Presentation", "usage": 12000}
380    ]
381    unified = []
382    for i, r in enumerate(results):
383        unified.append({
384            'source': 'Canva',
385            'original_id': f"canva_{i}",
386            'title': r['title'],
387            'url': "https://canva.com/templates/example",
388            'price': 0, 
389            'engagement_metric': r['usage'],
390            'engagement_label': 'Uses',
391            'opportunity_score': 65,
392            'metadata': {'format': 'Template'}
393        })
394    print(f"✅ [Canva] Scraped {len(unified)} items (Simulated Go Wrapper)", flush=True)
395    return unified
396
397async def scraper_lemon_squeezy_unified(config: Dict) -> List[Dict]:
398    """Mock Adapter for Lemon Squeezy (Go)"""
399    unified = [{
400        'source': 'LemonSqueezy',
401        'original_id': 'ls_1',
402        'title': 'SaaS Starter Kit',
403        'url': 'https://lemonsqueezy.com/store/example',
404        'price': 49.00,
405        'engagement_metric': 120,
406        'engagement_label': 'Sales',
407        'opportunity_score': 80,
408        'metadata': {'revenue': '$5000+'}
409    }]
410    print(f"✅ [Lemon Squeezy] Scraped {len(unified)} items (Simulated Go Wrapper)", flush=True)
411    return unified
412
413async def scraper_wp_plugins_unified(config: Dict) -> List[Dict]:
414    """Mock Adapter for WP Plugins (Go)"""
415    unified = [{
416        'source': 'WP Plugins',
417        'original_id': 'wp_1',
418        'title': 'AI Content Generator for WP',
419        'url': 'https://wordpress.org/plugins/ai-gen',
420        'price': 0,
421        'engagement_metric': 50000,
422        'engagement_label': 'Installs',
423        'opportunity_score': 85,
424        'metadata': {'rating': '4.8/5'}
425    }]
426    print(f"✅ [WP Plugins] Scraped {len(unified)} items (Simulated Go Wrapper)", flush=True)
427    return unified
428
429async def scraper_shopify_apps_unified(config: Dict) -> List[Dict]:
430    """Adapter for Shopify App Store"""
431    from backend.scrapers.shopify_apps import ShopifyAppsScraper
432    
433    sa_config = next((s['config'] for s in config['scrapers'] if s['name'] == 'shopify_apps'), {})
434    scraper = ShopifyAppsScraper()
435    
436    keyword = sa_config.get('keywords', [None])[0] if sa_config.get('keywords') else None
437    products = await scraper.scrape(
438        keyword=keyword,
439        max_apps=sa_config.get('limit', 10)
440    )
441    
442    unified_products = []
443    for p in products:
444        unified_products.append({
445            'source': 'ShopifyApps',
446            'original_id': p['canonical_key'],
447            'title': p['product_name'],
448            'url': p['product_url'],
449            'price': p.get('price', 0.0),
450            'engagement_metric': p.get('install_count', 0),
451            'engagement_label': 'Installs',
452            'opportunity_score': min(70 + len(p.get('pain_signals', [])) * 5, 100),
453            'metadata': {
454                'rating': p.get('rating', 0.0),
455                'review_count': p.get('review_count', 0),
456                'developer': p.get('developer', 'Unknown'),
457                'pain_signals': p.get('pain_signals', [])
458            }
459        })
460    
461    print(f"✅ [Shopify Apps] Scraped {len(unified_products)} items", flush=True)
462    return unified_products
463
464
465async def scraper_shopify_products_unified(config: Dict) -> List[Dict]:
466    """Adapter for Shopify Products Scraper (Individual Store Products)"""
467    from backend.scrapers.shopify_products import ShopifyProductsScraper
468    
469    sp_config = next((s['config'] for s in config['scrapers'] if s['name'] == 'shopify_products'), {})
470    
471    scraper = ShopifyProductsScraper()
472    
473    # Get store list from config or use defaults
474    stores = sp_config.get('stores', None)
475    max_products = sp_config.get('limit', 20)
476    
477    products = await scraper.scrape(
478        stores=stores,
479        max_products_per_store=max_products
480    )
481    
482    unified_products = []
483    for p in products:
484        # Calculate opportunity score based on digital product and pricing
485        score = 50
486        
487        # Boost for digital products
488        if p.get('is_digital', False):
489            score += 20
490        
491        # Boost for discounted products (indicates market testing)
492        if p.get('compare_at_price') and p.get('compare_at_price') > p.get('price', 0):
493            score += 10
494        
495        # Boost for reasonable price point ($10-$100 sweet spot for digital)
496        price = p.get('price', 0)
497        if 10 <= price <= 100:
498            score += 15
499        
500        # Penalty for unavailable products
501        if not p.get('available', True):
502            score -= 20
503        
504        unified_products.append({
505            'source': 'ShopifyProducts',
506            'original_id': p['canonical_key'],
507            'title': p['product_name'],
508            'url': p['product_url'],
509            'price': p.get('price', 0.0),
510            'engagement_metric': p.get('inventory_quantity', 0),
511            'engagement_label': 'Inventory',
512            'opportunity_score': max(0, min(score, 100)),
513            'metadata': {
514                'store': p.get('store_domain', ''),
515                'is_digital': p.get('is_digital', False),
516                'product_type': p.get('product_type', ''),
517                'vendor': p.get('vendor', ''),
518                'tags': p.get('tags', []),
519                'available': p.get('available', True),
520                'compare_at_price': p.get('compare_at_price'),
521                'images': p.get('images', [])[:3]  # First 3 images
522            }
523        })
524    
525    print(f"✅ [Shopify Products] Scraped {len(unified_products)} items", flush=True)
526    return unified_products
527
528
529async def scraper_linkedin_unified(config: Dict) -> List[Dict]:
530    """Adapter for LinkedIn Scraper (B2B Pain Points)"""
531    from backend.scrapers.linkedin_search import LinkedInSearchScraper
532    
533    li_config = next((s['config'] for s in config['scrapers'] if s['name'] == 'linkedin'), {})
534    
535    scraper = LinkedInSearchScraper()
536    
537    # Get query from config or use default
538    query = li_config.get('query', None)
539    limit = li_config.get('limit', 10)
540    
541    posts = await scraper.scrape(query=query, limit=limit)
542    
543    unified_products = []
544    for p in posts:
545        # Calculate opportunity score based on B2B relevance
546        score = 60  # Base score
547        
548        # Boost for B2B keywords
549        desc_lower = p.get('description', '').lower()
550        b2b_keywords = ['saas', 'b2b', 'enterprise', 'crm', 'tool', 'platform']
551        if any(kw in desc_lower for kw in b2b_keywords):
552            score += 15
553        
554        # Boost for pain signals
555        pain_keywords = ['struggling', 'frustrated', 'need', 'looking for']
556        if any(kw in desc_lower for kw in pain_keywords):
557            score += 20
558        
559        unified_products.append({
560            'source': 'LinkedIn',
561            'original_id': p['canonical_key'],
562            'title': p['product_name'],
563            'url': p['product_url'],
564            'price': 0,  # LinkedIn posts don't have price
565            'engagement_metric': 0,  # Search results don't have engagement
566            'engagement_label': 'Professional Network',
567            'opportunity_score': min(score, 100),
568            'metadata': {
569                'description': p.get('description', ''),
570                'platform': 'linkedin',
571                'type': 'b2b_pain_point'
572            }
573        })
574    
575    print(f"✅ [LinkedIn] Scraped {len(unified_products)} items", flush=True)
576    return unified_products
577
578
579async def scraper_reddit_unified(config: Dict) -> List[Dict]:
580    """Adapter for Reddit Scraper (Community Pain Points)"""
581    from backend.scrapers.reddit import RedditScraper
582    
583    reddit_config = next((s['config'] for s in config['scrapers'] if s['name'] == 'reddit'), {})
584    
585    scraper = RedditScraper()
586    
587    # Get subreddits and limit from config
588    subreddits = reddit_config.get('subreddits', None)
589    limit = reddit_config.get('limit', 10)
590    time_filter = reddit_config.get('time_filter', 'week')
591    
592    posts = scraper.scrape(subreddits=subreddits, limit=limit, time_filter=time_filter)
593    
594    unified_products = []
595    for p in posts:
596        # Use Reddit's built-in scoring
597        evidence = p.get('evidence', {})
598        pain_severity = evidence.get('pain_severity', 0)
599        validation_score = evidence.get('validation_score', 0)
600        monetization_potential = evidence.get('monetization_potential', 0)
601        
602        # Calculate opportunity score
603        opportunity_score = min(
604            (pain_severity * 3) + (validation_score * 2) + (monetization_potential * 5),
605            100
606        )
607        
608        unified_products.append({
609            'source': 'Reddit',
610            'original_id': p['canonical_key'],
611            'title': p['product_name'],
612            'url': p['product_url'],
613            'price': 0,
614            'engagement_metric': evidence.get('upvotes', 0),
615            'engagement_label': 'Upvotes',
616            'opportunity_score': int(opportunity_score),
617            'metadata': {
618                'description': p.get('description', ''),
619                'subreddit': evidence.get('subreddit', ''),
620                'comments': evidence.get('comments', 0),
621                'pain_severity': pain_severity,
622                'validation_score': validation_score,
623                'monetization_potential': monetization_potential
624            }
625        })
626    
627    print(f"✅ [Reddit] Scraped {len(unified_products)} items", flush=True)
628    return unified_products
629
630
631async def scraper_twitter_unified(config: Dict) -> List[Dict]:
632    """Adapter for Twitter/X Scraper (Real-time Pain Signals)"""
633    from backend.scrapers.twitter import TwitterScraper
634    
635    twitter_config = next((s['config'] for s in config['scrapers'] if s['name'] == 'twitter'), {})
636    
637    scraper = TwitterScraper()
638    
639    # Get query and limit from config
640    query = twitter_config.get('query', None)
641    limit = twitter_config.get('limit', 10)
642    
643    tweets = await scraper.scrape(query=query, limit=limit)
644    
645    unified_products = []
646    for t in tweets:
647        # Calculate opportunity score based on pain signals
648        evidence = t.get('evidence', {})
649        pain_count = evidence.get('pain_count', 0)
650        
651        score = 50  # Base score
652        
653        # Boost for pain signals
654        score += min(pain_count * 15, 40)
655        
656        # Boost for trending keywords
657        desc_lower = t.get('description', '').lower()
658        trending_keywords = ['viral', 'trending', 'everyone', 'thousands']
659        if any(kw in desc_lower for kw in trending_keywords):
660            score += 10
661        
662        unified_products.append({
663            'source': 'Twitter',
664            'original_id': t['canonical_key'],
665            'title': t['product_name'],
666            'url': t['product_url'],
667            'price': 0,
668            'engagement_metric': pain_count,
669            'engagement_label': 'Pain Signals',
670            'opportunity_score': min(score, 100),
671            'metadata': {
672                'description': t.get('description', ''),
673                'pain_signals': t.get('pain_signals', []),
674                'platform': 'twitter'
675            }
676        })
677    
678    print(f"✅ [Twitter] Scraped {len(unified_products)} items", flush=True)
679    return unified_products

backend/scrapers/unified_orchestrator.py

1"""
2AURIFY INTELLIGENCE - Unified Scraper Orchestrator
3Centralizes execution, standardizes data, and applies unified scoring.
4"""
5
6import asyncio
7import json
8import logging
9import os
10import csv
11from datetime import datetime
12from typing import List, Dict, Any
13
14import sys
15sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
16
17import psycopg2
18from psycopg2.extras import Json
19from backend.scrapers.etsy import EtsyScraper
20
21# Setup logging
22logging.basicConfig(
23    level=logging.INFO,
24    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
25    handlers=[
26        logging.StreamHandler(sys.stdout)
27    ]
28)
29logger = logging.getLogger("AurifyOrchestrator")
30
31class ScraperOrchestrator:
32    def __init__(self, config_path: str = "backend/scrapers/scrapers_config.json"):
33        self.config = self._load_config(config_path)
34        self.results = []
35        self.custom_scrapers = {}
36        
37    def get_db_connection(self):
38        """Get database connection"""
39        return psycopg2.connect(
40            host=os.getenv("POSTGRES_HOST", "localhost"),
41            port=int(os.getenv("POSTGRES_PORT", "5435")),
42            user=os.getenv("POSTGRES_USER", "orexia_app"),
43            password=os.getenv("POSTGRES_PASSWORD", "Farhat2026Secure"),
44            database=os.getenv("POSTGRES_DB", "orexia")
45        )
46        
47    def register_scraper(self, name: str, func):
48        """Register a custom scraper adapter function"""
49        logger.info(f"🔌 Registered custom scraper: {name}")
50        self.custom_scrapers[name] = func
51
52    def _load_config(self, path: str) -> Dict:
53        try:
54            with open(path, 'r') as f:
55                return json.load(f)
56        except Exception as e:
57            logger.error(f"Failed to load config: {e}")
58            return {"scrapers": [], "scoring": {}}
59
60    async def run_etsy_adapter(self, config: Dict) -> List[Dict]:
61        """Adapter for the existing EtsyScraper class"""
62        logger.info("🎨 Running Etsy Scraper via Adapter...")
63        scraper = EtsyScraper()
64        products = []
65        
66        # Iterate through categories defined in config
67        for category in config.get('categories', []):
68            logger.info(f"   Searching Etsy for: {category}")
69            # Use the existing scrape method
70            items = await scraper.scrape(
71                query=category, 
72                max_pages=config.get('max_pages', 1),
73                digital_only=True
74            )
75            
76            # Standardize and Filter
77            for item in items:
78                # Apply config filters
79                if item.get('price', 0) < config.get('min_price', 0):
80                    continue
81                if item.get('review_count', 0) < config.get('min_reviews', 0):
82                    continue
83                    
84                # Standardize Structure
85                standardized = {
86                    'source': 'Etsy',
87                    'original_id': item.get('id'),
88                    'title': item.get('product_name'),
89                    'url': item.get('clean_url') or item.get('product_url'),
90                    'price': item.get('price'),
91                    'engagement_metric': item.get('review_count', 0), # Primary engagement
92                    'engagement_label': 'Reviews',
93                    'rating': item.get('rating', 0),
94                    'metadata': {
95                        'tags': item.get('tags', []),
96                        'shop': item.get('shop_name'),
97                        'is_bestseller': item.get('is_bestseller')
98                    }
99                }
100                products.append(standardized)
101                
102        print(f"✅ [Etsy] Scraped {len(products)} items", flush=True)
103        return products
104
105    async def run_github_adapter(self, config: Dict) -> List[Dict]:
106        """Simple GitHub API Scraper Adapter"""
107        logger.info("🐙 Running GitHub Scraper via Adapter...")
108        import aiohttp
109        
110        repos = []
111        base_url = "https://api.github.com/search/repositories"
112        token = os.getenv("GITHUB_TOKEN") 
113        headers = {"Authorization": f"token {token}"} if token else {}
114        
115        search_terms = config.get('include_keywords', [])
116        languages = config.get('languages', [])
117        
118        # Construct query
119        # Example: "automation language:Python"
120        query_parts = [" OR ".join(search_terms)]
121        if languages:
122            lang_query = " ".join([f"language:{l}" for l in languages])
123            query_parts.append(lang_query)
124        
125        final_query = " ".join(query_parts)
126        
127        async with aiohttp.ClientSession(headers=headers) as session:
128            params = {
129                'q': final_query,
130                'sort': 'stars',
131                'per_page': config.get('max_results', 30)
132            }
133            
134            try:
135                async with session.get(base_url, params=params) as response:
136                    if response.status == 200:
137                        data = await response.json()
138                        for item in data.get('items', []):
139                            if item['stargazers_count'] < config.get('min_stars', 0):
140                                continue
141                                
142                            standardized = {
143                                'source': 'GitHub',
144                                'original_id': str(item['id']),
145                                'title': item['name'],
146                                'url': item['html_url'],
147                                'price': 0.0, # GitHub repos are usually free/open source tools
148                                'engagement_metric': item['stargazers_count'],
149                                'engagement_label': 'Stars',
150                                'rating': 0, # No direct rating
151                                'metadata': {
152                                    'description': item['description'],
153                                    'language': item['language'],
154                                    'topics': item.get('topics', [])
155                                }
156                            }
157                            repos.append(standardized)
158                    else:
159                        logger.error(f"GitHub API Error: {response.status}")
160            except Exception as e:
161                logger.error(f"GitHub Adapter Error: {e}")
162                
163        print(f"✅ [GitHub] Scraped {len(repos)} items", flush=True)
164        return repos
165
166    def calculate_opportunity_score(self, item: Dict) -> Dict:
167        """Calculate score (0-100) and decision based on standardized data"""
168        score = 0
169        weights = self.config['scoring']['weights']
170        thresholds = self.config['scoring']['thresholds']
171        
172        # 1. Engagement Score (Normalized)
173        # We normalize based on source expectations
174        engagement = item['engagement_metric']
175        norm_engagement = 0
176        
177        if item['source'] == 'Etsy':
178            # 1000 reviews is amazing (100 pts)
179            norm_engagement = min(engagement / 1000 * 100, 100)
180        elif item['source'] == 'GitHub':
181            # 5000 stars is amazing (100 pts)
182            norm_engagement = min(engagement / 5000 * 100, 100)
183        elif item['source'] == 'YouTube':
184            # 100k views is amazing (100 pts)
185            norm_engagement = min(engagement / 100000 * 100, 100)
186            
187        score += norm_engagement * weights['engagement']
188        
189        # 2. Market Fit / Quality (Heuristic)
190        quality = 50 # Base
191        
192        # Contextual bonuses
193        meta = item['metadata']
194        
195        if item['source'] == 'Etsy':
196            if meta.get('is_bestseller'): quality += 30
197            if item['price'] > 20: quality += 10 # Good ticket size
198            
199        elif item['source'] == 'GitHub':
200            if meta.get('topics') and len(meta['topics']) > 3: quality += 20
201        
202        score += min(quality, 100) * weights['market_fit']
203        
204        # 3. Decision (Oryxia Pentagonal Algorithm)
205        # ├─ GO (70-100): Execute immediately
206        # ├─ WAIT (50-69): Qualified maybe
207        # └─ KILL (<50): Don't waste time
208        
209        final_score = min(round(score * 1.5), 100) # Simple multiplier to scale up
210        
211        decision = "KILL"
212        if final_score >= 70:
213            decision = "GO"
214        elif final_score >= 50:
215            decision = "WAIT"
216            
217        return {
218            "score": final_score,
219            "decision": decision
220        }
221
222    async def run_youtube_adapter(self, config: Dict) -> List[Dict]:
223        """Adapter for the existing YouTubeScraper class"""
224        logger.info("📺 Running YouTube Scraper via Adapter...")
225        from backend.scrapers.youtube import YouTubeScraper
226        
227        scraper = YouTubeScraper()
228        videos = []
229        
230        for keyword in config.get('keywords', []):
231            logger.info(f"   Searching YouTube for: {keyword}")
232            # Note: scrape is synchronous in YouTubeScraper, need to run it appropriately?
233            # Actually, it's not async defined in the file I read. It's `def scrape`.
234            # We should wrap it in to_thread or similar if it's blocking, but for now direct call if quick
235            # or better: wrap in asyncio.to_thread
236            
237            try:
238                # Running synchronous scrape in thread to avoid blocking asyncio loop
239                items = await asyncio.to_thread(
240                    scraper.scrape, 
241                    query=keyword, 
242                    limit=config.get('max_results', 20)
243                )
244                
245                for item in items:
246                    evidence = item.get('evidence', {})
247                    views = evidence.get('views', 0)
248                    
249                    if views < config.get('min_views', 0):
250                        continue
251                        
252                    standardized = {
253                        'source': 'YouTube',
254                        'original_id': item.get('canonical_key'),
255                        'title': item.get('product_name'),
256                        'url': item.get('product_url'),
257                        'price': 0.0, 
258                        'engagement_metric': views,
259                        'engagement_label': 'Views',
260                        'rating': 0,
261                        'metadata': {
262                            'channel': evidence.get('channel'),
263                            'pain_score': evidence.get('pain_score'),
264                            'published': evidence.get('published')
265                        }
266                    }
267                    videos.append(standardized)
268            except Exception as e:
269                logger.error(f"YouTube Adapter Error for '{keyword}': {e}")
270                
271        print(f"✅ [YouTube] Scraped {len(videos)} items", flush=True)
272        return videos
273
274    async def run_all(self):
275        """Execute all enabled scrapers and unify results"""
276        all_tasks = []
277        
278        for scraper_conf in self.config['scrapers']:
279            if not scraper_conf['enabled']:
280                continue
281                
282            name = scraper_conf['name']
283            # Pass the entire config object to custom scrapers for flexibility
284            # Or pass specific config section? Custom scrapers usually expect the root config or their specific config.
285            # The adapter signature in user code is `async def scraper_tiktok_unified(config: Dict)`.
286            # We will pass the FULL config to custom adapters as requested in user example.
287            
288            if name in self.custom_scrapers:
289                all_tasks.append(self.custom_scrapers[name](self.config))
290            elif name == 'etsy':
291                all_tasks.append(self.run_etsy_adapter(scraper_conf['config']))
292            elif name == 'github':
293                all_tasks.append(self.run_github_adapter(scraper_conf['config']))
294            elif name == 'youtube':
295                all_tasks.append(self.run_youtube_adapter(scraper_conf['config']))
296            
297        # Run in parallel
298        results_lists = await asyncio.gather(*all_tasks)
299        
300        # Flatten results
301        for res_list in results_lists:
302            self.results.extend(res_list)
303            
304        # Apply Scoring and Competitive Analysis
305        logger.info("🧠 Applying Unified Analysis & Competitive Intelligence...")
306        
307        # Initialize Competitive Analyzer
308        from .competitive_analyzer import CompetitiveAnalyzer
309        comp_analyzer = CompetitiveAnalyzer()
310        
311        for item in self.results:
312            # 1. Base Score
313            analysis_base = self.calculate_opportunity_score(item)
314            item['opportunity_score'] = analysis_base['score']
315            item['decision'] = analysis_base['decision']
316            
317            # 2. Competitive Intelligence (Phase 2)
318            market_analysis = comp_analyzer.analyze(item, self.results)
319            item['market_analysis'] = market_analysis
320            
321            # 3. Refine Decision based on Saturation
322            if market_analysis['recommendation'] == 'KILL':
323                 item['decision'] = 'KILL'
324                 item['opportunity_score'] = min(item['opportunity_score'], 45) # Downgrade
325            elif market_analysis['recommendation'] == 'GO' and item['decision'] == 'WAIT':
326                 item['decision'] = 'GO' # Upgrade if market gap found
327                 item['opportunity_score'] = max(item['opportunity_score'], 75)
328
329        return self.results
330
331
332    def export_csv(self, filename="aurify_unified_results.csv"):
333        """Export unified results to CSV"""
334        if not self.results:
335            logger.warning("No results to export.")
336            return
337
338        keys = [
339            'source', 'decision', 'opportunity_score', 
340            'title', 'price', 'engagement_metric', 'engagement_label', 
341            'url', 'market_density', 'saturation_score', 'price_position'
342        ]
343        
344        # Flatten for CSV
345        export_data = []
346        for r in self.results:
347            row = r.copy()
348            ma = r.get('market_analysis', {})
349            row['market_density'] = ma.get('market_density', 'N/A')
350            row['saturation_score'] = ma.get('saturation_score', 0)
351            row['price_position'] = ma.get('price_position', 'N/A')
352            export_data.append(row)
353        
354        try:
355            with open(filename, 'w', newline='', encoding='utf-8') as f:
356                writer = csv.DictWriter(f, fieldnames=keys, extrasaction='ignore')
357                writer.writeheader()
358                writer.writerows(export_data)
359            logger.info(f"✅ Exported unified results to {filename}")
360        except Exception as e:
361            logger.error(f"Export failed: {e}")
362
363    def save_results_to_db(self):
364        """Save analysis results to analyzed_digital_products"""
365        if not self.results:
366            logger.warning("No results to save to DB.")
367            return
368            
369        conn = None
370        try:
371            conn = self.get_db_connection()
372            cur = conn.cursor()
373            
374            saved_count = 0
375            
376            for item in self.results:
377                try:
378                    # Map fields
379                    platform = item.get('source', 'Unknown')
380                    name = item.get('title', 'Unknown Product')[:255]
381                    url = item.get('url', '')
382                    price = float(item.get('price', 0))
383                    
384                    score = float(item.get('opportunity_score', 0))
385                    status = item.get('decision', 'WAIT')
386                    
387                    market_analysis = item.get('market_analysis', {})
388                    metadata = item.get('metadata', {})
389                    
390                    # Prepare Insights JSON
391                    insights = {
392                        "market_density": market_analysis.get('market_density'),
393                        "saturation_score": market_analysis.get('saturation_score'),
394                        "gap_analysis": market_analysis.get('gap_analysis'),
395                        "engagement_metric": item.get('engagement_metric'),
396                        "engagement_label": item.get('engagement_label')
397                    }
398                    
399                    # Confidence Level
400                    confidence = 'MEDIUM'
401                    if score > 75: confidence = 'HIGH'
402                    elif score < 30: confidence = 'LOW'
403                    
404                    # Check if exists
405                    cur.execute("SELECT id FROM analyzed_digital_products WHERE product_url = %s", (url,))
406                    existing = cur.fetchone()
407                    
408                    if existing:
409                        # UPDATE
410                        cur.execute("""
411                            UPDATE analyzed_digital_products SET
412                            opportunity_score = %s,
413                            status = %s,
414                            analyzed_at = NOW(),
415                            key_insights = %s,
416                            confidence_level = %s,
417                            price = %s
418                            WHERE product_url = %s
419                        """, (
420                            score, status, Json(insights), confidence, price, url
421                        ))
422                    else:
423                        # INSERT
424                        cur.execute("""
425                            INSERT INTO analyzed_digital_products 
426                            (platform, product_name, product_url, price, opportunity_score, status, 
427                             confidence_level, key_insights, analyzed_at, created_at)
428                            VALUES (%s, %s, %s, %s, %s, %s, %s, %s, NOW(), NOW())
429                        """, (
430                            platform, name, url, price, score, status, 
431                            confidence, Json(insights)
432                        ))
433                    
434                    saved_count += 1
435                    
436                except Exception as row_e:
437                    logger.warning(f"Failed to save row: {item.get('title')}: {row_e}")
438                    conn.rollback() # Rollback detailed error but continue? 
439                    # Actually rollback entire transaction usually, but here we want best effort
440                    # So we should commit each or batch? Let's try batch commit at end, but robust
441                    continue
442            
443            conn.commit()
444            logger.info(f"💾 Persisted {saved_count} analyzed products to Database.")
445            
446        except Exception as e:
447            logger.error(f"Database Save Error: {e}")
448            if conn: conn.rollback()
449        finally:
450            if conn: conn.close()
451
452# Runner
453if __name__ == "__main__":
454    async def main():
455        orchestrator = ScraperOrchestrator()
456        logger.info("🚀 Starting Scraper Orchestrator...")
457        results = await orchestrator.run_all()
458        
459        logger.info(f"✅ Collected {len(results)} unified products.")
460        
461        # Print sample
462        print("\n" + "="*80)
463        print(f"{'SOURCE':<10} | {'DECISION':<5} | {'SCORE':<5} | {'ENGAGEMENT':<10} | {'TITLE'}")
464        print("="*80)
465        for r in sorted(results, key=lambda x: x['opportunity_score'], reverse=True)[:10]:
466            print(f"{r['source']:<10} | {r['decision']:<5} | {r['opportunity_score']:<5} | {r['engagement_metric']:<10} | {r['title'][:40]}...")
467            
468        orchestrator.export_csv()
469        
470    asyncio.run(main())

backend/scrapers/universal_worker.py

1import asyncio
2import json
3import logging
4import pika
5import sys
6import os
7from datetime import datetime
8
9# Add parent directory to path to import scrapers
10sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
11
12from backend.scrapers.gumroad import GumroadScraper
13from backend.scrapers.etsy import EtsyScraper
14from backend.scrapers.envato import EnvatoScraper
15from backend.scrapers.creative_market import CreativeMarketScraper
16from backend.scrapers.facebook import FacebookScraper
17
18# Configure Logging
19logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
20logger = logging.getLogger("UniversalWorker")
21
22# RabbitMQ Config
23RABBITMQ_URL = 'amqp://guest:guest@localhost:5672/'
24
25# Map queues to scraper classes
26QUEUE_MAP = {
27    'gumroad_scraping_queue': GumroadScraper,
28    'etsy_scraping_queue': EtsyScraper,
29    'envato_scraping_queue': EnvatoScraper,
30    'creative_market_scraping_queue': CreativeMarketScraper,
31    'facebook_scraping_queue': FacebookScraper
32}
33
34def process_job(ch, method, properties, body):
35    """Process a job from any queue"""
36    queue_name = method.routing_key
37    ScraperClass = QUEUE_MAP.get(queue_name)
38    
39    if not ScraperClass:
40        logger.error(f"❌ No scraper found for queue: {queue_name}")
41        ch.basic_ack(delivery_tag=method.delivery_tag)
42        return
43
44    try:
45        job = json.loads(body)
46        logger.info(f"📥 Received {queue_name} job: {job}")
47        
48        # Run scraper
49        scraper = ScraperClass()
50        
51        # Determine arguments
52        keyword = job.get('query') or job.get('keyword') or "digital products"
53        
54        logger.info(f"🚀 Running {ScraperClass.__name__} for: {keyword}")
55        
56        # Run async scrape in sync context
57        if queue_name == 'facebook_scraping_queue':
58             results = asyncio.run(scraper.scrape(query=keyword))
59        elif queue_name == 'etsy_scraping_queue':
60             results = asyncio.run(scraper.scrape(query=keyword, max_pages=1))
61        else:
62             # Standard BaseScraper (Gumroad, Envato, CreativeMarket)
63             # They usually accept keyword arg
64             results = asyncio.run(scraper.scrape(keyword=keyword, max_pages=1))
65        
66        logger.info(f"✅ {ScraperClass.__name__} completed. Found {len(results)} products.")
67        
68        # Ack message
69        ch.basic_ack(delivery_tag=method.delivery_tag)
70        
71    except Exception as e:
72        logger.error(f"❌ Error processing job: {e}")
73        # Nack to retry later or Ack to discard if it's a permanent error?
74        # For now, Ack to avoid infinite loops on bad jobs
75        ch.basic_ack(delivery_tag=method.delivery_tag)
76
77def start_worker():
78    """Start listening to all queues"""
79    try:
80        # Disable heartbeat to prevent timeouts during long scraping jobs
81        params = pika.URLParameters(RABBITMQ_URL)
82        params.heartbeat = 0
83        
84        connection = pika.BlockingConnection(params)
85        channel = connection.channel()
86        channel.basic_qos(prefetch_count=1)
87        
88        # Declare and consume from all queues
89        for queue in QUEUE_MAP.keys():
90            channel.queue_declare(queue=queue, durable=True)
91            channel.basic_consume(queue=queue, on_message_callback=process_job)
92            logger.info(f"🎧 Listening on {queue}...")
93            
94        logger.info("⚡ Universal Worker Started! (Heartbeat Disabled)")
95        channel.start_consuming()
96        
97    except KeyboardInterrupt:
98        logger.info("🛑 Worker stopping...")
99    except Exception as e:
100        logger.error(f"❌ Connection error: {e}")
101
102if __name__ == "__main__":
103    start_worker()

backend/scrapers/youtube.py

1"""
2AURIFY INTELLIGENCE - YouTube Pain Point Scraper
3Uses `youtube-search-python` to find high-demand "How To" problems
4"""
5from datetime import datetime
6import json
7import logging
8from typing import List, Dict
9import time
10import random
11
12try:
13    from youtubesearchpython import VideosSearch
14except ImportError:
15    import os
16    os.system("pip install youtube-search-python")
17    from youtubesearchpython import VideosSearch
18
19try:
20    from .base import BaseScraper
21except ImportError:
22    from base import BaseScraper
23
24logger = logging.getLogger(__name__)
25
26class YouTubeScraper(BaseScraper):
27    def __init__(self):
28        super().__init__(platform_name="youtube")
29        
30        # Target: "How to" queries that imply a problem
31        self.SEARCH_QUERIES = [
32            "how to fix excel",
33            "how to organize notion",
34            "how to make money online with ai",
35            "struggling with procrastination",
36            "alternative to salesforce",
37            "best tool for project management",
38            "how to create digital products",
39            "marketing for beginners",
40            "automate instagram",
41            "productivity hack",
42            "excel tutorial for beginners",
43            "how to build a saas"
44        ]
45        
46        self.pain_keywords = ["fix", "error", "problem", "solve", "stuck", "hack", "easy way"]
47
48    def scrape(self, query: str = None, limit: int = 20) -> List[Dict]:
49        """
50        Scrapes YouTube for problem-solving content.
51        """
52        # Pick query
53        search_term = query if query else random.choice(self.SEARCH_QUERIES)
54        logger.info(f"📺 Scraping YouTube - Query: '{search_term}', Limit: {limit}")
55        
56        try:
57            results = []
58            videos_search = VideosSearch(search_term, limit=limit)
59            response = videos_search.result()
60            
61            for video in response['result']:
62                title = video.get('title', '')
63                view_count_text = video.get('viewCount', {}).get('text', '0')
64                # Parse "1.2M views" -> 1200000
65                views = self._parse_views(view_count_text)
66                
67                # Check for pain
68                pain_score = self._calculate_pain_score(title)
69                
70                product_data = {
71                    "platform": "youtube",
72                    "canonical_key": f"yt_{video['id']}",
73                    "product_name": title,
74                    "product_url": video['link'],
75                    "description": video.get('descriptionSnippet', [{}])[0].get('text', '') if video.get('descriptionSnippet') else title,
76                    "evidence": {
77                        "views": views,
78                        "duration": video.get('duration'),
79                        "channel": video.get('channel', {}).get('name'),
80                        "pain_score": pain_score,
81                        "query": search_term,
82                        "published": video.get('publishedTime'),
83                        "view_velocity": self._calculate_velocity(views, video.get('publishedTime'))
84                    },
85                    "opportunity_score": self._calculate_youtube_opportunity(pain_score, views),
86                    "scraped_at": datetime.now().isoformat()
87                }
88                
89                self.save_product(product_data)
90                results.append(product_data)
91                
92            logger.info(f"✅ Successfully scraped {len(results)} videos from YouTube")
93            return results
94            
95        except Exception as e:
96            logger.error(f"❌ Error scraping YouTube: {e}")
97            return []
98
99    def _parse_views(self, text: str) -> int:
100        if not text: return 0
101        text = text.lower().replace(' views', '').replace(' view', '')
102        try:
103            if 'k' in text:
104                return int(float(text.replace('k', '')) * 1000)
105            if 'm' in text:
106                return int(float(text.replace('m', '')) * 1000000)
107            if 'b' in text:
108                return int(float(text.replace('b', '')) * 1000000000)
109            return int(text.replace(',', '').strip())
110        except:
111            return 0
112
113    def _calculate_pain_score(self, title: str) -> int:
114        title = title.lower()
115        score = sum(1 for k in self.pain_keywords if k in title) * 2
116        return min(score + 2, 10) # Base score 2
117
118    def _calculate_velocity(self, views: int, published: str) -> float:
119        """Estimate views per day (rough approx)"""
120        # Simplify relative time parsing ("2 days ago")
121        days = 1
122        if published:
123            if 'year' in published: days = 365
124            elif 'month' in published: days = 30
125            elif 'week' in published: days = 7
126            elif 'day' in published: days = 2
127            
128            try:
129                num = int(published.split(' ')[0])
130                days = days * num
131            except: pass
132            
133        return round(views / max(days, 1), 2)
134
135    def _calculate_youtube_opportunity(self, pain_score: int, views: int) -> float:
136        """Unified 0-100 Score"""
137        # Weights: Pain 60%, Demand (Views) 40%
138        
139        demand_score = min(views / 10000, 10.0) # 10k views = 1/10 demand
140        if views > 100000: demand_score = 10.0
141        
142        score = (pain_score * 6.0) + (demand_score * 4.0)
143        return round(min(score, 100.0), 1)
144
145    def save_product(self, data: Dict):
146        conn = self.get_db_connection()
147        if not conn: return
148        try:
149            with conn.cursor() as cur:
150                query = """
151                INSERT INTO raw_intelligence_signals 
152                (platform, canonical_key, title, url, description, evidence, scraped_at)
153                VALUES (%s, %s, %s, %s, %s, %s, %s)
154                ON CONFLICT (canonical_key) DO UPDATE SET
155                evidence = EXCLUDED.evidence,
156                scraped_at = EXCLUDED.scraped_at
157                """
158                cur.execute(query, (
159                    data['platform'],
160                    data['canonical_key'],
161                    data['product_name'],
162                    data['product_url'],
163                    data['description'],
164                    json.dumps(data['evidence']),
165                    data['scraped_at']
166                ))
167            conn.commit()
168        except Exception as e:
169            logger.error(f"[YouTube] DB Error: {e}")
170        finally:
171            conn.close()
172
173if __name__ == "__main__":
174    scraper = YouTubeScraper()
175    scraper.scrape(limit=5)

backend/api/models/init.py

backend/api/routes/init.py

backend/api/routes/aurify_routes.py

1"""
2AURIFY ANALYZER - FastAPI Routes
3=================================
4API endpoints for 3-stage analysis system
5"""
6
7from fastapi import APIRouter, HTTPException, BackgroundTasks
8from pydantic import BaseModel
9from typing import List, Optional, Dict, Any
10from datetime import datetime
11import sys
12import os
13import psycopg2
14from psycopg2.extras import RealDictCursor
15
16# Add parent directory to path
17sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
18
19from analysis.aurify_analyzer import AurifyAnalyzer, Decision, RiskLevel
20
21router = APIRouter(prefix="/api/aurify", tags=["Aurify Analyzer"])
22
23# ══════════════════════════════════════════════════════════════
24# REQUEST/RESPONSE MODELS
25# ══════════════════════════════════════════════════════════════
26
27class ProductInput(BaseModel):
28    product_name: str
29    platform: str
30    price_estimate: float = 0
31    metrics: int = 0
32    category: Optional[str] = None
33    description: Optional[str] = None
34    url: Optional[str] = None
35
36class AnalysisRequest(BaseModel):
37    products: List[ProductInput] = []
38    focus_area: Optional[str] = None
39    auto_validate_top_n: int = 3
40    run_stage1: bool = True
41    run_stage2: bool = True
42    run_stage3: bool = True
43
44class StageResponse(BaseModel):
45    status: str
46    data: Dict[str, Any]
47    timestamp: str
48
49# ══════════════════════════════════════════════════════════════
50# GLOBAL ANALYZER INSTANCE
51# ══════════════════════════════════════════════════════════════
52
53analyzer: Optional[AurifyAnalyzer] = None
54
55def get_analyzer() -> AurifyAnalyzer:
56    """Get or create analyzer instance"""
57    global analyzer
58    if analyzer is None:
59        analyzer = AurifyAnalyzer()
60    return analyzer
61
62# ══════════════════════════════════════════════════════════════
63# ENDPOINTS
64# ══════════════════════════════════════════════════════════════
65
66@router.post("/analyze/complete")
67async def run_complete_analysis(request: AnalysisRequest):
68    """
69    Run complete 3-stage analysis on products
70    
71    POST /api/aurify/analyze/complete
72    
73    Returns:
74        {
75            "stage1_scores": [...],
76            "stage2_validations": [...],
77            "stage3_blueprints": [...],
78            "summary": {...}
79        }
80    """
81    try:
82        analyzer = get_analyzer()
83        
84        products = [p.dict() for p in request.products]
85        
86        # If no products provided, fetch PRE-ANALYZED data from database
87        if not products:
88            try:
89                conn = psycopg2.connect(
90                    host=os.getenv("POSTGRES_HOST", "localhost"),
91                    port=int(os.getenv("POSTGRES_PORT", "5435")),
92                    user=os.getenv("POSTGRES_USER", "orexia_app"),
93                    password=os.getenv("POSTGRES_PASSWORD", "Farhat2026Secure"),
94                    dbname=os.getenv("POSTGRES_DB", "orexia")
95                )
96                cur = conn.cursor(cursor_factory=RealDictCursor)
97                
98                # Fetch recent sophisticated analysis
99                cur.execute("""
100                    SELECT 
101                        product_name, platform, price, opportunity_score, status, 
102                        key_insights, confidence_level, product_url
103                    FROM analyzed_digital_products 
104                    WHERE status IN ('GO', 'WAIT')
105                    ORDER BY 
106                        CASE WHEN status = 'GO' THEN 1 ELSE 2 END,
107                        opportunity_score DESC
108                    LIMIT 20
109                """)
110                db_results = cur.fetchall()
111                
112                # Fallback: If no GO/WAIT, show anything sorted by score
113                if not db_results:
114                     cur.execute("""
115                        SELECT 
116                            product_name, platform, price, opportunity_score, status, 
117                            key_insights, confidence_level, product_url
118                        FROM analyzed_digital_products 
119                        ORDER BY opportunity_score DESC
120                        LIMIT 10
121                    """)
122                     db_results = cur.fetchall()
123                
124                cur.close()
125                conn.close()
126                
127                if db_results:
128                    print(f"✅ Serving {len(db_results)} pre-analyzed products from DB")
129                    
130                    # Transform to Stage 1 Schema
131                    stage1_scores = []
132                    for row in db_results:
133                        insights = row.get('key_insights') or {}
134                        
135                        # Fallback for breakdown if missing (mock distribution based on total)
136                        total = float(row['opportunity_score'] or 0)
137                        
138                        stage1_scores.append({
139                            "pain": min(30, total * 0.3),
140                            "market": min(20, total * 0.2), 
141                            "gap": min(25, total * 0.25),
142                            "execution": min(15, total * 0.15),
143                            "revenue": min(10, total * 0.1),
144                            "total": int(total),
145                            "decision": row['status'], # GO/WAIT/KILL matches frontend
146                            "reasoning": f"AI Analzyed ({row['platform']}): {insights.get('market_density', 'Analyzed')}. Confidence: {row['confidence_level']}",
147                            "url": row.get('product_url')
148                        })
149                        
150                    return {
151                        "status": "success",
152                        "data": {
153                            "stage1_scores": stage1_scores,
154                            "stage2_validations": [], # Can populate if detailed
155                            "stage3_blueprints": []
156                        },
157                        "timestamp": datetime.now().isoformat()
158                    }
159                
160            except Exception as e:
161                print(f"⚠️ Failed to fetch from DB: {e}")
162                
163        # If still empty (DB empty), use sample data
164        if not products:
165            products = [
166                {
167                    "product_name": "WhatsApp CRM for E-commerce",
168                    "platform": "Custom",
169                    "price_estimate": 99,
170                    "metrics": 1000,
171                    "category": "SaaS", 
172                    "description": "Automated WhatsApp marketing"
173                },
174                {
175                    "product_name": "Notion Life Planner 2026",
176                    "platform": "Notion",
177                    "price_estimate": 19,
178                    "metrics": 500,
179                    "category": "Template",
180                    "description": "Complete life operating system"
181                }
182            ]
183        
184        # Run live analysis for provided inputs
185        result = analyzer.run_complete_analysis(
186            products=products,
187            focus_area=request.focus_area,
188            auto_validate_top_n=request.auto_validate_top_n
189        )
190        
191        return {
192            "status": "success",
193            "data": result,
194            "timestamp": datetime.now().isoformat()
195        }
196        
197    except Exception as e:
198        raise HTTPException(status_code=500, detail=str(e))
199
200
201@router.post("/analyze/stage1")
202async def run_stage1_only(request: AnalysisRequest):
203    """
204    Run ONLY Stage 1: Market Gap Scanner
205    
206    POST /api/aurify/analyze/stage1
207    
208    Returns Oryxia scores for all products
209    """
210    try:
211        analyzer = get_analyzer()
212        products = [p.dict() for p in request.products]
213        
214        scores = analyzer.stage1_market_gap_scan(
215            products=products,
216            focus_area=request.focus_area
217        )
218        
219        return {
220            "status": "success",
221            "data": {
222                "scores": [
223                    {
224                        "pain": s.pain,
225                        "market": s.market,
226                        "gap": s.gap,
227                        "execution": s.execution,
228                        "revenue": s.revenue,
229                        "total": s.total,
230                        "decision": s.decision.value,
231                        "reasoning": s.reasoning
232                    }
233                    for s in scores
234                ],
235                "summary": {
236                    "total": len(scores),
237                    "go_count": sum(1 for s in scores if s.decision == Decision.GO),
238                    "wait_count": sum(1 for s in scores if s.decision == Decision.WAIT),
239                    "kill_count": sum(1 for s in scores if s.decision == Decision.KILL)
240                }
241            },
242            "timestamp": datetime.now().isoformat()
243        }
244        
245    except Exception as e:
246        raise HTTPException(status_code=500, detail=str(e))
247
248
249@router.post("/analyze/stage2")
250async def run_stage2_validation(
251    opportunity: ProductInput,
252    original_score: int
253):
254    """
255    Run ONLY Stage 2: Decision Matrix Validator
256    
257    POST /api/aurify/analyze/stage2
258    
259    Validates a single GO opportunity
260    """
261    try:
262        analyzer = get_analyzer()
263        
264        validation = analyzer.stage2_validate_opportunity(
265            opportunity=opportunity.dict(),
266            original_score=original_score
267        )
268        
269        return {
270            "status": "success",
271            "data": {
272                "opportunity_name": validation.opportunity_name,
273                "original_score": validation.original_score,
274                "revised_decision": validation.revised_decision.value,
275                "risk_level": validation.risk_level.value,
276                "confidence": validation.confidence,
277                "porters_forces": {
278                    "threat_new_entrants": validation.threat_new_entrants,
279                    "supplier_power": validation.supplier_power,
280                    "buyer_power": validation.buyer_power,
281                    "threat_substitutes": validation.threat_substitutes,
282                    "competitive_rivalry": validation.competitive_rivalry
283                },
284                "unit_economics": {
285                    "cac": validation.cac,
286                    "ltv": validation.ltv,
287                    "ltv_cac_ratio": validation.ltv_cac_ratio,
288                    "payback_months": validation.payback_months,
289                    "churn_rate": validation.churn_rate
290                },
291                "moat": {
292                    "strength": validation.moat_strength,
293                    "score": validation.moat_score
294                },
295                "execution": {
296                    "time_to_market_weeks": validation.time_to_market_weeks,
297                    "bootstrap_viable": validation.bootstrap_viable,
298                    "capital_needed": validation.capital_needed
299                },
300                "red_flags": validation.red_flags,
301                "critical_assumptions": validation.critical_assumptions,
302                "next_actions": validation.next_actions
303            },
304            "timestamp": datetime.now().isoformat()
305        }
306        
307    except Exception as e:
308        raise HTTPException(status_code=500, detail=str(e))
309
310
311@router.post("/analyze/stage3")
312async def run_stage3_blueprint(
313    opportunity: ProductInput,
314    validation_data: Dict
315):
316    """
317    Run ONLY Stage 3: Product Blueprint Architect
318    
319    POST /api/aurify/analyze/stage3
320    
321    Generates complete MVP specification
322    """
323    try:
324        analyzer = get_analyzer()
325        
326        # Create mock ValidationResult from validation_data
327        from analysis.aurify_analyzer import ValidationResult, RiskLevel, Decision
328        
329        validation = ValidationResult(
330            opportunity_name=opportunity.product_name,
331            original_score=validation_data.get("original_score", 0),
332            revised_decision=Decision[validation_data.get("revised_decision", "CONFIRMED_GO")],
333            risk_level=RiskLevel[validation_data.get("risk_level", "MEDIUM")],
334            confidence=validation_data.get("confidence", 70),
335            threat_new_entrants="",
336            supplier_power="",
337            buyer_power="",
338            threat_substitutes="",
339            competitive_rivalry="",
340            cac=validation_data.get("cac", 100),
341            ltv=validation_data.get("ltv", 500),
342            ltv_cac_ratio=validation_data.get("ltv_cac_ratio", 5.0),
343            payback_months=validation_data.get("payback_months", 3),
344            churn_rate=validation_data.get("churn_rate", 0.05),
345            moat_strength=validation_data.get("moat_strength", "Medium"),
346            moat_score=validation_data.get("moat_score", 5.0),
347            time_to_market_weeks=validation_data.get("time_to_market_weeks", 8),
348            bootstrap_viable=validation_data.get("bootstrap_viable", True),
349            capital_needed=validation_data.get("capital_needed", 10000),
350            red_flags=[],
351            critical_assumptions=[],
352            next_actions=[]
353        )
354        
355        blueprint = analyzer.stage3_create_blueprint(
356            validated_opportunity=opportunity.dict(),
357            validation_result=validation
358        )
359        
360        return {
361            "status": "success",
362            "data": {
363                "product_name": blueprint.product_name,
364                "tagline": blueprint.tagline,
365                "value_proposition": blueprint.value_proposition,
366                "must_have_features": blueprint.must_have_features,
367                "should_have_features": blueprint.should_have_features,
368                "wont_have_features": blueprint.wont_have_features,
369                "tech_stack": blueprint.tech_stack,
370                "estimated_dev_hours": blueprint.estimated_dev_hours,
371                "pricing_tiers": blueprint.pricing_tiers,
372                "launch_channels": blueprint.launch_channels,
373                "success_metrics": blueprint.success_metrics,
374                "kill_criteria": blueprint.kill_criteria,
375                "roadmap": blueprint.week_by_week_plan
376            },
377            "timestamp": datetime.now().isoformat()
378        }
379        
380    except Exception as e:
381        raise HTTPException(status_code=500, detail=str(e))
382
383
384@router.get("/health")
385async def health_check():
386    """Check if analyzer is ready"""
387    try:
388        analyzer = get_analyzer()
389        return {
390            "status": "healthy",
391            "analyzer_ready": analyzer is not None,
392            "timestamp": datetime.now().isoformat()
393        }
394    except Exception as e:
395        return {
396            "status": "unhealthy",
397            "error": str(e),
398            "timestamp": datetime.now().isoformat()
399        }
400
401
402# ══════════════════════════════════════════════════════════════
403# DATABASE INTEGRATION ENDPOINTS
404# ══════════════════════════════════════════════════════════════
405
406@router.get("/scans")
407async def list_scans():
408    """Get all past analysis scans from database"""
409    # TODO: Integrate with PostgreSQL
410    return {
411        "status": "success",
412        "data": {
413            "scans": [],
414            "total": 0
415        }
416    }
417
418
419@router.get("/scans/{scan_id}")
420async def get_scan_details(scan_id: int):
421    """Get detailed results for a specific scan"""
422    # TODO: Integrate with PostgreSQL
423    return {
424        "status": "success",
425        "data": {
426            "scan_id": scan_id,
427            "results": {}
428        }
429    }
430
431
432@router.post("/scans/{scan_id}/export")
433async def export_scan_results(scan_id: int, format: str = "json"):
434    """Export scan results as JSON, CSV, or PDF"""
435    # TODO: Implement export functionality
436    return {
437        "status": "success",
438        "download_url": f"/downloads/scan_{scan_id}.{format}"
439    }
440
441
442# ══════════════════════════════════════════════════════════════
443# BATCH ANALYSIS (Background Jobs)
444# ══════════════════════════════════════════════════════════════
445
446@router.post("/analyze/batch")
447async def start_batch_analysis(
448    request: AnalysisRequest,
449    background_tasks: BackgroundTasks
450):
451    """
452    Start batch analysis for large datasets (runs in background)
453    
454    POST /api/aurify/analyze/batch
455    
456    Returns job_id to track progress
457    """
458    import uuid
459    job_id = str(uuid.uuid4())
460    
461    async def run_batch():
462        analyzer = get_analyzer()
463        products = [p.dict() for p in request.products]
464        result = analyzer.run_complete_analysis(
465            products=products,
466            focus_area=request.focus_area,
467            auto_validate_top_n=request.auto_validate_top_n
468        )
469        # TODO: Save to database with job_id
470    
471    background_tasks.add_task(run_batch)
472    
473    return {
474        "status": "accepted",
475        "job_id": job_id,
476        "message": "Analysis started in background",
477        "check_status_url": f"/api/aurify/jobs/{job_id}"
478    }
479
480
481@router.get("/jobs/{job_id}")
482async def get_job_status(job_id: str):
483    """Check status of background analysis job"""
484    # TODO: Implement job tracking
485    return {
486        "status": "running",
487        "job_id": job_id,
488        "progress": 45,
489        "eta_seconds": 120
490    }

backend/api/routes/dashboard_routes.py

1"""
2Dashboard Stats API Routes
3Provides database stats, queue stats, and health checks for the mission control dashboard
4"""
5
6from fastapi import APIRouter, HTTPException
7from typing import Dict, Any
8import psycopg2
9from psycopg2.extras import RealDictCursor
10import os
11from dotenv import load_dotenv
12
13load_dotenv()
14
15router = APIRouter(tags=["Dashboard Stats"])
16
17def get_db_connection():
18    """Get database connection"""
19    return psycopg2.connect(
20        host=os.getenv("POSTGRES_HOST", "localhost"),
21        port=int(os.getenv("POSTGRES_PORT", "5435")),
22        user=os.getenv("POSTGRES_USER", "orexia_app"),
23        password=os.getenv("POSTGRES_PASSWORD", "Farhat2026Secure"),
24        database=os.getenv("POSTGRES_DB", "orexia")
25    )
26
27@router.get("/stats/database")
28async def get_database_stats():
29    """
30    Get aggregated database statistics from both Aurify (New) and Orexia (Legacy) systems
31    """
32    stats = {
33        "total_products": 0,
34        "by_platform": [],
35        "recent_products": []
36    }
37
38    # --- 1. AURIFY (New System) ---
39    aurify_tables = [
40        ("products", "Analyzed Products (GO/WAIT)", "title", "price"),
41        ("staging_products", "Staging Lake (Raw)", "product_name", "price")
42    ]
43    
44    try:
45        # Default pcore/aurify on 5432
46        conn = psycopg2.connect(
47            host=os.getenv("POSTGRES_HOST", "localhost"),
48            port=5432,
49            user="pcore",
50            dbname="aurify"
51        )
52        cur = conn.cursor(cursor_factory=RealDictCursor)
53
54        for table_name, platform, name_col, price_col in aurify_tables:
55            try:
56                cur.execute(f"SELECT COUNT(*) as count FROM {table_name}")
57                count = cur.fetchone()["count"]
58                
59                if count > 0:
60                    stats["by_platform"].append({"platform": platform, "count": count})
61                    stats["total_products"] += count
62                    
63                    # Recent
64                    cur.execute(f"SELECT {name_col} as name, {price_col} as price FROM {table_name} ORDER BY id DESC LIMIT 3")
65                    for row in cur.fetchall():
66                        stats["recent_products"].append({
67                            "name": row["name"],
68                            "platform": platform,
69                            "price": row.get("price", 0)
70                        })
71            except Exception as e:
72                print(f"[Aurify] Error querying {table_name}: {e}")
73        
74        cur.close()
75        conn.close()
76
77    except Exception as e:
78        print(f"[Aurify] Connection Failed: {e}")
79
80    # --- 2. OREXIA (Legacy System) ---
81    legacy_tables = [
82        ("raw_gumroad_products", "Gumroad", "product_name", "price"),
83        ("raw_etsy_products", "Etsy", "product_name", "price"),
84        ("raw_notion_templates", "Notion", "product_name", "price"),
85        ("raw_intelligence_signals", "Product Hunt", "title", None),
86        ("raw_github_products", "GitHub", "title", None),
87        ("raw_facebook_ads_data", "Facebook Ads", "ad_body", None),
88        ("raw_envato_products", "Envato", "product_name", "price"),
89    ]
90
91    try:
92        # Legacy orexia_app/orexia on 5435
93        conn = psycopg2.connect(
94            host=os.getenv("POSTGRES_HOST", "localhost"),
95            port=5435,
96            user="orexia_app",
97            password=os.getenv("POSTGRES_PASSWORD", "Farhat2026Secure"),
98            dbname="orexia"
99        )
100        cur = conn.cursor(cursor_factory=RealDictCursor)
101
102        for table_name, platform, name_col, price_col in legacy_tables:
103            try:
104                cur.execute(f"SELECT COUNT(*) as count FROM {table_name}")
105                count = cur.fetchone()["count"]
106                
107                if count > 0:
108                    stats["by_platform"].append({"platform": platform, "count": count})
109                    stats["total_products"] += count
110                    
111                    if len(stats["recent_products"]) < 6: # Don't flood
112                        q = f"SELECT {name_col} as name"
113                        if price_col: q += f", {price_col} as price"
114                        q += f" FROM {table_name} ORDER BY id DESC LIMIT 2"
115                        cur.execute(q)
116                        
117                        for row in cur.fetchall():
118                            stats["recent_products"].append({
119                                "name": row["name"],
120                                "platform": platform,
121                                "price": row.get("price", 0) if price_col else 0
122                            })
123            except Exception as e:
124                print(f"[Orexia] Error querying {table_name}: {e}")
125
126        cur.close()
127        conn.close()
128
129    except Exception as e:
130        print(f"[Orexia] Connection Failed: {e}")
131
132    return stats
133
134
135@router.get("/stats/queues")
136async def get_queue_stats():
137    """
138    Get RabbitMQ queue statistics
139    """
140    # For now, return mock data since RabbitMQ might not be running
141    return {
142        "facebook": {
143            "message_count": 0,
144            "consumer_count": 0
145        },
146        "gumroad": {
147            "message_count": 0,
148            "consumer_count": 0
149        },
150        "etsy": {
151            "message_count": 0,
152            "consumer_count": 0
153        }
154    }
155
156
157@router.get("/health")
158async def health_check():
159    """
160    Check health of all services
161    """
162    health = {
163        "status": "healthy",
164        "services": {
165            "database": False,
166            "rabbitmq": False,
167            "api": True
168        }
169    }
170    
171    # Check database
172    try:
173        conn = get_db_connection()
174        cur = conn.cursor()
175        cur.execute("SELECT 1")
176        cur.close()
177        conn.close()
178        health["services"]["database"] = True
179    except:
180        health["services"]["database"] = False
181    
182    # Check RabbitMQ (mock for now)
183    health["services"]["rabbitmq"] = False
184    
185    # Update overall status
186    if not health["services"]["database"]:
187        health["status"] = "degraded"
188    
189    return health
190
191
192from pydantic import BaseModel
193from typing import List, Optional
194
195class ScraperRunRequest(BaseModel):
196    keywords: List[str] = []
197
198@router.post("/scrapers/run/{platform}")
199async def run_scraper(platform: str, request: Optional[ScraperRunRequest] = None):
200    """
201    Trigger a scraper to run immediately with custom keywords
202    """
203    import uuid
204    import asyncio
205    
206    # Import Orchestrator (Lazy import)
207    from backend.scrapers.unified_orchestrator import ScraperOrchestrator
208    from backend.scrapers.tiktok_enhanced import scraper_tiktok_unified
209    from backend.scrapers.unified_adapters import (
210        scraper_facebook_unified,
211        scraper_amazon_unified,
212        scraper_gumroad_unified,
213        scraper_creative_market_unified,
214        scraper_envato_unified,
215        scraper_instagram_unified,
216        scraper_youtube_unified,
217        scraper_producthunt_unified,
218        scraper_google_trends_unified,
219        scraper_notion_unified,
220        scraper_medium_unified,
221        scraper_canva_unified,
222        scraper_lemon_squeezy_unified,
223        scraper_wp_plugins_unified,
224        scraper_shopify_apps_unified,
225        scraper_shopify_products_unified,
226        scraper_linkedin_unified,
227        scraper_reddit_unified,
228        scraper_twitter_unified
229    )
230    # Note: GitHub uses built-in method in orchestrator class usually, but unified adapter pattern preferred.
231    # We will assume GitHub is handled via standard run_github_adapter method in orchestrator.
232
233    keywords = request.keywords if request else []
234    job_id = str(uuid.uuid4())
235    
236    print(f"🚀 API Trigger: Running {platform} with keywords: {keywords}")
237
238    # Full Configuration for ALL 17 Scrapers
239    base_config = {
240        "scrapers": [
241            {"name": "tiktok", "enabled": False, "config": {"hashtags": [], "limit": 20, "method": "mock"}},
242            {"name": "amazon", "enabled": False, "config": {"keywords": [], "limit": 10}},
243            {"name": "facebook", "enabled": False, "config": {"query": "", "country": "ALL"}},
244            {"name": "gumroad", "enabled": False, "config": {"keywords": [], "max_pages": 1}},
245            {"name": "creative_market", "enabled": False, "config": {"keywords": [], "max_pages": 1}},
246            {"name": "envato", "enabled": False, "config": {"marketplace": "codecanyon", "category": "scripts", "max_pages": 1}},
247            {"name": "instagram", "enabled": False, "config": {"hashtags": [], "limit": 10}},
248            {"name": "youtube", "enabled": False, "config": {"queries": [], "limit": 10}},
249            {"name": "producthunt", "enabled": False, "config": {"order": "VOTES", "limit": 10}},
250            {"name": "google_trends", "enabled": False, "config": {"keywords": [], "limit": 5}},
251            {"name": "notion", "enabled": False, "config": {"max_pages": 1}},
252            {"name": "medium", "enabled": False, "config": {"query": "tech"}},
253            {"name": "canva", "enabled": False, "config": {"keywords": []}},
254            {"name": "lemon_squeezy", "enabled": False, "config": {"keywords": []}},
255            {"name": "wp_plugins", "enabled": False, "config": {"keywords": []}},
256            {"name": "shopify_apps", "enabled": False, "config": {"keywords": [], "limit": 10}},
257            {"name": "shopify_products", "enabled": False, "config": {"stores": [], "limit": 20}},
258            {"name": "linkedin", "enabled": False, "config": {"query": None, "limit": 10}},
259            {"name": "reddit", "enabled": False, "config": {"subreddits": None, "limit": 10, "time_filter": "week"}},
260            {"name": "twitter", "enabled": False, "config": {"query": None, "limit": 10}},
261            {"name": "github", "enabled": False, "config": {"include_keywords": [], "languages": ["Python", "JavaScript", "Go"], "max_results": 20}}
262        ],
263        "scoring": {
264            "weights": {"engagement": 0.4, "growth": 0.3, "market_fit": 0.3},
265            "thresholds": {"GO": 75, "WAIT": 50}
266        }
267    }
268
269    # Normalize platform name
270    target_platform = platform.lower().replace(" ", "_").replace("-", "_")
271    
272    # Enable logic
273    target_scraper_conf = None
274    for s in base_config['scrapers']:
275        s_name = s['name'].lower().replace(" ", "_")
276        if s_name == target_platform or (target_platform == 'google_trends' and s_name == 'trends'):
277            s['enabled'] = True
278            target_scraper_conf = s
279            break
280            
281    if not target_scraper_conf:
282        # Fallback search if exact match failed
283        for s in base_config['scrapers']:
284            if s['name'] in target_platform:
285                s['enabled'] = True
286                target_scraper_conf = s
287                break
288    
289    if not target_scraper_conf:
290        return {"status": "error", "message": f"Platform {platform} not supported"}
291
292    # Inject Keywords
293    if keywords:
294        conf = target_scraper_conf['config']
295        p = target_scraper_conf['name']
296        
297        if p in ['tiktok', 'instagram']:
298            conf['hashtags'] = keywords
299        elif p == 'youtube':
300            conf['queries'] = keywords
301        elif p in ['facebook', 'medium']:
302            conf['query'] = keywords[0]
303        elif p == 'envato':
304            conf['category'] = keywords[0]
305        elif p == 'github':
306            conf['include_keywords'] = keywords
307        else:
308            # Default for Amazon, Gumroad, CreativeMarket, Trends, Notion, Canva, Lemon, WP
309            conf['keywords'] = keywords
310
311    # Run
312    orchestrator = ScraperOrchestrator()
313    orchestrator.config = base_config
314    
315    # Register all unified adapters (GitHub uses internal method usually, handled via orchestrator logic)
316    orchestrator.register_scraper("tiktok", scraper_tiktok_unified)
317    orchestrator.register_scraper("facebook", scraper_facebook_unified)
318    orchestrator.register_scraper("amazon", scraper_amazon_unified)
319    orchestrator.register_scraper("gumroad", scraper_gumroad_unified)
320    orchestrator.register_scraper("creative_market", scraper_creative_market_unified)
321    orchestrator.register_scraper("envato", scraper_envato_unified)
322    orchestrator.register_scraper("instagram", scraper_instagram_unified)
323    orchestrator.register_scraper("youtube", scraper_youtube_unified)
324    orchestrator.register_scraper("producthunt", scraper_producthunt_unified)
325    orchestrator.register_scraper("google_trends", scraper_google_trends_unified)
326    orchestrator.register_scraper("notion", scraper_notion_unified)
327    orchestrator.register_scraper("medium", scraper_medium_unified)
328    orchestrator.register_scraper("canva", scraper_canva_unified)
329    orchestrator.register_scraper("lemon_squeezy", scraper_lemon_squeezy_unified)
330    orchestrator.register_scraper("wp_plugins", scraper_wp_plugins_unified)
331    orchestrator.register_scraper("shopify_apps", scraper_shopify_apps_unified)
332    orchestrator.register_scraper("shopify_products", scraper_shopify_products_unified)
333    orchestrator.register_scraper("linkedin", scraper_linkedin_unified)
334    orchestrator.register_scraper("reddit", scraper_reddit_unified)
335    orchestrator.register_scraper("twitter", scraper_twitter_unified)
336
337    try:
338        results = await orchestrator.run_all()
339        orchestrator.save_results_to_db()
340        
341        return {
342            "status": "success",
343            "job_id": job_id,
344            "platform": platform,
345            "items_count": len(results),
346            "message": f"Successfully scraped {len(results)} items from {platform}"
347        }
348    except Exception as e:
349        print(f"❌ Scraper Failed: {e}")
350        return {
351            "status": "error", 
352            "message": str(e)
353        }
354
355
356@router.get("/stats/analytics")
357async def get_analytics_stats():
358    """
359    Get analytics data for charts (daily volume and platform distribution)
360    """
361    try:
362        conn = get_db_connection()
363        cur = conn.cursor(cursor_factory=RealDictCursor)
364        
365        # Get daily volume - simplified to just use one table with dates
366        cur.execute("""
367            SELECT 
368                DATE(scraped_at) as date,
369                COUNT(*) as count
370            FROM raw_gumroad_products
371            WHERE scraped_at >= CURRENT_DATE - INTERVAL '7 days'
372            GROUP BY DATE(scraped_at)
373            ORDER BY date ASC
374        """)
375        daily_volume = cur.fetchall()
376        
377        # If no recent data, create mock data
378        if not daily_volume:
379            from datetime import datetime, timedelta
380            daily_volume = [
381                {"date": (datetime.now() - timedelta(days=i)).strftime("%Y-%m-%d"), "count": 0}
382                for i in range(7, 0, -1)
383            ]
384        
385        # Get platform distribution
386        cur.execute("""
387            SELECT 'Gumroad' as name, COUNT(*) as value FROM raw_gumroad_products
388            UNION ALL
389            SELECT 'Etsy', COUNT(*) FROM raw_etsy_products
390            UNION ALL
391            SELECT 'Notion', COUNT(*) FROM raw_notion_templates
392            UNION ALL
393            SELECT 'Product Hunt', COUNT(*) FROM raw_intelligence_signals WHERE platform='producthunt'
394            UNION ALL
395            SELECT 'GitHub', COUNT(*) FROM raw_github_products
396        """)
397        platform_distribution = cur.fetchall()
398        
399        cur.close()
400        conn.close()
401        
402        return {
403            "daily_volume": [{"date": str(row["date"]) if isinstance(row, dict) else row[0], "count": row["count"] if isinstance(row, dict) else row[1]} for row in daily_volume],
404            "platform_distribution": [{"name": row["name"], "value": row["value"]} for row in platform_distribution if row["value"] > 0]
405        }
406        
407    except Exception as e:
408        raise HTTPException(status_code=500, detail=f"Analytics error: {str(e)}")
409
410
411@router.get("/export/csv")
412async def export_csv():
413    """
414    Export all products to CSV
415    """
416    from fastapi.responses import StreamingResponse
417    import io
418    import csv
419    from datetime import datetime
420    
421    try:
422        conn = get_db_connection()
423        cur = conn.cursor(cursor_factory=RealDictCursor)
424        
425        # Collect all products from different tables
426        all_products = []
427        
428        # Gumroad products
429        cur.execute("""
430            SELECT 
431                'Gumroad' as platform,
432                product_name as name,
433                product_url as url,
434                price,
435                scraped_at
436            FROM raw_gumroad_products
437            ORDER BY scraped_at DESC
438        """)
439        all_products.extend(cur.fetchall())
440        
441        # Etsy products (no scraped_at column)
442        cur.execute("""
443            SELECT 
444                'Etsy' as platform,
445                product_name as name,
446                product_url as url,
447                price,
448                CURRENT_TIMESTAMP as scraped_at
449            FROM raw_etsy_products
450        """)
451        all_products.extend(cur.fetchall())
452        
453        # Notion templates
454        cur.execute("""
455            SELECT 
456                'Notion' as platform,
457                product_name as name,
458                product_url as url,
459                price,
460                CURRENT_TIMESTAMP as scraped_at
461            FROM raw_notion_templates
462        """)
463        all_products.extend(cur.fetchall())
464        
465        # GitHub products
466        cur.execute("""
467            SELECT 
468                'GitHub' as platform,
469                title as name,
470                url,
471                0 as price,
472                scraped_at
473            FROM raw_github_products
474            ORDER BY scraped_at DESC
475        """)
476        all_products.extend(cur.fetchall())
477        
478        # Intelligence signals (Product Hunt, etc)
479        cur.execute("""
480            SELECT 
481                platform,
482                title as name,
483                url,
484                0 as price,
485                scraped_at
486            FROM raw_intelligence_signals
487            ORDER BY scraped_at DESC
488        """)
489        all_products.extend(cur.fetchall())
490        
491        # Facebook Ads Data (actually products)
492        cur.execute("""
493            SELECT 
494                'Facebook Ads' as platform,
495                product_name as name,
496                product_url as url,
497                COALESCE(price, 0) as price,
498                created_at as scraped_at
499            FROM raw_facebook_ads_data
500        """)
501        all_products.extend(cur.fetchall())
502        
503        cur.close()
504        conn.close()
505        
506        # Sort all products by scraped_at (handle both timezone-aware and naive datetimes)
507        def get_sort_key(product):
508            dt = product.get('scraped_at')
509            if dt is None:
510                return datetime.min
511            # Convert timezone-aware to naive for comparison
512            if hasattr(dt, 'tzinfo') and dt.tzinfo is not None:
513                return dt.replace(tzinfo=None)
514            return dt
515        
516        all_products.sort(key=get_sort_key, reverse=True)
517        
518        # Create CSV in memory
519        output = io.StringIO()
520        writer = csv.writer(output)
521        
522        # Write header with AI scores
523        writer.writerow([
524            'Platform', 
525            'Product Name', 
526            'URL', 
527            'Price', 
528            'AI Decision',  # GO/WAIT/KILL
529            'AI Score',     # 0-100
530            'Confidence',   # 0-100%
531            'Scraped At'
532        ])
533        
534        # Create a lookup dict for AI analysis
535        # Reopen connection since we closed it earlier
536        conn_ai = get_db_connection()
537        cur = conn_ai.cursor(cursor_factory=RealDictCursor)
538        
539        # Get AI analysis data
540        ai_lookup = {}
541        try:
542            cur.execute("""
543                SELECT 
544                    adp.product_name,
545                    adp.opportunity_score,
546                    adp.confidence_level,
547                    CASE 
548                        WHEN adp.opportunity_score >= 70 THEN 'GO'
549                        WHEN adp.opportunity_score >= 50 THEN 'WAIT'
550                        ELSE 'KILL'
551                    END as decision
552                FROM analyzed_digital_products adp
553            """)
554            for row in cur.fetchall():
555                ai_lookup[row['product_name']] = {
556                    'decision': row['decision'],
557                    'score': row['opportunity_score'],
558                    'confidence': row['confidence_level']
559                }
560        except:
561            pass  # If query fails, continue without AI data
562        
563        cur.close()
564        conn_ai.close()
565        
566        # Write data with AI scores
567        for product in all_products[:10000]:  # Limit to 10k rows
568            product_name = product['name']
569            ai_data = ai_lookup.get(product_name, {})
570            
571            writer.writerow([
572                product['platform'],
573                product_name,
574                product['url'],
575                f"${product['price']:.2f}" if product['price'] else 'Free',
576                ai_data.get('decision', 'Not Analyzed'),
577                f"{ai_data.get('score', 0):.1f}" if ai_data.get('score') else 'N/A',
578                ai_data.get('confidence', 'N/A'),
579                product['scraped_at']
580            ])
581        
582        # Prepare response
583        output.seek(0)
584        
585        return StreamingResponse(
586            iter([output.getvalue()]),
587            media_type="text/csv",
588            headers={
589                "Content-Disposition": f"attachment; filename=aurify_products_{datetime.now().strftime('%Y%m%d')}.csv"
590            }
591        )
592        
593    except Exception as e:
594        raise HTTPException(status_code=500, detail=f"Export error: {str(e)}")

backend/api/routes/products_routes.py

1from fastapi import APIRouter, Query
2import psycopg2
3import os
4from typing import List, Optional
5from pydantic import BaseModel
6from datetime import datetime
7
8router = APIRouter()
9
10# DB Configuration
11DB_HOST = os.getenv("POSTGRES_HOST", "localhost")
12DB_USER = os.getenv("POSTGRES_USER", "pcore")
13DB_PASS = os.getenv("POSTGRES_PASSWORD", "")
14DB_NAME = os.getenv("POSTGRES_DB", "aurify")
15DB_PORT = os.getenv("POSTGRES_PORT", "5432")
16
17def get_db_connection():
18    return psycopg2.connect(
19        host=DB_HOST,
20        user=DB_USER,
21        password=DB_PASS,
22        dbname=DB_NAME,
23        port=DB_PORT
24    )
25
26class Product(BaseModel):
27    id: int
28    title: str
29    description: Optional[str]
30    price: Optional[str]
31    category: Optional[str]
32    source: Optional[str]
33    url: str
34    total_score: int
35    decision: str
36    pain_score: Optional[int]
37    market_score: Optional[int]
38    gap_score: Optional[int]
39    execution_score: Optional[int]
40    revenue_score: Optional[int]
41    status: Optional[str]
42    created_at: datetime
43
44@router.get("/api/products")
45async def get_products(filter: str = Query("all", enum=["go", "wait", "all"])):
46    conn = get_db_connection()
47    products = []
48    
49    try:
50        cur = conn.cursor()
51        
52        query = "SELECT id, title, description, price, category, source, url, total_score, decision, pain_score, market_score, gap_score, execution_score, revenue_score, status, created_at FROM products"
53        params = []
54        
55        if filter == "go":
56            query += " WHERE decision = 'GO'"
57            query += " ORDER BY total_score DESC"
58        elif filter == "wait":
59            query += " WHERE decision = 'WAIT'"
60            query += " ORDER BY total_score DESC"
61        else: # all
62            query += " ORDER BY decision DESC, total_score DESC" # GO first, then WAIT
63        
64        cur.execute(query, params)
65        rows = cur.fetchall()
66        
67        for row in rows:
68            products.append({
69                "id": row[0],
70                "title": row[1],
71                "description": row[2],
72                "price": row[3],
73                "category": row[4],
74                "source": row[5],
75                "url": row[6],
76                "total_score": row[7],
77                "decision": row[8],
78                "pain_score": row[9],
79                "market_score": row[10],
80                "gap_score": row[11],
81                "execution_score": row[12],
82                "revenue_score": row[13],
83                "status": row[14],
84                "created_at": row[15]
85            })
86            
87    except Exception as e:
88        print(f"Error: {e}")
89        return {"error": str(e)}
90    finally:
91        conn.close()
92        
93    return {
94        "filter": filter,
95        "count": len(products),
96        "products": products
97    }

backend/scrapers/playwright_worker/Dockerfile

FROM mcr.microsoft.com/playwright/python:v1.40.0-jammy

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy source code
COPY scraper.py .

# Environment variables
ENV PYTHONUNBUFFERED=1

# Run the worker
CMD ["python", "scraper.py"]

backend/scrapers/playwright_worker/requirements.txt

1playwright>=1.40.0
2playwright-stealth
3pika==1.3.2
4psycopg2-binary==2.9.9
5fake-useragent
6python-dateutil
7python-dotenv

backend/scrapers/playwright_worker/scraper.py

1# workers/facebook/scraper.py
2"""
3Facebook Ads Library Scraper - Python Worker
4Uses Playwright for sophisticated browser automation
5Communicates with Go orchestrator via RabbitMQ
6"""
7
8import asyncio
9import json
10import time
11from datetime import datetime
12from typing import List, Dict, Optional
13import logging
14
15from playwright.async_api import async_playwright, Browser, Page, BrowserContext
16from playwright_stealth import stealth_async
17import pika
18import psycopg2
19from fake_useragent import UserAgent
20
21# ============================================================================
22# CONFIGURATION
23# ============================================================================
24
25logging.basicConfig(
26    level=logging.INFO,
27    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
28)
29logger = logging.getLogger(__name__)
30
31# RabbitMQ Configuration
32RABBITMQ_URL = 'amqp://guest:guest@localhost:5672/'
33JOB_QUEUE = 'facebook_jobs'
34RESULT_QUEUE = 'facebook_results'
35
36# PostgreSQL Configuration
37DB_CONFIG = {
38    'host': 'localhost',
39    'database': 'orexia',
40    'user': 'orexia_app',
41    'password': 'your_password'
42}
43
44# ============================================================================
45# FACEBOOK ADS SCRAPER CLASS
46# ============================================================================
47
48class FacebookAdsScraper:
49    def __init__(self):
50        self.browser: Optional[Browser] = None
51        self.context: Optional[BrowserContext] = None
52        self.page: Optional[Page] = None
53        self.ua = UserAgent()
54        
55    async def initialize(self, proxy: Optional[Dict] = None, account: Optional[Dict] = None):
56        """Initialize Playwright browser with stealth mode"""
57        
58        playwright = await async_playwright().start()
59        
60        # Browser launch options
61        launch_options = {
62            'headless': True,
63            'args': [
64                '--disable-blink-features=AutomationControlled',
65                '--disable-dev-shm-usage',
66                '--no-sandbox',
67                '--disable-setuid-sandbox',
68                '--disable-web-security',
69                '--disable-features=IsolateOrigins,site-per-process',
70            ]
71        }
72        
73        # Add proxy if provided
74        if proxy:
75            launch_options['proxy'] = {
76                'server': proxy['url'],
77            }
78            logger.info(f"🌐 Using proxy: {proxy['url']} ({proxy['country']})")
79        
80        self.browser = await playwright.chromium.launch(**launch_options)
81        
82        # Context with realistic fingerprint
83        context_options = {
84            'viewport': {'width': 1920, 'height': 1080},
85            'user_agent': self.ua.random,
86            'locale': 'en-US',
87            'timezone_id': 'America/New_York',
88            'permissions': ['geolocation'],
89        }
90        
91        self.context = await self.browser.new_context(**context_options)
92        
93        # Add cookies if account provided
94        if account and account.get('cookies'):
95            cookies = [
96                {'name': k, 'value': v, 'domain': '.facebook.com', 'path': '/'}
97                for k, v in account['cookies'].items()
98            ]
99            await self.context.add_cookies(cookies)
100            logger.info(f"🔐 Logged in as: {account['email']}")
101        
102        self.page = await self.context.new_page()
103        
104        # Apply stealth techniques
105        await stealth_async(self.page)
106        
107        logger.info("✅ Browser initialized with stealth mode")
108        
109    async def scrape_ads_library(self, query: str, filters: Dict = None) -> List[Dict]:
110        """
111        Scrape Facebook Ads Library for a given query
112        
113        Args:
114            query: Search query (e.g., "digital planner")
115            filters: Additional filters (country, active_status, etc)
116            
117        Returns:
118            List of ad data dictionaries
119        """
120        
121        if not self.page:
122            raise Exception("Browser not initialized")
123        
124        # Build URL
125        url = self._build_ads_library_url(query, filters)
126        logger.info(f"🔍 Navigating to: {url}")
127        
128        # Navigate with human-like behavior
129        await self._human_navigation(url)
130        
131        # Wait for content to load
132        try:
133            await self.page.wait_for_selector('div[role="article"]', timeout=15000)
134        except Exception as e:
135            logger.warning(f"⚠️ No ads found or timeout: {e}")
136            return []
137        
138        # Scroll to load more ads (lazy loading)
139        ads_data = await self._scroll_and_extract(query)
140        
141        logger.info(f"✅ Extracted {len(ads_data)} ads")
142        
143        return ads_data
144    
145    def _build_ads_library_url(self, query: str, filters: Dict = None) -> str:
146        """Build Facebook Ads Library URL with filters"""
147        
148        base_url = "https://www.facebook.com/ads/library"
149        
150        params = {
151            'active_status': filters.get('active_status', 'active') if filters else 'active',
152            'ad_type': 'all',
153            'country': filters.get('country', 'US') if filters else 'US',
154            'q': query,
155            'sort_data[direction]': 'desc',
156            'sort_data[mode]': 'relevancy_monthly_grouped',
157            'search_type': 'keyword_unordered',
158            'media_type': 'all',
159        }
160        
161        query_string = '&'.join([f"{k}={v}" for k, v in params.items()])
162        return f"{base_url}?{query_string}"
163    
164    async def _human_navigation(self, url: str):
165        """Navigate like a human (random delays, realistic actions)"""
166        
167        # Random delay before navigation (0.5-2 seconds)
168        await asyncio.sleep(0.5 + (asyncio.get_event_loop().time() % 1.5))
169        
170        # Navigate
171        await self.page.goto(url, wait_until='domcontentloaded')
172        
173        # Random mouse movements
174        await self._random_mouse_movement()
175        
176        # Random delay after page load
177        await asyncio.sleep(1 + (asyncio.get_event_loop().time() % 2))
178    
179    async def _random_mouse_movement(self):
180        """Simulate human-like mouse movements"""
181        import random
182        
183        for _ in range(random.randint(2, 5)):
184            x = random.randint(100, 1800)
185            y = random.randint(100, 1000)
186            await self.page.mouse.move(x, y)
187            await asyncio.sleep(0.1 + random.random() * 0.3)
188    
189    async def _scroll_and_extract(self, query: str) -> List[Dict]:
190        """Scroll page to load more ads and extract data"""
191        
192        ads_data = []
193        seen_ad_ids = set()
194        
195        # Scroll multiple times to load lazy content
196        for scroll in range(5):  # Max 5 scrolls
197            # Extract current ads
198            new_ads = await self._extract_ads_from_page(query)
199            
200            # Add only new ads
201            for ad in new_ads:
202                if ad['ad_id'] not in seen_ad_ids:
203                    ads_data.append(ad)
204                    seen_ad_ids.add(ad['ad_id'])
205            
206            # Human-like scroll
207            await self.page.evaluate('window.scrollBy(0, window.innerHeight * 0.8)')
208            
209            # Random delay between scrolls (1-3 seconds)
210            await asyncio.sleep(1 + (asyncio.get_event_loop().time() % 2))
211            
212            # Check if reached bottom
213            is_bottom = await self.page.evaluate(
214                '(window.innerHeight + window.scrollY) >= document.body.offsetHeight'
215            )
216            if is_bottom:
217                logger.info("📜 Reached bottom of page")
218                break
219        
220        return ads_data
221    
222    async def _extract_ads_from_page(self, query: str) -> List[Dict]:
223        """Extract ad data from current page state"""
224        
225        ads = []
226        
227        # Get all ad cards
228        ad_cards = await self.page.query_selector_all('div[role="article"]')
229        
230        for card in ad_cards:
231            try:
232                ad_data = await self._parse_ad_card(card, query)
233                if ad_data:
234                    ads.append(ad_data)
235            except Exception as e:
236                logger.error(f"Error parsing ad card: {e}")
237                continue
238        
239        return ads
240    
241    async def _parse_ad_card(self, card, query: str) -> Optional[Dict]:
242        """Parse a single ad card element"""
243        
244        # Extract ad ID (from data attribute or generate)
245        ad_id = await card.get_attribute('data-ad-id')
246        if not ad_id:
247            # Try to extract from nested elements
248            ad_id = f"fb_{int(time.time() * 1000)}"
249        
250        # Extract page name
251        page_element = await card.query_selector('a[href*="/ads/library/"]')
252        page_name = await page_element.inner_text() if page_element else "Unknown"
253        
254        # Extract ad copy/text
255        ad_bodies = []
256        text_elements = await card.query_selector_all('div[data-testid="ad-snapshot-preview"]')
257        for elem in text_elements:
258            text = (await elem.inner_text()).strip()
259            if text:
260                ad_bodies.append(text)
261        
262        # Extract images
263        image_urls = []
264        img_elements = await card.query_selector_all('img')
265        for img in img_elements:
266            src = await img.get_attribute('src')
267            if src and src.startswith('http'):
268                image_urls.append(src)
269        
270        # Extract landing page URL
271        landing_url = None
272        link_element = await card.query_selector('a[href*="l.facebook.com"]')
273        if link_element:
274            href = await link_element.get_attribute('href')
275            landing_url = self._extract_real_url(href)
276        
277        # Extract start date (if visible)
278        date_text = await card.inner_text()
279        start_date = self._parse_start_date(date_text)
280        
281        # Calculate run duration
282        run_duration_days = 0
283        if start_date:
284            run_duration_days = (datetime.now() - start_date).days
285        
286        # Only return if we have minimum data
287        if not ad_id or not ad_bodies:
288            return None
289        
290        return {
291            'ad_id': ad_id,
292            'page_name': page_name,
293            'ad_creative_bodies': ad_bodies,
294            'image_urls': image_urls,
295            'landing_page_url': landing_url,
296            'run_duration_days': run_duration_days,
297            'query': query,
298            'scraped_at': datetime.now().isoformat(),
299            'is_winning': run_duration_days >= 7,  # 7+ days = winning ad
300        }
301    
302    def _extract_real_url(self, facebook_url: str) -> Optional[str]:
303        """Extract real URL from Facebook redirect link"""
304        from urllib.parse import urlparse, parse_qs
305        
306        try:
307            parsed = urlparse(facebook_url)
308            if 'u' in parse_qs(parsed.query):
309                return parse_qs(parsed.query)['u'][0]
310        except:
311            pass
312        
313        return facebook_url
314    
315    def _parse_start_date(self, text: str) -> Optional[datetime]:
316        """Parse start date from ad text"""
317        import re
318        from dateutil import parser
319        
320        # Look for patterns like "Started running on Jan 15, 2026"
321        date_pattern = r'Started running on (.+?)(?:\n|$|\.|,)'
322        match = re.search(date_pattern, text, re.IGNORECASE)
323        
324        if match:
325            try:
326                return parser.parse(match.group(1))
327            except:
328                pass
329        
330        return None
331    
332    async def close(self):
333        """Cleanup browser resources"""
334        if self.page:
335            await self.page.close()
336        if self.context:
337            await self.context.close()
338        if self.browser:
339            await self.browser.close()
340        logger.info("✅ Browser closed")
341
342# ============================================================================
343# RABBITMQ WORKER
344# ============================================================================
345
346class FacebookWorker:
347    def __init__(self):
348        self.connection = None
349        self.channel = None
350        self.scraper = FacebookAdsScraper()
351        
352    def connect_rabbitmq(self):
353        """Connect to RabbitMQ"""
354        self.connection = pika.BlockingConnection(
355            pika.URLParameters(RABBITMQ_URL)
356        )
357        self.channel = self.connection.channel()
358        
359        # Declare queues
360        self.channel.queue_declare(queue=JOB_QUEUE, durable=True)
361        self.channel.queue_declare(queue=RESULT_QUEUE, durable=True)
362        
363        logger.info(f"✅ Connected to RabbitMQ - Listening on: {JOB_QUEUE}")
364    
365    def process_job(self, ch, method, properties, body):
366        """Process incoming job from Go orchestrator"""
367        
368        try:
369            # Parse job
370            job = json.loads(body)
371            logger.info(f"📥 Received job: {job['job_id']} - {job['query']}")
372            
373            # Execute scraping (run async in event loop)
374            result = asyncio.run(self._execute_scrape(job))
375            
376            # Send result back to Go
377            self._send_result(result)
378            
379            # Acknowledge job
380            ch.basic_ack(delivery_tag=method.delivery_tag)
381            
382        except Exception as e:
383            logger.error(f"❌ Job failed: {e}")
384            
385            # Send error result
386            error_result = {
387                'job_id': job.get('job_id', 'unknown'),
388                'status': 'failed',
389                'error': str(e),
390                'items_found': 0,
391                'completed_at': datetime.now().isoformat(),
392            }
393            self._send_result(error_result)
394            
395            # Acknowledge to remove from queue
396            ch.basic_ack(delivery_tag=method.delivery_tag)
397    
398    async def _execute_scrape(self, job: Dict) -> Dict:
399        """Execute the scraping job"""
400        
401        start_time = time.time()
402        
403        try:
404            # Initialize browser
405            await self.scraper.initialize(
406                proxy=job.get('proxy'),
407                account=job.get('account')
408            )
409            
410            # Scrape ads
411            ads = await self.scraper.scrape_ads_library(
412                query=job['query'],
413                filters=job.get('filters')
414            )
415            
416            # Save to database
417            saved_count = self._save_to_database(ads)
418            
419            duration = time.time() - start_time
420            
421            return {
422                'job_id': job['job_id'],
423                'status': 'success',
424                'data': ads,  # Full ad data
425                'items_found': len(ads),
426                'items_saved': saved_count,
427                'duration': duration,
428                'worker_id': 'python_facebook_1',
429                'completed_at': datetime.now().isoformat(),
430            }
431            
432        finally:
433            await self.scraper.close()
434    
435    def _save_to_database(self, ads: List[Dict]) -> int:
436        """Save ads to PostgreSQL database"""
437        
438        try:
439            conn = psycopg2.connect(**DB_CONFIG)
440            cursor = conn.cursor()
441            
442            saved = 0
443            for ad in ads:
444                # Insert or update
445                cursor.execute("""
446                    INSERT INTO raw_facebook_ads (
447                        ad_id, page_name, ad_creative_bodies, image_urls,
448                        ad_active_status, first_seen_at, last_seen_at
449                    ) VALUES (%s, %s, %s, %s, %s, %s, %s)
450                    ON CONFLICT (ad_id) DO UPDATE SET
451                        last_seen_at = EXCLUDED.last_seen_at,
452                        scrape_count = raw_facebook_ads.scrape_count + 1
453                """, (
454                    ad['ad_id'],
455                    ad['page_name'],
456                    ad['ad_creative_bodies'],
457                    ad['image_urls'],
458                    'active',
459                    ad['scraped_at'],
460                    ad['scraped_at'],
461                ))
462                saved += 1
463            
464            conn.commit()
465            cursor.close()
466            conn.close()
467            
468            logger.info(f"💾 Saved {saved} ads to database")
469            return saved
470            
471        except Exception as e:
472            logger.error(f"Database error: {e}")
473            return 0
474    
475    def _send_result(self, result: Dict):
476        """Send result back to Go orchestrator"""
477        
478        self.channel.basic_publish(
479            exchange='',
480            routing_key=RESULT_QUEUE,
481            body=json.dumps(result),
482            properties=pika.BasicProperties(
483                delivery_mode=2,  # Persistent
484            )
485        )
486        
487        logger.info(f"📤 Result sent: {result['job_id']} - {result['status']}")
488    
489    def start(self):
490        """Start consuming jobs"""
491        
492        self.connect_rabbitmq()
493        
494        # Set QoS - process one job at a time
495        self.channel.basic_qos(prefetch_count=1)
496        
497        # Start consuming
498        self.channel.basic_consume(
499            queue=JOB_QUEUE,
500            on_message_callback=self.process_job
501        )
502        
503        logger.info("🎧 Worker started - waiting for jobs...")
504        
505        try:
506            self.channel.start_consuming()
507        except KeyboardInterrupt:
508            logger.info("🛑 Worker stopped")
509            self.channel.stop_consuming()
510        finally:
511            self.connection.close()
512
513# ============================================================================
514# MAIN
515# ============================================================================
516
517if __name__ == '__main__':
518    worker = FacebookWorker()
519    worker.start()

aurify

You might also like

Website Content Crawler

Twitter (X.com) Scraper Unlimited: No Limits

Youtube Video Downloader

🔥 LinkedIn Jobs Scraper

Profile Posts Scraper for LinkedIn [No Cookies]

Web Scraper

Profile Details Scraper for LinkedIn + EMAIL (No Cookies)

Google Search Results Scraper

Posts Search Scraper for LinkedIn | No Cookies

Cheerio Scraper

.dockerignore

.gitignore

DEPLOYMENT.md

Dockerfile

Dockerfile-simple

FIX_DEPLOYMENT.md

FIX_DOCKER_ERROR.md

actor.json

actor.json.backup

apify (1).json

apify.json.deprecated

main-minimal.py

main-simple.py

requirements-complete.txt

requirements.txt

.actor/actor.json

backend/main.py

backend/requirements.txt

src/main.py

backend/analysis/anti_bias.py

backend/analysis/aurify_analyzer.py

backend/ai/gatekeeper.py

backend/ai/process_staging.py

backend/ai/prompts.py

backend/api/__init__.py

backend/api/anti_bias_api.py

backend/api/graveyard.py

backend/core/__init__.py

backend/core/system_implementation.py

backend/database/__init__.py

backend/workers/__init__.py

backend/utils/__init__.py

backend/scrapers/__init__.py

backend/scrapers/amazon.py

backend/scrapers/aurify_prompts.py

backend/scrapers/base.py

backend/scrapers/competitive_analyzer.py

backend/scrapers/creative_market.py

backend/scrapers/envato.py

backend/scrapers/etsy.py

backend/scrapers/facebook.py

backend/scrapers/gumroad.py

backend/scrapers/instagram.py

backend/scrapers/linkedin_search.py

backend/scrapers/notion.py

backend/scrapers/orchestrator.py

backend/scrapers/orexia_base.py

backend/scrapers/pain_heist_config.py

backend/scrapers/producthunt.py

backend/scrapers/reddit.py

backend/scrapers/scrapers_config.json

backend/scrapers/shopify_apps.py

backend/scrapers/shopify_products.py

backend/scrapers/tiktok.py

backend/scrapers/tiktok_enhanced.py

backend/scrapers/trends.py

backend/scrapers/twitter.py

backend/scrapers/unified_adapters.py

backend/scrapers/unified_orchestrator.py

backend/scrapers/universal_worker.py

backend/scrapers/youtube.py

backend/api/models/__init__.py

backend/api/routes/__init__.py

backend/api/routes/aurify_routes.py

backend/api/routes/dashboard_routes.py

backend/api/routes/products_routes.py

backend/scrapers/playwright_worker/Dockerfile

backend/scrapers/playwright_worker/requirements.txt

backend/api/init.py

backend/core/init.py

backend/database/init.py

backend/workers/init.py

backend/utils/init.py

backend/scrapers/init.py

backend/api/models/init.py

backend/api/routes/init.py