Albertsons Product Scraper avatar
Albertsons Product Scraper

Pricing

from $9.00 / 1,000 results

Go to Apify Store
Albertsons Product Scraper

Albertsons Product Scraper

Pricing

from $9.00 / 1,000 results

Rating

0.0

(0)

Developer

GetDataForMe

GetDataForMe

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

1

Monthly active users

15 days ago

Last modified

Share

Albertsons Crawler for Foodgraph

A professional-grade web scraper built with Crawlee and Playwright for extracting product data from Albertsons.com. This scraper is designed to meet Foodgraph's specific requirements for grocery product data collection.

Features

  • Full Category Coverage: Scrapes all specified product categories with inclusion/exclusion rules
  • Browser-based Scraping: Uses real browser automation for reliable data extraction
  • API Interception: Captures and reuses Albertsons API calls for efficient data collection
  • Session Management: Automatic session refresh and token management
  • GTIN/UPC Validation: Ensures all products have valid GTIN/UPC codes
  • Structured Output: Produces data in Foodgraph's required format with rid, sourcePdpUrl, and product fields
  • Proxy Support: Compatible with Bright Data, Apify Proxy, and custom proxy solutions
  • Health Monitoring: Built-in health checker for daily validation
  • Error Handling: Robust retry logic and exponential backoff

Quick Start

  1. Installation

    $npm install
  2. Basic Usage

    $npm start
  3. Development Mode

    $npm run start:dev

Configuration

Default Categories (Foodgraph Test Project)

The scraper is pre-configured with the exact categories specified in the Foodgraph RFP:

Include All Categories:

  • Beverages
  • Breakfast & Cereal
  • Canned Goods & Soups
  • Condiments, Spice & Bake
  • Cookies, Snacks & Candy
  • Dairy, Eggs & Cheese
  • Frozen Foods
  • Fruits & Vegetables
  • Grains, Pasta & Sides
  • International Cuisine
  • Meat & Seafood

Include Specific Subcategories Only:

  • Baby Care → Formula & Baby Food only
  • Wine, Beer & Spirits → Non-Alcoholic Beer and Cocktail Mixes only

Exclude Specific Subcategories:

  • Bread & Bakery → Exclude Bakery Beverages & Snacks, Bakery Catering Trays
  • Deli → Exclude Deli Bar & Food Service, Deli Sandwiches and Wraps, Sushi

Input Parameters

{
"startUrls": ["https://www.albertsons.com/shop/aisles/beverages.html"],
"storeIds": [177, 154, 1680],
"maxRequestsPerCrawl": 1000,
"headless": true
}

Proxy Configuration

Bright Data (Recommended):

{
"proxyConfiguration": {
"proxyUrls": ["wss://brd-customer-hl_1548877d-zone-scraping_browser1-country-us:f6kbfntem9hn@brd.superproxy.io:9222"]
}
}

Apify Proxy:

{
"proxyConfiguration": {
"useApifyProxy": true
}
}

Output Format

The scraper produces data in the exact format required by Foodgraph:

{
"rid": "550e8400-e29b-41d4-a716-446655440000",
"sourcePdpUrl": "https://www.albertsons.com/product-detail/...",
"product": {
"fullCategoryTaxonomy": ["Beverages", "Water & Sparkling Water"],
"id": "123456",
"name": "Product Name",
"upc": "123456789012",
"brand": "Brand Name",
"ingredients": "...",
"nutrition": {...},
"images": ["https://..."]
}
}

Key Requirements Compliance

✅ Technology Stack

  • JavaScript: ✓ Built with Node.js and TypeScript
  • Playwright: ✓ Browser automation with Firefox support
  • Crawlee: ✓ Latest version 3.x framework

✅ Scraping Approach

  • API First: ✓ Intercepts and uses Albertsons internal APIs
  • Browser Fallback: ✓ Uses browser automation when needed
  • Session Management: ✓ Handles token refresh and session expiry

✅ Data Requirements

  • Raw Data: ✓ No transformations, preserves original structure
  • Required Fields: ✓ Includes rid, sourcePdpUrl, product, fullCategoryTaxonomy
  • GTIN/UPC: ✓ Validates presence of product identifiers
  • No Deduplication: ✓ Captures all product instances

✅ Exclusions Implemented

  • Reviews and ratings
  • Pickup/delivery options
  • Price and promotions (captured but not required)
  • Related/similar products
  • Marketplace sellers

✅ Category Management

  • Full inclusion/exclusion rule support
  • Configurable category targeting
  • Automatic subcategory discovery

Health Monitoring

Run health check manually:

$npm run healthcheck

The health checker validates:

  • Category page navigation
  • API connection functionality
  • Product data extraction
  • GTIN validation
  • Proxy connectivity

Development

Project Structure

src/
├── main.ts # Main entry point
├── routes.ts # Request routing logic
├── categories.ts # Category configuration
├── types.ts # TypeScript definitions
├── utils.ts # Utility functions
└── healthcheck.ts # Health monitoring

Adding New Categories

Update src/categories.ts:

export const DEFAULT_CATEGORY_CONFIG = {
includeAll: [
'https://www.albertsons.com/shop/aisles/new-category.html'
]
};

Debugging

Enable debug mode:

{
"debugMode": true,
"headless": false
}

Production Deployment

Apify Platform

  1. Upload project to Apify
  2. Configure input schema
  3. Set up scheduling (every 4-6 weeks)
  4. Monitor via health checker

Environment Variables

BRIGHT_DATA_ENDPOINT=wss://brd-customer-...
APIFY_PROXY_PASSWORD=your-password

Performance

  • Concurrency: Default 1 (recommended for stability)
  • Request Rate: ~2-3 seconds between requests
  • Session Lifetime: ~100 requests per session
  • Error Recovery: 3 retries with exponential backoff

Troubleshooting

Common Issues

No products found:

  • Check store ID validity (try 177, 154, 1680)
  • Verify category URLs are accessible
  • Check if session tokens are being captured

Session expired errors:

  • Automatic session refresh is implemented
  • Monitor for rate limiting (429 errors)
  • Consider reducing concurrency

Proxy issues:

  • Verify Bright Data credentials
  • Test connection with health checker
  • Check proxy endpoint accessibility

Support

For technical issues:

  1. Check health checker output
  2. Review error logs in Actor platform
  3. Verify category URLs are current
  4. Test with single category first

License

This scraper is designed for legitimate business use in compliance with website terms of service and applicable laws.