Web2json Agent
Pricing
Pay per usage
Web2json Agent
Pricing
Pay per usage
Rating
0.0
(0)
Developer

国强 杨
Actor stats
1
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
📖 What is web2json-agent?
An AI-powered web scraping agent that automatically generates production-ready parser code from HTML samples — no manual XPath/CSS selector writing required.
📋 Demo
https://github.com/user-attachments/assets/c82e8e13-fc42-4d1f-a81a-4cec6e3f434b
📊 SWDE Benchmark Results
The SWDE dataset covers 8 vertical fields, 80 websites, and 124,291 pages
🚀 Quick Start
Install via pip
# 1. Install packagepip install web2json-agent# 2. Initialize configurationweb2json setup
Install for Developers
# 1. Clone the repositorygit clone https://github.com/ccprocessor/web2json-agentcd web2json-agent# 2. Install in editable modepip install -e .# 3. Initialize configurationweb2json setup
📚 Complete User Guide
For a comprehensive tutorial covering installation, configuration, and all usage scenarios, see:
docs/Web2JsonAgent%E4%BD%BF%E7%94%A8%E6%8C%87%E5%8D%97.md
This guide includes:
- Detailed installation steps
- Configuration methods (interactive wizard, config file, environment variables)
- Layout clustering for mixed HTML types
- Complete API examples and use cases
- FAQ and troubleshooting
🐍 API Usage
Web2JSON provides five simple APIs. Perfect for databases, APIs, and real-time processing!
API 1: extract_data - Complete Workflow
Extract structured data from HTML in one step (schema + parser + data).
⚠️ Important: The
extract_dataAPI assumes all HTML files in the input directory have the same layout type. If your HTML files have different layouts (e.g., list pages vs detail pages), useclassify_html_dirfirst to group them by layout similarity. See ./demo.py for a complete example.
Auto Mode - Let AI automatically discover and extract fields:
from web2json import Web2JsonConfig, extract_dataconfig = Web2JsonConfig(name="my_project",html_path="html_samples/",# save=['schema', 'code', 'data'], # Save to local disk# output_path="./results", # Custom output directory (default: "output"))result = extract_data(config)# Results are always returned in memoryprint(result.final_schema) # Dict: extracted schemaprint(result.parser_code) # str: generated parser codeprint(result.parsed_data[0]) # List[Dict]: parsed JSON data
Predefined Mode - Extract only specific fields:
from web2json import Web2JsonConfig, extract_dataconfig = Web2JsonConfig(name="articles",html_path="html_samples/",schema={"title": "string","author": "string","date": "string","content": "string"},# save=['schema', 'code', 'data'], # Save to local disk# output_path="./results", # Custom output directory)result = extract_data(config)# Returns: ExtractDataResult with schema, code, and data in memory
API 2: extract_schema - Extract Schema Only
Generate a JSON schema describing the data structure in HTML.
from web2json import Web2JsonConfig, extract_schemaconfig = Web2JsonConfig(name="schema_only",html_path="html_samples/",# save=['schema'], # Save schema to disk# output_path="./schemas", # Custom output directory)result = extract_schema(config)print(result.final_schema) # Dict: final schemaprint(result.intermediate_schemas) # List[Dict]: iteration history
API 3: infer_code - Generate Parser Code
Generate parser code from a schema (Dict or from previous step).
from web2json import Web2JsonConfig, infer_code# Use schema from previous step or define manuallymy_schema = {"title": "string","author": "string","content": "string"}config = Web2JsonConfig(name="my_parser",html_path="html_samples/",schema=my_schema,# save=['code'], # Save parser code and schema to disk# output_path="./parsers", # Custom output directory)result = infer_code(config)print(result.parser_code) # str: BeautifulSoup parser codeprint(result.schema) # Dict: schema used
API 4: extract_data_with_code - Parse with Code
Use parser code to extract data from HTML files.
from web2json import Web2JsonConfig, extract_data_with_codeconfig = Web2JsonConfig(name="parse_demo",html_path="new_html_files/",parser_code="output/blog/parsers/final_parser.py", # Path to parser .py filesave=['data'], # Save parsed data to diskoutput_path="./parse_results", # Custom output directory)result = extract_data_with_code(config)print(f"Success: {result.success_count}, Failed: {result.failed_count}")for item in result.parsed_data:print(f"File: {item['filename']}")print(f"Data: {item['data']}")
API 5: classify_html_dir - Classify HTML by Layout
Group HTML files by layout similarity (for mixed-layout datasets).
from web2json import Web2JsonConfig, classify_html_dirconfig = Web2JsonConfig(name="classify_demo",html_path="mixed_html/",# save=['report', 'files'], # Save cluster report and copy files to subdirectories# output_path="./cluster_analysis", # Custom output directory)result = classify_html_dir(config)print(f"Found {result.cluster_count} layout types")print(f"Noise files: {len(result.noise_files)}")for cluster_name, files in result.clusters.items():print(f"{cluster_name}: {len(files)} files")for file in files[:3]:print(f" - {file}")
Configuration Reference
Web2JsonConfig Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
name | str | Required | Project name (for identification) |
html_path | str | Required | HTML directory or file path |
output_path | str | "output" | Output directory (used when save is specified) |
iteration_rounds | int | 3 | Number of samples for learning |
schema | Dict | None | Predefined schema (None = auto mode) |
enable_schema_edit | bool | False | Enable manual schema editing |
parser_code | str | None | Parser code (for extract_data_with_code) |
save | List[str] | None | Items to save locally (e.g., ['schema', 'code', 'data']). None = memory only |
Standalone API Parameters:
| API | Parameters | Returns |
|---|---|---|
extract_data | config: Web2JsonConfig | ExtractDataResult |
extract_schema | config: Web2JsonConfig | ExtractSchemaResult |
infer_code | config: Web2JsonConfig | InferCodeResult |
extract_data_with_code | config: Web2JsonConfig | ParseResult |
classify_html_dir | config: Web2JsonConfig | ClusterResult |
All result objects provide:
- Direct access to data via object attributes
.to_dict()method for serialization.get_summary()method for quick stats
Which API Should I Use?
# Need data immediately? → extract_dataconfig = Web2JsonConfig(name="my_run", html_path="html_samples/")result = extract_data(config)print(result.parsed_data)# Want to review/edit schema first? → extract_schema + infer_codeconfig = Web2JsonConfig(name="schema_run", html_path="html_samples/")schema_result = extract_schema(config)# Edit schema if needed, then generate codeconfig = Web2JsonConfig(name="code_run",html_path="html_samples/",schema=schema_result.final_schema)code_result = infer_code(config)# Parse with the generated codeconfig = Web2JsonConfig(name="parse_run",html_path="new_html_files/",parser_code=code_result.parser_code)data_result = extract_data_with_code(config)# Have parser code, need to parse more files? → extract_data_with_codeconfig = Web2JsonConfig(name="parse_more",html_path="more_files/",parser_code=my_parser_code)result = extract_data_with_code(config)# Mixed layouts (list + detail pages)? → classify_html_dirconfig = Web2JsonConfig(name="classify", html_path="mixed_html/")result = classify_html_dir(config)
📄 License
Apache-2.0 License