Web2json Agent avatar

Web2json Agent

Pricing

Pay per usage

Go to Apify Store
Web2json Agent

Web2json Agent

Pricing

Pay per usage

Rating

0.0

(0)

Developer

国强 杨

国强 杨

Maintained by Community

Actor stats

1

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share


📖 What is web2json-agent?

An AI-powered web scraping agent that automatically generates production-ready parser code from HTML samples — no manual XPath/CSS selector writing required.


📋 Demo

https://github.com/user-attachments/assets/c82e8e13-fc42-4d1f-a81a-4cec6e3f434b


📊 SWDE Benchmark Results

The SWDE dataset covers 8 vertical fields, 80 websites, and 124,291 pages


🚀 Quick Start

Install via pip

# 1. Install package
pip install web2json-agent
# 2. Initialize configuration
web2json setup

Install for Developers

# 1. Clone the repository
git clone https://github.com/ccprocessor/web2json-agent
cd web2json-agent
# 2. Install in editable mode
pip install -e .
# 3. Initialize configuration
web2json setup

📚 Complete User Guide

For a comprehensive tutorial covering installation, configuration, and all usage scenarios, see:

docs/Web2JsonAgent%E4%BD%BF%E7%94%A8%E6%8C%87%E5%8D%97.md

This guide includes:

  • Detailed installation steps
  • Configuration methods (interactive wizard, config file, environment variables)
  • Layout clustering for mixed HTML types
  • Complete API examples and use cases
  • FAQ and troubleshooting

🐍 API Usage

Web2JSON provides five simple APIs. Perfect for databases, APIs, and real-time processing!

API 1: extract_data - Complete Workflow

Extract structured data from HTML in one step (schema + parser + data).

⚠️ Important: The extract_data API assumes all HTML files in the input directory have the same layout type. If your HTML files have different layouts (e.g., list pages vs detail pages), use classify_html_dir first to group them by layout similarity. See ./demo.py for a complete example.

Auto Mode - Let AI automatically discover and extract fields:

from web2json import Web2JsonConfig, extract_data
config = Web2JsonConfig(
name="my_project",
html_path="html_samples/",
# save=['schema', 'code', 'data'], # Save to local disk
# output_path="./results", # Custom output directory (default: "output")
)
result = extract_data(config)
# Results are always returned in memory
print(result.final_schema) # Dict: extracted schema
print(result.parser_code) # str: generated parser code
print(result.parsed_data[0]) # List[Dict]: parsed JSON data

Predefined Mode - Extract only specific fields:

from web2json import Web2JsonConfig, extract_data
config = Web2JsonConfig(
name="articles",
html_path="html_samples/",
schema={
"title": "string",
"author": "string",
"date": "string",
"content": "string"
},
# save=['schema', 'code', 'data'], # Save to local disk
# output_path="./results", # Custom output directory
)
result = extract_data(config)
# Returns: ExtractDataResult with schema, code, and data in memory

API 2: extract_schema - Extract Schema Only

Generate a JSON schema describing the data structure in HTML.

from web2json import Web2JsonConfig, extract_schema
config = Web2JsonConfig(
name="schema_only",
html_path="html_samples/",
# save=['schema'], # Save schema to disk
# output_path="./schemas", # Custom output directory
)
result = extract_schema(config)
print(result.final_schema) # Dict: final schema
print(result.intermediate_schemas) # List[Dict]: iteration history

API 3: infer_code - Generate Parser Code

Generate parser code from a schema (Dict or from previous step).

from web2json import Web2JsonConfig, infer_code
# Use schema from previous step or define manually
my_schema = {
"title": "string",
"author": "string",
"content": "string"
}
config = Web2JsonConfig(
name="my_parser",
html_path="html_samples/",
schema=my_schema,
# save=['code'], # Save parser code and schema to disk
# output_path="./parsers", # Custom output directory
)
result = infer_code(config)
print(result.parser_code) # str: BeautifulSoup parser code
print(result.schema) # Dict: schema used

API 4: extract_data_with_code - Parse with Code

Use parser code to extract data from HTML files.

from web2json import Web2JsonConfig, extract_data_with_code
config = Web2JsonConfig(
name="parse_demo",
html_path="new_html_files/",
parser_code="output/blog/parsers/final_parser.py", # Path to parser .py file
save=['data'], # Save parsed data to disk
output_path="./parse_results", # Custom output directory
)
result = extract_data_with_code(config)
print(f"Success: {result.success_count}, Failed: {result.failed_count}")
for item in result.parsed_data:
print(f"File: {item['filename']}")
print(f"Data: {item['data']}")

API 5: classify_html_dir - Classify HTML by Layout

Group HTML files by layout similarity (for mixed-layout datasets).

from web2json import Web2JsonConfig, classify_html_dir
config = Web2JsonConfig(
name="classify_demo",
html_path="mixed_html/",
# save=['report', 'files'], # Save cluster report and copy files to subdirectories
# output_path="./cluster_analysis", # Custom output directory
)
result = classify_html_dir(config)
print(f"Found {result.cluster_count} layout types")
print(f"Noise files: {len(result.noise_files)}")
for cluster_name, files in result.clusters.items():
print(f"{cluster_name}: {len(files)} files")
for file in files[:3]:
print(f" - {file}")

Configuration Reference

Web2JsonConfig Parameters:

ParameterTypeDefaultDescription
namestrRequiredProject name (for identification)
html_pathstrRequiredHTML directory or file path
output_pathstr"output"Output directory (used when save is specified)
iteration_roundsint3Number of samples for learning
schemaDictNonePredefined schema (None = auto mode)
enable_schema_editboolFalseEnable manual schema editing
parser_codestrNoneParser code (for extract_data_with_code)
saveList[str]NoneItems to save locally (e.g., ['schema', 'code', 'data']). None = memory only

Standalone API Parameters:

APIParametersReturns
extract_dataconfig: Web2JsonConfigExtractDataResult
extract_schemaconfig: Web2JsonConfigExtractSchemaResult
infer_codeconfig: Web2JsonConfigInferCodeResult
extract_data_with_codeconfig: Web2JsonConfigParseResult
classify_html_dirconfig: Web2JsonConfigClusterResult

All result objects provide:

  • Direct access to data via object attributes
  • .to_dict() method for serialization
  • .get_summary() method for quick stats

Which API Should I Use?

# Need data immediately? → extract_data
config = Web2JsonConfig(name="my_run", html_path="html_samples/")
result = extract_data(config)
print(result.parsed_data)
# Want to review/edit schema first? → extract_schema + infer_code
config = Web2JsonConfig(name="schema_run", html_path="html_samples/")
schema_result = extract_schema(config)
# Edit schema if needed, then generate code
config = Web2JsonConfig(
name="code_run",
html_path="html_samples/",
schema=schema_result.final_schema
)
code_result = infer_code(config)
# Parse with the generated code
config = Web2JsonConfig(
name="parse_run",
html_path="new_html_files/",
parser_code=code_result.parser_code
)
data_result = extract_data_with_code(config)
# Have parser code, need to parse more files? → extract_data_with_code
config = Web2JsonConfig(
name="parse_more",
html_path="more_files/",
parser_code=my_parser_code
)
result = extract_data_with_code(config)
# Mixed layouts (list + detail pages)? → classify_html_dir
config = Web2JsonConfig(name="classify", html_path="mixed_html/")
result = classify_html_dir(config)

📄 License

Apache-2.0 License