Lap-benchmark-docs

LAP Benchmark

View Full Benchmark Report – 500 runs, 50 specs, 5 formats

Measures how well LAP-compressed API documentation performs compared to original formats when given to AI coding agents.

Benchmark Matrix

Documentation Tiers

Tier Description
None No documentation provided (prior-knowledge baseline)
Pretty Original spec, properly formatted
Minified Whitespace and comments stripped
LAP Standard Full LAP format with descriptions
LAP Lean LAP format, types only (maximum compression)

Spec Coverage

OpenAPI (10)

Figma, Stripe, Twilio, GitHub REST, DigitalOcean, Slack, Spotify, Box, Plaid, Resend

AsyncAPI (10)

Streetlights, Slack RTM, Adeo Kafka, Social Media, Gitter Streaming, Gemini WebSocket, Kraken WebSocket, Correlation ID, Operation Security, RPC Server

GraphQL (10)

GitHub, SWAPI, Yelp, Shopify, Artsy, Linear, Saleor, Elasticsearch, Coral, Unraid

Postman (10)

Twilio, Postman Echo, Adobe, SAP, Stripe, Azure DevOps, Auth0, Braintree, InfluxDB, Akeneo

Protobuf / gRPC (10)

Google Storage, Pub/Sub, Vision, Data Catalog, Translate, Spanner, Firestore, Talent, Language, Billing

Project Structure

registry/       Spec definitions + task manifests with ground truth
sources/        Raw specs fetched from GitHub
compiled/       200 doc variants (50 specs x 4 tiers)
harness/        Benchmark runner, executor, scorer
prompts/        Agent prompt template
scripts/        Compilation, validation, analysis
results/        Benchmark results and reports

Quick Start

pip install tiktoken pyyaml   # dependencies

# Compile all doc variants
python scripts/compile_variants.py

# Validate registry + manifests
python scripts/validate_registry.py

# Pilot run (7 specs, all tiers)
python -m harness.runner --pilot

# Full run (50 specs, all tiers)
python -m harness.runner --full

Scoring

Each run produces a score from 0.0 to 1.0 across three components:

Confounding Variable Controls

Status

Pilot completed: 70 runs across 7 specs, all 5 tiers. Full benchmark run pending.