View Full Benchmark Report – 500 runs, 50 specs, 5 formats
Measures how well LAP-compressed API documentation performs compared to original formats when given to AI coding agents.
| Tier | Description |
|---|---|
| None | No documentation provided (prior-knowledge baseline) |
| Pretty | Original spec, properly formatted |
| Minified | Whitespace and comments stripped |
| LAP Standard | Full LAP format with descriptions |
| LAP Lean | LAP format, types only (maximum compression) |
Figma, Stripe, Twilio, GitHub REST, DigitalOcean, Slack, Spotify, Box, Plaid, Resend
Streetlights, Slack RTM, Adeo Kafka, Social Media, Gitter Streaming, Gemini WebSocket, Kraken WebSocket, Correlation ID, Operation Security, RPC Server
GitHub, SWAPI, Yelp, Shopify, Artsy, Linear, Saleor, Elasticsearch, Coral, Unraid
Twilio, Postman Echo, Adobe, SAP, Stripe, Azure DevOps, Auth0, Braintree, InfluxDB, Akeneo
Google Storage, Pub/Sub, Vision, Data Catalog, Translate, Spanner, Firestore, Talent, Language, Billing
registry/ Spec definitions + task manifests with ground truth
sources/ Raw specs fetched from GitHub
compiled/ 200 doc variants (50 specs x 4 tiers)
harness/ Benchmark runner, executor, scorer
prompts/ Agent prompt template
scripts/ Compilation, validation, analysis
results/ Benchmark results and reports
pip install tiktoken pyyaml # dependencies
# Compile all doc variants
python scripts/compile_variants.py
# Validate registry + manifests
python scripts/validate_registry.py
# Pilot run (7 specs, all tiers)
python -m harness.runner --pilot
# Full run (50 specs, all tiers)
python -m harness.runner --full
Each run produces a score from 0.0 to 1.0 across three components:
none tier measures what the model knows without any documentationapi_docs.txt (no tier or format leakage)Pilot completed: 70 runs across 7 specs, all 5 tiers. Full benchmark run pending.