Projects

Quality benchmark

This is our regression harness for the agents. We run the same packaged customer-feedback corpus through Agent 1 (feedback processor) and Agent 2 (insight synthesizer), then grade their outputs against a hand-curated ground truth. Use it after prompt or model changes to confirm quality hasn't regressed — scores should stay steady or improve. Each run makes real LLM calls (one full synthesis batch worth of tokens).

Learn about this benchmark
Three short sections so results are interpretable, not just numbers.
Recent runs
Your runs only. Expand a row for full rubric breakdown.

Loading…