Projects
Quality benchmark
This is our regression harness for the agents. We run the same packaged customer-feedback corpus through Agent 1 (feedback processor) and Agent 2 (insight synthesizer), then grade their outputs against a hand-curated ground truth. Use it after prompt or model changes to confirm quality hasn't regressed — scores should stay steady or improve. Each run makes real LLM calls (one full synthesis batch worth of tokens).