Agentic Evals

Task completion metrics, trajectory evaluation, sandboxed environments, and failure modes.