Agentic EvalsTask completion metrics, trajectory evaluation, sandboxed environments, and failure modes.