AI agents are increasingly proficient at executing business tasks autonomously, but IT leaders are cautious about granting permissions to access enterprise systems. Part of the challenge lies in how AI reliability is measured. Industry standards often rely on EVAL scores, which provide a static snapshot of performance rather than a measure of overall reliability. These metrics can fail to capture predictability across prompts, environments, and input types, said Bryan Silverthorn, director of the AGI Autonomy research lab at Amazon.Amazon’s AGI autonomy research lab is moving beyond raw performance benchmarks, focusing instead on a structured framework centered on consistency, robustness, predictability, and safety, Silverthorn told VentureBeat during an interview ahead of his session at VB Transform 2026.Rather than assuming that models can be harnessed into safety, Amazon’s approach emphasizes decoupled systems, such as sandboxed environments where agents propose changes that are reviewed by humans before implementation. This strategy aims to bridge the trust gap by prioritizing verifiable interactions, even in highly sensitive domains like finance, where the potential damage an agent can cause is significant.In VentureBeat’s Q2 Pulse Research survey of over 100 senior technology leaders and buyers, just 4% said they are comfortable relying on model guardrails alone. When asked what worries them most about model guardrails, 40% said unauthorized access to tools or data and 27% cited prompt manipulation or injection.At VB Transform, Silverthorn will share details of Amazon’s approach to trustworthy agentic AI and how companies can move from single-agent wrappers to multi-tool architectures that can self-correct mid-execution during his session titled Closing the capability-reliability gap: Inside Amazon’s framework for engineering trustworthy agents.Another agentic ops and evals-focused session at VentureBeat’s flagship conference, happening July …