Right now, today, you can spend $14,000 and buy a humanoid robot. There is no safety certification reviewed, no standardized test protocol verified. You get a machine capable of physical force and real-time autonomous decision-making. And the frameworks for validating its behavior are still catching up to what it can do. That’s not a criticism of the engineers building these systems.
The intelligence side of robotics is advancing at a pace that genuinely deserves the excitement it gets: better perception, more robust locomotion, faster inference, and tighter control loops. But here’s the question I keep coming back to: As the control architecture of these systems evolves from simple teleoperation all the way to fully autonomous reinforcement learning, are our testing methodologies and safety validation processes evolving with them? I don’t think they are.
Not yet. And I think that gap is worth talking about, not to slow the industry down, but to help it scale responsibly. Two research papers I’ve worked on recently have shaped how I think about this. One proposes a framework for classifying robot intelligence by its underlying control architecture. The other examines how software safety risk analysis needs to evolve for AI -driven systems.
Together, they point toward something the industry increasingly needs: a testing philosophy that scales alongside autonomy. One where formal safety guarantees replace test-case enumeration at the highest levels, and where adversarial robustness evaluation becomes as routine as functional testing. First, a map of where we are Before we can talk about how to test autonomous systems, it helps to be precise about what kind of system we’re actually testing.
In a recently published paper, I proposed a five-level taxonomy that classifies robots by their cognitive and control architecture, not by how attentive a human operator is — as the SAE driving levels do — but by how the machine itself is processing information and generating behavior. Levels 0 and 1: Teleoperation and imitation.
At Level 0, a human is doing all the thinking. The robot executes intent directly via teleoperation. At Level 1, it has learned to imitate from recorded demonstrations through behavior cloning and can operate without a live operator, but only within the bounds of what it’s seen. The brittleness here is well-documented: Robots trained on clean, structured demonstrations struggle when real-world conditions drift even slightly from training data.
A different floor texture, an object placed at an unfamiliar angle. Testing at these levels is relatively tractable, and the tooling is mature. Level 2: Supervised real-time learning. The robot can detect its own uncertainty, pause safely, request correction, and integrate that correction into its future behavior using inverse reinforcement learning.
Testing becomes a two-part challenge: validating the uncertainty detection mechanism itself, and validating the integrity of the learning update triggered by each corrective intervention. Level 3: Self-supervised learning. The robot generates its own training signals through trial and error, annotating its own successes and failures without human input.
Here, the test engineer’s job fundamentally changes. You’re no longer just testing fixed behavior. You’re validating a system that is continuously rewriting its own policy. Testing needs to assess not just current performance, but also the safety of the learning process itself. Level 4: Reinforcement learning. Full autonomy. The robot frames every task as an optimization problem and solves it through continuous interaction with its environment, often discovering solutions a human couldn’t demonstrate.
At this level, traditional test case enumeration breaks down. The behavior space is too large, too dynamic, and too emergent to enumerate exhaustively. Each level up this ladder doesn’t just add capability. It also adds a fundamentally different type of failure mode and demands a fundamentally different approach to validation. Where current safety frameworks fall short The go-to risk analysis tool in automotive and robotics software development is FMEA (failure mode and effects analysis).
In a co-authored paper, we examined the specific limitations of software design FMEA when applied to AI-driven systems, and what a more robust approach looks like. The core issue is the risk priority number, or RPN, which is FMEA’s standard scoring mechanism. It multiplies Severity, Occurrence, and Detection into a single score. The problem becomes obvious the moment you put numbers to it: a catastrophic failure rated Severity 10, Occurrence 1, Detection 1 scores 10.
So does a moderate failure rated Severity 1, Occurrence 1, Detection 10. Same number. Completely different threat. In a traditional deterministic software system, experienced engineers work around this with judgment. In a neural network-driven system where failure modes are emergent and context-dependent, that judgment is much harder to apply reliably.
The consequences of getting it wrong aren’t just a failed test. They’re deployment delays, liability exposure, and in the worst cases, incidents that set back public trust in an entire product category. The paper proposes integrating a risk priority matrix alongside HAZOP (hazard and operability study) analysis, methods that evaluate risk through richer contextual lenses rather than collapsing everything into a single number.
Grounded in ISO 26262 for functional safety and ISO 21434 for automotive cybersecurity, this combined approach gives engineers a more nuanced vocabulary for reasoning about AI-specific failure modes. The regulatory backdrop reinforces why this matters. ISO 25785-1 , the first international safety standard for bipedal robots, was published in May 2025 and covers industrial workplace deployment only.
ISO 13482 , addressing personal-care robots, was updated in 2025 but predates modern foundation models. The 2025 revision of ISO 10218-1 for industrial robotics made meaningful progress, but safety researchers are already identifying gaps in AI-driven humanoids and mobile manipulation that the update doesn’t fully close. These standards are essential foundations.
They need practitioner input to evolve faster. A testing philosophy that scales with autonomy So what does a more appropriate testing approach look like across these control levels? Here’s how I think about it. For Levels 0 and 1, conventional verification and validation methods apply reasonably well. Hardware-in-the-loop (HiL) testing, structured test suites, and systematic boundary testing of the training data distribution are achievable and effective.
The key addition for Level 1 is deliberate out-of-distribution (OOD) testing, probing the edges of the training corpus intentionally rather than assuming coverage. For Level 2, the test strategy needs to expand to cover the learning loop itself. Two things need validation separately: The uncertainty quantification mechanism — Does the robot correctly identify when it doesn’t know something? The policy update mechanism — Does the corrective input get integrated safely and accurately? Logging and replay infrastructure becomes critical.
Every human intervention should be recorded, tagged, and reviewed as a potential signal about where the policy is weak. For Level 3, formal methods start becoming genuinely necessary rather than optional. When a system is rewriting its own policy through self-supervised learning, the safety constraints on that learning process need to be mathematically specified and verified, not just empirically tested.
In practice, the hardest part of Level 3 validation isn’t the tooling; it’s getting alignment on what “safe exploration” actually means for your specific platform before testing begins. Approaches like constrained reinforcement learning and safe exploration algorithms are worth building into the architecture from the start, not retrofitting later.
Sim-to-real