When a production automated driving system loses steering assist at highway speed, the fallback plan can't be 'pull over when safe.' The system must continue operating—or at minimum, execute a controlled stop—without relying on any single component. That's the job of redundancy, and it's far harder than adding a second brake line. For teams working on Level 4 and Level 5 stacks, the question isn't whether to build redundancy, but how to distribute it across sensing, compute, actuation, and software without creating a system that's too heavy, too expensive, or too complex to validate.
This guide is written for engineers and architects who already understand the basics of sensor fusion and planning. We'll skip the primer on why redundancy matters and go straight to the trade-offs that determine whether your redundant architecture actually improves safety or just adds cost and failure surface.
Who Needs This and What Goes Wrong Without It
Any automated driving system that operates without a human fallback needs a redundant architecture. That includes robotaxis, autonomous trucking, and shuttle services where the safety driver is removed. Without deliberate redundancy, a single sensor failure, compute node crash, or actuator jam can leave the vehicle with no way to complete the minimal risk maneuver. The consequences range from a stranded vehicle blocking traffic to a collision that the system could have avoided.
What goes wrong most often is not the obvious single-point failure. Teams typically find that their redundant design works on paper but fails in edge cases where two supposedly independent channels share a common root cause. For example, a dual-camera setup that uses the same lens supplier may both fail in heavy rain because of a shared condensation issue. Or a dual-compute architecture where both ECUs run the same software stack may both crash from the same memory leak. These correlated failures are the real threat, and they require redundancy at the architectural level, not just the component level.
Another common failure mode is the 'silent disagreement' problem. Two redundant perception channels may both produce plausible but different outputs, and the arbitration logic must decide which one to trust. If the voting logic itself has a bug or is biased toward one channel, the system can commit to a dangerous action even though the other channel had the correct data. Without careful design of the decision fusion layer, redundancy can actually reduce safety by introducing new failure modes.
We've also seen projects where redundancy was added late, as an afterthought, resulting in a system that is more complex but not more robust. The extra wiring, power draw, and thermal load from redundant components can introduce new failure points—connectors that vibrate loose, cooling fans that fail, or power supplies that sag under the additional load. A well-intentioned redundant design that is not integrated from the start often ends up being less reliable than a simpler, non-redundant system.
Who Should Prioritize Redundancy Now
If your system is targeting deployment without a safety driver, redundancy is non-negotiable. For advanced driver-assistance systems (ADAS) where the human is still responsible, limited redundancy (e.g., dual power supplies for steering) can improve availability but is not required for safety. The line is clear: if the vehicle must handle all failures autonomously, you need full architectural redundancy.
The Cost of Getting It Wrong
Beyond safety, the business cost of a non-redundant system is downtime. A single compute failure in a fleet of robotaxis can take a vehicle off the road for hours, and if the failure mode is unrecoverable, it may require a tow. For a fleet of hundreds of vehicles, reliability directly affects utilization and revenue. Redundancy that ensures graceful degradation—even if it means reduced functionality—keeps the vehicle operational until it can return to base.
Prerequisites and Context Readers Should Settle First
Before diving into redundancy architecture, your team needs a clear fault model. What are the single-point failures you are designing against? This is typically captured in a hazard analysis and risk assessment (HARA) following standards like ISO 26262 or UL 4600. Without a systematic fault catalog, you risk over-engineering redundancy for low-probability failures while missing critical ones. The fault model should include sensor failures (e.g., camera occlusion, radar interference, LiDAR spinning motor stall), compute failures (e.g., GPU hang, memory corruption, network switch failure), actuation failures (e.g., brake booster loss, steering motor failure), and communication failures (e.g., CAN bus short, Ethernet link drop).
You also need to define the operational design domain (ODD) and the minimal risk condition (MRC). For example, if the ODD is highway driving, the MRC might be pulling onto the shoulder and stopping. If the ODD is urban streets, the MRC might be pulling into a parking lot or stopping in a safe lane. The redundancy architecture must guarantee that the MRC can be executed even after any single failure (and sometimes after multiple failures, depending on the safety case).
Understanding Failure Modes and Effects
Each component in the system should have a documented failure mode and effect analysis (FMEA). This is not a one-time exercise; it must be updated as the architecture evolves. For example, a LiDAR unit may fail in a way that produces no data, or it may produce corrupted data that appears valid. The redundancy design must handle both cases, and the detection mechanism must be able to distinguish between them. Similarly, a compute node may fail silently (no heartbeat) or may produce incorrect results that pass sanity checks. The architecture needs diverse monitoring, not just a watchdog timer.
Safety Standards and Certification Context
While this guide does not replace a formal safety process, it's important to understand that redundancy alone does not guarantee safety. The system must be designed to meet an acceptable safety integrity level (e.g., ASIL D for critical functions). Redundancy is a means to achieve that level, but it must be complemented by diversity (different implementations) and independence (no shared resources). If both redundant channels share a power supply, a single power failure takes both down, and the redundancy is worthless. Independence must be verified at the physical, electrical, and software levels.
Core Workflow: Designing a Redundant Architecture
The process of architecting redundancy follows a structured workflow that balances safety requirements with practical constraints. We break it into six steps that should be iterated as the system matures.
Step 1: Identify Critical Functions
Not every function needs redundancy. Start by listing all functions that are safety-critical: perception, localization, planning, control, and actuation. For each, define the minimum acceptable performance after a failure. For example, after a primary camera failure, the system may still operate at reduced speed using radar and LiDAR only. The goal is to identify which functions must be fail-operational (continue full function) versus fail-degraded (continue with reduced capability) versus fail-safe (stop safely).
Step 2: Choose Redundancy Topologies
There are three common topologies: 1+1 active/standby, N+1 (one spare for N active units), and N+M (multiple spares). For sensing, a common approach is to use a diverse sensor suite where each sensor type covers the weaknesses of others. For compute, a dual-redundant architecture with two independent compute nodes that cross-check each other is typical. For actuation, dual-wound motors or dual hydraulic circuits are used. The topology must ensure that no single failure disables both channels.
Step 3: Design the Voting and Arbitration Logic
When two channels produce different outputs, the system must decide which one to use. Simple majority voting works for three or more channels, but for dual-channel systems, you need a more sophisticated approach. Common strategies include: (a) compare outputs and if they disagree, use the channel that is consistent with the last known good state; (b) use a third, independent monitor channel to break ties; (c) degrade to a safe state if disagreement persists. The arbitration logic itself must be verified to be free of single-point failures—often implemented on a separate safety microcontroller.
Step 4: Implement Independent Power and Communication
Redundant channels must have separate power supplies, ground paths, and communication buses. A single short circuit on one channel should not affect the other. This means physical separation of wiring harnesses, separate fuses or circuit breakers, and galvanic isolation where signals cross between domains. Communication between channels should use redundant networks (e.g., two CAN buses or two Ethernet links) with independent switches.
Step 5: Build Detection and Switching Mechanisms
The system must detect failures quickly enough to switch to the backup before the vehicle enters an unsafe state. Detection can be based on heartbeats, sanity checks, cross-channel comparison, or built-in self-tests. The switching time must be less than the time to critical event. For example, if a steering actuator fails, the backup must engage within milliseconds to maintain lane keeping. This requires fast failure detection and low-latency activation of the redundant channel.
Step 6: Validate with Fault Injection Testing
Redundancy is only as good as its validation. Fault injection testing—where you deliberately induce failures in each component—is essential to verify that the system behaves as expected. This includes injecting sensor faults, compute crashes, network drops, and power interruptions. The tests should cover single failures, dual failures (if the safety case requires it), and common-cause failures (e.g., temperature extremes that affect both channels). Automated regression testing with fault injection should be part of the continuous integration pipeline.
Tools, Setup, and Environment Realities
Implementing redundancy requires a development environment that supports both simulation and hardware-in-the-loop (HIL) testing. Simulation is critical for early validation of fault scenarios that are hard to reproduce in the real world, such as simultaneous sensor failures. However, simulation must accurately model the failure modes—many simulators assume perfect sensors and actuators, which defeats the purpose. Use fault injection libraries that can corrupt data, delay messages, or simulate hardware faults at the software level.
For HIL testing, you need a setup that can inject real electrical faults: short circuits, open circuits, voltage sags, and communication errors. This typically requires a programmable power supply, a fault insertion unit (FIU), and a real-time target that runs the full software stack. The HIL environment should also include a vehicle dynamics model so that the system's response to failures can be evaluated in realistic driving scenarios.
Common Tooling Choices
Many teams use ROS 2 or DDS for communication, but these middleware layers do not inherently provide redundancy. You need to build a custom layer that handles duplicate messages, channel selection, and health monitoring. For compute redundancy, hypervisors like Jailhouse or ACRN can partition hardware resources and provide spatial isolation between redundant software stacks. For actuation, redundant ECUs with separate firmware are common, and they communicate via a dedicated safety bus like FlexRay or TTP.
Power and Thermal Constraints
Redundancy adds power draw, which is a major constraint in electric vehicles. Each additional compute node or sensor consumes power and generates heat. The thermal design must ensure that redundant components do not overheat when the primary is also running, especially in hot climates. Some architectures use a 'cold standby' where the backup is powered off until needed, but this introduces a wake-up latency that must be accounted for. Active standby (both channels powered) is simpler but consumes more energy.
Variations for Different Constraints
Not every project can afford full hardware redundancy. The trade-offs depend on the ODD, the acceptable risk level, and the business model. Here are three common variations.
Sensor Diversity as Lightweight Redundancy
If budget or space prevents dual sensors of the same type, you can achieve partial redundancy through sensor diversity. For example, a system that relies primarily on cameras can use radar as a backup for object detection, even though radar has lower resolution. The key is that the backup sensor must be able to detect the same critical objects (e.g., pedestrians, vehicles) with sufficient accuracy for the MRC. This approach reduces cost but requires careful analysis of the coverage gaps—radar may not detect stopped vehicles or pedestrians in certain orientations.
Software Diversity for Compute Failures
Instead of duplicating hardware, some teams implement software diversity on a single compute platform. Two independent perception algorithms run on the same GPU, and their outputs are compared. If they disagree, the system falls back to a simpler, more conservative planner. This approach saves hardware cost but is vulnerable to common-cause failures like a GPU driver bug that affects both algorithms. It also requires that the two algorithms are truly independent—not just different parameter sets of the same neural network.
Degraded Mode as a Practical Alternative
For systems with limited redundancy, a well-designed degraded mode can be safer than no redundancy at all. For example, if the primary compute fails, the system can reduce speed to 30 km/h and use a minimal sensor set (e.g., only radar) to navigate to a safe stop. The degraded mode must be designed and tested as thoroughly as the nominal mode. This approach is common in early deployments where full redundancy is not yet feasible.
Pitfalls, Debugging, and What to Check When It Fails
Even with careful design, redundant systems fail in unexpected ways. Here are the most common pitfalls and how to debug them.
Correlated Failures That Bypass Redundancy
The most dangerous failure is one that takes out both channels simultaneously. Common causes include: shared power supply, shared cooling system, shared software library, or shared calibration. To detect these, perform a common-cause analysis (CCA) that identifies all shared resources. Then test by failing each shared resource and verifying that both channels are affected. If they are, the redundancy is compromised.
Voting Logic Bugs
Arbitration logic is notoriously hard to get right. A common bug is that the voting algorithm assumes both channels are equally reliable, but in practice one channel may be more accurate in certain conditions. For example, a camera-based channel may be better in daylight, while a LiDAR-based channel is better at night. The voting logic should be context-aware, using confidence estimates from each channel. Debugging voting logic requires logging the raw outputs of each channel and the arbitration decision, then replaying scenarios where the system made a poor choice.
Testing Blind Spots
Many teams test redundancy only in simulation or only in the lab, missing failures that occur in the field. For example, a redundant compute node may work fine in a climate-controlled lab but overheat in a parked car on a summer day. Environmental testing—temperature, humidity, vibration, and electromagnetic interference—must include the redundant system in its full configuration. Also, test the switching mechanism under realistic loads: a cold standby that takes 500 ms to boot may be too slow for a highway lane departure.
What to Check When a Redundancy Failure Occurs
When a redundant system fails in testing or in the field, start by examining the fault logs from both channels. Look for patterns: did both channels fail at the same time? Did one channel fail first and the other later? Was there a common event (e.g., a voltage dip) that preceded the failure? Check the health monitoring messages: did the system detect the failure correctly? If detection was missed, the monitoring logic needs improvement. If detection worked but switching failed, examine the switching mechanism's state machine and timing.
Finally, remember that redundancy is a means, not an end. The goal is a safe and reliable automated driving system. Every redundant component adds complexity, and complexity is the enemy of safety. The best architectures are those that achieve the required integrity with the minimum necessary redundancy, validated through rigorous testing and real-world operation.
Comments (0)
Please sign in to post a comment.
Don't have an account? Create one
No comments yet. Be the first to comment!