WP_Term Object
(
    [term_id] => 14
    [name] => Synopsys
    [slug] => synopsys
    [term_group] => 0
    [term_taxonomy_id] => 14
    [taxonomy] => category
    [description] => 
    [parent] => 157
    [count] => 665
    [filter] => raw
    [cat_ID] => 14
    [category_count] => 665
    [category_description] => 
    [cat_name] => Synopsys
    [category_nicename] => synopsys
    [category_parent] => 157
)
            
arc v 800x100 High Quality (1)
WP_Term Object
(
    [term_id] => 14
    [name] => Synopsys
    [slug] => synopsys
    [term_group] => 0
    [term_taxonomy_id] => 14
    [taxonomy] => category
    [description] => 
    [parent] => 157
    [count] => 665
    [filter] => raw
    [cat_ID] => 14
    [category_count] => 665
    [category_description] => 
    [cat_name] => Synopsys
    [category_nicename] => synopsys
    [category_parent] => 157
)

A Functional Safety Primer for FPGA – and the Rest of Us

A Functional Safety Primer for FPGA – and the Rest of Us
by Bernard Murphy on 07-27-2017 at 7:00 am

Once in a while I come across a vendor-developed webinar which is so generally useful it deserves to be shared beyond the confines of sponsored sites. I don’t consider this spamming – if you choose you can ignore the vendor-specific part of the webinar and still derive significant value from the rest. In this instance, the topic is design for functional safety, particularly as applied to FPGA, design and how these techniques can be implemented using Synopsys’ Synplify Premier. In fact, in this very detailed review I saw little that isn’t of equal relevance in ASIC design, though typically implemented using different tools and methodologies.


The webinar kicks off with a review of standards requiring functional safety, refreshingly going beyond ISO 26262 to mention IEC 61508 for industrial applications, IEC 60601 for medical applications and DO-254 for avionics and even applications in datacenters which must support very high up-times (to 99.999%). The webinar speaker (Paul Owens, Sr. TME in the Synopsys Verification Group) then setup the context for the discussion by noting that each of these standards measure functional safety through assessment of risks with or without mitigation or detection.

You probably think “yeah, yeah, triplication or ECC and stuff like that”. In fact, ISO 26262 doesn’t specify what safety mechanisms to implement – you are free to invent custom methods if you want, but there are widely-accepted approaches which Paul spells out in detail. He mentions as one example the commonly used dual-core lock-step computing where two CPUs perform the same calculation in parallel and results are compared to detect faults.

Paul addresses design principally for two classes of fault in this webinar – stuck-at faults caused by physical damage (perhaps through electromigration) and soft errors initiated say by radiation or EMI which cause a transient in a signal (single event transient) which may, in turn, cause transition of a register into an unexpected state (single event upset). Each mechanism Paul describes is intended to mitigate or at least detect logic errors caused by these types of problem.


He kicks off with safety techniques for finite-state machines (FSM). FSMs are particularly sensitive to soft error problems, since an incorrect bit flip can send the FSM in an unexpected direction, causing all kinds of issues downstream. Paul describes two recovery mechanisms: “safe” recovery where a detected error takes the state machine back to the reset state and “safe case” where detection takes the FSM back to a default state (In Synplify Premier, this also ensures the default state is not optimized out in synthesis – you would need to guide similar behavior in other tools).

It is also possible to support error correction in FSMs where state encoding is such that the distance between current state and next state is 3 units. In this case single-bit errors can be detected and corrected without needing to restart the FSM or recover from a default state.


Now for memories. FPGA and IP vendors provide ECC (error-correcting code) RAMs and of course you can use those IPs. In some cases you may choose to use triple-modular redundancy (TMR) for RAMs that do not support error correction in the configurations you need. (TMR triplicates a function and follows it with majority voter logic; this allows two correctly operating functions to override one function with an error.) Also, something that was new to me, you can use error detection to trigger “RAM scrubbing”, a technique commonly used on configuration RAMs to force a re-program of that memory.

IOs are as prone to faults as any other part of the circuit and mitigation in some cases requires triplication. This is implemented through Distributed TMR (DTMR). TMR comes in multiple flavors – local, block, distributed and global. Here’s one useful reference).


TMR can be used on individual registers and register banks and it can also be applied to blocks of logic in the design, but here there are wrinkles. This is again viewed as distributed TMR but usage differs for cyclic, non-cyclic and some other approaches. In non-cyclic cases, there’s no feedback path from internal registers in the block; in this case triplication is straightforward. In cyclic cases, where there are internal feedback loops, those loops can (optionally) be broken to insert voter logic to limit accumulation of errors, in addition to triplicating the blocks and following those structures with majority voter logic.

Finally, there’s a physical constraint you may not have considered in TMR. Radiation-induced soft-errors can trigger not just the initial error but also secondary ionization (a neutron bumps an ion out of the lattice, which in turn bumps more ions out of place, potentially causing multiple distributed transients). If you neatly place your TMR device/blocks together in the floorplan, you increase the likelihood that 2 or more of them may trigger on the same event, completely undermining all that careful effort. So triplicates should be physically separated in the floorplan. How far depends on free-ion path lengths in secondary ionization.

Naturally many of these methods can be implemented in FPGA designs using Synplify Premier; Paul calls out commands and shows generated logic examples during the course of the webinar to illustrate each case. But whether or not you are an FPGA designer, I recommend you set aside some time for personal skills improvement by watching the webinar HERE.

Share this post via:

Comments

6 Replies to “A Functional Safety Primer for FPGA – and the Rest of Us”

You must register or log in to view/post comments.