Aaron's rationale for NAII requested FPGA capabilities.
From:
alarson
Subject:
Aaron's rationale for NAII requested FPGA capabilities.
Date:
Fri, 23 Dec 2022 10:11:42 -0700
At the December 19th coordination meeting, Archit asked me to provide a brief (sorry failed on that) rationale for the capabilities I requested of the FPGA provided by NAI for sync. To be clear, this is not intended to be a specification or even a suggestion for the behavior requested by Korry. This is based on what I wished we had on numerous other systems in the past that needed to implement sync.
Sync is rarely easy, and often has systemic effects and dependencies, so this is not to say how sync should be done, but rather issues to consider and how the NAI capabilities can be used to facilitate reasonably simple solutions to some common problems.
The first issue for sync is any solution that involves event latency (e.g., responding to an interrupt) is going to be complex and fragile due to inherent variability in response times. The NAI FPGA capabilities help substantially. Consequently I won't describe the latency associated problems the FPGA avoids, suffice it to say it helps a lot.
The NAI FPGA provides a sequence of pulses, where a pulse contains a "high" and "low" portion. The pulse duration is the sum of those times. Since the FPGA is timestamping the rising and falling edges separately it is possible for a receiver to know the duration of each part of the pulse from each sender. That allows the sender to provide not only the timing of the pulse start, but also to encode some information based on the relative size of the high/low times. For example a 10ms pulse could be composed of a 1ms high + 9 ms low, or 2ms high + 8 ms low, etc.
The FPGA auto-reloads the next pulse based on current (at the time of the pulse start) high/low register values. There is no need for the PAL to do anything unless an adjustment is required. Since the FPGA samples the registers at the pulse start, the PAL can adjust the "next" pulse durations at any time during the pulse without needing to worry about race conditions (see the WAT vs pulse offset discussion below).
Sync is inherently modal, and I'm going to assume for the purpose of this discussion that there is one chosen "master" and that all other channels slave to the master. This is most definitely an area where KISS is important. When a channel starts (e.g., from power on, or from a health monitor initiated reset), the started channel doesn't know the state of the other channels, e.g., is there a master already? By encoding the local channel's mode in the signal it transmits to the other channels, the local channel can help all the channels do a selection of the master. When a channel starts, lets assume it starts in state "I don't think I'm the master". The channel can immediately start transmitting a pulse containing its state (e.g., 1ms high + 9ms low), then the other channels can make group decisions relatively easily. If there is already a master, the master ignores the state of others and the newly started channel syncs to it. If there is no master then (arbitrarily and e.g.,) the group assumes the lowest numbered active channel will become master and all will sync to that. Once a channel believes it is (should be) the master it encodes that state in its pulse signal (e.g., 2ms high + 8ms low). Note there are race conditions that need to be resolved so its not quite that easy, but nothing a little more waiting can't resolve.
A channel that is starting can wait until it has identified a master and then (if it is not the master) set its pulse time to approximate the master's and then startup normally. Note that in this scenario the master might not be consistently selected. If that is important more state-full logic is necessary.
To simplify the discussion I'm going to assume Deos' "sense of time" i.e., the WAT start, is driven by a clock and not by the FPGA's pulse train, that there is no backup clock source, and that all channels have consistent WAT durations.
Note that in this configuration there is another synchronization required. Namely between the timer driving the WAT and the FPGA's generated pulse. I was hoping that a common (e.g., the GIC) timer could be used by the FPGA but that was not possible. This is where the ability to synchronously read the FPGA's timestamp is likely to be needed. I do not know what requirements there are for the synchronization of the channel WAT starts, so I'll leave this topic for now and focus on FPGA pulse synchronization.
During runtime if the pulse start is selected to be offset from the start of the OS's WAT start by a time larger than the anticipated worst case channel skew (e.g., 1ms) then when the PAL is activated by the WAT timer the timestamps for the most recent pulse from every channel will be available. The offset between the local channel's and the master's pulse start can be computed easily by simple math. If there is an offset the duration of the pulse (and the WAT clock) can be adjusted (shortened or lengthened) to get the current channel's pulse start to migrate toward matching the master. Normal ramping and stability considerations need to be made when selecting the adjustments. The fine grained FPGA clock frequency helps.
Nominally the PAL would provide an interface to adjust the WAT timer duration and an application or PAL extension would be responsible for computing the adjustment, calling the PAL interface with the adjustment and updating the FPGA pulse duration. There is substantial variability in this from platform to platform so the exact allocation of responsibilities will need to be worked out.
Note that since Deos is a real time OS, the WAT duration can't be adjusted arbitrarily, otherwise Deos couldn't provide time partitioning. In Deos there is an overhead that can be specified which permits the WAT duration to be shortened or lengthened a small amount. Any shortening has to be within that amount. Sometimes lengthening can be less constrained. The selection of the amount affects total system processor availability and is a platform integrator responsibility.
The above completely ignores detecting, diagnosing, and responding to various faults that can occur. Most notably what to do about a misbehaving master. In situations where the multi-channel capability is providing increased availability (rather than integrity) such conditions can complicate the logic considerably. I am by no means an expert in such systems but I believe the signalling capability embodied by the selectable high/low pulse width becomes even more important in those situations. It is also why I suggested having three inputs so that if a system safety analysis indicated there was a need for a wrap around monitor, the FPGA would already support it. The solution proposed at the Dec 19th interchange meeting seems like a better solution since it simplifies the interconnects.
[Prev in Thread]
Current Thread
[Next in Thread]
Aaron's rationale for NAII requested FPGA capabilities.,
alarson<=