Thursday, May 25, 2006

Bandwidth: the Final Frontier

Bharath said I should write about why I think FPGAs are being seen in more new networking devices. Not having much exposure to this use of FPGAs, I will instead attempt to address another area where these amazing devices are seeing ever more use: high speed DSP.

Barrier Traditional DSP manufacturers (like ADI and TI) have long since realized the limitations of the usual von Neumann architecture that's so common in commodity computer systems, the most important one (from a DSP point of view) is memory bandwidth. The single bus over which both instructions and data must be fetched is the stumbling block when it comes to systems designed to crunch vast amounts of data down to manageable chunks. One method through which this is attacked involves the use of small blocks of high-speed cache memory, however this approach does not help DSP applications when the data must be fetched from a high-speed sensor and constantly floods the cache, negating it's benefits. These factors all resulted in the familiar DSP chips of the 90s, the ADI SHARC and the TI C6xx series. The ADI design, in particular, is interesting in that they rely on all the main memory being a high-speed type residing on-die, typical SHARC designs did not use external RAM. This guarantees single-cycle execution and the benefits of the Harvard architecture (simultaneous data/program access), without the associated explosion in pin count (back in the days when BGAs were as exotic as Leprechauns). However, once the data is inside one of these chips, options are limited by the number of ALUs and multiplier units, usually not more than 5-10 operations per clock cycle.

Now that we're used to desktop CPU speeds in the several-GHz range, a 40 MHz chip seems quite puny. However, it's still enough to perform quite complex DSP tasks like FFTs, and this is indeed what DSPs were destined to do most of their lives. One thing they could not do well, though, is very simple processing (eg: an FIR filter) at extremely high sample rates (eg: 50-100 MHz). In the few cases where it could be done, the memory bandwidth would be strained to near breaking point and the DSP cost could not justify it's application to such a 'trivial' task. The gap got filled in by ASSPs like the famous Graychip GC4016, a digital downconverter. These take the vast amounts of data spewing from ADCs and crunch it down for DSPs. The price you pay though, is flexibility. Replace the ASSP with an FPGA and you now have a device that can reduce both the vast quantities of data as well as complex algorithms required to further process the reduced data. What's more, you are now free to reprogram the data reduction filters as your design changes (not always a good thing, though). These design changes can include simply slapping on an extra bank of memory on some unused I/Os if you find you don't have enough memory bandwidth. This is simply not an option with most traditional CPUs and DSPs, which are limited to one (or at most two) buses. Exceptions exist, as we'll soon see.

Desktop chips have attacked the DSP problem in several ways. The most obvious is inclusion of vector instructions like MMX, SSE and AltiVec, which addresses the ALU/multiplier restriction. Cache control instructions can be used to set up "Fences" in memory, the fenced-off portions are not cached. The idea is you can set up your sensor to DMA data into a fenced region of memory. The CPU will then not bother trying to cache this, since it'll anyway only be accessed once. This prevents flushing of more important program variables. Still, in very high speed applications, a CPU will still spend most of it's time reading data in and out, not performing calculations. ASICs and FPGAs, however, are designed to stream data through, preventing this phenomenon. Simple example: a radar receiver FPGA I designed runs at a rather tame 80 MHz, but performs 1920 million multiplies/sec since it runs 24 multipliers in parallel. Few CPUs would be capable of sustaining this performance (peak performance doesn't cut it) or getting the memory bandwidth required to do it. Even if they could, it would be a tremendous waste of a desktop CPU to simply have it do something like digital downconversion.

One promising technique desktop CPUs have started to adopt is on-die memory controllers. Now, memory bandwidth scales with an increase in CPUs. This gives an additional degree of freedom to the system designer to increase the available memory bandwidth.

How does all this tie in with networking applications? Well, like I said, I'm not the best person to comment, but I can see some parallels in the requirement for vast bandwidth in and out of the CPU in line-speed applications. FPGA costs have been getting ever smaller (thanks to our good friend, Gordon Moore :-), and inclusion of hard-IP like Xilinx' Rocket-I/O is going to make FPGAs a more viable alternative to NPs, especially for multi-gigabit applications and switch-fabrics.

No comments: