Small Logo

SummaryArchitectureDynamic RangePerformanceScalingPrecisionScalable ASICLTEReferences

 

Fast Fourier Transform

Summary

Past fast Fourier transform (FFT) circuit architectures have remained relatively unchanged, being based essentially on different permutations of the signal flow graph and mappings thereof.  Consequently, the inherent irregularities of the signal flow graph are reflected in the complex commutators or permutation circuits, large butterfly units, global interconnections, and stage-to-stage differences that result in an inherent inflexibility and lack of performance.

A radically different approach to parallel FFT implementation is described here based on a new matrix formulation of the discreet Fourier transform (DFT) which decomposes it into structured sets of multiplication-free 4-point (“base-4”) DFTs.  As a result, (1) implementations are simple, locally connected and structured, thereby allowing lower power, higher clock speeds and better mappings to modern FPGAs and ASICs; (2) fewer total clock cycles are needed per transform;  (3) significant added functionality and flexibility accrues from the inherent matrix-based structure; and (4) good arithmetic efficiency is retained by using a form of strength reduction.  In particular:

  • FFT lengths are nominally multiples of 16, 64 or 256.  Traditional FFTs are limited to powers of two, which severely reduces the number of reachable transform values and results in a very non-uniform distribution.  This is important for future wireless protocols such as 3GPP LTE (1536-points) and IEEE 802.22 (6144-points).
  • Small DFTs can be performed as well as FFTs, a useful feature where both are required such as the 3GPP LTE uplink SC-FDMA standard.
  • The circuit is programmable, a major difference when compared to traditional high performance parallel FFT designs.
  • For “sparse” FFTs, where there are few (perhaps randomly) placed transform inputs, the number of clock cycles per DFT can be reduced in proportion to the number of data points in real time (dynamically).  For example, in the limiting case of just one transform input to a 1024-point FFT, the number of clock cycles per DFT can be reduced ~70%.  This feature can be put to good use in wireless handsets to reduce power dissipation and increase throughputs without compromising the ability to perform complete high performance FFT computations when necessary.
    This same feature works in an inverse manner when only a few (perhaps random) coefficient outputs are desired from a fully populated transform input.  For example, a wireless handset often needs to compute coefficients associated with a small set of subcarriers, resulting in considerable savings in computation time and power dissipation.
  • The structural regularity of the base-4 matrix representation provides options for easily partitioning the DFT computation.  The benefits are that partitioning
    • Can be applied to large transform sizes so that an FFT can run on circuit arrays that are smaller than the nominal size.  In this way any base-4 array can perform any allowed transform size given enough memory resources.  This is useful in the context of recent 3 ½ and 4G wireless protocols such as WiMax 802.16m, 3GPP LTE and UMB that require “scalable” transform sizes from 128-points to 2048-points.
    • Is useful for matching circuit throughput to desired system throughput.  This is effected by adding/subtracting identical blocks of array hardware.
  • The base-4 FFT array is capable of achieving very high DFT performance levels because it can run at very high clock speeds and the number of clock cycles per DFT is less than N, the transform size, by breaking up the computation in a way that it can be run on an architecture consisting of a large number of small processing elements (PEs) which typically only contain a few registers, an adder, some multiplexers, and some memory (“systolic” architecture).  The high degree of parallelism then leads to very fast computing times. For example, a block floating point 16-bit 1024-point streaming I/O DFT can be computed in ~1µsec using (65nm FPGA technology), a speed superior to any of the latest generation of pipelined streaming FFT architectures.
  • Dynamic ranges and signal-to-quantization-noise ratios are higher than other fixed point or block floating point FFT designs for a given word size because much of the computation is done using a combined block floating point and floating point architecture.   For example a 1024-point 16-bit FFT has a signal-to-quantization-noise ratio of 87db.
  • The base-4 array is built from synchronous, fine-grained, locally connected, regular arrays.  Therefore
    • They are inherently scalable.  Bigger transforms can be implemented by increasing the array size along one dimension, yet performance isn’t affected.
    • The entire structure is relatively easy to design, test and maintain.  There are only three PE types, two of which only contain nominally two registers, an adder, and memory.  Consequently, an ASIC implementation would be very straightforward.
    • Interconnects are entirely local, reducing parasitic routing capacitances to keep power dissipation low.    
    • Many smaller memories are used rather than a few larger ones.  This increases speed and reduces power dissipation.
    • The inherently linear structure of the base-4 architecture is an excellent architectural match for the recent generation of high density FPGAs like the Vertex series from Xilinx and the Stratix series from Altera.  Both of these chip types feature architectures also characterized by linear arrays of hardwired elements such as memories and multipliers.
  • The computational latency of the base-4 design is much less than in modern pipelined FFTs.  For non-streaming designs the latency can be equal to the inverse throughput rate.  In other words the time it takes to compute the first DFT in a series of DFT computations is about equal to the time it takes to compute each successive DFT. This is significant because most fast pipelined FFT engines have "startup" times of ~N cycles, where N is the transform size, making the first DFT computation twice as long.
  • The same circuit architecture can perform 2-D DFTs as well as 1-D DFTs.