Fast Fourier Transform
Summary
Past fast Fourier transform (FFT) circuit architectures have remained relatively unchanged, being based essentially on different permutations of the signal flow graph and mappings thereof. Consequently, the inherent irregularities of the signal flow graph are reflected in the complex commutators or permutation circuits, large butterfly units, global interconnections, and stage-to-stage differences that result in an inherent inflexibility and lack of performance.
A radically different approach to parallel FFT implementation is described here based on a new matrix formulation of the discreet Fourier transform (DFT) which decomposes it into structured sets of multiplication-free 4-point (“base-4”) DFTs. As a result, (1) implementations are simple, locally connected and structured, thereby allowing lower power, higher clock speeds and better mappings to modern FPGAs and ASICs; (2) fewer total clock cycles are needed per transform; (3) significant added functionality and flexibility accrues from the inherent matrix-based structure; and (4) good arithmetic efficiency is retained by using a form of strength reduction. In particular:
- FFT lengths are nominally multiples of 16, 64 or 256. Traditional FFTs are limited to powers of two, which severely reduces the number of reachable transform values and results in a very non-uniform distribution. This is important for future wireless protocols such as 3GPP LTE (1536-points) and IEEE 802.22 (6144-points).
- Small DFTs can be performed as well as FFTs, a useful feature where both are required such as the 3GPP LTE uplink SC-FDMA standard.
- The circuit is programmable, a major difference when
compared to traditional high performance parallel FFT designs.
- For “sparse” FFTs, where there are few (perhaps randomly)
placed transform inputs, the number of clock cycles per DFT can
be reduced in proportion to the number of data points in real
time (dynamically). For example, in the limiting case of
just one transform input to a 1024-point FFT, the number of
clock cycles per DFT can be reduced ~70%. This feature can
be put to good use in wireless handsets to reduce power
dissipation and increase throughputs without compromising the
ability to perform complete high performance FFT computations
when necessary.
This same feature works in an inverse manner when only a few (perhaps random) coefficient outputs are desired from a fully populated transform input. For example, a wireless handset often needs to compute coefficients associated with a small set of subcarriers, resulting in considerable savings in computation time and power dissipation.
- The structural regularity of the base-4 matrix
representation provides options for easily partitioning the DFT
computation. The benefits are that partitioning
- Can be applied to large transform sizes so that an FFT
can run on circuit arrays that are smaller than the nominal
size. In this way any base-4 array can perform any
allowed transform size given enough memory resources. This
is useful in the context of recent 3 ½ and 4G wireless
protocols such as WiMax 802.16m, 3GPP LTE and UMB that
require “scalable” transform sizes from 128-points to
2048-points.
- Is useful for matching circuit throughput to desired
system throughput. This is effected by
adding/subtracting identical blocks of array hardware.
- Can be applied to large transform sizes so that an FFT
can run on circuit arrays that are smaller than the nominal
size. In this way any base-4 array can perform any
allowed transform size given enough memory resources. This
is useful in the context of recent 3 ½ and 4G wireless
protocols such as WiMax 802.16m, 3GPP LTE and UMB that
require “scalable” transform sizes from 128-points to
2048-points.
- The base-4 FFT array is capable of achieving very high DFT
performance levels because it can run at very high clock speeds
and the number of clock cycles per DFT is less than N,
the transform size, by breaking up the computation in a way that
it can be run on an architecture consisting of a large number of
small processing elements (PEs) which typically only contain a
few registers, an adder, some multiplexers, and some memory
(“systolic” architecture). The high degree of parallelism
then leads to very fast computing times. For example, a block
floating point 16-bit 1024-point streaming I/O DFT can be
computed in ~1µsec using (65nm FPGA technology), a speed
superior to any of the latest generation of pipelined streaming
FFT architectures.
- Dynamic ranges and signal-to-quantization-noise ratios are
higher than other fixed point or block floating point FFT
designs for a given word size because much of the computation is
done using a combined block floating point and floating point
architecture. For example a 1024-point 16-bit FFT has a
signal-to-quantization-noise ratio of 87db.
- The base-4 array is built from synchronous, fine-grained,
locally connected, regular arrays. Therefore
- They are inherently scalable. Bigger transforms can be implemented by increasing the array size along one dimension, yet performance isn’t affected.
- The entire structure is relatively easy to design, test and maintain. There are only three PE types, two of which only contain nominally two registers, an adder, and memory. Consequently, an ASIC implementation would be very straightforward.
- Interconnects are entirely local, reducing parasitic routing capacitances to keep power dissipation low.
- Many smaller memories are used rather than a few larger ones. This increases speed and reduces power dissipation.
- The inherently linear structure of the base-4
architecture is an excellent architectural match for the
recent generation of high density FPGAs like the Vertex
series from Xilinx and the Stratix series from Altera. Both
of these chip types feature architectures also characterized
by linear arrays of hardwired elements such as memories and
multipliers.
-
The computational latency of the base-4 design is
much less than in modern pipelined FFTs.
For non-streaming designs the latency can be equal to the
inverse throughput rate.
In other words the time it takes to compute the first DFT in a
series of DFT computations is about equal to the time it takes
to compute each successive DFT. This is significant because most
fast pipelined FFT engines have "startup" times of ~N
cycles, where N is the transform size, making the first DFT computation twice as long.
- The same circuit architecture can perform 2-D DFTs as well
as 1-D DFTs.
