Fast Fourier Transform
Dynamic FFT (“run-time” transform size choice) and ASIC Characterization
Different base-4 array sizes can be used to provide difference levels of performance for a single transform size. Similarly, the same array size can be used to perform different transform sizes since the processing is the same for all transform sizes. This is important functionality is used to support Orthogonal Frequency Division Multiple Access (OFDMA) in present WiMax, future 3GPP LTE and 3GPP2 UMB wireless protocols. For this reason a dynamic FFT that spans transform sizes from 128 to 2048 was evaluated because this meets the latest WiMax 802.16e protocol standard. Using an array size identical to that for a 1024-point FFT fixed size design with Nr=Nc=32, yields the circuit performance and resource estimates shown below in Table 1.
|
Circuit |
ROM |
RAM |
Multipliers |
ALMs |
ASIC gates |
fmax |
|
(Kbits) |
(Kbits) |
(18-bits) |
(Kgates) |
(MHz) |
||
|
Altera (16-bit
in 30-bits out) |
11 |
197 |
40 |
5008 |
499 |
285 |
|
Centar (16-bit
in 16-bit out) |
77.3 |
238 |
32 |
8000 |
620 |
387 |
Table 1. Resource and performance comparison of Altera variable FFT and base-4 equivalent with normal data order for input and output. An ALM is an "arithmetic logic module" with 4 LUTs and 4 registers.
Here, the comparison is made in Table 1 to Altera’s “variable” FFT (Megacore v7.2), which is a completely different architecture than Altera’s fixed FFT size circuits. Altera’s variable FFT is based on radix-22 single delay feedback architecture and does not provide a BFP capability or scaling option so that the 16-bit input becomes 30-bits at the output. Results are based on analysis of Quartus II v 7.1 compilations to the same FPGA.
Table 1 provides a breakout of resources used and maximum frequencies as well as an estimated equivalent ASIC gate count, since such a circuit is best suited for wireless applications. The gate count is based on the estimates shown in Table 2.
|
gates/ALM |
20 |
|
gates/multiplier |
2500 |
|
gates/ROM |
0.3 |
|
gates/RAM |
1.5 |
Table 2. ASIC gate estimates for FPGA elements. An ALM contains 2 LUTs at 4 gates/LUT and 2 registers at 6 gates/register.
The base-4 design is more efficient because it uses the resources of the entire array for each FFT size and nominally uses fewer clock cycles per DFT. To gage the overall base-4 dynamic FFT performance a figure or merit is calculated for each transform size equal to the product of the total ASIC gate count and cycles/DFT divided by the clock frequency. Here Table 3 shows that this overall figure of merit is an average of 72% better across all FFT sizes. The Altera circuit's 30-bit output will also lead to greater memory requirements in downstream circuit functions.
|
FOM |
|||||
|
Transform Size |
128 |
256 |
512 |
1024 |
2048 |
|
Altera 16-bit |
0.224 |
0.448 |
0.896 |
1.793 |
3.586 |
|
Centar 16-bit |
0.144 |
0.205 |
0.455 |
1.071 |
2.904 |
|
Improvement (%) |
55.4 |
118.5 |
97.0 |
67.5 |
23.5 |
Table 3. Figure of merit (FOM) comparisons based on 16-bit (input) variable length FFTs
The power dissipation of this version of the dynamic base-4 FFT can be estimated by comparing a complex multiplier built using FPGA ALMs with a logically equivalent hardwired multiplier embedded in Altera’s DSP block. It was found that an ALM based multiplier ran at a maximum frequency of 208MHz compared to the 550 MHz capability of the hardwired version (65 nm Stratix III technology) and used 9.3 times more power. Consequently, an ASIC version of the base-4 dynamic FFT should run at at least 550MHz leading to the performance results shown in Table 4.
|
Transform Size |
128 |
256 |
512 |
1024 |
2048 |
|
Throughput
(µsec) |
0.16 |
0.23 |
0.52 |
1.21 |
3.29 |
|
Complex sample
rates (MHz) |
782 |
1100 |
992 |
843 |
622 |
|
Energy (pJ) |
173 |
246 |
546 |
1285 |
3485 |
Table 4. Estimated performance numbers for a dynamic FFT capable of doing any these five transform sizes (65nm ASIC technology). Here throughput is the time to do one FFT in a series.
The performance levels shown in Table 4 are far higher than present 3½G wireless needs. For example, symbol rates for WiMax protocols are ~100µsec. However, 4G bandwidths will be higher and data throughputs will be multiplied by the number of MIMO channels, requiring much higher performance levels. In any case the base-4 architecture is completely scalable, so performance/resources can be reduced directly in proportion to hardware.
