Fast Fourier Transform
Performance
The best way to assess the capabilities of a new algorithm/architecture/circuit approach is benchmark it with respect to the “best” available standard. A good nominal basis for this comparison is a “streaming” (continuous data in and out) type architecture in two FFT sizes, 256 and 1024-points. After an investigation of different parallel FFT implementations, it was concluded that Altera’s streaming block floating point (BFP) FFTs (Megacore v7.2) were the best in terms of dynamic range, throughput and device utilization. For example, in 90nm technology Xilinx’s (Virtex-4) and Altera’s (Stratix II) best 16-bit 1024-point FFT times are 3.16 and 3.1 µsec, respectively, yet Altera’s BFP capability provides a higher dynamic range (Xilinx doesn’t offer a BFP streaming FFT).
To facilitate accurate comparisons Altera and base-4 FFT circuits were designed and targeted to the same hardware, specifically to the Altera Stratix III EP3S50F (speed grade -2) for both designs. Altera Quartus II tools (v7.1, Classic Timing Analyzer) were used to design, synthesize, compile and evaluate the two FFT circuits. The Quartus II compilation report includes information about resource usage and routing, the Quartus native timing analyzer finds the critical path that determines the maximum clock frequencies and the Quartus PowerPlay Analyzer estimates power dissipation. (The timing analyzer uses worst case estimates of timing parameters, so that it is likely actual implementations would run at higher speeds than listed in the table.)
Performance results for 256-point and 1024-point streaming base-4 FFT circuits are shown in Table 1 and compared to Altera BFP circuits that have a comparable signal-to-quantization-noise. The higher throughputs in the base-4 circuits are due to a reduced number of clock cycles per FFT and higher clock speeds associated with the simple implementation. For the Altera design the clock speed listed in Table 2 is also the complex sample rate. However, the base-4 design incorporates a special I/O clock and associated circuitry that runs at 389*256/240=415Mhz in the case of the 256-point circuit. This same approach doesn’t work for the 1024-point circuit because the I/O rates approach 600MHz, which is at the limits of Stratix III internal clock speeds and beyond its I/O speeds. Instead the circuit accepts and generates two complex data streams, one containing even data and the other odd data. In this way the I/O clock and associated circuitry only have to run at 387*1024/(689*2)=288MHz.
|
Category
|
Altera
(256) |
Base-4
(256) |
Altera
(1024) |
Base-4
(1024) |
|
SQN |
86.9 |
89 |
84.4 |
86.9 |
|
Throughput (cycles/DFT) |
256 |
240 |
1024 |
689 |
|
Clock speed (MHz) |
308 |
389 |
303 |
387 |
|
Throughput (µsec) |
0.83 |
0.62 |
3.38 |
1.78 |
|
ALMs |
4267 |
4395 |
5005 |
7795 |
|
Memory (Kbits) |
49.0 |
37.9 |
194.6 |
155.6 |
|
Multipliers |
24 |
16 |
24 |
32 |
|
Energy per DFT (nJ) |
1366 |
1372 |
6264 |
6624 |
|
Latency (cycles) |
793 |
687 |
3132 |
1923 |
|
Latency (µsec) |
2.57 |
1.77 |
10.33 |
4.69 |
Table 1.
Comparison of Altera BFP streaming FFTs to base-4
implementations for 256 and 1024-point transforms obtained using
Quartus II v7.1 tools.
A different comparison can be made by combining throughput, area (ALMs) and memory in one figure of merit (FOM) and is shown in Table 2. Here the base-4 FOM is >50% lower for both FFT sizes so it can be seen that the added performance is obtained with reduced resource usage.
|
Transform
Size |
Altera
FOM |
Base-4
FOM |
Improvement
(%) |
|
256 |
174 |
103 |
69.1 |
|
1024 |
3292 |
2159 |
52.5 |
Table 2.
Comparison of base-4 and Altera streaming FFT FOM = Area
(ALMs) x Throughput (cycles/DFT) x Memory (bits)/Clock speed (Hz).
Consequently, higher throughput can be obtained with reduced overall usage of resources and with comparable power dissipation.
More recent data obtained using the same underlying hardware, but with slightly newer compilation and synthesis tools (Quartus II v7.2) showed markedly increased performance levels as shown in Table 3.
|
Category
|
Base-4
(256) |
Base-4
(1024) |
|
Throughput (cycles/DFT) |
240 |
689 |
|
Clock speed (MHz) |
426 |
416 |
|
Throughput (µsec) |
0.56 |
1.66 |
|
Throughput (complex M samples/sec) |
457 |
617 |
|
ALMs |
4370 |
7850 |
|
Memory (Kbits) |
37.9 |
155.6 |
|
Multipliers |
16 |
32 |
|
Latency (cycles) |
687 |
1923 |
|
Latency (µsec) |
1.61 |
4.62 |
Table 3. Base-4 design FFT results after compilation using Quartus II v7.2 tools
