 Cooley–Tukey FFT algorithm

The Cooley–Tukey algorithm, named after J.W. Cooley and John Tukey, is the most common fast Fourier transform (FFT) algorithm. It reexpresses the discrete Fourier transform (DFT) of an arbitrary composite size N = N_{1}N_{2} in terms of smaller DFTs of sizes N_{1} and N_{2}, recursively, in order to reduce the computation time to O(N log N) for highlycomposite N (smooth numbers). Because of the algorithm's importance, specific variants and implementation styles have become known by their own names, as described below.
Because the CooleyTukey algorithm breaks the DFT into smaller DFTs, it can be combined arbitrarily with any other algorithm for the DFT. For example, Rader's or Bluestein's algorithm can be used to handle large prime factors that cannot be decomposed by Cooley–Tukey, or the primefactor algorithm can be exploited for greater efficiency in separating out relatively prime factors.
See also the fast Fourier transform for information on other FFT algorithms, specializations for real and/or symmetric data, and accuracy in the face of finite floatingpoint precision.
Contents
History
This algorithm, including its recursive application, was invented around 1805 by Carl Friedrich Gauss, who used it to interpolate the trajectories of the asteroids Pallas and Juno, but his work was not widely recognized (being published only posthumously and in neoLatin).^{[1]}^{[2]} Gauss did not analyze the asymptotic computational time, however. Various limited forms were also rediscovered several times throughout the 19th and early 20th centuries.^{[2]} FFTs became popular after J. W. Cooley of IBM and John W. Tukey of Princeton published a paper in 1965 reinventing the algorithm and describing how to perform it conveniently on a computer.^{[3]}
Tukey reportedly came up with the idea during a meeting of a US presidential advisory committee discussing ways to detect nuclearweapon tests in the Soviet Union.^{[4]}^{[5]} Another participant at that meeting, Richard Garwin of IBM, recognized the potential of the method and put Tukey in touch with Cooley, who implemented it for a different (and lessclassified) problem: analyzing 3d crystallographic data (see also: multidimensional FFTs). Cooley and Tukey subsequently published their joint paper, and wide adoption quickly followed.
The fact that Gauss had described the same algorithm (albeit without analyzing its asymptotic cost) was not realized until several years after Cooley and Tukey's 1965 paper.^{[2]} Their paper cited as inspiration only work by I. J. Good on what is now called the primefactor FFT algorithm (PFA);^{[3]} although Good's algorithm was initially mistakenly thought to be equivalent to the Cooley–Tukey algorithm, it was quickly realized that PFA is a quite different algorithm (only working for sizes that have relatively prime factors and relying on the Chinese Remainder Theorem, unlike the support for any composite size in Cooley–Tukey).^{[6]}
The radix2 DIT case
A radix2 decimationintime (DIT) FFT is the simplest and most common form of the Cooley–Tukey algorithm, although highly optimized Cooley–Tukey implementations typically use other forms of the algorithm as described below. Radix2 DIT divides a DFT of size N into two interleaved DFTs (hence the name "radix2") of size N/2 with each recursive stage.
The discrete Fourier transform (DFT) is defined by the formula:
where k is an integer ranging from 0 to N − 1.
Radix2 DIT first computes the DFTs of the evenindexed inputs () and of the oddindexed inputs (), and then combines those two results to produce the DFT of the whole sequence. This idea can then be performed recursively to reduce the overall runtime to O(N log N). This simplified form assumes that N is a power of two; since the number of sample points N can usually be chosen freely by the application, this is often not an important restriction.
The Radix2 DIT algorithm rearranges the DFT of the function x_{n} into two parts: a sum over the evennumbered indices n = 2m and a sum over the oddnumbered indices n = 2m + 1:
One can factor a common multiplier out of the second sum, as shown in the equation below. It is then clear that the two sums are the DFT of the evenindexed part x_{2m} and the DFT of oddindexed part x_{2m + 1} of the function x_{n}. Denote the DFT of the Evenindexed inputs x_{2m} by E_{k} and the DFT of the Oddindexed inputs x_{2m + 1} by O_{k} and we obtain:
However, these smaller DFTs have a length of N/2, so we need compute only N/2 outputs: thanks to the periodicity properties of the DFT, the outputs for from a DFT of length N/2 are identical to the outputs for . That is, E_{k + N / 2} = E_{k} and O_{k + N / 2} = O_{k}. The phase factor exp[ − 2πik / N] (called a twiddle factor) obeys the relation: exp[ − 2πi(k + N / 2) / N] = e ^{− πi}exp[ − 2πik / N] = − exp[ − 2πik / N], flipping the sign of the O_{k + N / 2} terms. Thus, the whole DFT can be calculated as follows:
This result, expressing the DFT of length N recursively in terms of two DFTs of size N/2, is the core of the radix2 DIT fast Fourier transform. The algorithm gains its speed by reusing the results of intermediate computations to compute multiple DFT outputs. Note that final outputs are obtained by a +/− combination of E_{k} and O_{k}exp( − 2πik / N), which is simply a size2 DFT (sometimes called a butterfly in this context); when this is generalized to larger radices below, the size2 DFT is replaced by a larger DFT (which itself can be evaluated with an FFT).
This process is an example of the general technique of divide and conquer algorithms; in many traditional implementations, however, the explicit recursion is avoided, and instead one traverses the computational tree in breadthfirst fashion.
The above reexpression of a sizeN DFT as two sizeN/2 DFTs is sometimes called the Danielson–Lanczos lemma, since the identity was noted by those two authors in 1942^{[7]} (influenced by Runge's 1903 work^{[2]}). They applied their lemma in a "backwards" recursive fashion, repeatedly doubling the DFT size until the transform spectrum converged (although they apparently didn't realize the linearithmic asymptotic complexity they had achieved). The Danielson–Lanczos work predated widespread availability of computers and required hand calculation (possibly with mechanical aids such as adding machines); they reported a computation time of 140 minutes for a size64 DFT operating on real inputs to 3–5 significant digits. Cooley and Tukey's 1965 paper reported a running time of 0.02 minutes for a size2048 complex DFT on an IBM 7094 (probably in 36bit single precision, ~8 digits).^{[3]} Rescaling the time by the number of operations, this corresponds roughly to a speedup factor of around 800,000. (To put the time for the hand calculation in perspective, 140 minutes for size 64 corresponds to an average of at most 16 seconds per floatingpoint operation, around 20% of which are multiplications.)
Pseudocode
In pseudocode, the above process could be written:^{[8]}
X_{0,...,N−1} ← ditfft2(x, N, s): DFT of (x_{0}, x_{s}, x_{2s}, ..., x_{(N1)s}): if N = 1 then X_{0} ← x_{0} trivial size1 DFT base case else X_{0,...,N/2−1} ← ditfft2(x, N/2, 2s) DFT of (x_{0}, x_{2s}, x_{4s}, ...) X_{N/2,...,N−1} ← ditfft2(x+s, N/2, 2s) DFT of (x_{s}, x_{s+2s}, x_{s+4s}, ...) for k = 0 to N/2−1 combine DFTs of two halves into full DFT: t ← X_{k} X_{k} ← t + exp(−2πi k/N) X_{k+N/2} X_{k+N/2} ← t − exp(−2πi k/N) X_{k+N/2} endfor endif
Here,
ditfft2
(x,N,1), computes X=DFT(x) outofplace by a radix2 DIT FFT, where N is an integer power of 2 and s=1 is the stride of the input x array. x+s denotes the array starting with x_{s}.(The results are in the correct order in X and no further bitreversal permutation is required; the oftenmentioned necessity of a separate bitreversal stage only arises for certain inplace algorithms, as described below.)
Highperformance FFT implementations make many modifications to the implementation of such an algorithm compared to this simple pseudocode. For example, one can use a larger base case than N=1 to amortize the overhead of recursion, the twiddle factors exp(−2πi k/N) can be precomputed, and larger radices are often used for cache reasons; these and other optimizations together can improve the performance by an order of magnitude or more.^{[8]} (In many textbook implementations the depthfirst recursion is eliminated entirely in favor of a nonrecursive breadthfirst approach, although depthfirst recursion has been argued to have better memory locality.^{[8]}^{[9]}) Several of these ideas are described in further detail below.
General factorizations
More generally, Cooley–Tukey algorithms recursively reexpress a DFT of a composite size N = N_{1}N_{2} as:^{[10]}
 Perform N_{1} DFTs of size N_{2}.
 Multiply by complex roots of unity called twiddle factors.
 Perform N_{2} DFTs of size N_{1}.
Typically, either N_{1} or N_{2} is a small factor (not necessarily prime), called the radix (which can differ between stages of the recursion). If N_{1} is the radix, it is called a decimation in time (DIT) algorithm, whereas if N_{2} is the radix, it is decimation in frequency (DIF, also called the SandeTukey algorithm). The version presented above was a radix2 DIT algorithm; in the final expression, the phase multiplying the odd transform is the twiddle factor, and the +/ combination (butterfly) of the even and odd transforms is a size2 DFT. (The radix's small DFT is sometimes known as a butterfly, socalled because of the shape of the dataflow diagram for the radix2 case.)
There are many other variations on the Cooley–Tukey algorithm. Mixedradix implementations handle composite sizes with a variety of (typically small) factors in addition to two, usually (but not always) employing the O(N^{2}) algorithm for the prime base cases of the recursion [it is also possible to employ an N log N algorithm for the prime base cases, such as Rader's or Bluestein's algorithm]. Split radix merges radices 2 and 4, exploiting the fact that the first transform of radix 2 requires no twiddle factor, in order to achieve what was long the lowest known arithmetic operation count for poweroftwo sizes,^{[10]} although recent variations achieve an even lower count.^{[11]}^{[12]} (On presentday computers, performance is determined more by cache and CPU pipeline considerations than by strict operation counts; welloptimized FFT implementations often employ larger radices and/or hardcoded basecase transforms of significant size.^{[13]}) Another way of looking at the Cooley–Tukey algorithm is that it reexpresses a size N onedimensional DFT as an N_{1} by N_{2} twodimensional DFT (plus twiddles), where the output matrix is transposed. The net result of all of these transpositions, for a radix2 algorithm, corresponds to a bit reversal of the input (DIF) or output (DIT) indices. If, instead of using a small radix, one employs a radix of roughly √N and explicit input/output matrix transpositions, it is called a fourstep algorithm (or sixstep, depending on the number of transpositions), initially proposed to improve memory locality,^{[14]}^{[15]} e.g. for cache optimization or outofcore operation, and was later shown to be an optimal cacheoblivious algorithm.^{[16]}
The general Cooley–Tukey factorization rewrites the indices k and n as k = N_{2}k_{1} + k_{2} and n = N_{1}n_{2} + n_{1}, respectively, where the indices k_{a} and n_{a} run from 0..N_{a}1 (for a of 1 or 2). That is, it reindexes the input (n) and output (k) as N_{1} by N_{2} twodimensional arrays in columnmajor and rowmajor order, respectively; the difference between these indexings is a transposition, as mentioned above. When this reindexing is substituted into the DFT formula for nk, the N_{1}n_{2}N_{2}k_{1} cross term vanishes (its exponential is unity), and the remaining terms give
where each inner sum is a DFT of size N_{2}, each outer sum is a DFT of size N_{1}, and the [...] bracketed term is the twiddle factor.
An arbitrary radix r (as well as mixed radices) can be employed, as was shown by both Cooley and Tukey^{[3]} as well as Gauss (who gave examples of radix3 and radix6 steps).^{[2]} Cooley and Tukey originally assumed that the radix butterfly required O(r^{2}) work and hence reckoned the complexity for a radix r to be O(r^{2} N/r log_{r}N) = O(N log_{2}(N) r/log_{2}r); from calculation of values of r/log_{2}r for integer values of r from 2 to 12 the optimal radix is found to be 3 (the closest integer to e, which minimizes r/log_{2}r).^{[3]}^{[17]} This analysis was erroneous, however: the radixbutterfly is also a DFT and can be performed via an FFT algorithm in O(r log r) operations, hence the radix r actually cancels in the complexity O(r log(r) N/r log_{r}N), and the optimal r is determined by more complicated considerations. In practice, quite large r (32 or 64) are important in order to effectively exploit e.g. the large number of processor registers on modern processors,^{[13]} and even an unbounded radix r=√N also achieves O(N log N) complexity and has theoretical and practical advantages for large N as mentioned above.^{[14]}^{[15]}^{[16]}
Data reordering, bit reversal, and inplace algorithms
Although the abstract Cooley–Tukey factorization of the DFT, above, applies in some form to all implementations of the algorithm, much greater diversity exists in the techniques for ordering and accessing the data at each stage of the FFT. Of special interest is the problem of devising an inplace algorithm that overwrites its input with its output data using only O(1) auxiliary storage.
The most wellknown reordering technique involves explicit bit reversal for inplace radix2 algorithms. Bit reversal is the permutation where the data at an index n, written in binary with digits b_{4}b_{3}b_{2}b_{1}b_{0} (e.g. 5 digits for N=32 inputs), is transferred to the index with reversed digits b_{0}b_{1}b_{2}b_{3}b_{4} . Consider the last stage of a radix2 DIT algorithm like the one presented above, where the output is written inplace over the input: when E_{k} and O_{k} are combined with a size2 DFT, those two values are overwritten by the outputs. However, the two output values should go in the first and second halves of the output array, corresponding to the most significant bit b_{4} (for N=32); whereas the two inputs E_{k} and O_{k} are interleaved in the even and odd elements, corresponding to the least significant bit b_{0}. Thus, in order to get the output in the correct place, these two bits must be swapped in the input. If you include all of the recursive stages of a radix2 DIT algorithm, all the bits must be swapped and thus one must preprocess the input with a bit reversal to get inorder output. Correspondingly, the reversed (dual) algorithm is radix2 DIF, and this takes inorder input and produces bitreversed output, requiring a bitreversal postprocessing step. Alternatively, some applications (such as convolution) work equally well on bitreversed data, so one can do radix2 DIF without bit reversal, followed by processing, followed by the radix2 DIT inverse DFT without bit reversal to produce final results in the natural order.
Many FFT users, however, prefer naturalorder outputs, and a separate, explicit bitreversal stage can have a nonnegligible impact on the computation time,^{[13]} even though bit reversal can be done in O(N) time and has been the subject of much research.^{[18]}^{[19]}^{[20]} Also, while the permutation is a bit reversal in the radix2 case, it is more generally an arbitrary (mixedbase) digit reversal for the mixedradix case, and the permutation algorithms become more complicated to implement. Moreover, it is desirable on many hardware architectures to reorder intermediate stages of the FFT algorithm so that they operate on consecutive (or at least more localized) data elements. To these ends, a number of alternative implementation schemes have been devised for the Cooley–Tukey algorithm that do not require separate bit reversal and/or involve additional permutations at intermediate stages.
The problem is greatly simplified if it is outofplace: the output array is distinct from the input array or, equivalently, an equalsize auxiliary array is available. The Stockham autosort algorithm^{[21]} performs every stage of the FFT outofplace, typically writing back and forth between two arrays, transposing one "digit" of the indices with each stage, and has been especially popular on SIMD architectures.^{[22]} Even greater potential SIMD advantages (more consecutive accesses) have been proposed for the Pease algorithm,^{[23]} which also reorders outofplace with each stage, but this method requires separate bit/digit reversal and O(N log N) storage. One can also directly apply the Cooley–Tukey factorization definition with explicit (depthfirst) recursion and small radices, which produces naturalorder outofplace output with no separate permutation step (as in the pseudocode above) and can be argued to have cacheoblivious locality benefits on systems with hierarchical memory.^{[9]}^{[13]}^{[24]}
A typical strategy for inplace algorithms without auxiliary storage and without separate digitreversal passes involves small matrix transpositions (which swap individual pairs of digits) at intermediate stages, which can be combined with the radix butterflies to reduce the number of passes over the data.^{[13]}^{[25]}^{[26]}^{[27]}^{[28]}
References
 ^ Gauss, Carl Friedrich, "Nachlass: Theoria interpolationis methodo nova tractata", Werke, Band 3, 265–327 (Königliche Gesellschaft der Wissenschaften, Göttingen, 1866)
 ^ ^{a} ^{b} ^{c} ^{d} ^{e} Heideman, M. T., D. H. Johnson, and C. S. Burrus, "Gauss and the history of the fast Fourier transform," IEEE ASSP Magazine, 1, (4), 14–21 (1984)
 ^ ^{a} ^{b} ^{c} ^{d} ^{e} Cooley, James W., and John W. Tukey, "An algorithm for the machine calculation of complex Fourier series," Math. Comput. 19, 297–301 (1965). doi:10.2307/2003354
 ^ Cooley, James W., Peter A. W. Lewis, and Peter D. Welch, "Historical notes on the fast Fourier transform," IEEE Trans. on Audio and Electroacoustics 15 (2), 76–79 (1967).
 ^ Rockmore, Daniel N. , Comput. Sci. Eng. 2 (1), 60 (2000). The FFT — an algorithm the whole family can use Special issue on "top ten algorithms of the century "[1]
 ^ James W. Cooley, Peter A. W. Lewis, and Peter W. Welch, "Historical notes on the fast Fourier transform," Proc. IEEE, vol. 55 (no. 10), p. 1675–1677 (1967).
 ^ Danielson, G. C., and C. Lanczos, "Some improvements in practical Fourier analysis and their application to Xray scattering from liquids," J. Franklin Inst. 233, 365–380 and 435–452 (1942).
 ^ ^{a} ^{b} ^{c} S. G. Johnson and M. Frigo, “Implementing FFTs in practice,” in Fast Fourier Transforms (C. S. Burrus, ed.), ch. 11, Rice University, Houston TX: Connexions, September 2008.
 ^ ^{a} ^{b} Singleton, Richard C. (1967). "On computing the fast Fourier transform". Commun. of the ACM 10 (10): 647–654. doi:10.1145/363717.363771.
 ^ ^{a} ^{b} Duhamel, P., and M. Vetterli, "Fast Fourier transforms: a tutorial review and a state of the art," Signal Processing 19, 259–299 (1990)
 ^ Lundy, T., and J. Van Buskirk, "A new matrix approach to real FFTs and convolutions of length 2^{k}," Computing 80, 2345 (2007).
 ^ Johnson, S. G., and M. Frigo, "A modified splitradix FFT with fewer arithmetic operations," IEEE Trans. Signal Processing 55 (1), 111–119 (2007).
 ^ ^{a} ^{b} ^{c} ^{d} ^{e} Frigo, M.; Johnson, S. G. (2005). "The Design and Implementation of FFTW3". Proceedings of the IEEE 93 (2): 216–231. doi:10.1109/JPROC.2004.840301. http://fftw.org/fftwpaperieee.pdf.
 ^ ^{a} ^{b} Gentleman W. M., and G. Sande, "Fast Fourier transforms—for fun and profit," Proc. AFIPS 29, 563–578 (1966).
 ^ ^{a} ^{b} Bailey, David H., "FFTs in external or hierarchical memory," J. Supercomputing 4 (1), 23–35 (1990)
 ^ ^{a} ^{b} M. Frigo, C.E. Leiserson, H. Prokop, and S. Ramachandran. Cacheoblivious algorithms. In Proceedings of the 40th IEEE Symposium on Foundations of Computer Science (FOCS 99), p.285297. 1999. Extended abstract at IEEE, at Citeseer.
 ^ Cooley, J. W., P. Lewis and P. Welch, "The Fast Fourier Transform and its Applications", IEEE Trans on Education 12, 1, 2834 (1969)
 ^ Karp, Alan H. (1996). "Bit reversal on uniprocessors". SIAM Review 38 (1): 1–26. doi:10.1137/1038001. JSTOR 2132972.
 ^ Carter, Larry; Gatlin, Kang Su (1998). "Towards an optimal bitreversal permutation program". Proc. 39th Ann. Symp. on Found. of Comp. Sci. (FOCS): 544–553. doi:10.1109/SFCS.1998.743505.
 ^ Rubio, M.; Gómez, P.; Drouiche, K. (2002). "A new superfast bit reversal algorithm". Intl. J. Adaptive Control and Signal Processing 16 (10): 703–707. doi:10.1002/acs.718.
 ^ Stockham, T. G. (1966). "High speed convolution and correlation". Spring Joint Computer Conference, Proc. AFIPS 28: 229–233.
 ^ Swarztrauber, P. N. (1982). "Vectorizing the FFTs". In Rodrigue, G.. Parallel Computations. New York: Academic Press. pp. 51–83. ISBN 0125921012.
 ^ Pease, M. C. (1968). "An adaptation of the fast Fourier transform for parallel processing". J. ACM 15 (2): 252–264. doi:10.1145/321450.321457.
 ^ Frigo, Matteo; Johnson, Steven G.. "FFTW". http://www.fftw.org/. A free (GPL) C library for computing discrete Fourier transforms in one or more dimensions, of arbitrary size, using the Cooley–Tukey algorithm
 ^ Johnson, H. W.; Burrus, C. S. (1984). "An inplace inorder radix2 FFT". Proc. ICASSP: 28A.2.1–28A.2.4.
 ^ Temperton, C. (1991). "Selfsorting inplace fast Fourier transform". SIAM J. Sci. Stat. Comput. 12 (4): 808–823. doi:10.1137/0912043.
 ^ Qian, Z.; Lu, C.; An, M.; Tolimieri, R. (1994). "Selfsorting inplace FFT algorithm with minimum working space". IEEE Trans. ASSP 52 (10): 2835–2836. doi:10.1109/78.324749.
 ^ Hegland, M. (1994). "A selfsorting inplace fast Fourier transform algorithm suitable for vector and parallel processing". Numerische Mathematik 68 (4): 507–547. doi:10.1007/s002110050074.
External links
 a simple, pedagogical radix2 Cooley–Tukey FFT algorithm in C++.
 KISSFFT: a simple mixedradix Cooley–Tukey implementation in C (open source)
Categories: FFT algorithms
Wikimedia Foundation. 2010.
Look at other dictionaries:
CooleyTukey FFT algorithm — The Cooley Tukey algorithm, named after J.W. Cooley and John Tukey, is the most common fast Fourier transform (FFT) algorithm. It re expresses the discrete Fourier transform (DFT) of an arbitrary composite size N = N 1 N 2 in terms of smaller… … Wikipedia
Bruun's FFT algorithm — Bruun s algorithm is a fast Fourier transform (FFT) algorithm based on an unusual recursive polynomial factorization approach, proposed for powers of two by G. Bruun in 1978 and generalized to arbitrary even composite sizes by H. Murakami in 1996 … Wikipedia
Splitradix FFT algorithm — The split radix FFT is a fast Fourier transform (FFT) algorithm for computing the discrete Fourier transform (DFT), and was first described in an obscure paper by R. Yavne (1968) and subsequently rediscovered simultaneously by various authors in… … Wikipedia
Rader's FFT algorithm — Rader s algorithm (1968) is a fast Fourier transform (FFT) algorithm that computes the discrete Fourier transform (DFT) of prime sizes by re expressing the DFT as a cyclic convolution. (The other algorithm for FFTs of prime sizes, Bluestein s… … Wikipedia
Primefactor FFT algorithm — The Prime factor algorithm (PFA), also called the Good Thomas algorithm (1958/1963), is a fast Fourier transform (FFT) algorithm that re expresses the discrete Fourier transform (DFT) of a size N = N 1 N 2 as a two dimensional N 1 times; N 2 DFT … Wikipedia
Bluestein's FFT algorithm — (1968), commonly called the chirp z transform algorithm (1969), is a fast Fourier transform (FFT) algorithm that computes the discrete Fourier transform (DFT) of arbitrary sizes (including prime sizes) by re expressing the DFT as a convolution.… … Wikipedia
Tukey — John Wilder Tukey (* 16. Juni 1915 in New Bedford, Massachusetts; † 26. Juli 2000 in New Brunswick (New Jersey)) war ein US amerikanischer Statistiker. Der Begriff Bit wurde von Tukey vermutlich 1946, nach anderen Quellen schon 1943 vorgeschlagen … Deutsch Wikipedia
John Tukey — Infobox Scientist name = John Tukey caption = John Wilder Tukey birth date = birth date1915616 birth place = New Bedford, Massachusetts, USA death date = death date and age20007261915616 death place = New Brunswick, New Jersey residence … Wikipedia
James Cooley — Dr. James Cooley (born 1926) is an American mathematician. James William Cooley received a B.A. degree in 1949 from Manhattan College, Bronx, NY, an M.A. degree in 1951 from Columbia University, New York, NY, and a Ph.D. degree in 1961 in applied … Wikipedia
Goertzel algorithm — The Goertzel algorithm is a digital signal processing (DSP) technique for identifying frequency components of a signal, published by Dr. Gerald Goertzel in 1958. While the general Fast Fourier transform (FFT) algorithm computes evenly across the… … Wikipedia