As can be seen, the arithmetic unit of the full

Guy D. Covert
Senior Systems Engineer
TRW LSI Products
La Jolla, CA 92038
The Discrete Fourier Transform (DFT) is used in a
wide variety of digital signal processing applications.
The algorithms used to implement this transform require
intensive arithmetic computation as well as complex
control and sequence functions. The designer of VLSI
components is faced with the problem of identifying
requirements and architectures for chips which directly
support the DFT. Design goals of these chips include
minimum chip count to implement an entire transform,
very high speed and low power dissipation. This paper
discusses a monolithic CMOS device that was fabricated
to perform 32 point Fast Fourier Transforms at very high
data rates. All data memory and arithmetic and control
circuitry is contained on this single low power chip.
The TMC 2032 is a monolithic, completely self
contained Fourier Transform processor which is capable
of computing both forward and inverse Discrete Fourier
Transforms (DFT) on 32 complex valued data samples.
The device has been fabricated using a TRW proprietary
2-micron bulk CMOS process technology that offers very
high circuit density and low power dissipation plus the
extremely high speed operation that has previously been
associated only with bipolar devices. Approximately
27,000 FET devices were used on a 236 x 248 die and the
device dissipates about 900 milliwatts from a single five
volt power supply. Total time required to perform a 32
TMC2032 consists of a 16 x 16 bit Multiplier Accumulator
when the maximum master clock frequency of 50MHZ is
multiply scheme. One input is connected to a sine-cosine
ROM that provides the complex twiddle factors required
The algorithm implemented by the TMC 2032 to
factors are stored in Booth coded form so that they can
Figure 1. Block Diagram of the TMC 2032
As can be seen, the arithmetic unit of the
(MAC) and a separate 17 bit carry-lookahead adder.
Together, these form a one-quarter butterfly circuit that,
under microprogram control, is sequenced through
twenty—four cycles of the master clock to complete ore
full complex FFT butterfly operation every 480
nanoseconds. The MAC circuit uses a Booth coded
point complex-to-complex DFT is 47.0 microseconds
by both the forward and inverse transforms.
be used directly by the MAC. This resulted in a
compute the OFT is an in-place decimation-in—time (DIT)
butterflies per pass are then required for one complete 32
significant saving of devices in the MAC circuitry at the
lesser cost of requiring a 24 bit wide ROM look up table
96dB of overall dynamic range.
butterfly circuit may be right shifted up to one bit under
control of an external signal. This allows scaling of data
as required to prevent arithmetic overflows. Arithmetic
rounding is applied to the final butterfly output by adding
0.5 to the least significant bit of each output word.
FFT using radix-2 butterflies. Five passes with sixteen
rather than a 16 bit width. Output from the quarter
All input/output and arithmetic
point transform.
operations are performed with a sixteen bit, fractional
two's complement fixed point numeric format that is
common to many existing high speed digital signal
processing systems. This format offers approximately
All input/output data as well as interim results are
stored in a 64 word by 16 bit RAM. This memory may
read from one address while writing to another in a single
A block diagram of the FFT processor is shown in
memory cycle. A memory cycle corresponds to four
cycles of the master clock.
Figure 1:
17467/82I0OOO 1081.$ 00.75 © 1982 IEEE
required. For example, if the input signal is essentially
Gaussian noise, the optimum fixed scaling is usually a
All control and sequence functions in the TMC 2032
are performed by a PLA based microprogrammed control
right shift on every even numbered pass.
unit. This unit is easily programmed by a final mask
step. It cycles at the master clock rate and generates all
the signals required to step through the 80 butterflies
required by a 32 point transform. These signals include:
A more flexible approach to data scaling requires
an external circuit to monitor the state of the overflow
bit and determine which passes of the FFT must be right
Twiddle factor ROM addresses, RAM read/write
addresses and butterfly unit states. An instruction
shifted in order to prevent overflows. Non-valid data will
come out of the first few FFTs, while the appropriate
right—shift pattern is developed. As long as the input
signal characteristics do not change significantly, the
minimum shift no overflow sealing case will soon be
found and valid output data will result from that point
decoder circuit allows the PLA to receive and process
macro level instructions via the off—chip interf ace.
The off—chip interface includes separate 16 bit
input and output ports, an instruction input and a status
register output. All outputs have three-state buffers to
give added flexibility when interfacing to bus oriented
systems. Instructions to the chip include:
Load complex data samples over the input!
output port sequentially into the dual port RAM, then
The TMC2032 performs a complex to complex
Fourier transform,
However, in many potential
perform a 32 point FFT.
applications for this chip, real data only is being
processed and 32 points of real data must be transformed
2. Output complex data in bit reversed addressing
into sixteen complex valued frequencies.
Here, the
TMC2032 may be used to compute two real-to-complex
transforms in the same amount of time required to
Output complex data in natural sequential
compute a single complex-to--complex transform. The
following computational procedure applies (1):
4. Load complex data and perform 32 point
inverse FFT.
1. Load the first 32 real valued data points into
the real array of the TMC 2032. Load the second 32 real
valued data points into the imaginary array of the
5. Right shift all data values by one bit during the
next sequential pass.
2. Execute the 32 point complex-to-complex FFT
macro instruction.
6. Return status.
The status register consists of five bits, Three of
these indicate which of the five FFT passes is currently
in progress. The fourth bit indicates that the chip is busy
and the fifth bit indicates that an arithmetic overflow
has occured during the current FFT pass.
transform of real only data will have a real part that is
imaginary part and an odd real part. Therefore, the two
sets of sixteen complex frequencies may be generated by
simple additions and subtractions required to sort out the
At this point, we must realize that the Fourier
an even function of frequency and an imaginary part that
is an odd function of frequency. Correspondingly, the
transform of imaginary only data will have añ even
even and odd parts.
In the implementation of any fixed point FFT, provisions must be made for scaling data points to prevent
arithmetic overflows which may be caused by normal
Using the above procedure, the effective processing bandwidth of the FFT chip may be doubled when
word growth within the algorithm. The TMC 2032 accom-
processing real data by the addition of fairly slow add and
subtract elements. Therefore, we can now transform real
data with an input sample rate of up to 1.36 MHZ.
plishes this scaling by use of two external signals as
1. A bit is available in the status register which
indicates that an arithmetic overflow occurred on the
current of the five FFT passes. This signal is reset at the
beginning of each pass and latched whenever an overflow
The TMC2032 was designed to be used as a
building block for the construction of larger size
transforms. A 1024 point FFT may be constructed using
2. An instruction may be input which causes the
TMC2032 to rightshift all data points by one bit before
the following computational method (2):
they are output from the next sequential pass. This
1. First of all, we must take the 1024 input
signal is latched at the start of each pass.
complex time samples and arrange them into a two
dimensional matrix with the following format:
The simplest application using these two signals to
prevent arithmetic overflows is a fixed scaling procedure
wherein an external circuit monitors the pass counter and
asserts the right shift instruction in a predetermined
fixed sequence. Here, the overflow bit becomes an error
flag. In order to use this method effectively, some a
priori knowledge about the structure of the input signal is
Further, we will define M (M=0 through 31) as the
colum index and L (L=0 through 31) as the row index.
Generation of the final 1024 point transform will now be
flexibility of selecting combinations of parallel and serial
performed by using 32 point FFT's on the rows and
structures which implement the required processing
within his speed constraints. For example, maximum
speed will be attained using 64 TMC2032's and the
corn plex frequencies.
sequenced through all 64 FFT's.
minimum hardware system will use a single chip
columns of this matrix then reformatting back to 1024
2. Using the TMC2032, perform a 32 point FFT on
each of the 32 columns.
A block diagram of one possible implementation of
the 1024 point transform is given in Figure 2. Here, 16
TMC2032's are each sequenced through four FFT's to
3. Every element must now be multiplied by a
complex twiddle factor depending on its location in the
matrix. This factor is:
compute a single 1024 point transform. Complex
multiplication is performed using two multiplieraccumulator chips. Each chip is sequenced twice to
generate a single complex product.Finally, a total of
three frame store memories are used to store
intermediate results and read them out in row or column
order as required. These memories are double buffered
Where M and L are the column and row indices and W is:
to allow sustained rate processing.
This system is
capable of producing a new 1024 point FPT every 188
w = e2 7T/1024
microseconds, subject to a latency time of 752
4. Using the TMC2032, compute the 32 point FFT's
of each of the 32 rows.
5. We now have completed the 1024 point
transform computation and, in the process, transposed
the original matrix. Therefore, we must now read out our
frequencies with F(0) being located at position (0,0), F(1)
at (0,1), F(2) at (0,2), F(32) at (1,0) etc.
(1) L. D. Enochson and R. K. Otnes "Digital Time Series
Analysis" 1972.
As can be seen, the above procedure requires the
computation of 64 different 32 point FFT's as well as
1024 complex multiplies. The system designer has the
L. R. Rabiner and B. Gold "Theory and Application of
Digital Signal Processing" Prentice—Hall, 1975; pp. 371—
Complex Multiply
Figure 2. 1024 Point FFT Implementation
射 频 和 天 线 设 计 培 训 课 程 推 荐
波、射频、天线设计研发人才的培养;我们于 2006 年整合合并微波 EDA 网(,现
培训课程和 ADS、HFSS 等专业软件使用培训课程,广受客户好评;并先后与人民邮电出版社、电子
路测量培训课程三个类别共 30 门视频培训课程和 3 本图书教材;旨在
ADS 学习培训课程套装
该套装是迄今国内最全面、最权威的 ADS 培训教程,共包含 10 门 ADS
学习培训课程。课程是由具有多年 ADS 使用经验的微波射频与通信系
全面地讲解了 ADS 在微波射频电路设计、通信系统设计和电磁仿真设
计方面的内容。能让您在最短的时间内学会使用 ADS,迅速提升个人技
术能力,把 ADS 真正应用到实际研发工作中去,成为 ADS 设计专家...
HFSS 学习培训课程套装
该套课程套装包含了本站全部 HFSS 培训课程,是迄今国内最全面、最
专业的 HFSS 培训教程套装,可以帮助您从零开始,
全面深入学习 HFSS
的各项功能和在多个方面的工程应用。购买套装,更可超值赠送 3 个月
免费学习答疑,随时解答您学习过程中遇到的棘手问题,让您的 HFSS
CST 学习培训课程套装
该培训套装由易迪拓培训联合微波 EDA 网共同推出,是最全面、系统、
专业的 CST 微波工作室培训课程套装,所有课程都由经验丰富的专家授
课,视频教学,可以帮助您从零开始,全面系统地学习 CST 微波工作的
还可超值赠送 3 个月免费学习答疑…
HFSS 天线设计培训课程套装
套装包含 6 门视频课程和 1 本图书,课程从基础讲起,内容由浅入深,
理论介绍和实际操作讲解相结合,全面系统的讲解了 HFSS 天线设计的
全过程。是国内最全面、最专业的 HFSS 天线设计课程,可以帮助您快
速学习掌握如何使用 HFSS 设计天线,让天线设计不再难…
13.56MHz NFC/RFID 线圈天线设计培训课程套装
套装包含 4 门视频培训课程,培训将 13.56MHz 线圈天线设计原理和仿
真设计实践相结合,全面系统地讲解了 13.56MHz 线圈天线的工作原理、
设计方法、设计考量以及使用 HFSS 和 CST 仿真分析线圈天线的具体
操作,同时还介绍了 13.56MHz 线圈天线匹配电路的设计和调试。通过
该套课程的学习,可以帮助您快速学习掌握 13.56MHz 线圈天线及其匹
※ 成立于 2004 年,10 多年丰富的行业经验,
※ 一直致力并专注于微波射频和天线设计工程师的培养,更了解该行业对人才的要求
※ 经验丰富的一线资深工程师讲授,结合实际工程案例,直观、实用、易学
※ 易迪拓培训官网:
※ 微波 EDA 网:
※ 官方淘宝店: