Sum addressed decoder

In CPU design, a Sum Addressed Decoder or Sum Addressed Memory (SAM) Decoder is a method of reducing the latency of the CPU cache access. This is achieved by fusing the address generation sum operation with the decode operation in the cache SRAM.

Overview

The L1 data cache should usually be in the most critical recurrence on the CPU, because few things improve instructions per cycle (IPC) as directly as a larger data cache, a larger data cache takes longer to access, and pipelining the data cache makes IPC worse. This article explains a way of reducing the latency of the L1 data cache access by fusing the address generation sum operation with the decode operation in the cache SRAM.

The address generation sum operation still must be performed, because other units in the memory pipe will use the resulting virtual address. That sum will be performed in parallel with the fused add/decode described here.

The most profitable recurrence to accelerate is a load, followed by a use of that load in a chain of integer operations leading to another load. Assuming that load results are bypassed with the same priority as integer results, then it's possible to summarize this recurrence as a load followed by another load -- as if the program was following a linked list.

The rest of this page assumes an Instruction set architecture (ISA) with a single addressing mode (register+offset), a virtually indexed data cache, and sign-extending loads that may be variable-width. Most RISC ISAs fit this description. In ISAs such as the Intel x86, three or four inputs are summed to generate the virtual address. Multiple-input additions can be reduced to a two-input addition with carry save adders, and the remaining problem is as described below.The critical recurrence, then, is an adder, a decoder, the SRAM word line, the SRAM bit line(s), the sense amp(s), the byte steering muxes, and the bypass muxes.

For this example, a direct-mapped 16KB data cache which returns doubleword (8-byte) aligned values is assumed. Each line of the SRAM is 8 bytes, and there are 2048 lines, addressed by Addr [13:3] . The sum addressed SRAM idea applies equally well to set associative caches.

um-addressed cache: Collapse the adder and decoder

The SRAM decoder for this example has an 11 bit input, Addr [13:3] ,and 2048 outputs, the decoded word lines. One word line is driven highin response to each unique Addr [13:3] value.

In the simplest form of decoder, each of the 2048 lines islogically an AND gate. The 11 bits (call them A [13:3] and theircomplements (call them B [13:3] ) are driven up the decoder. For eachline, 11 bits or complements are fed into an 11-input AND gate. Forinstance, 1026 decimal is equal to 10000000010 binary. The functionfor line 1026 would be:

wordline [1026] = A [13] & B [12] & B [11] & B [10] & B [9] & B [8] & B [7] & B [6] & B [5] & A [4] & B [3]

Both the carry chain of the adder and the decoder combineinformation from the entire width of the index portion of the address.Combining information across the entire width twice is redundant. Asum-addressed SRAM combines the information just once by implementingthe adder and decoder together in one structure.

Recall that the SRAM is indexed with the result of an add. Callthe summands R (for register) and O (for the offset to that register).The sum-addressed decoder is going to decode R+O. For each decoderline, call the line number L.

Suppose that our decoder drove both R and O over each decoder line,and each decoder line implemented:

wordline [L] = (R+O)=L

(R+O)=L <=> R+O-L=0 <=> R+O+~L+1=0 <=> R+O+~L=-1=11..1.

We can use a set of full adders to reduce R+O+~L to S+C (this iscarry save addition). S+C=11..1 <=> S=~C. There will be no carriesin the final add. Note that since C is a row of carries, it's shiftedup one bit, so that R [13:3] +O [13:3] +~L [13:3] = {0,S [13:3] } + {C [14:4] ,0}

With this formulation, each row in the decoder is a set of fulladders which reduce the base register, the offset, and the row numberto a carry-save format, and a comparator. We'll find that most ofthis hardware is redundant below, but for now it's simpler to think ofit all existing in each row.

Ignoring the LSBs: Late select on carry

The formulation above checks the entire result of an add.However, in a CPU cache decoder, the entire result of the add is abyte address, and the cache is usually indexed with a larger address,in our example, that of an 8-byte block. Ideally, we'd like to ignorea few of the LSBs of the address. We can't ignore the LSBs of the twoaddends, however, because they may produce a carry-out which wouldchange the doubleword addressed.

Suppose that we add R [13:3] and O [13:3] to get some index I [13:3] .The actual address Addr [13:3] is equal to either I [13:3] , or I [13:3] +1, depending on whether R [2:0] +O [2:0] generates a carry-out. We canfetch both I and I+1 if we have two banks of SRAM, one with evenaddresses, and one with odd. The even bank holds addresses 000xxx,010xxx, 100xxx, 110xxx, etc, and the odd bank holds addresses 001xxx,011xxx, 101xxx, 111xxx, etc. We can then use the carry-out fromR [2:0] +O [2:0] to select the even or odd doubleword fetched later.

Note that fetching from two half-size banks of SRAM will dissipatemore power than fetching from one full-size bank, since we areswitching more sense amps and data steering logic.

Match generation

I [13:3]	even bank fetches line	odd bank fetches line
100	100	101
101	110	101
110	110	111

Referring to the diagram to the right, we can see that the even bankwill fetch line 110 when I [13:3] =101 or I [13:3] =110. The odd bankwill fetch line 101 when I [13:3] =100 or I [13:3] =101.

In general, the odd SRAM bank should fetch line Lo=2N+1 wheneither I [13:3] =2N or I [13:3] =2N+1. We can write these twoconditions as:

I [13:3] = Lo-1 => R [13:3] + O [13:3] + ~Lo+1 = 11..11 => R [13:3] + O [13:3] + ~Lo = 11..10 I [13:3] = Lo => R [13:3] + O [13:3] + ~Lo = 11..11

We ignore the last digit of the compare: (S+C) [13:4] =11..1

Similarly, the even SRAM bank fetches line Le=2N when eitherI [13:3] =2N or I [13:3] =2N-1. We can write these two conditionsas follows, and once again ignore the last digit of the compare.

I [13:3] = Le-1 => R [13:3] + O [13:3] + ~Le = 11..10 I [13:3] = Le => R [13:3] + O [13:3] + ~Le = 11..11

Gate Level Implementation

R₁₃ ... R₆ R₅ R₄ R₃ O₁₃ ... O₆ O₅ O₄ O₃ L₁₃ ... L₆ L₅ L₄ L₃ -------------------------- S₁₃ ... S₆ S₅ S₄ S₃ C₁₄ C₁₃ ... C₆ C₅ C₄

Before we begin collapsing redundancy between rows, let's review:Each row of each decoder for each of two banks implements a set offull adders which reduce the three numbers to be added (R [13:3] ,O [13:3] , and L) to two numbers (S [14:4] and C [13:3] ). The LSB(=S [3] ) is discarded. Carry out (=C [14] ) is also discarded. Therow matches if S [13:4] = ~C [13:4] , which is &( xor(S [13:4] ,C [13:4] )).

We can partially specialize the full adders to 2-input and, or,xor, and xnor because the L input is constant. The resultingexpressions are common to all lines of the decoder and can becollected at the bottom.

S_0;i = S(R_i, O_i, 0) = R_i xor O_i S_1;i = S(R_i, O_i, 1) = R_i xnor O_i C_0;i+1 = C(R_i, O_i, 0) = R_i and O_i C_1;i+1 = C(R_i, O_i, 1) = R_i or O_i.

At each digit position, there are only two possible S_i,two possible C_i, and four possible xors between them:

L_i=0 and L_i-1=0: X_0;0;i = S_0;i xor C_0;i = R_i xor O_i xor (R_i-1 and O_i-1) L_i=0 and L_i-1=1: X_0;1;i = S_0;i xor C_1;i = R_i xor O_i xor (R_i-1 or O_i-1) L_i=1 and L_i-1=0: X_1;0;i = S_1;i xor C_0;i = R_i xnor O_i xor (R_i-1 and O_i-1) = !X_0;0;i L_i=1 and L_i-1=1: X_1;1;i = S_1;i xor C_1;i = R_i xnor O_i xor (R_i-1 or O_i-1) = !X_0;1;i

One possible decoder for our example might calculate these fourexpressions for each of the bits 4..13, and drive all 40 wires up thedecoder. Each line of the decoder would select one of the four wiresfor each bit, and consist of an 10-input AND.

What has been saved?

A simpler data cache path would have an adder followed by atraditional decoder. For our example cache subsystem, the criticalpath would be a 14 bit adder, producing true and complement values,followed by an 11 bit AND gate for each row of the decoder.

In the sum-addressed design, the final AND gate in the decoderremains, although 10 bits wide instead of 11. The adder has beenreplaced by a four input logical expression at each bit. The latencysavings comes from the speed difference between the adder and thatfour input expression, a savings of perhaps three simple CMOSgates.

If the reader feels that this was an inordinate amount ofbrain-twisting work for a three gate improvement in a multi-cyclecritical path, then the reader has a better appreciation for the levelto which modern CPUs are optimized.

Further optimizations: predecode

Many decoder designs avoid high-Fan-In AND gates in the decode lineitself by employing a predecode stage. For instance, an 11 bitdecoder might be predecoded into three groups of 4, 4, and 3 bitseach. Each 3 bit group would drive 8 wires up the main decode array,each 4 bit group would drive 16 wires. The decoder line then becomesa 3 input AND gate. This reorganization can save significantimplementation area and some power.

This same reorganization can be applied to the sum-addresseddecoder. Each bit in the non-predecoded formulation above can beviewed as a local two-bit add. With predecoding, each predecode groupis a local three, four, or even five bit add, with the predecodegroups overlapping by one bit.

Predecoding generally increases the number of wires traversing thedecoder, and sum-addressed decoders generally have about twice as manywires as the equivalent simple decoder. These wires can be thelimiting factor on the amount of feasible predecoding.

References

* Paul Demone has an explanation of sum-addressed caches in a realworldtech [http://web.archive.org/web/20060109081916/http://www.realworldtech.com/page.cfm?ArticleID=RWT091000000000 article] (archived).
* Heald et al have a [http://www.ra.informatik.uni-stuttgart.de/~rainer/Literatur/Online/RA/4/2.pdf paper] in ISSCC 1998 that explains what may be the original sum-addressed cache in the Ultrasparc III.
* Sum addressed memory is described in United States patent 5,754,819,May 19, 1998, "Low-latency memory indexing method and structure". Inventors: Lynch; William L. (Palo Alto, CA), Lauterbach; Gary R. (Los Altos, CA); Assignee: Sun Microsystems, Inc. (Mountain View, CA), Filed: July 28, 1994
* At least one of the inventors named on a patent related to carry-free address decoding credits the following publication: "Evaluation of A + B = K Conditions without Carry Propagation" (1992)Jordi Cortadella, Jose M. LlaberiaIEEE Transactions on Computers,
[http://citeseer.ist.psu.edu/565049.html]
* The following patent extends this work, to use redundant form arithmetic throughout the processor, and so avoid carry propagation overhead even in ALU operations, or when an ALU operation is bypassed into a memory address:United States Patent 5,619,664,"Processor with architecture for improved pipelining of arithmetic instructions by forwarding redundant intermediate data forms",awarded April 18, 1997,Inventor: Glew; Andrew F. (Hillsboro, OR);Assignee: Intel Corporation (Santa Clara, CA), Appl. No.: 08/402,322, Filed: March 10, 1995

Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать реферат

Look at other dictionaries:

Регистровый файл — (register file) модуль микропроцессора (CPU), содержащий в себе реализацию регистров процессора. Современные регистровый файлы, используемые в СБИС обычно реализованы как многопортовый массив быстрой статической памяти SRAM. Такие массивы… … Википедия
Кэш процессора — Кэш микропроцессора кэш (сверхоперативная память), используемый микропроцессором компьютера для уменьшения среднего времени доступа к компьютерной памяти. Является одним из верхних уровней иерархии памяти[1] … Википедия
Register file — A register file is an array of processor registers in a central processing unit (CPU). Modern integrated circuit based register files are usually implemented by way of fast static RAMs with multiple ports. Such RAMs are distinguished by having… … Wikipedia
CPU cache — Cache memory redirects here. For the general use, see cache. A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the… … Wikipedia
information theory — the mathematical theory concerned with the content, transmission, storage, and retrieval of information, usually in the form of messages or data, and esp. by means of computers. [1945 50] * * * ▪ mathematics Introduction a mathematical… … Universalium
Distributed source coding — (DSC) is an important problem in information theory and communication. DSC problems regard the compression of multiple correlated information sources that do not communicate with each other.[1] By modeling the correlation between multiple sources … Wikipedia
Algorithmic efficiency — In computer science, efficiency is used to describe properties of an algorithm relating to how much of various types of resources it consumes. Algorithmic efficiency can be thought of as analogous to engineering productivity for a repeating or… … Wikipedia
Netscape — For the web browser produced by this corporation, see Netscape (web browser). Netscape Communications Type Subsidiary of AOL Industry Internet, Software, Telecommunication Founded 1994 Headquarters Mountain View, California, United States (as an… … Wikipedia
Importance sampling — In statistics, importance sampling is a general technique for estimating the properties of a particular distribution, while only having samples generated from a different distribution rather than the distribution of interest. Depending on the… … Wikipedia

Academic Dictionaries and Encyclopedias

Sum addressed decoder

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Sum addressed decoder

Look at other dictionaries:

Share the article and excerpts

Direct link