Mercurial > cortex
comparison thesis/aux/mitthesis/chap1.tex @ 460:763d13f77e03
merge in laptop changes.
author | Robert McIntyre <rlm@mit.edu> |
---|---|
date | Thu, 27 Mar 2014 17:57:01 -0400 |
parents | 6b0f77df0e53 |
children |
comparison
equal
deleted
inserted
replaced
459:a86555b02916 | 460:763d13f77e03 |
---|---|
1 %% This is an example first chapter. You should put chapter/appendix that you | |
2 %% write into a separate file, and add a line \include{yourfilename} to | |
3 %% main.tex, where `yourfilename.tex' is the name of the chapter/appendix file. | |
4 %% You can process specific files by typing their names in at the | |
5 %% \files= | |
6 %% prompt when you run the file main.tex through LaTeX. | |
7 \chapter{Introduction} | |
8 | |
9 Micro-optimization is a technique to reduce the overall operation count of | |
10 floating point operations. In a standard floating point unit, floating | |
11 point operations are fairly high level, such as ``multiply'' and ``add''; | |
12 in a micro floating point unit ($\mu$FPU), these have been broken down into | |
13 their constituent low-level floating point operations on the mantissas and | |
14 exponents of the floating point numbers. | |
15 | |
16 Chapter two describes the architecture of the $\mu$FPU unit, and the | |
17 motivations for the design decisions made. | |
18 | |
19 Chapter three describes the design of the compiler, as well as how the | |
20 optimizations discussed in section~\ref{ch1:opts} were implemented. | |
21 | |
22 Chapter four describes the purpose of test code that was compiled, and which | |
23 statistics were gathered by running it through the simulator. The purpose | |
24 is to measure what effect the micro-optimizations had, compared to | |
25 unoptimized code. Possible future expansions to the project are also | |
26 discussed. | |
27 | |
28 \section{Motivations for micro-optimization} | |
29 | |
30 The idea of micro-optimization is motivated by the recent trends in computer | |
31 architecture towards low-level parallelism and small, pipelineable | |
32 instruction sets \cite{patterson:risc,rad83}. By getting rid of more | |
33 complex instructions and concentrating on optimizing frequently used | |
34 instructions, substantial increases in performance were realized. | |
35 | |
36 Another important motivation was the trend towards placing more of the | |
37 burden of performance on the compiler. Many of the new architectures depend | |
38 on an intelligent, optimizing compiler in order to realize anywhere near | |
39 their peak performance | |
40 \cite{ellis:bulldog,pet87,coutant:precision-compilers}. In these cases, the | |
41 compiler not only is responsible for faithfully generating native code to | |
42 match the source language, but also must be aware of instruction latencies, | |
43 delayed branches, pipeline stages, and a multitude of other factors in order | |
44 to generate fast code \cite{gib86}. | |
45 | |
46 Taking these ideas one step further, it seems that the floating point | |
47 operations that are normally single, large instructions can be further broken | |
48 down into smaller, simpler, faster instructions, with more control in the | |
49 compiler and less in the hardware. This is the idea behind a | |
50 micro-optimizing FPU; break the floating point instructions down into their | |
51 basic components and use a small, fast implementation, with a large part of | |
52 the burden of hardware allocation and optimization shifted towards | |
53 compile-time. | |
54 | |
55 Along with the hardware speedups possible by using a $\mu$FPU, there are | |
56 also optimizations that the compiler can perform on the code that is | |
57 generated. In a normal sequence of floating point operations, there are | |
58 many hidden redundancies that can be eliminated by allowing the compiler to | |
59 control the floating point operations down to their lowest level. These | |
60 optimizations are described in detail in section~\ref{ch1:opts}. | |
61 | |
62 \section{Description of micro-optimization}\label{ch1:opts} | |
63 | |
64 In order to perform a sequence of floating point operations, a normal FPU | |
65 performs many redundant internal shifts and normalizations in the process of | |
66 performing a sequence of operations. However, if a compiler can | |
67 decompose the floating point operations it needs down to the lowest level, | |
68 it then can optimize away many of these redundant operations. | |
69 | |
70 If there is some additional hardware support specifically for | |
71 micro-optimization, there are additional optimizations that can be | |
72 performed. This hardware support entails extra ``guard bits'' on the | |
73 standard floating point formats, to allow several unnormalized operations to | |
74 be performed in a row without the loss information\footnote{A description of | |
75 the floating point format used is shown in figures~\ref{exponent-format} | |
76 and~\ref{mantissa-format}.}. A discussion of the mathematics behind | |
77 unnormalized arithmetic is in appendix~\ref{unnorm-math}. | |
78 | |
79 The optimizations that the compiler can perform fall into several categories: | |
80 | |
81 \subsection{Post Multiply Normalization} | |
82 | |
83 When more than two multiplications are performed in a row, the intermediate | |
84 normalization of the results between multiplications can be eliminated. | |
85 This is because with each multiplication, the mantissa can become | |
86 denormalized by at most one bit. If there are guard bits on the mantissas | |
87 to prevent bits from ``falling off'' the end during multiplications, the | |
88 normalization can be postponed until after a sequence of several | |
89 multiplies\footnote{Using unnormalized numbers for math is not a new idea; a | |
90 good example of it is the Control Data CDC 6600, designed by Seymour Cray. | |
91 \cite{thornton:cdc6600} The CDC 6600 had all of its instructions performing | |
92 unnormalized arithmetic, with a separate {\tt NORMALIZE} instruction.}. | |
93 | |
94 % This is an example of how you would use tgrind to include an example | |
95 % of source code; it is commented out in this template since the code | |
96 % example file does not exist. To use it, you need to remove the '%' on the | |
97 % beginning of the line, and insert your own information in the call. | |
98 % | |
99 %\tagrind[htbp]{code/pmn.s.tex}{Post Multiply Normalization}{opt:pmn} | |
100 | |
101 As you can see, the intermediate results can be multiplied together, with no | |
102 need for intermediate normalizations due to the guard bit. It is only at | |
103 the end of the operation that the normalization must be performed, in order | |
104 to get it into a format suitable for storing in memory\footnote{Note that | |
105 for purposed of clarity, the pipeline delays were considered to be 0, and | |
106 the branches were not delayed.}. | |
107 | |
108 \subsection{Block Exponent} | |
109 | |
110 In a unoptimized sequence of additions, the sequence of operations is as | |
111 follows for each pair of numbers ($m_1$,$e_1$) and ($m_2$,$e_2$). | |
112 \begin{enumerate} | |
113 \item Compare $e_1$ and $e_2$. | |
114 \item Shift the mantissa associated with the smaller exponent $|e_1-e_2|$ | |
115 places to the right. | |
116 \item Add $m_1$ and $m_2$. | |
117 \item Find the first one in the resulting mantissa. | |
118 \item Shift the resulting mantissa so that normalized | |
119 \item Adjust the exponent accordingly. | |
120 \end{enumerate} | |
121 | |
122 Out of 6 steps, only one is the actual addition, and the rest are involved | |
123 in aligning the mantissas prior to the add, and then normalizing the result | |
124 afterward. In the block exponent optimization, the largest mantissa is | |
125 found to start with, and all the mantissa's shifted before any additions | |
126 take place. Once the mantissas have been shifted, the additions can take | |
127 place one after another\footnote{This requires that for n consecutive | |
128 additions, there are $\log_{2}n$ high guard bits to prevent overflow. In | |
129 the $\mu$FPU, there are 3 guard bits, making up to 8 consecutive additions | |
130 possible.}. An example of the Block Exponent optimization on the expression | |
131 X = A + B + C is given in figure~\ref{opt:be}. | |
132 | |
133 % This is an example of how you would use tgrind to include an example | |
134 % of source code; it is commented out in this template since the code | |
135 % example file does not exist. To use it, you need to remove the '%' on the | |
136 % beginning of the line, and insert your own information in the call. | |
137 % | |
138 %\tgrind[htbp]{code/be.s.tex}{Block Exponent}{opt:be} | |
139 | |
140 \section{Integer optimizations} | |
141 | |
142 As well as the floating point optimizations described above, there are | |
143 also integer optimizations that can be used in the $\mu$FPU. In concert | |
144 with the floating point optimizations, these can provide a significant | |
145 speedup. | |
146 | |
147 \subsection{Conversion to fixed point} | |
148 | |
149 Integer operations are much faster than floating point operations; if it is | |
150 possible to replace floating point operations with fixed point operations, | |
151 this would provide a significant increase in speed. | |
152 | |
153 This conversion can either take place automatically or or based on a | |
154 specific request from the programmer. To do this automatically, the | |
155 compiler must either be very smart, or play fast and loose with the accuracy | |
156 and precision of the programmer's variables. To be ``smart'', the computer | |
157 must track the ranges of all the floating point variables through the | |
158 program, and then see if there are any potential candidates for conversion | |
159 to floating point. This technique is discussed further in | |
160 section~\ref{range-tracking}, where it was implemented. | |
161 | |
162 The other way to do this is to rely on specific hints from the programmer | |
163 that a certain value will only assume a specific range, and that only a | |
164 specific precision is desired. This is somewhat more taxing on the | |
165 programmer, in that he has to know the ranges that his values will take at | |
166 declaration time (something normally abstracted away), but it does provide | |
167 the opportunity for fine-tuning already working code. | |
168 | |
169 Potential applications of this would be simulation programs, where the | |
170 variable represents some physical quantity; the constraints of the physical | |
171 system may provide bounds on the range the variable can take. | |
172 \subsection{Small Constant Multiplications} | |
173 | |
174 One other class of optimizations that can be done is to replace | |
175 multiplications by small integer constants into some combination of | |
176 additions and shifts. Addition and shifting can be significantly faster | |
177 than multiplication. This is done by using some combination of | |
178 \begin{eqnarray*} | |
179 a_i & = & a_j + a_k \\ | |
180 a_i & = & 2a_j + a_k \\ | |
181 a_i & = & 4a_j + a_k \\ | |
182 a_i & = & 8a_j + a_k \\ | |
183 a_i & = & a_j - a_k \\ | |
184 a_i & = & a_j \ll m \mbox{shift} | |
185 \end{eqnarray*} | |
186 instead of the multiplication. For example, to multiply $s$ by 10 and store | |
187 the result in $r$, you could use: | |
188 \begin{eqnarray*} | |
189 r & = & 4s + s\\ | |
190 r & = & r + r | |
191 \end{eqnarray*} | |
192 Or by 59: | |
193 \begin{eqnarray*} | |
194 t & = & 2s + s \\ | |
195 r & = & 2t + s \\ | |
196 r & = & 8r + t | |
197 \end{eqnarray*} | |
198 Similar combinations can be found for almost all of the smaller | |
199 integers\footnote{This optimization is only an ``optimization'', of course, | |
200 when the amount of time spent on the shifts and adds is less than the time | |
201 that would be spent doing the multiplication. Since the time costs of these | |
202 operations are known to the compiler in order for it to do scheduling, it is | |
203 easy for the compiler to determine when this optimization is worth using.}. | |
204 \cite{magenheimer:precision} | |
205 | |
206 \section{Other optimizations} | |
207 | |
208 \subsection{Low-level parallelism} | |
209 | |
210 The current trend is towards duplicating hardware at the lowest level to | |
211 provide parallelism\footnote{This can been seen in the i860; floating point | |
212 additions and multiplications can proceed at the same time, and the RISC | |
213 core be moving data in and out of the floating point registers and providing | |
214 flow control at the same time the floating point units are active. \cite{byte:i860}} | |
215 | |
216 Conceptually, it is easy to take advantage to low-level parallelism in the | |
217 instruction stream by simply adding more functional units to the $\mu$FPU, | |
218 widening the instruction word to control them, and then scheduling as many | |
219 operations to take place at one time as possible. | |
220 | |
221 However, simply adding more functional units can only be done so many times; | |
222 there is only a limited amount of parallelism directly available in the | |
223 instruction stream, and without it, much of the extra resources will go to | |
224 waste. One process used to make more instructions potentially schedulable | |
225 at any given time is ``trace scheduling''. This technique originated in the | |
226 Bulldog compiler for the original VLIW machine, the ELI-512. | |
227 \cite{ellis:bulldog,colwell:vliw} In trace scheduling, code can be | |
228 scheduled through many basic blocks at one time, following a single | |
229 potential ``trace'' of program execution. In this way, instructions that | |
230 {\em might\/} be executed depending on a conditional branch further down in | |
231 the instruction stream are scheduled, allowing an increase in the potential | |
232 parallelism. To account for the cases where the expected branch wasn't | |
233 taken, correction code is inserted after the branches to undo the effects of | |
234 any prematurely executed instructions. | |
235 | |
236 \subsection{Pipeline optimizations} | |
237 | |
238 In addition to having operations going on in parallel across functional | |
239 units, it is also typical to have several operations in various stages of | |
240 completion in each unit. This pipelining allows the throughput of the | |
241 functional units to be increased, with no increase in latency. | |
242 | |
243 There are several ways pipelined operations can be optimized. On the | |
244 hardware side, support can be added to allow data to be recirculated back | |
245 into the beginning of the pipeline from the end, saving a trip through the | |
246 registers. On the software side, the compiler can utilize several tricks to | |
247 try to fill up as many of the pipeline delay slots as possible, as | |
248 seendescribed by Gibbons. \cite{gib86} | |
249 | |
250 |