cortex: thesis/aux/mitthesis/chap1.tex annotate

annotate thesis/aux/mitthesis/chap1.tex @ 485:ac953b562eab

completed first draft.

author	Robert McIntyre <rlm@mit.edu>
date	Sat, 29 Mar 2014 16:22:49 -0400
parents	6b0f77df0e53
children

rev	line source
rlm@421	1 %% This is an example first chapter. You should put chapter/appendix that you
rlm@421	2 %% write into a separate file, and add a line \include{yourfilename} to
rlm@421	3 %% main.tex, where `yourfilename.tex' is the name of the chapter/appendix file.
rlm@421	4 %% You can process specific files by typing their names in at the
rlm@421	5 %% \files=
rlm@421	6 %% prompt when you run the file main.tex through LaTeX.
rlm@421	7 \chapter{Introduction}
rlm@421	8
rlm@421	9 Micro-optimization is a technique to reduce the overall operation count of
rlm@421	10 floating point operations. In a standard floating point unit, floating
rlm@421	11 point operations are fairly high level, such as ``multiply'' and ``add'';
rlm@421	12 in a micro floating point unit ($\mu$FPU), these have been broken down into
rlm@421	13 their constituent low-level floating point operations on the mantissas and
rlm@421	14 exponents of the floating point numbers.
rlm@421	15
rlm@421	16 Chapter two describes the architecture of the $\mu$FPU unit, and the
rlm@421	17 motivations for the design decisions made.
rlm@421	18
rlm@421	19 Chapter three describes the design of the compiler, as well as how the
rlm@421	20 optimizations discussed in section~\ref{ch1:opts} were implemented.
rlm@421	21
rlm@421	22 Chapter four describes the purpose of test code that was compiled, and which
rlm@421	23 statistics were gathered by running it through the simulator. The purpose
rlm@421	24 is to measure what effect the micro-optimizations had, compared to
rlm@421	25 unoptimized code. Possible future expansions to the project are also
rlm@421	26 discussed.
rlm@421	27
rlm@421	28 \section{Motivations for micro-optimization}
rlm@421	29
rlm@421	30 The idea of micro-optimization is motivated by the recent trends in computer
rlm@421	31 architecture towards low-level parallelism and small, pipelineable
rlm@421	32 instruction sets \cite{patterson:risc,rad83}. By getting rid of more
rlm@421	33 complex instructions and concentrating on optimizing frequently used
rlm@421	34 instructions, substantial increases in performance were realized.
rlm@421	35
rlm@421	36 Another important motivation was the trend towards placing more of the
rlm@421	37 burden of performance on the compiler. Many of the new architectures depend
rlm@421	38 on an intelligent, optimizing compiler in order to realize anywhere near
rlm@421	39 their peak performance
rlm@421	40 \cite{ellis:bulldog,pet87,coutant:precision-compilers}. In these cases, the
rlm@421	41 compiler not only is responsible for faithfully generating native code to
rlm@421	42 match the source language, but also must be aware of instruction latencies,
rlm@421	43 delayed branches, pipeline stages, and a multitude of other factors in order
rlm@421	44 to generate fast code \cite{gib86}.
rlm@421	45
rlm@421	46 Taking these ideas one step further, it seems that the floating point
rlm@421	47 operations that are normally single, large instructions can be further broken
rlm@421	48 down into smaller, simpler, faster instructions, with more control in the
rlm@421	49 compiler and less in the hardware. This is the idea behind a
rlm@421	50 micro-optimizing FPU; break the floating point instructions down into their
rlm@421	51 basic components and use a small, fast implementation, with a large part of
rlm@421	52 the burden of hardware allocation and optimization shifted towards
rlm@421	53 compile-time.
rlm@421	54
rlm@421	55 Along with the hardware speedups possible by using a $\mu$FPU, there are
rlm@421	56 also optimizations that the compiler can perform on the code that is
rlm@421	57 generated. In a normal sequence of floating point operations, there are
rlm@421	58 many hidden redundancies that can be eliminated by allowing the compiler to
rlm@421	59 control the floating point operations down to their lowest level. These
rlm@421	60 optimizations are described in detail in section~\ref{ch1:opts}.
rlm@421	61
rlm@421	62 \section{Description of micro-optimization}\label{ch1:opts}
rlm@421	63
rlm@421	64 In order to perform a sequence of floating point operations, a normal FPU
rlm@421	65 performs many redundant internal shifts and normalizations in the process of
rlm@421	66 performing a sequence of operations. However, if a compiler can
rlm@421	67 decompose the floating point operations it needs down to the lowest level,
rlm@421	68 it then can optimize away many of these redundant operations.
rlm@421	69
rlm@421	70 If there is some additional hardware support specifically for
rlm@421	71 micro-optimization, there are additional optimizations that can be
rlm@421	72 performed. This hardware support entails extra ``guard bits'' on the
rlm@421	73 standard floating point formats, to allow several unnormalized operations to
rlm@421	74 be performed in a row without the loss information\footnote{A description of
rlm@421	75 the floating point format used is shown in figures~\ref{exponent-format}
rlm@421	76 and~\ref{mantissa-format}.}. A discussion of the mathematics behind
rlm@421	77 unnormalized arithmetic is in appendix~\ref{unnorm-math}.
rlm@421	78
rlm@421	79 The optimizations that the compiler can perform fall into several categories:
rlm@421	80
rlm@421	81 \subsection{Post Multiply Normalization}
rlm@421	82
rlm@421	83 When more than two multiplications are performed in a row, the intermediate
rlm@421	84 normalization of the results between multiplications can be eliminated.
rlm@421	85 This is because with each multiplication, the mantissa can become
rlm@421	86 denormalized by at most one bit. If there are guard bits on the mantissas
rlm@421	87 to prevent bits from ``falling off'' the end during multiplications, the
rlm@421	88 normalization can be postponed until after a sequence of several
rlm@421	89 multiplies\footnote{Using unnormalized numbers for math is not a new idea; a
rlm@421	90 good example of it is the Control Data CDC 6600, designed by Seymour Cray.
rlm@421	91 \cite{thornton:cdc6600} The CDC 6600 had all of its instructions performing
rlm@421	92 unnormalized arithmetic, with a separate {\tt NORMALIZE} instruction.}.
rlm@421	93
rlm@421	94 % This is an example of how you would use tgrind to include an example
rlm@421	95 % of source code; it is commented out in this template since the code
rlm@421	96 % example file does not exist. To use it, you need to remove the '%' on the
rlm@421	97 % beginning of the line, and insert your own information in the call.
rlm@421	98 %
rlm@421	99 %\tagrind[htbp]{code/pmn.s.tex}{Post Multiply Normalization}{opt:pmn}
rlm@421	100
rlm@421	101 As you can see, the intermediate results can be multiplied together, with no
rlm@421	102 need for intermediate normalizations due to the guard bit. It is only at
rlm@421	103 the end of the operation that the normalization must be performed, in order
rlm@421	104 to get it into a format suitable for storing in memory\footnote{Note that
rlm@421	105 for purposed of clarity, the pipeline delays were considered to be 0, and
rlm@421	106 the branches were not delayed.}.
rlm@421	107
rlm@421	108 \subsection{Block Exponent}
rlm@421	109
rlm@421	110 In a unoptimized sequence of additions, the sequence of operations is as
rlm@421	111 follows for each pair of numbers ($m_1$,$e_1$) and ($m_2$,$e_2$).
rlm@421	112 \begin{enumerate}
rlm@421	113 \item Compare $e_1$ and $e_2$.
rlm@421	114 \item Shift the mantissa associated with the smaller exponent $\|e_1-e_2\|$
rlm@421	115 places to the right.
rlm@421	116 \item Add $m_1$ and $m_2$.
rlm@421	117 \item Find the first one in the resulting mantissa.
rlm@421	118 \item Shift the resulting mantissa so that normalized
rlm@421	119 \item Adjust the exponent accordingly.
rlm@421	120 \end{enumerate}
rlm@421	121
rlm@421	122 Out of 6 steps, only one is the actual addition, and the rest are involved
rlm@421	123 in aligning the mantissas prior to the add, and then normalizing the result
rlm@421	124 afterward. In the block exponent optimization, the largest mantissa is
rlm@421	125 found to start with, and all the mantissa's shifted before any additions
rlm@421	126 take place. Once the mantissas have been shifted, the additions can take
rlm@421	127 place one after another\footnote{This requires that for n consecutive
rlm@421	128 additions, there are $\log_{2}n$ high guard bits to prevent overflow. In
rlm@421	129 the $\mu$FPU, there are 3 guard bits, making up to 8 consecutive additions
rlm@421	130 possible.}. An example of the Block Exponent optimization on the expression
rlm@421	131 X = A + B + C is given in figure~\ref{opt:be}.
rlm@421	132
rlm@421	133 % This is an example of how you would use tgrind to include an example
rlm@421	134 % of source code; it is commented out in this template since the code
rlm@421	135 % example file does not exist. To use it, you need to remove the '%' on the
rlm@421	136 % beginning of the line, and insert your own information in the call.
rlm@421	137 %
rlm@421	138 %\tgrind[htbp]{code/be.s.tex}{Block Exponent}{opt:be}
rlm@421	139
rlm@421	140 \section{Integer optimizations}
rlm@421	141
rlm@421	142 As well as the floating point optimizations described above, there are
rlm@421	143 also integer optimizations that can be used in the $\mu$FPU. In concert
rlm@421	144 with the floating point optimizations, these can provide a significant
rlm@421	145 speedup.
rlm@421	146
rlm@421	147 \subsection{Conversion to fixed point}
rlm@421	148
rlm@421	149 Integer operations are much faster than floating point operations; if it is
rlm@421	150 possible to replace floating point operations with fixed point operations,
rlm@421	151 this would provide a significant increase in speed.
rlm@421	152
rlm@421	153 This conversion can either take place automatically or or based on a
rlm@421	154 specific request from the programmer. To do this automatically, the
rlm@421	155 compiler must either be very smart, or play fast and loose with the accuracy
rlm@421	156 and precision of the programmer's variables. To be ``smart'', the computer
rlm@421	157 must track the ranges of all the floating point variables through the
rlm@421	158 program, and then see if there are any potential candidates for conversion
rlm@421	159 to floating point. This technique is discussed further in
rlm@421	160 section~\ref{range-tracking}, where it was implemented.
rlm@421	161
rlm@421	162 The other way to do this is to rely on specific hints from the programmer
rlm@421	163 that a certain value will only assume a specific range, and that only a
rlm@421	164 specific precision is desired. This is somewhat more taxing on the
rlm@421	165 programmer, in that he has to know the ranges that his values will take at
rlm@421	166 declaration time (something normally abstracted away), but it does provide
rlm@421	167 the opportunity for fine-tuning already working code.
rlm@421	168
rlm@421	169 Potential applications of this would be simulation programs, where the
rlm@421	170 variable represents some physical quantity; the constraints of the physical
rlm@421	171 system may provide bounds on the range the variable can take.
rlm@421	172 \subsection{Small Constant Multiplications}
rlm@421	173
rlm@421	174 One other class of optimizations that can be done is to replace
rlm@421	175 multiplications by small integer constants into some combination of
rlm@421	176 additions and shifts. Addition and shifting can be significantly faster
rlm@421	177 than multiplication. This is done by using some combination of
rlm@421	178 \begin{eqnarray*}
rlm@421	179 a_i & = & a_j + a_k \\
rlm@421	180 a_i & = & 2a_j + a_k \\
rlm@421	181 a_i & = & 4a_j + a_k \\
rlm@421	182 a_i & = & 8a_j + a_k \\
rlm@421	183 a_i & = & a_j - a_k \\
rlm@421	184 a_i & = & a_j \ll m \mbox{shift}
rlm@421	185 \end{eqnarray*}
rlm@421	186 instead of the multiplication. For example, to multiply $s$ by 10 and store
rlm@421	187 the result in $r$, you could use:
rlm@421	188 \begin{eqnarray*}
rlm@421	189 r & = & 4s + s\\
rlm@421	190 r & = & r + r
rlm@421	191 \end{eqnarray*}
rlm@421	192 Or by 59:
rlm@421	193 \begin{eqnarray*}
rlm@421	194 t & = & 2s + s \\
rlm@421	195 r & = & 2t + s \\
rlm@421	196 r & = & 8r + t
rlm@421	197 \end{eqnarray*}
rlm@421	198 Similar combinations can be found for almost all of the smaller
rlm@421	199 integers\footnote{This optimization is only an ``optimization'', of course,
rlm@421	200 when the amount of time spent on the shifts and adds is less than the time
rlm@421	201 that would be spent doing the multiplication. Since the time costs of these
rlm@421	202 operations are known to the compiler in order for it to do scheduling, it is
rlm@421	203 easy for the compiler to determine when this optimization is worth using.}.
rlm@421	204 \cite{magenheimer:precision}
rlm@421	205
rlm@421	206 \section{Other optimizations}
rlm@421	207
rlm@421	208 \subsection{Low-level parallelism}
rlm@421	209
rlm@421	210 The current trend is towards duplicating hardware at the lowest level to
rlm@421	211 provide parallelism\footnote{This can been seen in the i860; floating point
rlm@421	212 additions and multiplications can proceed at the same time, and the RISC
rlm@421	213 core be moving data in and out of the floating point registers and providing
rlm@421	214 flow control at the same time the floating point units are active. \cite{byte:i860}}
rlm@421	215
rlm@421	216 Conceptually, it is easy to take advantage to low-level parallelism in the
rlm@421	217 instruction stream by simply adding more functional units to the $\mu$FPU,
rlm@421	218 widening the instruction word to control them, and then scheduling as many
rlm@421	219 operations to take place at one time as possible.
rlm@421	220
rlm@421	221 However, simply adding more functional units can only be done so many times;
rlm@421	222 there is only a limited amount of parallelism directly available in the
rlm@421	223 instruction stream, and without it, much of the extra resources will go to
rlm@421	224 waste. One process used to make more instructions potentially schedulable
rlm@421	225 at any given time is ``trace scheduling''. This technique originated in the
rlm@421	226 Bulldog compiler for the original VLIW machine, the ELI-512.
rlm@421	227 \cite{ellis:bulldog,colwell:vliw} In trace scheduling, code can be
rlm@421	228 scheduled through many basic blocks at one time, following a single
rlm@421	229 potential ``trace'' of program execution. In this way, instructions that
rlm@421	230 {\em might\/} be executed depending on a conditional branch further down in
rlm@421	231 the instruction stream are scheduled, allowing an increase in the potential
rlm@421	232 parallelism. To account for the cases where the expected branch wasn't
rlm@421	233 taken, correction code is inserted after the branches to undo the effects of
rlm@421	234 any prematurely executed instructions.
rlm@421	235
rlm@421	236 \subsection{Pipeline optimizations}
rlm@421	237
rlm@421	238 In addition to having operations going on in parallel across functional
rlm@421	239 units, it is also typical to have several operations in various stages of
rlm@421	240 completion in each unit. This pipelining allows the throughput of the
rlm@421	241 functional units to be increased, with no increase in latency.
rlm@421	242
rlm@421	243 There are several ways pipelined operations can be optimized. On the
rlm@421	244 hardware side, support can be added to allow data to be recirculated back
rlm@421	245 into the beginning of the pipeline from the end, saving a trip through the
rlm@421	246 registers. On the software side, the compiler can utilize several tricks to
rlm@421	247 try to fill up as many of the pipeline delay slots as possible, as
rlm@421	248 seendescribed by Gibbons. \cite{gib86}
rlm@421	249
rlm@421	250

Mercurial > cortex

annotate thesis/aux/mitthesis/chap1.tex @ 485:ac953b562eab