diff thesis/aux/mitthesis/chap1.tex @ 460:763d13f77e03

merge in laptop changes.
author Robert McIntyre <rlm@mit.edu>
date Thu, 27 Mar 2014 17:57:01 -0400
parents 6b0f77df0e53
children
line wrap: on
line diff
     1.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     1.2 +++ b/thesis/aux/mitthesis/chap1.tex	Thu Mar 27 17:57:01 2014 -0400
     1.3 @@ -0,0 +1,250 @@
     1.4 +%% This is an example first chapter.  You should put chapter/appendix that you
     1.5 +%% write into a separate file, and add a line \include{yourfilename} to
     1.6 +%% main.tex, where `yourfilename.tex' is the name of the chapter/appendix file.
     1.7 +%% You can process specific files by typing their names in at the 
     1.8 +%% \files=
     1.9 +%% prompt when you run the file main.tex through LaTeX.
    1.10 +\chapter{Introduction}
    1.11 +
    1.12 +Micro-optimization is a technique to reduce the overall operation count of
    1.13 +floating point operations.  In a standard floating point unit, floating
    1.14 +point operations are fairly high level, such as ``multiply'' and ``add'';
    1.15 +in a micro floating point unit ($\mu$FPU), these have been broken down into
    1.16 +their constituent low-level floating point operations on the mantissas and
    1.17 +exponents of the floating point numbers.
    1.18 +
    1.19 +Chapter two describes the architecture of the $\mu$FPU unit, and the
    1.20 +motivations for the design decisions made.
    1.21 +
    1.22 +Chapter three describes the design of the compiler, as well as how the
    1.23 +optimizations discussed in section~\ref{ch1:opts} were implemented.
    1.24 +
    1.25 +Chapter four describes the purpose of test code that was compiled, and which
    1.26 +statistics were gathered by running it through the simulator.  The purpose
    1.27 +is to measure what effect the micro-optimizations had, compared to
    1.28 +unoptimized code.  Possible future expansions to the project are also
    1.29 +discussed.
    1.30 +
    1.31 +\section{Motivations for micro-optimization}
    1.32 +
    1.33 +The idea of micro-optimization is motivated by the recent trends in computer
    1.34 +architecture towards low-level parallelism and small, pipelineable
    1.35 +instruction sets \cite{patterson:risc,rad83}.  By getting rid of more
    1.36 +complex instructions and concentrating on optimizing frequently used
    1.37 +instructions, substantial increases in performance were realized.
    1.38 +
    1.39 +Another important motivation was the trend towards placing more of the
    1.40 +burden of performance on the compiler.  Many of the new architectures depend
    1.41 +on an intelligent, optimizing compiler in order to realize anywhere near
    1.42 +their peak performance
    1.43 +\cite{ellis:bulldog,pet87,coutant:precision-compilers}.  In these cases, the
    1.44 +compiler not only is responsible for faithfully generating native code to
    1.45 +match the source language, but also must be aware of instruction latencies,
    1.46 +delayed branches, pipeline stages, and a multitude of other factors in order
    1.47 +to generate fast code \cite{gib86}.
    1.48 +
    1.49 +Taking these ideas one step further, it seems that the floating point
    1.50 +operations that are normally single, large instructions can be further broken
    1.51 +down into smaller, simpler, faster instructions, with more control in the
    1.52 +compiler and less in the hardware.  This is the idea behind a
    1.53 +micro-optimizing FPU; break the floating point instructions down into their
    1.54 +basic components and use a small, fast implementation, with a large part of
    1.55 +the burden of hardware allocation and optimization shifted towards
    1.56 +compile-time.
    1.57 +
    1.58 +Along with the hardware speedups possible by using a $\mu$FPU, there are
    1.59 +also optimizations that the compiler can perform on the code that is
    1.60 +generated.  In a normal sequence of floating point operations, there are
    1.61 +many hidden redundancies that can be eliminated by allowing the compiler to
    1.62 +control the floating point operations down to their lowest level.  These
    1.63 +optimizations are described in detail in section~\ref{ch1:opts}.
    1.64 +
    1.65 +\section{Description of micro-optimization}\label{ch1:opts}
    1.66 +
    1.67 +In order to perform a sequence of floating point operations, a normal FPU
    1.68 +performs many redundant internal shifts and normalizations in the process of
    1.69 +performing a sequence of operations.  However, if a compiler can
    1.70 +decompose the floating point operations it needs down to the lowest level,
    1.71 +it then can optimize away many of these redundant operations.  
    1.72 +
    1.73 +If there is some additional hardware support specifically for
    1.74 +micro-optimization, there are additional optimizations that can be
    1.75 +performed.  This hardware support entails extra ``guard bits'' on the
    1.76 +standard floating point formats, to allow several unnormalized operations to
    1.77 +be performed in a row without the loss information\footnote{A description of
    1.78 +the floating point format used is shown in figures~\ref{exponent-format}
    1.79 +and~\ref{mantissa-format}.}.  A discussion of the mathematics behind
    1.80 +unnormalized arithmetic is in appendix~\ref{unnorm-math}.
    1.81 +
    1.82 +The optimizations that the compiler can perform fall into several categories:
    1.83 +
    1.84 +\subsection{Post Multiply Normalization}
    1.85 +
    1.86 +When more than two multiplications are performed in a row, the intermediate
    1.87 +normalization of the results between multiplications can be eliminated.
    1.88 +This is because with each multiplication, the mantissa can become
    1.89 +denormalized by at most one bit.  If there are guard bits on the mantissas
    1.90 +to prevent bits from ``falling off'' the end during multiplications, the
    1.91 +normalization can be postponed until after a sequence of several
    1.92 +multiplies\footnote{Using unnormalized numbers for math is not a new idea; a
    1.93 +good example of it is the Control Data CDC 6600, designed by Seymour Cray.
    1.94 +\cite{thornton:cdc6600} The CDC 6600 had all of its instructions performing
    1.95 +unnormalized arithmetic, with a separate {\tt NORMALIZE} instruction.}.
    1.96 +
    1.97 +% This is an example of how you would use tgrind to include an example
    1.98 +% of source code; it is commented out in this template since the code
    1.99 +% example file does not exist.  To use it, you need to remove the '%' on the
   1.100 +% beginning of the line, and insert your own information in the call.
   1.101 +%
   1.102 +%\tagrind[htbp]{code/pmn.s.tex}{Post Multiply Normalization}{opt:pmn}
   1.103 +
   1.104 +As you can see, the intermediate results can be multiplied together, with no
   1.105 +need for intermediate normalizations due to the guard bit.  It is only at
   1.106 +the end of the operation that the normalization must be performed, in order
   1.107 +to get it into a format suitable for storing in memory\footnote{Note that
   1.108 +for purposed of clarity, the pipeline delays were considered to be 0, and
   1.109 +the branches were not delayed.}.
   1.110 +
   1.111 +\subsection{Block Exponent}
   1.112 +
   1.113 +In a unoptimized sequence of additions, the sequence of operations is as
   1.114 +follows for each pair of numbers ($m_1$,$e_1$) and ($m_2$,$e_2$).
   1.115 +\begin{enumerate}
   1.116 +  \item Compare $e_1$ and $e_2$.
   1.117 +  \item Shift the mantissa associated with the smaller exponent $|e_1-e_2|$
   1.118 +        places to the right.
   1.119 +  \item Add $m_1$ and $m_2$.
   1.120 +  \item Find the first one in the resulting mantissa.
   1.121 +  \item Shift the resulting mantissa so that normalized
   1.122 +  \item Adjust the exponent accordingly.
   1.123 +\end{enumerate}
   1.124 +
   1.125 +Out of 6 steps, only one is the actual addition, and the rest are involved
   1.126 +in aligning the mantissas prior to the add, and then normalizing the result
   1.127 +afterward.  In the block exponent optimization, the largest mantissa is
   1.128 +found to start with, and all the mantissa's shifted before any additions
   1.129 +take place.  Once the mantissas have been shifted, the additions can take
   1.130 +place one after another\footnote{This requires that for n consecutive
   1.131 +additions, there are $\log_{2}n$ high guard bits to prevent overflow.  In
   1.132 +the $\mu$FPU, there are 3 guard bits, making up to 8 consecutive additions
   1.133 +possible.}.  An example of the Block Exponent optimization on the expression
   1.134 +X = A + B + C is given in figure~\ref{opt:be}.
   1.135 +
   1.136 +% This is an example of how you would use tgrind to include an example
   1.137 +% of source code; it is commented out in this template since the code
   1.138 +% example file does not exist.  To use it, you need to remove the '%' on the
   1.139 +% beginning of the line, and insert your own information in the call.
   1.140 +%
   1.141 +%\tgrind[htbp]{code/be.s.tex}{Block Exponent}{opt:be}
   1.142 +
   1.143 +\section{Integer optimizations}
   1.144 +
   1.145 +As well as the floating point optimizations described above, there are
   1.146 +also integer optimizations that can be used in the $\mu$FPU.  In concert
   1.147 +with the floating point optimizations, these can provide a significant
   1.148 +speedup.  
   1.149 +
   1.150 +\subsection{Conversion to fixed point}
   1.151 +
   1.152 +Integer operations are much faster than floating point operations; if it is
   1.153 +possible to replace floating point operations with fixed point operations,
   1.154 +this would provide a significant increase in speed.
   1.155 +
   1.156 +This conversion can either take place automatically or or based on a
   1.157 +specific request from the programmer.  To do this automatically, the
   1.158 +compiler must either be very smart, or play fast and loose with the accuracy
   1.159 +and precision of the programmer's variables.  To be ``smart'', the computer
   1.160 +must track the ranges of all the floating point variables through the
   1.161 +program, and then see if there are any potential candidates for conversion
   1.162 +to floating point.  This technique is discussed further in
   1.163 +section~\ref{range-tracking}, where it was implemented.
   1.164 +
   1.165 +The other way to do this is to rely on specific hints from the programmer
   1.166 +that a certain value will only assume a specific range, and that only a
   1.167 +specific precision is desired.  This is somewhat more taxing on the
   1.168 +programmer, in that he has to know the ranges that his values will take at
   1.169 +declaration time (something normally abstracted away), but it does provide
   1.170 +the opportunity for fine-tuning already working code.
   1.171 +
   1.172 +Potential applications of this would be simulation programs, where the
   1.173 +variable represents some physical quantity; the constraints of the physical
   1.174 +system may provide bounds on the range the variable can take.
   1.175 +\subsection{Small Constant Multiplications}
   1.176 +
   1.177 +One other class of optimizations that can be done is to replace
   1.178 +multiplications by small integer constants into some combination of
   1.179 +additions and shifts.  Addition and shifting can be significantly faster
   1.180 +than multiplication.  This is done by using some combination of
   1.181 +\begin{eqnarray*}
   1.182 +a_i & = & a_j + a_k \\
   1.183 +a_i & = & 2a_j + a_k \\
   1.184 +a_i & = & 4a_j + a_k \\
   1.185 +a_i & = & 8a_j + a_k \\
   1.186 +a_i & = & a_j - a_k \\
   1.187 +a_i & = & a_j \ll m \mbox{shift}
   1.188 +\end{eqnarray*}
   1.189 +instead of the multiplication.  For example, to multiply $s$ by 10 and store
   1.190 +the result in $r$, you could use:
   1.191 +\begin{eqnarray*}
   1.192 +r & = & 4s + s\\
   1.193 +r & = & r + r
   1.194 +\end{eqnarray*}
   1.195 +Or by 59:
   1.196 +\begin{eqnarray*}
   1.197 +t & = & 2s + s \\
   1.198 +r & = & 2t + s \\
   1.199 +r & = & 8r + t
   1.200 +\end{eqnarray*}
   1.201 +Similar combinations can be found for almost all of the smaller
   1.202 +integers\footnote{This optimization is only an ``optimization'', of course,
   1.203 +when the amount of time spent on the shifts and adds is less than the time
   1.204 +that would be spent doing the multiplication.  Since the time costs of these
   1.205 +operations are known to the compiler in order for it to do scheduling, it is
   1.206 +easy for the compiler to determine when this optimization is worth using.}.
   1.207 +\cite{magenheimer:precision}
   1.208 +
   1.209 +\section{Other optimizations}
   1.210 +
   1.211 +\subsection{Low-level parallelism}
   1.212 +
   1.213 +The current trend is towards duplicating hardware at the lowest level to
   1.214 +provide parallelism\footnote{This can been seen in the i860; floating point
   1.215 +additions and multiplications can proceed at the same time, and the RISC
   1.216 +core be moving data in and out of the floating point registers and providing
   1.217 +flow control at the same time the floating point units are active. \cite{byte:i860}}
   1.218 +
   1.219 +Conceptually, it is easy to take advantage to low-level parallelism in the
   1.220 +instruction stream by simply adding more functional units to the $\mu$FPU,
   1.221 +widening the instruction word to control them, and then scheduling as many
   1.222 +operations to take place at one time as possible.
   1.223 +
   1.224 +However, simply adding more functional units can only be done so many times;
   1.225 +there is only a limited amount of parallelism directly available in the
   1.226 +instruction stream, and without it, much of the extra resources will go to
   1.227 +waste.  One process used to make more instructions potentially schedulable
   1.228 +at any given time is ``trace scheduling''.  This technique originated in the
   1.229 +Bulldog compiler for the original VLIW machine, the ELI-512.
   1.230 +\cite{ellis:bulldog,colwell:vliw}  In trace scheduling, code can be
   1.231 +scheduled through many basic blocks at one time, following a single
   1.232 +potential ``trace'' of program execution.  In this way, instructions that
   1.233 +{\em might\/} be executed depending on a conditional branch further down in
   1.234 +the instruction stream are scheduled, allowing an increase in the potential
   1.235 +parallelism.  To account for the cases where the expected branch wasn't
   1.236 +taken, correction code is inserted after the branches to undo the effects of
   1.237 +any prematurely executed instructions.
   1.238 +
   1.239 +\subsection{Pipeline optimizations}
   1.240 +
   1.241 +In addition to having operations going on in parallel across functional
   1.242 +units, it is also typical to have several operations in various stages of
   1.243 +completion in each unit.  This pipelining allows the throughput of the
   1.244 +functional units to be increased, with no increase in latency.
   1.245 +
   1.246 +There are several ways pipelined operations can be optimized.  On the
   1.247 +hardware side, support can be added to allow data to be recirculated back
   1.248 +into the beginning of the pipeline from the end, saving a trip through the
   1.249 +registers.  On the software side, the compiler can utilize several tricks to
   1.250 +try to fill up as many of the pipeline delay slots as possible, as
   1.251 +seendescribed by Gibbons. \cite{gib86}
   1.252 +
   1.253 +