Mercurial > cortex
diff thesis/aux/mitthesis/chap1.tex @ 422:6b0f77df0e53
building latex scaffolding for thesis.
author | Robert McIntyre <rlm@mit.edu> |
---|---|
date | Fri, 21 Mar 2014 01:17:41 -0400 |
parents | thesis/mitthesis/chap1.tex@c2c28c3e27c4 |
children |
line wrap: on
line diff
1.1 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 1.2 +++ b/thesis/aux/mitthesis/chap1.tex Fri Mar 21 01:17:41 2014 -0400 1.3 @@ -0,0 +1,250 @@ 1.4 +%% This is an example first chapter. You should put chapter/appendix that you 1.5 +%% write into a separate file, and add a line \include{yourfilename} to 1.6 +%% main.tex, where `yourfilename.tex' is the name of the chapter/appendix file. 1.7 +%% You can process specific files by typing their names in at the 1.8 +%% \files= 1.9 +%% prompt when you run the file main.tex through LaTeX. 1.10 +\chapter{Introduction} 1.11 + 1.12 +Micro-optimization is a technique to reduce the overall operation count of 1.13 +floating point operations. In a standard floating point unit, floating 1.14 +point operations are fairly high level, such as ``multiply'' and ``add''; 1.15 +in a micro floating point unit ($\mu$FPU), these have been broken down into 1.16 +their constituent low-level floating point operations on the mantissas and 1.17 +exponents of the floating point numbers. 1.18 + 1.19 +Chapter two describes the architecture of the $\mu$FPU unit, and the 1.20 +motivations for the design decisions made. 1.21 + 1.22 +Chapter three describes the design of the compiler, as well as how the 1.23 +optimizations discussed in section~\ref{ch1:opts} were implemented. 1.24 + 1.25 +Chapter four describes the purpose of test code that was compiled, and which 1.26 +statistics were gathered by running it through the simulator. The purpose 1.27 +is to measure what effect the micro-optimizations had, compared to 1.28 +unoptimized code. Possible future expansions to the project are also 1.29 +discussed. 1.30 + 1.31 +\section{Motivations for micro-optimization} 1.32 + 1.33 +The idea of micro-optimization is motivated by the recent trends in computer 1.34 +architecture towards low-level parallelism and small, pipelineable 1.35 +instruction sets \cite{patterson:risc,rad83}. By getting rid of more 1.36 +complex instructions and concentrating on optimizing frequently used 1.37 +instructions, substantial increases in performance were realized. 1.38 + 1.39 +Another important motivation was the trend towards placing more of the 1.40 +burden of performance on the compiler. Many of the new architectures depend 1.41 +on an intelligent, optimizing compiler in order to realize anywhere near 1.42 +their peak performance 1.43 +\cite{ellis:bulldog,pet87,coutant:precision-compilers}. In these cases, the 1.44 +compiler not only is responsible for faithfully generating native code to 1.45 +match the source language, but also must be aware of instruction latencies, 1.46 +delayed branches, pipeline stages, and a multitude of other factors in order 1.47 +to generate fast code \cite{gib86}. 1.48 + 1.49 +Taking these ideas one step further, it seems that the floating point 1.50 +operations that are normally single, large instructions can be further broken 1.51 +down into smaller, simpler, faster instructions, with more control in the 1.52 +compiler and less in the hardware. This is the idea behind a 1.53 +micro-optimizing FPU; break the floating point instructions down into their 1.54 +basic components and use a small, fast implementation, with a large part of 1.55 +the burden of hardware allocation and optimization shifted towards 1.56 +compile-time. 1.57 + 1.58 +Along with the hardware speedups possible by using a $\mu$FPU, there are 1.59 +also optimizations that the compiler can perform on the code that is 1.60 +generated. In a normal sequence of floating point operations, there are 1.61 +many hidden redundancies that can be eliminated by allowing the compiler to 1.62 +control the floating point operations down to their lowest level. These 1.63 +optimizations are described in detail in section~\ref{ch1:opts}. 1.64 + 1.65 +\section{Description of micro-optimization}\label{ch1:opts} 1.66 + 1.67 +In order to perform a sequence of floating point operations, a normal FPU 1.68 +performs many redundant internal shifts and normalizations in the process of 1.69 +performing a sequence of operations. However, if a compiler can 1.70 +decompose the floating point operations it needs down to the lowest level, 1.71 +it then can optimize away many of these redundant operations. 1.72 + 1.73 +If there is some additional hardware support specifically for 1.74 +micro-optimization, there are additional optimizations that can be 1.75 +performed. This hardware support entails extra ``guard bits'' on the 1.76 +standard floating point formats, to allow several unnormalized operations to 1.77 +be performed in a row without the loss information\footnote{A description of 1.78 +the floating point format used is shown in figures~\ref{exponent-format} 1.79 +and~\ref{mantissa-format}.}. A discussion of the mathematics behind 1.80 +unnormalized arithmetic is in appendix~\ref{unnorm-math}. 1.81 + 1.82 +The optimizations that the compiler can perform fall into several categories: 1.83 + 1.84 +\subsection{Post Multiply Normalization} 1.85 + 1.86 +When more than two multiplications are performed in a row, the intermediate 1.87 +normalization of the results between multiplications can be eliminated. 1.88 +This is because with each multiplication, the mantissa can become 1.89 +denormalized by at most one bit. If there are guard bits on the mantissas 1.90 +to prevent bits from ``falling off'' the end during multiplications, the 1.91 +normalization can be postponed until after a sequence of several 1.92 +multiplies\footnote{Using unnormalized numbers for math is not a new idea; a 1.93 +good example of it is the Control Data CDC 6600, designed by Seymour Cray. 1.94 +\cite{thornton:cdc6600} The CDC 6600 had all of its instructions performing 1.95 +unnormalized arithmetic, with a separate {\tt NORMALIZE} instruction.}. 1.96 + 1.97 +% This is an example of how you would use tgrind to include an example 1.98 +% of source code; it is commented out in this template since the code 1.99 +% example file does not exist. To use it, you need to remove the '%' on the 1.100 +% beginning of the line, and insert your own information in the call. 1.101 +% 1.102 +%\tagrind[htbp]{code/pmn.s.tex}{Post Multiply Normalization}{opt:pmn} 1.103 + 1.104 +As you can see, the intermediate results can be multiplied together, with no 1.105 +need for intermediate normalizations due to the guard bit. It is only at 1.106 +the end of the operation that the normalization must be performed, in order 1.107 +to get it into a format suitable for storing in memory\footnote{Note that 1.108 +for purposed of clarity, the pipeline delays were considered to be 0, and 1.109 +the branches were not delayed.}. 1.110 + 1.111 +\subsection{Block Exponent} 1.112 + 1.113 +In a unoptimized sequence of additions, the sequence of operations is as 1.114 +follows for each pair of numbers ($m_1$,$e_1$) and ($m_2$,$e_2$). 1.115 +\begin{enumerate} 1.116 + \item Compare $e_1$ and $e_2$. 1.117 + \item Shift the mantissa associated with the smaller exponent $|e_1-e_2|$ 1.118 + places to the right. 1.119 + \item Add $m_1$ and $m_2$. 1.120 + \item Find the first one in the resulting mantissa. 1.121 + \item Shift the resulting mantissa so that normalized 1.122 + \item Adjust the exponent accordingly. 1.123 +\end{enumerate} 1.124 + 1.125 +Out of 6 steps, only one is the actual addition, and the rest are involved 1.126 +in aligning the mantissas prior to the add, and then normalizing the result 1.127 +afterward. In the block exponent optimization, the largest mantissa is 1.128 +found to start with, and all the mantissa's shifted before any additions 1.129 +take place. Once the mantissas have been shifted, the additions can take 1.130 +place one after another\footnote{This requires that for n consecutive 1.131 +additions, there are $\log_{2}n$ high guard bits to prevent overflow. In 1.132 +the $\mu$FPU, there are 3 guard bits, making up to 8 consecutive additions 1.133 +possible.}. An example of the Block Exponent optimization on the expression 1.134 +X = A + B + C is given in figure~\ref{opt:be}. 1.135 + 1.136 +% This is an example of how you would use tgrind to include an example 1.137 +% of source code; it is commented out in this template since the code 1.138 +% example file does not exist. To use it, you need to remove the '%' on the 1.139 +% beginning of the line, and insert your own information in the call. 1.140 +% 1.141 +%\tgrind[htbp]{code/be.s.tex}{Block Exponent}{opt:be} 1.142 + 1.143 +\section{Integer optimizations} 1.144 + 1.145 +As well as the floating point optimizations described above, there are 1.146 +also integer optimizations that can be used in the $\mu$FPU. In concert 1.147 +with the floating point optimizations, these can provide a significant 1.148 +speedup. 1.149 + 1.150 +\subsection{Conversion to fixed point} 1.151 + 1.152 +Integer operations are much faster than floating point operations; if it is 1.153 +possible to replace floating point operations with fixed point operations, 1.154 +this would provide a significant increase in speed. 1.155 + 1.156 +This conversion can either take place automatically or or based on a 1.157 +specific request from the programmer. To do this automatically, the 1.158 +compiler must either be very smart, or play fast and loose with the accuracy 1.159 +and precision of the programmer's variables. To be ``smart'', the computer 1.160 +must track the ranges of all the floating point variables through the 1.161 +program, and then see if there are any potential candidates for conversion 1.162 +to floating point. This technique is discussed further in 1.163 +section~\ref{range-tracking}, where it was implemented. 1.164 + 1.165 +The other way to do this is to rely on specific hints from the programmer 1.166 +that a certain value will only assume a specific range, and that only a 1.167 +specific precision is desired. This is somewhat more taxing on the 1.168 +programmer, in that he has to know the ranges that his values will take at 1.169 +declaration time (something normally abstracted away), but it does provide 1.170 +the opportunity for fine-tuning already working code. 1.171 + 1.172 +Potential applications of this would be simulation programs, where the 1.173 +variable represents some physical quantity; the constraints of the physical 1.174 +system may provide bounds on the range the variable can take. 1.175 +\subsection{Small Constant Multiplications} 1.176 + 1.177 +One other class of optimizations that can be done is to replace 1.178 +multiplications by small integer constants into some combination of 1.179 +additions and shifts. Addition and shifting can be significantly faster 1.180 +than multiplication. This is done by using some combination of 1.181 +\begin{eqnarray*} 1.182 +a_i & = & a_j + a_k \\ 1.183 +a_i & = & 2a_j + a_k \\ 1.184 +a_i & = & 4a_j + a_k \\ 1.185 +a_i & = & 8a_j + a_k \\ 1.186 +a_i & = & a_j - a_k \\ 1.187 +a_i & = & a_j \ll m \mbox{shift} 1.188 +\end{eqnarray*} 1.189 +instead of the multiplication. For example, to multiply $s$ by 10 and store 1.190 +the result in $r$, you could use: 1.191 +\begin{eqnarray*} 1.192 +r & = & 4s + s\\ 1.193 +r & = & r + r 1.194 +\end{eqnarray*} 1.195 +Or by 59: 1.196 +\begin{eqnarray*} 1.197 +t & = & 2s + s \\ 1.198 +r & = & 2t + s \\ 1.199 +r & = & 8r + t 1.200 +\end{eqnarray*} 1.201 +Similar combinations can be found for almost all of the smaller 1.202 +integers\footnote{This optimization is only an ``optimization'', of course, 1.203 +when the amount of time spent on the shifts and adds is less than the time 1.204 +that would be spent doing the multiplication. Since the time costs of these 1.205 +operations are known to the compiler in order for it to do scheduling, it is 1.206 +easy for the compiler to determine when this optimization is worth using.}. 1.207 +\cite{magenheimer:precision} 1.208 + 1.209 +\section{Other optimizations} 1.210 + 1.211 +\subsection{Low-level parallelism} 1.212 + 1.213 +The current trend is towards duplicating hardware at the lowest level to 1.214 +provide parallelism\footnote{This can been seen in the i860; floating point 1.215 +additions and multiplications can proceed at the same time, and the RISC 1.216 +core be moving data in and out of the floating point registers and providing 1.217 +flow control at the same time the floating point units are active. \cite{byte:i860}} 1.218 + 1.219 +Conceptually, it is easy to take advantage to low-level parallelism in the 1.220 +instruction stream by simply adding more functional units to the $\mu$FPU, 1.221 +widening the instruction word to control them, and then scheduling as many 1.222 +operations to take place at one time as possible. 1.223 + 1.224 +However, simply adding more functional units can only be done so many times; 1.225 +there is only a limited amount of parallelism directly available in the 1.226 +instruction stream, and without it, much of the extra resources will go to 1.227 +waste. One process used to make more instructions potentially schedulable 1.228 +at any given time is ``trace scheduling''. This technique originated in the 1.229 +Bulldog compiler for the original VLIW machine, the ELI-512. 1.230 +\cite{ellis:bulldog,colwell:vliw} In trace scheduling, code can be 1.231 +scheduled through many basic blocks at one time, following a single 1.232 +potential ``trace'' of program execution. In this way, instructions that 1.233 +{\em might\/} be executed depending on a conditional branch further down in 1.234 +the instruction stream are scheduled, allowing an increase in the potential 1.235 +parallelism. To account for the cases where the expected branch wasn't 1.236 +taken, correction code is inserted after the branches to undo the effects of 1.237 +any prematurely executed instructions. 1.238 + 1.239 +\subsection{Pipeline optimizations} 1.240 + 1.241 +In addition to having operations going on in parallel across functional 1.242 +units, it is also typical to have several operations in various stages of 1.243 +completion in each unit. This pipelining allows the throughput of the 1.244 +functional units to be increased, with no increase in latency. 1.245 + 1.246 +There are several ways pipelined operations can be optimized. On the 1.247 +hardware side, support can be added to allow data to be recirculated back 1.248 +into the beginning of the pipeline from the end, saving a trip through the 1.249 +registers. On the software side, the compiler can utilize several tricks to 1.250 +try to fill up as many of the pipeline delay slots as possible, as 1.251 +seendescribed by Gibbons. \cite{gib86} 1.252 + 1.253 +