rlm@421: %% This is an example first chapter. You should put chapter/appendix that you rlm@421: %% write into a separate file, and add a line \include{yourfilename} to rlm@421: %% main.tex, where `yourfilename.tex' is the name of the chapter/appendix file. rlm@421: %% You can process specific files by typing their names in at the rlm@421: %% \files= rlm@421: %% prompt when you run the file main.tex through LaTeX. rlm@421: \chapter{Introduction} rlm@421: rlm@421: Micro-optimization is a technique to reduce the overall operation count of rlm@421: floating point operations. In a standard floating point unit, floating rlm@421: point operations are fairly high level, such as ``multiply'' and ``add''; rlm@421: in a micro floating point unit ($\mu$FPU), these have been broken down into rlm@421: their constituent low-level floating point operations on the mantissas and rlm@421: exponents of the floating point numbers. rlm@421: rlm@421: Chapter two describes the architecture of the $\mu$FPU unit, and the rlm@421: motivations for the design decisions made. rlm@421: rlm@421: Chapter three describes the design of the compiler, as well as how the rlm@421: optimizations discussed in section~\ref{ch1:opts} were implemented. rlm@421: rlm@421: Chapter four describes the purpose of test code that was compiled, and which rlm@421: statistics were gathered by running it through the simulator. The purpose rlm@421: is to measure what effect the micro-optimizations had, compared to rlm@421: unoptimized code. Possible future expansions to the project are also rlm@421: discussed. rlm@421: rlm@421: \section{Motivations for micro-optimization} rlm@421: rlm@421: The idea of micro-optimization is motivated by the recent trends in computer rlm@421: architecture towards low-level parallelism and small, pipelineable rlm@421: instruction sets \cite{patterson:risc,rad83}. By getting rid of more rlm@421: complex instructions and concentrating on optimizing frequently used rlm@421: instructions, substantial increases in performance were realized. rlm@421: rlm@421: Another important motivation was the trend towards placing more of the rlm@421: burden of performance on the compiler. Many of the new architectures depend rlm@421: on an intelligent, optimizing compiler in order to realize anywhere near rlm@421: their peak performance rlm@421: \cite{ellis:bulldog,pet87,coutant:precision-compilers}. In these cases, the rlm@421: compiler not only is responsible for faithfully generating native code to rlm@421: match the source language, but also must be aware of instruction latencies, rlm@421: delayed branches, pipeline stages, and a multitude of other factors in order rlm@421: to generate fast code \cite{gib86}. rlm@421: rlm@421: Taking these ideas one step further, it seems that the floating point rlm@421: operations that are normally single, large instructions can be further broken rlm@421: down into smaller, simpler, faster instructions, with more control in the rlm@421: compiler and less in the hardware. This is the idea behind a rlm@421: micro-optimizing FPU; break the floating point instructions down into their rlm@421: basic components and use a small, fast implementation, with a large part of rlm@421: the burden of hardware allocation and optimization shifted towards rlm@421: compile-time. rlm@421: rlm@421: Along with the hardware speedups possible by using a $\mu$FPU, there are rlm@421: also optimizations that the compiler can perform on the code that is rlm@421: generated. In a normal sequence of floating point operations, there are rlm@421: many hidden redundancies that can be eliminated by allowing the compiler to rlm@421: control the floating point operations down to their lowest level. These rlm@421: optimizations are described in detail in section~\ref{ch1:opts}. rlm@421: rlm@421: \section{Description of micro-optimization}\label{ch1:opts} rlm@421: rlm@421: In order to perform a sequence of floating point operations, a normal FPU rlm@421: performs many redundant internal shifts and normalizations in the process of rlm@421: performing a sequence of operations. However, if a compiler can rlm@421: decompose the floating point operations it needs down to the lowest level, rlm@421: it then can optimize away many of these redundant operations. rlm@421: rlm@421: If there is some additional hardware support specifically for rlm@421: micro-optimization, there are additional optimizations that can be rlm@421: performed. This hardware support entails extra ``guard bits'' on the rlm@421: standard floating point formats, to allow several unnormalized operations to rlm@421: be performed in a row without the loss information\footnote{A description of rlm@421: the floating point format used is shown in figures~\ref{exponent-format} rlm@421: and~\ref{mantissa-format}.}. A discussion of the mathematics behind rlm@421: unnormalized arithmetic is in appendix~\ref{unnorm-math}. rlm@421: rlm@421: The optimizations that the compiler can perform fall into several categories: rlm@421: rlm@421: \subsection{Post Multiply Normalization} rlm@421: rlm@421: When more than two multiplications are performed in a row, the intermediate rlm@421: normalization of the results between multiplications can be eliminated. rlm@421: This is because with each multiplication, the mantissa can become rlm@421: denormalized by at most one bit. If there are guard bits on the mantissas rlm@421: to prevent bits from ``falling off'' the end during multiplications, the rlm@421: normalization can be postponed until after a sequence of several rlm@421: multiplies\footnote{Using unnormalized numbers for math is not a new idea; a rlm@421: good example of it is the Control Data CDC 6600, designed by Seymour Cray. rlm@421: \cite{thornton:cdc6600} The CDC 6600 had all of its instructions performing rlm@421: unnormalized arithmetic, with a separate {\tt NORMALIZE} instruction.}. rlm@421: rlm@421: % This is an example of how you would use tgrind to include an example rlm@421: % of source code; it is commented out in this template since the code rlm@421: % example file does not exist. To use it, you need to remove the '%' on the rlm@421: % beginning of the line, and insert your own information in the call. rlm@421: % rlm@421: %\tagrind[htbp]{code/pmn.s.tex}{Post Multiply Normalization}{opt:pmn} rlm@421: rlm@421: As you can see, the intermediate results can be multiplied together, with no rlm@421: need for intermediate normalizations due to the guard bit. It is only at rlm@421: the end of the operation that the normalization must be performed, in order rlm@421: to get it into a format suitable for storing in memory\footnote{Note that rlm@421: for purposed of clarity, the pipeline delays were considered to be 0, and rlm@421: the branches were not delayed.}. rlm@421: rlm@421: \subsection{Block Exponent} rlm@421: rlm@421: In a unoptimized sequence of additions, the sequence of operations is as rlm@421: follows for each pair of numbers ($m_1$,$e_1$) and ($m_2$,$e_2$). rlm@421: \begin{enumerate} rlm@421: \item Compare $e_1$ and $e_2$. rlm@421: \item Shift the mantissa associated with the smaller exponent $|e_1-e_2|$ rlm@421: places to the right. rlm@421: \item Add $m_1$ and $m_2$. rlm@421: \item Find the first one in the resulting mantissa. rlm@421: \item Shift the resulting mantissa so that normalized rlm@421: \item Adjust the exponent accordingly. rlm@421: \end{enumerate} rlm@421: rlm@421: Out of 6 steps, only one is the actual addition, and the rest are involved rlm@421: in aligning the mantissas prior to the add, and then normalizing the result rlm@421: afterward. In the block exponent optimization, the largest mantissa is rlm@421: found to start with, and all the mantissa's shifted before any additions rlm@421: take place. Once the mantissas have been shifted, the additions can take rlm@421: place one after another\footnote{This requires that for n consecutive rlm@421: additions, there are $\log_{2}n$ high guard bits to prevent overflow. In rlm@421: the $\mu$FPU, there are 3 guard bits, making up to 8 consecutive additions rlm@421: possible.}. An example of the Block Exponent optimization on the expression rlm@421: X = A + B + C is given in figure~\ref{opt:be}. rlm@421: rlm@421: % This is an example of how you would use tgrind to include an example rlm@421: % of source code; it is commented out in this template since the code rlm@421: % example file does not exist. To use it, you need to remove the '%' on the rlm@421: % beginning of the line, and insert your own information in the call. rlm@421: % rlm@421: %\tgrind[htbp]{code/be.s.tex}{Block Exponent}{opt:be} rlm@421: rlm@421: \section{Integer optimizations} rlm@421: rlm@421: As well as the floating point optimizations described above, there are rlm@421: also integer optimizations that can be used in the $\mu$FPU. In concert rlm@421: with the floating point optimizations, these can provide a significant rlm@421: speedup. rlm@421: rlm@421: \subsection{Conversion to fixed point} rlm@421: rlm@421: Integer operations are much faster than floating point operations; if it is rlm@421: possible to replace floating point operations with fixed point operations, rlm@421: this would provide a significant increase in speed. rlm@421: rlm@421: This conversion can either take place automatically or or based on a rlm@421: specific request from the programmer. To do this automatically, the rlm@421: compiler must either be very smart, or play fast and loose with the accuracy rlm@421: and precision of the programmer's variables. To be ``smart'', the computer rlm@421: must track the ranges of all the floating point variables through the rlm@421: program, and then see if there are any potential candidates for conversion rlm@421: to floating point. This technique is discussed further in rlm@421: section~\ref{range-tracking}, where it was implemented. rlm@421: rlm@421: The other way to do this is to rely on specific hints from the programmer rlm@421: that a certain value will only assume a specific range, and that only a rlm@421: specific precision is desired. This is somewhat more taxing on the rlm@421: programmer, in that he has to know the ranges that his values will take at rlm@421: declaration time (something normally abstracted away), but it does provide rlm@421: the opportunity for fine-tuning already working code. rlm@421: rlm@421: Potential applications of this would be simulation programs, where the rlm@421: variable represents some physical quantity; the constraints of the physical rlm@421: system may provide bounds on the range the variable can take. rlm@421: \subsection{Small Constant Multiplications} rlm@421: rlm@421: One other class of optimizations that can be done is to replace rlm@421: multiplications by small integer constants into some combination of rlm@421: additions and shifts. Addition and shifting can be significantly faster rlm@421: than multiplication. This is done by using some combination of rlm@421: \begin{eqnarray*} rlm@421: a_i & = & a_j + a_k \\ rlm@421: a_i & = & 2a_j + a_k \\ rlm@421: a_i & = & 4a_j + a_k \\ rlm@421: a_i & = & 8a_j + a_k \\ rlm@421: a_i & = & a_j - a_k \\ rlm@421: a_i & = & a_j \ll m \mbox{shift} rlm@421: \end{eqnarray*} rlm@421: instead of the multiplication. For example, to multiply $s$ by 10 and store rlm@421: the result in $r$, you could use: rlm@421: \begin{eqnarray*} rlm@421: r & = & 4s + s\\ rlm@421: r & = & r + r rlm@421: \end{eqnarray*} rlm@421: Or by 59: rlm@421: \begin{eqnarray*} rlm@421: t & = & 2s + s \\ rlm@421: r & = & 2t + s \\ rlm@421: r & = & 8r + t rlm@421: \end{eqnarray*} rlm@421: Similar combinations can be found for almost all of the smaller rlm@421: integers\footnote{This optimization is only an ``optimization'', of course, rlm@421: when the amount of time spent on the shifts and adds is less than the time rlm@421: that would be spent doing the multiplication. Since the time costs of these rlm@421: operations are known to the compiler in order for it to do scheduling, it is rlm@421: easy for the compiler to determine when this optimization is worth using.}. rlm@421: \cite{magenheimer:precision} rlm@421: rlm@421: \section{Other optimizations} rlm@421: rlm@421: \subsection{Low-level parallelism} rlm@421: rlm@421: The current trend is towards duplicating hardware at the lowest level to rlm@421: provide parallelism\footnote{This can been seen in the i860; floating point rlm@421: additions and multiplications can proceed at the same time, and the RISC rlm@421: core be moving data in and out of the floating point registers and providing rlm@421: flow control at the same time the floating point units are active. \cite{byte:i860}} rlm@421: rlm@421: Conceptually, it is easy to take advantage to low-level parallelism in the rlm@421: instruction stream by simply adding more functional units to the $\mu$FPU, rlm@421: widening the instruction word to control them, and then scheduling as many rlm@421: operations to take place at one time as possible. rlm@421: rlm@421: However, simply adding more functional units can only be done so many times; rlm@421: there is only a limited amount of parallelism directly available in the rlm@421: instruction stream, and without it, much of the extra resources will go to rlm@421: waste. One process used to make more instructions potentially schedulable rlm@421: at any given time is ``trace scheduling''. This technique originated in the rlm@421: Bulldog compiler for the original VLIW machine, the ELI-512. rlm@421: \cite{ellis:bulldog,colwell:vliw} In trace scheduling, code can be rlm@421: scheduled through many basic blocks at one time, following a single rlm@421: potential ``trace'' of program execution. In this way, instructions that rlm@421: {\em might\/} be executed depending on a conditional branch further down in rlm@421: the instruction stream are scheduled, allowing an increase in the potential rlm@421: parallelism. To account for the cases where the expected branch wasn't rlm@421: taken, correction code is inserted after the branches to undo the effects of rlm@421: any prematurely executed instructions. rlm@421: rlm@421: \subsection{Pipeline optimizations} rlm@421: rlm@421: In addition to having operations going on in parallel across functional rlm@421: units, it is also typical to have several operations in various stages of rlm@421: completion in each unit. This pipelining allows the throughput of the rlm@421: functional units to be increased, with no increase in latency. rlm@421: rlm@421: There are several ways pipelined operations can be optimized. On the rlm@421: hardware side, support can be added to allow data to be recirculated back rlm@421: into the beginning of the pipeline from the end, saving a trip through the rlm@421: registers. On the software side, the compiler can utilize several tricks to rlm@421: try to fill up as many of the pipeline delay slots as possible, as rlm@421: seendescribed by Gibbons. \cite{gib86} rlm@421: rlm@421: