Mercurial > cortex

     1 %% This is an example first chapter.  You should put chapter/appendix that you

     2 %% write into a separate file, and add a line \include{yourfilename} to

     3 %% main.tex, where `yourfilename.tex' is the name of the chapter/appendix file.

     4 %% You can process specific files by typing their names in at the 

     5 %% \files=

     6 %% prompt when you run the file main.tex through LaTeX.

     7 \chapter{Introduction}

     8 

     9 Micro-optimization is a technique to reduce the overall operation count of

    10 floating point operations.  In a standard floating point unit, floating

    11 point operations are fairly high level, such as ``multiply'' and ``add'';

    12 in a micro floating point unit ($\mu$FPU), these have been broken down into

    13 their constituent low-level floating point operations on the mantissas and

    14 exponents of the floating point numbers.

    15 

    16 Chapter two describes the architecture of the $\mu$FPU unit, and the

    17 motivations for the design decisions made.

    18 

    19 Chapter three describes the design of the compiler, as well as how the

    20 optimizations discussed in section~\ref{ch1:opts} were implemented.

    21 

    22 Chapter four describes the purpose of test code that was compiled, and which

    23 statistics were gathered by running it through the simulator.  The purpose

    24 is to measure what effect the micro-optimizations had, compared to

    25 unoptimized code.  Possible future expansions to the project are also

    26 discussed.

    27 

    28 \section{Motivations for micro-optimization}

    29 

    30 The idea of micro-optimization is motivated by the recent trends in computer

    31 architecture towards low-level parallelism and small, pipelineable

    32 instruction sets \cite{patterson:risc,rad83}.  By getting rid of more

    33 complex instructions and concentrating on optimizing frequently used

    34 instructions, substantial increases in performance were realized.

    35 

    36 Another important motivation was the trend towards placing more of the

    37 burden of performance on the compiler.  Many of the new architectures depend

    38 on an intelligent, optimizing compiler in order to realize anywhere near

    39 their peak performance

    40 \cite{ellis:bulldog,pet87,coutant:precision-compilers}.  In these cases, the

    41 compiler not only is responsible for faithfully generating native code to

    42 match the source language, but also must be aware of instruction latencies,

    43 delayed branches, pipeline stages, and a multitude of other factors in order

    44 to generate fast code \cite{gib86}.

    45 

    46 Taking these ideas one step further, it seems that the floating point

    47 operations that are normally single, large instructions can be further broken

    48 down into smaller, simpler, faster instructions, with more control in the

    49 compiler and less in the hardware.  This is the idea behind a

    50 micro-optimizing FPU; break the floating point instructions down into their

    51 basic components and use a small, fast implementation, with a large part of

    52 the burden of hardware allocation and optimization shifted towards

    53 compile-time.

    54 

    55 Along with the hardware speedups possible by using a $\mu$FPU, there are

    56 also optimizations that the compiler can perform on the code that is

    57 generated.  In a normal sequence of floating point operations, there are

    58 many hidden redundancies that can be eliminated by allowing the compiler to

    59 control the floating point operations down to their lowest level.  These

    60 optimizations are described in detail in section~\ref{ch1:opts}.

    61 

    62 \section{Description of micro-optimization}\label{ch1:opts}

    63 

    64 In order to perform a sequence of floating point operations, a normal FPU

    65 performs many redundant internal shifts and normalizations in the process of

    66 performing a sequence of operations.  However, if a compiler can

    67 decompose the floating point operations it needs down to the lowest level,

    68 it then can optimize away many of these redundant operations.  

    69 

    70 If there is some additional hardware support specifically for

    71 micro-optimization, there are additional optimizations that can be

    72 performed.  This hardware support entails extra ``guard bits'' on the

    73 standard floating point formats, to allow several unnormalized operations to

    74 be performed in a row without the loss information\footnote{A description of

    75 the floating point format used is shown in figures~\ref{exponent-format}

    76 and~\ref{mantissa-format}.}.  A discussion of the mathematics behind

    77 unnormalized arithmetic is in appendix~\ref{unnorm-math}.

    78 

    79 The optimizations that the compiler can perform fall into several categories:

    80 

    81 \subsection{Post Multiply Normalization}

    82 

    83 When more than two multiplications are performed in a row, the intermediate

    84 normalization of the results between multiplications can be eliminated.

    85 This is because with each multiplication, the mantissa can become

    86 denormalized by at most one bit.  If there are guard bits on the mantissas

    87 to prevent bits from ``falling off'' the end during multiplications, the

    88 normalization can be postponed until after a sequence of several

    89 multiplies\footnote{Using unnormalized numbers for math is not a new idea; a

    90 good example of it is the Control Data CDC 6600, designed by Seymour Cray.

    91 \cite{thornton:cdc6600} The CDC 6600 had all of its instructions performing

    92 unnormalized arithmetic, with a separate {\tt NORMALIZE} instruction.}.

    93 

    94 % This is an example of how you would use tgrind to include an example

    95 % of source code; it is commented out in this template since the code

    96 % example file does not exist.  To use it, you need to remove the '%' on the

    97 % beginning of the line, and insert your own information in the call.

    98 %

    99 %\tagrind[htbp]{code/pmn.s.tex}{Post Multiply Normalization}{opt:pmn}

   100 

   101 As you can see, the intermediate results can be multiplied together, with no

   102 need for intermediate normalizations due to the guard bit.  It is only at

   103 the end of the operation that the normalization must be performed, in order

   104 to get it into a format suitable for storing in memory\footnote{Note that

   105 for purposed of clarity, the pipeline delays were considered to be 0, and

   106 the branches were not delayed.}.

   107 

   108 \subsection{Block Exponent}

   109 

   110 In a unoptimized sequence of additions, the sequence of operations is as

   111 follows for each pair of numbers ($m_1$,$e_1$) and ($m_2$,$e_2$).

   112 \begin{enumerate}

   113   \item Compare $e_1$ and $e_2$.

   114   \item Shift the mantissa associated with the smaller exponent $|e_1-e_2|$

   115         places to the right.

   116   \item Add $m_1$ and $m_2$.

   117   \item Find the first one in the resulting mantissa.

   118   \item Shift the resulting mantissa so that normalized

   119   \item Adjust the exponent accordingly.

   120 \end{enumerate}

   121 

   122 Out of 6 steps, only one is the actual addition, and the rest are involved

   123 in aligning the mantissas prior to the add, and then normalizing the result

   124 afterward.  In the block exponent optimization, the largest mantissa is

   125 found to start with, and all the mantissa's shifted before any additions

   126 take place.  Once the mantissas have been shifted, the additions can take

   127 place one after another\footnote{This requires that for n consecutive

   128 additions, there are $\log_{2}n$ high guard bits to prevent overflow.  In

   129 the $\mu$FPU, there are 3 guard bits, making up to 8 consecutive additions

   130 possible.}.  An example of the Block Exponent optimization on the expression

   131 X = A + B + C is given in figure~\ref{opt:be}.

   132 

   133 % This is an example of how you would use tgrind to include an example

   134 % of source code; it is commented out in this template since the code

   135 % example file does not exist.  To use it, you need to remove the '%' on the

   136 % beginning of the line, and insert your own information in the call.

   137 %

   138 %\tgrind[htbp]{code/be.s.tex}{Block Exponent}{opt:be}

   139 

   140 \section{Integer optimizations}

   141 

   142 As well as the floating point optimizations described above, there are

   143 also integer optimizations that can be used in the $\mu$FPU.  In concert

   144 with the floating point optimizations, these can provide a significant

   145 speedup.  

   146 

   147 \subsection{Conversion to fixed point}

   148 

   149 Integer operations are much faster than floating point operations; if it is

   150 possible to replace floating point operations with fixed point operations,

   151 this would provide a significant increase in speed.

   152 

   153 This conversion can either take place automatically or or based on a

   154 specific request from the programmer.  To do this automatically, the

   155 compiler must either be very smart, or play fast and loose with the accuracy

   156 and precision of the programmer's variables.  To be ``smart'', the computer

   157 must track the ranges of all the floating point variables through the

   158 program, and then see if there are any potential candidates for conversion

   159 to floating point.  This technique is discussed further in

   160 section~\ref{range-tracking}, where it was implemented.

   161 

   162 The other way to do this is to rely on specific hints from the programmer

   163 that a certain value will only assume a specific range, and that only a

   164 specific precision is desired.  This is somewhat more taxing on the

   165 programmer, in that he has to know the ranges that his values will take at

   166 declaration time (something normally abstracted away), but it does provide

   167 the opportunity for fine-tuning already working code.

   168 

   169 Potential applications of this would be simulation programs, where the

   170 variable represents some physical quantity; the constraints of the physical

   171 system may provide bounds on the range the variable can take.

   172 \subsection{Small Constant Multiplications}

   173 

   174 One other class of optimizations that can be done is to replace

   175 multiplications by small integer constants into some combination of

   176 additions and shifts.  Addition and shifting can be significantly faster

   177 than multiplication.  This is done by using some combination of

   178 \begin{eqnarray*}

   179 a_i & = & a_j + a_k \\

   180 a_i & = & 2a_j + a_k \\

   181 a_i & = & 4a_j + a_k \\

   182 a_i & = & 8a_j + a_k \\

   183 a_i & = & a_j - a_k \\

   184 a_i & = & a_j \ll m \mbox{shift}

   185 \end{eqnarray*}

   186 instead of the multiplication.  For example, to multiply $s$ by 10 and store

   187 the result in $r$, you could use:

   188 \begin{eqnarray*}

   189 r & = & 4s + s\\

   190 r & = & r + r

   191 \end{eqnarray*}

   192 Or by 59:

   193 \begin{eqnarray*}

   194 t & = & 2s + s \\

   195 r & = & 2t + s \\

   196 r & = & 8r + t

   197 \end{eqnarray*}

   198 Similar combinations can be found for almost all of the smaller

   199 integers\footnote{This optimization is only an ``optimization'', of course,

   200 when the amount of time spent on the shifts and adds is less than the time

   201 that would be spent doing the multiplication.  Since the time costs of these

   202 operations are known to the compiler in order for it to do scheduling, it is

   203 easy for the compiler to determine when this optimization is worth using.}.

   204 \cite{magenheimer:precision}

   205 

   206 \section{Other optimizations}

   207 

   208 \subsection{Low-level parallelism}

   209 

   210 The current trend is towards duplicating hardware at the lowest level to

   211 provide parallelism\footnote{This can been seen in the i860; floating point

   212 additions and multiplications can proceed at the same time, and the RISC

   213 core be moving data in and out of the floating point registers and providing

   214 flow control at the same time the floating point units are active. \cite{byte:i860}}

   215 

   216 Conceptually, it is easy to take advantage to low-level parallelism in the

   217 instruction stream by simply adding more functional units to the $\mu$FPU,

   218 widening the instruction word to control them, and then scheduling as many

   219 operations to take place at one time as possible.

   220 

   221 However, simply adding more functional units can only be done so many times;

   222 there is only a limited amount of parallelism directly available in the

   223 instruction stream, and without it, much of the extra resources will go to

   224 waste.  One process used to make more instructions potentially schedulable

   225 at any given time is ``trace scheduling''.  This technique originated in the

   226 Bulldog compiler for the original VLIW machine, the ELI-512.

   227 \cite{ellis:bulldog,colwell:vliw}  In trace scheduling, code can be

   228 scheduled through many basic blocks at one time, following a single

   229 potential ``trace'' of program execution.  In this way, instructions that

   230 {\em might\/} be executed depending on a conditional branch further down in

   231 the instruction stream are scheduled, allowing an increase in the potential

   232 parallelism.  To account for the cases where the expected branch wasn't

   233 taken, correction code is inserted after the branches to undo the effects of

   234 any prematurely executed instructions.

   235 

   236 \subsection{Pipeline optimizations}

   237 

   238 In addition to having operations going on in parallel across functional

   239 units, it is also typical to have several operations in various stages of

   240 completion in each unit.  This pipelining allows the throughput of the

   241 functional units to be increased, with no increase in latency.

   242 

   243 There are several ways pipelined operations can be optimized.  On the

   244 hardware side, support can be added to allow data to be recirculated back

   245 into the beginning of the pipeline from the end, saving a trip through the

   246 registers.  On the software side, the compiler can utilize several tricks to

   247 try to fill up as many of the pipeline delay slots as possible, as

   248 seendescribed by Gibbons. \cite{gib86}

   249 

   250
author	Robert McIntyre <rlm@mit.edu>
date	Sun, 27 Apr 2014 21:50:25 -0400
parents	6b0f77df0e53
children