Trace Driven Simulation of Cache Memories

This thesis evaluates an innovative cache design called, prime-mapped cache. The performance analysis on various application s and programs shows that the primemapped cache performs better than the conventional cache organizations. The performance gain will increase with the increase of the speed gap between processors and memories. The exact cache behavior of numerical applications namely: matrix multiplication and SPEC benchmarks is st udied by varying the cache parameters such as cachesize, linesize and associativi ty. Traces are collected from these programs and miss ratios for instructions and data accesses are compared. Based on the experimental results and depending on the algorithm used, the miss ratios of the prime-mapped cache are found to be 50 to 100% less than for conventional caches. Depending upon the speed difference between processors and memories, with the prime-mapped cache these algorithms can run 30% to 2 times faster than they do on conventional caches.


Performance Analysis
Characterization of machines, by studying program usage of their architectural and organizational features, is an essential part of a design process. In order to evaluate the performance potential of any design, performance analysis on various architecture approaches have to be carried out. Evaluating the performance of cache-based computer systems is a difficult task because of the complexity of program behaviors. Locality property of an application and reuse factors have to be considered.
Traditional performance evaluation can be broadly classified into three categories: Analytical Modeling, Simulation and Measurement.

Analytical Modeling
Analytical models provide a quick and insightful performance estimate of a given design[l8). By varying different input parameters over a wide range, analytical model is a good approach for a comparative study of the performance of different alternates of any design particularly cache design [19) . But the analysis and numerical results of analytical models are to some extent hypothetical and are not meant to predict performance of any realistic computer system. 1 In most cases, analytical models are used in the initial development of a new design whereas event-driven and trace-driven simulations are used to validate the design.

Measurement
Measurement is the only accurate and realistic performance evaluation. Since the cache design space is incredibly diverse with tens of independent design variables per level in a memory hierarchy, it is impossible to explore the entire design space in one study. Also the flexibility is limited which constrains the design methodology.
To be able to do measurement, the system needs to be in existence. Therefore it is unsuitable for new designs

Event-Driven Simulation
Event-Driven Simulation simulates activities of a system by generating random events according to a given distribution. Event-driven simulation can be carried out in various modes of which Time-Driven and Execution-driven Simulation are the most widely used [7]. Time-driven simulation is synchronous in the sense that all the system activities occur at discrete time intervals which are processor cycles.

Trace-Driven Simulation
In trace-driven simulation, one or more application programs are executed, usually interpretively, and a complete trace is collected from each. The trace typically contains all of the memory addresses referenced , as well as opcodes and possibly timing information. Traces can either be used directly, for example, to evaluate instruction set characteristics [21], or as input to an architectural simulator to predict the performance of different architectural variants. Such trace-driven simulation is most frequently used to study the behavior of cache memories [10].
The validity of trace-driven simulation relies on a crucial assumption: that perturbations to the trace data caused by the tracing process do not affect the simulation results. Unfortunately, it is nearly impossible to collect traces without perturbing program execution in some way. The most common perturbation is execution dilation. [24) validated the use of trace-driven simulation for multi-processors. Variability due to dilation and multiple runs appears to be small.
Trace-Driven simulation is based on actual traces of programs running on a system [4). Therefore it provides the most reliable and accurate performance estimates for given programs on a given system.

Outline
The thesis is organized as follows:

Background
The fast computation of numerically intensive programs presents a challenge to memory system designers. Numerical program execution can be accelerated by pipelined arithmetic units [2], but to be effective, must be supported by high speed memory access. A cache memory is a well known hardware mechanism used to reduce the average memory access latency [ 6].

Cache Memory
Cache memories are high speed buffers which are inserted between the proces- When a physical memory address is generated for a memory reference, the block address field is used to address the corresponding block frame. The tag bit address is compared with the tag in the cache block frame. If there is a match, the information in the block frame is accessed by using the address field. Figure l .a illustrates this organization.
• Fully Asso ciative In this mapping, any block in memory can be in any block frame. When a request for a block is presented to the cache, all the map entries are compared simultaneously (associatively) with the request to determine if the request is present in the cache [17] . Although the fully associative cache eliminates the high block contention, it encounters longer access time because of the associativity of a large number of blocks. • Sector Mapping In this scheme, the memory is partitioned into a number of sectors, each composed of a number of blocks [15). Similarly the cache is also divided into sector frames, each composed of a set of block frames. The memory requests are for blocks, and if a request is made for a block not in the cache, the sector to which this block belongs is brought into the buffer.
The limitations are that that the mapping of blocks in a sector is congruent.
Also, only the block that caused the fault is brought into the cache, and the remaining block frames in this sector frames are marked invalid thus wasting bandwidth. Figure l .d illustrate the sector-cache organization.

• Prime Mapping
In a prime-mapped scheme, [12) each memory address, same as conventional cache-based computer system [5), is partitioned into three fields: W = log 2 (line size) bits of word address in a line( offset); c = (log 2 (number of sets+ 1) bits of index; and the remaining tag bits. The access logic of the prime-mapped cache consists of three components: data memory, tag memory and matching logic.
Same as a set-associative cache, the data memory contains a set of address decoders and cached data; the tag memory stores tags corresponding to the cached lines; and the matching logic checks if the tag in an issued address matches the tag in the cache. The cache lookup process is exactly same as the set associative cache. However, the index field used to access the data memory is not just a subfield of the original address word issued by the processor since the modulus for cache mapping is not . a 2's power any more [16). it is the residue of the line address modulo a Mersenne number. 8

Chapter 3 Cache Simulator
This chapter describes the cache simulator used to simulate the two cache designs -set-associative-mapped and prime-mapped. The Simulator can be broadly divided into two parts: the first part called the XSIM[lO) trace generator takes the pixiefied output of any executable code and generates traces for the second part, the DINER0 [21] cache simulator to perform the actual simulation and report the results. The basic fl.ow diagram is given in Figure 2.
The explanation for pixie, trace generator and the cache simulator is described below

Pixie
Generating traces for executable codes running into megabytes imposes severe constraints on the operating system. The trace counts for numerical algorithms typically run into billions which cannot be stored in a hard copy. Hence, for tracedriven simulation to perform accurately, the input code is to be divided into several smaller sub-codes which can be accessed individually. The traces generated from these codes are of smaller size and hence can be recorded. The division of a bigger code into smaller codes is done by Pixie (a DECstation system utility).

Operation of Pixie
Pixie takes in an executable program from a DECstation compiler, partitions into basic blocks each of size 64k bytes and writes an equivalent program containing additional code that counts the execution of each basic block. A basic block is a region of the program that can be entered only at the beginning and exited only at the end. The input executable code is identified based on its magic number. Pixie exits on an undefined input magic number. The internal division of the code is done based on a dynamic stack allocation. Each block has a unique starting address with which it is identified. There can be correspondence within each block. To optimize the performance, pixie groups those instructions which can fit into the range of 16bit displacement. An error will be generated if the offset exceeds 16 bits (signed).
Pixie writes this output code into a default .pixie file extension. As the code is divided and information is to be stored as to their addresses, the pixiefied code is considerably larger than the input code. The branch instructions in the program determine the number of times each basic block in the program text is executed and the sequence in which the blocks are executed. Pixie also supports thirty two 32-bit general purpose registers which are used for data movements. All the operations are register-to-register except for load and store operations which are memory-to-register operations.

Options
In addition to generating address counts pixie also supports certain important features: • Pixie defines a MIPS instruction set which is compatible with dee based rise machines.
• To allow for the individual blocks to be accessed, pixie maintains a file which gives the starting addresses for each of the blocks.
• Pixie supports the fortran 77 format statements by putting the original text into the translated output.
• To account for the trace references (used by the simulator), pixie enables the issue of memory references which enlarges the code considerably. Care must be taken in using this option as the branch offset may exceed 16 bits range on a subroutine call.
• In order to reduce the number of references generated pixie can issue only one memory reference for every N memory references, where N is an user defined number greater than 1.
Pixie doesn't work on programs that receive signals as the handler for address to the system calls is not translated. Also since the pixiefied code is considerably larger than the original code, conditional branches that used to fit in the 16-bit branch displacement field doesn't fit which generates a pixie error. This drawback is exposed by the perfectclub benchmark programs wherein the offsets are in the order of 18 bits.

Trace Generator
The output of the pixified code is fed to a trace generator. The XSIM trace generator reads each basic block of the pixiefied code and converts them into an assembly language code based on MIPS instruction set. This code is then assembled and converted into a trace output. Trace output file. This is an ASCII file with one LABEL and one ADDRESS per line. The rest of the line is ignored so that it can be used for comments.

Features
• Xsim has provision to generate an entire trace file which can be subsequently fed to any cache simulator which takes in the same kind of input file. This facility is rarely used as the trace file may run into gigabytes. Instead as explained previously traces are fed one at a time without generating a trace file.
• Xsim has the ability to suppress tracing and just generate data files. This is equivalent to an assembler.
• To stop the trace generation after a fixed amount of time, Xsim has an option to stop the generator after N serial cycles are traced.
• To make it more user visible Xsim can generate traces starting from some fixed address.
• For comparative study, it may be needed to act only on a fixed amount of traces. This can be done by specifying both the starting and ending addresses 13 for trace generation therby fixing the amount of traces generated and also avoiding accessing undefined addresses.
• To make it more user friendly, Xsim has a 9 level debugger which can trace the address accessed on. The debugging includes acting on new basic blocks, analyzing basic blocks, producing results for each of the basic blocks etc.

Dinero Cache Simulator
The traces generated by the xsim is fed into the Dinero cache simulator. The simulator takes as its inputs the organization of the cache like the unified cache size, instructin cache size, data cache size, block and sub-block sizes, associativity, write back policy etc. Once the cache parameters are fed into the simulator, it checks for discrepancies in the set-up like specifying a blocksize which is not equal to 2c for some positive integer c or an undefined write back policy etc.
When the simulator recognizes that valid input parameters are specified, it initializes the address stack and starts fetching the traces. The address part is decoded and the tag and index fields are determined. The simulator looks for data access in the cache. On a miss ( address tag not found in the cache), the main memory is accessed and the cache is updated. The data trace following the instruction trace is then acted upon. All the data and instruction references are recorded including read, write or misc accesses. The misses corresponding to each of the above cases are also filed. The address stack is continously updated based on input specifications like the prefetching mode, flushing of cache etc. The results are written to an output file an example of which is given below.
Once the cache simulation is done, the output file is updated with the recorded values of instruction and data references. As can be seen from Figure. 4, the misses are calculated as a percentage of the total number of references for that category.
Also the number of memory references is also given which when large degrades the cache performance. Since it is found that the simulator spends 35% to 50% of its  Performance .Evaluation

Matrix Multiplication
Matrix Multiplication has been used to evaluate architecture design for a considerable longtime and hence has been included to analyze the prime-mapped design.
Square matrices are used ranging from 64 X 64 to 256 X 256 long integers. Each matrix is divided into blocks(submatrices) of size BlxB2 and the algorithm was run on these blocks. An exhaustive study has been done by varying the blocksize, cachesize, datasize and the blocking factor and the results are plotted below.
Blocksize represents the number of bytes of data moved between cache and main memory in a single access. Figure        Bl represents the number of times the inner most loop( once loaded) will be executed. Since B2of16 gives the best performance improvement for any B2, a variation in Bl over that value of B2 for miss percentages represent Figure 9. As for B2, the misses increase for increasing B 1, (more replacements, more misses) and decrease for increasing linesizes due to less number of misses.  The performance gain when plotted against B2 is given in Figure 10. The performance degrades for increasing B2 due to more number of line interferences. Also increasing blocksizes reduce the gain due to less number of misses. Figure 11 gives the corresponding increase in performance for Bl. As for B2, the performance degrades for larger B 1 (more replacements) and increases for smaller   Figure 12 shows the effect of Bl on B2 as a factor for performance gain. decreasing Bl improves the performance as this reduces the number of times the blocks have to be replaced. A miss here will be propagated throughout the matrix multiplication.
The peak gain for Bl = 8 and B2 = 8, is substantiated in Figure 13 which illustrates the effect of B2 on Bl.
The discussion proves that the prime-version performs better than the un-primed version over various cache parameters, but the peak performance depends on the organization of the matrices as well. This brings out the intricacies in cache behavior that, no one particular architecture can ensure the best performance over the entire range of problem size. · -· -· -· -· -· -· -~---· -· -· -· -·-· -· -· -· -  GCC is a GNU C compiler program which converts preprocessed files into optimized sun-3 assembly code which is written in C and cannot be vectorizable. Figure   14 & 15. plots the miss percentages for GCC for varying Cachesize and associativity 1 and associativity 2 respectively. The misses decrease for increasing cachesize due to availability of more data inside the cache thus increasing the probability of a hit.
The plots for different associativities indicate that the miss percentage decrease for larger associativity as was proved for the matrix multiplication algorithm.    COMPRESS is a C program which performs data compression on a 1 MB file using adaptive Lempel-Ziv coding. The variation in miss percentage as cachesize is varied for COMPRESS is shown in Figures 17 & 18 for different associativities. As for GCC, the misses decrease for increasing cachesize. The performance gain plotted in Figure   19. indicates that a cachesize of 8K bytes gives the lowest gain. For cachesizes, less than 8k bytes, the misses vary significantly (reducing for cachesize = 8k bytes more appreciably for un-primed than for primed version). For cachesizes = 8kbytes, the misses stabilize and hence produce more improvement. Also as the cachesize is increased, associativity plays a major role as can be seen from the curvature of the plots. The benchmark exhibits very high code locality and is not very sensitive to instruction cachesize. One 32KB direct-mapped cache had a miss ratio of less than one half of one percent. On the other hand , the benchmark is quite sensitive to data          Hydro2d is an astrophysics application program which solves hydro-dynamical navier stroke equations to compute galacial jets. The input file is modified to change the number of timesteps from 400 to 100. This is the only number in the input file supplied with the benchmark, other input data are generated by the program itself.
The output file specifies the time step, the GRID spacing, the viscosity factor and the execution time. The input file specifies the points to be computed. Changing this parameter affected the number of GRID points generated (from 400 to 100) and the viscosity factor. But the GRID points obtained upto 100 steps match with the result file from the benchmark and hence the program application is not changed.       version ; e about half of un-prime version.
The performance improvement vs cachesize ( Figure 29) shows that rate of improvement decreases as the cachesize is increased. This is due to the decrease in misses as the cachesize is increased. The misses vary very little for larger cachesizes and hence the graph stabilzes for cachesize of 128 kbytes and greater. 51 6.5 6 · · · · · · .. . ,, -> Direct-Mapped  billions. The miss percentage reduce as the cachesize is increased due to less number of line conflicts for larger cachesizes. Figure 31 shows the performance improvement for the same cache size variation. The performance shows a increasing gain as the cachesize is made large. This is due to the less number of line conflicts in primeversion compared to unprimed version.
The SPEC results prove that the prime-version, when giving better performance gains, tend to stabilize over the range of 8k byte cachesize and varying associativities when reducing the miss percentages, have little impact on the cache performance.

Conclusions
In this thesis, a new cache organization, Prime-mapped Cache Design was evalu- ated. An existing conventional Cache Simulator is modified to reflect the new design.
The design is evaluated using trace generation through the XSIM trace generator, pixie and the Dinero cache simulator. Numerous programs and algorithms are compiled and used to generate the traces. Programs include Matrix Multiplication and the SPEC benchmarks. The SPEC benchmarks provide a valuable source as they are used to evaluate a similar design. Simulation is done for different cache sizes by varying the blocksize and the associativity.
Results are obtained for both the unprimed and the primed version. A comparison shows that the prime version shows a performance improvement ranging from 50% to 150% in the miss ratios depending on the algorithm used to evaluate it. Eventhough prime version cache shows considerable improvement, the large variation in the performance reflects the complexity and unpredictable nature of the cache de- sign. An architecture which gives good performance for one algorithm may perform poorly when used for a different algorithm. The wide range of programs used to evaluate the new design takes this into consideration and the results show that the prime-mapped cache design always gives a better performance than unprime-mapped design for scalar processors. 55