ANALYZING THE PERFORMANCE IMPACT OF PARALLEL LATENCY IN THE PIPELINE

This work introduces a concept coined overlap latency which is shown to severely limit performance in several types of benchmarks. This overlap latency is only completely removed when both branch mispredictions and cache misses are removed in tandem, rather than improved in isolation. Since most current research investigates improvements to branch predictions or cache behavior and not both proposed techniques are not able to unlock this extra performance gain. To demonstrate this concept benchmarks are evaluated using four configurations: baseline which uses current state-of-the-art branch prediction and cache prefetching, perfect-bp which emulates perfect branch prediction direction, perfect-cache which emulates a perfect L1 data cache, and perfect which combines perfect-bp and perfect-cache. In addition, detailed analysis on select benchmarks is conducted to show the cause of overlap latency as well as the effect this has on an out-of-order execution CPU. Benchmarks were found to have the potential for up to an additional 229% IPC compared to that expected based on individual performance gain from branch prediction and cache.

Effects of overlap latency in the pipeline. As branch prediction is improved, the load latency which remains leads to an increase in instructions waiting to commit. As the ROB fills, the number of instructions which can be fetched decreases. As cache misses are reduced, the branch latency which remains leads to more frequent ROB flushes. This leads to the ROB being underutilized such that there may be no instructions in the ROB ready to commit. The average idle commit cycles tracks the latency of the head instruction from the ROB. This combines latency from perfect-bp and perfect-cache to show the added benefit in the perfect configuration. . . . . . . . . . . . . . . . . 37

15
Iteration throughput for several representative benchmarks. As seen in the figure, perfect-bp increases useful iterations processed at the cost of longer processing time per iteration. Conversely, perfect-cache reduces the amount of time to process an iteration while wasting CPU resources by processing iterations that will later be squashed.  Effect of cmov on specific benchmarks. While cmov can be benficial in current CPU architectures, it limits potential performance improvements made possible by removing overlap latency. 55

Introduction
Since the invention of the multi-stage pipeline [1], through the widely adopted Out-of-Order (O3) execution model [1], and continuing today with active research in branch prediction [2] and data cache prefetching [3] -among many other areas -computer architecture has consistently focused on improving single core performance. Even with promising contributions from newer research topics (such as multi-core architectures [4] and hardware accelerators [5]) which provide opportunities for new and creative innovations in computer architecture, single core performance remains among the most important sources of improvement [6]. This can be seen in industry, where microprocessor vendors have been devoting significant hardware resources -including deeper pipelines and wider issue and commit widths -to push the limits of single core performance. This trend is also seen in academia where competitions are regularly organized to find and refine branch prediction [7] and cache prefetching [8] techniques. Even outside of competitions, many papers are proposed every year in this area of research.
This high level of interest and activity has led to great improvements to single core performance. However, continuing to produce more and more effective branch prediction and data cache prefetching is becoming much more difficult. For example, two main sources of branch mispredictions and cache misses in today's stateof-the-art processors can be attributed to hard to predict (H2P) branches [9] and irregular data accesses [10,11], respectively. Since these do not follow some repetitive pattern, which current mechanisms utilize to predict future behavior, new and innovation solutions will be needed to predict them with any degree of accuracy and coverage. In addition to this, H2P branches and irregular data accesses can of-ten be interrelated.For example, a pointer-based data structure traversal (irregular data accesses) cannot be effeciently prefetched without accurate branch prediction.
Conversely, branches which depend on irregular data accesses take much longer to resolve on a cache miss which can result in more squashed instructions.
In this work, the co-dependence between H2P branches and irregular data accesses is evaluated in a variety of benchmark suites to determine the degree to which this behavior effects single core performance. In addition, a detailed analysis is carried out to determine the causes (software implementations) and effects (architectural strains) of the load-branch relationship. This study makes the following contributions: • Introduce the concept of overlap latency, a consequence of load-branch dependencies in frequently executed loops.
• Provide an upper bound performance limit on a variety of benchmarks to show the limitations in improving branch prediction and cache performance in isolation due to overlap latency.
• Categorize selected benchmarks into groups of algorithms which are vulnerable to overlap latency.
• Investigate the cause of overlap latency and its effects on a modern processor. This section will provide a background on important topics related to this research, as well as a survey of related works.

Pipelining
A pipelined computer architecture takes advantage of parallelism inherent in a set of instructions by overlapping the execution of the instructions. A pipeline essentially breaks the processing of an instruction into several stages. Each of these stages take less time to complete then the entire processing stage would, allowing for increased clock cycles. In addition, it allows for several instructions to be processed simultaneously since the instructions can be in different pipeline stages.
While this technique has led to substantial performance gains, it also creates a more complex environment to process instructions in. For example, if an instruction depends on the results of the previous instructions, the instruction must wait to begin execution until the result of the previous instruction is available. In the event the previous instruction's result depends on a memory access, this wait could be many CPU cycles long. This waiting is referred to as a pipeline stall, and results in underutilization of the CPU resources.
Another complication introduced by the pipeline architecture is referred to as a control hazard. Control hazards are the result of branch instructions in which the next instruction to execute depends on the result of the current instruction. The most simple solution to a control hazard is to stall the pipeline until it is known what the next instruction should be. Since branches represent about one-third of all instructions, this stalling can result in significant performance loss.

Out-of-Order Execution
Out-of-Order (O3) execution is used in almost all modern architectures. This allows for instructions to be executed out of program order however still committing instructions in the correct order. This is useful for several reasons, but mainly to hide memory accesses. As mentioned in the previous section, if an instruction depends on the previous instruction's result it must wait for the result. However, there may be other instructions independent of these two instructions which could occupy the pipeline instead. O3 execution allows for instructions to be executed as soon as they are ready to execute and then reorders the executed instructions back into program order to ensure correct program execution. This reordering is accomplished through the use of a reorder buffer (ROB) which stores all executed instructions until they are able to be committed in program order.

Data Prefetching
While O3 execution is effective at hiding memory latency by allowing the execution of other instructions while waiting, data prefetching is a technique used to actually remove these memory latencies. This is done by attempting to predict near-future memory accesses, and storing the data in the cache before the program requires the data. When predicted correctly, prefetching removes what would otherwise result in a cache miss. There are many very effective methods of data cache prefetching including: Stride Prefetching [1], Signature Path Prefetching [2], Best Offset Prefetching [3] and Indirect Memory Prefetching [4].

Irregular Data Accesses
While data cache prefetching has provided significant improvement to cache performance, and therefore program performance, there remain memory accesses which even the state-of-the-art designs cannot predict. For example, an irregular memory access can occur when traversing a linked data structure in which the values fetched are pointers to random memory locations on the heap. Since almost all current data prefetchers utilize some history or repetitive pattern to make predictions, this type of traversal is very difficult to be predicted. There have been several proposed designs to handle the irregular memory access problem [5,6,7].

Branch Prediction
As mentioned in the pipeline section, stalls due to control hazards can significantly limit a program's performance. Branch prediction (BP) is a technique used in all modern CPUs which attempts to predict whether a branch will be taken or not rather than wait for the actual outcome. The CPU then proceeds to execute instructions assuming the prediction is correct. If the prediction turns out to be wrong, the state of the pipeline must be reverted back to the mispredicted instruction and execution resume down the correct path. Many effective branch prediction techniques have been proposed, however most CPUs use either the TAGE [8] predictor or Perceptron-based prediction [9]. While these branch predictors are able to acheive over 90% accuracy many cases, there remain branches which cannot be accurately predicted. This problem has been investigated, and is referred to as hard to predict (H2P) branches.

Microarchitecture Simulation
In order to test new microarchutectural ideas and implementations, it is common to use software to emulate the functionality of the proposed hardware. This allows for changes to the architecture to be made quickly as well as for the ability to record fine grain details about the implementation. However, software emulating hardware runs orders of magnitude slower than the actual hardware [10]. With today's ever increasing complexity in both the microarchitecture and the software that is run on it, simulations times can take days, weeks, or even months.

Functional vs Detailed Simulations
The effort to reduce the required simulation time while maintaining accurate and detailed results continues to be an active research area. A fundamental tradeoff is between functional and detailed simulations. A functional simulation will emulate the essential functionality of the hardware while optimizing or removing other parts of the simulation in order to reduce simulation time. A detailed simulation will accurately emulate the hardware at a very low level, providing useful and insightful statistics at the cost of much longer simulation times.
A common technique used is to carry out a functional simulation of some workload on the proposed hardware and take a checkpoint at some region of interest.
This checkpoint will store essential information about architectural state and memory contents so that future simulations can begin at this point. This checkpoint is then restored using a detailed simulation which will simulate some predetermined number of instructions. This provides a compromise, assuming there is a region of interest that is known before simulation. In cases where this assumption does not hold true, other checkpointing methodologies have been proposed.

Checkpointing Methods
The two most commonly used checkpointing methodologies are Simpoint [11] and SMARTS [12]. Simpoint requires a functional simulation to be conducted from start to finish in which a basic block vector analysis is used. The workload is then broken into sections and weights are assigned to each section. Detailed simulations can then be carried out on each section, with the overall results being merged according to each section's assigned weight. Simpoint is effective because it is able to remove redundent or similarly behaving areas of the workload, thus reducing the number of instructions which must be simulated in detail. A disadvantage of Simpoint is that once the sections (assigned by instruction number) are determined (first functional simulation pass), a second full functional simulation is required to take checkpoints at the specific instructions.
The SMARTS methodology breaks the workload into some number of even intervals with each section being assigned an equal weight. In this case, only one functional simulation needs to be run to take checkpoints at each interval. A disadvantage of this method is that it can take many trials to find a sufficient interval length such that the averaged results approach the actual results. In addition, SMARTS typically requires a large number of checkpoints to be taken which can lead to significant storage space.

Related Works
While a majority of active research in branch prediction and data prefetching rely on some kind of history to predict the future accesses, there have been novel approaches which attempt to resolve H2P branches and irregular data accesses.
Control Flow Decoupling [13] (CFD) proposes separating loops into two loops, one containing all predicate instructions necessary to determine the branch outcome and another containing all control-dependent instructions. While this transformation is only possible on a subset of loops which contain certain characteristics, it is able to totally remove branch mispredictions for these loops. This is done by recording the outcome of each set of predicate instructions and storing the branch outcome into a queue which is used by the CPU to determine if the control dependent instructions should be executed for each iteration. This is able to remove H2P branch mispredictions when these branches are present in loops which are suitable for such transformation.
Slipstream [14] is a novel architecture proposal in which two versions of the same program are executed simultaneously. In one version, unimportant (to program outcome) instructions are identified at runtime and skipped, while the other version executes all instructions. The two instruction streams are able to communicate to each other, allowing the shorter version to convery necessary information -like branch outcomes and memory information -to the full version. This is possible since the shorter version executes ahead of the full version due to skipping some instructions.
Runahead execution [15] is another method of executing instructions ahead of the actual instruction stream. In runahead execution, a thread -either in hardware or software -is used to execute future instructions in an effort to prefetch useful data which is otherwise difficult to prefetch.
These works have the potential to effectively remove H2P branches and irregular memory access cache misses. These do not, however, target specifically the load-branch dependency explored in this study. When traversing a data structure, for example, runahead execution is not able to execute far enough in the future due to pointer chasing. In addition, if any type of work is done on each node, this type of loop cannot be transformed as proposed in CFD.

Overlap Latency
As mentioned in the introduction, H2P branches and irregular data accesses constitute a large amount of branch mispredictions and cache misses in state-ofthe-art CPUs. These also tend to be interrelated, adding additional complexity to consider when attempting to improve performance. In this section, the concept of overlap latency is introduced, which demonstrates the problems that arise due to this relationship. An example of overlap latency is illustrated in figure 1. In this example, a while loop is iterated over many times throughout execution of the program.
The line labeled B1 frequently results in a branch misprediction while the line labeled L1 produces a significant amount of cache misses. If work is done to improve data cache prefetching so that line L1 no longer produces many cache misses, the processing of instructions will be much faster since instructions in the reorder buffer (ROB) no longer need to wait for long memory accesses. While faster instruction processing will lead to improved performance, the presence of line B1 will limit the amount performance can improve. This is because the faster instruction processing will also result in more frequent branch mispredictions from line B1. So while more instructions can now be fetched, many of these instructions will later be flushed from the pipeline as a result of line B1. Conversely, if branch prediction is improved so that B1 no longer results in branch mispredictions, more useful instructions will be fetched. While this will make more efficient use of the pipeline since all processed instructions will later be committed, the long memory accesses caused by cache misses from line L1 will lead to the ROB holding more instructions. As the ROB fills, the pipeline will have to delay execution of further instructions which limits the potential performance improvements made possible by better branch prediction.
This load-branch dependency created by lines L1 and B1 lead to what is referred to in this paper as overlap latency. Overlap latency refers to the parallel latencies caused by cache misses (i.e. L1) and branch misprediction penalties (i.e. B1). Since both of these sources of latency occur frequently as the loop is executed, removing only one source will not provide significant performance improvement due to the presence of the other source. As will be shown in the results section, applications with significant overlap latency can be severely limited in potential performance gain unless both the load and branch are handled in tandem.

Motivating Example
As an example of this type of behavior, the benchmark MST from the Olden benchmark suite will be used. This benchmark computes the minimum spanning tree of a graph. A bottleneck within this benchmark is the while loop shown in figure 3. This loop performs a hash table key lookup. In this case, a cache miss occurs frequently when retrieving ent = hash− > array[j] from memory. In addition, the check to see if ent is valid (not NULL) results in frequent branch misprediction. As can be seen, this represents a load-branch dependency in which the load is caused by an irregular data access (linked list) and the branch is a H2P branch since its result is dependent on this irregular data access.
If branch prediction were to be improved, this would lead to an increased ROB occupancy since many instructions will be waiting for memory accesses.
Conversely, if data cache prefetching is improved, this would lead to an underutilization of the ROB given the increased frequency of squash events, i.e. branch mispredictions. This is shown in figure 4, where data obtained from a detailed simulation was used to plot the average ROB occupancy and the frequency of squash events for four different configurations: baseline (current state-of-the-art), perfect branch prediction (perfect-bp), perfect cache (perfect-cache), and perfect branch prediction and perfect cache (perfect). If there was no co-dependency between lines B1 and L1, one would expect contributions from perfect branch prediction would be independent to contributions from perfect cache performance. However, as can be seen from the IPC values shown in figure 4, the actual performance gain seen from the perfect configuration is much higher than this expectation. The additional speedup made possible by removing both latencies together is referred to in this work as overlap speedup.
In order to explain this overlap speedup, figure 4 shows results obtained from tracking the for loop during execution. These results were obtained by periodically taking a snapshot of the instructions within the ROB and monitoring these instructions until they were all either committed or squashed. Iterations were represented by a single predetermined instruction which is executed on each loop iteration. The iteration commit ratio shows the average number of iterations actually committed compared to the average number of iterations found in the ROB in each snapshot.
As the results indicate, the perfect-bp configuration commits every iteration found in the ROB at the cost of increased time to process the instructions due to memory latency. Perfect-cache, on the other hand, processes instructions in the ROB very quickly, however, it only commits about half of the iterations processed. The perfect configuration experiences both a high iteration commit ratio as well as fast instruction processing. Therefore, while perfect-bp improvements are limited by the presence of long latency operations and perfect-cache improvements are limited by the presence of frequent ROB flushes, the perfect configuration is able to unlock a significant amount of additional performance gain by eliminating both of these overlapping latencies.     For this study, it was of interest to investigate a wide range of coding styles to determine the set of algorithms, traversals, or other attributes which cause parallel latency in the pipeline. To this end, the following benchmark suites were selected: Olden [1] is a computation-instensive benchmark suite which makes extensive use of linked data structures. Olden has been used in many studies aiming to mitigate the effect of pointer-chasing. These benchmarks were chosen since they have been well researched and they target applications with a large amount of irregular data accesses.
The Problem Based Benchmark Suite (PBBS) [2] is a C++ based benchmark suite designed to study applications which ultize an algorithm to solve some problem. For example, this benchmark suite includes benchmarks such as comparison sort and the travelling salesman problem. This suite was chosen since it represents many types of useful algorithms found in real-world applications. The Graph Algorithm Platform Benchmark Suite (GAPBS) [4] is a C++ framework whose goal is to accelerate graph processing research via offering standardized graph processing baseline implementations. GAP provides very high performance implementations for each kernel through the use of complex and intuitive algorithms. This benchmark suite was chosen since it provides evaluation for many popular graphing algorithms and has been used in many papers.
CRONO [5] is a C multi-threaded graph analytic benchmark suite developed at the University of Connecticut, Storrs. Released roughly at the same time as GAPBS, CRONO aims to become a standard multi-threaded graph algorithm benchmark suite. CRONO benchmarks leverage C array data structures, and uses a combination of direct and indirect access patterns. Comparing and contrasting CRONO with GAPBS, both suites aim to be a standard graph based benchmark suite, with two different implementations (based in C and C++, respectively) and access patterns. For this study, the multi-threaded functionality has been set to run on a single thread.

The Standard Performance Evaluation Corporation CPU 2017
benchmark suite (SPEC CPU 2017) [6] is the de facto computer architecture research benchmark suite, consisting of a collection of computational intense performance benchmarks that stress the processor, memory hierarchy, and compilers.
This benchmark suite also offers the opportunity to evaluate real-world applications, rather than algorithms which are incorpprated into applications. For this study, only the speed integer benchmarks were selected. Since the focus of this research is on single core performance, the rate benchmarks are not of interest.
In addition, recent characterizations of the floating point benchmarks have shown small amounts of branch misprediction.

Simulation Configuration
To obtain detailed measurements from the collection of benchmark suites, the gem5 simulator [7] was used. All simulations were run in syscall emulation (SE) mode on an x86 ISA. The baseline configuration for the simulated CPU is shown in Simulations were run using four different configurations: state-of-the-art, perfect-bp, perfect-cache, and perfect. The state-of-the-art configuration used the TAGE-SC-L branch predictor [8]. Results for this configuration were obtained using four different prefetchers: Stride, SPP, BOP, and IMP. The perfect-bp configuration emulated a perfect branch predictor (direction only) while still using a state-of-the-art prefetcher. The perfect-cache configuration emulated an L1 data cache which never missed while using the TAGE-SC-L branch predictor. The perfect configuration emulated both perfect branch prediction and perfect L1 data cache. The implementation details of perfect-bp and perfect-cache can be found in the following section.

Implementations 4.3.1 Perfect Branch Prediction
In order to evaluate the potential contributions possible made by branch prediction, a method for ensuring perfectly accurate branch prediction was needed to use in simulations. In this study, branch prediction was considered to be the prediction of taken or not taken only. This means branch mispredictions are still possible in the event of a BTB miss or an incorrect BTB predictions. The frequency of these mispredictions vary depending on workload, however for a vast majority of benchmarks considered these events were very infrequent. To achieve perfect branch prediction, two modifications were made to the GEM5 simulator.
First, the correct path for a particular workload must be recorded. Therefore, a preliminary simulation is run in which the taken/not taken information for every committed instruction is recorded in an external text file in the order they are committed. Since the information recorded is just a boolean value for every control instruction, this does not require a large storage space.
Once the correct path was recorded for the workload, it was then run in the perfect branch prediction configuration. The external file is loaded into the simulator as an array. On every prediction, the correct outcome is extracted from the array and the branch predictor's decision is overridden by this value. As mentioned earlier, mispredictions due to the BTB are still possible. Because of this, squashes are still possible so the array pointer must be able to return to a previous state when this happens. This is handled using a small circular buffer the size of the ROB, which tracks the dynamic instructions which have caused the pointer to increment. When a tracked instruction is squashed, the pointer is also decremented.

Perfect Cache
To evaluate memory access latency effects on performance, a perfect cache was needed to emulate a L1 data cache with a 100% hit rate. Therefore, all memory accesses would incur the penalty of an L1 cache access only which is the goal of all data cache prefetchers. In order to implement this in GEM5, different timing modes available in GEM5 were taken advantage of. In order to simulate memory access times, GEM5 uses a timing mode which emulates the detailed interactions between the CPU and all levels of the memory hierarchy. GEM5 uses has a functional mode, which is traditionally used to load workload information into simulator memory at the beginning of a simulation or when a functional simulation is run which does not track memory details. The perfect cache implemented for this study utilizes the GEM5 timing mode to simulate interactions between CPU and L1 data cache.
The interactions between L1 and L2 cache, however, were overridden so that the functional mode was used to retrieve the appropriate data from lower levels without incurring any penalties. This implementation essentially creates an L1 data cache the size of main memory with minimal changes to existing GEM5 code.

Metrics Used
For this study, several common metrics were used, as well as metrics developed specifically to describe behavior related to overlap latency. The following are common metrics that were collected from simulations: • BP MPKI -branch mispredictions per kilo-instruction • L1D MPKI -L1D cache misses per kilo-instruction • LLC MPKI -LLC cache misses per kilo-instruction • IPC -Instructions per Cycle • Hot PC -Instruction which produces a relatively large number of misses or mispredictions compared to other instructions in benchmark.

Load-Branch MPKI
Overlap latency is a consequence of a load-branch dependency which is executed many times throughout the execution of an application. Moreover, only cache misses (i.e. access patterns which are not prefetched well in state-of-the-art mechanisms) and H2P branches contribute to overlap latency. Therefore, only benchmarks which exhibit a high number of both branch mispredictions and cache misses should be considered for this work. In order to identify such benchmarks, a metric referred to as Load-Branch MPKI was used.
This equation makes use of the observation that a branch miss and a cache miss can be viewed as parallel processes in this work. The reason for this is that both branch and cache misses are required to have overlap latency, and therefore if one or the other is not present that is no opportunity for overlap latency.

Expected Speedup
The aim of this work is to prove that there is additional performance gain made possible by removing both branch mispredictions and cache misses together. This additional gain cannot be achieved by either branch prediction or cache prefetching since the other source of latency is still present in either case. Therefore, the actual speedup observed via simulation must be compared to the speedup that would be expected based on the performance of perfect-bp and perfect-cache. Since the assumption is that there is no relationship between performance gain from branch prediction and cache, the expected speedup is defined as: since these speedups should be independent of one another.

Overlap Speedup
Overlap speedup is the metric used to describe the additional speedup a benchmark can possibly achieve by removing the latencies caused by both the load and branch together. This is speedup that is not possible by improving the latency of just one source. Overlap Speedup is defined as: Since expected speedup refers to the speedup expected by removing all branch mispredictions and cache misses based on the simulation results of perfect-cache and perfect-bp, overlap speedup is able to capture any additional speedup gained by completely removing the load-branch dependency.
The Load-Branch MPKI as well as overlap speedup for each benchmark simulated is shown in figure 5. All benchmarks with a high load-branch MPKI exhibit a significant amount of overlap speedup, and all benchmarks with a low load-branch MPKI exhibit negligible overlap speedup.
Two benchmarks, bisort and health, have significant overlap speedup however it is actually signigicantly lower than expected. The reason for this is that the perfect-cache configuration actually helps branch prediction, which violates the assumtion that perfect-bp and perfect-cache are independent. In these cases, the benchmarks still mispredict on the same instructions, however, they have to have far fewer predictions which lowers the opportunity for mispredictions. This is due to improved cache performance allowing branch mispredictions to be caught much faster.

CHAPTER 5 Analysis
In this chapter, benchmarks selected based on Load-Branch MPKI will be analyzed. First, a software analysis is provided demonstrating algorithms which are vulnerable to overlap latency. Then, architectural strain caused by latency overlap is shown to explain why the load-branch dependency created in software limits the performance of a benchmark.

Software-Level Analysis
After careful analysis of a large range of benchmarks, those which showed a large amount of overlap latency were able to be grouped into categories. In this section, these categories will be discussed along with one or more examples which provide concrete evidence of how the overlap is created in software.  Table Lookups BST, Hash, Skiplist, MST, Dict, RandAcc Linked Data Structure Travsal Treeadd, TSP, Health, Perimeter Data Dependent Modifications Sort, CSR-List, Bisort 1. Neighboring Node Access This class of algorithm typically traverses over all nodes (e.g. vertices) in the data structure. In each iteration, these algorithms will access or modify other nodes in the structure. Since these other nodes are not necessarily sequential, this creates an irregular access pattern that is hard to predict. Therefore, a frequent cache miss is created from loading the other nodes within each iteration. In addition, any type of decision which must be made based on these other nodes will result in a H2P branch, causing frequent branch mispredictions. This combination results in a load-branch dependency, leading to significant overlap latency if this loop is iterated over many times throughout execution. causes 28%. This leads to an overlap speedup of 1.7 (i.e. an additional 70% speedup is possible due to overlap latency).

Hash Table Lookups
Hash table lookups in which a key must be found matching some other key. Two types of implementations were seen in the benchmarks studied for this paper. The first implementation uses a linked list of arrays to store entries. A hashing function determines the index the matching key would be found, and then a traversal of the linked list is performed checking each array for a possible match. In this case, a load-branch dependency is created by traversing the linked list until a key is found. Pointer chasing is repsponible for the irregular memory access, and the unpredictability associated with the length of each linked list as well as the value of each entry creates the H2P branch. An example of this is seen in MST which is shown as the motivating example.
The other implementation, which is used by the AMAC benchmarks, utilizes a fifo (linked list) to hold some number of keys. These fifos are stored in some data structure which must then be traversed to find matching keys.
In this implementation, the load-branch is produced similar to the other implementation. As an example, BST will be used. This benchmark probes an unbalanced BST searching for a set of of keys. Each node of the BST contains a fifo (linked list) containing some number of keys. Whenever a key is found, the BST pointer is reset to the root and the search for the next key begins. The unpredictability of when a key will be found (data dependent) as well as the differing sizes of each node's fifo create a H2P branch at B1. In addition, irregular data access patterns are created by accesses different pointer fields on every iteration. In this example, there are several instructions which could create a load-branch dependency, however the if-else control flows guarentees exactly 1 load-branch dependency is created on every iteration. In figure 7, these instructions are identified as B1 and C1. The two B1 instructions (combined here since they are cascaded) make up 95% of all branch mispredictions in the simulation. The two C1 instructions account for 99% of all cache misses. The large number of load-branch dependencies caused by this loop lead to an overlap speedup of 1.66.
3. Linked Data Structure Traversal Linked data structure traversals in which each node of some structure -such as a linked list or binary tree -is accessed according to some order. In this case, pointer chasing due to access the next node in the structure results in an irregular memory access while the composition (i.e. is the node a leaf node or not) of the linked data structure produces a H2P branch. Since in many cases there is a small amount of work to be done on each individual node, these traversals usually lead to very small loops which are iterated over many times. Since a majority of these loops is the actually load-branch dependency, these traversals can produce significant overlap latency.
The benchmark treeadd will be used as an example. This benchmark traverses a binary tree, accumulating a running sum of the values of node. The tree is traversed postorder. The conditional checking whether or not the current node is valid (not NULL) results in a large number of branch mispredictions. In addition, the pointer chasing resulting from traversing the

Data Dependent Modifications Data dependent traversals such as sort-
ing algorithms can lead to overlap latency. In these cases a H2P branch is produced given the dependence of the result of the branch on the input data. Cache misses occur due to the large amount of data being accessed and modified at once. As an example, the benchmark Comparison Sort will be used. This benchmark uses the Standard Template Library (STL) function sort to sort a random set of float data. The sort function utilizes an algorithm called IntroSort which partition the array based on some pivot. The function used to create this partition is shown in figure 9. Within this while loop values of the array are accessed from front to back as well as back to front. In addition, the front and back pointers are incremented and decremented within the loop. This creates a very difficult memory access pattern to predict, leading to cache misses. In addition, the number of iterations traversing front to back (as well as back to front) to highly dependent on the input data. This creates a H2P branch. In this case two load-branch dependencies are created, both of which are executed on every iteration of the outer loop. The instructions B1 and B2 result in 77% of branch mispredictions and the instructions C1 and C2 account for 86% of all cache misses.
This results in an overlap speedup of 1.29.

Hardware-Level Analysis
This section will explore the effects overlap latency can have on the CPU resources. First, the consequences of improved branch prediction on benchmarks with significant overlap latency will be explained. Next, the consequences of improved cache is shown. Finally, the effects of the combination is shown which will highlight the strain put on the CPU by improved branch predcition or cache performance in isolation in these benchmarks.

Perfect Branch Prediction Consequences
The perfect-bp simulation results provide the upper bound performance for each benchmark when branch prediction direction is always correct 1 . While improved branch prediction is generally a good thing in any benchmark, this can    add strain to the CPU especially in benchmark's with significant overlap latency. This strain is due to the increase in useful instruction (i.e. instructions which will not be squashed) which must be processed. This is quantified in figure 14, which graphs the average ROB occupancy of each benchmark for base and perfect-bp configurations. The increase in ROB occupancy from perfect branch prediction is expected since this will drastically reduce the number of instructions squashed in the ROB. However, due to overlap latency, this increase can be even greater since this will lead to more instructions in the ROB waiting for memory accesses to return. This over-utilization of the ROB can prohibit further instructions from being executed, reducing the performance gain possible.

Perfect Cache Consequences
The perfect-cache simulation results show the upper bound limit of performance gain possible by improving cache performance. In benchmarks with overlap latency, improved cache performance alone is not enough to acheive the best possible performance. This reason for this is the presence of the branch misprediction in the load-branch dependency which creates overlap latency. While a perfect cache will allow for faster processing of instructions, H2P branches will remain to limit performance. In the case of a benchmark with overlap latency, improved cache means more frequent mispredictions and therefore more frequent squashes. The increased frequency of branch mispredictions, which is shown in figure 14, will result in an underutilized ROB. This will limit performance gain possible as there will be cycles in which no instructions will be present in the ROB.

Idle Commit Cycles
The combination of of the previous two sections leads to this metric. In perfect-bp, there is still are relatively large number of cycles in which no instruc-tions are committed, in many cases the head instruction is waiting for memory access. In perfect-cache, there is still significant cycles in which no instructions are committed due to frequent ROB flushes resulting in no instructions ready to commit. However, in the perfect case instructions can be processed faster due to improved memory and the ROB is utilized well so that instructions are usually ready to be committed. This observation was quantified using the ratio of cycles in which no instructions are committed to total cycles simulated. This metric does not track the number of instructions committed at each cycle, which would result in IPC. Rather, it tracks the latency of the head instruction (or lack of head instruction) in each configuration. This is useful because the size of the ROB is irrelevent. The results of this metric for each benchmark is shown in figure 14.
From the geometric mean of these benchmarks, it can be seen in both perfect-bp and perfect-cache over half of all cycles result in no instructions being committed on average. In the perfect case, however, the average is reduced all the way only about a quarter of the cycles resulting in no committed instructions.

Iteration Throughput
In order to provide examples of the aforementioned architectural strain placed on the CPU in the cases of both improved branch prediction and improved cache, this section will examine the throughput of each benchmark analyzed in the software analysis section. This was done by identifying the loop containing overlap latency (shown in the software analysis section) and tracking the performance of the loop during simulation. The performance was measured using three metrics:   Figure 14: Effects of overlap latency in the pipeline. As branch prediction is improved, the load latency which remains leads to an increase in instructions waiting to commit. As the ROB fills, the number of instructions which can be fetched decreases. As cache misses are reduced, the branch latency which remains leads to more frequent ROB flushes. This leads to the ROB being under-utilized such that there may be no instructions in the ROB ready to commit. The average idle commit cycles tracks the latency of the head instruction from the ROB. This combines latency from perfect-bp and perfect-cache to show the added benefit in the perfect configuration.
in the software analysis section can be seen in figure 15. As expected, perfect-bp leads to 100% of snapshot iterations being committed while perfect-cache has a similar commit ratio to base. This increased number of useful instructions, however, requires more CPU cycles to commit due to memory latency. On the other hand, perfect-cache is able to process each snapshot very quickly although many processed instructions are not actually needed since they are later squashed.

Effect of the ROB Size
The ROB serves many purposes, one of its main purposes is to hide long memory access latency by allowing other instructions to execute while the CPU waits for the data to return from memory. By increasing the size of the ROB, the CPU is afforded more opportunity to hide these memory latencies therefore reducing the effect of cache misses on overall performance (i.e. IPC). Because of this, the ROB plays a significant role in the impact overlap latency has on a benchmark's performance. While the base configuration for this study utilizes an ROB around the size of typical modern CPUs, it is also of interest to examine how larger ROBs will effect overlap latency in the future.
To do this, simulations were run on all selected benchmarks while varying the size of the ROB from 256 entries up to 1024 entries. This also involved increasing the number of physical registers, instruction queue entries, and load/store queue entries by the same ratio as the ROB. In the perfect-bp configuration, a large majority of benchmarks saw either small changes or significant IPC improvements as the ROB size increased. This is to be expected, since reducing the impact of cache misses will lead to more reliance on branch prediction when overlap latency is present. Also as expected, the perfect-cache configuration saw very minimal changes in almost all benchmarks. Finally, the perfect configuration saw no change in IPC which is also expected. Since in the perfect configuration the ROB does

Results
In the previous section, the causes of overlap latency as well as the effects it can have on the processor were analyzed. In this chapter, the impact this has on the actual performance of these benchmarks will be demonstrated. To show this, the upper bound IPC values for all four configurations obtained via detailed simulation will be shown. A deeper analysis of these results will then provide better insight into just how limiting latency overlap has the potential to be on certain applications.

Upper Limit IPC
In this section, the impact overlap latency can have on a benchmark's performance when present. This was investigated by simulating each benchmark using four configurations: base (TAGE-SC-L + SPP) which represents current state-ofthe-art implementations, perfect-bp which emulates a branch prediction direction while maintaining SPP prefetching, perfect-cache which emulates a perfect L1 data cache while maintaining TAGE-SC-L branch prediction, and perfect which emulates both perfect branch prediction direction and perfect L1 data cache. Figure   17 shows the IPC values of all four configurations of the selected benchmarks.
As can be seen by the IPC values, these benchmarks see a much higher performance improvement in the perfect configuration compared to the other three.
While this may be expected given perfect contains both perfect-bp and perfectcache benefits, this paper argues that there are additional, and less obvious, benefits to improving both branch prediction and cache together which unlock much greater potential for performance improvement in benchmarks with overlap latency. As an example, the benchmark MST (discussed in the motivation section)  additional speedup was found (this was the RandAcc benchmark) with an average of 16% across all benchmarks. Table 3 shows the average additional speedup found by category (as defined in the previous chapter).
While it can be seen from these results that the full potential for performance improvement cannot be achieved without removing both sources of latency together in these benchmarks, this work also aims to show that as one source of latency is removed, the removal of the other grows in importance. This can be seen in   be seen, branch prediction is more effective (produces higher speedup) when cache misses are removed for almost all benchmarks simualted. Simiarly, cache produces higher speedups when branch mispredictions are removed. It is for this reason that these benchmarks achieve a significant amount of additional speedup in the perfect configuration compared to expected speedup. Figure 17: Upper bound limit of IPC for each configuration simulated. Benchmarks such as BST and TC see a large larger potential in the perfect configuration compared to perfect-bp and perfect-cache. Figure 18: Overlap speedup. This is the extra speedup obtained due to removing both load and branch latency compared to the expected speedup based on perfectcache and perfect-bp results. While the simulation framework used in this study was able to accurately measure the overlap latency found within simple benchmarks which evaluate a single graph traversal or algorithm, there are some limitations when this method is applied to more complex benchmarks such as SPEC CPU2017.

Benchmark Complexity
Since SPEC CPU2017 benchmarks are meant to represent real-world applications, as well as the complexity associated with real-world implementations, they cannot be accurately evaluated by simulating just one section of each application.
Unlike previous benchmarks explored, there is no single region of interest in which to focus the detailed simulations. To handle this, as mentioned in the background section, the SMARTS methodology [1] was utilized to take checkpoints throughout the lifecycle of each benchmark. These checkpoints were taken using the Lapdary tool developed by a group of researchers at the University of Michigan. This tool allowed the checkpoints to be generated using GDB running on native hardware, as opposed to running the benchmark in the simulator. By generating checkpoints on native hardware, a significant amount of time was saved (i.e. several hours compared to several weeks). In order to customize this tool for this particular work, wrapper code was written to automate the process of determining the correct interval at which to take checkpoints and to store the generated checkpoints in the correct location. Based on prior work done [2], it was determined that approximately 100 checkpoints per benchmark would be sufficient to obtain accurate simulation results for SPEC 2017. In order to reconcile these multi-checkpoint benchmarks with the other single-checkpoint benchmarks, an average of statistics collected from all checkpoints was used to represent the performance of the benchmark. This is able to give a good representation of, for example, the unpredictibility of branches found in a benchmark (i.e. branch MPKI). The averaging, however, tends to hide other attributes of the benchmark, such as areas where latency overlap limit performance.
Therefore, in order to conduct detailed analysis on the SPEC benchmarks, checkpoints of interest were selected from each benchmark and each of these checkpoints were treated as individual simulations. This will be discussed further in the analysis section.

Indirect Branches
Another limitation of the framework used in this study is the increased use of indirect branches and calls found within the SPEC CPU2017 benchmarks. Indirect jumps can cause branch mispredictions even when the direction is correctly predicted if the target of the taken branch is predicted wrong. While other benchmarks do not make frequent use of indirect jumps, SPEC benchmarks have a significant amount of these. Since perfect branch prediction was defined as perfect direction only, this limits the upper bound estimate for perfect-bp.

Impact of Overlap Latency
Although it is important to point out the limations of the methodology used, useful analysis was still conducted on the SPEC benchmarks. In this section, the overall impact of overlap latency will be explored and then a detailed analysis of select areas of the MCF benchmark will be provided showing the presence of overlap latency. Figure 20 shows the overlap latency indicator function values for the SPEC benchmarks. Applying the same threshold to these benchmarks as was applied to the previous benchmarks, one can see that only MCF has significant opportunity for overlap speedup. This observation was confirmed by simulation, as the overlap speedups for each benchmark are shown in figure 22. In the next section, a detailed analysis of MCF will be given.

MCF
In order to examine the presence and impact of overlap latency found in MCF, results from three separate checkpoint simulations will be shown. These checkpoints represent different stages of the execution lifecycle of MCF. While it is true that different parts of a program should be weighted based on how often the part is executed, that information is provided by the averaging of all checkpoints since if one part of a benchmark is executed often more than one checkpoint will execute that part. In this section, the interest lies in how different parts of MCF operate rather than overall performance impact.
The first checkpoint that will be examined, referred to as CPT 9, does not contain overlap latency. While both branch mispredictions and cache misses occur frequently in this checkpoint, they occur at different stages of execution thus avoiding overlap. The effect of this can be seen in figure 23 where performance improvement is almost completely dominated by cache behavior. For this checkpoint, speedup results are very close to expected, shown in figure 23, meaning there is little additional benefit to improving cache and branch prediction together.
In contrast to CPT 9, two other checkpoints were chosen which do contain overlap latency. The cause of the load-branch dependencies are from different execution stages, as shown in figure 24. Because of the load-branch dependency in these checkpoints, neither perfect-bp nor perfect-cache are able to achieve speedups approaching that of perfect. This impact is shown again in figure 23.
As these examples demonstrate, latency overlap can still have an impact on long, complex benchmarks however this impact is not as dramatic overall as that seen in previous examples. This is to be expected, however, as previous benchmarks are meant to expose a particular algorithm to find bottlenecks while SPEC benchmarks are meant to evaluate real world applications.

Impact of cmov
An important consideration when analyzing overlap latency within a benchmark is the use of cmov instructions. Modern compilers use cmov instructions when deemed more efficient than relying on branch prediction. If a compiler does choose to use cmov rather than a branch and load, it can drasticly reduce the number of branch predictions made. While in many cases the compiler does a good job of deciding when and where to place cmov instructions, future improvements to branch prediction could impact this decision. Therefore, the use of cmov instructions was monitored during this study.
To demonstrate the effect of cmov, three benchmarks simulated in this study will be used. In figure 25, the potential speedup of these benchmarks is shown when compiled with cmov and instructions as well as without cmov. The cmov instructions were disabled using the flags -fno-ssa-phiot -fno-if-conversion -fnoif-conversion2 -fno-tree-loop-if-convert -fno-tree-loop-if-convert-stores. As can be seen, the use of cmov severly limits the amount of overlap latency in a benchmark.
This can be an advantage, especially in current state-of-the-art CPUs, however as branch prediction and cache performance is improved, the use of cmov does not allow for as much IPC improvement. As an example of this, the potential IPC values for the RandAcc benchmark is shown. In the base configuration, the use of cmov results in a higher IPC. In addition, improvements to cache significantly increase IPC when cmov is used compared to without cmov. However, when both branch mispredictions and cache misses are reduced, the use of cmov limits the potential IPC by more than 25%. (c) Upper bound IPC for the RandAcc benchmark both with an dwithout cmov. Figure 25: Effect of cmov on specific benchmarks. While cmov can be benficial in current CPU architectures, it limits potential performance improvements made possible by removing overlap latency.

CHAPTER 8 Conclusion
In this study, it was shown that a load-branch dependency formed by H2P branches and irregular data accesses can significantly impact the potential performance gain for some types of benchmarks. First an indicator function which relates branch MPKI and cache MPKI to overlap latency opportunity was provided. This function was then used to narrow down the set of benchmarks that were analyzed further. These benchmarks of interest were then simulated to show the upper bound speedup made possible by perfect branch prediction alone, perfect L1 cache alone, and perfect branch prediction and perfect L1 cache together. In all selected benchmarks, there existed some amount of additional performance gain unlocked by removing both sources of latency in tendem. The additional speedup was termed overlap speedup. The cause of overlap speedup in different categories of benchmarks was then shown from a software perspective. Finally, the effects of the load-branch dependency on the CPU was examined in an attempt to explain the additional speedup. This was shown by explaining the increased importance of branch prediction as cache improves (and vice versa) in benchmarks with this load-branch dependency.
This work provides a foundation for future research into practical implementations that attempt to reduce both sources of latency together. By providing upper bound limits on performance and showing the additional performance made possible, this work shows provided motivation for more active research into this area. In addition, by providing categories of algorithms which are vulnerable to overlap latency, possible starting points for this new research has been given.