SOFTWARE SUPPORT FOR HARDWARE PREDICTORS

Hardware predictors are widely used to improve the performance of modern processors. These predictors are mostly used in data or instruction prefetching mechanisms and branch predictors. Hardware-based prefetchers and branch predictors can work dynamically based on the program’s run-time behavior. However, most of the hardware-based predictor mechanisms depend on detecting patterns (data access patterns, branch patterns, etc) and they require very complex mechanisms to be able to capture irregular patterns. Software techniques, like software prefetching, can help to improve the performance of the applications with behaviors that are difficult to capture by hardware mechanisms. On the other hand, they mostly rely on execution special instructions repeatedly during the execution which is likely to create an instruction overhead. Also, they cannot respond to the run-time behaviors like hardware mechanisms. To overcome the weaknesses of prediction mechanisms, we proposed mechanisms which combine the strengths of hardware and software mechanisms. In this thesis, we examined ways to use the knowledge we can extract from the software to inform hardware mechanisms. We enable hardware-based systems to capture complex software behaviors just using the information it receives from the software instead of using large history tables and buffers to try to make predictions. First, we proposed our hardware-based prefetching mechanism called Sequential Prefetcher with Adaptive Distance (SPAD). SPAD uses a simple method of hardware prefetching that integrates timeliness into sequential prefetching. It can outperform recently proposed complex prefetchers with simpler and smaller hardware. In the second chapter, we proposed a software supported hardware prefetching mechanism called Array Tracking Prefether (ATP). This mechanism targets irregular memory access patterns and relies on compiler/programmer to configure hardware prefetching mechanism. By combining the strength of software and hardware methods, ATP outperforms proposed software only and hardware-only solutions. Finally, we proposed another software-assisted prefetcher for pointer intensive in-memory database applications, Node Tracker (NT). Although NT is proposed as a prefetcher, it is capable of using the knowledge it extracts from the prefetched data to help CPU pipeline in other ways to increase throughput. While tightly integrated with CPU, NT can achieve up to 19x speedup for the targeted applications.

anisms to be able to capture irregular patterns.
Software techniques, like software prefetching, can help to improve the performance of the applications with behaviors that are difficult to capture by hardware mechanisms. On the other hand, they mostly rely on execution special instructions repeatedly during the execution which is likely to create an instruction overhead.
Also, they cannot respond to the run-time behaviors like hardware mechanisms.
To overcome the weaknesses of prediction mechanisms, we proposed mechanisms which combine the strengths of hardware and software mechanisms. In this thesis, we examined ways to use the knowledge we can extract from the software to inform hardware mechanisms. We enable hardware-based systems to capture complex software behaviors just using the information it receives from the software instead of using large history tables and buffers to try to make predictions.
First, we proposed our hardware-based prefetching mechanism called Sequential Prefetcher with Adaptive Distance (SPAD). SPAD uses a simple method of hardware prefetching that integrates timeliness into sequential prefetching. It can outperform recently proposed complex prefetchers with simpler and smaller hardware.
In the second chapter, we proposed a software supported hardware prefetching mechanism called Array Tracking Prefether (ATP). This mechanism targets irregular memory access patterns and relies on compiler/programmer to configure hardware prefetching mechanism. By combining the strength of software and hardware methods, ATP outperforms proposed software only and hardware-only solutions.
Finally, we proposed another software-assisted prefetcher for pointer intensive in-memory database applications, Node Tracker (NT). Although NT is proposed as a prefetcher, it is capable of using the knowledge it extracts from the prefetched data to help CPU pipeline in other ways to increase throughput. While tightly integrated with CPU, NT can achieve up to 19x speedup for the targeted applications.    [1] and Best Offset Prefetcher (BOP) [2]. y axis shows the speedup relative to next-line prefetcher (i.e., baseline is the offset 1 prefetcher  To overcome the memory wall, designers have resorted to a hierarchy of cache memory levels, which rely on the principle of memory access locality to reduce the observed memory access time and the performance gap between processors and memory. Unfortunately, important workload classes exhibit adverse memory access patterns that baffle the simple policies built into modern cache hierarchies to move instructions and data across cache levels. As such, processors often spend much time idling upon a demand fetch of memory blocks that miss in higher cache levels.
Prefetching−predicting future memory accesses and issuing requests for the corresponding memory blocks in advance of explicit accessesis an effective approach to hide memory access latency. There has been a myriad of proposed prefetching techniques, and nearly every modern processor includes some hardware prefetching mechanisms targeting simple and regular memory access patterns. In this chap-

Memory Wall
Innovations in microarchitecture, circuits, and fabrication technologies have led to an exponential increase in processor performance over the past four decades.
Meanwhile, DRAM has primarily benefitted from increases in density and DRAM speeds have improved only nominally. While future projections indicate that processor performance improvement may not continue at the same rate, the current gap in performance will necessitate techniques to mitigate long memory access latencies for years to come.
Computer architects have historically attempted to bridge this performance gap using a hierarchy of cache memories. Figure 1.1 depicts the anatomy of a modern computers cache hierarchy. The hierarchy consists of cache memories that trade-off capacity for lower latency at each level. The purpose of the hierarchy is to improve the apparent average memory access time by frequently handling a memory request at the cache, avoiding the comparatively long access latency of DRAM. The cache levels closer to the cores are smaller but faster. Each level provides a temporary repository for recently accessed memory blocks to reduce the effective memory access latency. The more frequently memory blocks are found in levels closer to the cores, the lower the access latency. We refer to the cache(s) closest to the core as the L1 caches and then number cache levels successively, referring to the final cache as the last level cache (LLC).
The hierarchy relies on two types of memory reference locality. Temporal locality refers to memory that has been recently accessed and is likely to be accessed again. Spatial locality refers to memory in physical proximity that is likely to be accessed because near-neighbor instructions and data are often related.
While locality is extremely powerful as a concept to exploit and reduce the effective memory access latency, it relies on two basic premises that do not necessarily hold for all workloads, particularly as the cache hierarchies grow deeper.
The first premise is that one cache size fits all workloads and access patterns. The capacity demands of modern workloads vary drastically, and differing workloads benefit from different trade-offs in the capacity and speed of cache hierarchy levels. The second premise is that a single strategy for allocating and replacing cache entries (typically allocating on-demand and replacing entries that have not been recently used) is suitable for all workloads. However, again, there is enormous variation in memory access patterns for which a simple strategy for deciding which blocks to cache may fare poorly.
There are a myriad of techniques that have been proposed from the algorithmic, compiler-level, and system software level all the way down to hardware to overcome the Memory Wall. These techniques include cache-oblivious algorithms, code and data layout optimizations at the compiler level, to hardware-centric approaches. Moreover, many software-based techniques have been proposed for prefetching.

Prefetching
One way to hide memory access latency is to prefetch. Prefetching refers to the act of predicting a subsequent memory access and fetching the required values ahead of the memory access to hide any potential long latency. In the limit, a memory access does not incur any additional overhead and memory appears to have a performance equal to a processor register. In practice, however, prefetching may not always be timely or accurate. Late or inaccurate prefetches waste energy and, in the worst case, can hurt performance.
To hide latency effectively, a prefetching mechanism must: (1) predict the address of a memory access (i.e., be accurate), (2) predict when to issue a prefetch (i.e., be timely), and (3) choose where to place prefetched data (and, potentially, which other data to replace).

Predicting Addresses
Predicting the correct memory addresses is a key challenge for prefetching mechanisms. If addresses are predicted correctly, the prefetching mechanism will have the opportunity to fetch them in advance and hide the memory access latency.
If addresses are not predicted accurately, prefetching may cause pollution in the cache hierarchy (i.e., prefetched cache blocks would evict potentially useful cache blocks) and generate excessive traffic and contention in the memory system.
Predicting memory addresses may not be so simple. A data reference may be an access to a standalone variable or an element of a data structure and the nature of the reference depends on what the program is doing at a particular instance of execution. There are algorithms and data structure traversals that lend themselves well to both repetitive and predictable patterns (e.g., reading every element of an array sequentially). There are also a number of ways in which memory addresses can be hard to predict. These include, but are not limited to, interleaving of accesses to variables, multiple data structures, and control-flow dependent traversals (e.g., searching a binary tree).
Similarly, an instruction reference will depend on whether the program is executing sequentially or it is taking a branch (i.e., following a discontinuity).
While sequential instruction fetch is straightforward, the control-flow behavior and its predictability in the program can impact how effective instruction prefetching can be.
Predicting addresses accurately also depends on the level of the cache hierarchy at which the prefetching is performed. At the highest level, the interface between the processor and level-one cache (Figure 1.1) contains all memory reference information that could enable highly accurate prefetch, but could also lead to a waste of resources recording prefetch information for accesses that will hit in the first level cache anyway, and thus do not require prefetch. Conversely, at lower hierarchy levels, the access sequence is filtered, observing only the misses from higher levels. Thus, otherwise, effective prefetching algorithms may be confused by access-sequence perturbations from effects like cache placement and replacement policy.
Finally, there is typically a trade-off between the aggressiveness of a prefetch strategy and its accuracy; more aggressive prefetching will predict a higher fraction of the addresses actually requested by the processor at the cost of also fetching many more addresses erroneously. For this reason, many evaluation studies of prefetchers report two key metrics that jointly characterize the prefetchers effectiveness at predicting addresses. Coverage measures the fraction of explicit processor requests for which a prefetch is successful (i.e., the fraction of demand misses eliminated by prefetching). Accuracy measures the fraction of accesses issued by the prefetcher that turn out to be useful (i.e., the fraction of correct prefetches over all prefetches). Many simple prefetchers can improve coverage at the expense of accuracy, whereas an ideal prefetcher provides both high accuracy and coverage.

Prefetch Lookahead
Ideally, a prefetching mechanism issues a prefetch well in advance and provides enough storage for prefetched data so as to hide all memory access latency. Predicting precisely when to prefetch in practice, however, is a major challenge. Even if addresses are predicted correctly, a prefetcher that issues prefetches too early may not be able to hold all prefetched memory close to the processor long enough prior to access. In the best case, prefetching too early will be useless because the prefetched information will be evicted away from the processor prior to use. In the worst case, it may evict other useful information (e.g., other prefetched memory or useful blocks in higher-level caches). If memory is prefetched late, then it will diminish the effectiveness of prefetching by exposing the memory access latency upon the memory access. In the limit, late prefetches may lead to performance degradation due to additional memory system traffic and poor interaction with mechanisms designed to prioritize time-critical demand accesses.

Placing Prefetched Values
The simplest and perhaps oldest software strategy for prefetching data is to load it into a processor register much like any other explicit load operation. Many architectures, in particular, modern out-of-order processors, do not stall execution when a load is issued, but rather stall dependent instructions only when the value of a load is consumed by another instruction. Such a prefetch strategy is often called a binding prefetch because the value of subsequent uses of the data is bound at the time the prefetch is issued. This approach comes with a number of drawbacks: (1) it consumes precious processor registers, (2) it obligates the hardware to perform the prefetch, even if the memory system is heavily loaded, (3) it leads to semantic difficulties in the case the prefetch address is erroneous (e.g., should a prefetch of an invalid address result in a memory protection fault?), and (4) it is unclear how to apply this strategy to instructions.
Instead, most hardware prefetching techniques place prefetched values either directly into the cache hierarchy or into supplemental buffers that augment the cache hierarchy and are accessed concurrently. In multicore and multiprocessor systems, these caches and buffers participate in the cache coherence protocol, and hence the value of a prefetched memory location may change during the interval between the prefetch and a subsequent access; it is the hardwares responsibility to ensure the access sees the up-to-date value. Such prefetching strategies are referred to as non-binding. In these schemes, prefetching is purely a performance optimization and does not affect the semantics of a program.

Instruction Prefetching
Instruction fetch stalls are detrimental to performance for workloads with large instruction working sets; when the instruction supply slows down, the processor pipelines execution resources (no matter how abundant) will be wasted. techniques, such as out-of-order execution, are often effective in hiding some or all of the stalls due to data accesses and other long latency instructions. However, out-of-order execution generally cannot hide instruction fetch latency. As such, instruction stalls often account for a large fraction of overall memory stalls in servers.
Next-line prefetching [1] is the simplest form of instruction prefetching, which is prevalent in most modern processor designs. Because code is laid out sequentially in memory at consecutive memory addresses, often over half of the lookups in the instruction cache are for sequential addresses. The logic needed to generate sequential addresses and fetch them is minimal and fairly easy to incorporate into a processor and cache hierarchy.

Data Prefetching
Data miss patterns arise from the inherent structure that algorithms and highlevel programming constructs impose to organize and traverse data in memory.
Whereas instruction miss patterns in conventional von Neumann computer systems tend to be quite simple, following either sequential patterns or repetitive control transfers in a well-structured control flow graph, data access patterns can be far more diverse, particularly in pointer-linked data structures that enable multiple traversals.
Strided prefetchers are using a simple mechanism to identify unique strides that separates addresses in a memory stream based on the PC of the instructions that access them or based on global order [2]. Pointer chasing prefetchers, targeting to predict the address being pointed to by the pointers, try to predict future accesses by using hardware/software approaches. This can be achieved by inserting prefetch instructions via programmer/compiler [3] or by correlating the data in the data cache with its address and predicting its likelihood of being a pointer [4]. Irregular memory access pattern based prefetchers target harder to prefetch addresses by identifying certain key characteristics of the memory stream [5,6,7,8].
Markov prefetchers [6,9,10] predict from an observed memory sequence, the sets of unique addresses that are likely to occur in the future.

Stream and Stride Prefetchers
The first category of data prefetchers is stride and stream prefetchers, which are a direct evolution of the next-line and stream prefetching mechanisms that have been mainly developed for instructions. These prefetchers capture access patterns for data that are either laid out contiguously in the virtual address space or are separated by a constant stride. This class of prefetcher tends to be highly effective for dense matrix and array access patterns, but generally provides little benefit for pointer-based data structures. Strided data prefetchers are widely deployed in industrial processor designs, from systems as old as the IBM System/370 series through modern high-performance processors. Until recently, it is believed that this class of hardware data prefetcher was the only class to be commercially deployed.
Sequential data prefetcher implementations, which are restricted to prefetch only blocks at consecutive addresses, were described as early as 1978 [1]. By the early 1990s such prefetchers were extended to detect and prefetch sequences of accesses separated by a non-constant stride [11]. Such strided access patterns arise frequently when traversing multi-dimensional arrays or when aggregate data types (e.g., structs in C) are stored in arrays. Strided accesses can also arise by happenstance even in pointer-based data structures when dynamic memory allocators layout constant-sized objects consecutively in memory, a common case due to pool allocators. Dahlgren and Stenstrom study the relative merits and effectiveness of sequential and stride prefetching mechanisms in detail [12].
A key challenge in stride prefetcher implementations is to distinguish among multiple inter-leaved strided sequences, for example, as may arise in a matrixvector product. strides on a per-load-instruction basis. Their reference prediction table is a tagged, set-associative structure that uses the load instruction PC as the lookup key. Each entry holds the last address referenced by that load and the difference in address (i.e., stride) between the last two preceding references. Whenever the same stride is observed twice consecutively, the last reference address and stride are used to compute one or more additional addresses for prefetch. Subsequent accesses that continue to match the recorded stride will trigger additional prefetches. A long sequence of such strided accesses is referred to as a stream, analogous to instruction stream prefetchers. Ishii and co-authors describe more sophisticated hardware structures that can compactly represent multiple strides [13], while Sair and coauthors extend stream prefetching to more irregular patterns by predicting stride lengths [14].
A second key implementation issue is to decide how many blocks to prefetch when a strided stream is detected. This parameter, often referred to as the prefetch degree or prefetch depth, is ideally large enough that the prefetched data arrive before being referenced by the processor, but not so large that blocks are replaced before access or cause undue pollution for short streams. Hur and Lin propose simple state machines that track histograms of recent stream lengths and can adaptively determine the appropriate prefetch depth for each distinct stream, enabling stream prefetchers to be effective even for short streams of only a few addresses [15]. searched. When the stride detection mechanism observes a new stream, an entire stream buffer is cleared and re-allocated (discarding any unreferenced blocks from a stale stream), typically according to a round-robin or least-recently-used scheme.

Address-Correlating Prefetchers
Whereas stride prefetchers are typically ineffective for pointer-based data structures, such as linked lists, the second class of prefetcher we consider is specifically designed to target the pointer-chasing access patterns of such data structures. Instead of relying on regularity in the layout of data in memory, this class of prefetcher exploits the fact that algorithms tend to traverse data structures in the same way repeatedly, leading to recurring cache miss sequences.
Correlation between accesses to pairs of memory locations was suggested as early as 1976 [17]. Charney and Reeves first described hardware prefetchers that seek to exploit such pair-wise correlation relationships, coining the term "correlation-based prefetcher" [18,19]. Later work generalizes the notion of address correlation from pairs to groups or sequences of accesses [20,21]. Wenisch and co-authors introduce the term "temporal correlation" [21] to refer to the phenomenon that two addresses accessed near one another in time will tend to be accessed together again in the future. Temporal correlation is an analog to "temporal locality", that a recently accessed address is likely to be accessed again in the near future. Whereas caches exploit temporal locality, address-correlating prefetchers exploit temporal correlation.

Jump Pointers
Correlating prefetchers are a generalization of hardware and software mechanisms that specifically targeted pointer-chasing access patterns. These earlier mechanisms rely on the concept of a jump pointer [22,23,24,25], a pointer that enables a large forward jump in a data structure traversal. For example, a node in a linked list may be augmented with a pointer ten nodes forward in the list; the prefetcher can follow the jump pointer to gain lookahead over the main traversal being carried out by the CPU, enabling timely prefetch. Prefetchers relying on jump pointers often require software or compiler support to annotate pointers. Content directed prefetchers [26,27] eschew annotation and attempt instead to dereference and prefetch any load value that appears to form a valid virtual address. While jump-pointer mechanisms can be quite effective for specific data structure traversals (e.g., linked list traversals), their key shortcoming is that the distance the jump pointer advances the traversal must be carefully balanced to provide sufficient lookahead without jumping over too many elements. Jump pointer distances are difficult to tune and the pointers themselves can be expensive to store.

Pair-wise Correlation
In essence, a correlation-based hardware prefetcher is a lookup table that maps from one address to another address that is likely to follow it in the access sequence.
While such an association can capture sequential and stride relationships, it is far more general, capturing, for example, the relationship between the address of a pointer and the address to which it points. It is the ability to capture pointer traversals that affords address-correlating prefetchers a far greater opportunity for performance improvement than stride prefetchers, as pointer-chasing access patterns are disproportionately slow on modern processors. However, addresscorrelating prefetchers rely on repetition; they are unable to prefetch addresses that have never previously been referenced (in contrast to stride prefetchers). Moreover, address correlation prefetchers require enormous state, as they need to store the successor for every address. Hence, their storage requirement grows proportionally to the working set of the application. Much of the innovation in address-correlating prefetcher design centers on managing this enormous state.

Markov Prefetcher
The Markov prefetcher [28,29]  repetitive temporally correlated streams in commercial server applications and demonstrate that stream lengths vary from two to many thousands of cache blocks [30,31]. The most common stream length is only two misses, implying that a wide Markov table entry is storage inefficient. However, when weighted by the number of misses in the stream (i.e., the potential coverage that can be obtained by prefetching the stream), the median stream length is about ten cache blocks.
A key advance, introduced by Nesbit and Smith in their global history buffer [32], is to split the correlation table into two structures: a history buffer, which logs the sequence of misses in a circular buffer in the order they occurred, and an index table, which provides a mapping from an address (or other prefetch trigger) to a location in the history buffer. The history buffer allows a single prefetch trigger to point to a stream of arbitrary length.  The index table retains a set-associative storage organization similar to the original Markov prefetcher. However, rather than storing cache block addresses, the index table now stores pointers into the history buffer. When a miss occurs, the GHB references the index table to see if any information is associated with the miss address. If an entry is found, the pointer is followed and the history buffer entry is checked to see if it still contains the miss address (the entry may since have been overwritten). If so, the next few entries in the history buffer contain the predicted stream. History buffer entries can also be augmented with link pointers to other history buffer locations, to enable history traversal according to more than one ordering (e.g., each link pointer may indicate a preceding occurrence of the same miss address, enabling increases to prefetch width as well as depth).

Execution-Based Prefetching
Another category of data prefetcher relies neither on repetition in miss sequences nor in data layouts; rather execution-based prefetchers seek to explore the programs instruction sequence ahead of instruction execution and retirement to discover address calculations and dereference pointers. The key objective of such prefetchers is to run faster than instruction execution itself, to get ahead of the processor core, while still using the actual address calculation algorithm to identify prefetch candidates. As such, these mechanisms do not rely on repetition at all.
Instead, they rely on mechanisms that either summarize address calculation while omitting other aspects of the computation, guess at values directly, or leverage stall cycles and idle processor resources to explore ahead of instruction retirement.

Algorithm Summarization
Several prefetching techniques summarize the instruction sequence that traverses a data structure, such that the traversal pattern can be executed faster than the main thread to prefetch data structure elements. Roth and co-authors [3,24] propose a mechanism that summarizes traversals entirely in hardware by identifying pointer loads (load instructions that dereference a pointer) and the dependent chain of instructions that connect them. These dependence relationships are then encoded by hardware into a compact state machine, which can iterate through the sequence of dependent loads faster than instruction execution. Annavaram, Patel, and Davidson propose a general mechanism for extracting program dependence graphsa subset of instructions that lead to missing loadsin hardware and then executing these graphs in dedicated precomputation engines [33].

Helper-Thread and Helper-Core Approaches
Thread-based data prefetching techniques [34,35,36,37,38,39,40,41,42] use idle contexts on a multithreaded or multicore processor to run helper threads that overlap misses with speculative execution. Individual techniques vary in whether they are automatic or require compiler/software support, whether they rely on simultaneous multithreading hardware and specific thread coordination mechanisms, whether they rely on additional cores, and whether they require additional mechanisms to insert blocks into remote caches. In nearly all cases, these techniques re-purpose spare execution contexts to execute the prefetching code. However, the spare resources the helper threads require (e.g., idle cores or thread contexts; fetch and execution bandwidth) may not be available when the processor executes an application exhibiting high thread-level parallelism. The benefit of these techniques must be weighed against scaling up the number of application threads.

Run-Ahead Execution
Run-ahead execution uses the execution resources of a core that would otherwise be stalled on a long-latency event (e.g., an off-chip cache miss) to explore ahead of the stalled execution in an effort to discover additional load misses and warm branch predictors. The idea in run-ahead is to capture a snapshot of execution state when the core would otherwise stall, then proceed past stalled instructions to continue to fetch and execute the predicted instruction stream. Instructions that are data-dependent on an incomplete instruction are not executed (e.g., a poison token is propagated through the register renaming mechanism).
When the long-latency event resolves (e.g., the original miss returns), the execution state is recovered from the snapshot and the original execution continues, re-crossing the instructions that were explored during run-ahead. The primary benefit of this scheme is the prefetching effect for long-latency loads. Run-ahead was originally proposed in the context of in-order cores by Dundas and Mudge [43].
Mutlu and co-authors explore efficient implementations in the context of out-oforder processors [44,45,46,47]. More recently, authors have explored non-blocking pipeline microarchitectures that speculate past long-latency loads without discarding speculative execution results when the loads return, instead re-executing only the dependent instructions [48,49].

Software prefetching
With software prefetching the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache but do not stall waiting for the data to arrive. Code Snippet 1.1 demonstrates a simple example code with software prefetching.
Processors that have multiple levels of caches often have different prefetch instructions for prefetching data into different cache levels. This can be used, for example, to prefetch data from main memory to the L2 cache far ahead of the use with an L2 prefetch instruction, and then prefetch data from the L2 cache to the L1 cache just before the use with an L1 prefetch instruction.
There is a cost for executing a prefetch instruction. The instruction has to be decoded and it uses some execution resources. A prefetch instruction that always prefetches cache lines that are already in the cache will consume execution resources without providing any benefit. It is therefore important to verify those prefetch instructions prefetch data that is not already in the cache.
The cache miss ratio needed by a prefetch instruction to be useful depends on its purpose. A prefetch instruction that fetches data from main memory only needs a very low miss ratio to be useful because of the high main memory access latency. A prefetch instruction that fetches cache lines from a cache further from the processor to a cache closer to the processor may need a miss ratio of a few percent to do any good.
Commonly, software prefetching creates fetches slightly more data than actually used. For example, when iterating over a large array it is common to prefetch data some distance ahead of the loop. When the loop is approaching the end of the array the software prefetching should ideally stop. However, it is often cheaper to continue to prefetch data beyond the end of the array than to insert additional code to check when the end of the array is reached. This means that 1 kilobyte of data beyond the end of the array that isn't needed is fetched.

Recent Work in Prefetching 1.5.1 Sandbox Prefething
Sandbox Prefetching (SBP) [50] works by testing out several aggressive sequential prefetchers in a sandboxed environment outside the real memory hierarchy in order to determine which prefetchers should be used in the real memory hierarchy. Rather than issuing real prefetches, SBP evaluates prefetchers by placing prefetch addresses in a Bloom filter which is a data structure designed to tell you, rapidly, whether an element is present in a set. Demand cache accesses check the Bloom filter to see if the address could have been prefetched by the prefetcher currently being evaluated. Hits in the Bloom filter give confidence that the evaluated prefetcher would be accurate if it were deployed in the real memory hierarchy.
Several prefetchers are evaluated in a round-robin fashion, and the prefetchers with the most Bloom filter hits are used to issue real prefetches. SBP evaluates sequen-tial aggressive prefetchers that immediately prefetch addresses with a fixed offset from the current demand access, like a next-line prefetcher. Once deployed in the real memory hierarchy, the chosen prefetchers perform no additional warm-up or confirmation before issuing prefetches. ). This is a pure hardware approach to detect indirect accesses automatically and issue prefetches for them.

Software Prefetching for Indirect Memory Accesses
Ainsworth proposed a compiler pass to automatically generate software prefetching instructions for indirect memory accesses [52]. Within the compiler, it finds the loads that reference loop induction variable, and use a depth-first search algorithm to identify the set of instruction which needs to be duplicated to load in data for future iterations. Across the different workloads they evaluated, they achieved an average of 1.3x performance improvement for an Intel Haswell machine, 1.1x for an ARM Cortex-A57, 2.1x for an ARM Cortex-A53, and 2.7x for a Xeon Phi.

An Event-Triggered Programmable Prefetcher for Irregular Workloads
Ainsworth [53] proposed a software-assisted hardware prefetching mechanism to prefetch irregular access patterns. It employs low power RISC cores to execute subprograms to compute future addresses and prefetch them. It relies on the programmer or compiler to generate the subprograms to generate subprograms. to the extra instructions related to prefetching. In this study, we developed a software-assisted hardware mechanism, Array Tracking Prefetcher, to prefetch indirect memory accesses.

Informed Pre-Execution for In-Memory Database Applications
Pointer-chasing access behaviors are common in in-memory database applications. These algorithms commonly include multiple lookups over a pointerintensive data structure where each lookup iterates over a set of nodes, and each node has one or multiple pointers which points to the next node to be accessed.
As well as the huge number of cache misses created by these applications, long dependency chains also create an important performance bottleneck for these applications. Processing multiple lookups in parallel improves the performance of these applications significantly. However, hiding memory access latencies still, do not maximize the throughput of these applications. Due to long dependency chains and high branch misprediction rates, prefetching only solutions have limited potential. In this work, we proposed Node Tracker, a software/hardware system, which prefetches/pre-executes future lookups in parallel, and further improves the CPU performance by using the knowledge which is extracted from the pre-executions.

Abstract
Current processors employ aggressive hardware prefetching mechanisms to improve performance and reduce power. Sequential prefetching is a widely employed and successful technique that exploits spatio-temporal memory access patterns in applications. However, it does not take into account prefetch timeliness. We propose a simple method that integrates timeliness into sequential prefetching. Our results show that 139-bytes direct-mapped mechanism can significantly improve the performance of an L2 sequential prefetcher and can match or outperform recently proposed complex prefetchers with simpler and smaller hardware.

Introduction
Modern processors employ prefetchers to hide long memory latencies for demand cache misses. Prefetchers predict data or instruction addresses those are likely to be used in the near future. When successful, they facilitate faster retrieval of data/instruction for demand requests. Next-line or sequential prefetching has been shown to provide significant performance benefits for applications with a good spatial locality. However, they prefetch rather blindly because they do not employ confidence mechanisms. This is problematic for two reasons: 1) they use cache and bandwidth resources rather blindly, which can either reduce their benefit, or can even hurt the performance and power; 2) even for the applications with good spatial locality, they may not provide the potential benefits because they are not timely in issuing prefetches. To address the first problem, Pugsley et al. [1] proposed the sandbox prefetching (SBP) method. In their method, a set of predetermined sequential offset prefetchers (hence, it is also called offset prefetching) are tested by recording their predicted prefetching addresses in the sandbox on each demand access and counting the number of demand access hits on the recorded potential prefetch addresses in the sandbox. After the evaluation interval, only the offsets with sandbox scores above a threshold are allowed to perform prefetching in the next interval. The sandbox proves to be a powerful idea eliminating many unnecessary and potentially harmful prefetches only after a prefetcher has been proven useful, it is activated.
The second problem, although equally important in designing successful prefetchers, is not sufficiently addressed by the SBP. If prefetch is not timely, there is no benefit. Best Offset Prefetcher (BOP) [2], which was the winner of the 2015 Data Prefetching Competition [3], develops a method to target prefetch timeliness for SBP. Similar to SBP, various offsets compete in a history table (e.g., Sandbox) and the best performing offset is chosen to perform prefetching in the next interval. However, the decision for the best offset is made by not the only number of correct predictions but also their timeliness. In order to track timeliness, BOP records the time, i.e, cycle, at which a prefetched cache line is placed in the cache. That is, it records the time of the cache line refill. This requires BOP to employ one bit per L2 cache line to track prefetched lines in the cache and observe their refill times to store in an auxiliary table to determine timeliness.
In this work, we focus on both timeliness and accuracy, as in the BOP. Instead of offset-testing via a sandbox approach, however, we focus on the most popular offset, +1 (i.e., next-line), that occurs in most applications, and we propose a simple mechanism to guide the sequential prefetcher for timeliness. By dynamically adapting distance (hence, we call it Sequential Prefetcher with Adaptive Distance (SPAD)) [4] for the sequential prefetcher, we show that our proposed mechanism outperforms the BOP, with less hardware and lower complexity.
SPAD uses a testing queue, TQ, in the same spirit as the SBP method but its operation and purpose are quite different (as described in Section 2.5). After each evaluation period, SPAD's decision engine increments or decrements a distance counter to guide the sequential prefetcher in how far ahead a prefetch must be issued in the next interval to be useful. The decision on incrementing or decrementing the distance is based on several factors, such as, the number of demand hits in the TQ, the number of L2 misses and the amount and ratio of demand misses that are hit in the TQ. It is important to note that SPAD actively issues prefetches and gets evaluated at the same time using only one single testing buffer.
In addition, SPAD does not need to keep track of the prefetched lines and their refill times in the cache and therefore, despite providing better performance than BOP, it uses much simpler logic and much less hardware storage.
This chapter makes the following contributions: 1. It presents a detailed analysis of offset prefetching and provides insights into the understanding of offset prefetching performance.
2. It shows that although best performing offset values are larger than 1, these offset values are rarely observed delta values in SPEC CPU2006 benchmarks.
Most performance benefit, in fact, comes from the most frequently observed address delta value 1, but prefetching is more timely with offsets larger than 1.
3. It categorizes SPEC CPU2006 benchmarks based on their offset prefetching and delta pattern behaviors.
4. It proposes a simple and highly effective algorithm to track prefetch timeliness focusing on delta 1 prefetching. The proposed SPAD prefetcher outperforms recent offset prefetchers, SBP and BOP, with significantly lower hardware budget.
The rest of the chapter is organized as follows. Section 2.3 discusses related work. Section 2.4 presents a detailed analysis of offset prefetching and motivates our work. Section 2.5 describes the proposed SPAD prefetcher. Methodology is given in Section 2.6. In Section 2.7, we present the results. Finally, Section 2.8 concludes.
There are many earlier prefetchers that focus on regular patterns deserves a mention. The earliest of all is the nextline prefetching [8]. To avoid issuing useless prefetches, a prefetch bit can be added to each cache line [8] or cache replacement status [9] can be used instead. Stride prefetchers are confidence-based prefetchers which exploit constant strides among the instructions which have memory access [10,11,12]. To exploit sequential streams, stream buffer was introduced by Jouppi [13]. A stream buffer that can also track non-unit stride accesses was later proposed by Palacharla and Kessler [14]. Other work [15,16,17] studied more aggressive stream buffers that exploit adaptive degree and distance values. Finally, some prefetchers used history tables to record and monitor the past memory accesses to predict future addresses [6,5,18,19,20,21,22,23,24]. In this chapter, we compare our proposal with more recent work that outperform earlier prefetchers.
In this section, we describe them in detail.
In our evaluation, all prefetchers live at the L2 cache level. The information available to the prefetcher at this level of cache hierarchy is limited. In most current processors, the program counter (PC) is unavailable at this level. This makes PC-based patterns harder to track. Furthermore, a prefetcher at this level must deal with physical addresses directly without the TLB or other page table information. Therefore, many prefetchers track addresses on a per physical page basis, discovering patterns and prefetching within multiple simultaneously tracked physical pages. This complicates the effective design for the prefetcher, especially in terms of efficiently tracking multiple simultaneous physical pages. All prefetchers that we evaluated in this work were originally proposed as L2 prefetchers, except for GHB [6].

Sandbox Prefetching
Sandbox Prefetching (SBP) is the first full-fledged offset prefetcher. It is costeffective and was shown to slightly outperform the more complex AMPM prefetcher [5] that won the 2009 Data Prefetching Competition [25]. The SBP works by evaluating and scoring candidate prefetchers without issuing actual prefetch requests.
Instead, these candidate requests are recorded in a Bloom filter structure. The accuracy of these predicted prefetch addresses is evaluated by checking if subsequent demand accesses hit in the Bloom filter. Pugsley et al. proposed offset prefetchers as candidate prefetchers. A number of fixed-offset prefetchers (that prefetch X + O, where X is the block address requested and O is the non-zero offset) are evaluated using a sandbox and the ones with a score above a threshold are allowed to issue prefetch requests until the next evaluation period has been completed and new scores have been obtained. The score for an offset is simply the number of hits in the sandbox (i.e., the Bloom filter) during the interval where that offset has been tested.
SPAD and SBP are both sequential prefetchers. The SBP does not take into account prefetch timeliness when evaluating offsets and therefore its scoring mechanism is suboptimal. Offsets are simply tested for accuracy by counting the number of hits in the sandbox. SPAD exploits spatio-temporal patterns of one single offset and uses a sandbox-like table to test both prefetch timeliness and accuracy.

Best Offset Prefetcher
The Best Offset Prefetcher (BOP), which is the winner of the 2015 Data have been completed, the BOP chooses the best offset to issue prefetches with, i.e., the offset with the highest score. Then, the scores and the round counter are reset and a new learning phase starts.
Similar to SPAD, BOP considers both accuracy and timeliness of prefetches.
However, SPAD is not an offset prefetcher because it only considers one single offset. Our results show that by focusing on most commonly observed offset 1 (i.e., the resulting prefetcher is a next-line prefetcher) and the timeliness of the issued prefetches, SPAD outperforms BOP with less hardware and lower complexity.
However, per-page tracking of each cache line requires much larger hardware storage than SPAD to perform similarly. Another disadvantage of AMPM-like prefetcher is its training time. AMPM requires three accesses along a stride pattern within a region before prefetching starts. This warm-up is needed for each region independently. SPAD, being a sequential prefetcher, does not need this warm-up or confirmation before issuing prefetches. Finally, unlike SPAD, AMPM does not consider the timeliness of the issued prefetches.

Global History Buffer
Global History Buffer (GHB) [

Motivation
Offset prefetching is a generalization of next-line prefetching. In next-line prefetching, if a cache line A is demand requested, the prefetcher issues a prefetch request for line A + 1, i.e., the next sequential line. Recently proposed SBP [1] and BOP [2] (details of which are explained in Section 2.3) show that offset prefetching is relatively simple and outperforms more complicated prefetchers. These are the only two offset prefetchers that we are aware of at the time of this writing. In the following, we analyze offset prefetching and motivate our work.

Performance Potential with Offset Prefetchers
In this section, we analyze fixed-offset sequential prefetching 2 to observe best achieving fixed-offset for each SPEC CPU2006 benchmarks. We simulated sequential prefetching for offsets from -16 to 16 (32 fixed-offsets) for each benchmark. Many prefetchers employ a single bit per cache line to track whether the cache line is placed into the cache due to a prefetch or a demand access. There are multiple reasons. First, one can evaluate the success of prefetching. Second, if prefetches are only issued on cache misses to preserve bandwidth and increase accuracy, prefetching would negatively impact the updates and predictions on prefetchers tables because it changes the miss patterns. Therefore, most prefetchers that issue prefetches on cache misses also issue prefetches when a hit on a previously prefetched cache line occurs. 2 All the evaluations in this work use L2 level prefetchers tial prefetching (they are placed under offset 0 (no prefetching) in the table) -the best performing fixed-offset provides less than 0.5% speedup at best. We obtained similar results with other prefetchers that we tested, therefore, we do not consider these seven benchmarks further in this study. Since best performing offset varies from benchmark to benchmark, a prefetcher that can automatically find best offset per benchmark would perform the best (hence the adaptivity of recently proposed SBP and BOP). If we are to pick a fixed offset across all benchmarks, we would pick offset 1 as the most commonly observed global and/or local address delta is 1. That is, the resulting prefetcher would be a nextline prefetcher. Figure 2.1 shows, however, that this would be a bad decision. In Figure 2.1, we compare how well various fixed offset prefetchers perform relative to +1 offset prefetcher (i.e., the next-line). We used fixed offsets ranging from 2 to 16. Negative fixed offsets perform very poorly when the same negative offset was used across all SPEC CPU2006 benchmarks. Figure   2.1 shows that offset 1 is clearly not the best fixed offset on the SPEC CPU2006 benchmarks. The best fixed offset is 4. Figure 2.1 also shows the performance when Best Fixed Offset for each Benchmark (BFOB) is used. BFOB performs significantly better suggesting to find methods that exploit this behavior. We also present the results for an oracle offset prefetcher, Oracle, which has prior knowledge of best performing offset for each interval (every 512 L2 accesses) within an application and thus perfectly adapts also to the changing program behavior. Surprisingly, Oracle only marginally (about 0.5%) outperform the best fixed-offset,  Figure 2.1 also shows the performance results for the recently proposed SBP and BOP 3 . SBP performs better than fixed offset 4 prefetcher. BOP outperforms SBP incorporating timeliness in choosing best offset. There is, however, a significant headroom for improvement.
BFOB and Oracle significantly outperform both SBP and BOP.

Understanding the Performance of Offset Prefetchers
To understand where performance really comes from, we try to correlate the most frequently observed global/local deltas within a benchmark with its offset prefetching performance. and local (i.e., per PC) deltas for each benchmark. As expected, +1 is the most frequently observed delta globally and locally. 470.lbm and 481.wrf are the only two benchmarks which do not observe delta 1 significantly neither globally nor locally.
Although delta 1 is the most frequently observed delta globally and/or locally in 3 Original BOP uses 46 offsets (23 positive, 23 negative between -40 to +40). In our implementation, SBP performed best with 16 offsets (8 positive, 8 negative, +/-8) 4 To reduce the interference from interleaved memory accesses, we find global deltas by computing differences (deltas) between current address and the last three access addresses and chose the smallest delta out of the three  [1] and Best Offset Prefetcher (BOP) [2]. y axis shows the speedup relative to next-line prefetcher (i.e., baseline is the offset 1 prefetcher).
most of the benchmark, only two benchmarks have their best performance with offset 1, as shown in Table I. When we analyze benchmarks individually based on their offset prefetching behavior, we see that for most benchmarks with best performing offset larger than 1, most frequently seen delta is 1 and best performing offset does not appear to be a frequent delta. Most of these benchmarks have significant speedup with offset 1 and the speedup increases as the offset increases peaking at the best offset value. This behavior suggests that offset 1 often issues late prefetches and if prefetch distance increases so does the performance with offset 1 prefetching. That is, offset 1 is the most important offset but it is usually not timely in issuing prefetches. Table 2.3 categorizes the benchmarks based on their offset prefetching behavior. There are 4 categories: Category 1: 13 out of 21 benchmarks are in this category. Best offset is larger than 1. Most frequently observed global and/or local delta is 1. Offset 1 prefetching provides significant speedup but prefetches are often issued late to fully hide memory latency, therefore a prefetch distance is beneficial. Best offset is equal to delta 1 plus best prefetch distance. offset 1 prefetcher (i.e., next-line prefetcher) has a speedup (over no prefetching baseline) of 23%. The speedup increases as offset increases reaching to 43% with offset 7 prefetcher (i.e., the best offset). SBP provides a speedup similar to offset 1 prefetcher. BOP achieves 39%, a little lower than best fixed offset. Result for 437.leaslie3d is very similar: 28% speedup with offset 1 prefetcher, which increases to 41% with best fixed offset prefetcher (BFOB). Again BOP (36%) outperforms SBP (30%) demonstrating that it is successful in integrating timeliness into SBP. However, there is still room for improvement as its performance is lower than BFOB. On average, BOP outperforms SBP on category 1 benchmarks as expected. However, BOP does not always perform well in this category. For example, BOFB provides a 10% speedup over offset 1 prefetcher (i.e., baseline is offset 1 prefetcher) for 456.hmmer with offset 3. BOP only provides 1.2% better than offset 1 prefetcher while SBP outperforms offset 1 prefetcher by 5%. Still about 5% worse than BFOB but 3% better than BOP. Thus, BOP is not able to adapt the behavior of 456.hmmer. 462.libquantum is another benchmark where BOP (also SBP) underperform BFOB significantly.
Almost all benchmarks in this category behave similarly except for 429.mcf.
429.mcf has mostly irregular memory accesses. BFOB (with offset 5) provides only about 6% speedup over no prefetching. This speedup is still due to delta value 1, which is 8% of the global deltas observed. With a prefetch distance of 4, i.e. offset 5, speedup is 3.6% better than next-line prefetching (i.e., offset 1).  Finally, 401.bzip2 performs best with offset -2 achieving a 7.5% speedup over no prefetching baseline. However, its behavior is rather unusual (see Figure 2.2a). For offset -1, there is no speedup but a slight slowdown. And for offsets smaller than -2, speedup drops to less than 2%. Considering that the most frequently observed delta is -1, this benchmarks behavior is hard to capture for timeliness of prefetches.

Our Motivation
We can draw important conclusions from the analysis of offset prefetching: 1. Best offset varies for each benchmark. Most best offsets are small, 1 to 6.
BOP and SBP are recent proposals that aim to find the best offset. However, although BOP and SBP perform better than recent prefetchers that exploit regular memory access patterns, there is a significant room for improvement.
BFOB outperforms BOP by 3% as shown in Figure 2.1. Based on our findings, we propose a simple yet effective sequential prefetching mechanism with adaptive distance, SPAD. Since most performance benefit in offset prefetching comes from delta value 1 but improved prefetch timeliness than nextline prefetcher (i.e., offset 1 prefetcher), SPAD employs a delta 1 prediction with a feedback mechanism to predict best prefetch timing for delta 1. SPAD tracks issued prefetches using a A + Distance is sent to lower-level memory. SPAD increments or decrements the Distance after an evaluation period, trying to adapt to the application behavior by finding the best prefetch distance for delta 1. If prefetching is determined to be harmful, SPAD turns off the prefetching by resetting the Distance to zero.

Test Queue (TQ)
In order to estimate prefetch timeliness with current Distance, SPAD records predicted prefetch addresses (shown as test address in Figure 2.

3) in a table called
the Test Queue (TQ). Several implementations are possible for the TQ. In this study, we choose the simplest implementation: TQ is direct-mapped, accessed through a simple index function, each entry holding a tag. The tag does not need to be the full address, a partial tag is sufficient. In our implementation, for a 128-entry TQ, we use the 7 least significant line address bits to index the table.
For 8-bit tags, we skip the 7 least significant line address bits and extract the next 8 bits.
Each L2 cache demand access (e.g, to line A) triggers a prediction for a candidate prefetch address. After the line address, A + Distance is predicted as a prefetch candidate, if it is not already in the TQ, it is recorded in the TQ and a prefetch for A + Distance is issued. If it is found in the TQ, no prefetch is issued.
That is, the TQ also acts as a prefetch filter in order to filter redundant prefetch requests going to the memory system.
When Distance is zero, no prefetching would occur. However, the TQ continues to record A + 1 (next-line) as if an offset 1 prefetcher is active. This is needed to continue evaluating the prefetcher when it is off so that later when the prefetching is useful again, it can get reactivated.

Interval
Distance is updated at the end of each evaluation period called an Interval. interval but next-line prefetcher (i.e., Distance = 1) continues to be evaluated.

Decision Engine (DE)
After each interval, DE updates the Distance as necessary based on three counters: l2miss, tqhits and tqmhits. l2miss tracks the number of total L2 cache misses in an interval. tqhits is the number of L2 demand accesses that are found in the TQ in an interval, and tqmhits tracks the number of L2 demand misses that are found in the TQ in an interval. After each interval, DE checks several conditions to make distance update decision as follows.
1. If tqhits < T QT HLD (64 in our evaluation) and if Distance > 1, Distance is decremented. The intuition behind this action is that tqhits is low because either predicted addresses are not accurate or prefetches are issued too early so that they are not in TQ anymore (replaced by other predictions).
2. If tqhits < T QT HLD for three consecutive intervals, prefetching is considered useless and Distance is reset to zero disabling the prefetching.
3. If tqhits >= T QT HLD, update decision for Distance is made as follows (in this order): (a) If l2miss < M ISST HLD (8 in our evaluation), no update is made assuming current Distance is successful.
(b) Finally, if tqmhits > l2miss/2 for more than two consecutive intervals, distance is incremented. No division is necessary for this check, a simple shift operation is sufficient. The intuition behind this decision is that when most L2 misses are found in the TQ, although prediction accuracy is high, prefetches are likely issued too late.
4. Since prefetcher continues to record predicted addresses (i.e., A + 1) in the TQ when prefetching is off (i.e., Distance is zero), prefetching can be turned back on if it is proved successful. In our implementation, this is measured by the following condition: tqhits >= T QT HLD for two consecutive intervals.

Integrating Negative Distance Prediction into SPAD
For a few benchmarks, negative deltas are most important for performance.   Slim AMPM [26], an improved version submitted to DPC-2), we used their source code available at the DPC-2 website. Hence, these prefetchers are authors optimized versions for the DPC-2 framework. This is one of the reasons why we have used DPC-2 framework in our evaluation. Our SPAD implementation optimized for DPC-2 framework allows a fair comparison.

Simulator Parameters
The DPC2 framework models a 6-wide issue out-of-order core with parameters summarized in Table 2  to keep track of several events as described in Section 2.5.3. In addition, SPAD needs a 9-bit interval register, two 6-bit Distance registers, one for delta 1 and one for delta -1, one 8-bit register to hold T QT HLD and one 4-bit register for M ISST HLD. Overall hardware budget is 1105 bits or 139 bytes.

Benchmarks
In our evaluations we used SPEC CPU2006 [27] benchmark suite. We used Simpoint 2.0 [28] to generate representative 100M-instruction traces. The measurements in the early part of the cycle-accurate simulations are discarded to account for various warm-up effects.

Results
We simulate SPAD with a 128-entry direct-mapped TQ table and an interval size of 512. We compare our results to five state-of-the-art prefetchers that exploit regular memory access patterns. Section 2.3 contains their descriptions. SPAD also focuses on regular patterns. Our work is most related to recently proposed offset prefetchers SBP [1] and BOP [2], which outperform prior work. This table shows that SPAD uses much smaller hardware than the competitor prefetchers. Its hardware budget is only one-fourth of the BOPs, which is the best performing competitor, and half of the SBPs.
In the following, we first explore SPADs design space to find the best performing interval length and TQ size. We then present the performance evaluation of SPAD and comparison with prior methods.

Conclusion
Variants of next-line sequential prefetchers have commonly been employed in current processors due to their good performance and simplicity. In next-line prefetching, when a line A is demand accessed, a prefetch is issued for line A + 1.  In this chapter, we analyzed offset prefetching and realized that benchmarks often perform best with offsets larger than 1 but they exhibit regular delta 1 memory access patterns. What makes a specific offset work best is not because memory access sequences exhibit a delta value that is equal to the offset, but because that offset value provides a prefetch distance for the delta value 1 performing the prefetch in a more timely manner. We proposed the Sequential Prefetcher with Adaptive Distance (SPAD) to exploit this behavior. SPAD focuses on delta value 1 and the prefetch timeliness. Instead of testing many offset prefetchers, it tests only one prefetcher (delta value 1) and tracks best prefetch distance for timeliness. Our results show that SPAD outperforms SBP and BOP by 1.5% and 1.1%, respectively, with a simpler mechanism and much lower hardware budget. [2] P. Michaud, "A best-offset prefetcher," in 2nd Data Prefetching Championship, 2015.

List of References
[

Introduction
Traversing sparse matrices and graphs frequently results in indirect memory accesses, which have irregular access patterns and thus poor cache spatial locality. See, e.g., [2] (describing a compiler-based system to generate software prefetches for indirect memory accesses). Figure 3.1 shows the percentage increase in the number of instructions in a loop iteration due to software prefetching for various benchmarks. For some benchmarks, e.g., Integer Sort (is), and Conjugate Gradient (cg), the overhead is very high (e.g., approximately 100%) as those benchmarks have relatively few instructions in each iteration. Graph500 (g500) has a high instruction overhead due to border checking instructions in the frequently executed loop. While the benefit of prefetching may offset the overhead for some benchmarks (e.g., is and g500 ), for others (e.g., cg, Hashjoin ph 2 (hj2), and Hashjoin ph 8 (hj8)), the reverse is true.
In addition to instruction overhead, the benefit of software prefetching is further limited by (1) dependences related to prefetch address calculation, which may reduce the prefetch distance (i.e., how far in advance of the memory access the prefetch instruction is issued), and (2) the lack of run-time information, which is required to optimally place the prefetch instructions. Figure 3.2 depicts the effect of prefetch distance on speedup for two benchmarks, Histogram (histo) and PageRank (pr). For histo, the highest speedup occurs at the largest prefetch distance (128) while the speedup is lower for smaller distances, especially for prefetch distances of 1, 2, 4, and 8. The opposite is true for pr, namely, the highest speedups occur at the smallest prefetch distances. For a set of memory-bound benchmarks with indirect memory accesses, the prefetch distance has a very significant effect on the speedup of software prefetching. More specifically, the average speedup across all benchmarks ranges from 1.19 (worst) to 1.84 (best). Finally, it is also important to remember that the optimal prefetch distance for a given application may change based on running the application on a different underlying architecture [2], which further underscores the necessity of run-time information to optimally place the prefetch instructions.
Hardware prefetchers, by contrast, do not require executing additional instructions in order to compute and issue prefetches. But, in order to capture irregular which was designed to capture a few different indirect memory access patterns (e.g.,

Code Snippets 3.1-3.3 below illustrate the limitations of hardware prefetching
and concomitantly the advantages of software prefetching. We use IMP as an exemplary hardware prefetcher given its efficacy and relatively low complexity. By contrast, software prefetching can accurately prefetch the memory accesses depicted in the above code snippets, but only with substantial programmer effort.
Also, prefetching these memory accesses requires significant overhead because it requires performing the arithmetic/logical operations for every prefetch. Lastly, the best prefetching distance is hard to predict due to the lack of run-time information.
Given that software and hardware prefetchers have different strengths and weaknesses, in this chapter, we propose a prefetch mechanism that attempts to combine the strengths of each. More specifically, we propose the Array Track-

Array Tracking Prefetcher (ATP)
ATP is an integrated software/hardware approach to prefetching indirect data access patterns. The remainder of this section describes the software and hardware

ATP's Software Component
The software component of the ATP extracts information related to indirect memory accesses within a loop and passes this information to the hardware component. The programmer can manually mark this loop as shown in Figure 3.3a or can use the compiler to automatically identify the loop using an approach that is similar to that described in [2]. The software component passes this extracted indirect-access information to the hardware component through special metadata instructions called Array Tracking Instructions (ATIs), as shown in Figure 3.3c.
Array Tracking Instructions (ATIs): Each ATI is a single 6-byte long instruction (two bytes for the opcode, 2-bits to specify the type of the ATI instruction, and the remainder for the operands). When a core detects an ATI instruction, the core removes it from the pipeline and forwards it to the ATP   The ATP also includes a mechanism to dynamically change the prefetch distance in order to adapt to specific run-time behavior to achieve better timeliness and performance.
An overview of the ATP hardware mechanism is shown in Figure 3.5. It prefetch calculation. C) Then, whenever the ATU observes a demand access to a trigger array, it notifies the PCU to begin the prefetch calculation process and issue prefetches. D) Finally, the prefetch distance is dynamically adjusted based on the feedback from the DS, which finds the best performing prefetch distance using a simple mechanism. Next, we explain the operation of ATP in these four stages.

Processing of ATI instructions and ATP Initialization
Each valid ATI in the ATQ is processed in-order from ATQs head to tail.
ATIs are used to initialize/program the ATU tables. The ATU consists of three important tables, the Array Table (AT), the Indirect Relation Table (IRT) and the   Operation Table (OT). An AT CL instruction resets all the ATU tables, namely valid bits are set to zero in the AT, IRT, and OT. We explain how ATI instructions initialize or program ATU tables using the example in Figure 3 The index array has the highest depth and a target array which is not an index to another target array has the lowest depth, 1. Depth is used for prefetch address calculation as described in Section 3.3.2.
Finally, AT RL is followed by AT OP instructions since the base array is not directly used as an index for the target array. The first AT OP is an AN D with data 0x7f and the second AT OP is a M U L with 14 as its data. AT OP instructions can only follow an AT RL or another AT OP instruction. AT OP sets to 1 the op bit of the IRT entry it corresponds to denoting that base array must undergo an operation before used as an index for target array. op idx field specifies the index of the OT that corresponds to the operation specified by AT OP . If AT OP is followed by another AT OP , the next bit field of the last AT OP is set to 1. Once the last AT OP in the ATQ have been processed, ATU has completed initialization and ATP is ready to move to the size and base calculation stage.

Size and Base Address Calculation
Before prefetching can start for indirect accesses, the stride of the trigger arrays and, the element sizes and base addresses of the target arrays must be

IRT.type
Specifies the indirect relation type. Two types are supported: direct and pointer. In direct relation, the value read from the base (index) array is used as an index or used in calculation of the index for the target array. Figure 3.3 demonstrates a direct type. Pointer types are used to connect type 0 trigger entries of base arrays to type 1 trigger entries of destination arrays. In pointer type, value read from the base array is used as the root address of destination access. Root addresses are used to calculate prefetch addresses for incoming dimensions in a two-dimensional array.

IRT.op bit
Set if operations need to be performed on index array for target array index calculation (See Figure 3.6 as example).

IRT.op idx
The index of the OT entry that specifies the operation to perform.

OT.op
Operation to perform on index array values.
OT.data Data needed for operation. The first operand is the index array value if previous entrys next bit is zero. Otherwise, the first operand is the result of the previous operation. OT.next bit Set if another operation (specified in the next OT entry) needs to be performed after the current operation.
known. ATP employs a single mechanism to compute sizes and base addresses.   is performed for each node of this graph. Here, we explain how prefetching is performed for each indirect structure that is shown in Figure 3.7.

In a multi-level A[B[C[i]]
] structure, the index array, C, represents the root node of the graph as shown in Figure 3.7a. As ATU signals PCU for prefetching operation on an access to the trigger array C, an entry for C is allocated in the PCB. Since C is the trigger array, by following the indirect map fields in the AT, its target array B and then A are also inserted into the PCB and they are linked to their sources in the PCB entries. After the initialization is done, the PCU starts calculating the prefetch addresses for each entry. To calculate the prefetch address for any non-trigger array, the PCU needs to read a value from the source array. and for path C → B → A, we can read C[i + 1 * dist]. As

Path
Step 1 Step 2 Step

Path
Step 1 Step 2 Step For node C: Step 1. Compute the address for C[i + 1 * distance] using Equation 3.3 and read its value from the L1 cache.
Step 2. Compute the address for B[C[i + 1 * distance]] using Equation 3.2 and read its value from the L1 cache. Step

Path
Step 1 Step 2 Step Finally, Figure 3. Also, the edge from B1 to B2 is represented as a pointer type relation instead of a regular type as for the structures discussed above (for Figure 3.7a-3.7c).
In two dimensional arrays, prefetching can be triggered by two different load accesses (B1 and B2).  Step 2 Step 3 B2 Read

Adaptive Prefetching Distance Selection
Distance Selector (DS) enables ATP to adjust the prefetch distance (in terms of how many array elements ahead) dynamically to be timely accurate on different applications and configurations.
Each power of 2 prefetch distance from 2 to 16 competes during a test period and at the end of this period, the distance that takes the smallest number of cycles to complete the same number of loop iterations is picked as the distance for the acting period that comes after the testing period. The acting period is a fixed 50 times larger than the testing period. After acting period another testing period follows.
In testing period, each prefetch distance is run for a fixed number of loop iterations (64 in our experiments of which the first 32 are used for warm-up and next 32 are used for performance measurement) in a round-robin fashion and the number of cycles is counted. DSU employs two 32-bit cycle counters. min count holds the smallest cycle count and run count holds the cycle count for the currently tested distance. After each distance has completed its test, if run count < min count, min count is set to run count and a 3-bit best dist register is updated. After all of the distances are tested, best dist indexes a table of 2-bit confidence counters and increments the count for that distance. DSU repeats this process until any of the distances confidence reaches to a threshold which is 2 in our implementation.
Using a threshold less then 2 decreases the performance of some applications due to aggressive decisions. Using a threshold above 2 proved useless and increases the duration of the testing phase which also decreases the performance. Once the decision is made, the chosen distance is set to be used in the acting period and the testing period cycle counters are set to 0.
In multi-core architectures, distance selection is performed separately on each core. We observe that best distances vary for each core running a multi-threaded application due to the sharing of last-level cache and memory bandwidth.

Extending ATP for Prefetching Linked Data Structures
ATP is very successful in prefetching for indirect access structures as we show in the results section. However, for one of our workloads, HJ8, a significant performance opportunity was lost because indirect accesses were followed by linked list traversals and ATP was not able to capture this behavior. In HJ8, each element of the destination array is a linked list data structure. We extended the ATP to support linked lists. We add three additional fields in the AT: a node bit, node offset and number of nodes. An additional instruction called atnod is also added to inform the hardware of this behavior and set the fields in the AT. This instruction always follows an ATAR instruction and have two operands: an offset that represents the next field in a linked-node and an immediate value that represents the number of nodes. For example, for a A[B[i]]−> next−> next structure, where each element of array A is a head node of a linked list where each list has 3 nodes (a head node and two additional nodes), generated ATI sequence is as follows:  Table 3.7.
We observe significant performance improvement in HJ8 with an ATP that supports linked list traversals. Without this support, ATP achieves a speedup of 1.44 for this benchmark by successfully prefetching the indirect accesses. With the added linked list support, ATP boosts the speedup to 3.32. By contrast, software

Path
Step 1 Step 2 Step 3 Step4 prefetching, SWPF, that prefetch for both indirect accesses and the linked list in HJ8 could only achieve a speedup of 1.65, which is much lower than ATPs performance. It is important to note, however, that our implementation is limited to cases where the number of nodes is known. HJ8 has 3 nodes for all the linked lists so prefetch depth is constant (e.g., depth is 4 in the example in Table 3.7) and ATP does not need to predict depth (distance, however, is dynamically evaluated as before). When the number of nodes is not known, ATP must make predictions for the depth of the structure, which complicates the hardware mechanism. This scenario is not evaluated and left as future work since our focus in this chapter is indirect access structures.

Experimental Setup
We now discuss details of the simulation infrastructure, the workloads and the configurations that we used for our evaluation.

Simulation Environment
We implemented the ATP on the gem5 simulator [5] using System Emulation mode and generated the results using the x86 out-of-order CPU model. Table 3.8 shows the configuration of each core while Table 3.9 shows the ATP configuration.
We inserted ATI instructions at the beginning of the loop and implemented ATP to prefetch for the L1 cache in order to provide for a direct comparison with IMP [4], which prefetches for the L1 cache. Each L1 is equipped with an 8-entry prefetch request queue (PRQ) in our evaluation. In all methods tested, computed prefetch We faithfully implemented IMP (attached to each L1 cache) on our baseline architecture. Similar as in [4], our IMP implementation used a 16-entry Prefetch shift values. Total hardware budget for our IMP implementation was 8032 bits (1004 bytes) per core, almost four times the size of the ATP (see Table 3.3). To evaluate the performance of software prefetching, we inserted software prefetching instructions inside the loops containing the indirect memory accesses. For both IMP and software prefetching, we measured the speedup for various prefetch distances but only report the results for the best performing distance.
For each benchmark, we fast-forward to the beginning of the loop containing indirect memory accesses and the simulate 100M instructions; for multi-core simulations, each core simulates at least 100M instructions.
We use the number of cycles per loop-iteration as the performance metric as it eliminates additional overhead due to software prefetching. As such, it provides an apples-to-apples comparison between hardware and software prefetching.

Benchmarks
We used seven benchmarks to evaluate the performance of ATP. Each benchmark contains indirect memory accesses inside their performance-critical loop.
Integer Sort (IS) and Conjugate Gradient (CG) are from the NAS Parallel Benchmarks suite [6]. IS represents computational fluid dynamics programs and uses a bucket sort algorithm to sort integer values while CG represents unstructured grid computations and use eigenvalue estimation on sparse matrices. IS and

CG have simple A[B[i]] access behavior.
Both Pagerank (PR) and Triangle Counting (TC) are from the CRONOSuite benchmark suite [7]. PR is a graph algorithm that ranks a website based on the rank of the websites that link to it [8] while TC counts the number of triangles in a graph and is used by graph algorithms such as clustering coefficients [9].  Histogram (Histo) calculates the distribution of numerical data and is from the Parboil benchmark suite [12].

Results
This section presents the performance of ATP, software prefetching (SWPF), and IMP, which is a pure hardware prefetching mechanism. We measure the performance of ATP, SWPF, and IMP for single and multi-core architectures. It is important to note that the results are biased in favor of SWPF because we presented the best speedup achieved by SWPF after carefully inserting prefetches and many profiling runs to obtain the best performing prefetch distances. outperform IMP, while ATP and IMP outperform SPWF for CG. As described in Section 3.2, the overhead due to SWPF was extremely high for both IS and CG. This overhead has little effect in IS because SWPF can hide this overhead by virtue for significantly reducing the latencies of the indirect memory accesses.

Single-Core Performance of ATP
CG is one of the most sensitive benchmarks to the instruction overhead of software prefetching. Using software prefetching for CG decreases its performance by 33% while hardware prefetching mechanisms can achieve better performance (1.40 and 1.60 for ATP and IMP respectively). For CG, ATP has a lower speedup compared to IMP. Our evaluations show that for the same fixed distance value, ATP and IMP have a similar result for CG. However, ATPs distance selector does not always use the best performing distance for CG, which in turn impacts the overall performance potential negatively.    For g500, benefit from prefetching increases on multi-core architectures.

Multi-Core Performance of ATP
SWPF slightly outperforms ATP for this benchmark because ATP loses prefetch opportunities due to dropped prefetches (as we discuss in Section 3.5.6). Generally, the speedups due to prefetching in 4 and 8-core architectures are lower than on a single-core architecture due to increased resource utilization. By way of example, for IS, because the main loop is not long enough to hide memory latency, the speedup in IS decreases due to an increased number of memory accesses and the concomitant increase in latency for those accesses. Therefore, while resource contention has some effect on the efficacy of each prefetching method, ATP and SWPF still provide significant speedup.

Efficacy of Adaptive Distance on ATP Speedup
As described above, the Distance Selection Unit allows ATP to dynamically adjust the prefetch distance for different applications and configurations. Figure   3.11 compares the speedup of ATP when using adaptive prefetch distance versus ATP with various fixed distances (2, 4, 8, 16, and 32). The results in Figure 3.11 show that dynamically adjusting the prefetching distance has an average speedup of 2.17 while the highest performing fixed distance (distance = 8) yields an average speedup of 1.88. Therefore, even though periodically testing each distance for a certain number of iterations to choose the best distance for the next period may slightly degrade the speedup for some benchmarks, on average, dynamically adjusting the prefetch distance yields a higher average speedup.
IS and HJ2 benefit from longer prefetch distances due to the small number of instructions in its main loop. As such, in order to be timely, prefetch instructions must be issued further away.
By contrast, PR and TC have higher performance when using shorter prefetch distances as Figure 3

Prefetch Coverage and Accuracy
A prefetcher needs to be accurate or it will prefetch memory blocks that are never used, thus polluting its cache. If a prefetcher is not timely, it will either not fully hide the memory latency of the cache miss or, even worse, the prefetched cache line will be evicted.   and guarantees the prefetched cache line will be accessed. The prefetch accuracy of ATP and IMP are lower (99% and 80%, respectively). ATP is more accurate than IMP since the software mechanism specifies and limits the prefetches, thus reducing the number of useless prefetches.
We chose the best-fixed distances for SWPF and IMP in our evaluations.
The average timeliness for SWPF and IMP is 70% and 74%, respectively. By contrast, ATP dynamically adjusts the distance; the overall timeliness of ATP is significantly higher (88%). Although SWPF has the best coverage with 80%, ATP also has high coverage with 73% since it calculates the prefetch addresses based on software hints. IMP fails to cover most of the potential prefetches as discussed in Section 3.2 (because it covers a relatively small number of indirect access patterns) and therefore it has much lower coverage (19%).

Effects of Number of MSHRs, L1 Cache Size, and L1 Cache Access Latency
In this section, we evaluated the performance of ATP, SWPF, and IMP as we vary the number of MSHRs, L1 cache size, and L1 cache access latency.  increases the performance of no prefetching baseline by 13% and 16%, respectively (results not shown). Prefetching benefits from the higher number of MSHRs more significantly than the no-prefetching baseline. As shown in Figure 3.12, the number    respectively. The size of L1 cache has a very limited effect on the speedups either in no prefetching or prefetching methods. Also, both single-core and multi-core results show that using prefetching on a smaller L1 cache size has much better performance compared to using a larger L1 cache size without prefetching. In all cache sizes that we evaluated, ATP outperforms SWPF and IMP.      to the total number of prefetches that were expected to be calculated (does not include the prefetches for index arrays). We observe that most of the dropped prefetches in benchmarks is, cg, g500, and histo are due to late prefetching of index values. In g500, some of the prefetches are dropped because their dependent Prefetch drops can be eliminated completely if the prefetch triggering is done when source data is placed in the cache. However, this requires significant changes in the ATP and is left as future work.

Related Work
Data prefetching is a well-known technique to help alleviate the memory wall problem [13] by increasing Memory-Level Parallelism (MLP) [14,15]. Many general-purpose microprocessors rely on data prefetching to improve performance for memory-intensive workloads. Most of the early prefetchers [16,17,18] were based on sequential prefetchers, which prefetch sequential memory blocks relying on the fact that many applications exhibit spatial locality. Although sequential prefetchers work effectively in many cases, applications with non-sequential data access patterns do not benefit from sequential prefetching. That motivated the research on more complex prefetchers that try to capture the non-sequential nature of these applications [19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37]. Table 3.1 summarizes a variety of data access patterns and software and hardware prefetching methods targeting them. Prefetching techniques targeting pointer-based applications have been studied in [19,20,21,33,35,37]. Indirect array references cannot be captured with those methods, however, since the desired addresses are computed, not contained in the memory as pointers. Guided Region Prefetching (GRP) [37] is a hardware prefetching scheme which uses compiler hints encoded in load instructions to regulate an aggressive hardware prefetching engine.
GRP targets a broad range of behaviors from arrays and pointers to basic indirect Software prefetching [42,43,44,41,40,45] provides a way for programmers to insert prefetching instructions into a program targeting various simple and complex patterns. Manual insertion is flexible but requires significant programmer effort. Automatic insertion requires the compiler to recognize the access pattern.
Ainsworth [2] developed an algorithm which automates the insertion of software prefetches for indirect memory accesses into programs. Although this approach eliminates the requirement for the programmer effort, it cannot guarantee to insert the instructions in an optimized way for the specific architecture. Furthermore, significant instruction overhead may offset its benefits. On the other hand, software prefetching can target more complex patterns than hardware counterparts, especially if hardware budget is limited. In contrast to prior work, we proposed a hybrid software-hardware approach using the strengths of each for prefetching indirect memory accesses.
Finally, Lee et al. [45] studied the interaction between software and hardware prefetching and found that inserting software prefetching instructions in the presence of hardware prefetchers may hurt the overall prefetching performance due to the incorrect training of hardware prefetchers. ATP does not have this problem because prefetching is only initiated by hardware, not by software prefetch instructions; course-grain metadata instructions are used to guide the hardware prefetcher.

Conclusion and Future Work
We propose and implement the Array Tracking Prefetcher to have the benefits of both software and hardware prefetching for indirect memory accesses. ATP inserts prefetch metadata instructions outside the loop and use them to pass information to the hardware mechanism. The hardware mechanism uses this information to determine which indirect memory accesses to prefetch and when to do so.
To increase the prefetch timeliness (and performance), ATP dynamically adjusts the distance. By using software hints, ATP avoids using an expensive hardware budget.
Our results show that ATP yields an average speedup of 2.17X, 1.85X, and 1.41X for single-core, 4-core, and 8-core architectures, respectively. ATP also out-performs the state-of-the-art software-based (SWPF) and hardware-based (IMP) prefetching methods.
In future work, we plan to improve ATPs capacity by targeting diverse data structures. In this work, we showed that ATP can be extended to target linked-list traversals. However, our extension was based on a specific case where the number of nodes was known. In the future, we plan to improve our ATP software/hardware interface to enable prefetching for more complex data structure traversals. In addition, currently, ATP misses a significant opportunity by dropping prefetches when source data for address calculation is not present in the cache at the time of calculation. In future work, we plan to modify ATP so that prefetch address calculations can be triggered by cache fills from source data due to prefetched index array.
List of References

Abstract
Pointer-intensive data structures are commonly used in database applications.
Traversals on these data structures mostly cause memory stalls due to their dependent pointer references. Improving memory-level parallelism by accessing the memory simultaneously for separate lookups is beneficial for such data structures.
Existing techniques focus on improving their performance by creating overlapping memory accesses for distinct lookups. Although data prefetching is very beneficial for such structures, it is not enough alone to maximize their performance on modern CPUs.
In this work, we propose Node Tracker (NT), a software supported hardware prefetching mechanism which is tightly integrated with CPU. Additionally, it can use the extracted knowledge from the prefetched data to inform out-of-order CPUs about the future matching nodes and conditional branch targets. In our evaluations, NT achieved up to 19x speedup over no-prefetching baseline.

Introduction
Pointer-intensive data structures, such as linked-lists and trees are used by many in-memory database applications. These applications have unpredictable access patterns due to frequent pointer chasing and result in leaving the CPU idle due to long memory latencies. Although modern CPUs utilize memory-level parallelism (MLP), the benefit of MLP comes from the number of independent in-flight accesses. In pointer chasing lookups, accessing the next hop requires data from previous pointers, which prevents the CPU to service these accesses in parallel. Moreover, many database operations consist of multiple lookups that can be serviced in parallel but the CPU has a fixed instruction window size which limits the number of simultaneous lookups.
Data prefetching is intended to hide memory access latencies in single-core and multi-core systems which effectively reduces the gap between memory access time and processor frequency. Previous software-based solutions [1,2] exploit interlookup parallelism to overlap memory access latencies. However, they require to re-design the algorithm and still they have limited benefits due to dependencies and hardware limitations. Helper threads [3] can create a separate thread to issue prefetches. However, they eventually tend to stall and struggle to be ahead of the main thread due to load-miss chains created by pointer-intensive applications.
Ainsworth [4] proposed a system with programmable cores to issue prefetches for future accesses. Although it provides an ideal solution for prefetching, the preexecutions they perform can only help reducing demand cache misses.
In this work, we propose Node Tracker (NT), a software-assisted hardware prefetching mechanism, that is highly integrated with CPU pipeline. NT relies on programmer/compiler to configure the hardware. NT focuses on preexecuting multiple future lookup operations on pointer-intensive data structures asynchronously using simple in-order programmable cores. In addition to prefetching, NT assists CPU execution by eliminating unnecessary node visits and providing the future branch targets to the CPU pipeline. NT is designed to be integrated with another prefetcher, ATP [5], which handles prefetching for sequential and indirect accesses. ATP also provides necessary information to NT to be able to start pre-execution of future lookups.

Related Work and Motivation
Code Snippet 4.1 illustrates an example of a simple probe hash- break; } n = n -> next ; } } ing behavior, a lookup on a linked-list creates dependent memory accesses where it needs to access the nodes of a linked-list sequentially. Other pointer-intensive workloads like binary tree search (BST) also has similar dependent memory accesses where the performance of the application depends on the number of nodes to be traversed for a single lookup. Several proposed prefetching techniques are able to cover prefetching pointerchasing accesses across distinct lookups. Kocberber [2] presented a software-based method called AMAC that exploits parallel lookups for pointer-intensive database applications. AMAC proposes a way to implement the algorithm to be able to serve multiple key lookups asynchronously to hide long memory access latencies by improving MLP. Although this method aims to parallelize node accesses across different lookups, it needs prefetches to be timely accurate to achieve the best performance. Late prefetches may cause stalls, while early prefetches may lead having a nodes pointer in the cache but unable to create the next pointers request on time.
Amir [6] proposed a robust technique to prefetch jump pointers to tolerate linked lists access time named JPP (Jump Pointer Prefetching). JPP explicitly stores jump pointers to nodes located a number of hops away. It can be beneficial for long pointer chains which are not commonly used by database applications we have.
Helper threads [3] can run different context on a separate thread to create load accesses ahead of the main thread without any need of extra hardware. However, the additional thread which runs on a high-performance core could consume significant amounts of energy. They are also unable to calculate prefetches ahead of demand execution where load-miss chains are common.
Ainsworth [7] proposed a design with programmable cores to compute prefetch addresses. They can accurately prefetch irregular access patterns without modifying the original code. However, they do not provide any support for CPU execution except hiding memory access latencies by prefetching.
We designed a prefetching/pre-execution technique, Node Tracker (NT), targeting pointer-intensive data structures with multiple lookup operations which is common in in-database applications. In addition to prior work, NT uses the knowledge of pre-executed lookups to provide out-of-order CPUs further support. Node Tracker is a programmable unit that is configured using special instructions inserted by programmer/compiler before the outer loop as it is marked in Code   The lookup operations are executed by a simple state machine by using the pre-configured settings of NT. Figure 4.2 demonstrates a simple execution flow of a TB entry. NTP switches to the next entry in two cases; either when there is a cache miss occurs (states 1 and 4), or the task is completed (state 5). When NTP switches to a new entry and activates it, it continues executing from the previous state. To remember the latest state, each TB entry has a state field. If no task is assigned (either previous task of the entry is completed or it is activated for the first time), then it fetches an entry from the head of WQ. Each TB entry keeps lookup ID, node pointer, and key values that are carried from WQ.

Node Tracker
To perform the comparison, the comparison operations and corresponding actions should be passed to the hardware. Comparison operations may refer to any of the comparison parameters like " == " and " < " as well as "nocomparison" which means that the corresponding actions will be performed without any comparison. And there are two types of actions to be performed in NT; "EXIT " which finishes the lookup if the comparison is true and "N EXT " which reads and sets the next node pointer and continues. These comparison operations and corresponding actions are stored in a small table called Comparison Table (CT). CT entries also have a field to keep an offset value to be used with "N EXT " action and refer to the offset from the node address to read the pointer of the next node. Each comparison execution starts with the first entry and for each false comparison, it moves to the next entry. In Code Snippet 4.1, there is one comparison operation ("key == n− > key") and it corresponding operations is "EXIT " since the matching node is found, if the comparison is true. If the first comparison is false, it needs to move to the next node. In this case, the second entry should be inserted as "nocomparison" with the action of "N EXT " and the offset should be set.

Result Buffer and Node-Update
Like in the example shown in Code Snippet 4.1, a lookup operation consists of several node visits to compare the keys till the matching node is found. Since NT already pre-executes these node visits, the CPU pipeline does not need to revisit the non-matching nodes again in most cases. Thus, a direct-mapped buffer called Result Buffer (RB) stores the matching node addresses for the completed lookups. CPU pipeline can read the node addresses from RB and instead of starting the inner loop from the head node, it can start from the matching node (inner loop can be executed for one iteration in this case). To achieve this, using the special instructions, we need to inform the hardware about the pc (hold pc) of the instruction which writes the pointer address into a register.
CPU uses a lookup ID to read the matching node addresses from RB. Lookup IDs are assigned to the instructions by the lookup counter in the fetch stage. The lookup counter keeps the lookup ID of the last fetched instruction. Its value is increased each time a new lookup (outer loop iteration) begins. So, every instruction within the same outer loop iteration has the same unique lookup ID. If any of the instructions are discarded due to a branch misprediction, the counter value is restored with the value of the latest valid instruction's lookup ID. For the future lookups, IDs are generated by ATP and inserted into WQ along with the key and the pointer to the head node. Later these IDs are used to index RB. Whenever a new task is assigned to a TB entry, a result entry in RB is allocated for the corresponding lookup. Each entry of RB keeps a ready flag which is set to false initially and once the lookup is completed and the matching node address is written to its pointer field, it is set to true which means that it is ready to be read by the CPU.
CPU does not let the instructions (hold instructions) with hold pc to issue directly. First, it accesses to the RB using the lookup ID of the instruction. If the corresponding entry exists and the ready flag is set to true, it reads the matching node address and replaces the value of the result of the instruction with the matching node address. If the RB entry is available but its value is not ready (still processing in TB), the hold instruction is not issued till NT finishes the lookup writes the matching node address into to RB entry. In the later sections of this chapter, we will call this process as node-update (NU).

Branch Buffer and Branch-Fix
By prefetching with NT, we observe that we can almost eliminate all of the demand cache misses in the applications we evaluated. However, even though nodeupdate reduces the number of branches per lookup significantly, we still observe a bottleneck due to branch mispredictions. Since the number of branches per lookup is very low with node-update (due to the single iteration in the inner loop), the actual branch outcomes can be extracted from the pre-executed lookups to set the branch targets in CPU pipeline.
To achieve this, branch patterns for different scenarios should be generated as bit vectors and passed to the hardware using special instructions. Each branch pattern is stored in an entry in Branch Pattern Table (BPT) and each entry is mapped to a unique scenario (node matched, node not matched, etc.). In Code Snipped 4.1, there are two branches per inner loop iteration ("while(n)" and "if (key == n− > key)"). If the node "n" is a matching node, both branches are expected to be taken (represented as "11"). Whenever NT completes a lookup and writes output to RB, it also writes the corresponding branch buffer. In our applications, the number of unique scenarios we observe for a single inner loop iteration does not exceed 4 which the maximum number of BPT entries we need.
In the later sections of this chapter, we will address this method as branch-fix (BFX).

Methodology
We implemented NT on gem5 simulator and evaluated results using x86 out-oforder CPU model in System Emulation mode. Table 4.1 shows our configurations for the simulator, NT, and ATP.
For hash-join and binary search tree workloads (including the baseline algorithms and AMAC implementations), we used the implementation of Kocberber et al. [2]. The performance of NT is evaluated using probe has-table algorithm with 4 nodes per bucket (PHT-B4), 8 nodes per bucket (PHT-B8), and with a random node distribution (PHT-RND). Additionally, also a binary search tree (BST) algorithm is used for the experiments.

Results
We compared our results with no-prefetching baseline, ATP, and AMAC.
First, we discussed how NT performs by prefetching only, and then we discussed the effect of node-update and branch-fix. We used the number of million keys per second (MKPS) as a metric to measure the throughput of different methods.  benefits from prefetching better than PHT-B4 and has a higher speedup with NT-PF.
Even though BST benefits quite well from NT-PF, it has a lower speedup compared to PHT-B4 and PHT-B8. BST has very long traversals compared to PHT-B4 and PHT-B8. Due to its long dependency trees, CPU pipeline mostly processes a single lookup at a time (see Figure 4.3). So, even though NT-PF improves its performance by eliminating cache misses, its performance is limited by the instruction window.
NT-PF has 4.3x, 6.6x, and 3.1x speedups for PHT-B4, PHT-B8, and BST respectively which is better than ATP and AMAC in all benchmarks. ATP can benefit from prefetching sequential and indirect array accesses. It has low (on PHT-B4 and PHT-B8) or no speedups (on BST) over the baseline which means that most of the speedup of NT-PF comes from prefetching the nodes.
AMAC has 2.3x, 3.6x, and 2.4x speedups over no-prefetching baseline on PHT-B4, PHT-B8, and BST respectively. But its performance is also very limited due to the dependent loads causing stalls. NT-PF has almost two times higher   To study the effect of branch mispredictions, we evaluated NT-PF using a perfect branch predictor (NT-PERF) to see the performance benefit when we eliminate the cache misses and branch mispredictions at the same time. NT-PERF improves the throughput 3.3x and 2.6x over NT-PF for PHT-B4 and PHT-B8 respectively.
By eliminating all the branch mispredictions, NT-PERF also increases the number of lookup rate in instruction window as it is seen in figure 4.3. BST uses conditional move instructions which perform better than branches due to its tree structure and it leads to having long dependency chains instead of having high branch misprediction rates. However, it prevents BST benefiting from NT-PERF. Also, AMAC does not use conditional move instructions for BST and it's misprediction rate is significantly higher than the baseline (refer to figure 4.6). Using NT-BFX eliminates most of the branch mispredictions and it improves the overall performance even better then NT-PERF since it has the advantage of using node-update. NT-BFX improved the throughput of NT-NU by 2.5x and 1.3x for PHT-B4 and PHT-B8 respectively. Due to longer traversals of PHT-B8, some of the NTUs are being late to save the branch outcomes to BB. figure 4.5 that NT-BFX has an MPKI of 10 which means that it is not able to cover all the branches. We could increase the prefetch distance to compute branch outcomes more ahead of the execution but it requires us to increase buffer sizes and leads to increase cache misses due to early prefetches which might decrease the overall throughput. So, we decided to keep the distance as 32 for our evaluations to have the best overall performance on all benchmarks. On BST, we do not see any benefit of branch-fix since it already has almost zero MPKI with NT-NU (refer to figure   4.6).

Impact of Node Distribution in Hash-Join Probe
We also examined the effect of random distribution of nodes by simulating another benchmark, PHT-RND, which has a random number of nodes per each bucket (distributed using Zipfian with factor 0.5) and the total number of nodes is equal to the total number of nodes in PHT-B4.

Impact of Number of MSHRs
We simulated all methods with different number of MSHRs as seen in Figure   4.8. We observe that NT-BFX benefit of the increased number of MSHRs for all benchmarks. NT-NU also benefit of the higher number of MSHRs on PHT-B8 and BST but it does not impact the performance of NT-NU on PHT-B4 since its performance is limited due to branch mispredictions. Depending or experiments, we decided to use 24 MSHRs for single and 2-core architectures, 16 MSHRs for architectures with 4 or more number of cores.

Multicore Scalability
We also observed how NT and other methods we tested scales as we increase the number of cores. Figure 4.9 shows how throughput scales as we increase the number of cores using each method and benchmark. Up to 12 cores, NT-NU and NT-BFX scale perfectly on every benchmark. After 12 cores scaling for NT-NU and NT-BFX on PHT-B8 and BST starts to slow down but still throughput is increasing well up to 24 cores. In PHT-B4, NT-NU scales very well up to 24 cores but NT-BFX starts to lose its advantage after 16 cores but still performs better than NT-NU.

Additional Discussions
We proposed NT as a prefetcher unit which pre-executes future lookup operations on pointer-chasing traversals but the significant benefit of NT relies on its tight integration with CPU. Another approach to address the typical applications would be designing an accelerator which performs the lookups independently from the CPU as proposed by [8]. Although such an accelerator could do the job as efficient as NT, this would require the accelerator to be able to perform everything CPU does. NT instead, still lets the CPU to execute the application so it does not need to support complex operation as an accelerator needs to do.

Conclusion
We proposed Node Tracker as a prefetcher unit which pre-executes future lookup operations on pointer-chasing traversals but the significant benefit of NT relies on its tight integration with CPU.
NT with prefetching only achieved up to 6.6x speedup. NT with informing CPU about the matching node pointer achieved a maximum speedup of 18x and when NT also provides branch outcomes by using the knowledge it received from the software and prefetched data, it can achieve up to 19x speedup in our evaluations over no-prefetching baseline.
List of References CHAPTER 5

Conclusion
Hardware prefetching has been a subject of academic research and industrial development for over 40 years. Nevertheless, because of the scaling trends that continue to widen the gap between processor performance and memory access latency, the importance of hardware prefetching and the need to hide memory system latency has only grownfurther innovation remains critical.
Sequential prefetchers are useful for many workloads but it is critical for them to issue prefetches timely. Late prefetches cannot hide the latency sufficiently.
In this case, when the prefetched data is accessed by the demand execution, the data has not arrived at cache on time so it creates a cache miss. If the prefetch is issued too early, the prefetched data might be replaced with other data before it is accessed by the demand execution. In this case, demand execution creates another cache miss for the previously prefetched data since it is not in the cache anymore. Also, some applications do not benefit of sequential prefetchers and they can even decrease their performance by polluting caches with unnecessary data and wasting bandwidth resources. In chapter 2 we proposed Sequential Prefether with Adaptive Distance (SPAD) which monitors either the prefetches are being late or early and adjusts distance dynamically. It also monitors if prefetching is useful or not and it turns it off when it is not beneficial. It achieves a 20% speedup on average on the benchmarks we evaluated and outperforms most recent sequential prefetching methods.
Although sequential prefetching is beneficial for a wide variety of workloads, many other applications create irregular memory access patterns which cannot be captured by sequential prefetchers and their performance is critical for certain fields. An important portion of these applications involves indirect memory accesses which are widely used in data structures like graphs, sparse matrices, etc.
Software prefetching is very useful for indirect memory accesses but the insertion of software prefetching instructions increases the number of instructions to execute which might create overhead in some cases. Also, software prefetching requires programmer knowledge and effort to tweak to achieve the best performance based on the target microarchitecture. Hardware mechanisms developed to capture indirect memory accesses but without knowledge from the software, they are unable to detect complex indirect access behaviors. We proposed Array Tracking Prefetcher Pointer-chasing memory accesses are also hard-to-predict by pure hardware mechanisms. Also, dependent access chains make it very hard for both hardware and software mechanisms to keep ahead of the demand execution. Fortunately, many database applications involve multiple lookup operations on linked-data structures. This brings the opportunity to benefit create cache misses of different lookup operations in parallel. Software methods require to modify the algorithm to be able to benefit from inter-lookup parallelism and it increases the number of instructions per lookup significantly which limits their potential. In section 4, we proposed a software supported hardware mechanism, Node Tracker (NT), which is designed to pre-execute future lookup operations. It prefetches the data of the future lookups, but it also helps demand execution in other ways to boost the performance even further. Since NT pre-executes and finds the matching node of the linked data structure, it informs demand execution with the matching node pointer so demand execution does not need to iterate all the nodes again to find it. Branch-mispredictions are also a very important cause which limits the performance of the workloads we evaluated. NT records the expected branch outcomes in a buffer to provide it to the branch-predictor later. Using all these features, NT can be considered as a helper unit to the CPU which not only improves the performance by reducing cache misses as a prefetcher, it also helps CPU execution by providing all the knowledge it collects by pre-executing future lookup operations.
NT achieves up to 19x speedup by using its all features to boost CPU execution performance.

Future Work
In future work, we will work on developing hardware-aware compilers. Although we have introduced special instructions to inform hardware mechanisms, compilers still optimize the programs without any knowledge of the hardware we designed. In this case, the hardware mechanism is required to support a wide variety of assembly code structures which brings extra hardware overhead and limitations. If the compiler knows how we designed the hardware and what kind of code structures it will benefit more, it can use the optimizations carefully and even modify the structure of the code to help our mechanisms to work more efficiently.