BOOTSTRAP AGGREGATING BRANCH PREDICTORS

After over two decades of extensive research on branch prediction, branch mispredictions are still an important performance and power bottleneck for today’s aggressive processors. All this research has introduced very sophisticated and accurate branch predictor designs, TAGE predictor being the current-state-of-art. In this work, instead of directly improving on individual predictor’s accuracy, I focus on an orthogonal statistical method called bootstrap aggregating, or bagging. Bagging is used to improve overall accuracy by using an ensemble of predictors, which are trained with slightly different data sets. Each predictor (can be same or different predictors) is trained using a resampled (with replacement) training set (bootstrapping). Then, the final prediction is simply provided by weighting or majority voting (aggregating). This work shows that applying bagging improves performance more than simply increasing the size of the predictor.

For conditional branches, the machine must wait for the resolution of the branch condition, and if the branch is to be taken, it must further wait until the target address is available. Branch instructions are executed by the branch functional unit. For a conditional branch, it is not until it exits the branch unit and when both the branch condition and the branch target address are known that the fetch stage can correctly fetch the next instruction.
This delay in processing conditional branches incurs a cycle penalty in fetching the next instruction, corresponding to the traversal of decode, dispatch, and execute stages by the conditional branch. The actual lost-opportunity cost of stalled cycles is not just empty instruction slots, but the number of empty instruction slots must be multiplied by the width of the machine.
Maximizing the volume of the instruction flow path is equivalent to maximizing the sustained instruction fetch bandwidth. To do this, the number of stall cycles in the fetch stage must be minimized. For an n-wide machine each stalled cycle is equal to fetching n no-op instructions. The primary aim of instruction flow technique is to minimize the number of such fetch stall cycles and/or to make use of these cycles to do potentially useful work. The current dominant approach to accomplishing this is via branch prediction.
TAGE [1] predictor has been widely accepted as the current state-of-the-art in branch prediction. In this work, instead of directly improving predictor's accuracy, I focus on an orthogonal statistical method called bootstrap aggregating, or bagging.
Bagging is used to improve overall accuracy by using an ensemble of predictors, which are trained with slightly different data sets. Each predictor (can be same or different predictors) is trained using a resampled (with replacement) training set (bootstrapping). Then, the final prediction is simply provided by weighting or majority voting (aggregating). This work shows that applying bagging improves performance more than simply increasing the size of the predictor.

BRANCH PREDICTOR
Experimental studies have shown that the behavior of branch instructions is highly predictable. A key approach to minimizing branch penalty and maximizing instruction flow throughput is to speculate on both branch target address and branch condition of branch instruction. As a static branch instruction is repeatedly executed at run time, its dynamic behavior can be tracked. Based on its past behavior, its future behavior can be effectively predicted. Two fundamental components of branch prediction are branch target speculation and branch condition speculation. With any speculative technique, there must be mechanisms to validate the prediction and to safely recover from any misprediction.
Branch target speculation involves the use if a branch target buffer (BTB) to store previous branch target address. BTB is a small cache memory accessed during the instruction fetch stage using the instruction fetch address (PC). Each entry of BTB contains two fields: the branch instruction address (BIA) and the branch target address (BTA). When a static branch instruction is executed for the first time, an entry in the BTB is allocated for it. Its instruction address is stored in the BIA field, and its target is stored in the BTA field. Assuming the BTB is fully associative cache, BIA field is used for the associative access of the BTB. The BTB is accessed concurrently with the accessing of the I-cache. When the current PC matches the BIA of an entry in the BTB, a hit in the BTB results. This implies that the current instruction being fetched from the I-cache has been executed before and is a branch instruction. As shown in Figure 1, when a hit in the BTB occurs, the BTA field of the hit entry is accessed and predicted to be taken.

Figure 1: Branch Target Speculation Using a Branch Target Buffer
By accessing the BTB using the branch instruction address and retrieving branch target address from BTB all during the fetch stage, the speculative branch address will be ready to be used in the next machine cycle as the new instruction fetch address if the branch instruction is predicted to be taken. If the branch instruction is predicted to be taken and this prediction turn out to be correct, then branch instruction is effectively executed in the fetch stage, incurring no branch penalty. The nonspeculative execution of the branch instruction is still performed for the purpose of validating the speculative execution. The branch instruction is still fetched from the Icache and executed. The resultant target address and branch condition are compared with speculative version. If they agree, then correct prediction was made; otherwise, misprediction has occurred and recovery must be initiated. The result from nonspeculative execution is also used to update the content, i.e., the BTA field, of the BTB.

STATIC BRANCH PREDICTOR
There are a number ways to do branch condition speculation. The simplest form is to design the fetch hardware to be biased for not taken, i.e., to always predict not taken. When a branch instruction is encountered, prior to its resolution, the fetch stage continuous fetching down the fall-through path without stalling. This form of minimal branch prediction is easy to implement but not very effective. For example, many branches are used for loop closing instructions, which are mostly taken during execution except when exiting loops. Another form of prediction employs software support and can require ISA changes. For example, an extra bit can be allocated in branch instruction format that is set by the compiler. This bit is used as a hint to hardware to perform either predict not taken or predict taken depending on the value of this bit. The compiler can use branch instruction type and profiling information to determine the most appropriate value for this bit. This allows each static branch instruction to have its own specified prediction. However, this prediction is static in the sense that the same prediction is used for all dynamic executions of the branch. A more aggressive and dynamic form of prediction makes prediction based on the branch target address offset. This form of prediction first determines the relative offset between the address of the branch of the instruction and the address of the target instruction. A positive offset will trigger the hardware to predict not taken, whereas a negative offset, most likely indicating a loop closing branch, will trigger the hardware to predict taken. The most common branch condition speculation technique employed in contemporary superscalar machines is based on history of previous branch executions.

DYNAMIC BRANCH PREDICTOR
History-based branch prediction makes a prediction of branch direction, whether taken (T) or not taken (N), based on previously observed branch directions.
The assumption is that historical information on the direction that a static branch takes in previous executions can give helpful hints on the direction that is likely to taken in   state. An output value either T or N is associated with each of the four states representing the prediction that would be made when a predictor is in that state. When a branch is executed, the actual direction taken is used as an input to the FSM, and a state transition occurs to update the branch history which will be used to do the next prediction. The particular algorithm implemented in the predictor of Figure 3 is a biased toward predicting branches to be taken; note that three of the four states predict the branch to be taken. It anticipates either long runs of N's (in the NN state) or long runs of T's (in the TT state). As long as at least one of the two previous executions was a taken branch, it will predict the next execution to be taken. The prediction will only be switched to not taken when it has encountered two consecutive N's in a row.
This represents one particular branch prediction algorithm; clearly there are many possible designs for such history-based predictors, and many designs there have been evaluated by researchers.
To support history-based branch direction predictors, the BTB can be augmented to include a history field for each of its entries. The width, in number of bits, of this field determined by number of history bits being tracked. When a PC address hits in the BTB, in addition to the speculative target address, the history bits are retrieved. These history bits are fed to the logic that implements the next state and output functions of the branch predictor FSM. The retrieved history bits, the output logic produces the 1-bit output that indicates the predicted direction. If the prediction is a taken branch, then this output is used as the new instruction fetch address in the next machine cycle. If the prediction turns the prediction turns out to be correct, then effectively the branch instruction has been executed in the fetch stage without incurring any penalty or stalled cycle.
A classic experimental study on branch prediction was done by Lee and Smith [2]. In this study, 26 programs from six different types of workloads for three different machines were used. Averaged across all the benchmarks, 67.6% of the branches were taken while 32.4% were not taken. Branches tend to be taken more than not taken by a ratio of 2 to 1. With static branch prediction based on the op-code type, the prediction accuracy ranged from 55% to 80% for six workloads. Using only 1 bit of history, Another experimental study was done in 1992 at IBM by Ravi Nair using Systems Performance Evaluation Cooperative (SPEC) benchmarks [3]. This was a very comprehensive study of possible branch prediction algorithms. The goal for branch prediction is to overlap the execution of branches or accomplish branch folding; i.e., branches are folded out of the critical latency path of instruction execution. This study performed an exhaustive search for optimal 2-bit predictors.
There are 2 20 possible FSMs of 2-bit predictors. Nair determined that many of these machines are uninteresting and pruned the entire design space down to 5248 machines.
Extensive simulations are performed to determine the optimal (achieves best prediction accuracy) 2-bit predictor for each of the benchmarks. The list of SPEC benchmarks, their prediction accuracies, and the associated optimal predictors are shown in Figure 4. In Figure 4, the states denoted with bold circles represent states which the branch is predicted taken; to nonbold circles represent states that predict not taken.
Similarly the bold edges represent state transitions when the branches is actually taken; the nonbold edges represent transitions corresponding to the branch actually not taken. The state denoted with asterisk indicates the initial state. The prediction accuracies for optimal predictors of these six benchmarks range from 87.1% to 97.2%.
Notice that optimal predictors for doduc, gcc, and espresso are identical (disregarding the different initial state of the gcc predictor) and exhibit the behavior of a 2-bit up/down saturating counter. We can label the four states from left to right as 0, 1, 2, and 3, representing the four count values of a 2-bit counter. Whenever a branch is resolved taken, the count is incremented; and it is decremented otherwise. Two lowercount states predict a branch to be not taken, while other two higher-count states predict a branch taken. Figure 4 also provides the prediction accuracies for the six benchmarks. The prediction accuracies for the six benchmarks if the 2-bit saturating counter predictor is used for all six benchmarks. The prediction accuracies for spice2g6, li, and eqntott only decreased minimally from optimal values, indicating that the 2-bit saturating counter, originally invented by Jim Smith, has become a popular prediction algorithm in real and experimental designs.
The same study by Nair also investigated the effectiveness of counter-based predictors. With 1-bit counter as predictor, i.e., remembering the direction taken last time and predicting the same direction for the next time, the prediction accuracies range of 82.5% to 96.2%. As I shown in Figure 5, a 2-bit counter yields an accuracy range of 86.8% to 97.0%. If a 3-bit counter is used, the increase in accuracy is minimal; accuracies range from 88.3% to 97.0%. Based on this study, the 2-bit saturating counter appears to be a very good choice for a history-based predictor.
Direct-mapped branch history tables are assumed in this study. While some programs, such as gcc, have more than 7000 conditional branches, for most programs, the branch penalty due to aliasing in finite-sized branch history tables levels out at about 1024 entries for table size.

BRANCH MISPREDICTION RECOVERY
Branch prediction is a speculative technique. Any speculative technique requires mechanisms for validating the speculation. Dynamic branch prediction can be viewed as consisting of two interacting engines. The leading engine performs validation in the later stages of the pipeline. In the case of misprediction the trailing engine also performs recovery. These two aspects of branch prediction are illustrated in Figure 5. Branch speculation involves predicting the direction of a branch and then proceeding to fetch along the predicted path of control flow. While fetching from the predicted path, additional branch introductions may be encountered. Prediction of these additional branches can be similarly performed, potentially resulting in speculating past multiple conditional branches before the first speculated branch is resolved. Figure 5 illustrates speculating past three branches with the first and third branches being predicted taken and second one predicted not taken. When this occurs, instructions from three speculative basic blocks are now resident in the machine and must be appropriately identified. Instructions from each speculative basic block are given the same identifying tag. In the example of Figure 5, three distinct tags are used to identify the instructions from the three speculative basic blocks. A tagged instruction indicates that it is a speculative instruction, and the value of the tag identifies which basic block it belongs to. As a speculative instruction advances down the pipeline stages, tag is also carried along. When speculating, the instruction addresses of all the speculated branch instructions (or the next sequential instructions) must be buffered in the event that recovery is required.
Branch validation occurs when the branch is executed and the actual direction of a branch is resolved. The correctness of the earlier prediction can then be determined. If the prediction turns out to be correct, the speculation tag is deallocated and all instructions associated with that tag becomes nonspeculative and are allowed to complete. If a misprediction is detected, two actions are required; namely, the incorrect path must be terminated, and fetching from new correct path must be initiated. To initiate a new path, the PC must be updated with a new instruction fetch address. If the incorrect prediction was a not-taken prediction, then PC is updated with the computed branch target address. If the incorrect prediction was a taken prediction, then the PC is updated with sequential (fall-through) instruction address, which is obtained from the previously buffered instruction address when the branch was predicted taken. Once the PC has been updated, fetching of the instructions resumes along the new path, and branch prediction begins anew. To terminate the incorrect path, speculation tags are used. All the tags that are associated with mispredicted branch are used to identify the instructions that must be eliminated. All such instructions that are still in decode and dispatch buffers as well as those in reservation station entries are invalidated. Reorder buffer entries occupied by these instructions are deallocated. Figure 5 illustrates this validation/recovery task when the second of the three predictions is incorrect. The first branch is correctly predicted, and therefore instructions with Tag 1 becomes nonspeculative and are allowed to complete. The second prediction is incorrect, and all the instructions with Tag 2 and Tag 3 must be invalidated and their buffer entries must be deallocated. After fetching down the correct path, branch prediction can begin once again, Tag 1 is used again to denote the instruction in the first speculative basic block. During branch validation, the associated BTB entry is also updated.

TWO LEVEL BRANCH PREDICTOR
The dynamic branch prediction schemes discussed thus far have a number of limitations. Prediction for a branch is made based on the limited history of only that particular static branch instructions. The actual prediction algorithm does not take into account the dynamic context within which the branch is being executed. For example, it does not make use of any information on the particular flow path taken in arriving at that branch. Furthermore the same fixed algorithm is used to make the prediction regardless of dynamic context. It has been observed experimentally that the behavior of certain branches is strongly correlated with the behavior of other branches that precede them during execution. Consequently more accurate branch prediction can be achieved with algorithms that take into account the branch history of other correlated branches and that can adapt the prediction algorithm to the dynamic branching context.
In 1991, Yeh and Patt proposed a two-level adaptive branch prediction technique that can potentially achieve better than 95% prediction accuracy by having a highly flexible prediction algorithm that can adapt to changing dynamic contexts [4].
In previous schemes, a single branch history table is used and indexed by the branch address. As shown in Figure

OH-SNAP
Most proposals for neural branch predictors derive from the perceptron branch predictor [9]. A perceptron is a vector of h + 1 small integer weights, where h is the history length of the predictor. As shown in Figure 11  Branch history shift register is speculatively updated which is called ahead pipelining and rolled back on misprediction.
When the branch outcome becomes known, the perceptron that provided the prediction may be updated. The perceptron is trained on a misprediction or a when the magnitude of the perceptron output is below a specified threshold value. Upon training, both the bias weight and the h correlating weights are updated. The bias weight is incremented or decremented if the branch is taken or not taken, respectively.
Each correlating weight in the perceptron is incremented if the predicted branch has the same outcome as the corresponding bit in the history register and decremented otherwise with saturating arithmetic. If there is no correlation between the predicted branch and a branch in the history register, the latter's corresponding weight will tend toward 0. If there is high positive or negative correlation, weight will have a large magnitude. Figure 11 illustrates the concept of a perceptron producing a prediction and being trained. A hash function, based on the PC, accesses the weights table to obtain a perceptron weights vector, which is then multiplied by the branch history, and summed with bias weight to form perceptron output. In this example, the perceptron incorrectly predicts that the branch is taken. The microarchitecture adjusts the weights when it discovers the misprediction. With the adjusted weights, assuming that the history is the same the next time this branch is predicted, the perceptron output is negative, so the branch will be predicted not taken.
accuracy result from accessing the weights using a function of the PC and the path, breaking the weights into a number of independently accessible tables, scaling the weights by the coefficient based on their location on branch history register, and taking the dot product of a modified global branch history vector and the scaled weights. Figure 12 shows a high level diagram of the prediction algorithm and data path.   k.a, bagging), introduced by Breiman [11] in 1996, is a meta-algorithm to improve the stability and accuracy of learning algorithms. It has been shown to be very effective in improving generalization performance compared to individual base models [12]. Basic idea behind is by combining many weak learners to produce a strong learner. Bagging is special case of having a hybrid predictor, where predictions from multiple predictors are aggregated using meta-predictors, adder-trees,   In this work, I applied bagging to branch prediction. Because original bagging method is offlinethat is, all the training data set must already be available -, I need to develop an online version of bagging. Previous work by Oza and Russel [13] modeled sequential arrival of the data by a Poisson(1) distribution and proved the convergence of this method to offline bagging as N→∞. I first used their method in my implementation, which improved performance most of the time. However, I observed that multinomial distribution worked better and hence this method was used in later simulations. The situation is more complicated for branch prediction data because bootstrapping must be carried out in a way that suitably captures the dependence structures for the data. Oza and Russel's [13] method assumed that samples were independent of each other, and thus it does not produce good bootstrapping for branch prediction data. There are studies that developed methods for bootstrapping time series [14], which are better fit for branch prediction. Further research is needed to develop better online bootstrapping methods for branch prediction or adopt methods from previous work on bootstrapping for time series data, which is left as future work.
In my bagging implementation, each predictor is updated on each sample k times in a row where k is a random number generated by multinomial distribution. I illustrate online bagging in Figure 14.  In general, bagging can be applied to any predictor. Group of same predictors (e.g., a number of TAGE predictors) as well as different predictors may be used.

TAGE bagging (T-BAG) uses a number of TAGE predictors of approximately
the same size as sub-predictors. Each sub-predictor provides prediction for the current branch independent of each other. Online bagging is performed by determining whether or not a sub-predictor is updated with the current branch's outcome. Note that this update may occur multiple times for the current branch based on a random number generated. The branch history, however, is always updated as usual.
For final prediction computation, each sub-predictor remembers the success of its last 16 predictions using a sliding window. The number of correct predictions is used as the weight of the sub-predictor. For a not-taken prediction, the weight is taken as negative and for taken predictions it is positive. The overall TAGE bagging prediction is the sign of the sum of the weights, negative being not-taken and it is taken otherwise. This method was slightly better than using majority vote for the final prediction.
In all random updates, RandUpd, simulations, updates are performed randomly for 0, 1, or 2 times in a row for 20%, 60% and 20% of the time, respectively, using trinomial distribution. That is, 60% of the time update is done as usual, 20% of the time no update is performed and 20% of the time update is done twice in a row.
The original TAGE also uses the PC when forming the hashed index for its tagged components. However, because of its operation and its ability to exploit very long history lengths, the PC does not significantly affect performance. In my experiments, the best TAGE configuration using PC in table indexing and the one that does not use PC achieve the similar performance. Therefore, to further increase variability among sub-predictors, some sub-predictors do not use the PC when indexing tagged tables. To the best of my knowledge, no previous work has studied the effects of not using PC in table indexing.  Table 1.

RESULTS
My goal is finding a branch predictor that outperforms TAGE predictor. For this, first I tuned the TAGE predictor for CBP-4 traces. This is done to find peak performance TAGE can achieve. This way any further improvement on performance would be because of bagging. Secondly I applied bagging only on TAGE predictor.
Thirdly, I used a different types of branch predictor, OH-SNAP.
For reference purposes Table 2 shows the result of simple gshare, TAGE and OH-SNAP results. All shown predictors in Table 2 is 64 KB in size. CBP-4 evaluates success based on arithmetic mean of all traces (AMEAN).
While tuning TAGE predictor my focus was on multiple parameters. Such as size of predictor, number of tables, counter width on tables, and history length. Table   3 show the effect of increasing size of TAGE predictor. As seen from results increasing the size have a significant effect on performance, but increasing the size more than 1 MB has minimal effect, therefore as a base configuration size of 1 MB is used.

Number of Sub-predictors
AllSame_RandUpd AllDifferent AllDifferent_RandUpd Figure 15: Comparison of Different Bagging Configurations Figure 15 shows the overall effect of bagging. As I mentioned above, one could use the same configuration for all the sub-predictors. This configuration called AllSame. In this case, the only variability in sub-predictor predictions comes from the random updates. In this configuration, the sub-predictor parameters that we have used are: counter width = 3, number of tagged tables = 38, the minimum and maximum history lengths = 7 and 100,000, respectively. AllDifferent refers to variety between predictors. For this configuration both random update and regular update is simulated.
From this figure two outcomes can be made. First AllDifferent is always better than AllSame. Secondly, to use random update there need to be some sufficient number of predictors to justify usage. For this case using more than 8 predictors is breaking point. Using random update for 8 or more predictor gives better result.
Lastly I used OH-SNAP to see effect of bagging on different type of predictor.
I didn't tuned OH-SNAP and every predictor used is 64 KB. Their individual results can be seen in Table 6. Base configuration achieved 2.613 misp/KI. Increasing size by 2 and 4 times made minimal result and achieved 2.611misp/KI, and 2.608 misp/KI respectively. Using 2 OH-SNAP with bagging resulted in 2,616 misp/KI which is worse than increasing size by two times. Reason for this is we didn't use sufficient amount of predictor. At 4 predictor, bagging out performs the just size increase with 2,602 misp/KI. Bagging shows promise as a future research direction. Although online bagging method used in this work provides a way to apply bagging to branch prediction, it assumes independent samples, which is not the case for branch history.
Different online bagging methods may prove better and are subject to future research.
Finally, my analysis was done by mostly using TAGE as the base predictor. I looked into OH-SNAP briefly. It is possible to use more variety of predictors that use different methods for prediction.
Another thing I want to mention is, with this idea I entered CBP-4 competition and took fourth place in unlimited size category [15].