Multi Mthreaded Pattern Searching of Large Files Using Limited Memory

Pattern searching and discovery for large files is prohibitively slow and requires large amounts of memory for processing. As the number of patterns to process increase, the amount of memory needed increases exponentially exceeding the resources in a traditional computer system. The solution to this problem involves utilizing the hard drive to save pattern information. A program was created called Pattern Finder which saves patterns, keeps track of how much memory it uses and when that threshold is reached, it dumps the information to the hard drive. The other problem inherent with pattern searching besides limited resources is the amount of processing time it takes to complete. To speed up processing, we implement a multithreaded suffix tree pattern finding algorithm that utilizes multiple processing cores. The goal is to mimic Amdahl’s law by adding more cores and therefore increasing throughput.


Background
The main objective is finding microprocessor trace patterns to better understand program flow. The goal of this project is to develop a framework for quantitatively analyzing a program's behavior and thereby provide insights into the design of nextgeneration hardware prediction mechanisms. Our focus is to discover patterns, according to predetermined pattern scoring rules, that occur frequently in input sequences or are characteristic for certain subsets of the input data. Frequently recurring patterns are often indicative of the underlying structure and function. Once all possible patterns are found, one can analyze the pattern results to uncover program behavior.
A pattern at its most granular level is a byte which can be represented by an unsigned char using the C++ language. A byte can represent 256 different combinations and we number levels based on the number of unsigned chars are in the pattern. Level one has one unsigned char as the pattern which will potentially give us 256 different patterns.
The number of patterns that can exist for each level is 256 to the Nth power where N is the current level being processed.
The Pattern Finder flow is quite involved but the high-level design focuses on whether each level can be processed using the hardware resources efficiently. The obvious goal is to use only memory (specifically, the Dynamic Random Access Memory (DRAM)) because it is considerably faster than the Hard Drive (HD). The Pattern Finder first begins by predicting how much memory will be used for the next level and how much memory is currently being used and decides to use only DRAM or use both DRAM and HD. The user can either input the DRAM memory limit or the computer will decide for it if not specified.
If the prediction involves just DRAM, it will operate much faster but if we must access the drive there will be a timing penalty.

Objective and Scope
The pattern searching algorithm relies on a tree search methodology. If the tree is completely populated, then each parent node will have 256 child nodes. The relationship between the parent and child node is that the child will always have a level size of the parent's size plus one. The tree methodology allows complete parallelism for processing because each grouping of patterns does not need to know about the other patterns found in other threads. Each grouping of pattern nodes can't have similar patterns with the neighboring nodes because they live on different leaves of the tree. Another advantage of tree searching is that Pattern Finder never has to keep track of the pattern. If the size of the pattern and the location of the last unsigned char in the pattern is known, the program can always reference it later in the searched file. Figure 1 below shows the pseudo code of how the tree suffix pattern works. The tree will grow in levels until there are no more patterns to be found or the pattern length has reached its maximum size. Each thread will grow its individual tree in parallel without the need for any shared memory. The tree algorithm is used because of its parallel friendly nature which allows the utilization of multi core technology.
( , ) 15. .  track of patterns that are of length N. Another advantage of the suffix tree algorithm is that the previous level data only needs to be kept in memory. The previous level node data contains all of the information starting from the first level.

DRAM Processing Bottlenecks
Many bottlenecks were overcome while programming the parallel tree pattern processor. The first and most important bottleneck was memory management and access latency. To keep the processing memory size small, there must be many considerations into how each pattern is stored and if need be, offload the data onto the hard drive. Memory access latency is very important because the problem at hand deals with intensive dynamic memory allocation and access. To properly and efficiently use memory there is a need to understand the CPU (Central Processing Unit) architecture limitations, and more specifically how the cache is structured. Another key ingredient in battling bottlenecks is how the operating system manages thread allocation and how threads manage memory. To tackle these issues there will need to be a proper understanding of how the computer hardware is laid out.

Computer Architecture
The ADANA1 server is the main computer that will process large data files in parallel. ADANA1 has four CPU sockets each containing a 12 core Xeon E7-4860 @ 2. The speed and core count of the CPU is a huge processing force but the major focus should be on the cache layout. All the processing power will be waiting for memory access latency if the program does not adhere to proper cache management. Each chip contains three levels of cache and each cache has a different memory access latency speed and size.
Level 1 cache is the fastest at a 4-cycle memory latency and contains 64 KB per core.
Level 2 cache is the mid-range with a 10-cycle memory latency and contains 256 KB per core. Level 3 cache is the slowest at a 100-cycle memory latency but is different in that all 30 MB of cache is shared with all the cores on the chip. The figure below shows how the cache is structured in relation to each core. To summarize the L3 cache there are 12 cores on each chip that share 30 MB of cache line data unlike the L1 and L2 cache that keep their own caches separated from other cores. The key will be keeping the file data in the 30 MB cache to prevent cache misses and main memory access which is even slower access than the L3 cache. One other thing to keep in mind is the 64-byte cache lines that get loaded from memory to the cache. Creating data objects that align with this cache line size is also another important aspect to maintaining memory in the cache and retrieving the bytes.
Accessing data that crosses cache line boundaries can cause serious cache algorithm confusion and force unnecessary cache evictions.

Operating System Memory Management
Another key discussion point is how memory is managed per the OS. Some operating systems like Linux adhere more to the notion of parallel processing memory management while Windows tends to not do as well as some of the Linux Distros. Our research involves using tools from Windows for understanding memory profiling bottleneck issues while running speed tests on the Ubuntu Linux ADANA 1 machine. Most operating systems do not allocate memory in parallel which causes memory access bottlenecks in programs that dynamically allocate memory in a threaded environment.
Using fine-tuned memory allocators with threading in mind increases the speed of dynamic allocation and will be explored more in the paper.

STL Vector Memory Approach
Understanding how STL vectors manage memory is essential in making the Pattern Finder memory access fast. Vectors store memory contiguously which utilizes the CPU's ability to keep blocks of memory in the cache while accessing it. Accessing memory linearly must be done to take advantage of memory already loaded into the cache. Later there will be a discussion about the spatial locality caching problem of randomly accessing a large vector of pattern data and how to overcome it. Matthias Kretz, in his "Efficient Use of Multi-and Many-Core Systems with Vectorization and Multithreading" paper [1] suggests partitioning data to minimize random access cache miss penalties when accessing a large vector set.

Memory Allocation Bottleneck
The memory allocation bottleneck is associated with how the Pattern Finder requests for memory from the operating system. The various flavors of memory allocators include Glibc Malloc, Hoard Malloc 13 , JE Malloc 12 , MT Malloc 11 and TC Malloc. The major issue stems from the necessity to constantly allocate memory when growing patterns.
When multi-threading is introduced, the memory allocators need to request data from the memory manager in parallel. Operating system requests for memory must be processed serially which adds a blocking wait time when other threads request memory at the same time.
One way to circumvent this bottleneck is to override the new and delete implementations which govern operating system memory allocation. The previously with TC Malloc is much larger but the synchronization of threaded memory allocation is greatly reduced. The larger memory overhead can lead to over allocation because there is a minimum size that each thread will have access to and when the memory allocated per thread gets past a certain memory threshold, the throughput drops significantly. Careful memory management must therefore be a priority when implementing a program that is constantly straining the system for memory resources 6 .

TC Malloc vs Glibc Malloc Allocation
The memory allocation bottleneck as previously described can be bypassed partially by using other heap memory manager implementations. Glibc (C++ Language) Malloc is the tried and true standard memory allocator. TC Malloc otherwise known as Thread Cached Malloc implements a different style of memory allocator 4 . The Thread Cached implementation utilizes mini heaps to decouple coherent memory allocation in which the Glibc version adheres to. Each spawned thread has its own agnostic miniature heap that will attempt to allocate memory on that heap. If the memory size of the mini heap exceeds a certain threshold, the mini heap must reallocate which takes more time and overhead. To efficiently use and harness the power of TC Malloc, the programmer must be aware that the thread must be relatively lightweight in memory allocation. This opens the idea of thread pooling which spawns many light weight pattern finding tasks. The first   The trend lines show that between 2 16 to 64 MB allocations slow down significantly. It is evident that there is a sweet spot in terms of memory allocation size that can be achieved.
Current simulations are pre-allocating STL vectors with 2 10 (1024) to 2 14 (16384) byte allocations. The issue with this pre-allocation scheme is that memory overhead can start to dramatically increase and must be factored into memory predictions for hard drive or DRAM decision processing. The current implementation using TC Malloc might be inefficient when the thread total is greater than 10 because of how STL (Standard Template Library) vectors are handled. Vectors can grow infinitely and the problem with that is that they are contiguous and therefore constantly reallocating when growing. The thread cached version only works well with small allocation sizes and thus using linked lists to store PLISTS (pattern index lists) would be an implementation to try. The implementation of a linked list of small vectors that never need to be reallocated could show great improvement. This technique would reduce spatial locality caching of memory but would improve the throughput speed if the allocation sizes were somewhere between one kilobyte and 8 kilobytes per vector.
Using 8-16 cache blocks per vector allocation would be preferable but still require testing.

Thread Building Block Allocation
Threading Building Blocks known as TBB is an Intel API that gives scalable and cache coherent memory allocation 3 . The two main custom vector allocators to take advantage of are called scalable allocator and cache coherent allocator. The main advantage of the scalable allocator is that it was created to be used specifically for threading allocation. Testing this new implementation that overrides the Glibc Malloc allocator received mixed results. The CPU usage went from 60% to 80% but slowed down the processing. Overall the TBB implementation is faster than Glibc but is slower than TC Malloc.

Amdahl's Law Bottleneck
Amdahl's law states that a program which is 95% parallel should get a 6x speedup while using 8 threads 2 . Pattern Finder achieves a 5.2x speedup per previous estimates which is close to the target but not quite there. The problem is that when scaling the thread count to 18, the thread increase only gives us a 6.8 speedup increase. Amdahl's law states that a program should get 9.5x using 18 threads for a 95% parallel program. As one can see in Figure 3, Amdahl's law states there are multiple trend lines for processing in a program that is run in parallel threads. The results follow a program that would run at a 90% parallel portion. At the 8-core mark there should be a 5x speedup and at the 18-core mark there should be a 6.5x speedup. This would mean that Pattern Finder is between 95 and 90 percent parallel. If Pattern Finder can become 95% parallel, there would be better throughput potentially.

First Level Thread Improvements
One difficult hurdle to overcome is implementing the first level to be processed in a completely parallel fashion. The issue is that the first level's job is to process the entire file and subsequently partition patterns into pattern buckets. Partitioning the patterns into a main bucket store involves bringing each thread's data together and copying over into one unit. To prevent this problem from occurring there is a need to create duplicate buckets that are labeled as having the same pattern. They are not contained in common vectors but are tagged as the same. Giving tags to the first level indicating their pattern prevents sequential amalgamations of data. Implementing this solution gives us about a 99% parallel Pattern Finder as discussed in the previous section.

Random File Access
The memory bottleneck of randomly accessing file information from a large file is displayed in this code sample. The Visual Studio profiler shows that line 3387 is taking the longest time in the application. The previous line is our show stopper because Pattern Finder is randomly accessing a very large file that the L3 Cache can't hold on to. This causes cache misses and memory access latency penalties of 100 CPU cycles. One improvement upon this design is to preload all the string information needed for the entire level processing. Preloading takes out all the other variables that will be stored in cache during the main processing loop and frees up more cache for our random access.
Preloading in no way solves our problem but still manages to give Pattern Finder a nice boost in speed.
The preloading algorithm utilizes Memory Level Parallelism (MLP) which is the capability for a program to have multiple pending memory accesses in parallel. MLP in a 5-stage pipeline gives Pattern Finder the capability for one core to process 4 memory accesses in parallel because there are no dependencies between the memory accesses.
Pattern Finder essentially gained a 4x throughput for the single threaded case from MLP due to the 5-stage pipeline.
The random string preloading operation creates a major pitfall for the threading aspect of Pattern Finder. What happens is that the sequential version of the program eliminates random string access latency. The result is a very fast sequential Pattern Finder while the multithreaded version still struggles to keep memory in the cache and therefore the throughput is not as scalable when increasing thread count. The problem is each thread accesses the string randomly at the beginning of each level and therefore introduces cache misses all over the place. Say for example 8 threads are accessing the same file but the memory location spans between the threads are very large. The constant memory jumping creates a situation where the cache is continually pulling and evicting memory from its stores and this phenomenon is called thrashing the cache.
20 Figure 9: Code Snippet highlighting the area where the cache is continually being missed after a subsequent fetch

Thread Pooling Management
The threading scheme used for this project utilized STL Threads which are an improvement over other threading libraries like pthreads (POSIX Threads). STL Threads have added support for polling if a thread has finished its execution. Polling a thread's status is vital in implementing a thread pooling hierarchy. Once a thread is finished with its job it can then be dispatched right away to do more work. The thread pool ensures that all threads will be put to work and thus increasing the throughput of Pattern Finder.

Thread Recursion
One issue with the design improvement of dispatching pattern searching threads is balancing the work load. After the initial first level processing, threads are dispatched with certain workloads that will most likely never be equal. Sometimes the workloads can be quite lopsided because some pattern nodes can be negligible. In the previous implementation threads were dispatched with balanced nodes and would terminate processing when all the patters were obtained. To achieve the same throughput as designed there would need to be a way to put threads back to work when they completed their task.
The solution to the problem is to have each thread poll to see if any cores are available to help with a job. If a thread has finished working, then any thread can grab it and utilize its processing power. The smarter implementation would be to use a thread priority queue in which the thread who has completed the least amount of work should be given more thread resources to finish. Larger files usually fall victim to this issue where workloads are not equal. The recursive thread pool approach improved throughput from 13x to 14.8x using 32 threads for a 2.8 GB media file.

Balanced Workload Approach
One major bottleneck of Pattern Finder as previously described is managing each threads workload evenly. The best way to approach this problem is to do a simple count of each pattern's size and distribute them to each thread. If Pattern Finder can keep each thread's workload spread as evenly as possible then Pattern Finder will not have to spawn threads as frequently. There will always be the need to dispatch threads within threads because of the inherent nature of finding patterns. Pattern searching is random and thus there is no way to know how many patterns will be found on a certain tree node. The balanced workload approach for larger files tends to have inconsistent workloads and they yield a much better throughput. The previous unbalanced Pattern Finder processed a 2.8 GB file in 110 milliseconds while the latest and greatest balanced approach produced a processing speed of 89 milliseconds. The balanced load approach yielded a 1.24x speedup.

Limited Memory Hard Drive Processing
A computer will always have a limited amount of DRAM and therefore any program needs to know it's limitations while processing pattern data. At startup, Pattern Finder queries the system's information and assesses if a file can be processed completely in DRAM or if it needs help from the Hard Drive. The user can explicitly input a DRAM memory limit but most users should let the program decide if the system can handle the file. Every time Pattern Finder finishes processing a pattern level it must subsequently make a conservative judgment call on whether the memory resources will satisfy processing on DRAM alone. Finding the best way to process patterns on the Hard Drive will be introduced in the following topics. Techniques borrowed from the "Better External Memory Suffix Array Construction" 15 will prove to be very beneficial in understanding and deconstructing the problem.

Hard Drive Processing Design
In the case of a file being larger than a computer's DRAM capacity, the computer will need to come up with a strategy for processing. Utilizing the Hard Drive to offload memory gives us the ability to process data without resorting to use Hard Drive as virtual memory. The three main memory components in processing each pattern level is composed of the previous level's nodes, the input file being processed and the new level's nodes being generated on the fly. The pattern data will be held in a map which keys off pattern data strings and points to the pattern's indexes in the corresponding vector 7 . The first step in dividing memory resources is taking a thread's memory limit and dividing that value by three for our three main processing components. The first processing memory and written to the Hard Drive, Pattern Finder takes the stored files and merges them into a coherent pattern tree. One major caveat in processing these partial pattern files is the existence of patterns occurring across multiple files.

Memory Mapping
Memory mapping is used to increase the throughput of reading and writing to Hard Drive. Instead of the standard way of accessing a file using OS system read/write calls, memory mapping creates a file handle that a program accesses as if it were a pointer to heap memory. Specifically, the memory being mapped is the OS kernel's page cache meaning that no middle man copies of data must be created in user space.

Writing to Hard Drive with Memory Mapping
Writing pattern data or any type of data to the Hard Drive is extremely slow. One way to make writing faster is to treat the data as a contiguous block of memory. Memory mapping essentially cordons off a region of memory on the disk and returns a pointer of the data type to be written. The memory size requested can be dynamic but the most efficient way to write to the disk is to grab 2 MB blocks for Windows and 4KB blocks for

Reading from Hard Drive Example
Reading from the complete pattern files involves extracting segments of the data based on the memory constraints given. In one scenario Pattern Finder is run using 8 threads with a memory constraint of 8 GB. Each thread will then be given a These threads alleviated the write processing from halting while waiting for a memory flush to be fully committed.

Dynamic Level Processing
If the user doesn't request that the pattern processing be done solely using DRAM or HD, then Pattern Finder is forced to dynamically manage the memory of the system. At the beginning of every level, Pattern Finder will decide if the system has enough memory to process the level without using the HD. Certain levels of processing take very large amounts of memory while other levels might only take a very small amount. The dynamic memory decision making exploits this concept by preventing potential HD processing slowdowns when the memory needed for a certain level is negligible compared to the size of available DRAM.

HD or DRAM Decision Algorithm
The algorithm to determine how a level is processed involves factoring in the previous level's pattern information. There is a potential that each pattern from the previous level can produce 256 new patterns because the next level adds on a new 8-bit value that map to 256 new unique values. This means Pattern Finder must always incorporate the worst-case scenario when doing memory prediction so the predicted pattern count for the next level must be the previous level count multiplied by 256. The other key factor is coming up with an actual memory count for the next level of processing. Each pattern is stored as a vector of 64 bit unsigned integers containing the locations of each individual pattern. The algorithm cannot predict how each of the patterns will be stored or if there will even by an instance of one but it can figure out the remaining potential patterns.
Each pattern has a certain shelf life but the maximum size a pattern can be is the size of file minus 1. The algorithm takes advantage of this fact by keeping tally of eliminated patterns and using that value to predict the potential new patterns that can be generated.
For example, if the file size is 20 MB and the number of eliminated patterns is 15 MB then Pattern Finder can deduce that the potential number of next level patterns can at a maximum be 5 million. It does not matter how many patterns are accounted for in the previous level, the potential count will always be truncated by 5 million if the previous level count multiplied by 256 exceeds that 5 million. many file handles open. One way to prevent this event from ever occurring is to increase the limit of file handles that can be opened at once. In Windows the limit of file descriptors can easily be changed in a program using a system call but Linux requires changes to system files which makes it harder to manage this problem. To make sure Pattern Finder works regardless of file handle limitations it must understand the limit and adjust appropriately by closing the file and then opening it later.

Reducing File I/O Slowdowns
Offloading the deletion of files on another thread is also beneficial in the speedup of processing. An OS delete file call takes a lot of time which blocks the processing of further data until it is completely removed. Instead, the processing thread notifies a file removal thread that a file needs to be deleted allowing processing to be done more quickly by offloading the file delete call elsewhere.

Memory Watchdog
Even though Pattern Finder's dynamic memory predictor should take care of most memory predictions, there remains the unreliable nature of how the OS (Operating System) allocates and deallocates memory. In C++, there is the notion that memory is handled by the programmer which is why they call C++ an unmanaged language. The programmer does not have as much control over the memory as one would think. When memory is allocated in C++, the OS will typically bring in more memory than is necessary. Typically, the reason behind this idea is that memory resides at certain boundaries and the OS is designed to realize that memory will most likely be requested soon after to be used again.
Another unpredictable aspect of OS memory management is how memory gets returned to the OS. A C++ delete call notifies Pattern Finder that a certain block of memory no longer needs to be managed and can be released to the OS. The problem with this concept is that the OS can leave released memory cached for the process and will not return it back to the OS. This tradeoff is quite efficient because program caching allows memory to be reused quickly without requesting for more memory from the OS. Even though this will improve speed it will not always return memory back to the OS as expected. The main driver making the decisions on how memory is managed between the OS and Pattern Finder is called the Virtual Memory Manager. The VMM determines when memory gets returned and how memory gets fetched for Pattern Finder. Based on how the VMM was implemented, there can be many discrepancies between various OS memory controllers.
The memory watchdog determines if the memory being used by Pattern Finder is nearing capacity and will make an executive decision to force a flush of all memory resources back to the OS. The flush resets all memory resources and begins processing with a blank memory slate. This process must be done to prevent virtual memory page thrashing which occurs when pages of Pattern Finder memory must be constantly read from disk to accommodate the entire program's memory needs. Paging prevents an OS from crashing when memory is exceeded because virtual page memory is usually on the order of two times the size of DRAM.

Code portability
Pattern Finder is cross platform for Linux and Windows giving the user the capability of running the Pattern Finder on a personal windows box or scale up to a Linux super computer cluster.

Multiprocessor Scalability
Overall performance improved significantly utilizing the knowledge of how the cache performs and understanding the CPU architecture. The figure below shows the throughput versus the thread count from a selection of files illustrating the potential variance in throughput. The throughput achieved models a program that is 95% parallel according to Amdahl's Law.

Cache Latency
Pattern searching is inherently a memory bound application that relies heavily upon accessing memory. To properly understand how Pattern Finder operates in relation to the hardware one can use profiling tools such as the Intel Profiling Suite. The major optimization strategy when dealing with a memory heavy application is to access data in a fashion that takes advantage of the memory latency hierarchy. Prefetching cache lines from DRAM will significantly improve the speed of memory access if the memory is sequentially and spatially accessed. If Pattern Finder accesses memory in long memory strides, the caching system will never yield L1 cache-speed memory access. sized file will not always be the case and therefore the user cannot rely on a program that can only process small files extremely efficiently. The more likely processing scenario would be a file that is larger than the CPU cache and so the program must therefore avoid random i.e. non-sequential memory access. The pattern searching algorithm inherently involves non-sequential access because each pattern node only processes locations where similar patterns were found. For example, pattern nodes are processed one at a time so there is a potential that one instance of a pattern could be at the beginning of a file and the next instance could be 100 MB down the line in memory. Caching memory will be of no

Hard Disk versus DRAM Processing
The graph below shows that Hard Disk processing is 65 times slower than DRAM processing when utilizing the non-threaded version. As multiprocessing is introduced and threads are scaled up, the Hard Disk processor fails to scale the same way the DRAM processor does. Every time a thread is added, the Hard Disk processor loses 3x in throughput compared to the DRAM processor. This issue with Hard Disk scaling has to do with threads opening/closing and reading/writing to a file system that is not meant for multithreading. Figure 14: HD processing over DRAM processing throughput

Overlapping vs Non-Overlapping Patterns
Files containing large patterns benefit greatly in speed when utilizing the nonoverlapping search. An mp3 music file containing silence in a segment of the file was processed 180 times faster using the non-overlapping search. A blank png image file was processed 772 times faster. The processing time is greatly reduced while maintaining nearly 100% pattern accuracy. The graph below shows processing time of the two previously mentioned files.  illustrates that at processing patterns of level three, overlapping patterns get discarded and expose gaps in coverage but the pattern integrity is still intact and observed in the reflected data. The transition from patterns of length 2 to 3 show the coverage dip.

APPENDIX B Pattern Finder Documentation and Usage
PatternFinder is a tool that finds non-overlapping or overlapping patterns in any input sequence.

PatternFinder Input Files:
PatternFinder accepts any type of input file because it processes at the byte level. PatternFinder Output Files: Nine outputs are available. One is a general logger using ascii text format, another is the Output file which generates patterns based on -pnoname, -plevel, ptop and the remaining seven are Comma Separated Variable files used for post processing in MATLAB.
1) Logger file: records general information including the most common patterns, number of time a pattern occurs and the pattern's coverage at every level until the last pattern is found. Simple text file. 2) Output file: generates patterns based on -pnoname, -plevel and -ptop 3) Collective Pattern Data file: records each level's most common pattern and number of times the pattern occurs in CSV format. 4) File Processing Time: records each file's processing time in CSV format. Used for processing large data sets with many files. 5) File Coverage: records the most common pattern's coverage of the file in CSV format. 6) File Size Processing Time file: records each file's processing time and corresponding size in CSV format. Used primarily to isolate files in a large dataset that contain large patterns. 7) Thread Throughput: records the processing throughput improvement while incrementing the number of processing threads in CSV format. Typically used with -c option which tests threads in multiples of 2 starting at 1 until the number of cores on the machine has been met. 8) Thread Speed: records the processing time taken while incrementing the number of processing threads in CSV format. Typically used with -c option which tests threads in multiples of 2 starting at 1 until the number of cores on the machine has been met.