In storage process, the next generation of storage system

In conventional computer systems, software relies on the CPU to handle the process applications and assign computation tasks to heterogeneous accelerators such as GPU, TPU and FPGA. It requires the CPU to fetch data out of the storage device and move the data to the heterogeneous accelerators. After the accelerators complete computation tasks, the results are flushed to the main memory of the host server for software applications. In this architecture, the heterogeneous accelerators are located far away from the storage device. There are data movements on the system bus (NVM-express/PCI-express), which requires a lot of transmission time and bus bandwidth. When data move back and forth on the storage data bus, it decreases the overall performance of the storage system. This dissertation presents the in-storage processing (ISP) architecture that offloads the computation tasks into the storage device. The proposed ISP architecture eliminates the back and forth data movements on the system bus. It only delivers the computation results to the host memory, saving the storage bus bandwidth. The ISP uses FPGA as a data processing unit to process computation tasks in real-time. Due to the parallel and pipeline architecture of the FPGA implementation, the ISP architecture processes data in short latency, and it has minimal effects on the data flow of the original storage system. In this dissertation, we proposed four ISP applications. The first ISP application is the Hardware Object Deserialization in SSD (HODS), which is designed to tailor to the high-speed data conversion inside the storage device. The HODS shows visible differences compared to software object deserialization regarding application execution time while running Matlab, 3D modeling, and other scientific computations. The second ISP application is called the CISC: Coordinating Intelligent SSD and CPU. It speeds up the Minimum Spanning Tree (MST) applications in graph processing. The CISC coordinates the computing power inside SSD storage with the host CPU cores. It outperforms the traditional software MST by 35%. The third application speeds up the data fingerprint computation inside the storage device. By pipelining multi data computation units, the proposed architecture processes the Rabin fingerprint computation in wire speed of the storage data bus transmission. The scheme is extensible to other types of fingerprint/CRC computations and readily applicable to primary storage and caches in hybrid storage systems. The fourth application is data deduplication. It eliminates duplicate date inside the storage and provides at least six times speedup in throughput over software. The proposed ISP applications in this dissertation prove the concept of computational storage. In the future, more compute-intensive tasks can be deployed into the storage device instead of processing in the CPU or heterogeneous accelerators (GPU, TPU/FPGA). The ISP is extensible to the primary storage and applicable for the next generation of the storage system.


Abstract
The rapid development of nonvolatile memory technologies such as flash, PCM, and Memristor has made processing in storage (PIS) a viable approach. We present an FPGA module augmented to an SSD storage controller that provides wire-speed object deserialization, referred to as HODS for hardware object deserialization in SSD. A pipelined circuit structure was designed to tailor to high-speed data conversion specifically. HODS is capable of conducting deserialization while

Introduction
Object deserialization is a process of creating data structure suitable for applications. It can spend 64% of the total execution time of an application on average if the traditional deserialization process is used [1]. It typically takes three steps: (1) Raw data is read out of storage device and buffered in the host memory; (2) Host CPU transforms raw data into binaries; (3) Application computation executes using binary results of object deserialization. This CPU-centric approach becomes inefficient for several reasons: First of all, step 2 cannot take full advantages of modern CPUs, because the scanning access of a significant amount of data has poor data locality making the deep cache hierarchy useless. Secondly, it suffers from considerable overhead in the host system because of frequent context switching caused by significant amount of storage I/Os [2]. This paper presents a hardware approach to providing wire-speed object deserialization, referred to as HODS, hardware object deserialization in SSD storage.
We have designed and implemented an FPGA module that is augmented in an SSD The rest of this paper is organized as follows: Section II describes the motivation of hardware deserialization and its corresponding performance issues. Section III provides detailed design for FPGA object deserialization module including hardware PIS storage architecture, FPGA object deserialization module, and host programming API. Section IV describes the experimental prototype implementation. Section V reports performance results. We conclude our paper in Section VI.

Motivation of hardware deserialization
Most non-database applications such as scientific data analytics, 3D modeling, or spreadsheet applications use interchangeable data formats such as ASCII code.
Such serialized memory objects make it easy to collect, exchange, transmit, or store data [2] because the text-based (e.g., CSV [6], txt) encodings allow machines with different architectures (e.g., little endian vs. big endian) to exchange data with each other. It does not require users to understand memory layout of machines, and it is often easy to manage text-based encoding files without using special editing tools. Figure 1 shows an example of a standard ASCII file chunk. Meaningful ASCII strings are stored between special characters such as space, line-feed, and comma.
Before any computation can be done on the data, such text-based encoding strings must be converted into machine binaries readable by applications [7]. To understand how such data conversion affects the overall application performance, we ran a set of benchmarks on a Lenovo server with a quad-core Intel i7-4470 CPU. The benchmark datasets are stored in an Intel 750 series NVM-e SSD. In this experiment, each benchmark application reads the data file from the SSD, converts the file from text to binary in the system RAM, and then processes the data. Figure 2 shows the breakdown of the execution time of the benchmark applications [7,19,20]. It can be seen from the figure that the object deserialization (data conversion) takes a significant proportion of the total execution time of applications, ranging from 32% to 85%.
To minimize the overhead of host CPU, object deserialization in PIS has been proposed in flash memory SSDs [1]. Figure 3 illustrates general data flow inside current PIS storage [8∼16]. First, SSD controller loads data from flash to D-Cache using DMA (step 1); Next, the embedded processors (such as ARM core) fetch data from D-Cache and execute PIS functions (step 2). After that, the embedded cores flow. In conventional PCI-e or NVM-e SSD, storage data can directly move from flash to host main memory by one DMA operation [1,9,10]. Because of this PIS architecture, it breaks single DMA data movement into two sub DMA operations.  One goes in D-Cache, and the other goes out of D-Cache. This modification blocks IO path and slows down storage read speed [17,18].
PIS using Embedded cores in SSD is not efficient enough. To verify the actual efficiency of using embedded cores for object deserialization, we experimented with ARM Cortex-A9 processor with two different clock settings. As shown in Figure 4, single ARM with 877MHz clock speed can provide 42∼53MB/s throughput on both integer and floating-point benchmarks [19,20] Figure 5 depicts the time slices of FPGA based HODS design. Compared to previous architecture in Figure 3, there is no supefluous memory access to store and fetch intermediate results [23,24]. We build a direct IO path from storage to host main memory, and PIS is done concurrently with data transfer on the bus.

Hardware Deserialization SSD Architecture
In the following paragraphs, we will describe system architecture, hardware object deserialization module, and host driver program in detail. bus which is a bridge for data movement among host, flash and DDR3. As the flash controller processor, three embedded cores are responsible for standard storage control workflow. They do not get involved in PIS processing, but only direct storage data flow to go through FPGA object deserialization module.

FPGA object deserialization module
To extract meaningful data structure from ASCII files, we designed and implemented hardware deserialization module. As shown in Figure 7, the hardware object deserialization module is a four-stage pipeline. The first stage pipeline is to search special characters such as space, line-feed, and comma along with n bytes parallel data stream. The special characters' location information will be passed where m is an integer that m×n indicates the maximum object length we can detect between every two special characters. The output from pipeline stage one indicates which shingle's first byte hits special character such as space, line-feed or comma. If a shingle's first byte hits a special character or current shingle is the start shingle of a data file, its length detector is enabled to search the next nearest special character. Otherwise, corresponding shingle length detector is disabled.
Because every first byte of each shingle is used to enable/disable shingle length detector, it requires m×n-1 comparators to work in parallel for the remaining bytes along with the rest shingle content. All comparators' results are assembled into low address arbiter to find out object length from the start byte of the shingle.
If shingle's first byte is not a special character, object length detector will disable current shingle output.
The m×n-1 comparators also detect the location of the decimal point.
According to the binary values of the shingle content, the object length detector identifies the type of shingle data (integer, floating point or ASCII string) and passes down the shingle type to the object converter along with the object length, shingle content and decimal location.
Object converter: n shingle converters are working in parallel, each one of them processes three types of the shingle data: the floating point shingle is converted to the floating point data by FPU [27]; the ASCII string shingle bypass; the integer shingle goes to the multiplexer matrix. As shown in the middle part of Figure 8, each integer shingle converter is composed of a multiplexer matrix. Each column of multiplexers shares the same weight of multiplicand such as times one thousand or times one hundred.
Object length uses MUX to choose a row of multiplexer matrix. The selected It grantees all the converted results can be flushed into NVM-e interface without halt.

Host Driver Program
To allow an application to use the hardware object deserialization module, we have developed a programming framework including libraries and NVM-e driver modifications using C/C++ programming languages. This section will briefly introduce our driver program and show how our driver interacts with hardware object deserialization module.
On the driver side, NVM-e is a scalable host controller interface developed specially for accessing non-volatile memory attached via PCI-e bus. It includes support for parallel operation by supporting up to 64K commands within a single I/O queue to the device. NVM-e encodes commands into 64-byte packets and uses one-byte command opcode [4]. We modified original host NVM-e module by adding one-bit flag opcode into NVM-e read command. Other commands remain unchanged. The newly added flag bit is a switch to determine whether the storage internal data flow bypasses or goes through the object deserialization module. If the flag bit is not set, SSD controller initiates DMA to move data from flash to host main memory. Otherwise, SSD controller directs flash data to go through hardware deserialization module and sends results to NVM-e interface. Our NVMe driver does not touch the original submission and completion queue strategy, and modification effort is minimal.
In host application, original C/C++ object deserialization functions such as (fscanf ) or (sscanf ) are replaced by our application function (HODS scanf ). The HODS converts all variation sized ASCII strings into fix sized binaries and sequentially stores such binaries into host main memory. Our application function (HODS scanf ) sequentially access host main memory to fetch results directly, which substantially offloads host CPU's workload.

Experimental Methodology
We have built an NVM-e SSD prototype that supports hardware object deserialization and carried out performance evaluation using several standard benchmarks. This section discusses the prototype setup and benchmark selection.

Experimental platform
The experimental platform uses Lenovo server with a quad-core Intel i7-4470 running at 3.4 GHz. The system DRAM size is 32 Gbyte. The host was set up with Linux Ubuntu 16.04, kernel version 4.4. Our prototype NVM-e SSD card plugs into host server through PCI-e Gen3x4 interconnect.
We use Xilinx Ultra-scale VU9P as flash controller chip on prototype stor-age card [5]. All storage logic fits into a single FPGA chip, including embedded processors, DRAM/flash controller logic, NVM-e module, DMA/cache engine and hardware deserialization function. This prototype card contains 8Gbyte DDR3 and 1TB flash memory. To evaluate HODS, we store benchmark dataset on 1TB flash before host starts applications. The following paragraph describes benchmark we used in this paper.

Benchmarks
We selected benchmarks from BigDataBench [20], JASPA [7] and Rodinia [19] with following criteria: (1) The input data of applications are text files. (2) Large and meaningful inputs data can be generated from benchmark tools for our evaluation.
(3) The application contains many floating point values that we can evaluate our prototype comprehensively. (4) The application is open source in C programming that is compatible with our prototype. Benchmark applications may apply MPI [25] or openMP [26] to parallelize host computation. Some applications provide data generators such as LU-decompression (LUD), Breadth First Search (BFS), K-mean and B-tree. Other datasets are generated by duplicating benchmark input data. We also provide 3D plot application to demonstrate user experiences of using HODS as compared to existing systems [5]. All benchmark program codes are written in C/C++, and we use Verilog to generate RTL for FPGA.

Evaluation results
For the purpose of comparative analysis, we consider the baseline as running applications on the server machine with HODS disabled. Using the same server machine, we enable HODS and run the same set of applications to evaluate performance. Figure 9. Normalized size variation after hardware deserialization.
1.6.1 Transfer size variation Figure 9 shows data size changes after FPGA object deserialization. Transfer size shrinks because text-based encoding usually requires more bytes than binary representations. For example, ASCII string "87654321" requires 8 bytes to represent a single object value, but it is only 4 bytes in binary. The longer object is, the smaller converted data size will be. We also eliminate special characters such as space, line-feed and comma, which are unneeded data for benchmark applications. Taking PageRank application as an example, a 600K IOPS SSD with HODS would perform the same as a 1 million IOPS SSD without HODS. Figure 10 plots the data conversion throughput. HODS accelerator achieves as much as 935MB/s∼1.13GB/s object deserialization throughput in 100MHz FPGA clock, and host CPU has 58MB/s∼93MB/s throughput at 3.5GHz clock speed.

Throughput speedup
For integer benchmarks such as PageRank, memplus, B-tree and BFS, we observed 8∼12× speedup. These Performance gains can be attributed to two facts.
It can potentially reduce the storage traffic overhead.
Because host CPU takes much longer time to convert floating point numbers from ASCII code, HODS' speedup is even higher for floating point benchmarks such Figure 11. Normalized hardware deserialization speedup.
as LU-decompression, 3D-cat and line plot. Furthermore, data sizes of floating point benchmarks are also reduced by 51%∼60%, giving rise to more speedup.
From our experiments, we observed speedup between 17× and 21×. In both integer and floating point object deserializations, HODS runs faster than the existing state of art [1] that has shown 1.66× for the same benchmarks.

Speedup of Application Execution Time
The overall speedup of application programs depends on the fraction of data conversion time over benchmark applications' running time. Our work focuses on object deserialization itself. If benchmark application is computation intensive, the data conversion becomes a small fraction of total time. Then its performance improvement is limited. Figure 11 plots the HODS' speedup of applications. Benchmarks such as RDB and memplus give only 10% to 30% speedup because they contain matrix multiplication which is computation intensive. The other benchmark applications showed 2.4∼4.3× speedup. Current benchmark applications apply MPI or OpenMP par-allel model with quad-cores. We expect higher speedup when using more cores or GPUs that run the computation part in parallel but can hardly do anything on data conversion part. Quantitative investigation on such parallel computer architectures is out of our research scope of this paper.

Conclusion
This paper presents a hardware object deserialization in SSD (HODS) that offloads data-intensive computation to storage where data is stored. Compared to existing state of art [1], HODS eliminates SSD controller's overhead and buffer limitations. It can process storage data in wire speed and does not interfere with SSD controller's firmware resources. Our host driver program provides a user-friendly application interface to replace fscanf or sscanf function in C/C++, Matlab, python or any other programming languages.

Introduction
Processing of graph-structured data has become increasingly important and has brought to the forefront of computational challenges. Graphs with up to billions of vertices and trillions of edges are commonplace in today's big data era [1]. Minimum Spanning Tree (MST) is a fundamental problem in graph processing to compute a subset of a graph with the total edge weight being minimum. It is pervasive throughout science, broadly appearing in fields such as social network, Subramanian et al [4] presented FRACTAL to speed up MST using multi-cores.
Their work is based on a cycle-accurate, event-driven simulator to model parallel system with 256 cores [7]. To avoid data dependency in MST, they modified task scheduler and used timestamps to determine which tasks execute in high priority. Their simulation shows 40× speedup when configured with 256 multicores.
Manoochehri et al [6] proposed MST implementation on GPU. Sorting data in the main memory of host is computation intensive, and it consumes enormous CPU resources. We observed in our experiments that edge sorting takes a significant portion of total MST execution time ranging from 36% to 75% of the total time. We therefore believe that there is a great potential for further performance improvement by leveraging the intelligence available inside SSD where a huge amount of graph edges is stored.
In this paper, we present a new approach to the MST computation by means of CISC (Coordinating Intelligent SSD and CPU). The idea is to exploit the controller logic inside the SSD to preprocess graph edges while being loaded to the main memory of the host. CISC divides the large amount of graph edges into chunks and sorts each chunk of edges in order using hardware. In this way, the edges loaded into the internal memory of the host consist of multiple sorted chunks.
To allow software MST to efficiently use sorted chunks, we developed two software programs for the host servers of serial and parallel MST, respectively. The serial MST forms a B-tree holding the smallest edges of all the chunks and merges smaller edges into MST in high priority. In the multicore system, we optimized the classical sample sort algorithm [8] of parallel MST, and the remaining computation can be effectively parallelized on multicores. Such an efficient data distribution in the host main memory ensures smaller weight edges can be processed very efficiently.
To demonstrate the feasibility and performance potential of CISC, we have implemented CISC using FPGA inside an SSD. A working prototype has been built that consists of both software running on the host and hardware circuit inside SSD. Using the CISC prototype, we run standard graph benchmarks to measure performances. Experimental results show that CISC outperforms pure Software MST substantially.
This paper made the following contributions: • A pipeline structure of FPGA sort module has been presented that can provide wire-speed hardware sort of multiple edge chunks. We have designed and implemented the FPGA module alongside the I/O bus inside PCI-e SSD realizing a true processing in storage (PIS) for graph processing. It is also extensible to other sort-based software applications.
• A B-tree based selection algorithm and an optimized sample sort algorithm have been proposed that run on single core and multicore systems, respectively. CISC coordinates the chunk sorting inside SSD and selection/merging of minimum weight edges on CPU cores efficiently. The software and hardware co-design framework is the first of the kind for graph processing.
• A working CISC prototype has been built that works as expected. The prototype has been used to carry out extensive experiments for performance measurements. Our experimental results demonstrated the superb performance and effectiveness of CISC for MST over existing approaches.
The rest of this paper is organized as follows: In section II, we discuss the related work. Section III provides detailed design for in-storage sort module. Section IV describes the two MST software modules of CISC that run on single-core and multicores, respectively. Section V presents experimental results and discussions.
Section VI concludes the paper.  for edge sorting ranges from 36% to 75%. In addition to execution time, edge sorting consumes computation resources that could otherwise be used for other computation tasks. Examining the experimental results, we believe such expensive edge preprocessing can be offloaded to data storage device where the large amount of edges is stored.

Previous Work on Near-Data Processing
In many computer systems for the data mining, big-data, and database, the data movement becomes the bottleneck that it causes performance degradation and power waste [34]. Data processing is swiftly moving from computing-centric to data-centric. Inspired by these trends, the concept of NDP [10] (Near-Data Processing) has recently attracted considerable interest: Placing the processing power near the data, rather than shipping the data to the processor. The NDP computation might execute in memory or in the storage device where the input data reside [11], and it can be divided into two main categories: PIM and PIS.
PIM aims at performing computation inside main memory. Various PIM approaches have been proposed since the pioneering work by Gokhale et al. [12].
Ahn et al. [16] proposed a scalable PIM architecture for graph processing with five workloads including average teenage follower, conductance, PageRank, singlesource shortest path, and vertex cover. They verified the graph processing performance by simulation.
PIS aims at processing in storage (PIS). Early PIS approaches include the Active Disks architecture proposed by Acharya et al. [17]. They perform the scan, select, and image conversion in storage system and provides a potential reduction of the data movement between disk and CPU. Patterson et al. [18] proposed an architecture (IDISK) which integrates the embedded processors into the disk and push computation closer to the data. Their results suggest that a PIS based architecture can be significantly faster than a high-end symmetric multiprocessing (SMP) based server. Choi et al. [19] implemented algorithms for linear regression, k-means, and string matching in the flash memory controller (FMC). BlueDBM Morpheus [33] frees up scare CPU resources by using embedded processor inside SSD to carry out object deserialization. Recently, Biscuit [21] equipped with FMCs and processes pattern matching logic in storage which speeds up MySQL requests.
Lee et al [35] proposed ExtraV, a framework for near storage graph processing such as Average Teenage followers, PageRank, Breadth-First Search and Connected Components. It efficiently utilizes a hardware accelerator at the storage side to achieve performance and flexibility at the same time.
Our focus in this paper is on speeding up graph processing that has become increasingly important in today's big data era. As will be evidenced shortly, the benefit is great to preprocess a huge amount of graph data inside SSD where the data is stored. As shown in Figure 13, PIS augments a special functional logic to perform the desired function inside a storage device, in this case, SSD. All the storage control It provides sort function that is activated by NVM-e command and is done while data is being read from the storage to the host. In order to eliminate the off-chip memory accesses in FPGA sort, CISC takes a divide and conquer approach. Instead of sorting the entire edge list that is huge, we The in-storage sort pipeline is composed of the linear-time sorters [25] and several stages of FIFO mergers [22] [24]. We design this architecture especially for the in-storage graph processing with the minimal PIS latency and hardware cost.

In-storage sort module
As the first stage of the pipeline, the linear-time sorter uses n buffers to hold sorted graph edges. It compares each incoming edge's weight in parallel with all already sorted edges in the buffers and inserts the new graph edge into the appropriate location in the buffers to maintain the existing sorted order [25]. Each FIFO merger stage doubles up the segment of the previous pipeline stage [24]. For example, the size of data sort doubles up from 4 to 16 when the data stream passes through two stages of the FIFO mergers. As shown in Figure 15, order to the next pipeline stage of the FIFO merger, that is, we always pick up the smaller data from two FIFOs to be flushed to the next stage [24]. In this way, the current segment merges two of the previous segments and doubles up the sort size.
The sort size of the last segment is the chunk size that depends on the FPGA's internal resources (numbers of FIFO merger stages). After passing through the in-storage sort module, the graph edges are loaded into the host main memory in form of multiple sorted chunks.
The startup time of such pipeline of FIFO mergers depends on the data transfer delay of the last stage of the FIFO merger [24]. The delay is the data transfer time of the first chunk of the graph data. Therefore, PIS latency is only the pipeline's startup time when the host server reads the first chunk of a large number of sorted chunks from the storage.

Software Design of CISC
To allow the MST application to use the in-storage sort module, we developed two CISC software modules running on the host, one for single core CPU and the other for parallel MST running on multicores. The following paragraph describes the software design of CISC. Figure 16. MST software of CISC on the single-core system.

Serial CISC software
As shown in Figure 16   In order to speed up MST in the multicore system, we optimized the classical sample sort algorithm [8] of parallel MST. The concept of the sample sort is to divide the dataset into segments, and the data values within each segment have a range. The ranges among segments are non-overlapping. CPU cores sort these segments in parallel and complete the sample sort after combining all of the sorted segments. However, in most cases, the unsorted data does not follow the above segments' data distribution. The sample sort algorithm needs reshuffle the dataset by selecting samples and partition segments. Figure 17 shows a sample sort example of the n total =24 sorting elements with p=3 parallel tasks. There are four major steps of the sample sort algorithm: (1) Local sort: Multiple tasks divide the n total elements into p chunks of the size n total /p each and sort these chunks in parallel.
(2) Select & sort samples: The sample sort algorithm chooses m=2 samples evenly from each sorted chunk and then sorts all these selected samples with the total number of m×p.
(3) Segment partition: From the above m×p samples, the sample sort algorithm evenly selects p-1 samples as splitters. These splitters partition the dataset into p segments with non-overlapping ranges. The sample sort algorithm is suitable for the multicores system because the local sort (step 1) and segment reorganization (step 4) can be executed in parallel. However, each parallel task still sorts a large number of graph edges, which is computation intensive and time-consuming. It also has a synchronization problem of multiple tasks because the sample sort waits for all the parallel tasks to be finished before the next step of processing.
The parallel CISC software optimizes the sample sort algorithm by skipping the local sort (step 1). The in-storage sort circuit divides a large amount of graph edges into chunks of size n total /p each and sorts each chunk of edges in order using hardware. As shown in Figure 16, CISC provides an efficient data distribution for the rest of the sample sort's steps and avoids the local sort of parallel tasks in the host main memory.
In the parallel CISC software of MST, we did not change the original design of graph merge. According to the benchmark baseline [9], the parallel MST starts to merge after graph sort (sample sort) is completed. It merges sorted edges into the graph subsets with multiple tasks and grows by several sub-trees in parallel.

The parallel MST computation finishes when all the sub-trees join together and
MST traverses all the graph vertices. As will be shown latter in our experiment, CISC offers overall speedup of MST due to the optimized sample sort.

Evaluation
In order to evaluate how CISC performs in comparison with traditional approaches, we have built an NVM-e SSD prototype that implements CISC. The hardware chunk sort module is augmented inside the FPGA controller of the PCIe SSD. The PCI-e SSD card is inserted to a multi-core server to carry out a performance evaluation of CISC. This section discusses the prototype setup and evaluation results.

Experimental Platform and Benchmark Selection
We set up the experimental environment on an Intel Xeon processor with 96 cores. It runs at 2.5 GHz and hosts a Linux system with kernel version 4.14. The system contains a PCI-e 3×4 that connects our CISC storage and other peripherals.
We built our CICS prototype on top of the Open-SSD platform [27]. All storage logic fits into Xilinx Zynq-7000 series FPGA, including a dual-core ARM processor, DRAM/flash controller logic, NVM-e interface and CISC's in-storage sort module. The ARM processor runs at 1GHz clock speed, and this platform contains 1GB DDR2 and 256GB flash memory. To evaluate CISC, we store MST benchmark files on SSD before the host starts the MST application. The in-storage sort module is set to sort 128K edges per chunk.
Three benchmark datasets are chosen from [28∼30], including transportation, Internet data analysis and Graph Mining, as listed in Table 2. The PBBS benchmark [9] source code is used in our design as the baseline to evaluate the performance difference between CISC and the traditional software. We compose CISC software to replace the sample sort and serial MST in the baseline. The parallel software code uses OpenMP configured for multicores.

Numerical Results and Discussions
Since edge sort is the main part that CISC offers performance advantages for MST computation, we first carried out experiments to measure the execution times of edge sort using CISC and traditional software approach. The speedup of parallel software sort increases with the increase of the number Figure 18. The sort speedup of CISC, the baseline is serial software sort running on single-core.
of cores on the host server. Compared with single core, the speedup increases to 22∼27× as the number of cores increases to 96, as illustrated by the blue line plots in Figure 18. For the same number of cores, our parallel CISC outperforms the traditional software sort. For all the benchmarks considered, we observed 2∼2.81× speedup compared to the traditional software sort with the same number of cores.
These speedups can be mainly attributed to the elimination of parallel local sort tasks and partially offloading of computational resources from multi-cores to the SSD. As shown in Figure 18, the parallel CISC sort on 96 multicores shows 55× to 62× speedup compared to the traditional software sort on a single-core.
The overall speedup of the MST application depends on the fraction of sort time over total execution time. For a comparative analysis, we consider the baseline as running MST on a single-core with the in-storage sort module disabled. Figure   19 shows measured results for the benchmarks considered. We observed speedups of 2.2∼2.7× on single-core and a 1.3× speedup on multicores on average. The speedup ratio of a single-core is more significant than multicores because of the time fraction difference of edges' sort. The larger the fraction of graph sort time it takes, the more speedup CISC can obtain. As shown in Figure 12, the sort execution time on single-core consumes 65% to 75% of the overall MST execution, and parallel MST takes 31% to 46% execution time for the graph sort. Thus, the speedup ratio of multicores' MST is less significant than for single-core.
The speedup of parallel MST increases when using more CPU resources of the host server. As shown in Figure 19, CISC always runs faster than the traditional software with the same number of cores. It outperforms purely multicore systems because CISC obtains performance gains from both multicores and the in-storage sort. Compared to a single-core MST baseline, CISC outperforms traditional software by 11.47 to 17.2 times on 96-cores systems. Figure 19. The MST speedup of CISC, the baseline is serial MST running on single-core.

Hardware Cost Analysis
CISC partially offloads the expensive computation from the host server to the SSD. The additional hardware cost of implementing CISC inside an SSD controller includes logic cells, LUT, Flip-flops, and RAM.

Introduction
Identifying and reducing redundancies in data storage and transmission become more and more important nowadays [1]. One of the common techniques used in locating redundant data is comparing sketches of data chunks to find duplication or similarity. A sketch typically consists of a few fingerprints representing a data chunk [2]. Rabin fingerprint has proved to be very effective and is widely used in forming such a sketch [2]. To derive a sketch, a data chunk is scanned a shingle by shingle, a fix-sized window, (e.g. 8 bytes long) that shifts forward one byte every step. A Rabin fingerprint is calculated for each shingle. A random sampling technique, such as Minwise theory [3], is then used to select a few among all Rabin fingerprints as a sketch for the data chunk.
Deriving such sketches is computationally intensive. For example, to obtain a sketch of 4KB data chunk with a shingle size of 8 bytes, 4K-7 Rabin fingerprints need to be calculated and the sampling process is also time consuming. Existing software programs typically take around 30 microseconds to generate a sketch for each 4KB data chunk on a commercial CPU [4]. For data deduplications in data backup and archive applications, such a delay might be tolerable. However, with todays storage devices approaching gigabyte per second in throughput and submilliseconds in latency [5], this delay is inadequate for real-time data processing for primary storages and storage caches.
This paper presents a hardware approach to Rabin fingerprint computation and sampling to produce a sketch for a data chunk. By means of effective pipelining and split fresh technique, our hardware implementation is able to achieve one order of magnitude speedup over the existing software implementation [6]. Moreover, the design consumes 2∼10 times less hardware resource than a comparable configuration of the existing HW solution [7]. Our design also overcomes the drawback of [7] that has a latency linearly increasing with data input size. A working prototype of our new design has been successfully implemented on an FPGA and tested to work properly at clock rate above 300 MHz. The architecture is configurable according to the characteristics of the input data, and a single unit of the design can be replicated to work in parallel accommodating higher throughput demand.
The paper is organized as follows. Section II provides the background and a preview on the pipeline architecture. Section III explains the overall design as well as the optimizations. Implementation experiences on an FPGA board are shared in Section IV along with its performance evaluation. Section V concludes Figure 20. The diagram of shingles.
the paper with future plans.  (1), a degree n-1 polynomial over GF (2). A random polynomial p(x) , not necessarily irreducible, is picked over the same field with degree k-1, as in (2). The remainder r(x) of dividing f(x)

Background and architectural overview
by p(x) over GF (2), a k-bit number, is returned as the fingerprint of the message m. This process is shown in (3).
In the formal algebra system, a single modulo operation can be turned into multiple calculations, each of which is responsible for one bit in the result. Such scheme, normally involving just XORs, is suitable for hardware implementations.
We group these bit-wise calculations to form a computational module for Rabin fingerprints, and call it the fresh function.    (a 0 , a 1 ,..a 38 ) of the fresh, and fresh2 (a 39 , a 40 ,..a 63 ) in the 64-bit example above. Table 4 lists the complexity of the individual split fresh modules, the combined of the two, and that of the original single fresh function, given the polynomial p(x)= x 16 + x 13 + x 12 + x 11 + 1. While the resource consumption may not change much at the end, the clock rate should improve for the case of the split fresh due to more and simpler pipeline stages.

Sampling of Fingerprints
The total number of fingerprints generated for a w-byte data chunk in our application will be w-b+1, where b is the size of the shingles. After all Rabin fingerprints are computed for a block, a number of fingerprints are chosen as a sketch to represent the block. Udi Manber [8] provided two methods to decide which fingerprints to select. One is selecting fingerprints that have their last n bits being all zeros. The other is selecting fingerprints according to some keyword because keywords are in a sense universal and they are selected truly at random.
Broder showed a scheme based on Minwise theory [3]. Minding the principle of random sampling, to select Rabin fingerprints with the upper N bits being a specific pattern shall present a fairly good approximation because these upper bits in each fingerprint can be considered as randomly distributed. We choose this scheme for Figure 22. Design with fingerprint pipeline and signature selection.
its processing speed, and similarity detection qualities [1], as will be discussed in Section III B.

Designe and optimization
Our design is illustrated in Figure 22 with three major function modules:

Rabin Fingerprint Pipeline Design
The Rabin fingerprint pipeline in Figure 23 has two split fresh stages followed by seven shift stages. The two fresh modules compute the fingerprint F P 0 for the eight bytes of data from the proceeding clock. The first seven bytes from the Compared to a pipeline with a single fresh unit, this design introduces one more cycle latency to the final result, which is not detrimental to the system performance. If needed for higher clock rate, the fresh, as well as the shift can be further split into more stages.

Channel Sampling and Final Selection
During sampling, each computed fingerprint is divided into two parts: index and signature, where the index is a few of MSBs, and the signature the remaining LSBs. Say the index has m bits, then the signatures can be categorized into 2 m bins. Within a bin, the signatures are selected as candidates for the final sketch.
For a channel sampling unit, there can be up to 2 m candidates for the final selection. stored. The comparator, generating the write enable signal, decides either the minimum or maximum value is sampled into the buffer. To avoid RAW hazards, a data forwarding function is adopted to control which value to compare with the incoming signature. The XNOR gate checks whether the read address and the write address clash. If they do, and the write enable is active at the moment, the current write value will be forwarded to the comparator. This forwarding is done by the MUX controlled by the output of the AND gate.
When all signatures are processed with the candidates settling in the channel buffers, the final selection unit activates the index counter to fetch the candidates according to a pre-defined index sequence, such as 0, 1, 3, 5, 7, 11, 13, and 15 in our design. Taking advantage of concurrently available buffers, and with the pipeline registers between the comparators, the final selection in Figure 24(b) conducts a binary tree reduction over the candidates.

Parallel Pipelines
The pipeline design can be duplicated to accommodate a data bus wider than the defined shingle size. Suppose the input data comes in at 16 bytes per clock, and the shingle size remains 8 bytes. The data can be divided into low 8 bytes and going through the upper pipeline in the following clock. In this fashion, each stage produces two fingerprints during a clock cycle.

Implementation and evaluation
Our fingerprint design is a part of a primary storage prototype that is implemented on a Xilinx ML605 board. As seen in Figure 26, a host PC reads from and writes to the storage media via an NVMe interface [9][10].

Hardware Implementation Evaluation
Using the example polynomial, we implemented three designs, i.e. with pipeline having a single stage fresh, a two-stage split fresh, and replicated eight The split fresh design uses less LUTs and a little bit more registers compared to the single fresh design. However, the implementation does run at higher clock rate because the delays are more uniform across all stages in the pipeline. This improvement is consistent with our analysis in Section II.A, and the scheme offers a promise for possibly higher clock speed.

Software Comparison
We further implemented the software design in [6] on 2.8 GHz Intel Core i5 processor with 2GB DRAM. The computation utilizes a sliding window based Rabin fingerprint library to process the same sets of data used in our hardware experiments. We constrained the program to run on one core only, and compare the results with that of our hardware module implemented on the FPGA. also shows ∼5X improvement on the hardware over the software solution.

Conclusion and future works
The proposed hardware approach for fingerprinting large data objects can operate at wire speed. The major techniques include fresh/shift pipelining, split fresh optimization, online channel sampling, and pipelined final selection. Demonstrated on FPGA using Rabin fingerprint, the whole computation adds just a few clocks latency to the data stream. Measured throughput satisfies the requirement of primary storages. The architecture is extensible to other types of CRC and fingerprint computations, and can be adapted to large shingle sizes and wide data buses.
Future optimization can still be achieved by streamlining the final selection to reduce latency. Or by shingling more than one byte, and interleaving the shingled bytes, we should be able to make the single pipeline itself to a parallel one.

Abstract
Deduplication has proven essential in backup storage systems as large amount of identical or similar data chunks exist. Recent studies have shown the great potential of data deduplication in primary storage and storage caches [1]. For such application scenarios, processing speed for similar data chunks becomes more important to the system success. This paper presents a FPGA accelerator for similarity based data deduplication. It implements three hardware kernel modules to improve throughput and latency in dedupe system: block sketch computation, reference block indexing and similar block delta compression. The accelerator connects to the host system through a PCIe Gen 2 × 4 interface. By means of pipelining and parallel data lookup across multiple hardware modules, our new HW design is capable of processing multiple data units, say 8-byte long, in parallel every clock cycle and therefore provides line speed similar block dedupe. Our experiments have shown that the similarity based data dedupe performs 30% better in data reduction than the conventional dedupe techniques that only look at identical blocks. Compared our hardware implementation with its software counterpart, the experimental results show that our preliminary FPGA implementation with clock speed of 250 MHz provides at least 6 times speedup in throughput or latency over software implementation running on the state-of-art servers.

Introduction
Data deduplication has become increasingly important due to explosive data growth in the Internet world. It has been highly successful in enterprise backup environments [2]. Typically, companies execute daily incremental backups and weekly full backups to protect their data. The great amount of duplicate data drives widespread use of deduplication in enterprise backups.
The success of data deduplication in backup systems inspired a large amount of efforts in primary storage deduplication. Unlike the backup system, the primary deduplication is used in a production environment [3] [4], which brings in multiple challenges. Firstly, a primary storage does not have as much duplicate data as in the backup systems. Data sent to primary storage comes from user level applications, such as database and MS-Office. The main operations for these kinds of applications are modify, add and delete. These operations generate a lot of similar data blocks as opposed to duplicated blocks, making it more sensible to look at deduplications at sub-block level. The second challenge is the performance requirement. Backup storage deduplication is throughput sensitive while the primary storage is mainly used in production environment and is latency sensitive.
The required response time for each data unit is much shorter than backup dedupe systems. The last challenge is the limitation of resources. Primary storage deduplication system often shares the production environments resources while backup deduplication system has its own resources. Taking server resources such as the CPU and the RAM resources to perform deduplication may drag down application performance running on the server, which is undesirable.
Files or data blocks are frequently modified and reassembled in different contexts and packages. By deriving the differences between near-duplicate data blocks, delta compression can effectively dedupe data at both file or block levels. The central task of delta compression is to find difference content between two data chunks, and try to only keep them. Philip Shilane et al built a delta compression and dedupe storage [2]. The extra deduplication benefit gains owing to delta compression is 1.4 times compared to the conventional dedupe techniques. However, the throughput of the system is ranging only from 30MB/s to 100MB/s which are not suitable for primary storage or cache systems that demand close Gigabyte per second throughput and submillisecond in latency.
In order to make similarity based dedupe applicable to primary storages or caches, hardware acceleration should be explored. A hardware implementation not only can offer high speed dedupe, but also offload dedupe functions from servers so that application performance is not negatively affected. In this paper, we present the first hardware design, to our knowledge, for similarity based dedupe for primary storages and storage caches. By means of pipelining and parallel structures, our design provides high throughput and fast response time. The proposed architecture was implemented on a Xilinx Virtex-6 FPGA development board. Three major hardware modules for the dedupe system were fully tested to be functional.
Extensive experiments have been carried out to evaluate their performance and compression ratio as compared to software implementations. Our experimental results show that the hardware implementation provides at least 6 times speedup, over its software counterpart while the compression ratio is comparable. We also show that similarity based dedupe offers 30% better data reduction ratio than the typical dedupe techniques.
This paper makes the following contributions: 1) Design and implementation of hardware solutions for three major modules of similarity based data deduplication: fingerprint computation to derive the sketch of a data block; indexing structure and search logic for finding reference blocks that are used as bases for delta compressions; and hardware delta compression logic.
2) Integration of the hardware modules into software dedupe platform [5]. The integrated system is shown to function correctly and efficiently.
3) Performance evaluations have been carried out using real world data sets.
We conducted extensive experiments to show the achievable speedup and data reduction ratios as compared to existing solutions.
The rest of this paper is organized as follows. In Section 2, the related back-ground work is presented. Section 3 presents our design of the 3 hardware modules.
The FPGA implementation, the test setup, and the experimental results are detailed and discussed in Section 4. We conclude our paper in Section 5.

Background 4.3.1 Standard dedupe
A typical process of data deduplication involves the following processes.
Firstly, it splits files into multiple chunks and generates a fingerprint for each chunk. The fingerprint usually is a strong hash digest of the chunk. If two fingerprints match, it means their contents are duplicate. When a new incoming chunk's fingerprint matches an existing one in deduplication system, only the chunk's metadata such as file name or LBA and a reference to the existing content will be stored [6].

Similarity based dedupe
It is often the case that data chunks are frequently modified by cut, insert, delete, or update a part of the content. Though slightly changed chunk will generate different strong hashes and could not be indexed by standard dedupe, the sketch of the chunk may stay the same if a weaker hash function is used [7]. Such weaker hash sketches typically consist of several Rabin fingerprints and have the property that if two chunks share a same sketch then they have a lot of same content, i.e. they are likely near-duplicates. Note that we will use the terms "chunk" and "block" interchangeably in this paper to refer to the basic unit of data deduplicaiton.
In similarity based deduplication, a new block searches for a near-duplicate block in a set of reference blocks by comparing their sketches. If a matched sketch is found in a list of reference blocks, a delta compression is performed against the found reference block and only the delta is stored with a pointer to the reference block. Therefore, similarity based dedupe requires three key functions: 1) computing the sketch of a block; 2) select and store reference blocks against which the delta compression will be performed after a matched sketch is found; 3) delta compression.

Delta compression
For two near-duplicate files f old and f new , delta compression is to compute a minimal size of f delta that new could be reconstructed by f old [8]. Delta compression constructs a dictionary of observed sequences, and looks for repetitions as it goes. It writes the number of the dictionary entry when a repetition encountered, and store the unique token if no match happened. The output thus consists of appropriately labeled f new and references to f old repetitions.
Though extensive work has been done on hardware compression, none of them were designed specifically for delta compression in dedupe system [9,10]. Also, current hardware based delta compression has to compress data chunk f new byte by byte. It takes 4K loops to compute f delta , which may form a performance bottleneck for high throughput storage systems. I/O buses are usually more than one byte in width. A compression unit whose latency increases linearly with the input width is not acceptable for modern data storage applications. Inspired by WK algorithms for Compressed Caching in Virtual Memory Systems [11], we choose multiple bytes as the token size that can be processed in parallel hardware for delta compressions.

Design and optimization 4.4.1 Compute Sketches
To derive a sketch, a data chunk is scanned shingle by shingle, a fix-sized window (e.g. 8 bytes long), that shifts forward one byte every step as shown in Figure 28. A Rabin fingerprint is calculated for each shingle scanned. In the formal algebra system, Rabin fingerprint computation can be turned into multiple calcula- tions, each of which is responsible for one bit in the result. Such scheme, normally involving just XORs, is suitable for hardware implementations [12]. We group these bitwise calculations to form a computational module for Rabin fingerprints, and call it the "fresh" function.
Within two consecutive shingles shown in Figure As shown in Equation (1), the fingerprint of the second shingle B(x) can be obtained using the fingerprint of the first shingle A(x), the first byte, U(x), of the prior shingle and the last byte, W(x), of the current shingle [13]. We call this formula the "shift" function, which generally leads to a simpler design than the fresh function, and should consume less resources when being implemented in hardware. Further optimization is possible by directly splitting the fresh function, "split shift", into multiple sub-functions, and hence multiple stages in the pipeline.
Based on the property of "fresh", "shift" and "split shift", we designed a Rabin fingerprint pipeline which provides a line speed sketch computation. As shown in Figure 29, it has two split fresh stages followed by seven shift stages. The two fresh modules compute the fingerprint F P 0 for the eight bytes of data from the Each clock takes turns as the proceeding clock, and its data goes through the fresh units during its time.
A random sampling technique, such as Minwise theory [14], is then used to select a few among all Rabin fingerprints as the sketch for the data chunk. As shown in Figure 30, sketch is generated after fingerprints are produced at every pipeline stage and are sent rightward to the corresponding channel sampling units.
As the data chunk runs through the pipeline, the fingerprints are sampled and

Reference block index
After the sketch of each block is calculated, we use the sketch to represent each data block and keep track of I/O access patterns to all sketches. Based on Figure 30. Block diagram of hardware design for sketch computation with fingerprint pipeline and sketch selection.
the content locality, i.e. access frequency and recency of data contents [15], we select and cache two thousands most popular blocks as reference blocks. These reference blocks and their sketches are stored in a reference list. Every newly generated block sketch is used as a key to search the reference list to find a match.
The new block is then delta compressed against the matched reference block. The compressed delta and a pointer are stored in the primary storage or cache rather than the original 4 KB block.
In our design, we assume that a sketch contains 8 fingerprints each of which is one byte long. If two data blocks have n matched fingerprints between their respective sketches (n from 4 to 8), we consider they are near duplicate blocks. n is referred to as similarity threshold. Once such near duplicate block is found in the reference index, the corresponding reference block will be read out and delta compression against it is performed. permutations using CRC implementation with a 13-bit polynomial. In cuckoo hash index table, each record forms a pair composed of a hash key and an index to the reference list. Subsequently, the input sketch is compared with the reference sketch that shared same n bytes permutation. Cuckoo hash splits n byte permutations into multiple tables that each unique key appears only in a single place at a time.
If none of the n bytes permutations is equal to the searched reference sketch, the found flag is cleared, otherwise it is set and a similar block is found.
Taking advantage of FPGA's parallel computation, our hardware design for reference index module allows parallel search among all the C n 8 parallel paths at same time. Once a match case is found, reference index can spot the location in the reference list as illustrated in Figure 32.

Delta compression
The PCI-e bus connecting our hardware platform to the host is 8 Bytes in width. In order to provide a line speed compression for similarity based data deduplication, we looks for every 8 Bytes repetitions from near duplicate blocks. first. Blk new is the associated block to be compressed. It is fed into the delta compressor following the reference block. While the two blocks are feeding into the compressor, repetitions between the two blocks were searched. As shown in In order to do a quick search in the reference dictionary, we need a Con-    Figure 35. Each channel stores and

Implementation and evaluation 4.5.1 Experimental setup
The three major hardware modules for similarity based dedupe as discussed above are built on Xilinx ML605 development board with V6-240T FPGA and our maximum clock speed is 250MHz. As a hardware coprocessor, it connects to the host through a PCI-e 2 × 4 bridge. A data deduplication software simulator [5] is running on the host PC with Intel(R) core(TM) 2 Duo CPU E7500 with 2.93GHz and 4GB DRAM. Figure 36 shows the block diagram of how the hardware modules are connected to the host system. For the purpose performance evaluation and comparison, we installed the standard dedupe software downloaded from [18]. By standard dedupe, we mean the dedupe function that perform data reduction only on identical data chunks. An open source software package [15] that does similarity based dedupe was also installed in order to evaluate the efficiency and effectiveness of our newly design hardware modules. Therefore, in the following discussions we will compare the three dedupe systems: the standard dedupe, the software module, Figure 36. Experiment platform for hardware accelerating similarity based data deduplication. and our hardware modules.

Latency
Since data dedupe for primary storage and caches is on the critical path for production I/Os, minimizing dedupe latency is essential to storage I/O performance. We first evaluate latencies of the dedupe functions.
The first function for similarity based dedupe is fingerprint computation to derive sketches for data blocks. Our first experiment is to measure the times taken for computing the Rabin fingerprints and deriving a sketch for each data block.  Linux kernel data set, using software fingerprint computation to derive sketches of all data blocks takes over 18.9 seconds, while our hardware implementation takes only 1.37 seconds. The average delay for computing a sketch of a 4KB block is about 2.5 us using the hardware module. But it takes over 30 us to do the same using the software module. For high performance storage such as SSD, this difference can have a significant impact on the production performance of disk I/Os. Not only does the hardware implementation speedup fingerprint computation greatly, but also offloads the computation to the accelerator allowing the server CPU to concentrate on application performance.
The second function is reference block search to find the best reference block for delta compressions. We measured the time it takes to search for a matched Figure 38. Average latencies of reference block search for different similarity thresholds.
reference block upon each new coming block. Figure 38 shows the measured results for different similarity threshold values. The similarity threshold is used to determine how many fingerprints in a sketch should match before a delta compression is performed. The lower, the threshold value is, the higher the chance to find similar blocks. However, it also increases the chance of false positive, i.e. two blocks are considered similar based on a few fingerprints match but they are not delta compressible. Higher threshold value, on the other hand, give a better chance that two blocks are delta compressible because they have more matched fingerprints in their respectively sketches. But some compressible blocks may be missed if the threshold value is too high.
From Figure 38 we can see that our hardware implementation of reference block search takes less time than its software counterpart. However, the latency reduction is not as substantial as the fingerprint computation part. From our experiments, we observed two reasons for this. First of all, our hardware design on this part is still preliminary and there are rooms for optimization given more time. The software implementation, on the other hands, is pretty mature with many built in optimizations. Secondly, the latency time is on the order of the 10s of nanoseconds. There is not much space for hardware to do much better since The third function is delta compression. We measured the delta compression time of each 4KB associated block against a 4KB reference block using our hardware compressor. We also measured the same compression time using the software delta compressor MiniLZO [19]. The performance comparison is shown in Figure   39.

Data Reduction Ratio
In order to validate dedupe capability of our hardware design, we carried out experiments to measure the data reduction ratio of the hardware dedupe system.
We compare this ratio with mature software dedupe systems. The purpose is to make sure the high speed hardware can achieve the expected data reductions. Figure 40 shows the data reduction ratios of similarity based dedupe for both software package and the hardware implementation for data set Linux kernels. It can be seen in this figure that the data reduction ratios of the two systems are comparable for all similarity threshold values considered. We noticed that for lower similarity threshold values such as 4, software package does a little bit better job than the hardware implementation. Our analysis of the hardware design and the software package suggests the following reasons for this. First of all, with the software compressor, data compression can be done both inner block and inter blocks between the reference block and the associated block. In the hardware implementation, on the other hands, only inter block compression is performed.
Further improvement on the hardware design is possible. Secondly, for smaller value threshold, software can work much harder with more iterations to find string matches within and between blocks. The hardware implementation will perform just one pass and it may miss some substring matches.  We have also carried out experiments to compare the data reduction ratios of standard dedupe and similarity based dedupe. The measured results are shown in Figure 41. From this figure, one can see that the similarity based dedupe shows better data reduction than standard dedupe because of the existence of similar data blocks. From our experiments, we observed about 30% better data reduction of similarity based dedupe than standard dedupe. We believe such improvement should be much bigger in real world environments and in production systems.
More similarity exists in real world data such as databases, big data, large files, sensor data, and data being processed by servers. As a result, similarity based data dedupe should perform much better in terms of data reduction.

Conclusion and future works
We have proposed a hardware accelerator to speed up similarity based dedu-