Architecture and performance evaluation of data storage systems
With the advancement of cloud computing technologies, both personal and business users tend to store more and more data on consolidated data centers which can be accessed from anywhere using computers or smart devices. Multiple users may upload identical or similar contents which results in a large amount of duplicated data in the data center. Besides cloud services, the emerging virtualization technologies allow running hundreds of virtual machines on one physical machine which needs to store many copies of similar operating systems and applications. Traditional data storage systems are not able to fully exploit such data redundancy. This dissertation presents a new approach to identify and store similar data blocks in compact formats to improve the performance of the storage system. A histogram-based signature is proposed to capture the similarities between data blocks if their contents are similar or shifted. Similar data blocks are clustered into the same group based on their signatures. Furthermore, a heatmap algorithm is designed to find the most popular block among similar blocks considering both temporal and content localities of data blocks. Finally, a high-speed delta coding algorithm is developed to compress similar blocks into small deltas. The proposed approach leverages flash memory based Solid-State Disk (SSD) to store a single copy, the reference, for many redundant data blocks. Other similar blocks are stored as small deltas referring to the reference block in SSD. Compared to conventional magnetic hard disks, the flash based SSD is orders of magnitude faster in terms of latency. Thus the reference block stored on SSD can be retrieved quickly and I/O requests to other similar blocks can be served by combining the corresponding deltas with the reference block to avoid slow hard disk accesses. Two prototypes of the proposed data storage system have been implemented, one as part of the Linux kernel virtual machine monitor and the other as a Linux device driver. Numerical results on standard benchmarks show an order of magnitude improvement of the new storage system compared to existing disk I/O architectures such as RAID and SSD/HDD storage hierarchy. The last part of this dissertation presents a block level versioning system that is able to recover to any point in time to the past. The versioning system is independent of operating systems by using network storage protocol. The version creation, log maintenance and version recovery are done at storage target to offload the versioning overhead from application servers. Experiments on Linux, Windows, and Solaris have demonstrated that the new versioning system allows user to recover selected files with much smaller metadata cost compared to existing file system versioning systems.
Computer Engineering|Solid State Physics|Computer science
"Architecture and performance evaluation of data storage systems"
Dissertations and Master's Theses (Campus Access).