Research Into Computer Hardware Acceleration of Data Reduction and SVMS

................................................................................................................. ii ACKNOWLEDGMENTS .......................................................................................... iv PREFACE .................................................................................................................... vi TABLE OF CONTENTS .......................................................................................... viii LIST OF TABLES ..................................................................................................... xii LIST OF FIGURES .................................................................................................. xiii 1 Compression Speed Enhancements to LZO for Multi-Core Systems ............... 1 Abstract ......................................................................................................................... 2 1.1 Introduction .......................................................................................................................... 3 1.2 Analysis Of LZO 1X-1-15 ................................................................................................... 4 1.3 Algorithm Enhancements ................................................................................................... 10 1.3.1 Parallelization of Block Compression ...................................................................... 10 1.3.2 Optimize Copying of Literal Data ............................................................................ 12 1.3.3 Search for Matches Every 32-Bits ............................................................................ 13 1.3.4 Force Cache-Line-Aligned Reads ............................................................................ 13 1.3.5 Utilize Hardware CRC-32 Instruction ...................................................................... 14 1.4 Performance Analysis ........................................................................................................ 16 1.4.1 Test Setup ................................................................................................................. 17 1.4.2 Benchmark Test Results ........................................................................................... 20 1.5 Conclusions ........................................................................................................................ 26


Abstract
This paper examines several promising throughput enhancements to the Lempel-Ziv-Oberhumer (LZO) 1x-1-15 data compression algorithm. Of many algorithm variants present in the current library version, 2.06, LZO 1x-1-15 is considered to be the fastest, geared toward speed rather than compression ratio. We present several algorithm modifications tailored to modern multi-core architectures in this paper that are intended to increase compression speed while minimizing any loss in compression ratio. On average, the experimental results show that on a modern quad core system, a 3.9x speedup in compression time is achieved over the baseline algorithm with no loss to compression ratio. Allowing for a 25% loss in compression ratio, up to a 5.4x speedup in compression time was observed.

Introduction
Real-time systems are more and more often being required to process an increasing amount of data. Interface throughput and storage bottlenecks may be reached in such systems because of the amount of data involved. Data compression can be used to help alleviate such problems by reducing the amount of data injected into a pipeline. For the purpose of this paper we will consider targeting a theoretical real-time system, requiring a lossless compression system to pass data between two interfaces. Compression may be required in such a situation due to bandwidth limitations or space constraints on the destination interface. We will assume a constant input stream of data is available to the compression device and that the final compressed output is able to be passed to the secondary interface with zero delay. The target lossless compression algorithm for this system will be Lempel-Ziv-Oberhumer (LZO) variant 1x-1-15.
The LZO compression library is a collection of lossless dictionary based data compression algorithms that favor speed over compression ratio. The LZO library was first released in 1996 and has received periodic updates since. The library has experienced widespread use, being implemented in a variety of technologies, including NASA's Mars Rovers [1] and Oracle Corporation's B-tree Linux file system.
A comparison of the LZO 1x-1 algorithm speed performance against common compression formats such as GZIP, can be found in [2]. Of the available LZO algorithms, 1x-1-15 is considered by the author, Oberhumer, to have the fastest compression speeds [1], exceeding that of LZO 1x-1 at the cost of compression ratio.
This paper investigates possible enhancements to the 1x-1-15 algorithm to improve data compression speeds for use in real-time systems.
Utilizing the special architectural features of modern multi-core processors, we will examine the effects of optimizing LZO in the following ways: parallelizing block compression, using Intel SSE (Streaming SIMD Extensions) vector instructions to perform data copy operations, modifying the search algorithm, enforcing cachealigned reads, and calculating CRC-32 checksums via hardware. All five enhancements have been implemented on the LZO open source code. Performance evaluation and comparison have been carried out using real world data sets.
Experimental results have shown significant performance improvements in terms of compression time. For the same compression ratio, over a factor 3 speedup was observed. If trading off a slightly lower compression ratio is allowed and all enhancements are combined, over a factor 5 speedup was observed.
The remainder of this paper is organized as follows. First, an analysis of the existing LZO 1x-1-15 algorithm is conducted, revealing the overall structure and identifying unique characteristics. Next the proposed enhancements to the algorithm are discussed in detail. After this, experimental data is given to compare against baseline performance. Finally, conclusions regarding the obtained results are presented.

Analysis Of LZO 1X-1-15
LZO 1x-1-15 is a variation of the Lempel-Ziv 1977 (LZ77) compression algorithm, which is described in [3]. LZ77 achieves data compression via a sliding window mechanism: bytes from a look-ahead buffer are shifted one by one into a search buffer. When matches are found between the look-ahead buffer and locations in the search buffer, tokens are output on the compression stream rather than literals, resulting in compression. Major differences in Oberhumer's LZO variation include: the optimization of operations through the use of integer computer hardware, a quick hash lookup table for match data, and better optimized output tokens.
The LZO 1x-1-15 algorithm is structured to take advantage of the fact that most computers are optimized for performing integer operations. Instruction latency/throughput tables illustrating this on Intel ATOM architecture CPUs can be found in [4]. Throughout the algorithm, no time-consuming floating point operations are used. The most complex instruction performed is an integer-multiply operation, which takes roughly five or less CPU clock cycles on the latest Intel architecture processors [5]. When pipelining and out-of-order execution operations are taken into account, the delay introduced by this multiply instruction is made even less significant.
Fetches to and from cache are also optimized. The algorithm is relatively small when compiled into binary format, likely fitting within a modern level 1 instruction cache. Compiling for i386 target hardware was found to result in roughly 600 instructions, which occupied a total size of roughly 1.84kB. This low instruction count is confirmed by that of the slightly larger, more complex cousin algorithm LZO 1x-1, given in [6]. Since the algorithm operates on data in a block manner, successive iterations do not require the frequent re-fetching of instructions from main memory.
Data cache misses are also minimized, as a maximum of 48kB of data at a time is compressed; regardless of the user-defined input block size. On a modern data cache of size 16kB or larger, this will result in few level 1 cache misses. A 48kB sub-block of input data to be compressed should easily fit within level 2 cache, if not most of level 1 cache. Intel processors in particular predict data fetch patterns and automatically pre-fetch sequential data from a detected input stream, resulting in further performance gain during the block compression algorithm [4] [7].
As seen in Fig. 1 jumps exponentially until either a match is found or the end of the current 48kB sub- The search algorithm appears to make the assumption that data matches are related by spatial locality. When a match is found, the search algorithm assumes nearby matches will also occur, only incrementing the search pointer, ip, by one byte. As match detections repeatedly fail, ip is increasingly incremented in an exponential fashion.
Consider the following simplified example: a 100-byte data set with 4-byte long matches located at byte offsets 30, 34, 38, and 84. Misses will occur at offsets 0 through 29 and the search pointer ip will be incremented by one byte offset each time.
Keep in mind that every time a miss occurs, the dictionary is still populated. Then three consecutive matches will occur, with ip being updated to the next unmatched location for successive search iterations. From offsets 42 through 73 ip is incremented by one byte. From offsets 74 through 82, the difference between the last match, ii, and the search pointer has grown beyond a distance of 32, and as such a byte offset of 2 is added to ip upon each detected miss. At byte offset 84 a match is found once again.
As the match is only 4 bytes in length, the search will resume at byte offset 88 where a miss will occur. Since a match was detected, ii was updated and the difference between ip and ii is once again less than 32. This results in ip being incremented by one byte once again rather than two. Miss detections and single offset increments will continue to the end of the dataset.
An interesting aspect of the code used by the algorithm to perform the searches is that it involves only integer operations and no comparison instructions. The effect of this is that the compiled code will produce no conditional branch instructions to calculate the exponential behavior. Consequently, no part of the CPU pipeline will need to be reserved for branch prediction to determine how far to advance into the input data stream on the next iteration. This results in overall faster code.
The second major portion of the LZO 1x-1-15 algorithm performs the copying of unmatched literal data to the output stream. First, the number of bytes of uncompressed literal data is written in an encoded manner to the output stream. This is followed by the literal data. The algorithm assumes a 32-bit data bus, and, when possible, copies data to the output stream in groups of 32-bits in an attempt to optimize any writes performed. After copying the literal data, if the end of the data stream was reached, a special block marker is written to the output stream. Otherwise, the match length is calculated.
The algorithm calculates match length by performing 32-bit XOR operations on sequential sets of data pointed to by the input stream and the dictionary. If a 32-bit match comparison is successful, the result of the XOR operation will be zero. When the comparison fails for the first time, 8-bit byte comparisons are performed to determine if any leftover partial-word singular bytes in the stream matched. Thus, similar to the literal copy operation, the algorithm assumes that the computer's hardware will be optimally used if matches are primarily performed on a 32-bit integer basis.
Following the match length calculation, the algorithm encodes the following tokens on the output stream: a marker denoting the type of match that occurred, the offset to the location of the data that matched the current set, and the length of the match. To minimize the number of bytes used to store this information and improve the compression ratio, five different encodings exist. Each of the five encodings offers a unique variation of offset/length encoding. After outputting the match information, the algorithm returns to searching for the next 32-bit set of matching data.

Algorithm Enhancements
The five enhancements that follow in this section have been applied to the existing LZO 1x-1-15 algorithm to determine the impact on compression performance.

Parallelization of Block Compression
The original LZO algorithm is serial in nature. Data is compressed on a block by block basis, but there is no explicit support for multiple CPU hardware cores. By implementing a divide and conquer approach to individually compress and reassemble blocks of data in the input stream, a performance gain should be attained. This performance gain should be directly proportional to the amount of CPU cores utilized.
To investigate this improvement, a thread-based variation of the 1x-1-15 algorithm has been constructed as follows. As seen in Fig. 1.3, three main types of threads are created: control, compression, and reconstruction. Communication between the three types of threads is accomplished through the use of semaphoreprotected shared memory. The main thread is created to control the flow of input data, manage memory utilization, and initiate compression threads. N user-defined threads are created to perform the actual compression on input data blocks when commanded.
The compression threads operate independently from one another, each with their own set of temporary resources. The software and operating system ensure the compression threads are load-distributed to all available processor cores in the system. The main control thread will initiate a compression thread when the following conditions are determined: available input data exists, temporary output compression buffers exist, and an idle compression thread exists. Finally, a reconstruction thread is created to accept the output LZO-compressed block data from the individual compression threads. It is the job of the reconstruction thread to ensure that the resultant output stream is created in-order. Since data is being compressed in parallel on separate CPUs, there is no guarantee as to when a particular block of data will finish compression, i.e., Input Block 1 could be compressed first, followed by Block 3, followed by Block 2. Thus, without the reconstruction thread, there would be no mechanism to correctly reconstruct the compressed output data in-order. Unless special headers were added to identify the out-of-order data, existing LZO decompression routines would be incompatible with the produced output format.

Optimize Copying of Literal Data
In the current LZO algorithm, uncompressed literal data is copied to the output stream on a 32-bit long word basis. The copying of this data is one of the more time intensive operations. Instead of copying on a 32-bit integer basis, data can be copied faster using available vector instructions. On current generation Intel processors, this allows for up to 128 bits of data to be copied at a time. This should potentially quadruple performance when this portion of the code is executed. To investigate this optimization, Intel SSE (Streaming SIMD (Single Instruction Multiple Data) Extensions) vectorized memory copy related source code from Agner Fog's freely available asmlib library [8] was utilized.

Search for Matches Every 32-Bits
LZO 1x-1-15 shifts data from a look-ahead buffer into the search buffer by one byte, similar to LZ77. This algorithm variant has been created to advance the input stream by 32-bits (one integer) each time a match detection fails. In this manner, the input stream will be traversed faster, although with the extra side effect that the dictionary will be less frequently populated. A compression ratio loss is expected, the magnitude of which will need to be determined experimentally.

Force Cache-Line-Aligned Reads
Most current generation Intel processors suffer a penalty when reading data that exists between cache line boundaries [7]. accesses to memory can be ensured to be cache-line-aligned.
Assuming the cache line boundaries in a system lie on an address location that is divisible by 4 bytes, and knowing that up to 4 bytes of data can be read from memory Four bytes are added to the current input pointer instead of one byte to ensure that the input data is always advanced by 32 bits. The exponential portion of the equation is bit-wise ANDed to ensure that the result of that mathematical operation is 32-bit aligned as well.
Match length calculation was also required to be modified to ensure that reads from the input stream remain on cache line boundaries. Since the existing algorithm allowed for match lengths to be determined on a byte basis, this variation was altered to only perform matches on a 32-bit basis, guaranteeing inter-cache line boundary locations will not be read the next time the search equation is executed.
This modification is in essence an extension of the previously described 32-bit search variation. In addition to searching every 32-bits for a match, the search pointer is verified to lie on a 4-byte boundary and the lengths of matching datasets are terminated early to ensure they result in multiples of 32-bits. Similar to the 32-bit search algorithm, a compression ratio loss is expected. The compression ratio loss will likely be greater due to the fact that matches will be shorter. A speed enhancement may result due to the cache line read penalty being avoided.

Utilize Hardware CRC-32 Instruction
The existing LZO library incorporates a CRC-32 calculation routine derived from the freely available zlib library [9]. This variation implements a tabular method using the 0x4C11DB7 polynomial. This method requires a considerable amount of processor utilization to generate checksums. The particular approach taken for this paper was to observe the performance of the relatively new Intel SSE 4.2 assembly instruction "CRC32".
It should be noted that the improved CRC-32 library function is not explicitly utilized by the LZO 1x-1-15 algorithm, or any other of the provided LZO library algorithms for that matter. The function is provided by Oberhumer to the end-user so that CRC's may be calculated on a need basis as determined by the writer of the final produced compression executable. In this manner the end-user is able to adjust the balance between speed and error-checking capability. The author of the LZO library has written and maintained the executable "lzop". The current version of lzop, 1.03, incorporates a small subset of the LZO library for its compression, including the 1x-1, 1x-1-15, and 1x-999 algorithms. This executable calls the CRC-32 library function twice for every block of input data compressed; once to determine the CRC for the original uncompressed data block and once to determine the CRC for the compressed output block. By default, the application sets the user-defined input block size to 256kB.

Performance Analysis
To determine compression performance, first a dataset of sample files to be compressed was constructed. The particular set was chosen to demonstrate performance over a variety of file types that could represent potential data streams in a real-time system. A brief description of the files used can be seen in Table 1.1.
Text files such as the Wikipedia backups and the Google Books 1-gram corpus were chosen to demonstrate performance on highly redundant data. Since LZO is a dictionary-based method, it was expected that the compression time and ratio should yield fairly good results for these files. The files already containing various degrees of compressed data (GZIP and BZIP2) were chosen to show compression performance when file expansion has a high potential of occurrence. An uncompressed tarball compilation of extended TIFF image data was picked to demonstrate image compression characteristics of the algorithm. A rather large uncompressed tarball compilation of HDF biological E-coli data was chosen to show performance against binary data. Finally, genome data in FASTQ format was chosen, as it is considered the standard for storing the output of high throughput sequencing instruments. Thus, the FASTQ data should provide a somewhat realistic real-time data stream.
As stated in Section 1.1, performance analysis was evaluated in regards to a theoretical real-time compression system consisting of two interfaces: a constant input stream and an output stream that can be written to with zero delay. A software command-line executable was created to simulate such a system and facilitate batch compression of the files identified in Table I. The executable loads input data and compresses it in memory to reduce the impact of slow file I/O subsystems. Time spent reading data from the hard disk is not counted against compression time. This serves two purposes; first, the algorithms can be more fairly evaluated independent of hardware overhead limitations, and second, the input data stream is simulated as a continuous stream, as desired by the real-time system simulation. Parameters were passed to the executable dictating which algorithm to utilize and how many iterations of compression to perform on a per file basis. After testing a particular algorithm's performance on a file from the dataset, the program outputs the average throughput, the average compression time, the average compression time per input data block, the compression ratio, and the final compressed file size.    Prior to the start of testing, an optimal parameter was determined for the input block size. The first Google Book 1-gram dataset was run with several input block sizes, as seen in Fig   Performance with one thread in the multi-core version of LZO was found to be slightly worse than the baseline. This is most likely due to the overhead involved in coordinating the multi-core algorithm.

Test Setup
Performance for both multi-threaded implementations increased up to a maximum value of four compression threads. It was originally expected that performance would increase up to eight maximum threads as the system under test has eight logical cores. Further testing showed that due to the specific software implementation, the CPU cores never reached maximum utilization of all eight cores simultaneously. In order to accommodate large file sizes, only 64 MBytes at a time were read to physical memory from disk and compressed in a loop.
The stall time imparted from performing the disk reads was accounted for and subtracted from the total compression time. However, the operating system, due to the frequency of the stalls, appeared to determine that delegating work to the logical cores was not necessary, as the physical cores remained slightly underutilized. A modified version of the compression benchmark program was created to iterate on the first 512 MB of file data in memory. This software was able to exercise all eight cores, showing the expected performance boost from one to eight threads.
For the purposes of this testing it was decided to continue utilizing the original version of software with the maximum thread parameter set to 4 or greater. The memory-only software variation would result in inaccurate compression ratio measurements, as larger file streams would not be able to fit within available memory space.

Benchmark Test Results
Testing was conducted in the following manner. First the performance of the Results of testing the LZO 1x-1-15 variants on the dataset can be found in  To determine why the cache-line-aligned modification may experience issues with highly redundant data, a special version of the compression software was created to gather more in-depth timing information. This version of software recorded execution time spent in the four main sections of the compression code, described in Section 1.2. Compression of the file "image_en.nq" was re-run using the 32-bit search and cache-line-aligned algorithms. The results can be found in Table 1.2.
It was found that the cache-line-aligned algorithm overall executed roughly 1.28 times the number of loop iterations, resulting in a total compression time of 1.40 times longer than that of the 32-bit search variation. Supplemental testing on an Intel Core2 Duo system (not shown), which according to [7] should benefit more from the aligned reads than the i7 processor used for benchmarking, showed that the algorithm still performed 1.30 times slower than the 32-bit search variation. Examining Table 1.2, fewer match length determination loops were executed, implying that fewer dictionary matches were found. This makes sense, as during dictionary searches the 32-bit alignment of the input pointer may skip up to three consecutive bytes, leading to a more sparsely populated dictionary. Those matches found would also have been shorter in length due to the imposed 32-bit boundary restriction. This led to more sub-block is also hindered by the high level of data redundancy: the algorithm periodically finds short matches, resulting in an inability to skip forward to the end of a sub-block, as it would with data of lower redundancy. These combined factors appear to have attributed to the poor speed performance for highly redundant data.
Compression timing performance for highly non-redundant data can be seen in In instances where compression ratio is not of primary importance, the combined optimization algorithm demonstrated the ability to increase compression speed performance by 5.4x.
In general, the results suggest that LZO and other token-based block compression algorithms similar to it can benefit from recent CPU hardware optimizations.
Increased processor bus widths allow for larger block memory copies when storing uncompressed literal data. The introduction of multiple cores allow for multiple blocks to be simultaneously and independently compressed. These optimizations should port to embedded multi-core architectures containing data buses greater than 32-bits.
Future work may be explored by investigating potential optimizations from the Intel AVX (Advanced Vector Extensions) and AVX2 vector instruction sets [10]. Another area of future work may be to implement an adaptive version of the algorithm similar to that described in [11]. Such a system would attempt to maintain a minimum compression ratio, switching back and forth between the different algorithms for speed gains as need allows.

Abstract
This paper presents a novel architecture of a lower limb neural machine interface (NMI) for determination of user intent. Our new design and implementation paves the way for future bionic legs that require high speed real-time deterministic response, high accuracy, easy portability, and low power consumption. A working FPGA-based prototype has been built, and experiments have shown that it achieves average performance gains of around 8x that of the equivalent software algorithm running on an Intel Core i7 2670QM, or 24x that of an Intel Atom Z530 with no perceivable loss in accuracy. Furthermore, our fully pipelined and parallel non-linear support vector machine-based FPGA implementation led to a 6.4x speedup over an equivalent GPUbased design. In this paper, we also characterize our achieved timing margin to show that our design is capable of supporting real-time wireless communications. With additional refinement, such a wireless personal area network (PAN) system will provide improved flexibility on an individual basis for electromyography (EMG) sensor placement.

Introduction
Until recently, commercially available lower limb prosthetics have mainly focused on the use of simple passive devices. Surface electromyography (EMG) is one sensory interface that has been used in the past to successfully predict active user intent to control upper limb devices [1]. The main issue with applying this pattern recognition (PR) technology to patients with transfemoral (TF) amputations is that the detected lower limb EMG signals are typically non-stationary during locomotion.
These EMG signals are however quasi-cyclic with respect to locomotion gait and somewhat stationary within short windows. In [2], it was shown that a phasedependent EMG pattern classifier could be constructed to accurately predict intended user movements. When the EMG data is fused with measured moments and ground reaction force from the artificial limb, a more accurate algorithm was developed, achieving over 95% accuracy for locomotion detection [3] [4] [5]. The algorithm in [3] and [4] used linear discriminate analysis (LDA) for classification, and was later realized in a field programmable gate array (FPGA) to obtain real-time performance [6]. To achieve higher accuracy, a support vector machine [7] (SVM) based classifier has been studied [5] and shown to provide greater accuracy than the LDA approach.
Because the ultimate goal is to control an artificial limb in a real-time and wearable environment, a portable embedded system is desirable with low form factor and low power consumption.
In this paper, we present a new lightweight design of the more accurate SVMbased algorithm in an FPGA. Our FPGA implementation differs greatly from that presented by the LDA approach in [6], as it utilizes a non-linear SVM-based detection algorithm. While this increases the design complexity, it has been shown to provide higher accuracy in determining user intent over the LDA approach [5]. Furthermore, in an effort to increase performance via pipelining and parallelism, our FPGA implementation was written and optimized in VHDL. In contrast, the previously created LDA-based FPGA was auto-generated using Impulse C C-to-HDL CoDeveloper [8].
This paper makes the following contributions: Making comparisons with an existing real-time wireless protocol, we will show that sufficient prediction time slack exists such that our design can communicate with sensory devices in a time critical manner. Wireless sensors are desired for EMG readings because, depending on a TF amputee subject's residual nerves, some muscle locations may be preferred over others for placement. Our eventual wireless system should provide flexibility, allowing us to target those regions on an individual basis.
The remainder of this paper is organized as follows. In the next section we briefly introduce our proposed real-time system and describe our existing NMI algorithm. Section 2.3 details our prototype hardware implementation. In Sections 2.4 and 2.5 we introduce our testing methodology and discuss our results. We then present our conclusions and discuss future work in this area.

System and Algorithm Design
The proposed final state of our prosthetic system can be seen in Fig. 2.1. A detailed description of data flow in the system is summarized in [9]. Capabilities and limitations of the implemented prototype design are discussed in Section 2.3.
To obtain force and moment measurements, the artificial lower limb contains an integrated six degrees of freedom (6-DOF) load cell manufactured by Bertec The wireless data is received by an input/output (I/O) core which forwards the raw ADC data to the FPGA's feature extractor. Once the feature extractor has obtained enough sensor data to occupy one window, a set of features are computed and forwarded to the multiclass SVM module. The SVM module takes these features and performs the current classification of user intent, which is then forwarded to a tenpoint majority vote algorithm. The result of the majority vote is the final predicted user intention, which is sent through the wireless link back to a device on the artificial limb that will use the information to adjust limb impedance and position accordingly via its own separate state machine. Fig. 2.2 depicts a timeline of the major events, as seen from the FPGA, that take place during operation. Details of notable algorithm events are described in [9]. diagram of the hardware implementation can be seen in Fig. 2.3. In the subsections that follow, the major three components of the prototype design are discussed.

Wireless and Inter-Module Communications
A dual core implementation of a superscalar MIPS-like CPU was synthesized to act as a transmitter/receiver between incoming raw EMG and load cell data and outgoing predictions. Due to its small footprint, USB device compatibility, and builtin support for several standard wireless communication protocols, a real-time embedded build of Linux was chosen for an operating system. One core on the system is dedicated to handling kernel tasking. The other core is responsible for receiving wireless data and transmitting it to the Feature Extractor/SVM module over a fast

Feature Extraction and Phase Detection
The feature extraction unit is depicted in detail in Fig. 2 of the filter is used as input to a simple state machine to predict the current gait phase.
The prediction is stored in a ten entry history buffer to aid in future predictions.

Multiclass SVM and Majority Vote
The Multiclass SVM unit is designed to handle three classes. The SVM unit's block RAM is primed with model data prior to run-time, including support vectors, combined coefficient and class labels, and bias values. The RAM is designed to accept and store up to four different models, corresponding to the four gait phases.
When the SVM unit is signaled to begin a classification, the stored SVM model After kernel function execution, the SVM decision function [7] is evaluated. The hardware achieves high throughput by attempting to maximize the total number of parallel weighting operations. The final value for each SVM comparison is delivered in a pipelined fashion to a vote counting unit, which determines the winner for each binary classification and tallies votes. When it has detected that a target number of votes have been placed, the final classification is sent to the majority vote unit.
The majority vote hardware maintains a sliding window of the last 10 predictions.
Prior to receiving a new prediction, it adjusts its window and tallies the votes for each stored classification. When the new classification is available only a few clock cycles are required to update the vote tally and output the final result.

Experimental Evaluation Methodology
This study was conducted with Institutional Review Board (IRB) approval at the University of Rhode Island and informed consent of subjects. Testing occurred in two phases: feature extraction and SVM performance evaluation, and real-time wireless system performance evaluation. For all testing, 3-Class offline data from [9] was utilized to benchmark the performance of the new system.
Two sets of 3-Class data were obtained from a male able-bodied subject. The first set of data was used to predict stair ascent and consisted of 6905 individual classification tests. The second set of data was used to predict walking motion and consisted of 7391 classification tests.
A software program was created to transmit required input data wirelessly to the FPGA module under test and receive responses. Cycle-accurate counters embedded within the FPGA hardware determined the amount of time to independently perform both feature extraction and multiclass SVM. This timing information was reported back to the software program along with the resulting prediction. Accuracy and timing benchmarks were achieved by comparing against a current generation Intel i7 CPU running a previously published C software-based implementation [9]. The Intel i7 system under test consisted of the following major components: an Intel Core i7 2670QM CPU, 6 GB of RAM, and Windows 7 64-bit operating system.

Performance Evaluation
After design compilation, the Altera Quartus II tool predicted that our total thermal design power (TDP) is 2.3 Watts, roughly matching that of the Intel Atom Z530 mobile CPU that ran the comparable NMI algorithm in [9]. Our complete system should require less power than this mobile device, as all memory and most communications resources are contained within the Stratix V.
During the first phase of testing, the offline A/D sample data from the 20ms 3-Class trial was sent to the feature extractor in real-time. The three classes predicted by this model include stair ascent, standing, and level-ground walking. The system reported back the class prediction and timing taken to perform feature extraction and multiclass SVM prediction. The mean timing and overall speed-up achieved over the Intel i7 by the FPGA to complete both feature extraction and multiclass SVM predictions can be seen in Table 2.1.
In [9], the Intel Z530 Atom achieved a mean prediction time of 0.721ms over the same set of trials and in [12], a highly parallelized GPU-based version of our algorithm was run on a 35W GeForce GT 540m with 96 CUDA Cores, yielding an average prediction time of 0.192ms. Comparing with our results in the table, we have achieved a speedup of 24x and 6.4x respectively over these platforms. In the case of the GPU, a 15.2x power advantage is also achieved. Another interesting comparison can be seen between the SVM and LDA-based FPGAs. In an equivalent 3-Class trial,  [6] reports the average prediction time of its FPGA as being 0.23ms. As seen in Table   2.1, the SVM-based FPGA only takes 0.03ms on average -a 7.6x performance increase. While LDA generally requires fewer computations than SVM, the hardware optimizations present in our design, combined with the fact that the LDA FPGA was created with a software to HDL converter both contribute to this unexpected difference in performance.
Unlike the CPU implementation, our feature extraction consistently occurs within a fixed number of clock cycles, and since the amount of time taken for a classification decision is proportional to the selected gait phase, predictions can be expected to occur within a precise, guaranteed timeframe. Further analysis showed that with the exception of one prediction, results were identical to those produced by the Intel i7, and met the high level of accuracy reported in [5] and [9]. The one mismatch was determined to be a near-boundary case.
In the second phase of testing, the software program used previously for data injection was modified to record the round trip timing of the system. Timing was recorded for both wired Ethernet and 802.11g WiFi using TCP/IP with the Nagle algorithm disabled for speed. The wired Ethernet implementation was chosen as a control to ensure that the soft-core CPU and SVM-based limb algorithm hardware were capable of handling the latency and throughput required for real-time communications. The desired processing slack time was determined using data from [10]. In their paper, the developers of RT-WiFi found that in a noisy environment, a payload of 460 bytes will incur a transmission delay of at most 4.2ms. Extrapolating linearly to the 520 bytes required for each 20ms window prediction, it can be expected that our system should achieve a guaranteed slack time of at least 4.75ms per window.
The same trial subsets of 3-Class data used during the first phase of testing were again used to determine performance. In the wired implementation, all prediction responses were received within the desired 20ms prediction timeframe. Most arrived within 4ms, with a few taking as long as 14ms. This yields a minimum slack time of 6ms, which satisfies our derived timing requirement. As expected, during 802.11g WiFi testing, the system was incapable of meeting real-time demand, with responses of up to 60 ms. Clearly, a solution similar to RT-WiFi is needed to achieve the desired response.

Conclusions
This paper explored using FPGAs to improve the real-time performance of an existing NMI algorithm for lower limb control. It was found that by implementing the real-time limb control algorithm in an FPGA we are able to meet required accuracy, while completing all computations within a much smaller and bounded timeframe when compared to the previous software-based implementations. In addition, the FPGA provides a wearable footprint and power rating required by limb control.
Analyzing the published data from the authors of [10] along with our wired Ethernet results, we found that the prototype system should be capable of real-time communications if a proper custom data link layer is substituted. Future work may be performed on the topic of further power reduction. ASIC implementations have been known to yield considerably better power ratings than FPGAs and require significantly less volume. A study will need to be undertaken to quantify the advantages an ASIC design would have over an FPGA.

Introduction
Recently, classification processing has become a growing field of interest for many embedded systems. In the past, pattern recognition techniques have been employed to achieve good classification accuracy in various areas, including data mining [1], artificial limb control [2], and security systems [3]. One popular subset of machine learning classifiers, support vector machines [4] (SVMs), have been shown to achieve high accuracy without the need of complex parameter tuning that is found in some neural networks [5]. While they yield strong results, the computation process of classification suffers greatly from a large number of iterative mathematical operations and an overall complex algorithmic structure. Traditional processors with limited parallelization in their data pipelines have exhibited poor real-time performance, giving rise to the challenge for computer architects to devise a classification engine suitable for embedded systems that is both performance and energy optimized.
In this paper, we present an accelerator-based reconfigurable real-time architecture for the feed forward phase of SVM classification, R 2 SVM. Acceleratorbased architectures [6] are an area of potential growth for embedded systems, yielding solutions to the issues of power, space, and timing constraints. They offer a solution to improving energy efficiency and performance by optimizing logic to perform specific tasks. To date, only limited such architectures have been studied, which either target a specific SVM task or trade off important factors, including precision, accuracy, speed, and general purpose use. Our paper makes the following contributions to the community: (C1) A real-time energy efficient, accurate, general purpose run-time reconfigurable accelerator for the purpose of multiclass SVM classification. Unlike all existing works, our architecture is designed to work with four of the most commonly used SVM kernels, and any number of features and classes, up to a maximum specified at synthesis. This allows for multiple diverse SVM workloads to be targeted.
(C2) We have developed a fully functional prototype of the R 2 SVM architecture in an FPGA, which will soon be made publically available. Our unique design uses model input data identical to that provided by libSVM [7], allowing for direct performance comparisons.
(C3) We present data from several well-known machine intelligence benchmarks to demonstrate the benefits of the R 2 SVM architecture over a variety of scenarios. To the best of the authors' knowledge, this is the first work to provide comparison results for multiclass classification with SVM among FPGA, CPU, and GPU hardware.
The remainder of this paper is organized as follows. In the next section we briefly discuss the background and related work in the area of SVM classification and SVM hardware optimization. Section 3.3 describes the system architecture of R 2 SVM in detail. In Sections 3.4 and 3.5 we introduce our testing methodology and discuss our performance results. Last, in Section 3.6, we present our conclusions and discuss future work in this area.

Overview of SVM Classification Process
Support vector machines are a form of supervised machine learning, whose current soft margin incarnation was pioneered by Vapnik and Cortes [4].
Classification is divided into two phases, a training phase and a test phase, also known as the feed forward phase. During the training phase, a set of data is supplied to a learning algorithm in the form of real vector-class pairs to create a model, which is later required for use by the feed forward phase. Given an accurate, ideal set of labeled input data, model creation in traditional SVMs is generally a one-time process that can be performed prior to run-time [2] [3]. The feed forward phase, however, is often run periodically or as needed to perform classifications requiring high performance and real-time processing. We therefore concentrate on optimizations of this phase of the classification process in this paper.
The feed forward phase can be viewed as a two-step process. First, data of interest must be collected by some sensory means and relevant attributes/features extracted. Feature extraction itself is a task that is often highly tailored to the type of data being collected [8]. Features presented to classification may change over time as better indicators are discovered or redundant/misleading elements are removed, and thus extraction is best left to general purpose processing. In contrast, SVM classification is a redundant operation that, if accelerated in hardware, an embedded system could serve to benefit. Following feature extraction, the SVM classification algorithm, known as the decision function (3.1), is run in which stored model data is used to characterize the set of gathered features into classes. Kernel functions [9] are special mapping functions that efficiently map non-linear datasets to a highly dimensional linear feature space. They allow for support vector machines, which were originally developed for linear classification, to also perform non-linear classifications. The particular kernel to employ is entirely dependent on the characteristics of the data in use and is beyond the scope of this paper. , SVMs have been further extended to solve problems requiring classification among more than two classes of data. Similar to libSVM, our design implements the one against one approach [10]. One against one SVM performs multiple back-to-back binary classifications. The winner of each classification receives a vote and the class with the greatest number of votes is presented as the resultant classification.

Related Work
Researchers have repeatedly employed computer systems to accelerate the processing of the SVM decision function. Implementations have targeted CPUs [7], GPUs [11] [12] [13], as well as custom accelerator hardware. Many existing custom hardware implementations focus on accelerating the decision function for a specific task [14]. These designs cannot easily be reused for other purposes. Often-times they support a small, fixed number of classes [15] or make use of low precision calculations [16] for speed improvements. Most existing designs attempt to limit or refrain from using more precise floating point hardware for several reasons, including space and design complexity [17]. Unlike existing works, the reconfigurable architecture provides the ability to accurately compute multiple diverse workloads in succession. This is desirable in any real-time system that requires numerous machine learning decisions to be made within a given timeframe.

The Role of Precision in SVM Classification
While researchers have exploited fixed point and other reduced precision numerical systems to implement efficient, task specific SVM implementations, the exact nature of the relationship between SVM classification accuracy and precision has only been vaguely addressed by existing literature. Rather than submit to trialand-error tactics in determining the required arithmetic precision, [18] delves into the theory behind this topic. Through running several experiments, the authors conclude that SVM parameter quantization and rounding error are the two main factors that contribute to determining the minimal amount of precision required for SVM computations, both of which are largely data set driven. For the limited data sets studied in the paper, the authors found that a minimum of 15 bits of floating-point precision were required to ensure no loss in accuracy. In devising an architecture meant to tackle a wide variety of classification problems, great flexibility in precision will be required to ensure high accuracy under all conditions. For this reason, while not the most resource or power-efficient approach, we have elected to maintain the IEEE-754 floating point number system when performing computations for classification in our architecture.

Architecture Design
In this section, we describe the design decisions made in the construction of our unique architecture. We envision the R 2 SVM architecture as being implemented as a critical processing section in a variety of real-time embedded devices. Classification is often a very resource and time-intensive process for general purpose computing. By relieving this burden in a power/area constrained environment, our architecture provides an attractive alternative. Our main intention in designing R 2 SVM was to develop an accurate, fast, highly efficient SVM classifier, capable of meeting the demands of a diverse set of workloads that may be encountered by real-time systems.
To achieve a level of accuracy on-par with existing software packages, all Meeting hard real-time speeds is achieved through extensive pipelining and parallelism. We designed R 2 SVM such that a given model will always take a known number of clock cycles to compute a result. We will show in the subsections that follow that access delays to both parameter and model data are kept to a minimum to minimize pipeline stalls. In Section 3.5, our results clearly indicate the ability of the architecture to maintain real-time response.
The final goal of our architecture is to provide seamless support for a diverse set of classification workloads through offering reconfiguration at run-time. Versatility is available in this area in that header information from the model is used during runtime to reconfigure the number of classes to be used in the current classification. We will demonstrate in the subsections that follow that while our hardware may be synthesized for a maximum of classes, models ranging anywhere from 2 to classes may be used without any performance degradation. This versatility is also  informs the higher level system of its completion, and waits to receive the next set of parameters.

Kernel Evaluation
The first step in the evaluation of (3.1) is to use an appropriate kernel function to map an input vector set of features to a scalar value, which can later be used by the coefficient weighting process. Because of the latency involved with floating point calculations, our kernel is fully pipelined such that, once populated, the hardware will all involve some specialized input operation, followed by a summation, which is in turn followed by a specialized output operation. This is reflected in our design by Fig.   3.2. Here, the design uses an adder tree to perform the task of summation. This removes the need for any iterative stalling when producing a kernel output.
Our input block to the kernel pipeline operates based on the kernel selected. One of two operations will occur: the multiplication of two vectors, or the subtraction and squaring of those vectors. The first is required by the linear, polynomial, and sigmoid kernels, and the latter is used by the Gaussian radial basis function.  With the sigmoid kernel, alterative mathematical operations are once again performed. To calculate sigmoid, we use the trigonometry identity of (3.7). (3.7) The data path chosen for sigmoid initially follows that of the Gaussian radial basis function, with the exception that parameters and are specified. The output of the natural exponent unit is fed into additional logic to complete the determination of the sigmoid kernel value.

Coefficient Weighting
Once scalar kernel values are available, the coefficient weighting phase may commence ( Fig. 3.4). Given a model with classes, each support vector will have ( ) associated coefficients. These coefficients must be multiplied by their associated kernel value. By instantiating ( ) multiply-accumulate (MAC) blocks, each kernel value is only required to be presented to the hardware once. This reduces circuit complexity as well as iterative behavior and redundant memory access. While a particular model may have k classes, our hardware is designed to handle a maximum classes, where 2 <= <= . MAC blocks are reconfigured to be enabled/disabled such that only those required for the current classification task are used.

Voting
The voting unit, as seen in Fig. 3.5, maintains separate counters for each class.
When a comparison value arrives from the coefficient weighting engine and the valid flag is set, it is compared with a floating point value of zero. As mentioned earlier, two classes are forwarded to the voting unit to inform it of the counters involved in the current comparison. If the value is greater than zero, the lower numbered class receives a vote. Otherwise the higher numbered class is incremented. On each iteration, the elected class is updated by checking for the lowest labeled class holding the maximum number of total votes. When the unit receives the "last comparison" flag, the final vote is cast, and a signal is pulsed to inform external hardware that the classification has completed and the elected class is valid. Our prototype was implemented in an Altera 5SGSMD5N Stratix V FPGA using a combination of Altera Megafunction IP blocks [19] and VHDL. The 5SGSMD5N

Prototype Implementation Details
was selected due to its abundance of DSP resources, required to efficiently implement floating point hardware logic. Altera IP blocks were used to provide floating point calculations in most instances due to their tested and proven accuracy, level of optimization, and relatively low latency.
Since Altera IP blocks only support single and double precision IEEE-754 calculations, unique reduced precision floating point operations like those in [18] are not possible in our implementation. Of the two available numerical formats, we decided to use single precision due to its lower latency, lower use of resources, and the fact that the authors in [18] determined that enough bits of precision should be available to accurately perform most classification tasking.
In designing our VHDL codebase, we created five top-level generics to control synthesis of the design. This allowed for us to characterize the optimal performance This should further reduce the area and power required by the design.

Evaluation Methodology
In evaluating the performance of our design, we wanted to demonstrate scalability as well as accuracy and efficiency in solving complex classification tasks. We have selected six workloads and one embedded system case study to prove the ability of our architecture to meet the demands of real-time systems.

Workload Characteristics for Benchmarking
We first selected a diverse set of existing real-world datasets obtained from both the UCI Machine Learning Repository [20] and the Statlog collection [21]. Details regarding the number of classes and features in the datasets can be seen in Table 3.1 in the Results section.
Adult. This is a modified version of the UCI Adult dataset, obtained from the libSVM authors [7]. We have selected this two-class dataset as a means to baseline performance against the publically available KMLib GPU library.
DNA. This workload was chosen to demonstrate performance with a large number of features (180). The classification problem is related to molecular biology and could represent a portion of a medical system: given a DNA sequence, the task is to determine the boundaries between introns and exons.
Letter. The purpose of this classification task is to determine the English alphabet letter corresponding to a given set of 16 attributes. 20 different fonts were used in the creation of the dataset to add to the complexity. This dataset could be viewed as an example of a potential embedded robotics application: automating the process of text recognition as part of a control loop in understanding the operating environment.
Shuttle. This dataset was supplied by NASA and consists of shuttle control data with regard to the position of radiators, essentially an embedded real-time defense system.
It was chosen to aid in characterizing the error rate of our system, as a large number of test cases were available.

Satimage. This workload is used to identify different types of terrain present in a
Landsat satellite image. Classes include red soil, cotton crop, vegetation stubble, grey soil, mixture, damp grey soil, and very damp grey soil. If the data were gathered and analyzed in real-time, the dataset could be representative of an embedded communications system.

Vowel. This dataset consists of features gathered from spoken British English vowels
from several speakers. Speech recognition could be a critical task for a real-time embedded system functioning as an assisted living device. The Vowel dataset was selected in particular to examine performance with a small number of support vectors.

Case Study -A Human-Computer Control Interface
To fully evaluate the utility of R 2 SVM, we also examine its use in a real-world embedded system. The selected system is a human-computer wireless interface that takes classification of lower arm electromyography (EMG) signals to perform userintended cursory actions on a personal computer. The intent of this medical device is to allow amputees to seamlessly control a computer much like an able-bodied subject.
The overall setup of our system is as follows: EMG sensors are employed to gather data from the following muscles: interface to a nearby computer where the intended operation is executed.
Our prototype system under development conservatively allocates 60% of the 20ms window to classification. In our testing, we run three trials from collected data traces to show that R 2 SVM is capable of meeting the real-time demands of the classification portion of such an embedded system, while operating at a very low power level.

Evaluation Platforms
Models were constructed from the datasets using libSVM. In all tested platforms, models were loaded prior to run-time such that only classification performance would be examined. Dedicated timers were used to determine and record the amount of time taken for each classification. For our testing purposes, we define accuracy as being able to compute the same resulting class as a CPU running the well-known libSVM software package. The extent to which the SVM decision function correctly predicts with a given model is dependent on several factors that are beyond this paper's scope.
Because of this, we omit details of the classification accuracy of the models themselves, as this data would not add much value to our SVM performance discussion.

Hardware Synthesis
The full results of optimal hardware synthesis for our test cases are omitted for space, but are briefly described here. As anticipated, the number of DSP blocks

Workload and Case Study Performance Results
The results of testing for all kernels and scenarios can be viewed in Table 3.1.
We display both the average time taken for the FPGA to compute the classification as well as the speedup offered by R 2 SVM. In all cases, R 2 SVM was able to predict with 99.95% or higher accuracy the same final class as libSVM on the CPU. The few encountered failures were examined individually to determine the reason for which the incorrect class was reported. In all instances, failure was determined to be due to the result of floating point rounding error, which accumulates faster on the single precision FPGA. This indicates a near-boundary case for classification, which in a real setting could be remedied through either improvements to model features or kernel parameters.
Additionally, it was found that for our case study we were able to perform each prediction with 100% accuracy and within 14.1µS. The time required for classification occupies only 0.07% of the budgeted 20ms mentioned earlier, leaving ample time for accompanying hardware to perform data collection, feature extraction, and control. Leftover time could even be used to send alternative models/kernels to the classification module to assist in strengthening the results.
Average FPGA speed-ups of up to 53.74x over the CPU and 23.33x over the GPU can be observed in Table 3.1. It was found that the GPU library used for testing was a more optimal implementation than the publically available KMLib, which it outperformed by 418.53x on the Adult dataset. Examining both the CPU and GPU data results in detail, a few local maximums exceeding several dozen milliseconds were discovered, reaffirming the need for a dedicated well-bounded accelerator to support real-time tasking. Since the R 2 SVM design is entirely hardware driven with no facility for preemption, the FPGA classification times given in the table are guaranteed to be the actual times required to evaluate the decision function for a given model. To determine why R 2 SVM is able to outperform the CPU and GPU, we performed an analysis of their codebase. On the CPU, when libSVM begins a classification, it first computes all kernel values before beginning the weighting process. This results in a huge performance penalty when compared with the FPGA, which is allowed to begin coefficient weighting while kernel evaluation is ongoing.
The CPU is limited in parallelization in this regard and must iterate, relying on cache and a higher maximum clock rate to attempt to gain performance. A similar issue occurs in the coefficient weighting section of the libSVM software where ( ) weighting coefficients must be applied to all support vectors in the system and summed. The CPU evaluates each class comparison separately in a linear fashion.
This can be a hugely expensive task for any processor. Meanwhile, R 2 SVM is constructed to perform ( ) weightings and subsequent accumulations in parallel. R 2 SVM outperforms the GPU for similar reasons. Unlike the FPGA, the GPU must also perform all kernel operations first followed by weighting. During kernel calculations, a large GPU performance hit is taken when a low number of features are presented. Indeed in many low-feature instances, the CPU outperforms the GPU.
This is because the GPU incurs a time penalty with each kernel invocation, and with a small number of features the amount of time to copy the results back to the accompanying CPU cancels any advantage. As evident in the DNA dataset, the FPGA design is still able to outperform the GPU when a large number of features are presented. Thus, it would appear that hardware optimizations to both the kernel and coefficient weighting process have allowed our design to achieve performance levels that rival today's desktop processors and GPUs.

Conclusion and Future Work
This paper presented a novel design and implementation of R 2 SVM, a reconfigurable real-time high performance SVM classification architecture. While providing accuracy comparable to that of available software packages, we have shown that our hardware design is scalable in terms of both features and classes. Using realword datasets, we implemented our hardware and concretely determined the performance with regards to two of today's popular general purpose processing methods. In all cases, R 2 SVM exhibited an increase in speed in comparison to both the CPU and GPU being used for benchmarking, with an average speed-up as high as 53x, and energy savings reaching an estimated minimum of 12x. In future work we will explore alternatives to the kernel hardware. We envision that derivatives of our design will be deployed for use in many diverse settings. We intend to release our design publically under an open source license to promote further innovation in this exciting area.

Abstract
We The next layer up consists of servers that provide accurate control decisions via multilayer adaptive learning and spatial-temporal association, before they are connected to the top level cloud where complex system behavior analysis is performed. Our multilayered architecture mimics human neural circuits to achieve the high levels of parallelization and scalability required for efficient city-wide monitoring and feedback. To demonstrate the utility of our architecture, we present the design, implementation, and experimental evaluation of a prototype Reflex-Tree. City power supply network and gas pipeline management scenarios are used to drive our prototype as case studies. We show the effectiveness for several levels of the architecture and discuss the feasibility of implementation.

Introduction
Urbanization-the demographic transition from small, rural communities to large urbanized cities-is associated with shifts from an agriculture-based economy to one grounded in mass industry, technology, and service delivery. The combined population of the world's urban areas is predicted to surpass six billion by 2050 [1][2][3].
With this accelerated growth will come unprecedented increases in the consumption of resources and services, leading to material and energy shortages, which will in turn ultimately drive climate change [4][5][6][7]. The "Smart City" is an emerging concept aimed at dramatically enhancing the efficiency, sustainability, and safety of these urban communities. Integrating infrastructure and services into a cohesive whole allows them to be both monitored and managed using intelligent devices and systems [8]. Smart cities encompass enhancements in energy, building, mobility, healthcare, infrastructure, technology, governance, and citizenry (Figure 4.1). The technologies driving these enhancements have a predicted collective market worth of $3.3 trillion infrastructures to intelligently process vast amounts of data collected [10]. Such infrastructure enables both appropriate and efficient resource allocation during normal operation and quick response in real-time to storms, earthquakes, or other natural disasters [11][12][13].
Realizing the "intelligent" infrastructure at the foundation of future smart cities presents significant parallel processing challenges: Firstly, widely distributed, real-time, precise, and massively parallel sensor networks are essential to the success of smart city systems. Most of the existing proposed sensing network architectures [14][15] follow a centralized approach for data gathering and processing, where information is gathered through various sensors and routed directly to the cloud. They lack the sensing capacity, high spatial and temporal resolution, heterogeneous sensing capability, and reliability necessary to meet the critical requirements of future smart cities. Alternatively, a decentralized, layered approach could offer much in terms of fault tolerance for data gathering. A failure in one area of the network will not impact other adjacent sites, resulting in improved up-time. In addition, filtering the data being sent throughout a layered hierarchy should minimize the amount of data needed to be sent to the highest layer, reducing the overall bandwidth required for operation, as well as the associated costs.
Secondly, parallel computing nodes must be deployed at various geographical locations in order to process the massive volume of data generated by distributed sensors in real-time. Often the dataset is not only immense in volume, but also heterogeneous in nature, representing the status information of public infrastructure, healthcare systems, transportation networks, energy distribution, and other critical systems. Currently, no computing platform based on massively parallel and geographically distributed nodes exists that is capable of delivering the processing performance consistent with the low power envelope demanded by smart cities to reduce their operational costs.
Thirdly, in order to provide distributed, real-time control and decision making in complicated urban environments, advanced machine intelligence with spatial-temporal association and complex system behavior analysis is essential. Computational complexity of such machine intelligence increases from broadly distributed local nodes (neighborhood, community, and districts) to a central node that requires citywide control and decision-makings with data volumes representing the entire urban area.
Finally, and more importantly, management and control functions must be implemented at each distinct level of an urban environment, from individual elements of infrastructure to blocks, neighborhoods, districts, and the entire city. As an example, if sufficient computing intelligence is partitioned to lower levels, a local gas leakage could be quickly and efficiently handled by an edge-computing node, eliminating the time required for centralized citywide control decisions to take place and be communicated.
To tackle these challenges, we propose a transformative parallel computing and communication architecture specifically suitable for smart cities, the Reflex-Tree, which will be described in detail in Section 4.3.
This paper makes the following contributions: • Introduction of a novel deployment of a distributed fiber-optic sensing network (FOSN), able to provide timely measurements of both temperature and strain measurements at millimeter level spatial resolution.
• The creation of a unique hierarchical four layer, decentralized, large scale, and application specific parallel computing and communication structure capable of carrying out sensor-based decision-making processes.
• Detailed simulations of Reflex-Tree in two real-world problems: city power supply network and natural gas pipeline management. We describe how our architecture efficiently detects and handles problems that arise.

Smart City Infrastructure
A number of existing works examine potential applications and control infrastructure for the smart cities of the future. For example, the authors of [16] propose a storage system application for city-wide video surveillance. Data from high definition real-time security cameras is sent to a cloud storage system and split into a database based on metadata information. Dividing the streams was shown to allow for fast retrieval when on-demand classification analysis is required by the cloud.
Another suitable application for smart cities is smart lighting [17]. Here the authors discuss an IP based approach to lighting control. The intention is to implement an efficient method of automated lighting control that will ultimately result in cost savings. Unlike these two simple applications, we intend to present a high level, broad architecture, capable of handling a multitude of city tasking.
More related to the work we are pursuing, the authors of [14] propose a framework for effective disaster management. They suggest using crowd sourcing as a method for reporting on environmental and other conditions. Mobile or other wireless devices are used to relay information to the cloud, where it is assessed, filtered, and correlated with data from known reliable infrastructure sensors installed on city buildings or public transportation vehicles. Similarly, the work in [15] proposes another unique implementation of a cloud based management system used to gather sensor data from both citizens and places of infrastructure. The information undergoes sorting and classification at the cloud, where the appropriate public agency is contacted in the event a response is required.
To the best of our knowledge, our unique multi-tiered approach for city management differs from all existing works. In section 4.3 we describe the framework of our parallel architecture, which to an extent allows lower level nodes the ability to be responsible for decision making, instead of sole reliance on the cloud. This difference should work to increase critical response speed as well as scalability and fault-tolerance. Not only can our architecture be relied on for disaster management, but it can also be extended to day-to-day management of city infrastructure.

Sensor Networks for Smart City Applications
Two general types of sensor networks hold significant promise to be widely adopted by future smart cities: active wireless sensor networks (WSN) and passive FOSNs. Currently, WSN, in which each sensor node is both a transducer and radio frequency (RF) transceiver, has been the most widely utilized network type [13,[18][19][20]. Examples include Adhoc networks and the more recent radio-frequency identification (RFID) systems [21][22][23][24][25][26][27][28][29]. The most distinct advantage of a WSN is that the formation of a network can be accomplished in-situ, making a WSN favorable for use with mobile devices. However, as an active network, each sensor node must be powered locally, requiring frequent maintenance (e.g., battery recharging or replacement). Additionally, wireless communication is limited by the ambient environment, as RF waves have very short penetration depths in water or wet soil [30,31]. In a municipal gas distribution system, a large portion of the pipeline network is embedded underground, making WSN impractical or impossible.
In contrast, a FOSN is directly connected through optical fibers, allowing the passive network to be completely embedded within an infrastructure element and to be maintenance-free during its entire lifetime [32][33][34][35][36]. These features, along with FOSN's unique advantages of compactness, high spatial and temporal resolution, resistance to chemical corrosion, immunity to electromagnetic inference (EMI), large multiplexing capacity, and remote operation with ultralow loss (0.2 dB/km), make FOSN the most promising technology to serve as a sensing layer for city-wide management [33,[37][38][39][40]. Passive FOSNs have been demonstrated to reliably measure temperature, strain, and pressure [41]. The optical nature of the sensors allows them to operate without electricity or mechanical parts. While this offers a cost advantage over WSN, it also renders such networks immune to electromagnetic interference.

Overview Of Reflex-Tree Architecture
The Reflex-Tree architectural approach is inspired by the human nervous system, which uses several distinct hierarchical layers to process and react to millions of data streams of biological sensory information in real-time [42][43][44]. The key element of the reflex-tree concept is the inclusion of automated "reflex" circuits in the sensing and distributed computing architecture.
Physiologically, the myotatic (or stretch) reflex, acts as a direct neural circuit to maintain muscle position without the need for centralized control input from the brain.
We present Figure 4.2 as an example to illustrate this phenomenon: while the brain is the central controller of body activities, a direct neural circuit causes an individual to  This layer will use the inputs from the second layer to perform complex system behavior analysis and execute any required dynamic decision-making algorithms.
This allows for a city-wide response in the event of a natural disaster or other potential cause of service outage. The end result of our architecture is a new computing platform with massive parallelism across all four layers, providing the necessary computing power and intelligence demanded by smart cities of the future.

Case Studies in City Management
In this section we present two simplified case studies to demonstrate the potential utility of our architecture. First we describe the lower two layers (layers 4 and 3), which are common to the two case studies. In a real-world setting a combination of shared and unique sensing devices will be deployed for each city management task.
After describing these two common layers, we examine the unique specific use of the gathered information in layers 2 and 1 for both a natural gas distribution system and a city power supply network.

Common Layer 4: Fiber-Optic Sensor Network
To gather detailed information regarding the environment, we employ a new fiber-optic sensor network. At its foundation, molecular-level "finger-print" Rayleigh backscatter extracted by optical frequency domain reflectometry (OFDR) is used as a sensing mechanism. A small-scale proof-of-concept fiber sensing system has been constructed and tested. These tests used an optical fiber section 1 meter in length mounted on a Swagelok tube.   identify the three situations with much improved accuracy, which will be elaborated in the following session. The collected data from our experiments, consisting of the injected "hot", "cold", and "normal" conditions, was used to construct an extensive set of raw frequency shift model data to be used as the layer 4 output sensory information in our gas pipeline and electric utility simulations.

Common Layer 3: SVM Pattern Recognition
Recall that layer 3 of the architecture consists of edge devices connected to multiple sensing nodes. The main purpose of these edge devices is to perform one or multiple pattern classifications on the incoming raw data from layer four and provide the results to the next higher level. The edge devices are intended as a low cost, low power embedded solution for pre-filtering an enormous amount of raw measurement data. The results of pattern classification are transmitted to the next layer up (layer 2) for further analysis and decision-making. Transmission would again occur through an optical fiber to reduce the amount of power required to maintain the system and allow for data to propagate uncorrupted through potentially adverse environments.
For our case study, support vector machine (SVM) classification [45] will be used to determine one of the three main temperature conditions (hot, cold, or normal) from the vast gathered amount of raw sensor network data. We have selected SVM for supervised machine learning in part due to its ability to achieve high accuracy without the need of complex parameter tuning, often found in neural networks [46]. While SVM may not be the most efficient classification method for our simplified case study, it should prove beneficial in a deployed environment where the edge sensor is relied on for processing a multitude of diverse workloads, many of which may require dozens or hundreds of feature dimensions. for multiple sensing nodes to be efficiently processed by the same edge device.
While we had mentioned earlier that an advantage to our architecture over a WSN was the fact that the passive FOSN will not require power, the level 3 processing nodes will still require some power. We still believe our design will offer substantial power savings over a WSN approach because an order of magnitude less level three nodes should exist, since each node is responsible for processing a multitude of individual optical sensors. To further minimize power consumption in the edge nodes, we implemented several power modes allowing the edge device to be placed in a sleep mode when activity is low, which should decrease the required power, and therefore reduce the cost of long-term operation.

Case Study 1: Gas Pipeline Distribution System
We now look to describe our first case study in future city management: the control/monitoring of a natural gas pipeline. Gas pipeline systems play an essential role in supplying energy within cities. Several threats can affect pipeline integrity, including ground movements (landslides, seismic activity), harsh environments (sudden temperature changes), third party intrusion (construction work), corrosion, and aging. These hazards significantly hinder/endanger pipeline function, leading to damage, leakage, and pipeline failure, each of which entails serious economic and ecologic consequences [49][50][51]. A smart energy infrastructure providing accurate, widely distributed, real-time, in-situ monitoring and control should significantly improve pipeline management and safety [52][53][54][55]. The autonomous, scalable reflex feedback control loops at the core of the Reflex-Tree architecture are ideally suited to monitor and control these dynamic and critical components of municipal infrastructure.
Three possible levels of emergency detection for gas pipeline events and their responses, as well as corresponding hierarchical layers, and the intelligence algorithms employed at each layer, are listed in  to isolate damaged pipelines and prevent further damage to the grid. The cloud will use complex behavior analysis to cause a system-wide cascaded overloading effect across the entire gas distribution network, with the potential to continue to propagate and evolve within the system.
For the purposes of this case study, we seek to show that the sensor data gathered at layer 4 can be used to detect several potentially hazardous conditions. It is known that in a gas pipeline, local temperature drop is an indication of leakage owing to the Joule-Thomson effect: local pressure release induces a cold spot in a compressed gas line, allowing for the quick determination of the precise location of a leak. Also, detection of heat can be indicative that an explosion has occurred. Thus, although we are limited to sensing "hot", "cold", and "normal" at layer 3 of the Reflex-Tree, we should be able to correlate these detections with an event occurring in the pipeline. To this extent, we have constructed a small scale simulation of such a pipeline. clustering algorithm [56] to detect significant sized sections of damaged pipes (several adjacent kilometers). In an actual system, the layer 2 nodes would run multiple detection algorithms, each tailored to the particular portion of infrastructure being monitored. The local grid covered by this computing node is, along with multiple other layer 2 nodes, interfaced directly to the cloud via optical fiber.
The cloud layer is not simulated in our case study due to its complexity. The cloud, which would consist of a large, powerful server cluster, would piece together the detections from the layer 2 nodes and look for data patterns and trends that may reveal a wide-scale or systematic issue using highly parallel techniques such as MapReduce [57].
For the case study we run a simulation of both a fire and ground tremor that increase in area over time. The fire results in a pipeline explosion, and the tremor is intended to produce pipeline leakage over time. "Reflex-arc" feedback is disabled in the simulation such that detections at higher layers can be clearly observed. We discuss our findings in detail in Section 4.5.

Case Study 2: City Power Supply Network
In the second case study, we focus our investigation on the feasibility of applying the Reflex-Tree architecture to a city power supply network modeled after a portion of a real-world power transmission network located in the San Francisco Bay area.
Power lines are another common manner in which cities supply and distribute energy.
Natural and other hazards can damage power grids, requiring re-routing or other manual intervention. We believe that the feedback loops present in Reflex-Tree are again an efficient and viable solution for management. Table 4.2 presents several potential emergency scenarios for the power supply network along with the Reflex-Tree layers/algorithms responsible for their detection and remediation.
For this simplified case study, we simulate a deployed FOSN along power lines together with multiple layer 3 edge devices running SVM to determine one of three temperature detections: "normal", "hot", or "cold". We consider a "hot" detection to correlate with an overcurrent condition, as the lines should heat up in such a condition.   In our simulation, we have deployed a multitude of edge devices spaced one kilometer apart alongside the power lines. As with the gas pipeline, all edge devices are attached to a single layer 2 intermediate computing node running DBSCAN to detect clusters of issues, and the local grid covered by this computing node is, along with multiple other layer 2 nodes, interfaced directly to the cloud with optical fiber. Again, we have yet to For this case study, we run a simulation of both an earthquake and an ice storm event over a period of time. The earthquake causes a ground fault in the power line network, resulting in an overcurrent condition. The ice storm, as expected, should result in an accumulation of ice on the power lines, resulting in a detectable, but potentially catastrophic condition that should be attended to immediately. Our findings for this case study are described in the next Section.

Results
A visualization of our full simulations can be found in the video files located at [59,60]. For both simulations, SVM classification in the layer 3 edge devices was found to be extremely accurate, predicting over 98% of all simulated pipeline states correctly over the duration of the simulation. In an actual deployment, some measurement noise will likely be present in the system, however, the great multitude of FOSN sensors as well as their close proximity should aid in filtering this out. We leave this investigation to future work, as the objective of this initial paper is to present our proposed parallel architecture. In both case studies, the clustering algorithm present at layer 2 was able to accurately detect portions of the pipeline/power line that were at risk once at least four individual, adjacent layer 3 classifications became available (this resolution is the result of a DBSCAN input parameter). To illustrate this, we present  the top left in clockwise order, they are: the current status of the simulation, the current status of layer 3 classification, the "cold" clusters detected, and the "hot" clusters detected. For both the simulation and classification diagrams, a pipeline color of blue indicates "normal" status, a pipeline color of red indicates a "hot" status, and a pipeline color of white indicates a "cold" status. External heat events are displayed by a gradient pattern of red (high intensity) to yellow (low intensity), and external cold events are displayed by a gradient pattern of dark blue (high intensity) to teal (low intensity). In the displayed frame of the simulation, a fire has broken out on the left side of the pipeline, causing an explosion to occur in a portion of the surrounding pipeline.
The top right portion of the pipeline meanwhile is experiencing a ground tremor, resulting in major leakage to two separate sections of the pipeline that are located within the vicinity of the disturbance. Moving to the diagram of detected classifications, it can be visually determined that the layer 3 edge devices have performed classifications that roughly matched the exact state of the simulated pipeline. In an actual deployed Reflex-Tree system, warnings would be relayed to city workers as the defects in the pipeline slowly continued to mount, hopefully preventing further damage by quick intervention. The layer 3 data is collected by a layer 2 node, and all individual hot and cold detections are separated to perform DBSCAN clustering. This results in the bottom two hot/cold cluster diagrams in Figure 4.8.
Note that the colors chosen in these diagrams have no real significance: in each of the cluster figures, a different color represents membership with a different cluster. One hot cluster is detected, which corresponds to the one large section of pipeline experiencing an explosion, and two cold clusters are detected, which correspond to the two separate sections of the pipeline that are affected by the ground tremor. Although the feedback loop of "reflex arc" in this preliminary simulation was not completed, layer 2 nodes are expected to instantaneously attempt to intelligently shut down selected sections of the pipeline to prevent further damage. The cloud at layer 1 would then collate all layer 2 gathered data to see if additional action should be taken.
The higher level behavior of the Reflex-Tree is currently under both theoretical and experimental investigation, and will be reported in publications to follow.
A very similar situation for the power line case study can be seen in Figure 4.9.
Here a simulated earthquake occurs in the upper left portion of the power supply network, while an ice storm is gradually taking place in the lower right half. The earthquake causes multiple ground faults due to downed wires, resulting in an overcurrent condition where the nearby lines heat up. The ice accumulation causes the temperature of the power lines themselves to drop considerably. Like the previous case study, SVM classifications yielded high accuracy (greater than 98%), and hot/cold DBSCAN clustering is seen to successfully distinguish sections of the power network that are in need of immediate attention.

Conclusions and Future Work
To the best of the authors' knowledge, we have presented the first hierarchical, parallel approach to city infrastructure management modeled after the human nervous system. Through simulation of realistic case studies, we have proven that the concept appears to be both feasible and highly reliable in detecting potential issues at multiple stages in the hierarchy for the chosen scenarios. In an actual deployed system, the "reflex-arc" feedback should aid considerably in performing timely adjustments to city infrastructure, before the cloud-level intervention.
While we consider this initial simulation successful, an extensive amount of further work exists before such a system could be viably deployed. First, more complex simulations should be undertaken to simulate not only detectable problematic situations, but also to apply feedback and model the resulting response of the system.
In addition to energy deployment, a number of other situations to be simultaneously monitored should be included in the simulation, including traffic, weather, and lighting. Also, an extensive simulation of the complex cloud layer will need to be fully defined and implemented. In improving the simulation quality, numerous efficient and parallel algorithms will need to be explored and tailored to specific tasks.
Certain tasking may require that unique algorithms be used, while others may share some in common. A final area of future research could include merging FOSN, WSN, and other versatile sensing technologies for situations in which FOSN may not be the ultimate solution in acquiring and gathering data.