Program Acceleration in a Heterogeneous Computing Environment Using OpenCL, FPGA, and CPU

Reaching the so-called “performance wall” in 2004 inspired innovative approaches to performance improvement. Parallel programming, distributive computing, and System on a Chip (SOC) design drove change. Hardware acceleration in mainstream computing systems brought significant improvement in the performance of applications targeted directly to a specific hardware platform. Targeting a single hardware platform, however, typically requires learning vendor and hardware-specific languages that can be very complex. Additionally, Heterogeneous Computing Environments (HCE) consist of multiple SOC hardware platforms, so why not use them all instead of just one? How do we communicate with all platforms while maximizing performance, decreasing memory latency, and conserving power consumption? Enter the Open Computing Language (OpenCL) which has been developed to harness the power and performance of multiple SOC devices in an HCE. OpenCL offers an alternative to learning vendor and hardware-specific languages while still being able to harness the power of each device. Thus far, OpenCL programming has been directly mostly at CPU and GPU hardware devices. The genesis of this thesis is to examine the connections between parallel computing in a HCE using OpenCL with CPU and FPGA hardware devices. Underlining the industry trends to favor FPGAs in both computationally intensive and embedded systems, this research will also highlight the FPGA specifically demonstrating comparable performance ratings as CPU and GPU at a fraction of the power consumption. OpenCL benchmark suites run on a FPGA will show the importance of performance per watt and how it can be measured. Running traditional parallel programs will demonstrate the power and portability of the OpenCL language and how it can maximize performance of FPGA, CPU, and GPU. Results will show that OpenCL is a solid approach to exploiting all the computing power of a HCE and that performance per watt matters in mainstream computing systems, making a strong case for further research into using OpenCL with FPGAs in a HCE.


LIST OF TABLES
Program acceleration is the idea of achieving the best program performance by utilizing all tools available, both hardware and software. Programmers create code that is clean, efficient, and built to run as parallel as possible in any computing environment. This idea is the motivation behind the Heterogenous Computing Environment (HCE) concept and the creation of a language to maximize its capabilities, OpenCL. This thesis will briefly discuss the connection between parallel computing and heterogeneity in mainstream computing as a precursor to understanding the need for OpenCL and why is somewhat revolutionary. This brief background in parallel programming and heterogeneity will be followed by program and data analysis using OpenCL in an HCE consisting of a CPU, GPU, and FPGA.

Motivation for Research
GPUs have cornered the lion's share of developmental research due to their ability to process large amounts of data as well as their physical presence in nearly every mainstream computing system. Up until the release of OpenCL, developers shied away from FPGAs due to the complexity of programming needed just to perform simple functions. Within the past few years, some research has been conducted using OpenCL and FPGA acceleration such as information filtering 1 , fractional video compression 2 , finite impulse filters 3 . More research is warranted, however, due to the increased utilization of FPGAs and since they are the best choice in executing highly parallel programs while consuming the least amount of power possible. Measuring program execution speed is one measure of performance, but the amount of wattage consumed to achieve that speed is also critical and is referred to as performance per watt. This research will demonstrate how it has never been easier to program hardware thanks to OpenCL and demonstrate the performance enhancement and power savings of FPGA over CPU and GPU in an OpenCL controlled environment. OpenCL in a heterogeneous computing environment enables computer programmers and engineers alike to maximize acceleration and performance across all hardware platforms. This research will substantiate why OpenCL should be used for program and hardware acceleration in a HCE.

Chapter Overview
Chapter 2 provides the background information and research as to the significance of OpenCL as a computing language. It reviews the main literature used in this thesis and research, reviews parallel programming and how it ties into OpenCL, analyzes speed and performance gains as they apply to hardware, and highlights FPGA as the de facto standard in power consumption savings in an HCE.
In Chapter 3, a detailed explanation of the methodology is brought forth. The OpenCL architecture is introduced in detail and how it bonds the components of the HCE. As supporting evidence for this research, existing OpenCL programs will be altered to achieve maximum performance results and demonstrate the portability of OpenCL programs and kernels. To illustrate the simplicity of hardware programming using OpenCL, some existing OpenCL programs will be compiled and run on FPGA.
A feature program will demonstrate the speed and performance gains of using FPGA as a hardware platform. The Software Development Kit (SDK) for Altera will also be introduced since its understanding is crucial to visualize how the hardware and software communicate in the OpenCL controlled HCE.
Chapter 4 will validate the benefits of using OpenCL and FPGA in an HCE.
First, comparisons will be made between the results of running a traditional C program

Parallel Programming
The need to focus more on software versus hardware to accelerate program execution and efficiency came to fruition after processor manufacturers reached the "performance wall" around 2004. Figure 2.1 shows how the increase of performance of single processors and memory configuration flatlined. Sequential programming techniques such as the "divide and conquer" method, which divided a single program into smaller subsets or groups of code executed separately, were precursors to the many classes of parallelism and parallel architectures that exist today. Object oriented computing languages such as Java and C++ also were designed initially to speed up sequential program execution, but suffered from too many data dependencies between function calls to exploit the essence of true parallelism. Instruction level parallelism was used on superscalar and out of order execution uniprocessors, but it only increased the execution speed of traditional sequential programs. 4 Figure 2.1 Processor performance growth.
Around 2006, chip developers began placing multiple processors or cores onto one chip. Though the idea of having multiple computers work together on one program was not a new concept, the era of multiple cores had begun. This opened the door for parallel processing programs (running one job on multiple processors) and job level parallelism (running multiple jobs on multiple processors). 5 Within the realm of the personal computer, however, these multi-core machines are typically confined to sharing the same memory or physical address space. Clusters and grid computing were also introduced which are commonplace among online databases, physical limitations to how many cores can be placed on a board, and even though cores designed today have their own data memory they still are competing with other cores for the same physical address space which is controlled by the CPU.
Additionally, Single Program Multiple Data (SPMD) models work well to handle running tasks in parallel but matters become much more complicated when data needs to be shared and synchronized between the tasks. With the release of OpenCL 2.0, shared virtual memory is used between program and device that also supports access to data across tasks being executed on separate devices.

Speed and Performance
Before we move on to discuss the significance of running OpenCL in an HCE, we need to review some basic concepts revolving around speed and performance. In the past decade, considerable strides have been made in increasing the overall speed and performance of programs being executed on hardware accelerators such as the GPU and FPGA boards. Where development based upon Amdahl's law of speedup in latency of task execution on a fixed workload stops, Gustafson's law continues and proves to be a more realistic formula for calculating parallel performance, especially as it relates to the basis of this research.
Gustafson's law is formulated as:

Slatency(s) = 1p + sp
Where: • Slatency is the theoretical speedup in latency of the execution of the whole task • S is the speedup in latency of the execution of the part of the task that benefits from the improvement of the resources of the system • P is the percentage of the execution workload of the whole task that benefits from the improvement of the resources of the system before the improvement Gustafson's law shows that even within a constant time interval, the complexity of programming can increase as long as the quantity or quality of resources being used to compute the program increases, such as the number or performance capability of processors for example. Figure    consumption must also be kept at a minimum as the tech industry designs more and more applications of embedded computers that run off of battery power.

Heterogeneous Computing Environment
Now that we have reached the undeniable conclusion that more processing power is better, let's discuss how that processing power can or should be arrayed.
Heterogeneous computing is the concept of using multiple "devices" with their own processor and memory capability to execute programs, tasks, or functions independently or concurrently with other devices. These devices include multi-core CPUs, GPUs, FPGAs, and digital signal processors. 9 As an example, GPUs have been used for decades as independent processing units to enhance computer generated graphics for gaming. Graphic generation and rendering require complicated floating point calculations that continually place a high demand on computing resources.
Since GPUs are equipped with their own on-chip memory and processor, they opened the door for speed ups in graphics processing during hardcore gaming, especially 3D rendering. Modern GPUs have anywhere from 32 to 64 individual cores placed directly on the chip. NVIDIA Corporation developed their own API called CUDA to bind high-level computing languages such as C, C++, and FORTRAN with the massively parallel computing power of their GPUs.
The need for heterogeneous computing has increased exponentially over the past decade. Instead of relying on one processor to perform at the highest possible speed, multiple processors are utilized to achieve faster program execution. Parallelism in programming requires additional processors to gain a true advantage over traditional sequential programming that has dominated software development for decades. As shown in Figure 2.3, these additional processors can be found within the multiple cores of all mainstream computing systems. Besides multiple cores, hardware components of computers today typically have their own memory and processor(s) located directly on the circuit board. SoC design takes advantage of spatial and temporal locality to reduce memory latency and achieve maximal speed-up. As mentioned earlier, manufacturers such as NVIDIA offer GPU cards for laptops and desktop computers that have their own CPU and memory on the same die which greatly reduces memory latency, data dependencies, and data conflicts. The DE1-SoC board used for this research has a dual-core ARM Cortex A9 MP Core processor coupled with a 1GB DDR3 SDRAM as part of the Hard Processor System (HPS). All of these processors or cores located in the various hardware components, along with any external boards that can be connect via PCIE or USB UART connections, provide additional computational power that can be utilized to support data and thread parallelism. These "self-contained" constructs of processor and memory constitute the "devices" of a heterogeneous computing environment.
Each device in a heterogeneous environment handles the processing of complex applications differently. Complex applications can be vaguely categorized based upon the workload that they place on the device being used. Control intensive applications include searching, parsing, and sorting operations; data intensive applications focus more on image processing, simulation and modeling, and data mining; and compute intensive applications involve iterative methods, numerical methods, and financial modeling. 10 The complexity of the task at hand will determine which device is best suited for executing the task, as seen in Figure 2.4 which illustrates how a computer may handle data parallel versus serial and task parallel workloads. CPUs are typically best suited for control intensive applications while GPUs excel at processing imagery due to the massively parallel design that handle processing large amounts of data.
FPGAs are also inherently parallel and handle complex parallelism with a lower consumption of power than both CPU and GPU platforms. An example of this will be examined exclusively in Chapter 4. Workload diversification has opened the door for improving performance and lowering overall power consumption for applications such as HD video conferencing and real-time language translation. Farming out the parallel processing chores to the GPU while keeping the operating system requirements and serial tasks with the CPU enables maximization of both devices, which is the essence of heterogeneous computing.
Now that the layout of an HCE is clear, let's look at an acceleration of the hardware devices. In order to tap into the acceleration power of hardware such as the multi-core CPUs, GPUs, or FPGA one must acquire an in-depth knowledge of highlevel (C, C++, etc.), hardware (VHDL for FPGA), or even vendor specific (CUDA) languages which can be very complex and are only applicable to the one specific piece of hardware. If only there was a way to accelerate any hardware device by utilizing one programming language. This is precisely what OpenCL offers, which will be illustrated in Chapter 3.
The use of a heterogeneous computing environment coupled with the OpenCL language and FPGA, offers an alternative framework for maximizing performance, decreased power consumption, and workarounds to the aforementioned challenges.

FPGA Programming and Design
FPGA design has evolved significantly over the past few decades. In the past, using an FPGA in your development environment required extensive programming just to get your FPGA to perform some simple functions. As a result, FPGAs have been by in large avoided by program developers. 11 FPGAs were the forerunners of the gaming industry, back when games such as Space Invaders and Pac-Man ruled the arcade. Designed as simple two-dimensional arrays of logic gates connected in parallel that are field programmable, they can perform complex computations at a fraction of the performance cost of its hardware brethren. The large number of logic elements typically found on an FPGA provide the means for the multi-threaded parallelism. Modern FPGA design such as the DE1-Soc board from Altera used in this research contains over a million logic elements and thousands of memory blocks in a parallel design that enables multiple OpenCL workgroups to be processed concurrently.
Prior to OpenCL, programming an FPGA required learning complex hardware languages such as HDL or VHDL along with using Electronic Design Automation (EDA) tools in order to properly convert your design idea into a complex logic circuit that programmed the FPGA board. With the advent of the OpenCL language which we will discuss in more detail in Chapter 4, FPGAs can now be programmed without having to learn HDL or VHDL. The SDK supplied by the vendor of each OpenCL compatible FPGA board handles the chore of creating the complex logic circuit needed to program the FPGA board. The SDK utilizes a software program in the background such as Quartus II for FPGA in order to create the hardware configuration file.
As mentioned earlier, the FPGA is becoming the hardware processor of choice in embedded applications where power is at a premium. The FPGA has a fine-grain parallelism architecture, and by using OpenCL you can generate only the logic you need to deliver one fifth of the power of the hardware alternatives. 12 It is clear that there is a solid connection between OpenCL and FPGA that may see a sharp increase in FPGA application in environments beyond the embedded market.

METHODOLOGY
Program acceleration can be achieved in many ways. No one way will work all of the time. Sometimes data level parallelism is more critical than job level parallelism, or image processing enhancement is most crucial vice the speed at which linear algebra problems are calculated. But it is possible that there are alternate ways to accelerate hardware in the interest of achieving maximum results with minimal power consumption. The goal here is to shed light on a new paradigm that attempts to tackle complex programs in the most advantageous way possible instead of being constrained to one device. This research is rooted in five key reasons to substantiate the argument for using OpenCL to program hardware, especially FPGA, in an HCE: 1. Heterogeneity in program design provides access to more processing power 2. OpenCL makes programming hardware easier 3. OpenCL provides program portability along with speed and performance advantages 4. FPGAs can perform just as fast as GPUs and CPUs at a fraction of the power consumption

Parallel programming exploitation using OpenCL
Reason one was covered with a basic explanation of a heterogeneous computing environment and how it correlates to OpenCL. A "Hello World" example will be used to not only template the basic functions of the OpenCL API, but to also demonstrate just how simple it is to program hardware such as the FPGA. Portability, speed, performance, and power savings will be demonstrated using a vector addition program run in the HCE. Wrapping up the supporting arguments will be a black sholes financial options pricing program converted from C to OpenCL to illustrate how parallelization of existing software and hardware can be exploited.

Research Setup
The components of my heterogeneous computing environment and the interconnectivity of the hardware and software components are diagramed in Figure   3.1. The hardware and software used consisted of the following: Hardware

Altera SDK and DE1-SoC Development Board
In order to build and run programs within an OpenCL environment, you need to install a SDK that is vendor specific to the type of hardware in your HCE. Since the DE1-SoC board from Altera is used in this research, it was necessary to install the Altera SDK for OpenCL to build and run kernels on that board. The Altera SDK

OpenCL Architecture
Open Computing Language (OpenCL) is not only designed to communicate with all devices in a heterogeneous computing environment, but also direct the execution of programs by assigning programming tasks to the best device available. 14 It bridges the gap between the computing power of multiple devices and serving as one computing language that software programmers can learn fairly quickly to interact directly with hardware. Created by Apple, it is an open source language that was first released in 2008 by the Khronos Group who still manages the libraries and releases. Khronos developed the OpenCL standard so that an application can offload parallel computation to accelerators in a common way, regardless of their underlying architecture or programming model. 15 Not only is it designed to support the heterogeneous computing environment, it also affords cross-platform portability. A program written in OpenCL can be run on virtually any machine that has an OpenCL SDK and applicable libraries installed.
The OpenCL C language is a restricted version of the C99 language, but it also has wrappings that support C++, Java, Python, and NET. 16 The specification of  The kernel programming model construct builds kernels using items such as work-items and work-groups to obtain maximum parallelization based upon the targeted device. The host is responsible for compilation of the main source code using a standard C compiler and loading the host binaries into memory. The main source code contains the OpenCL commands and function calls, contains any constant data, and controls the execution of a kernel on a device. Runtime compilation prepares the kernel to be run on the best device based on computational workload which maximizes both concurrency and parallelism. 19 After the kernels are created, they are compiled by an offline compiler to create the hardware configuration files (.aoco and .aocx). The .aocx file is used to program the FPGA enabling it to run the kernel. It takes the place of the complicated VHDL programs of the past that were need to program an FPGA.  The memory model consists of buffers, images, and pipes that handle memory allocation, temporary storage of data, and prioritization of how data items are stored.
OpenCL defines memory regions as either host or device specific. Global memory (DDR and QDR) is visible to all work items executing a kernel, constant memory stores data that remains constant, local memory (on-chip memory) shares data between work-items, and private memory (on-chip registers) is unique to an individual work-item. 20 Figure 3.5 shows the layout of host and device memory regions within OpenCL. Figure 3.5 OpenCL memory model.

OpenCL Runtime and FPGA Programming
When learning a new computer language, most authors and instructors start off with a "Hello World" version of the code that shows a few basic commands, function calls, include files, and the like. OpenCL is no exception. In this case, the hello world program is used to demonstrate how the OpenCL architecture interacts with the associated hardware in your HCE. A full listing of the "Hello World" code is provided in Appendix A. We will examine this program by breaking it down into smaller sections of OpenCL code that align with the application steps that all OpenCL programs follow. The final step will be to convert it to run on the DE1-SoC FPGA, demonstrating just how easy it is to program hardware with OpenCL.
Since OpenCL is basically a derivative from the C language if follows the same principles of utilizing source and headers files. In the source file, there is one main function that controls the order of program execution along with the programspecific function definitions and declarations. It also provides space to declare hostside memory functions and operations, variables, and constants. The bulk of the standard OpenCL function definitions and declarations are located in a cluster of headers files that are continually updated as newer versions of OpenCL are released.
All OpenCL header files are available for download from the Kronos Group website as well as popular developer websites such as GitHub.
As stated earlier, all OpenCL programs follow ten main steps that setup the OpenCL environment and run kernels on existing devices. 21 Below is a detailed breakdown of each step along with the associated code.
1. Discovering the platform and devices. In order for a kernel to run on a device, the host must first determine if an OpenCL platform is present and how many devices are associated with it. Figure 3.6 shows the standard function calls for finding the OpenCL platform and devices.
2. Creating a context. After the platform and devices have been discovered, the host program creates a context that includes all devices. Figure 3.7 shows the standard function call for context creation along with error checking.
// Create the context. context = clCreateContext(NULL, 1, &device, NULL, NULL, &status); checkError(status, "Failed to create context"); 4. Creating memory objects (buffers) to hold data. Memory objects enable the transfer of data between the host and the device. The buffer is associated with the context on the host side, making it accessible to all devices in that context. Flags can also be used here to specify if data is read-only, write-only, or read-write. The hello world program has no need to hold data in memory, so Figure 3.9 is a code snippet from a vector addition program to show the creation of buffer objects.  5. Copying the input data onto the device. Data is copied from a host pointer to a buffer which is ultimately transferred to the device when needed. Figure 3.10 shows the standard function call from the vector addition program.  6. Creating and compiling a program from OpenCL C source code. The hello world kernel is stored in a character array named "helloWorld" which is used to create a program object that is compiled. During compilation, the information for each targeted device is provided as needed.  7. Extracting the kernel from the program. Since the hello world kernel is embedded in main.cpp as a character array "HelloWorld", it must be extracted in order to be executed independently on a device. Figure 3.12 shows the standard function call for kernel extraction.
// Create the hello world kernel kernel = clCreateKernel(program, "helloWorld", &status); checkError(status, "Failed to create kernel"); 9. Copying output data back to the host. This step reads data back to a pointer on the host. Since the hello world program does not use buffer objects, Figure 3.14 is a code snippet from the vector addition program that copies output data back to the host.  Figure 3.14 Copy data back to host.
10. Releasing the OpenCL resources. The OpenCL resources that were allocated for kernel execution are released. This is similar to C or C++ programs where memory allocations (for example) are freed at the end of program execution. Figure 3.15 shows the standard function calls to release resources.  Converting the hello world program (or any program for that matter) to run on the DE1-SoC FPGA is as easy as creating a hardware configuration file (.aocx) and an executable file for the Linux-based environment used on the DE1-SoC board. As shown in Figure 3.1, the SOC EDS Cross-compiler is used to "make" a Linux executable file that runs the host code on the FPGA using the onboard processor. This executable file is built using the main.cpp file and includes linkages to any libraries and additional source code as needed. The hardware configuration file is built from the kernel.cl file using the Altera Offline Compiler (AOC) which communicates with the Quartus II software. This creates a VHDL-like version of the kernel that can be run directly on the FPGA. It is only necessary to have Quartus II installed for the AOC to use, no need to understand how to use this IDE or even understand VHDL programming. The SDK does all the heavy lifting of a hardware configuration for you.

Maximizing Speed, Performance, and Portability
The advantages of using OpenCL can be best highlighted with example code that is straight forward and easy to manipulate in order to achieve easily recognizable results. It should be obvious to all software developers and hardware users that there is no one stop shop solution to solving computational problems. Some programs are designed to maximize the advantages offered by a specific hardware and vice versa.
But OpenCL differs in that the program is created to be used across multiple hardware configurations. This opens up the possibility of the giving the OpenCL runtime the flexibility to choose the hardware device that is best suited for the task at hand. To emphasize the favorable attributes of OpenCL, I used a vector addition program that can be quickly altered to illustrate different results. The complete code listing for vector addition is listed in Appendix B.

Converting from C to OpenCL
To best demonstrate the effects of converting an existing program into OpenCL, I selected an existing code example that could be optimized to reap the benefits of running in an HCE using OpenCL. The financial market uses what is referred to as vanilla option pricing to determine call and put options for assets. The  As shown in Figure 3.17, the randomization of x and y within the do…while loop, which in itself only runs a few times to arrive at the final value for x, was consuming the bulk of effort from the CPU. The challenge becomes how to convert this C-based program into OpenCL while optimizing program execution. The OpenCL approach is to look for functions or blocks of code as in Figure 3.17 that can be converted to a kernel for faster execution on a specific or multiple devices. This is akin to data and thread-level parallelization, but with the caveat of having more flexibility to tweak the number of work-items in a work-group and achieve even greater effects on the target hardware. The results and analysis gathered from this program conversion are discussed under the Black-Sholes sub-heading in chapter 4.

FINDINGS
In order to quantify the results and analysis of using OpenCL in a HCE on multiple devices, this research focuses on three overarching use cases. The first use case focuses on a commonly used vector addition algorithm compare and contrast results across different hardware devices. The second use case involves converting an existing program from a tradition high level language such as C to OpenCL to demonstrate the benefits of such a conversion. The third use case focuses on running benchmarks that currently exist for GPU platform on the FPGA. The third use case also involves converting code, but demonstrates the portability of the OpenCL language. Common to all cases is the demonstration that programming hardware such as the FPGA has never been easier.

Vector Addition Program
The vector addition program takes two vectors and adds them together creating a third resultant vector. This particular OpenCL rendition of the program is a good example of how pipeline parallelism can be exploited on an FPGA to achieve results that are similar to a GPU. Figure 4.1 depicts a representation of the load and store operations that occur within each logic element of the FPGA. The DE1-SoC board has 84,000 logic elements that can be utilized simultaneously by segregating the input vectors into work groups. The FPGA handles each work group as if it was a thread during parallel execution. When loads of thread ID 0, for example, are passed to the ALU for addition, the next two thread IDs are fetched from the host side memory buffer.

Figure 4.1 Vector addition pipeline
The program is setup to run on all devices that are available so it is an excellent example of why multiple devices with an HCE are important and how OpenCL can best leverage those devices. It can be adjusted to run the entire kernel on one device only, mirrored vector addition on multiple devices simultaneously, or dividing the vector workload across the devices. Buffer objects are created for each device to facilitate the transfer of data between the host and device memory. For my setup, the DE1-SoC has its own memory and processor so running the main program and kernel directly on the board eliminates memory latency due to the spatial locality that would be present when running the source program from the computer.
Here is the method used to conduct a comparative analysis of the vector addition kernel run on 3 different devices: Devices: • CPU (4x compute units* at 1.7 GHZ) (Intel Core: 8 SP GFLOPS/cycle) Program alteration: altered the data bandwidth by changing the total elements to be processed: • N = 500,000 • N = 1,000,000 • N = 10,000,000 Captured system clock time average over 5 runs for each value of N Results: • Results in Figure 4.2: straight system clock time for kernel execution • Results 4-6: adjusted times to account for core advantage (FPGA baseline with 1 CU) • Results 7-9: adjusted times for clock rate advantage (CPU baseline @    With a level playing field for each processor (excluding any data conflicts caused by the shared memory between the CPU and GPU), the adjustments listed in Table 4.1 yield the results in Figure 4.3. This is a prime example of how the advantage of an FPGA versus CPU or GPU can be overlooked. Given the same clock rate and equal program workload distribution, the FPGA outperforms the CPU, is very close to the GPU and consumes multitudes less power than both.

Black Sholes Program
The black sholes program served as an entry level program that offered all the challenges of converting a standard C-based program into OpenCL along with all the benefits of doing so. Achieving program optimization without just creating redundant code is the goal of any developer. The best method is to work from the inside out, looking for functions and blocks of code that could benefit from parallelization. As If the number of sims for the black sholes program is set at 100,000 for example, the function is called 100,000 times by both the put and call functions respectively (see Effective resource and memory management enables the best balance between CPU and FPGA for overall program execution.

Rodinia Benchmarks
The Rodinia Benchmark Suite, version 3.1 offers a wide range of benchmarks that are targeted for CUDA, OMP, and OCL applications on GPU hardware. Since no OpenCL benchmarks exist for FPGAs, the Rodinia benchmarks were chosen since they contained OpenCL coding and would serve as an opportunity to maximize on the benefits of running OpenCL in a HCE. Since the benchmarks were designed for optimal performance on GPU, not all of them are good candidates for conversion and execution on FPGA. The benchmarks chosen for research were algorithms that could benefit from vectorization, were heavy in data transfer between local, global, and host memory, and utilized multiple kernels.
The first benchmark chosen was the K-Means data-mining algorithm which is heavy in data parallelization. The algorithm is designed to take an initial set of input data and sort it into clusters based upon the data's unique features. Within each of these clusters, one item is determined to be the centroid for that particular cluster. The program uses the Euclidean Distance metric to calculate distances between data elements. The Rodinia version of the K-Means program maintains all the data points, features, cluster centers, and data/cluster integrity stored in separate arrays. Pointers are used to reference all of these arrays while performing the I/O, clustering, centroid computation, and cluster reassigning based upon proximity of the data elements to the centroids. The program is divided into separate .c files with the main execution residing within the kmeans.cpp file.
As mentioned earlier, the Rodinia Benchmarks are optimized to run on GPU.
The K-Means algorithm provides for multiple points of manipulation to achieve different results which benefit experiments with both software and hardware. Running the K-Means benchmark as downloaded showed results that favored the GPU over the FPGA. The OpenCL kernels with this benchmark are quite simple and don't provide an opportunity for FPGA optimization. By thorough examination of the code, I decided to manipulate the constant number of clusters while adjusting the work-group size to best match the FPGA's capabilities. The best combination of cluster groups to work-group size produced the worst performance for the GPU. Figure 4.8 shows the optimal results obtained when working on a 100,000-element data set with 10 clusters and a work-group size of 1024. The next Rodinia Benchmark selected for analysis was the hybrid sort program which is a combination of two popular sorting algorithms, bucket sort and merge sort.
The program is designed to take list of floats in random order and run a bucket sort algorithm to separate the input data into groups, or, buckets. This step of the sorting process can be configured to run across all devices in the HCE, tracking the CPU execution time of each device. Next, the buckets are stored in a vector that is fed into the merge sort portion of the program. The merge sort can also be configured to run on all devices and the CPU execution time is tracked. As is a common theme with OpenCL optimization, examining the code execution and determining the proper work-group size to match the targeted hardware device can often be all that is required to achieved best results. Figure 4.9 shows the result of the merge sort and how the DE1-SoC board achieved a slight advantage over GPU when the work-group size was set at 1024.  Also, the hierarchy of program execution between the ND-Range, work-group, and work-items underlines one of the major benefits of OpenCL in its portability and scalability.

CONCLUSION
This research was dedicated to determining if program acceleration with OpenCL in a Heterogeneous Computing Environment consisting of FPGA, CPU, and GPU is a credible approach to increasing speed, performance, and ultimately minimizing power consumption. Today's computing industry has a higher demand than ever before on maximal computation performance with minimal power consumption. We have moved well beyond the era of processor overclocking and moved on to multi-tentacle approaches to enhancing CPU performance such as multi-cores, distributive computing, and now OpenCL. Being able to maximize the performance of all hardware platforms simultaneously, independently, or sequentially is what OpenCL is all about.
Learning a new computer language can always be a challenge. Time and repetition are paramount as well as tapping into the best resources available.
Compared to other high-level languages such as Java, OpenCL may be somewhat intimidating for developers without C or C++ experience. I found the language to be straightforward to learn the basics, and there are multiple online training sessions offered at no cost from Intel which will get even the novice programmer up and running in relatively short order. The more detailed and in-depth programming lessons will cost money, however. Since the Kronos Group is the manager of the official OpenCL API, their website is naturally an excellent repository of source code and libraries. GitHub can also be used but be cautious not to mix up header files for different versions of OpenCL.
The software development kit used was provided by Altera Corporation which required me to obtain a student license. Altera has since been acquired by Intel, so all software downloads can be obtained through their website as well. Intel will direct you to use Microsoft Visual Studio 2010 for compiling your host code for use with their SDK. Learning the API of the SDK also takes some time, but the instructions are very well written and come with simple code examples to help understand the interoperability between all devices in your environment.
The results gathered from the vector addition and black sholes programs definitely underline how quickly one can begin to code in OpenCL and obtain immediate performance improvements that are manageable and scalable.
Experimenting with work item quantities and work-group sizes gives you the power to parallelize your applications ever further than normal thread and data level parallelization techniques. Utilizing the multiple NDRanges (3 in total) offered by the OpenCL platform gives you additional scalability options that increase the need for synchronization between work-items, groups, multiple kernels (if used), and host and device-side memory. All of which is controllable by either host or device.
Programming hardware has never been easier. OpenCL eliminates the need to learn complex hardware programming languages such as Verilog or VHDL to program FPGA. Kernels can be built to run on multiple devices or be tailored to maximize the efficiency of a specific hardware platform.
Having device-side and host-side memory models that are scalable gives additional programming flexibility for larger and more data-intensive applications.