System for Video Game Enhancement Using Console Emulator State Information and Scene Recognition

Video game console emulators have long had a need to improve the quality of their graphical output to accommodate higher video resolutions and increased consumer expectations. Previous attempts at providing improved visual quality have relied on interpolation-based filters and, more recently, fullscreen raster-to-vector graphics conversion. Interpolation filters produce low-quality results at low computational cost. Conversion to vector graphics can produce high-quality results, but at a high computational cost. In all cases, these algorithms work on the full frame of graphical data at once, which can introduce errors in the result due to the inability of these algorithms to separate foreground graphics from background graphics. This thesis investigates the modification of an existing emulator to achieve the goals of high-performance and higher-quality enhancement than what has previously been available. Two approaches were implemented and compared. The first uses traditional image processing techniques to perform object detection. The second uses a modified emulator to access virtual hardware state information to perform object detection. The results of object detection are used to perform replacement of all graphics with highresolution replacements and to implement a form of scene recognition to add additional graphical and audio effects. The system developed as part of this research uses a modified version of the FCEUX emulator for the Nintendo Entertainment System. The performance of the using virtual hardware state information as a means of object detection and localization has been found to be more efficient and accurate than using image processing algorithms. Using this system, a substantial audio-visual upgrade was successfully developed for a commerciallyreleased game.

debugging screens. The PPU Viewer shows the two "banks" of image data, prior to palette index lookups. The Name Table Viewer   The game screen is common divided into two regions: a game play region and a player feedback region. In (a), the player feedback region appears below the play region. In (b), it appears above the play region. 54 18 A panel of fully saturated colors, including individual RGB channels, combinations of two channels, and a gradient is processed using several different grayscaling formulas. The post-detection filter grid. The current tile is located at position (i, j). This filter is swept across every position in T checking for matches at (i, j) and one or more of the neighbor elements. If a match is found, the value at (i, j) will be replaced. . In this set of screenshots, (a) the upgrader has identified a part of the game that represents a scene. (b) A key is pressed, and the upgrader is presented with a file dialog to save the scene tag information to disk. In this case, the the vector ⃗ t , representing the scene displayed in (a), is associated with the tag "InTheCave." Each ⃗ t i is a packed 32 This screen makes extensive use of the scene recognition system. Upon scene detection, the system adds music and a looping fire sound effect (the original game had neither of these), replaces all background graphics with a pre-rendered background, and plays a one-time sound effect that is a man speaking the words that appear in the original game as text. In addition, after the system has rendered the sprite layer, scene recognition is used to provide a fog overlay and a world overlay

Introduction
As faster, network-enabled gaming systems and portable devices have become available and widely adopted by the public, there has been a resurgence of interest in the republishing of older video game titles. This has been accomplished by porting the existing game code (or rewriting the game entirely) to run on newer hardware platforms, or by allowing the original code to run in an emulated environment.
Creating a port of older games to run on modern game consoles is an expensive task. The current average annual salaries for employees involved in professional game development range from fifty thousand dollars for quality assurance testers to eighty-six thousand dollars for programmers [1], and writing a port would require a team of developers. The ideal porting project would start from a well-written source code base that has been thoroughly commented and documented. For many older games, this is not available. Games were written in hardware-specific assembler code, often using undocumented hacks to push the hardware beyond its intended limits. In many cases, the source code or documentation for games have disappeared over the years [2].
Despite the obvious costs and difficulties involved with writing a port of a game, there is a very strong advantage in that the developers have total control of the game behavior.
The developers may add new features, modify existing features, correct bugs, and change large portions of the game mechanics. This flexibility comes with a cost, as these changes may significantly alter the feel of a game, whether by design or as a result of not accurately recreating the behaviors of the original code.
As an alternative to porting game code, several game console manufactures have begun to provide software-based emulators of earlier game consoles on their newer consoles. These emulators load a binary file of the Read-Only Memory (ROM) data of the game, and simulate how it would behave on the original hardware. This has the advantage of exactly preserving the look and feel of the original game [3]. It is very cost effective, as the only costs involved are marketing and distributing the binary over some medium.
As the emulator has been written prior to republishing the game, there are no new development costs. Because it is emulating the original hardware to play the ROM, the primary disadvantage of emulation is clear: the developers are limited to the capabilities of the hardware for which the game was originally designed to be executed on. For many older game consoles, these limitations are substantial, with low-resolution, limited color graphics and primitive sound generators.

Motivation 1.Commercial Applications
There are numerous opportunities for commercial application of the system being proposed. Several major manufactures of game consoles provide fee-based services that allow consumers to purchase and play older games through software emulation. Examples of these services are Nintendo's Virtual Console on WiiWare [4], and Sony's PSOne Classics on the PlayStation Network [5]. These emulators provide a strict emulation service. They do not make attempts at altering the visuals of a game, with the sole exception being sprite alterations to deal with expired trademark licensing. Prices are generally set based upon which console the games were originally released for, and follow the rule of "older is cheaper." An obvious reason for this is that the earlier game consoles had substantially more limited graphical and audio capabilities, which potentially limits enjoyment of the game [6], and so it is difficult to ask the consumer to pay a premium price.
By significantly reducing development and testing expenses, this system will allow these publishers to cost-effectively upgrade the way their games look and sound while maintaining existing electronic publishing channels. Publishers will be able to ask for prices that are comparable to the those of games that were written for more recent generation game consoles, increasing revenue.
As consumers are now playing games on more varied devices, such as traditional consoles, phones and tablets, there may be a need to customize the experience for each device. Emulators for many game systems are available for these devices, and the configurable mapping of the proposed system will allow publishers to customize the output for a number of devices and needs. Potential uses include providing higher-contrast graphics, graphics designed to optimize battery life, graphics designed for better visibility on smaller screens, and customizations for users with visual disabilities.
While publisher-based emulation services provide a low-cost way of republishing older games, it is not always possible due to licensing agreements that may have expired. At present, the only options in this situation are to relicense the characters, trademarks, or other elements in question, attempt to patch the sprite data in the binaries to avoid the licensing issues, or to decide to not republish the game. The proposed system allows the publisher to easily perform a replacement of problematic visuals in the original game, making the licensing a non-issue.

Image Processing and Computer Vision Algorithms
After establishing the feasibility of developing such a system, one must begin the task of developing a functioning implementation. Conventional wisdom suggests that algorithms from the fields of image processing and computer vision may be used to implement portions of the system.

Optimal Detection
Games generally must provide real-time feedback to the player. Any system that processes each image frame must be able to complete its work before the next frame is requested. Further, emulators are increasingly popular on battery-powered devices, which are limited in computational power. High CPU utilization can quickly drain the battery on such devices. Therefore, it is important that any algorithm implemented in the system be highly efficient. There are very efficient algorithms that have been developed, often with a primary focus on face detection, such as the Viola-Jones object detection system [7].
While these algorithms are faster than what was previously available, they can still be computationally intensive. Further, many game consoles were supplied with hardware that could apply several forms of geometric transformations to images within the game.
Algorithms that are invariant to these transformations exist, but these are also computationally intensive. Therefore, it is important look for alternative means for obtaining this information to reduce or eliminate the additional overhead.

Vision Analysis vs. State Analysis
The problem being studied involves the enhancement of a video game that is being run within emulation software. A unique property of this problem is that we have access to a perfect model of the behavior we are trying to analyze, in the form of virtual hardware state information. Without implementing any image processing or vision algorithms whatsoever, all of the goals of the system are achievable by accessing this state information.
It is possible that there may be a high labor cost in modifying an emulator to support this functionality, due to each game's unique ways of interacting with the hardware. Retrieving the required information in this manner is possible, but may not be practical. A computer vision system can also obtain all of the required information through analysis of the display, but at a much higher runtime computation cost. One of the goals of this research is to either find a balance between the two methods that allows for a high accuracy detection system with low computational cost, or to develop a recommendation of one method over the other.

Scene Awareness and Event Recognition
In the process of analyzing each frame of the game's visual output, there exists more than just a simple opportunity to replace individual graphics with a higher quality substitute. It should be possible for an algorithm to look at a set of frames and determine the game's current environment. The environment will consist of state information regarding different 'scenes' of the game, along with a mechanism for recognizing what is happening on-screen. This information can be used to further inform the renderer's algorithm.
Scene recognition algorithms have been developed for analyzing image content for the purposes of labeling or annotation in static scenes [8] [9]. Examples of such algorithms also exist for scene recognition in video [10] [11]. Further, event recognition systems have been developed that utilize video of the player as a method of controlling the game, such as in Microsoft's Kinect.
Games often have a number of events, in which some action occurs due to the event or timer being triggered. Often, the action is limited to a sprite appearing or disappearing, or a single sound effect being played. The proposed system can utilize an event tracking and recognition system that will allow for additional events to be tracked, along with more visual and audio feedback to the player.
The goal of this portion of the system is not to simply label the scenes, but to use that labeling to provide partial or complete enhancement of the original scene.

Cultural Preservation
There is a trend towards video games being considered not just as temporary entertainment, but as cultural artifacts worthy of preservation [3], much as is the case for music and film. Unfortunately, similar to the music and film industries, much of the early history of video games has been poorly preserved and often lost or destroyed [2].
Emulation has proven successful as an avenue for preserving older games and associated hardware [3]. Much of the demand has been driven by nostalgia. With each passing generation, the graphical standards by which games are judged has steadily risen.
As the consumers who obtain the games for nostalgic reasons gradually leave the marketplace, it is likely that this newer generation of gamers may reject the older games, partially based upon the poor game visuals. Greater visual quality in games has been shown to increase sensations of telepresence, leading to higher levels of enjoyment [6]. The proposed system addresses this by providing updated visuals for these games, which may be more palatable to modern gamers, thus increasing their distribution.

Problem Definition
The problem shall be defined as that of developing a system for the purpose of enhancing the audio-visual output of a video game that is running in an emulator, utilizing state information in the emulator to increase efficiency and accuracy. The problem can be divided into the following sub-problems: 1. Viability: Demonstrating that such a system can be developed, and that it can provide better results than methods that have been previously developed.

Object detection:
A method for successfully detecting on-screen game elements reliably and efficiently.
3. Event detection and scene recognition: A method for determining when specific events occur, and using this knowledge to begin an action in response to the event.

Structure of the Thesis
Chapter 2 will provide an overview of the work that has been previously performed in this field. Topics include upscaling filters, object detection algorithms, and algorithms for comparing images for similarity. For each area of related work, a discussion of the disadvantages of that work will be provided.
Chapter 3 will briefly discuss the design and limitations of the original console hardware of the Nintendo Entertainment System, as well as provide an introduction to the software emulator FCEUX. This will be followed by a detailed discussion of the types of information that could be extracted from the emulator's state information, its suitability for use in enhancement, and the advantages and disadvantages of this approach.
Chapter 4 gives an overview of the system developed as part of this research. It includes discussion on the individual software components required the system, as well as a breakdown of the different human roles involved in producing an enhanced game.
In Chapter 5, object detection techniques will be presented for efficiently detecting on-screen game elements. This includes discussion on image processing techniques as well as using virtual hardware state information as a means of object detection.
The focus of Chapter 6 will be on using the development of an event recognition system that can be used to recognize game scenes and events as they occur. This will include discussion about the types of events that can be detected, as well as what kinds of interference-free actions that can be taken in response to those events.
The discussion of the results obtained in testing an implementation of this system on a commercial game are discussed in Chapter 7. Finally, a summary for the work and suggestions for future work are presented in Chapter 8.

Related Work
This chapter surveys the work that has been done previously in the area of enhancing emulated video games that were originally designed to run on hardware sprite engines. This is followed by a discussion on algorithms in the fields of image processing and computer vision that can be used to extract feature information from the game's visual output.

Filter Algorithms
Emulators are frequently used on home computers to emulate commercial game console hardware. The screen resolutions of the display devices used on these computers is typically much higher than the resolution available in the original console hardware. For example, it is very common now for computer monitors to support 1920 × 1080 display resolutions, but early game consoles often had resolutions closer to 320 × 240. While early emulators were typically full-screen applications that were launched from a command line, and used a custom screen resolution at runtime, emulators now commonly run in a windowed environment. To provide a larger view of the game screen, emulators implement algorithms that upscale the game's visual output to a larger size.

Definition 2.1.1 Digital Filter
A digital filter is a process or algorithm that removes an undesired component from a digital signal. In this thesis, the signal component of interest is aliasing, which can produce jagged edges in the signal. Finally, a filter that transforms images from the pixel domain to the vector domain is discussed.

Analog Simulation Filters
Most older game console hardware provided analog video output compatible with the standards set by the National Television System Committee (NTSC). Some game consoles were modified to use Phase Alternating Line (PAL) instead of NTSC, primarily in Europe.
The video signal would be sent to a standard CRT television. Unlike today's digital display and delivery mechanisms, these analog systems produced a number of visual artifacts (errors in the video signal). While these artifacts are generally considered undesirable, many games took advantage of some of these distortions to produce special effects, such as pseudo-transparency. Some emulators provide filters that attempt to simulate some of these artifacts.
The most well-known NTSC-CRT emulation filter is the algorithm provided in the Blargg NTSC Library [12]. This filter simulates many of the visual artifacts of the analog chain, including CRT barrel distortion, scanlines, phosphor masks, blurring, bloom, and color fringing. Recent revisions of the algorithm have brought improvements in performance, but the algorithm remains computationally expensive. It is most often implemented as a pixel shader program, and requires a Graphical Processing Unit (GPU) to perform the processing.
NSTC filters do not generally perform any form of scaling on their own. They may be used in conjunction with filters specifically designed for upscaling. Additionally, the results produced by these types of filters may not be considered to be an enhancement of the image as it degrades the output, but it degrades in a way that enhances the game play experience.

Traditional Scaling Filters
The filters discussed here are designed to scale an image by some integral amount.
Images begin and end as a rectangular region of pixels. The algorithms used in these filters are all forms of interpolation.

Definition 2.1.2 Interpolation
Interpolation is a method of reconstructing missing data points by computing an approximate value based on nearby, known data points.
Source-unaware scaling filters are found in many software packages, including image editing software. These filters, which are referred to here as general-purpose scaling filters, do not take any information about the nature of the source image into consideration. Some algorithms, such as those designed to process pixel art, have some awareness of the characteristics of those types of images. The algorithms take advantage of that information to produce better scaling results for those images than could have been obtained otherwise.

General-Purpose Scaling Filters
A number of common scaling filters have been used in emulators for enlarging the display size. Nearest-neighbor interpolation filters are present in most emulators. This algorithm is very fast and can be used to double or quadruple the size of the image without introducing new colors or artifacts into the image. The color value of a pixel in the scaled image is dependent upon a single pixel in the original image. This algorithm is preferred by some game players who want to see the original image preserved as much as possible, but rendered larger than the existing source resolution. As nearest-neighbor interpolation does not make any attempt to smooth the image, the resulting image can have a very blocky or jagged appearance due to aliasing, as can be seen in Figure 1(a).
Bilinear and bicubic filters attempt to resolve this problem by interpolating the color values between pixels in the source image. This produces an enlarged image with smooth (a) (b) Figure 1: (a) Example of aliasing in pixel art. This image has had a 3 × 3 nearest-neighbor upscaling filter applied. Notice that diagonal lines have a step-like appearance, which is undesired. (b) The same image scaled using the hq3x filter to reduce aliasing [13].
color gradients approximating the missing pixel colors. These filters will work well on images in which the source data represents continuous color shifts. For pixel art, or any other type of image that has hard lines or contours, the smoothing effect of these filters may produce an image that appears blurry [14].

Pixel Art Scaling Filters
Game systems that used hardware-based sprite engines typically featured lowresolution graphics of limited color depth, and so it was important to develop scaling algorithms that worked well with these types of images. The individual graphical components used in these games are known as pixel art, a term coined by Adele Goldberg and Robert Flegal at the Xerox Palo Alto Research Center in 1982 [15], although the use of this type of digital art dates back nearly a decade earlier (1973) to the Robert Shoup's SuperPaint software package [16].

Definition 2.1.3 Pixel Art
Pixel art is a raster-based digital image created by editing each individual pixel without the use of automatic tools.
The efficiency of algorithms for scaling pixel art was important, as the scaling would need to be performed approximately sixty times per second [17]. To achieve this goal, the filters did not generally allow for arbitrary scaling. The filters were most commonly written for 2×, 3×, and 4× scaling.
Eric's Pixel Expansion (EPX) was developed by Eric Johnston at LucasArt in 1992.
He created it to port a game engine, which ran on an IBM PC at 320 × 200 with up to 256 colors, to the Apple Macintosh Color Classic, which was running at nearly double the resolution [18]. This algorithm worked by looking at the four non-diagonal neighbors of the source pixel. The four destination pixels would be computed based on the similarity of the color values of the neighbor pixel pairs in that direction. If three or more of the neighbors were identical, the original pixel color value would be used for all four destination pixels [19].
It was functionally identical to EPX, but had a more efficient implementation. The AdvMAME3×/Scale3× filter works by the same principles as EPX, but achieves 3× scaling by using the color values of all eight neighbors of the source pixel to compute a 9 × 9 result [20]. The AdvMAME4×/Scale4× filter is simply the EPX filter applied twice to achieve a 4× upscaled result [20]. Using this family of scaling filters results in an excessive amount of rounding of sharp edges, as can be seen in Figure 2.
The Eagle filter was created by Dirk Stevens, and it provided 2× scaling. The first step (c) Scale3× (d) Scale4× Figure 2: Examples of how the Scale2×, Scale3×, and Scale4× filters affect a piece of black and white pixel art [20]. Examples are scaled to uniform dimensions for comparison.
of the algorithm was to enlarge the image 2× using a nearest-neighbor interpolation filter.
Each pixel is then processed by looking at the neighbors in each diagonal direction, along with the two pixels that are adjacent to both the neighbor and the original pixel. If the three pixel colors match, the new pixel is assigned that color. A disadvantage of this filter is that isolated pixels on a solid background color will disappear. This issue can be seen in Figure 3(e) on Page 15.
The 2×SaI (2× Scale and Interpolation) filter is a upscaling filter developed by Derek Liauw Kie Fa in 1998. It was first used in Snes9x, which was a popular emulator for the Super Nintendo Entertainment System (SNES) at the time. The goal of this algorithm was to provide provide 2× scaling while preserving the smooth areas and the brightness of the original source. The filter matrix was designed to make use of lines and edges of the same color. Of the four pixels produced for each input pixel, one pixel will be identical to the source, and the other three are approximated from the patterns that match the filter matrix [21].
Further work on this algorithm has led to a modified version of the filter. The modified filter is often referred to by the names Super 2×SaI and Super Eagle. These algorithms work in a manner similar to 2×SaI, but perform more blending of the pixel colors to produce a smoother appearance in the output.
The hqx family of filters was developed by Maxim Stepin in 2001. Here, "hq" stands for "high quality," and "x" stands for magnification. There are three sets of filters: hq2x, hq3x, and hq4x, which provide 2×, 3×, and 4× scaling, respectively [13,22,23]. These filters remain very popular, and have been implemented in several emulators. The hqx filters look at the eight neighbors of the source pixel and attempt to detect shapes by looking for pixels of similar color that match predefined patterns. The patterns are then used to perform a lookup in a table which provides the upscaled, interpolated patterns. The hqx family of filters work very well on images that have geometric patterns, and can produce smooth, high-quality results.

Vectorization Filters
The current state-of-the-art in scaling pixel art is the Kopf-Lischinski algorithm, which converts pixel art into vector graphics.

Definition 2.1.4 Vector Graphic
A vector graphic is a type of digital graphic that is based on primitive mathematical objects, such as lines, polygons, circles, and Bézier curves. Unlike pixel-based rater images, vector graphics can be moved, scaled, or rotated without any loss of quality, as no interpolation is needed.
Earlier algorithms for producing vector graphics from raster images were not wellsuited for pixel art, as small details in the original image were often lost, as can be seen in Figure 3(h-i). In pixel art, these small details (often a single pixel of a color distinct from its surroundings) are very important to the meaning of the image [25]. The Kopf-Lischinski algorithm was designed specifically for pixel art. It uses a small set of weighted heuristics to eliminate edge crossings in a similarity graph, where each vertex represents a pixel and contains edges to the vertices that represent its eight neighboring pixels. Edges between vertices with dissimilar colors are removed. A heuristic algorithm to remove edge crossings is applied, resulting in a planar similarity graph. The edge information is used to create a Voronoi diagram which loosely represents the reshaped pixels.
Connected sequences of visible edges are converted into quadratic B-spline curves, which are subjected to an optimization process to remove block-shaped regions [25]. This can be used to produce vector graphic output.
This filter produces very good results for certain types of inputs. In particular, input images that approximate curves appear to vectorize nicely with this algorithm (see Figure 4), but sharp can edges suffer from excessive rounding. An example of how this algorithm performs on an image with sharp edges can be seen in Figure 3(j). The algorithm also does not work well for anti-aliased images, which are commonly found in pixel art from 16-bit generation consoles [25]. A disadvantage of the Kopf-Lischinski algorithm is that it is computationally expensive and currently unsuitable for real-time use during game play. There exists a video of a game developed for the Super Nintendo Entertainment System being played while this filter is converting the output to vector graphics. This video is not real-time, and was prerecorded and time-corrected for playback [25]. One issue demonstrated by the video is that it processes the entire game screen to generate the vectors, as opposed to individual pieces of pixel art as discussed in the paper. Better results could be obtained by processing the sprite layer and the several layers of parallax scrolling independently. To do this would require the modification of an emulator to extract the different layers prior to screen composition.
If the emulator could be modified in this way, it would open up a number of opportunities to make other changes that can lead greater game enhancement capabilities.

The Limits of Interpolation
There is a fundamental problem with all of these filters in that they provide a form of interpolation. That is, they are attempting to reconstruct information that is not present in the original data. While algorithms such as Kopf-Lischinski provide very good results on many data sets, it is not possible for these algorithms to reliably introduce meaningful new detail into the upscaled images.
It is for this reason that the system presented in this thesis will not attempt to interpolate a game's visual output, but will instead replace it entirely. The goal is to detect what type of object is being observed and provide a suitable higher-resolution image for that object. The new replacement image can be constructed to have much more true detail than what could have been obtained from an interpolated resampling of the original data.
To enable this functionality, the system will need algorithms or methods for detecting and recognizing different types of objects within a game.

Object Detection
To enable the system to segment the screen output into distinct units for processing, the system must have some way of analyzing the state of the game's visuals. One way to do this is to perform object detection on the video output of the emulator. A number of approaches have been developed in the fields of image processing and computer vision for tacking this problem.

Detection Definition Object Detection
Object detection refers to the detection of one or more instances of an object within an image. It is a form of binary classification in that the object at the region of interest either is an instance of the object, or it is not an instance of the object. Detection also provides localization of the object.

Definition 2.2.2 Multi-Class Object Detection
If more than one class of objects need to be detected and distinguished from one another, then it becomes a problem of multi-class object detection.
Object detection, as a single-class problem, is commonly encountered in the form of face detection. In this problem, an area of the image being processed is labeled as either containing a face, or not containing a face. It is not concerned with knowing whose face it is (which is facial recognition, as opposed to detection), just that it appears to be face-like in appearance. If there were two classes of objects being detected (e.g., faces and stop signs), then this would be an example of a multi-class object detection problem.
In general, the system requires a solution for the multi-class object detection problem.
In limited circumstances, it can be a single-class detection problem, but this will only occur in situations where only a specific object needs to be detected and modified. An example of a situation meeting this condition is when a single graphical object in a game needs to be replaced for licensing or other reasons.
Early attempts at object detection systems used features that were computationally expensive, such as raw pixel data or edge detection output. For example, edge detection using a Canny filter first requires applying a Gaussian filter to minimize noise in the image, and then applies up to four directional gradient filters, depending on the types of edges that must be detected [27].
A pedestrian detection system (1997) and a general purpose object detection system (1998) based on Haar wavelets were developed by Papageorgiou, Oren, and Poggio [28,29]. An example of a Haar wavelet is shown in Figure 5(a). A Haar wavelet transform works by producing a matrix such that the first n/2 rows produce a weighted average of the input, and the last n/2 rows produce a weighted difference of the input.
These are low-pass and high-pass filters, respectively. Haar wavelet transformations can be used for compressing information about the image into less data. Papageorgiou et al.
found that pixel and edge-based representations of images were generally inadequate for tracking the objects of interest.

Definition 2.2.3 Haar-like Feature
A Haar-like feature is computationally inexpensive image feature that is based on looking at adjacent rectangular regions and comparing the sums of the pixel intensities of each region [7].
Most state-of-the-art frontal face detection systems are based on the Viola-Jones object detection framework [7]. The Viola-Jones algorithm achieves comparable accuracy with previous algorithms, but is much more computationally efficient. The detection is The integral image can be efficiently computed in a single pass by making use of the following pair of recurrences:

Definition 2.2.4 Classifier
A classifier is a form of supervised machine learning that attempts to label (classify) an object as belonging to one of a set of classes.
The Viola-Jones algorithm uses a cascaded set of classifiers to perform detection. These classifiers rely on supervised learning, which involves training the classifier on a training data set. The expectation is that good performance on the training data set will lead to a model that generalizes well on instances outside of the training set. AdaBoost is used during training for feature selection. Boosting systems, such as AdaBoost, were developed for building accurate classifiers by combining the results of several low-accuracy classifiers [30].
The idea is that the cascade allows extremely fast rejection of regions which do not contain face candidates by starting off with the weak classifiers. Areas of interest (those rectangular regions that were labeled as potentially face-like by the weak classifiers) are subjected to progressively more expensive to compute strong classifiers. There are several issues with using the existing object detection frameworks in this system. The first issue is the need to train the classifiers. To make the system as simple as possible for the developers that are using it, it should not require any more technical tasks than are absolutely necessary. Correctly training the classifiers requires some amount of experience in the field. Poor choices in the training sets, or overtraining the classifier (resulting in a model that fits the noise of the training set) can lead to unintentionally poor detection results. AdaBoost is known to be susceptible to overfitting in the presence of noise, but there has been some research into minimizing its effects on classification accuracy [31].
Another issue with object detection algorithms is the amount of computation required. While the Viola-Jones algorithm represents a substantial improvement in highspeed object detection, it is still a form of single-class object detection that uses multiple classifiers over several different scales. The system will be required to detect many objects, typically one object for each background tile and sprite that might appear on-screen, in each frame of video output. Further, many of the objects to that require identification have a known position, and merely require labeling. Therefore, the localization of these objects is not a required step.
A final issue with these algorithms is that of accuracy with respect to locality of the detected object. The algorithms can detect occurrences of the features within some radius around the actual object, which can result in several false positive detections in the neighborhood nearby the true location. False positives are a problem for the system, as object detection leads directly to visual output to the player, and the player relies on this information to make decisions. There are various solutions for reducing the number of false positives, including algorithms that look for overlapping bounding regions or assigning confidence weights to each labeling. All of these solutions add additional processing time and still provide no guarantee of the location being correct.
Object detection concerns itself with the question of "Is this the object?" Because of the nature of visual output in video games, and because the system is interested in everything on the screen, this question does not need to be asked. It is the object. Perhaps the better question is "What is this specific object?" Several types of algorithms have been developed to answer that specific question.

Image Hashing
Images, even small ones, require a large amount of data. It is not practical to compare images by using the pixel data directly. For example, comparing two n × m images would require as many as nm pixel comparisons. Computing an identifying label for each image, and then performing comparisons on the labels, can make data lookup considerably faster.
That is the basic idea behind a hash table. For an image, a hash value can be computed and used to lookup information about that image using only the hash value.

Definition 2.3.1 Hash Function
A hash function is a function h(x) that takes a (possibly) large set of data and maps it to a smaller set of data, commonly a set of integers.

Definition 2.3.2 Collision
Some hash functions make no guarantee that two dissimilar inputs will not produce the same hash value. When this occurs, it is known as a collision.

General Hash Functions
A number of well known hash functions have been created and used for a variety of purposes, such as cryptography and error correction. While these hash functions are frequently and effectively used in their respective domains, their use in image processing is limited. This is due to the fact that small changes in the image (even a single bit) can produce large changes in the computed hash. If the purpose of the hashing is to detect equality at the bit level, this may be desired behavior. Good image hashing should have some resilience against small changes, as it is common to require detection of variations of the same image.

Transformation-Invariant Image Hashing Algorithms
It is often the case that several different images represent the "same" image. This can occur when the image data undergoes some form of transformation, whether intentional or not. For example, JPEG compression uses a lossy algorithm which can produce an image that looks nearly identical to the original image at normal viewing distance, but will have a large number of different bits. Images may also be rotated, cropped, or have subtle color alterations. A hashing algorithm that is resistant to these types of changes is called a robust hashing algorithm. An example of such an algorithm, developed by Venkatesan, Jabubowski, Koon and Moulin, borrows a number of ideas from the field of cryptography to develop a robust image hashing algorithm that can be used as a substitute for image watermarking. The output of the hashing algorithm is an n-bit string that can be directly compared for equality between the images [32].
The Venkatesan, Jabubowski et al. algorithm is robust against a number of deformations. Specifically, their algorithm produced identical hashes for images that had been altered by rotation of up to 2 degrees, scaling by up to 10%, cropping of 10%, random line deletion, JPEG compression, and 4 × 4 median filtering [32]. However, a pair of "identical" images, with one scaled to twice the original size would produce different hash strings, and therefore not be detected as the same image. A solution to this problem is to use geometric hashing. The goal of geometric hashing is to be able to locate an object based on common substructures in such a way that the object would be detected regardless of scale, rotation, translation, or occlusion [33].
The earliest work on geometric hashing was done by Schwartz and Sharir in 1986, using boundary-curve matching techniques [34]. Later Schwartz, Wolfson, and Lamdan developed a new technique using point sets that was invariant under affine transformations [35].

Definition 2.3.3 Affine Transformation
An affine transformation is a transformation that preserves straight lines such that relation- transformations, a minimum of three points is required for the basis [33].
Geometric hashing provides a relatively efficient method for labeling an object.
However, the precomputation stage is very expensive, and at runtime requires very large hash tables to be stored in memory.

Histograms
The previously discussed algorithms can be used to identify objects within an image when the object has undergone translation, rotation, or other deformations. It is also possible to use color intensity as a feature, allowing two objects can be compared for similarity on that feature alone. A histogram is a table of frequencies, and can be used to visualize how often something occurs. In the context of image processing, histograms can be used to visualize how features of an image (typically, colors) are distributed. Histograms can be used on both color and grayscale images.

Definition 2.4.1 Intensity Histogram
An intensity histogram represents the distribution of grayscale pixel values in an image.

Definition 2.4.2 Color Histogram
A color histogram represents the distribution of colors in an image.
Histograms computes a distance metric by finding the maximum of the absolute differences between the two cumulative distribution functions [36].
Histograms are a computationally efficient way of obtaining identifying information about an image. Histograms are generally insensitive to scaling, translation, and rotation.
However, they provide no spatial information about the distribution of colors or intensities in an image when applied globally. There are several solutions to this problem, which include using sets of histograms over smaller subregions of an image to identify objectspecific features. Selection of these subregions typically involve training classifiers.

Color Coherence Vectors
Computation of histogram features is very fast. Comparing histograms for similarity is also fast. Unfortunately, spatial information about the distribution of pixel data is lost.
There is a histogram-based method for comparing images that maintains some spatial information, known as a color coherence vector (CCV).
In a CCV, pixels in the image are divided into coherent and incoherent pixels.
Coherent pixels are those pixels that can be recognized as part of a larger, connected region.
Incoherent pixels do not belong to such a region. In a traditional color histogram, instances of a single color that occur in several, distinct parts of an image contribute to the same bucket. In a CCV, these different parts of an image will not be recognized as a coherent region (there are incoherent pixels between them), and so the two instances of that region of color are labeled differently. This preserves spatial information about the image [37].
Computing the connected components can be done in linear time. The size of the component determines if the pixels are coherent or not. The size is given as τ , where τ ≈ 1% of the image area. In Figure 9, a small example of the connected component labeling is shown using a low-intensity, 6 × 6 grayscale image with τ = 4. Pixel intensities were split into buckets that held ten values. For example, the first bucket holds values from 10 to 19, inclusive. For each bucket, some pixels will be part of a connected region 10 21 22 15 16  24 21 13 20 14 17  23 17 38 23 17 16  25 25 22 14 15 21  27 22 12 11 21 30  24 21 10 12 22 23 Connected component labeling Figure 9: Computing a labeling of the connected components for a color coherence vector. The grayscale intensity values are placed into three buckets that each contain a range of ten values. The 1's bucket is made up of pixels from two disconnected regions, as is the 2's bucket. These are given distinct labelings, and so from these three buckets, labels A-E are obtained [37]. (more than τ coherent pixels), and some will not (incoherent pixels). Let α i and β i represent the number of coherent and incoherent pixels for the i-th bucket, respectively [37]. Then is a coherence pair, and the color coherence vector is: For the image in Figure 9, the component sums are shown above in Table 1. The computed CCV for this image is shown in Table 2.  Figure 9, presented in table form. For each color, α represents the number of coherent pixels, and β represents the number of incoherent pixels [37].
Equation 5 shows how the difference between two CCV's I and I ′ can be computed.
It is efficient to compute and provides a value that is typically larger than that produced by taking the difference of two standard color histograms. Therefore, it serves as a better differentiator between two source images [37].

Related Work Summary
There are several algorithms that have been developed for the purpose of enhancing the visual qualities of scaled pixel art, both in the raster image and vector graphics domains.
There has also been a significant amount of research performed in the area of object detection. These object detection algorithms have several drawbacks, including computational expense and inaccuracy in object locality.
The next chapter will investigate the hardware for one of the best-known and most frequently emulated video game consoles. It will examine the architecture and explore features of the sprite engine that can be used to perform object detection and enhancement.

Hardware and Emulation
This chapter will provide an introduction to the Nintendo Entertainment System, along with an overview of its hardware implementation. It will also introduce the FCEUX software emulator.

History
The Nintendo Entertainment System (commonly referred to as the NES) was an 8-bit video game console developed by Nintendo Co., Ltd. It was initially launched in Japan in 1983, and was released in the US by Nintendo of America in 1985. The NES was officially discontinued in 1995. The NES was very popular, selling approximately sixty-two million units, and is often credited for revitalizing the video game industry in the USA following the "video game crash" of 1983. Games that were developed for this system continue to be very popular, and have resulted in the development of many sequels for later generation game consoles [38].

Description of the Hardware
The NES contained an 8-bit central processing unit (CPU), the Ricoh 2A03, which was based on the MOS Technology 6502. This CPU ran at 1.79 MHz. The console itself contained two kilobytes of general-purpose random random access memory (RAM), along with an additional two kilobytes of video RAM that was used by the Picture Processing Unit (PPU) for background graphics (known as the "name tables"). A small amount of RAM was available for other graphical elements, including 256 bytes for the Object Attribute Memory (OAM) and twenty-eight bytes to store the color palettes.
Like many consoles of that era, the NES used a hardware sprite engine. The OAM provided attribute information for sixty-four hardware sprites [39].

Definition 3.2.1 Sprite
A sprite is a 2D digital image that can be rendered onto a larger image. For the purposes of this thesis, sprites are small pieces of 2D pixel art that are stored and rendered separately from the background bitmap.
Game contents were stored in Read Only Memory (ROM) contained in the game cartridges. The NES was capable of addressing up to thirty-two kilobytes of program ROM.
This could be expanded through the use of bank switching to allow the CPU to access additional chunks of ROM memory [39].

Color Palettes
The NES allowed for the use of any color selected from a palette of forty-eight colors, and six shades of gray.

Definition 3.2.2 Palette
A palette is a finite set of set of colors used in the creation a digital image.
A sample palette is shown in Figure 10. The NES palettes were based on NTSC, and so emulators use an RGB approximation of the original colors. Colors in the palette are referenced by index, and are arranged such that the first four bits control the hue (horizontal axis in Figure 10) and the second four bits control the intensity (vertical axis in Figure 10).
A background color for the entire screen could be specified. The tiles and sprites each were able to select from eight palettes (four for tiles, and four for sprites), each of which contained three colors, plus one value reserved for transparency. This allowed the NES to display a maximum of twenty-five colors on the screen at any one time [39].
A rarely used featured of the original hardware was the ability to tint the palette through the use of the palette's color emphasis bits. Three bits were used for tinting to palette towards red, green, or blue hues. The bits could be used individually, or combined.
For example, setting the red bit would de-emphasize the blues and greens. This feature is not well supported in many NES emulators.

Display
The display resolution of the NES was officially 256 × 240 pixels. Due to the limitations of NTSC-based television sets of the time, the eight pixels along the top and bottom margins were often not viewable, making the effective resolution 256 × 224 pixels. Game screens were stored in a 2 × 2 arrangement, with a particular screen being mirrored depending on whether the game was configured for horizontal or vertical scrolling. Parallax scrolling behavior was not provided by the hardware, but it was possible to change the scrolling bits between scanlines to simulate two independently scrolling layers. The screens were arranged in tiles of 8 × 8 pixels [39].

Sprites
The OAM provided storage of attribute information for up to sixty-four sprites. Each sprite was allocated four bytes of memory in this table, and the information stored included the index of the sprite, x and y coordinates, bits for controlling horizontal and vertical flipping, a priority bit, and a palette index. Several of the bits were not used. A sprite could be 8 × 8 or 8 × 16, but all sprites were required to use the same dimensions. Each sprite was limited to using only three colors, in addition to a transparent color index that allowed the background to show through. To work around this limitation, some games used a composite sprite, in which two or more sprites are rendered in the same location to Figure 11: An example of a composite sprite for use with the NES hardware sprite engine. Due to limitations of the engine, this sprite was made of several sprites that have been overlaid on top of each other. The body and head were each made of several 8×16 sprites that shared one palette. To draw the face, two additional colors were needed (tan and white), and so the face was a separate set of two 8×16 sprites, using a different palette [40].
give the illusion of a single sprite with more than just three colors. An example of such a sprite, extracted from Capcom's game "Mega Man," is shown in Figure 11 [40]. The use of composite sprites was common in NES games, but at a great cost, as there could never be more than sixty-four sprites on-screen at the same time and composite sprites could quickly use up the available OAM slots. A further limitation of the hardware sprite engine was the requirement that there be no more than eight sprites rendered as part of any single scanline.
The default behavior of the hardware was to simply not render any sprite pixels that occur after the eight sprite for that scanline. Some game developers attempted to detect when this occurs, and would cycle the order of the sprites in the OAM to trigger a flickering sprite, ensuring that all of the sprites are seen at least some of the time. This limitation could also be used to the advantage of game developers, as hidden sprites also contribute to this behavior. This allowed for partial masking of sprites, which was a feature not provided by the hardware itself [39].

Background Graphics
The NES stored background graphics in an area of memory known as the "name tables." Each name table was a 16 × 15 grid of tiles (each of which was 16 × 16 pixels) and

Audio Channels
The NES produced music and sound effects using five audio channels. There were two pulse wave channels with variable volume, a triangle wave channel with fixed volume, and a two-mode white noise channel with variable volume. The final channel was a differential pulse-code modulation channel. The sounds were generated by the CPU, and were output on two pins: one for the two pulse channels, and the other for everything else [38].
Emulators can access this information directly. There exists a number of hardware hacks that tap these two pins to allow for discrete two-channel output, providing a form of pseudostereo sound . In an unmodified NES, these channels were mixed together prior to the audio output jacks, making the output effectively monaural.

Description of the Emulator
The NES emulator being used in this system is FCEUX, version 2.14a [41]. This emulator has been in development since 2008, and is based on the earlier work of Family Computer Emulator (FCE). The source code consists of approximately 168,000 lines of C++, and is supported on many compilers and platforms.
FCEUX successfully emulates the NES hardware, as well as much of the expansion hardware provided on various game cartridges. It also provides a number of features to enhance the game experience, some of which remove limitations in the original hardware.

Visual Enhancements
FCEUX does not allow a game to use more than the number of colors allowed by the original hardware, but does allow the palettes to be configured to custom colors. The system does not require a specific palette, but once a palette has been chosen it should not be be altered. Custom video modes can be configured to exceed the original resolution of 320 × 240. A number of options are provided for controlling the method of upscaling, and the emulator supports hq2x, hq3x, Scale2×, and Scale3× upscaling algorithms [42].
The allowable drawing area can be configured to emulate traditional CRT televisions, which could not render the top and bottom eight pixels. Further, it allows clipping of the left and right eight pixels, which can reduce color artifacts on screen edges when palette swaps occur during scrolling [42].
The final graphical enhancement provided by FCEUX is the option of removing the limitation of having only eight sprites per scanline [42]. For some games, this can greatly improve the visual quality the game by reducing flickering, but can break the functionality of games that depend on this behavior.

Audio Enhancements
FCEUX allows for additional control over the sound channels. It features a builtin resampler for upsampling, individual volume controls for each channel, as well as an independent master volume for the mixed sound output [42]. While not explicitly provided as an option, little effort would be required to map each channel to a distinct location in the stereo image.

Debugging Facilities
FCEUX provides a substantial amount of debugging functionality built-in to the product. The background and foreground graphics layers can be independently enabled or disabled [42]. This is critically important for any object detection algorithms used in the system, as it allows for the use of more efficient algorithms which do not take occlusion into account.
The PPU Viewer tool displays the contents of the image data banks. The data on the left is generally used for sprites. In the screenshot shown in Figure 13(a), there are 8 × 16 sprites are loaded, and these are stored in such way that the upper 8 × 8 half is immediately followed on the right by the lower 8 × 8 half. The data on the right generally represents the images used in the background layer. In this example, it includes elements for text display.
All images are rendered in three colors on a black background. This is because the PPU viewer does not have any information about which palettes are being used for the sprites or tiles at any given time.
The Name Table Viewer tool shown in Figure 13 Table Viewer debugging screens. The PPU Viewer shows the two "banks" of image data, prior to palette index lookups. The Name Table Viewer shows the layout of the background layer in memory [42].

Random Access Memory
Games maintain all of their state information by storing the state data in Random Access Memory (RAM) during gameplay. Each game is responsible for all of the management of this data, and so there are no standard memory addresses for specific types of game state information. For example, the memory address at which a score is stored for one game might not store similar information in a different game.
FCEUX allows for debugging hooks that occur whenever a particular set of memory addresses are read or written to. If the system were to be expanded to include game-specific memory checks, FCEUX provides the necessary tools for getting access to the required information.

Summary
The NES hardware maintains a large amount of useful information about the different types of game assets that are currently loaded. The FCEUX emulator provides a virtual emulation of this hardware, allowing software to obtain access to this information.
Additionally, the FCEUX emulator provides a number of debugging aids that will make the development and testing easier.

CHAPTER 4
Description of the System

Definition 4.1 The System
For the purposes of this thesis, the system is composed of several pieces of computer software and digital data, along with a set of prescribed algorithms for use with that data, that have been developed as part of this research to implement the game enhancement process described in Chapters 4-7.
The goal of this system is to successfully provide an enhanced user experience to the player of a video game by providing a semi-automated replacement of the sounds and graphics in the game. Elements may be enhanced by increasing the number of colors or the resolution of the game graphics, although this is not strictly necessary. The system targets emulators for hardware sprite engine-based game consoles. These types of game consoles maintained a video frame buffer for the background, and sprites were combined with the background during scanline rendering (one row of pixels at a time).
The visual output that is displayed to the game player represents the state of the game at a given moment in time. When enhancing the game's visual output, it is important that what the game player sees remains an accurate representation of the game's internal state.
Therefore, a restriction is placed on the system such that any enhancement made to a game does not alter the game player's ability to perceive what is happening in the game.

Software Components
Several software components are required for the the system. Executable software components include an emulator for a video game console, and a ROM image of a game to run in that emulator. Non-executable software components include various configuration files, such as mapping data, digital images, and digital audio files.

Emulator
The term emulator was coined by developers at IBM in 1963, and at that time referred to the emulation of one piece of hardware on another by implementing microcode on the new machine that performed the tasks of the old machine. This is in contrast to simulation, which referred to software-based emulation [43].

Definition 4.1.1 Emulator
An emulator is a piece of computer software that simulates most or all of the functionality of another computer system. This enables the computer running the emulator to execute programs that could only be run on the computer system being emulated.
An emulator for the game console is a required part of the system. The emulator software can be written specifically for this system, or it may be part of an existing emulation package. During the initial development phase, the source code of the emulator must be available for modification. The specific modifications that must be made will vary depending on the emulation target. After the modifications have been completed, only the binaries of the emulator (along with any supporting files) are required for the system.
Excluding ongoing maintenance, the development of the emulator is a fixed-cost in terms of the system. Development of the features required by the system (or the emulator itself, if not modifying an existing codebase) is a one-time event, and does not need to be repeated for each game to be enhanced.

Game Binary Image Definition ROM Image
The executable code and data for a console game were typically stored in one or more Read Only Memory (ROM) chips inside the game cartridge. A digital file that contains a copy of the data in the ROM chips is known as a ROM image.
The system requires a ROM image of a game. This will be loaded and executed by the emulator. The purpose of the ROM image is to provide all of the game logic, and to be used as a tool for performing analysis on the game's visual output.

Configuration Data
The system must have a way to configure how the new assets are loaded and recognized. This information can be stored in one or more data files. The files can then be loaded by the modified emulator at runtime.
Several pieces of information should be user-configurable. At a minimum, a list of all replacement graphics, along with a unique label that maps some piece of state information from the original game to the replacement graphic, is required.
There are several common file formats used for specifying configuration data for an

application. Examples include INI (common in older applications running on Microsoft
Windows) and Extensible Markup Language (XML). A custom file format may be used, and is not necessarily restricted to being a text-only file. The only restriction is that the modified emulator must support reading and writing data in the specified format.

Art Assets
One of the stated goals of the system is to visually alter visual output of the game.
Alteration, for the purposes of this system can be defined as providing a substitute image for one or more images in the original game's display. The substitutes can be higher-quality upscalings, modified images, or complete redrawings of the game's content.
Artwork must be provided in the form of a digital file that can be loaded and rendered The number of image files required depends on how much of the original game imagery is marked for replacement in the configuration data. While the system can be used for total replacement of the game's visual output, it is also conceivable that only single tiles or sprites may be replaced. The digital image files are, in general, unique to each game ROM being enhanced.

Audio Assets
The system does not require specific audio assets to be provided. If an emulator supports sound output, it may play the original sounds specified in the ROM image.
The system should be able to use an alternative set of sounds for the game, including sound effects and background music. Any suitable digital audio format may be used. If implemented, the software libraries used for sound playback should support a minimum of two simultaneous streams of audio to allow for the concurrent playback of background music and least one sound effect.

Redistributable Package
Once the system has been deployed, there must be a mechanism for loading the configuration data and modified art and sound assets into the emulator. The system does not propose any specific requirements for how this may be performed.

Development Roles
Each component of the above system requires some amount of human interaction. A number of distinct roles have been identified in this system, and they can be categorized by the amount of ongoing effort required by that role in support of the system. The number of people required by the system is, at an absolute minimum, one person. This occurs in the event that one person is performing all of these tasks by themselves. A more likely scenario is that there is at least one person or more per role.

Zero-Effort Roles
The role of original developer is defined as the set of people who were responsible for developing the game in its originally published state, including all design, writing, programming, and creation of game assets. The contribution of the original developer is the game ROM image. It is assumed that the game ROM image, being a binary copy of an already existing game, requires no further development effort in the context of this system.

Fixed-Effort Roles
The system requires the use of an emulator software package that has been modified to support the functionality required by the system. The role of emulator developer is defined as the set of software developers who are responsible for writing the programming code for the emulator, or making the required changes in the source code of an existing emulator.
Upon successful completion of the modified emulator, the emulator developer's contribution to the system ends.

Variable-Effort Roles
The role of upgrader is defined as a set of one or more people who will use the software provided by the emulator developer in conjunction with a game created by the original developer to produce an enhanced game. It is the responsibility of the upgrader to analyze the game, determine what is required for enhancement, and produce artifacts to satisfy those requirements. It is possible for the original developer and the upgrader to be the same entity, but they need not be.
The role of Asset developer is responsible for taking the art and sound requirements from the upgrader and producing the digital image and sound files for use in the enhanced game. The upgraders may be the same as the asset developers, but need not be. The asset development role is one that requires a variable amount of effort, as each game may require a different amount of asset creation and the assets are (generally) unique for each game.
The final role is that of the player. The player is defined as the end user of the enhanced game. The player is not required to perform any function in the development process, but may be used to obtain feedback about the enhanced game. This feedback can be used to make further refinements to the game prior to final distribution.

Workflow
When a decision to enhance a game is made, several steps must take place. First the upgrader must acquire a copy of the game ROM from the current copyright holder. This may or may not be the original developer. The game ROM is loaded into the modified emulator, and the upgrader uses the the emulator to create the configuration data and to document what assets need to be created. This information is passed to the asset developers, who produce the digital art and sound files for use in the enhanced game.
These files are provided to the upgrader for testing in the system. If changes are required, the asset developers are informed, and this process repeats until asset development is complete. The upgrader can then package the game for redistribution. A diagram of the expected workflow is shown in Figure 14.

CHAPTER 5 Object Detection Implementation
This chapter will discuss the implementation of the object detection portion of the system. Object detection in the system consists of detecting sprites and background tile graphics.

Modifying the Emulator for Game Enhancement
The emulator must be modified to provide support for object detection, event detection, and asset replacement. In the process of doing so, there will be opportunities to access some of the emulator's knowledge about the state of the virtual hardware. This section of the thesis will talk about the process of determining what pieces of information might be useful, the implementation of new emulator features to extract and use this information, and a discussion of the difficulties encountered.

Initial Analysis of the Emulator
The source code base of FCEUX is very large, consisting of approximately 168,000 lines of C++ code. Building the source code was not an easy task, as the the source code

Information Available to the System
The emulator has access to all of the information about the state of the virtual hardware, including the contents of CPU registers, RAM, video RAM, video ROM, and game ROM. The fact that it is accessible does not necessarily mean that it is easily obtainable. Further, there must be some discussion about how much of this information is actually beneficial to the system. The previously mentioned pieces of data can be processed in a blind manner, in that the same techniques can apply for any game ROM being used in the emulator. System RAM can be tapped to determine the value stored at a specific location in memory, but the contents are entirely dependent upon the specific coding of the game. For example, many games keep track of a numeric score for the player. The NES hardware does not provide a register for this information, and there is no standardized memory location or format.
One game may store the score data at a specific memory address, but another game may use that same address to store completely unrelated information.

User Configurable Behavior
The system must expose an interface for configuring how game assets in a game ROM are mapped to new replacement assets. The modified emulator must support reading and writing from a data source that contains this information, such as a file or network connection.
At a minimum, the system requires the ability to map a specific identifier to a replacement game asset. The format of the identifier may be unique to each type of asset. For example, the current implementation of the system uses an unsigned integer to represent the identifier for both sprites and tiles. Sprite identifiers are chosen from a set of numbers from 0 to 255, which identify their index values in the OAM table. Tile identifiers may be any number between 0 and 2 32 − 1, but may be split into smaller, easier to manage subcomponents as the 32-bit data represents four 8-bit numbers packed into a single word. format. An example of a partial configuration file in Windows INI format is shown in Figure 16.

Sprite Detection
Detection of multiple moving objects is a difficult problem due to the need to separate the foreground from the background, and because multiple objects may be occluding each other. A number of temporal algorithms have been developed to eliminate the foreground objects from a set of image frames [44]. This assumes that there are enough frames to achieve satisfactory results, and that the foreground objects are actually moving. Occlusion can make object labeling and localization difficult by obscuring object features that are important for detection. By utilizing information contained within the OAM, we can obtain a complete listing of all in-use sprites and information about their location and orientation without having to perform any background subtraction pre-processing and without having to use an occlusion-invariant detection algorithm. The layout of an OAM entry is described in Table 3. Object localization can be obtained by examining the zeroth and third bytes to obtain the x and y coordinates of the sprite.

Definition 5.2.1 Unordered Map
An unordered map is a template-based associative container in C++ (unordered_ map) that provides an iterable hash table implementation [45].
Object classification can be obtained by examining the first byte. The priority bit is used to control if the sprite is rendered behind the background layer, or in front of the background layer. The three unimplemented bits were actually non-existent in the hardware, and so OAM entries were only twenty-nine bits wide. Of the data stored in the second OAM byte, the current implementation of the system makes use of the bits for horizontal and vertical flipping to handle on-screen objects that use a single half-sprite that is horizontally or vertically flipped to produce the full image. A more complete implementation could use the palette information to select a from a different unordered map, allowing games that use the same sprites with different palettes to preserve this difference in upgraded versions.

Background Detection
For sprites, the system directly accesses information in the OAM to identify and localize individual sprite objects. Similar information about the background graphics layer is available, but this implementation will use an image processing approach to detect background tile graphics in order to allow for the comparison of the two implementations.

Video Output Buffer
All image processing and computer vision systems require one or more frames of graphical data to analyze. FCEUX provides several opportunities to access information that could be used as an input for the image processing component. To get the most accurate representation of the data that is being sent to the screen, the system accesses the image data after all scanline rendering to the memory bitmap has been completed, but The emulator has several code branches responsible for performing this task, as the specific function calls and data structures vary depending on user-configurable options, such as applying video filters for upscaling. The system is not interested in processing the output of the scaling filters, and so those branches can be skipped. The emulator can also be configured to generate output at a certain color depth [42]. We restrict the color depth to 8-bit color output to ensure that the expected code path is always followed. A complete listing of the configuration restrictions for this implementation are shown in Table 4.
The system now has a single point of access to the image data for each video frame.
The data is in the form of a device independent bitmap (DIB), with the image stored upsidedown. It is not uncommon for DIB images to be stored in this manner on platforms that maintain backwards compatibility with IBM OS/2's Presentation Manager [46]. This image data will be passed to the image processing component for storage once per frame, and is vertically flipped at that time. Additionally, a flipped and 4× nearestneighbor upscaled version of the image is rendered to the video output surface using the Win32 StretchDIBits function. This will serve as a background layer upon which all other rendering is performed. It also serves as a visual guide for the developer during the upgrading process.

Preparation
The original data is a stream of bytes that represent a palette-indexed image. The first step is to perform the palette index lookups and use this information to convert the data to a 32-bit RGBA raster bitmap image. The dimensions of this image are 256 pixels × 224 pixels.
The bytes of the image are obtained inside the output stage of the emulator after all scan lines have been rasterized, but prior to actually displaying the image on-screen.
The contents of the image represent the background layer of the game's visual output as seen by the Picture Processing Unit (PPU). As the contents of the sprite layer can be efficiently detected through the emulator's hardware state information, eliminating the sprite layer from the rendering leaves a clean tile map image. This removes the constraint that the tile detection algorithm be invariant to occlusion.

Tile Recognition
Many games divide the screen content into two distinct regions. One region contains the actual game play field. This region is composed of a series of square graphics known as tiles. The second region is reserved for scores and player feedback information. This region will be referred to as the banner region. The banner region is primarily static, with only a small amount of content changing in response to some event in the game. It most commonly appears along the top or bottom of the game play region (see Figure 17). While  Figure 17: The game screen is common divided into two regions: a game play region and a player feedback region. In (a), the player feedback region appears below the play region [47]. In (b), it appears above the play region [48].
the game play region requires tile-by-tile detection, the banner region may be specified manually through the configuration file if needed.
In the tile-by-tile detection phase, each tile is assigned a labeling. Tile localization is not required, as the location is known from its position on the tile grid. The labeling is based on a feature composed of spatial color information. Color can be an important cue in object detection, especially when other features are unreliable or non-existent in the image [49]. For the NES, up to four colors may be used per-tile, and information about the spatial distribution of these colors can be used to differentiate between two or more tiles.
Color coherence vectors (CCV) were considered as a way of labeling each tile. A CCV is a way of computing a color histogram of an image that treats color regions that are spatially separate (incoherent) as distinct "colors." CCV's generally consider a region of pixels to be coherent if the size of the region exceeds approximately 1% of the size of the image [37].
The system is working with tiles of size 16 × 16, and so a coherent region could consist of a portion of the image that is as small as three pixels. Dithered images, which may alternate two colors in a 1 × 1 checker-board pattern, are not uncommon in pixel art. It appears that the use of CCV's may not be the best option for use with these types of images as the dithering patterns will result in large quantities of incoherent regions. As an alternative, this system will use a labeling based on the average colors of the four quadrants of each tile.
Each 16 × 16 tile is divided into four 8 × 8 quadrants, Q 1 , Q 2 , Q 3 , and Q 4 , each of which contain sixty-four pixels. Each quadrant is processed to find the average pixel color A RGB for that quadrant, with 0 ≤ R,G,B ≤ 255: ) .
Each color channel has an intensity range of 0-255, requiring eight bits (one byte) for storage. A RGB requires a minimum of three bytes to store the red, green, and blue components, and thus requires twelve bytes to store the information for all four quadrants. This is not a large amount of memory, but it can be potentially inconvenient to the upgrader to require them to input 3-tuples for every tile. Further, it can be difficult to visualize this information in a usable form on the tile map display.
To resolve this problem, the data that represents the tile image will be converted to grayscale. The algorithm for this must not lose any of the spatial information about the color distribution, and it should avoid producing duplicate grayscale values for different colors.
Let A R , A G , and A B be the red, green and blue components of A RGB respectively.
Then the grayscale value A G is: where G is a grayscaling function. The function G shown here is a perceptual grayscaling formula known as the Luminance formula [50].
It is common for color images to be converted to grayscale for object detection tasks. A number of grayscaling formulas were considered. Kanan and Cottrell found that perceptual Figure 18: A panel of fully saturated colors, including individual RGB channels, combinations of two channels, and a gradient is processed using several different grayscaling formulas [50]. grayscaling formulas were not as effective as as linear formulas for general object recognition. However, one of the exceptions that they found was that perceptual formulas had an advantage in recognition of textures [50]. The data contained in each tile image represents a low-resolution texture. Further, the palettes in use are of very limited color depth, and so it would not be uncommon to have equal values in the color subcomponents. Perceptual grayscale models use a weighted combination of the color subcomponents, minimizing the likelihood of two colors generating the same grayscale value. Figure 18 shows the results of several grayscaling formulas applied to a set of fully saturated RGB colors. The Luminance formula (Equation 7) provides a good balance of color perception (see Figure 18e) and low computational cost, and so the Luminance formula was chosen for function G.
A G can be stored using only eight bits, as 0 ≤ A G ≤ 255. This means that the average grayscale value for each of the four quadrants can be packed into a single 32-bit unsigned integer (see Figure 20). This integer contains information about the color distribution of the tile and preserves some information about the localization of those pixels within the tile. In Figure 19, two 16 × 16 tiles with a similar color distribution are shown. The tile in Figure 19f has most of its red component in the upper right corner, whereas the tile in Figure 19a has the red component more or less equally distributed across all four corners.
Performing color averaging per-quadrant and converting to grayscale preserves this spatial color information and using full-tile averaging does not.
By reducing the amount of information used by the feature to four small numbers, the upgrader can more easily work with the information. Within the emulator's display, all four numbers can easily fit within a tile square on-screen, allowing for easy visualization of the labeling data. The numbers also allow for easier data entry into the configuration files. This is implementation specific, and a more thoroughly-developed, user-friendly development tool could hide much of the complexity from the upgrader, allowing for more data per tile if needed. The dimensions of the color image I that the system is processing are 256 × 224 pixels.
The NES uses a tile size of 16 × 16 pixels, and thus the background layer is a 14 × 16 grid of tiles. Let T ∈ N 14 × 16 ≤ 2 32 − 1 be the matrix used to store the computed tile identifiers, with T i,j being the tile detection result of the tile with its upper-left corner at position (i, j) in I.
Then T i,j can be computed as : where Q 1 , Q 2 , Q 3 and Q 4 are the four quadrants of the tile with its upper-left corner at position (i, j) in I.
Computation of T i,j relies only on arithmetic operations and uses no branching other than the highly-optimized branching in the generated loop code. This allows for very fast computation of the features, and thus efficiently labels the tile.
The computed 32-bit word ( Figure 20) is used as a hash key in an unordered_map.
The key represents the image content of the tile, and can be used to obtain a handle to a replacement image. During each frame of game play, all tiles on the screen are processed to compute their hash key. The system then queries the table to obtain an image handle. If a bitmap handle is found, the system can render the replacement image onto the display. If the key cannot be found in the table, nothing will be rendered onto the screen, resulting in   the original image showing through. This serves as a visual guide during development, but can also be useful if the developer only wants to replace a small number of the graphical assets. If no changes in the tile matrix have been detected from the previous frame, the most-recently rendered bitmap is reused.
Prior to performing the hash key lookup, the upgrader may want to perform additional processing on the tile detection matrix. For this, the system provides postdetection filtering.

Post-detection Filtering
The tile detection algorithm attempts to determine what is on the screen. By itself, the algorithm only provides enough information for a single candidate replacement tile. This is because the tile labeling is used as a key for looking up an image in a hash table. A second algorithm is used for post-filtering the tile detection matrix.

Definition 5.4.1 Post-Detection Filter
The post-detection filter is an algorithm that is applied after tile labeling has occurred. For each tile, it examines the labelings of adjacent tiles to determine if the current tile's label should be altered.
Several reasons have been identified for requiring post-filtering of the tile detection matrix. The detection algorithm produces an integer labeling, and the labeling may not be unique. An example of two tiles that produce the same labeling is given in Figure 21.
A combination of the tile labeling, along with contextual information from neighboring tiles can be used to correctly locate an acceptable candidate label for replacing the tile.
The detection algorithm will generate an identical tile labeling for identical tiles. While this is the desired behavior for determining what is currently on-screen, the upgrader may want to provide alternative graphics depending on contextual clues from neighboring tiles.
As an example, large areas filled with a single tile may utilize several different replacement graphics to create a less repetitive or more context-sensitive appearance. This feature has no dependence on the the specific method by which T was obtained, and can also be applied if T was populated from virtual hardware state information. An example of how post-detection filtering may be applied is shown in Figure 23. Correction of a tile labeling resulting from non-unique labels is a subset of this problem, and both can be treated identically.
The tile detection matrix T stores the initial labelings for each tile. The post-detection matrix T P maintains a second set of labelings that are initially undefined. Let i and j be the row and column indices of the tile currently being processed in the post-detection filter. Then the filter can access T (i ′ , j ′ ) and T P (i ′ , j ′ ), with i ′ ∈ {i − 1, i, i + 1} and j ′ ∈ {j − 1, j, j + 1}. Figure 22 illustrates the set of neighbor tiles that can be evaluated as part of the filter.
If post detection filtering is requested, a filter is initialized with data read from the user configuration data. The filter is applied to all elements of T . For each element, the  (i, j). This filter is swept across every position in T checking for matches at (i, j) and one or more of the neighbor elements. If a match is found, the value at (i, j) will be replaced.
filter determines of the current element should be modified. If it should be modified, the replacement label is stored at the same position in T P . At rendering time, a modified version of T, referred to here as T ′ , is used to provide the renderer with a set of labels to use for image replacement. T ′ is T, with each element of T that has a non-empty equivalent in T P replaced with the labeling in T P .

Scene Detection
The 14 × 16 tile detection matrix T can be viewed as a 28 × 32 grayscale image, as each element in T can be broken into four independent components (two rows and two columns). Low-frequency energy is preserved during the production of T through the averaging process, as the labelings of each element in T is effectively a form of low-pass filtering.
It has been shown color cues are a primary scene recognition cue and that colored scenes could be identified by humans even at very low resolutions [51]. Further, it has been suggested that 32 × 32 is the minimum resolution for a color image on which scene recognition can be reliably performed [52]. For grayscale images, the required image dimensions grow to approximately 64 × 64 [52]. These dimensions were recommended in the context of querying a large database of natural images (i.e., naturally occurring images, captured by photography or other means). Color information has been discarded in T. The image represented by T cannot be considered a natural image as it is algorithmically derived from an artificial image which may or may not represent a naturally occurring scene.
The image represented by T is used as a query to determine if T matches a scene of interest. Unlike many content-based image retrieval systems, the goal is not to obtain a list of all images known to be similar to the query. The goal is to find a single image that is similar (beyond a specified scoring threshold) to the image represented by T. If one does not exist, then T does not represent a scene.
The image represented by T lacks noise, due to both the nature of the source of the data as well as the results of the low pass filtering. There is also no need to perform any form of object detection or segmentation on the image. Therefore, the 28 × 32 grayscale image should be sufficient for this purpose.

CHAPTER 6 Scene Recognition and Event Detection
Thus far, the components of the system that have been described in earlier chapters are capable of detecting the locations and orientation of sprites, and of detecting the presence of specific tiles on the game screen. This provides enough information to the system to enable one-to-one replacement of game images. Through the use of post-detection filtering, individual tiles can be mapped to one of several context-sensitive replacement images. While this is sufficient to provide an enhanced game experience, it is possible to further enhance the output by providing the system with a higher-level awareness of what is happening within the game. In this chapter, the methods used by the system for detecting larger-scale events are discussed.

Events
To perform event processing, the system must have a list of situations to detect, along with a set of responses to those situations. These two elements shall be called triggers and events, respectively.

Definition 6.1.1 Trigger
A trigger is a user-defined detectable situation that the system may take notice of in order to produce some change in the system state.

Definition 6.1.2 Event
An event is an action performed in response to a trigger.
To provide an example of how triggers and events work, consider a scenario where the system has detected that the player has entered a specific room in a game. This serves as a trigger. In response, the system may request that the background music change. The request to change the background music is an example of an event.

Triggers
Triggers can be broken into several sub-categories. An appearance trigger recognizes when an object with a specific labeling has been newly detected over a series of video frames. The objects of interest may include background objects located on the tile map, or foreground sprite objects. Objects that are on-screen may also disappear, and so the the absence of an object that was previously known to be on-screen can be considered as a disappearance trigger.
A movement trigger produces an event when objects are found to be moving according to some user-defined pattern, or when they are found to be nearby a specific location.
Examples of movement triggers include the detection of specific objects changing direction or speed, as well as the detection of a collision or adjacency of two or more objects.
Whereas the previously described triggers are relatively simple to detect, the system should also be able to process events of greater complexity. This may include boolean combinations of simple triggers, or it may extend to full-scale scene processing. The current version of the system used in the modified emulator provides implementations for appearance and disappearance triggers.

Scenes
A game can often be divided into several distinct parts. Common examples of this include title screens, configuration screens, and changes in the game play environment.
Specific map locations or combinations of on-screen elements may also be of interest. In the context of object detection, a scene often refers to a complex image that contains more than one object [53]. In this system, the definition is narrowed such that a scene represents a particular set of objects in a specific, fixed spatial relationship.

Definition 6.2.1 Scene
A scene is a complex trigger, defined by a set of state information that, when taken together, uniquely identify a specific moment in the game.
In a typical game, there exists many scenes that can be detected. Of these, relatively few have much meaning to the player. A procedure for marking interesting scenes is provided in the modified emulator.

Marking Procedure
The upgrader plays the game in the modified emulator and presses a specific key sequence (in this implementation, it is the  key) when presented with a moment in the game that should be marked as a scene. The upgrader must then enter an identifying tag for the scene, which will be stored on-disk along with a vector representation of the tile detection matrix for the current screen. An example of the user interface for this process is shown in Figure 24.
In the mapping data, the identifying tag entered during marking is entered as an event trigger. This process is repeated for each scene that the upgrader wants to identify with an event. For each scene, multiple events can be generated.

Representation and Recognition
The dimensions of the game screen are 256 × 224 pixels, divided into a grid of tiles that are each 16 × 16 pixels. This results in a grid of tiles that is 14 × 16 tiles in size. Let Σ be the set of all scenes that have been marked for recognition. To perform scene recognition, we again let T ∈ N 14 × 16 be the tile detection matrix for the current frame, and we let S ∈ N 14 × 16 ≤ 2 32 − 1 with S ∈ Σ be the data matrix associated with a scene to be recognized.
S and T can be viewed as small grayscale images. One method that is commonly used to test for image similarity is to compute the Euclidean distance between the two images: (a) (b) Figure 24: In this set of screenshots, (a) the upgrader has identified a part of the game that represents a scene. (b) A key is pressed, and the upgrader is presented with a file dialog to save the scene tag information to disk. In this case, the the vector ⃗ t , representing the scene displayed in (a), is associated with the tag "InTheCave." Each ⃗ t i is a packed 32-bit word built from four 8-bit numbers, each of which represents the intensity of the average color in that quadrant.
An alternative distance measure is to compute the Manhattan distance between the two images: There is a problem with both of these distance metrics when used with the data in T and S. A large distance between any two elements of T and S does not indicate that it is "more different" than would a smaller distance. Any difference at all indicates that the tiles were labeled differently. Thus, a distance of nine, computed using either distance metric, can be misleading. Was it one tile that had a value that differed by nine? Or was it nine tiles that each differed by one? There is no way to obtain this information directly through these measurements.
Color coherence vectors (CCV) were again considered for this task. Color information was stripped during the tile labeling phase, leaving a set of grayscale values in T . Treating the grayscale values as "colors," the image represented by T can be broken up into sets of coherent and incoherent regions. This preserves some spatial information, but not enough to uniquely identify a particular scene.
A simple method for identifying if the current scene T is known in Σ is through testing for matrix element equality. For every S ∈ Σ, compare S with T . If S is strictly equal to T , then T has been recognized as a scene.
This method does not allow for any differences between T and S (i.e., the query scene and the scene must be identical). If T is supposed to represent a scene, but has had a minor alteration (for example, a moved block or a text overlay), then T will no longer be considered a recognizable match for a scene. Another simple solution is to use matrix subtraction. Let Z ∈ N 14 × 16 ≤ 2 32 − 1 be a matrix and Z = |S − T |. The number of non-zero elements in Z represent a crude form of similarity scoring, being equal to the number of tiles that were not exact matches. This method has the advantage of being computationally fast and is also easy to implement. This distance measure is defined as: where ϕ(x) is the function: When a scene is saved to disk, the contents of the file are the matrix S, written as a set of vectors ⃗ s i . When the system loads the list of scenes at startup, this data is read back to form the set of scenes Σ.
During gameplay, the tile detection matrix T is created during every frame. T is used as a query to determine if there is any S ∈ Σ that matches T within a certain dissimilarity tolerance. There are 224 elements in every S, as the matrices are 14 × 16. Therefore, a mismatch in one element (i.e., a single tile was labeled differently in T and S) means that there is a ≈ 0.446% difference between the two scenes.
Tolerances may be specified per-scene in the configuration data. This allows the upgrader to specify that a scene S can only be matched by a query that results in a similarity score that exceeds some threshold. For example, if the scene and the query differ by three tiles (at a penalty of 0.446 per tile), then the threshold must be set less than 98.662% to register as match.

Non-interfering Reaction
Through event detection and scene recognition, the system can have awareness to complex situations within a game. However, the system must be restricted in the ways that it can respond to these events. The reason for this is that the game's logic, as stored in the ROM, cannot be altered. For example, an event in response to a trigger could not open up a hidden passage through a wall where one would not have already existed.
Responses to an event can be either visual or audible. Visual responses can include replacement or alteration of individual graphics, or could extend to replacing the entire background layer with one or more pre-rendered bitmap images.
Audible event responses may include the playing of digital audio files when a specific enemy or item appears on-screen. Scene detection may be used to generate an audible response in the form of changing the background music when a specific scene has been recognized.

Sound Events
The sound output of the NES was rather primitive. There were no "instruments" to be heard, but they were instead simulated by a pair of square wave oscillators, a triangle wave oscillator, one static noise generator, and one pulse code modulation (PCM) channel.
Pitch and duration were specified by the game coding for each channel. This information can be accessed from within the emulator [39].
The Musical Instrument Digital Interface (MIDI) was developed in the early 1980s as a way for electronic musical instruments to communicate with each other. It supports a number of different types of event data, including pitch and duration [54].
It is possible to silence the output of the emulator's APU code (see Table 4 on page 52), and instead translate the input data for each channel into a set of MIDI events. The MIDI event data can then be sent to any MIDI-compatible device, including external hardware or software-based synthesizers. This allows for enhancement of the sound data by directly utilizing the original audio coding information as a data source for the MIDI synthesizers.
For this implementation of the system, a decision has been made to bypass this system entirely, and utilize the scene recognition engine for loading and playing external digital music files. This enables to system to use a greater number of polyphonic lines, and greater flexibility in sound choices. The current implementation of the system maintains audio stream handles for background music, one non-looping sound effect, and one looping sound effect. The handles are updated based upon the triggered events, including scene recognition.

Discussion of the Results
I have developed an implementation of the system by modifying the source code of the FCEUX emulator to support the required features. To test the system, a commercial game, The Legend of Zelda, developed by Nintendo, was chosen for the game ROM image [48].

System Feasibility
The system was developed and tested using a pre-existing emulation software package (FCEUX) for the emulator portion of the system. This emulator is capable of successfully emulating the original game, with all in-game behaviors unchanged. A set of upgraded graphics and sound assets were used for replacing the original in-game data.

Modifying an Emulator
The FCEUX emulator software used in this system has been in development since 1994. The developers of FCEUX had no awareness of this system, and so the requirements of the system had no impact on the technical design of emulator. Despite this, I was able to modify the source code for FCEUX to support the requirements of the system without needing to make any substantial architectural changes. Therefore, it is clear that an emulator does not need to be specifically designed in a particular way to support this system. An off-the-shelf emulator package can be successfully modified to use this system, as I have done in this implementation.
The difficulty of making those modifications depends greatly on the understandability of the source code for the emulator, as well as on the quantity and accuracy of the documentation. In this particular implementation, the source code for FCEUX was difficult to understand and lacking sufficient documentation. This is not an uncommon scenario in many of the open-source emulators that I have examined. This did not prevent me from under-standing enough of the code to make the changes that were required. If an emulator were to be newly developed, support for the functionality required by this system would require very little additional effort to build into the design.
This research has shown that using a modified game console emulator to provide enhanced versions of sprite-based games is an approach that shows promise, and that enhanced games can be developed at a substantially lower cost than developing a traditional port of the game.

Specific Features Implemented
In the current build The event detection implementation currently processes appearance and disappearance triggers, in addition to scene recognition triggers. There are nine defined scenes, four of which produce a complete replacement of the background layer with a pre-rendered bitmap image.
The system provides a digital audio soundtrack that uses several scene-dependent MP3 audio files. Sound effects are also provided in the form of MP3 audio files.

Object Detection
The system provides two approaches for processing on-screen game elements. This implementation of the system implements sprite detection by extracting data from the OAM tables within the emulator. Tile detection is obtained by obtaining the video buffer and applying the tile detection algorithm. Both approaches provide successful detection.

Sprite Detection Results
The

Applicability to Other Emulators
All emulators for the NES will have the same virtual hardware state information available. Sprite detection, as described above, can be implemented in those emulators without changing the general logic. Table 5 provides information about the hardware sprite engines in a number of popular video game systems. The Game Boy Advance (GBA) and the Neo Geo hardware sprite systems supported additional features not present in older game consoles. For example, the Neo Geo could shrink sprites by scaling them, and the GBA supported arbitrary scaling and rotation of sprites.
Detection of these sprites by traditional object detection algorithms would require the use of an algorithm that is invariant to scale and rotation, such as geometric hashing.
Geometric hashing is computationally expensive, but easily parallelizable. The devices that may implement this system might not provide multiple cores or GPU processing, and may be limited in the amount of system memory.
For emulators of game consoles that have scaling and rotation capabilities, using the state information inside of an emulator to locate and label sprites will provide a substantial improvement in performance and accuracy. Specifically, the system will maintain O(1) running time for per-sprite detection, and thus O(n) for all n sprites, even if there are 128 sprites with arbitrary sizes, rotation, and occlusion.

Tile Detection Results
Tiles forty-eight shared tiles between the two sets.
In the current implementation of this system using FCEUX, I have chosen a relatively simple algorithm that assigns an integer labeling for each tile. The integer is a composite value composed of four 8-bit numbers, each ranging in value from 0 to 255. These numbers represent a grayscale intensity level for a quadrant of the tile. A color average is computed for each quadrant, and this average color is converted to grayscale using the Luminosity grayscaling formula. Dividing the tile into quadrants maintains some spatial information about the color distribution in the tile, and using a perceptual grayscaling formula preserves some information about the difference in individual color channels.
Accurate tile detection is considered as finding a unique labeling for each graphical tile seen by the system such that the labeling can be used as a hash key. As shown in Figure 21 on Page 60, the labeling algorithm used in the current implementation of the system does not produce a unique value for all inputs. These examples were artificially constructed to exploit this weakness, and I do not anticipate such collisions happening in real-world inputs.
For the specific game being tested, tile detection was 100% accurate when not scrolling.
Scrolling across name tables (in particular, vertical scrolling) produced graphical glitches as the algorithm was not able to detect the partial tiles reliably. This is due to the tile no longer aligning to the tile detection grid, thus providing different labelings than what is expected.
At some computational expense, a sliding grid can be used to align the grid with the scrolled image data for better results. Localization was not a consideration in tile detection as it was inferred from the tile's position within the tile grid.

Weaknesses of the Image Processing Approach
The NES hardware supported only one tile-based background layer, with limited scrolling functionality (illustrated in Figure 25a). A listing of hardware-supported background modes for commonly emulated game systems is given in Table 6. Later game consoles had hardware support for multiple background layers, which were not necessarily tile-based. Additionally, these layers could be scrolled (Figure 25b) at independent rates.
On some systems, layers could be rotated, or scaled independently (Figure 25c). The intro-duction of these features poses a problem for the system's current tile detection algorithm, as the algorithm expects an image of one background layer as an input and uses a computed hash for each tile. Of the game consoles that I examined, the SNES presents the most challenges to the current implementation of the system. The SNES had several background modes. Each mode had distinct features. One background mode of the SNES, Mode 7, is especially problematic, as it allows the use of an affine transformation matrix to provide a sense of perspective. Mode 7 allowed for scaling and rotation to be changed for each scanline by changing the values in the transformation matrix during H-blank. Developers were particularly creative with this mode, often using it in unexpected ways. Figure 26 shows two examples where Mode 7 was used in ways that would present problems for the present image-processing based approach to tile detection. Figure 26a presents a screenshot from a game in which the background (in Mode 7) was used to represent an enemy. This enemy was much larger than typical sprite-based enemies, and could be freely transformed through rotation and scaling. Screen-elements that appear to be the "background," such as the gray bricks, were actually sprites rendered over the background. Figure 26b shows an example of Mode 7 being used to present a pseudo-3D perspective. In this example, Two different views of the same background layer are presented in a split screen by changing transformations per-scanline.
(a) (b) Figure 26: Two SNES games using Mode 7 in an unexpected way. In (a), the large enemy in the air is actually a Mode 7 background, with support for scaling, translation, and rotation. Everything else on the screen is rendered as a hardware sprite, including the castle platform [26]. In (b), Mode 7 is used for both halves of the screen, with different transformation matrices applied per scanline [55].
Some games for the SNES made use of additional on-cartridge hardware. Examples of this hardware include the SuperFX and SuperFX2 chips. This hardware enabled the SNES to play games that used a limited number of 3D polygons, and provided additional 2D scaling and stretching effects. Figure 27 shows an example of a game that combines Mode 7 with hardware sprite rendering, along with 3D polygons overlaid onto the screen bitmap.
Detecting the background layers of these later game consoles can be computationally expensive, due to the requirement for transformation invariance in the detection algorithms. For this implementation of the system, image processing was shown to be a successful technique for identifying background elements. It is clear that this approach could work if implemented in an emulator for any of the other game consoles in Table 6, but at a significant increase in both development costs and computation per-frame. The Games that use co-processors to perform graphics work would require additional modifications to the emulator to extract the relevant data [56].
modified emulator for any of these systems can be further altered to allow each layer to be independently accessed as either a indexed tile map, or as a bitmap image. Layer-specific information, such as scroll values, ordering, scaling, and rotation can be read directly from the emulated hardware registers. For SNES games that use an on-cartridge co-processor, such as the SuperFX, there may need to be additional changes made to the emulator source code. Emulators generally provide an implementation for these co-processors, and so it should be possible to extract information from them, much as the current implementation of the system does for sprites in the OAM. For relatively simple hardware, such as the NES, image processing is slower than extracting the state information, but still achieves acceptable performance and accuracy in the most common scenarios. For more complex hardware, this research clearly demonstrates that an image processing approach has significant disadvantages when compared to using virtual hardware state information.

Event Detection Results
The current implementation of the system detects appearance and disappearance triggers. The system maintains a boolean flag for each object that the upgrader configures as a trigger. The object can be a sprite or a tile. In each frame, the detection of any object that is in the trigger list sets the corresponding flag to true. The absence of an object that was previously known to be on-screen causes it's flag to be reset.
Detection of these event triggers occurs after all recognition has been performed, regardless of whether the object detection results were created by using the emulated state information or through image processing. Therefore, the specific method used does not impact the accuracy of the event detection, assuming that objects were detected with 100% accuracy.
There were no errors in the detection of either appearance or disappearance triggers.
The difficulty comes in the form of determining what labeling values to use for triggers.
Some graphical elements that may be suitable for use as event triggers sometimes occur in unexpected ways, as developers would often reuse parts of these graphics for other visual elements. This will vary from one game to the next, and so it requires some trial and error from the upgrader to ensure that good triggers are chosen.
An image-processing implementation of this feature in the emulators for latergeneration game consoles may be problematic due to the reduced guarantees of accuracy when labeling objects. This form of error would result in the unexpected presence or absence an event. Using the emulator state information eliminates this possibility and ensures consistent event detection results.

Scene Recognition Results
Scene recognition has been implemented by way of querying the current scene against a list of known scenes. As with the event detection system, scene recognition occurs after all object detection has been performed, regardless of the method used. Likewise, the choice of method will have no effect on the performance of the scene recognition system assuming that the query scene is an accurate representation of what is actually displayed on-screen.
This guarantee cannot be reliably made if a image processing approach to object detection is used, particularly on game consoles that can use a variety of transformations on the background graphics layer. Using the emulator's hardware state information provides the guarantee that is required to have this feature work reliably each and every time.
Fortunately, the hardware of the NES is simple enough to allow the system to successfully use an image processing approach. Several scenes were marked using the user interface. Similarity thresholds were specified in the configuration data for each scene, and the system correctly detected the scene events when presented with matching scenes during gameplay. This implementation uses scene recognition to replace background graphics, add graphical overlays, replace background music, and add new music and sound effects where none were previously used.  Figure 30: (a) Original first playable game screen. This is the first time in which the player has control of the protagonist character. (b) Background layer graphics have been rendered using the image processing system for tile detection. The scene recognition engine has been used to start the game music and provide subtle graphical overlays above the sprite layer. This screen makes extensive use of the scene recognition system. Upon scene detection, the system adds music and a looping fire sound effect (the original game had neither of these), replaces all background graphics with a prerendered background, and plays a one-time sound effect that is a man speaking the words that appear in the original game as text. In addition, after the system has rendered the sprite layer, scene recognition is used to provide a fog overlay and a world overlay, allowing a new sense of depth to the cave.
(a) (b) Figure 32: In this screen, scene recognition has been used to provide a world overlay (the tops of the trees) allowing the player character to walk behind the tree tops. The scene recognition system also detects the presence of the rock throwing enemies and plays appropriate sound effects. The scene recognition engine is used to provide graphical replacement of much the background layer and a graphical world overlay. Appropriate sound effects are provided in response to on-screen action. This scene has been configured to change the music to the overworld theme music if it is not already playing. The scene recognition is performing full replacement of the background graphics layer as well as providing a graphical overlay and fog. It has change the music and looping sound effects in response to the change in environment.

Discussion Summary
This implementation of the system has proven to be an effective method of enhancing the audio-visual content of games that are running in an emulator for a game console that utilizes a hardware sprite engine. Further, it has shown that using the virtual hardware state information inside the emulator is an efficient and reliable means of performing object detection within the system and that this method has significant advantages over image processing algorithms that perform the same task.

CHAPTER 8
Conclusions and Future Work

Summary
For over twenty years, emulators have been used to play video games on hardware other than that for which they were originally designed to run on. Today, even mobile devices such as cell phones are equipped with far greater capabilities for audio and video playback than was what was available on the original game console hardware. The development of network-based services have provided game publishers with an avenue for republishing older game titles as digital downloads for use on an emulator. These emulators are limited in the amount of enhancement that can be applied to the game, and are generally limited to upscaling filters. Previously, game developers would have to rewrite a game for a new game console to take advantage of increases in graphics technology. This thesis has suggested an alternative approach to increase the graphics and sound quality of these games without requiring an expensive rewrite of the game code.
Prior work on emulator-based game enhancement has relied on interpolation of pixel data to produce upscaled raster images or vector graphics. The algorithms used to obtain these earlier results did not use any form of object detection and processed the original video image as an atomic unit. Decomposing the image into several distinctly labeled parts can aid enhancement by allowing specific portions of the image to be replaced with substitute images. Finding a meaningful subdivision of the original image requires object detection. Multi-class object detection becomes computationally inefficient when the needs for invariance against several transformations are introduced.
In this research, the system takes a novel approach to the problem by relying on an emulator that has been specifically modified to allow access to portions of the virtual hardware's state information. Access to the state information provides all of the benefits of object detection (decomposing the original image into several independent units) without any of the the disadvantages (high computational cost and no guarantee of 100% accuracy in either the labeling or localization). The results of this research show that this system is superior to previous methods that depended on processing full video frames.

Suggestions for Future Work
Now that a working implementation of this system has been developed, there are a number of avenues for further development. This includes additional work on the modified FCEUX emulator implementation, as well as implementing the system in emulators for other game consoles.

Further Development on FCEUX
This implementation was tested using a single commercial game. There were approximately eight hundred games released for the NES. While many of these games do not do anything out of the ordinary, several used various hacks and undocumented features of the NES hardware. Many games also used additional hardware inside the game cartridge itself.
NES emulation is rather mature at this point, and most of the tricks used by these games are well known. Nearly two decades after the release of the last official title for the NES, some emulators still have difficulty emulating every corner case. Therefore, it would not be surprising if some of these games did not work well with the system as currently designed. It may be worthwhile to test the system with these games to determine what, if any, changes need to be made to accommodate these behaviors.
The current implementation uses the virtual state information for the sprite layer and an image processing approach for the background layer. This decision was only made for the purposes of comparing the two methods, and this research has shown that replacing the background layer code with new code that uses the virtual state information will provide better performance and accuracy.

Other Hardware Sprite Engine Emulators
The NES is the earliest of the hardware sprite-based game consoles that still enjoys widespread popularity. Later game consoles, such as the SNES, Game Boy, and Game Boy Advanced all use hardware sprite engines, and so much of the design of the system can be applied to modify emulators for those platforms. Like the NES implementation, image processing should be abandoned in favor of using the virtual hardware state information. This change is especially important for these game consoles as they are the first generation of game consoles that support multiple, simultaneous background layers and increased numbers of sprites, all with varying amounts of transformation applied. This is a case where the existing object detection algorithms have been specifically identified as being potential performance bottlenecks.

Semi-Automated Image Retrieval
The system currently relies on the upgrader and the asset developer to work together to produce suitable digital image files for use by each game. This requires an effort proportional to the number of unique files that must be created for each game. However, many games feature graphics that look different on-screen, but represent similar real-world objects.
It may be possible to produce a digital art library of generic tiles and sprite replacement graphics for use with an image retrieval system. In such a system, graphics extracted from the game are used as a query into the image retrieval system to find a list of candidate replacement images. As an example, many games have levels that feature various square blocks for the player to walk on. These blocks may look like low resolution versions of bricks, stones, or other materials. A 'red brick block' search would retrieve a list of potential matches in the image library for images that may be suitable to replace those tiles.
The large square image set assembled by Torralba, Fergus, and Freeman [52] would make a good starting point. There are approximately eighty million square images in the database. Each image is labeled with exactly one English noun. During enhancement, the database can be queried for by a noun for each tile. Results can then be filtered by image similarity. The final choice can then be made by the upgrader or asset developer.

Non-Raster Image Replacement
Prior to this research, the last major development in enhancing emulated game graphics was the Kopf-Lischinski algorithm for converting pixel art to vector graphics. One of the goals of this research was to overcome some of the limitations encountered by Kopf-Lischinski, such as high computational complexity and an inability to separate foreground and background objects.
This system is free from those limitations. At the present time, the system only considers raster bitmap images for replacing on-screen game elements. This decision was made primarily for convenience, as it was considerably easier to work with graphics in this format. With the system in place, it is now possible to take a further step to perform Kopf-Lischinski vectorization at design-time using isolated graphical elements, allowing for the use of vector graphics at runtime without any of the penalties that would be imposed by previous implementations.