Hybrid Optical Acoustic Seafloor Mapping

Abstract The oceanographic research and industrial communities have a persistent demand for detailed three dimensional sea floor maps which convey both shape and texture. Such data products are used for archeology, geology, ship inspection, biology, and habitat classification. There are a variety of sensing modalities and processing techniques available to produce these maps and each have their own potential benefits and related challenges. Multibeam sonar and stereo vision are such two sensors with complementary strengths making them ideally suited for data fusion. Data fusion approaches however, have seen only limited application to underwater mapping and there are no established methods for creating hybrid, 3D reconstructions from two underwater sensing modalities. This thesis develops a processing pipeline to synthesize hybrid maps from multi-modal survey data. It is helpful to think of this processing pipeline as having two distinct phases: Navigation Refinement and Map Construction. This thesis extends existing work in underwater navigation refinement by incorporating methods which increase measurement consistency between both multibeam and camera. The result is a self consistent 3D point cloud comprised of camera and multibeam measurements. In map construction phase, a subset of the multi-modal point cloud retaining the best characteristics of each sensor is selected to be part of the final map. To quantify the desired traits of of a map several characteristics of a useful map are distilled into specific criteria. The different ways that hybrid maps can address these criteria provides justification for producing them as an alternative to current methodologies. The processing pipeline implements multi-modal data fusion and outlier rejection with emphasis on different aspects of map fidelity. The resulting point cloud is evaluated in terms of how well it addresses the map criteria. The final hybrid maps retain the strengths of both sensors and show significant improvement over the single modality maps and naively assembled multi-modalThe oceanographic research and industrial communities have a persistent demand for detailed three dimensional sea floor maps which convey both shape and texture. Such data products are used for archeology, geology, ship inspection, biology, and habitat classification. There are a variety of sensing modalities and processing techniques available to produce these maps and each have their own potential benefits and related challenges. Multibeam sonar and stereo vision are such two sensors with complementary strengths making them ideally suited for data fusion. Data fusion approaches however, have seen only limited application to underwater mapping and there are no established methods for creating hybrid, 3D reconstructions from two underwater sensing modalities. This thesis develops a processing pipeline to synthesize hybrid maps from multi-modal survey data. It is helpful to think of this processing pipeline as having two distinct phases: Navigation Refinement and Map Construction. This thesis extends existing work in underwater navigation refinement by incorporating methods which increase measurement consistency between both multibeam and camera. The result is a self consistent 3D point cloud comprised of camera and multibeam measurements. In map construction phase, a subset of the multi-modal point cloud retaining the best characteristics of each sensor is selected to be part of the final map. To quantify the desired traits of of a map several characteristics of a useful map are distilled into specific criteria. The different ways that hybrid maps can address these criteria provides justification for producing them as an alternative to current methodologies. The processing pipeline implements multi-modal data fusion and outlier rejection with emphasis on different aspects of map fidelity. The resulting point cloud is evaluated in terms of how well it addresses the map criteria. The final hybrid maps retain the strengths of both sensors and show significant improvement over the single modality maps and naively assembled multi-modal

processing techniques available to produce these maps and each have their own potential benefits and related challenges. Multibeam sonar and stereo vision are such two sensors with complementary strengths making them ideally suited for data fusion. Data fusion approaches however, have seen only limited application to underwater mapping and there are no established methods for creating hybrid, 3D reconstructions from two underwater sensing modalities. This thesis develops a processing pipeline to synthesize hybrid maps from multi-modal survey data. It is helpful to think of this processing pipeline as having two distinct phases: Navigation Refinement and Map Construction. This thesis extends existing work in underwater navigation refinement by incorporating methods which increase measurement consistency between both multibeam and camera. The result is a self consistent 3D point cloud comprised of camera and multibeam measurements. In map construction phase, a subset of the multi-modal point cloud retaining the best characteristics of each sensor is selected to be part of the final map. To quantify the desired traits of of a map several characteristics of a useful map are distilled into specific criteria. The different ways that hybrid maps can address these criteria provides justification for producing them as an alternative to current methodologies. The processing pipeline implements multi-modal data fusion and outlier rejection with emphasis on different aspects of map fidelity. The resulting point cloud is evaluated in terms of how well it addresses the map criteria.
The final hybrid maps retain the strengths of both sensors and show significant improvement over the single modality maps and naively assembled multi-modal maps.

ACKNOWLEDGMENTS
The document that follows is the last piece of my grad school puzzle and I want to thank everyone who helped me put it together. We went some amazing places together, had some great laughs, broke things, fixed things, worked hard, played hard, learned more than we thought we could and generally had an incredible time.
Without such great freinds and colleagues this research could not have taken shape, not to mention reached its final form.
I have to thank my advisor Chris Roman for many things but first and foremost, for taking a chance on me. He provided me with so many opportunities to learn and get involved in amazing projects in exciting parts of the world. His Kat helped me see the way through those first few cruises and is still a great friend and inspiration. Katy's good judgement and cool head kept things going smoothly and her amazing, upredictable sense of humor kept it light. I also don't think iv I have ever beat her at backgammon. As the backbone of the operation Bob's enthusiasm for our mission has been contagious and has stuck with me through the years.
My amazing labmates ... there is no question that this thesis was possible in large part due to their inspiration, moral support and incredible skills. Without Ian's amazing software skills, and patient teaching, this project may well have taken years longer. It was great fun going to sea with Dave and I am pretty sure he can fix anything, well except one thing. Try jelly donuts next time, Dave.
Clara was always there to help put things in perspective and talk about the bigger picture, or bikes. Bryan made sure I wasn't the only one walking into seminar 5 minutes late covered in mud and still wearing my mountain bike shoes, and still he managed to set the work-ethic bar higher than I ever thought possible.
There are so many other people that made my grad school life a memorable time. John was always up for every adventure, Jeannie's enthusiasm and sense of humor were balm for any crisis. I have never met more supportive women than Leigh, Kelly and Kelsey; I couldn't have asked for better people to have on my side down the home stretch.
Lastly I want to thank the friends and family far from Rhode Island who were supportive all through this project. Thanks especially to Allison and Natalie for sticking by me. I certainly owe my grandmother a nod: she did her PhD first, even when no other women did things like that. All my love and thanks goes my parents for taking midnight calls when I was sure I was stuck and everything was broken. They always knew I would get it working again and they always had my back. And Russ... Russ, thank you for trusting me and believing in me and being part of all the adventures of the last few years.    The oceanographic research and industrial communities have a persistent need for high resolution three dimensional sea floor maps which convey both shape and texture. Specialized sea floor mapping techniques are used for marine archeology [1][2][3][4], marine geology [5][6][7], ship inspection [8,9], and ecological monitoring [10][11][12][13].
Stereo cameras and multibeam sonars are among the instruments which can be used to make these maps. Both have their respective advantages and drawbacks. In the land robotics community, it is common to use complimentary mapping sensors and combine their measurements using data fusion techniques [14,15]. This data fusion approach has seen only limited application to underwater mapping [16][17][18], and there are no established methods for creating hybrid, 3D reconstructions from two underwater sensing modalities. The goal of this thesis is to develop a method that integrates multibeam sonar and stereo vision data into a common navigation and mapping system to create hybrid maps. These hybrid maps serve the purpose of reducing multi-modality mapping data to an easily interpreted form while preserving as much detail as possible.

Background
Maps of marine environments provide vital insight for scientists and engineers.
The last couple of decades have seen a boom in technologies which provide means to create increasingly accurate and detailed maps. These methods are based on a variety of sensing platforms including ships, tow-sleds, divers and increasingly underwater robots. A good overview of modern mapping platforms is given in [19].
Robotic platforms are of particular interest because they are able to make detailed observations of the underwater environment using a variety of sensors.
1 Figure 1. Remotely operated vehicle Hercules. Images of Hercules being deployed using a crane (left) and surveying a shipwreck (right).
They can travel to areas too deep, hot, or otherwise extreme for human divers and carry out more detailed and precise measurements than ship based platforms.
They can acquire optical imagery of the sea floor which can't obtained from the surface due high rate of light attenuation in water.
These robots can be either Autonomous Underwater Vehicles (AUVs), which typically execute pre-programmed missions themselves, or ROVs which are controlled directly by a user throughout the course of the mission (Fig. 1). Mapping specific robots carry a suite of navigation sensors which measure depth, attitude, speed and relative position, as well as mapping sensors which make acoustic or optical range and backscatter measurements. They can also have sophisticated positioning and control systems in order to hold position or carry out structured surveys. Sites of interest can be traversed according to specific instructions to guarantee a certain amount of coverage (Fig. 2).
Robotic platforms are often equipped with several mapping modalities. This lends flexibility to the mapping system since each modality has its own strengths.
The application of these sensors to navigation and mapping have been researched thoroughly by researchers in land robotics, computer vision, and photogrammetry.
They are also well researched and understood in the underwater environment. Figure 2. Survey tracklines. A typical underwater vehicle survey follows a back and forth "mowing the lawn" with trackline spacing that provides overlapping coverage for mapping sensors. The figure shows the tracklines in red, overlapping image footprints in red and a multibeam sonar map as the underlay. Notice the final trackline running orthogonal to the rest of the survey. This "loop closure", ensures that some terrain will be observed more than once in order to constrain drift in navigation.
Generally, a single survey will be executed using one instrument. Fusion of data from these sensors is just beginning to be explored in the underwater context. Kunz developed a system for navigation refinement using both sonar and camera [17]. Hurtos addressed the issue of finding the offsets between a camera system and a multibeam sonar which is critical to fusing their data [16]. To date however, there is no established method for created hybrid maps comprised of 3D structure from fused stereo and multibeam sonar range data.
This thesis develops a processing pipeline to synthesize hybrid maps from multi-modal survey data. It is helpful to think of this processing pipeline as having two distinct phases. The first is navigation refinement, where data from navigation sensors are corrected to enforce consistency between mapping measurements. The second phase is map construction where a selected subset of mapping data is projected into a common coordinate frame to construct a map. 3

Underwater Navigation
Sensor data is assembled into maps using robot position data. The most basic requirement of navigation data is that it is at least as good as the mapping sensors so that it does not become the dominant source of error in the map. Therefore, a lot of research has gone into improving underwater vehicle navigation so that it keeps pace with the improvements in mapping sensor resolution.

Dead Reckoning
Vehicle navigation data is acquired using a number of sensors to measure attitude, depth and velocity. The vector velocity can be is integrated over time to compute the distance traveled by the vehicle relative to its starting point. This relative position measurement can be combined with attitude and depth measurements can be used to form a 6 Degrees of Freedom (DOF) "dead reckoned" position estimate for the vehicle at any time. Because the random error in the measurements is also integrated over time, it accumulates and causes the trajectory estimate to drift from the true trajectory. Over the course of a survey, this type of error begins to dominate mapping sensor error. When the navigation data places the vehicle in an incorrect location, mapping measurements are also incorrectly localized. If the localization error is greater than the sensor precision, the measurements will appear misaligned.

Direct Measurement
Absolute position measurements such as GPS position are not available. So other methods have been developed to directly measure robot position and constrain navigation drift underwater. Long Baseline (LBL) acoustic systems are the underwater analog to GPS. A number of acoustic beacons are placed around a survey site. The vehicle is able to range to the beacons and determine its absolute position on the site. This approach has been used for underwater mapping in the detection of hydrothermal vent plumes as well as microbathymetric mapping of an ancient shipwreck [20][21][22]. While high frequency LBL systems can localize with   centimeter level accuracy without drift, the systems are expensive, cumbersome to   deploy and not practical for all survey locations. A less cumbersome alternative to LBL systems is Ultra Short Baseline (USBL) [23]. This method computes a position using range and bearing measurements from the ship to vehicle. While there is no drift, this method is susceptible to noise an typically offers accuracies between 0.1% and 1.5% of water depth.  [24,25]. Eustice and Mahon both filter the navigation data using a pose based information filter and visual data to constrain 5 the robot pose [26,27]. Barkby uses the principle of particle filtering to estimate robot pose and assemble a multibeam sonar based map [28,29].
The SLAM examples mentioned above fuse data from multiple navigation sensors but only one mapping sensor to arrive at a final refined navigation solution which produces a self consistent map. In order to create a trajectory that maximizes consistency from two mapping sensors, both types of measurements must be used in the constraint network. Several examples of this have been investigated.
Fallon fuses sidescan sonar and acoustic ranging within common navigation constraint network [30]. Hover et al and Kunz both solve for vehicle trajectories using constraints from multibeam sonar and monocular cameras [9,17]. This concept is vital in aligning 3D structure from multibeam sonar and stereo cameras to create the proposed hybrid maps.

Mapping
The purpose of a map is to provide a meaningful reduction of the survey data for the end user. A scientist is interested in both structure and texture of the scene and makes interpretations based on colors, shapes, sizes, positions, so the map should be as metrically accurate as possible and easily texture mapped. In general, maps are composed of scene measurements projected into a common spatial frame of reference. There are two basic ways of producing maps for our purposes.
The first is to optimize the map in SLAM or Structure from Motion (SFM) [31].
This means that navigation and mapping are simultaneously refined and the map is comprised of strictly the measurements used for navigation refinement. The second is to refine the vehicle poses using SLAM and then use the poses only to project environmental measurements into the common reference frame, this is known as mapping with known poses [32]. For the purposes of multi-modal mapping, the latter technique allows more flexibility to use instruments which don't lend themselves to probabilistic mapping easily and to create a map where two sensors can mutually reinforce each other [15]. 6

Probabilistic Mapping
A straight forward approach to SLAM based mapping is to simply use SLAM solution's map. This map can be an occupancy grid where the map is a grid and cells are populated with the probability that something exists there [32] such as a hydrothermal vent plume [33]. Or it can be a sparse set of landmarks such as trees with compact descriptors [34]. These maps contain well localized information, but are highly abstracted. The level of abstraction is ideal for a autonomous mission planning but can be too abstract for detailed scientific inquiry.
A more informative map should be comprised of a 3D mesh or 2.5D gridded height map to convey the structure and a photo-realistic overlay to supply textural information. Barkby creates a probabilistic height map from multibeam sonar data alone using a non feature based particle filter [35]. SFM [31,36,37] approaches maintain a large enough number of sparse features that a detailed 3D mesh which can be easily texture mapped is a direct result. However, not all types of mapping data are suited to feature-based estimation frameworks. In particular sonar range data is more suited for pose based SLAM techniques. To ensure mutual alignment of the camera and the multibeam measurements, they must be incorporated into the same navigation refinement framework for estimating vehicle poses. This requires a navigation refinement system which is flexible enough to incorporate the feature based stereo constraints and pose based multibeam constraints. After the poses have been estimated, then it is possible to construct the map from the known poses.

Mapping with Known Poses
Mapping with known poses gives particular control over the map characteristics such as point density and blending techniques since the map is created independently of the navigation solution. When mapping with known poses, the map making process is distinct from the robot pose estimation process. There are a few instances of this technique being used in underwater mapping where the primary goal is photo-realistic scene models. Johnson-Roberson creates photorealistic scene models with high point density, using the poses found during a view-based SLAM solution [38,39]. [40] estimates the camera trajectory and then computes a dense stereo correspondence map to create a dense scene reconstruction. to the differing indices of refraction between air and water. This modeling error is particularly apparent in highly structured scenes or at ranges that the calibration wasn't intended for [41]. This introduces warping to the final reconstruction.
While turbidity and calibration are not a problem for the multibeam, the acoustic data can suffer from speckle noise, as well as artifacts introduced by the transducer geometry. These errors obscure fine details in the final map or cause holes in the mesh.
Using these two complimentary sensors together will lead to a more flexible survey apparatus, able to produce maps the highest possible quality available under any given set of conditions by leveraging each sensors respective strengths.

Contributions
The contributions of this project will be the following: • A multi-modal navigation framework for simultaneously refining camera and multibeam poses throughout the survey using SLAM which emphasizes alignment between the two sensors.
• A mapping methodology which selects the best sensor data for each map location from a redundant data set while respecting the inherent characteristics of each sensor. 9

Assumptions
A number of assumptions are necessary to process the data used in this thesis.
• The approach utilizes navigation data that is adequate to constrain vehicle motion. Navigation data comes from a suite of on-board sensors which provide information regarding vehicle depth, attitude and velocity. This is enough to constrain the six degrees of freedom vehicle motion. Moreover, these or analogous sensors are present on nearly all underwater vehicles since they are required for basic functionality. Therefore, it is reasonable to assume that the navigation data for the surveys presented here and the vast majority of underwater surveys will be adequate to constrain vehicle motion.
• A constant velocity model is adequate to predict vehicle motion between navigation sensor measurements. The constant velocity assumption has been used previously with good results for filtering underwater vehicle motion [42].
This is because sea floor survey type missions are executed slowly and without abrupt changes to vehicle speed or attitude which might disrupt the mapping data quality. Additionally, vehicles capable of imaging surveys are generally passively stable in pitch and roll, making the platform unlikely to violate the constant velocity assumption by abrupt attitude changes. Measurements are also made very frequently so relative motion between measurements is small.
• Estimates of stereo camera calibration and sensor offsets are available. Camera calibrations must be done in order to obtain any metric information from a camera and can be obtained using a variety of methods before the sensors are taken into the field. The sensor offsets are straightforward to obtain by hand measuring relative to the vehicle pose.
• An ideal pinhole camera model is valid over a given survey. While standard cameras don't precisely conform to a pinhole model underwater, radial distortion parameters computed during camera calibration closely approximate the effect. This assumption holds approximately if the camera is positioned close to the viewport glass of its housing and the camera altitude stays relatively close to its calibrated altitude [41]. The cameras used for this survey were mounted near the viewport glass with the former constraint in mind. The latter constraint is addressed by the nature of underwater optical surveys.
Surveys typically have a constant moderate altitude since high altitudes preclude imaging due to rapid light attenuation and low altitudes make only a small amount of terrain observable at once. A small amount of error may result from this assumption, but the advantages gained in efficiency by leveraging pinhole camera geometry and constraints are substantial enough to outweigh it.
• Overlap exists for some of the mapping sensor measurements to provide constraints on vehicle pose and sensor offset estimation. Surveys can generally be designed to incorporate as much overlap as desired. The survey design trades off available time with size of area covered and instrument field of view. Sometimes a survey mission will be aborted before a final loop can be closed, but this can be mitigated by building overlap into the body of the survey and not just relying on a single overlapping trackline at the end.

Layout
The process of map making in this thesis has two parts. The first part is navigation refinement where navigation and mapping sensor data are combined to  The Extended Kalman Filter (EKF) has been a common tool for navigation refinement since Smith Self and Cheeseman advocated its use for building probabilistic maps in the 1980's [1]. This filtering approach was applied to underwater mapping by Roman to assemble multibeam bathymetric maps [2]. The key aspect to Roman's implementation is assembling adjacent sonar pings into submaps in which navigation drift contributes less error than sensor resolution, and can therefore be neglected. Navigation data and uncertainties are accumulated in the EKF which augments the filtered vehicle state vector with delayed states corresponding to locations of the submap origins. Links between overlapping submaps are made when the structure of the two submaps can be registered. Relative poses between submap origins added to the filter as additional measurements between the delayed states to produce a well constrained vehicle trajectory that corresponds to a self consistent map. The utility of this method is limited however by its O(n 3 ) complexity where n is the size of state space. As a result it is impractical for refining trajectories with many unknown submap origins or image poses. [2].
The Sparse Extended Information Filter (SEIF) has been used as an alternative to the EKF because it scales well in state space. The information matrix of the filter is maintained instead of the covariance matrix so that the update step does not require an O(n 3 ) inversion. Additionally, the information matrix is exactly sparse when the variables to be estimated consist of prior poses alone, a characteristic which can be exploited for efficient state recovery with O(n) complexity [3].
This method has been applied to underwater mapping with both monocular and stereo vision by Eustice [4] and Mahon [5] respectively.
Filtering leaves a few issues unresolved. EKFs and SEIFs estimate the current robot pose by applying a recursive filter to the previous pose, current measurements and control inputs. Because of this sensor updates only propagate forward so at updates with large measurement innovations the trajectory can become less smooth than might naturally be expected. Furthermore, linearization error can accumulate over the course of a trajectory.

Smoothing versus filtering
Recently, another approach known as Smoothing and Mapping (SAM) has been applied to address the lingering problems associated with filtering. In smoothing, the robot trajectory is not marginalized out and inference is done on the entire trajectory [6]. Since the full non-linear problem can be solved over the entire trajectory, error is evenly distributed around the graph. This produces a trajectory that is consistent with all of the constraints. It also produces smoother trajectories than filtering methods leading to more appealing maps [7].
Smoothing treats the SLAM problem as a large non-linear system which is which are a function of measurements such as range and bearing to a landmark or vehicle velocity. This problem can be posed as as a factor graph [6]. This graphical model is an intuitive way to look at the system, breaking it down into variable nodes (variables to be estimated) and factor nodes (measurement functions) (Fig. 3).
The goal is to find the Maxiumum a Posteriori (MAP) estimate for the unknowns.
Ultimately the graph or non-linear system can be solved using a variety of inference methods. For most practical situations, the sparsity of the underlying structure allows for a solution using sparse matrix techniques which are highly efficient. bine an imaging sonar and monocular camera constraints incrementally within a factor graph to navigate and build a map during ship hull inspection [8]. Kunz uses multibeam sonar and stereo cameras on an AUV to build a map of a coral reef and refine sensor offsets for biological monitoring [9]. The results in both cases are robot trajectories which obey constraints imposed by mapping sensors and navigation sensors and maps which appear self-consistent..
A recent underwater mapping method combines both filtering and smoothing techniques. The EKF is useful for creating submaps for data association and map assembly, but scaling limitations make it impractical for refining long trajectories [10]. To mitigate the scaling issue, Vaughn leverages EKF assembled submaps for data association, but uses submap origins as nodes in a factor graph. The factor graph is then solved using the iSAM software package. This avoids scaling issues and efficiently improves navigation for map making [7].

Methods
The following navigation refinement methods are designed to estimate vehicle poses using data from both the on-board navigation and mapping sensors. They extend the state of the art in navigation refinement to enforce consistency between multiple modalities. The estimated poses will be used in the mapping phase (Chapter 3) to project measurements into a common reference frame and assemble a map from two sensors. Camera data is incorporated by aligning overlapping sets of images. For multibeam, it is common to aggregate sets of pings into submaps which can be aligned using point cloud registration techniques.
The process detailed in the following section is founded on the work of Kunz [9].
That procedure begins by aligning overlapping sets of images and incorporating them into a bundle adjustment style navigation solution where consistency is enforced using multiple views of the same landmark [11]. The resulting navigation solution is used to assemble multibeam submaps and establish links between those that overlap. A final navigation solution is then estimated using all of the available constraints, both camera and multibeam. This thesis adds a new step where overlapping camera and multibeam submaps are co-registered to refine their relative pose and enforce mutual consistency. Additionally, this approach is able to estimate the offsets of the sensors with respect to the vehicle frame as part of the navigation solution.
The process is outlined in Figure 4. This figure explicitly breaks the process 20 into two phases. The first phase refines camera offset and vehicle navigation.
From this phase, multibeam constraints can be computed and further refinement of navigation data along with multibeam sensor offset. The specifics elements of this chart will be further explained in the following sections.

Instrumentation and platform
The data for this work was gathered during surveys using the ROV Hercules    while the ROV maintains a minimum safe survey altitude of 2m. This allowes us to take advantage of the greater resolution available at reduced range.
Images were taken using a rigid stereo rig fitted with two Prosilica GC1380 cameras. These were mounted within pressure housings with flat glass viewports 300mm apart. Their optical axes are parallel. The color and black and white images were acquired as 12 bit grayscale and 48 bit Bayer respectively by 1024 by 1360 pixel CCDs.
The lighting was supplied using two Ocean Imaging Systems model M3831 flashbulbs hardware triggered off the master camera. These were mounted on the forward half of the vehicle to minimize the common volume of water imaged by the cameras and strobes. The maximum framerate that could be achieved was limited by the strobe recharge time to about 0.125Hz which translates to 1.25 frames per meter of travel along track.
Hercules's navigation instrumentation includes an Ixsea Octans fiber optic gyro, a Paro Scientific depth sensor, and a Teledyne RDI doppler velocity log.
The navigation data has several applications. The measurements are processed in real-time to drive the vehicle's autopilot allowing precise survey patterns. It is also visualized and logged using DVLNav software [12] for use in post processing.

Notation
It is helpful to specify a notation system for coordinate system transforms which indicate how constraints are integrated into the factor graph. This notation helps to articulate the spatial relationships formed by the network of constraints.

Coordinate systems
There are several relevant coordinate reference frames which will be referred to frequently (Fig. 6). The local level coordinate system, ℓ is an absolute frame.
Its origin is in one fixed location. For convenience the origin of ℓ is assigned as the The odometry between vehicle poses at time i and j is written as x i,j which is a transform that can be computed using the operations described in the following section. Note that all frames have the z axis pointing down and vary based on the orientation of x and y

Spatial relationships between coordinate systems
A robot or sensor's position and attitude with respect to any coordinate system can be described in terms of the spatial variables x, y, z, θ, φ, ψ. The first three are translational variables and the last three are the attitude variables indicating roll, pitch and yaw respectively.
Operations on these variables allow a given pose to be expressed in other coordinate frame. The notation adopted for coordinate transforms is fully explained in Smith Self and Cheeseman [1]. However, the relevant transforms are summarized here. The compounding operation takes two relationships x i,j and x j,k and lays them head to tail to arrive at the compound relationship x i,k It is known as the head-to-tail operation and is expressed as ⊕. The inverse relationship is useful as well. This might be used to reverse a spatial relationship that has been applied and is expressed as ⊖. A composite relationship known as tail-to-tail is useful for finding the relative pose between two forward relationships. The tail-to-tail is expressed as x j,k = ⊖x i,j ⊕ x i,k . These operations offer a way to express the changes in spatial relationships between coordinate systems which occur due to 25 vehicle motion and sensor measurements.

Factor graph assembly and structure
A factor graph is a graphical model which expresses a large function in terms of its factors. It is intuitive to look at and can be solved using a variety of Bayesian inference methods. The goal of the navigation refinement using factor graphs is to determine the position of the mapping sensors at the time of measurement.
A factor graph is used to structure the network of constraints from which poses will be inferred. One group of constraints consists of the dead reckoned navigation between these poses. Images can be abstracted into features and aligned to provide additional visual constraints. Multibeam pings can be assembled into submaps and aligned with each other to provide further constraints. Finally, alignments between stereo pair reconstruction and multibeam submaps enforce alignment between the two modalities using a third type of constraint..
The factor graph is assembled and solved in two phases (Fig. 4). In Phase I the feature based links and navigation links are used to solve for the vehicle positions and the offset of the cameras. Then the offset of the camera is held fixed and a graph containing the feature based constraints between cameras, multibeam constraints, and cross modality constraints is solved to find the vehicle poses and multibeam offset.

Computing navigation constraints between sequential mapping sensor measurements
The navigation data from the depth, attitude and velocity sensors provides constraints between sequential mapping sensor measurements (Fig. 7). However, the navigation data is asynchronous with the mapping sensor measurements, and must be resampled. The resampling is done using an EKF. Each successive measurement from a navigation sensor is incorporated into the filter using an update step. A prediction step is run when a multibeam ping or image capture step occurs in order to recover the vehicle state and state covariance at that time. The relative poses and covariances between sequential mapping measurements are retained as constraints for the factor graph. These constraints only link temporally adjacent vehicle poses and will contain dead reckoning error which accumulates over time, necessitating the other forms of constraint ( Fig. 7).

Data association
Data association is the process of recognizing that two separate observations relate to the same terrain and deriving a spatial constraint from their relative alignment. Depending on the sensor, there are two possible approaches to data association.
The first approach is for creating links between stereo image pairs. Linking stereo camera poses requires abstracting images into matchable features and recognizing a link between two poses when a unique feature is viewed in both poses.
The second is for creating links between 3D terrain patches, generated using either camera or multibeam. Establishing links between 3D patches is done by aligning the structure of two overlapping patches using point cloud registration techniques.
This approach is appropriate for both multibeam-multibeam links and multibeamstereo cross modality links (Fig. 8).
Generally when SLAM algorithms are performed online, links are sequentially hypothesized when a measurement from an adjacent pose lies within the a confidence ellipse related to the covariance of the current pose or current measurement.
The covariance is kept small because the navigation is continually being refined.
This results in a small search area for potential links and a robustness to false Image features The first factor graph is set up as a bundle adjustment problem [11]. It incorporates the odometry constraints with feature based constraints between stereo image pairs. The graph is solved to obtain the vehicle positions and the camera offset.
There are a number of ways to use images to constrain robot trajectories.
Hover et al uses a 5DOF pose based image constraint [8]. Kunz uses a landmark based reprojection error minimization with a single camera [9]. Here however, a stereo system is available. A calibrated stereo system allows for a 6 DOF motion constraint unlike a monocular system, which can only provide 5 DOF constraints on motion due to loss of scale. Camera and multibeam sensor offsets on the vehicle must be well aligned relative to each other so that their measurements can be properly aligned. These offsets are measured by hand and it is difficult to achieve the required precision.
To avoid the guesswork, Kunz added an additional variable node to the graph: the camera offset node (Fig. 10). This additional variable accounts for the constant transform between the vehicle and sensor coordinates. The initial hand measurement of the camera offset serves as a prior on the offset node and the covariance of the prior encodes how well the offset was measured. It is worth mentioning that this node is often poorly constrained in the z direction and tends to float vertically.
The vehicle is very stable in the pitch and roll directions which is the motion necessary to constrain z. This could make the map more difficult to geo-reference but has little impact on its self-consistency. The prior on the camera offset it the final constraint needed in the assembly of the bundle adjustment factor graph before it can be solved. 30 The location of the 3D landmarks projected on their respective images, along with the camera offset prior and the resampled navigation data are assembled into a graph. When the graph is solved, the result is a refined vehicle position at the time of every mapping sensor measurement, as well as an estimated camera offset.

Adding multibeam and cross modality constraints (Phase II)
After camera constraints have been used to refine the navigation and camera offsets, the data can be used to establish multibeam links. While these links have less influence on the over all navigation solution than the camera constraints, they are important for constraining the multibeam sensor offset ensuring proper alignment with the camera.
• Submap assembly The multibeam submaps are constructed by aligning adjacent pings in the  These three dimensional submaps can be aligned with overlapping submaps to produce constraints on the relative position of the vehicle. Any relative pose constraints formed between submaps act on the vehicle pose which serves as the submap origin. This process is described in detail in [10] and refined for factor graph applications in [7].

• Submap link alignment and verification
Sonar submaps are aligned to constrain adjacent vehicle poses. In general, a relative pose constraint up to 6 DOFs found by aligning the submaps in x, y, z, roll, pitch and heading and computing the rigid transformation between their origins using point cloud registration techniques.
To establish link hypotheses, the submap boundaries are plotted in ℓ using the bundle adjusted vehicle navigation and the hand measured sonar offset. First link hypotheses are generated between potentially overlapping submaps then the overlapping regions are gridded (Fig. 11). The gridded data is aligned in the x and y directions by minimizing the Some of Squared Differences (SSD).
∆x, ∆y = min ∆x,∆y where z i,x,y is the depth of grid cell x, y in submap i, and S ∆x,∆y is the set of all indices x, y is in submap i and x + ∆x, y + ∆y is in submap j [10]. The minimum gives can be used to correct the initial estimate for the x and y components of the relative pose transform. If correlation is successful, a full 3D alignment is attempted with the SSD based alignment as an initial guess.
Point cloud registration has been widely researched for applications in robotics and scene reconstruction. Iterative Closest Point (ICP) in particular has become a common way to bring two point clouds into alignment by computing the rigid transformation between them [14]. ICP works by taking a random sample of points from one cloud, finding their nearest match in the other cloud and the computing the transform which pulls these points into the best alignment. This processes is iterated over for a pre-specified number of iterations.
ICP gives a full 6 DOF alignment between point clouds, however it is susceptible to local minima and sometimes converges to the wrong answer. After alignment, the ICP result is assessed to make sure it actually produces an alignment improvement when compared with the SSD. Each point in one submap is linked to the nearest point in the other submap to determine the point-to-point error. This is done for the SSD as well as the ICP alignment results. The error histograms are summed and if the ICP error is mainly higher than that of the SSD, the ICP transform is rejected in favor of the 2 DOF SSD result. This method was developed by Roman [2].
The error surface of the SSD function is useful for determining the uncer- to a 0.4m radius. If a larger region is approximated, the quadratic will tend to be not as steep even for good alignments, therefore the Hessian threshold has to be lowered.
When maps are only aligned in x and y, the factor node only constrains the graph in 2DOF. The link is given essentially zero information gain for all of the unconstrained degrees of freedom. In this case, the diagonal Hessian components are the information gain for x and y. When the ICP alignment is used, the Hessian is also used for x and y information gain and non-zero values are found empirically for the remaining degrees of freedom since the method in Roman, 2007 was found to underestimate information gain for these links to the point where the have no influence on the graph [15]. Instead the x and y information gain was taken from the 2 DOF information gain. Roll roll pitch and heading information gain was gradually increased until resulting map was at its most consistent and the links had moderate effect on the multibeam sensor offset.
These links are also used to constrain the multibeam sensor offset. Unlike the camera offset, this offset can only be estimated in x, y, z and roll. Changing the offset in pitch or heading would change the shape of the submap which we assume is rigid. For a 6 DOF offset estimation to be valid, the submap would have to mutable and be re-aggregated for each new iteration of the navigation solver. Therefore, the offset was not allowed to vary in pitch and roll as the graph was optimized.

• Cross modality links
Aligning the point clouds from the two sensors is critical to making a multimodal map. To accomplish this, another constraint is introduced into the graph. This constraint connects the multibeam data to the camera data via a relative pose constraint between two respective submap origins. This is similar to multibeam data association (Fig. 12).
The first step is to create stereo based submaps. This is done by perform- Another way to apply such a constraint is to use a 6 DOF constraint much like the one used to link two multibeam submaps. For cross modality links, a set of 15 camera and multibeam links containing reasonable amounts of structure were selected as link hypotheses. Then with the initial alignment provided by the bundle adjusted navigation solution, the sum of squared differences was used to refine the alignment in x and y directions. Then ICP is performed to ascertain the full relative pose constraint between the camera and the multibeam submaps. This relative pose constraint is used to enforce the mutual alignment of the camera and multibeam point clouds.
For instance, say that an stereo pair is acquired at time i and a multibeam origin corresponds to time j. The cross modality link between vehicle pose at i and j is written as x i,j . This relative pose measurement between the two vehicle poses is the constraint which will be applied to the graph. It is a function of the relative pose between submap origins (x c i ,m j ), found using point cloud registration, and the sensor offsets: Navigation data between two poses close together in time has a very low covariance because there has been little opportunity for drift. Therefore and cross modality constraint between those two poses will tend to have more impact on the sensor offsets than they do on the navigation data. This prevents the multibeam sensor offset from floating away from the fixed camera offset when Phase II is solved. The Phase II graph containing the camera constraints, navigation constraints, multibeam constraints, multibeam offset prior, and cross modality constraints is solved to finally estimate the vehicle poses and multibeam sensor offset (Fig. 12).

Factor nodes: Error functions
The graph solution is inferred using a non-linear least squares solver to min-

Reprojection error function
The error between two stereo poses is computed using reprojection error. Reprojection error is a metric to simultaneously evaluate the correctness of camera poses and scene reconstruction. This is done by comparing the location of a feature based on the reprojection induced by estimated pose and scene to the same feature's actual location in an image (Fig. 13). Here K is the camera matrix for the left camera, image point U = [u, v] T and the 3D point in the camera reference f c = [X, Y, Z] T can be expressed in the camera coordinate frame by image is the reprojection error. The graph solver however minimizes over a basic error vector containing the error r = [u error , v error ] T .

Error metrics
The quality of the map assembled from optimized navigation can be evaluated using several types of error metric. The first type are the error metrics over which the graph was optimized. The second type are error metrics which arise from constructing a multi-modal map and evaluating its characteristics directly. It is important to distinguish between these two types of error metrics.

Map based error metrics
The error metrics for reprojection error and relative pose constraints are practical approximations for 3D structure alignment, which are proxies for map alignment error. Since this is the case, its is valuable to examine map quality directly to ensure that it is sufficiently improved by navigation refinement proceed with 42 the mapping steps.

• Map alignment
A composite map is composed of a number of camera and multibeam submaps projected into a common coordinate frame. This error metric concerns the quality of submap alignment across the entire composite map. The error metric is based on the Map-to-Map error developed by Roman [2], that has been modified to accommodate comparisons between submaps acquired by different modalities.
The metric is derived from the idea of the Hausdorff distance [16]. Multiple points X i can be selected to produce multiple d ij and the average is taken. This process is repeated ∀i ∈ M and the mean d ij is taken to be the map-to-map error for that cell (Fig. 14).
In the previous implementation, a plane was not fit to the set of points in M j , instead the distance between X i and the nearest point in M j was used.
This is a reasonable approach when sampling densities are consistent and greater than grid cell size. submaps made from stereo cameras in particular When this cell by cell error metric is evaluated across all of the submaps (including stereo and multibeam) , the result is a gridded representation of the map-to-map error. This is a good illustration of the quality of alignment between the submaps which will ultimately comprise the final composite map of the surface.
Another way to use this metric is to assign all multibeam submaps to one submap number and all camera submaps to another map label. Evaluating the map to map error over these two 'submaps' gives a sense for how well the two modalities are aligned. This is an important thing to examine since it is well established that the individual modalities can form self consistent maps, but no one has ever investigated their how consistent the are with each other.
• Texture alignment Texture alignment refers to how well we can expect images projected onto the map structure to line up with each other and it can be evaluated using reprojection error. 3D locations of features appears in two camera poses are backprojected into the opposing viewpoint and the backprojected point is compared to the known feature location in that image to get reprojection error. In the previous section, this error was evaluated using the feature locations optimized during navigation refinement, however this is not truly reflective of the alignment of texture maps when projected on the mesh since the texture maps are not warped to match the unrefined feature locations.
Instead textures will map to the mesh of the unrefined feature locations .
Therefore, reprojection error will best evaluate texture alignment if done with the initial feature locations reprojected into the refined camera poses.
While no texture mapping is done in this thesis, that would be a logical and straightforward way to extend the utility of the work. Reprojection error is also a useful approximation for how well the camera meshes align with each other.

Results
The results of navigation refinement dictate the quality of the ultimate map so it is important to understand the characteristics and breakdown points of this process. The various data association techniques contribute an important set of constraints to the navigation refinement solution. In particular, the use of cross modality registration has been introduced as a new constraint and the results presented here. Overall, the navigation refinement results can be evaluated in terms of the error metrics summarized in the previous section. These error metrics give an 45 indication of expected mapping performance as well as give insight into particular considerations which should be made in developing the mapping methods.

Data association
Links between poses constrain the vehicle to locations which will provide self consistent maps. This section summarizes the utility and breakdown points of each of the data association techniques used to constrain the vehicle poses.

Stereo
Stereo data association is based on two stereo poses viewing the same feature. Stereo links fail in poor imaging conditions. If light is poor, turbidity is high, scene texture is lacking, or the viewpoint between images is too different, there are several points where the algorithm will catch bad links: 1. If turbidity is high or the vehicle is too far from the bottom, few or no stereo matches will be made between images in the stereo pairs, thus no SIFT descriptors will be available for matching with hypothesized links pairs.
2. If the scene has changed between one measurement and the next due to silt kick up or moving fish, few common SIFT descriptors will be found between hypothesized link image pairs. If any are found, they may not be consistent with a 6 DOF rigid motion so the features matches between stereo pairs will all be rejected and no link will be verified.
One portion of many surveys where stereo links tend to fail is along diagonal  Figure 16(b). The images contain enough distinct and common features that a link ought to be easily obtained. However, the 11 sift matches overlaid on the two images are incorrect except for two. There are not enough correct matches to meet the threshold so this link is rejected. Even so, six links were successfully established between the survey and the crossing line.
One way to asses bad links is using reprojection error. Any feature with a much higher final average than the others is likely to be an outlier. Figure 19 shows no

Multibeam
Multibeam data association is executed after the bundle adjustment step. As a result, most of the drift has been removed from the navigation data. This gives good initial alignment for the multibeam relative pose estimates. Figure 17 shows the distribution of verified links established between multibeam submaps assembled using the bundle adjusted navigation data. There are 76 total links distributed throughout the survey.
Submap links are based on the alignment of scene structure, therefore if there is little scene structure, alignment is less likely to be successful. Figure 17 shows that there is more concentration of links in the center of the survey where there is more structure from the debris of the shipwreck. There aren't as many multibeam links as there are stereo links, and they appear to not alter the navigation data very much from the bundle adjusted solution, however they are useful because they enforce self consistency between multibeam submaps as the cross modality links enforce consistency between multibeam submaps and camera submaps.

Cross Modality registration
The cross modality registration uses stereo and multibeam sonar range data and aligns them. Two different methods with different levels of constraint were 49 used. The first method constraining only the z direction between aligned submaps is the only option when there is no scene structure. The second uses the well known ICP point cloud registration algorithm to compute 6 DOF constraint between the two overlapping submaps. The results of ICP alignment process are presented here, and the impact that both methods have on the navigation solution is demonstrated in Section 2.4.2.
The ICP alignment with selected submaps had a good success rate. 15 multibeam-stereo pairs that were selected. SSD gave a distinct results for 12 of them and all 12 converged consistently to reasonable solutions. It was helpful to select submaps covering areas with structure. An advantage of using dense stereo reconstructions is the very high point density available which provides flexibility.
All of the stereo points could be used, but at a drastically increased processing time. Instead camera reconstructions were down-sampled to 1.5 points/cm 2 density to match the multibeam's natural point density of 1.5 points/cm 2 . It was also useful to remove outliers from the dense stereo by gridding both point clouds and removing stereo data more than three standard deviations from the mean. ICP convergence generally occurred between 4 and 10 iterations. Figure 18 shows the typical results of aligning camera and multibeam submaps from the area shown in 18(a). Final alignments showed very little error when evaluated using the map-to-map error metric (Fig. 18(e)), however there are gaps around the edge of the objects due to occlusions. These gaps do not prevent ICP from converging but they contribute ambiguity to the alignment.
ICP has been a successful method for registering stereo submaps to multibeam submaps because these submaps have achieve the required sampling density, and have good alignment. Dense stereo techniques allow flexibility in selecting sampling density which results in convergence of the ICP algorithm. Additionally, since the navigation data has already gone through one round of refinement, the initial alignment between submaps is good.

Factor graph results evaluated using error metrics
The output of the factor graph inference and the impacts of the various constraints are evaluated using the error metrics outlined previously. These error metrics reveal information about the improvement in the submap alignment as well as remaining artifacts. In addition they illustrate the utility of cross modality links.

Reprojection error
The navigation solution is computed by minimizing error over a number of functions, one of which is reprojection error. Reprojection error is a good metric for comparing various navigation solutions because it reflects approximately how well images will line up when they are projected on the 3D map structure. Kunz shows that adding camera constraints and camera offset estimation improves reprojection error [9]. Additionally it was shown that multibeam relative pose constraints do not worsen the reprojection error and those results have been reproduced here. Figure 19 shows that the addition of cross modality links also does not negatively impact the reprojection error of the solution. Next it will be shown that cross modality links also improve the mutual consistency of data from the two sensors.

Map-to-map error
To evaluate mutual alignment of the two sensors, map-to-map error is used.
The dense stereo reconstructions are counted as one map and the multibeam submaps are counted as another. To evaluate the overall error characteristics of the map, each submap is treated separately in the map-to-map error calculation.
First map-to-map error is used to show the impact of cross modality links on the alignment between the two sensors. A histogram of map-to-map errors is useful when the amount of error is great enough that a spatial distribution plot becomes difficult to interpret. In this histogram the error for no cross modality links is large (Fig. 20). Adding either z links or ICP links substantially reduces

Histogram of Reprojection Error from Various Navigation Solutions
Dead−Reckoned Navigation Stereo and MB Constraints Stereo, MB and Cross−Mod. Constraints Figure 19. Reprojection error results. Adding cross modality constraints doesn't degrade the reprojection error.
this error. Note that the distribution between the a z links and the ICP links are quite similar.
Examining the spatial distribution of error will give some indication of whether these cross links have an impact on the alignment of the two sensors in x and y. In fact, it appears that between the 1 DOF and the 6 DOF alignment, there is very little difference in the spatial distribution of error (Fig. 21).
If each dense stereo reconstruction and multibeam submap is labeled as a different map and then map-to-map error is computed, it is an indicator of overall point cloud thickness (Fig. 22). The most obvious error is at the edges of objects where slight misalignments between submaps are apparent and error is often as big the object is tall. The error appears at the edge of every object in the map and tends to be a consistent width. This indicates either a constant bias in the relative offset between the two sensors, or that the sensors resolve edges differently than one another.
Another error type of error appears as a gradual increase in error across the width of each trackline (Fig. 22). This is a slight roll bias in the camera offset.
This type of error is usually apparent in the map as well and indicates that the offset wasn't fully corrected during the navigation refinement step.

Discussion
This chapter has focused on the necessary steps for aligning multibeam and stereo in the same coordinate system by refining navigation data. The motivation for this is to align the two modalities well enough that a single map can be constructed from the fused data sets. Additionally, several error metrics for evaluating the results have been reviewed.

Data association
One problem with the presented approach to link hypothesis generation for data association is that as maps get larger, more navigation drift occurs and the search radius for potential links must be wider. This adds computation time but this is not a large concern when the solutions are computed in post processing. Another problem is that with so many links being compared, there is more room for bad matches. To cope with this, we use aggressive outlier rejection thresholds during link verification. This we can reliably reject bad links and have found that a large number of good links remain.

Stereo based data association
Matching stereo based measurements with each other to form links is a strong way to constrain the navigation data. It performs well, providing constraints on most images even in areas of the sea floor with few structural features since the textural composition of sea floor tends to be rich enough for unique feature matches.
The outlier rejection threshold requiring six matched features between poses is somewhat aggressive but it doesn't cause too many problems because good matches are so prolific. However, it may be possible to avoid this during the navigation solution by incorporating outlier rejection into the navigation solution and rejecting features which low marginal probability at each iteration. Another option may be to use a robust error function, though early experiments show that this approach often fails to converge to a solution.
In spite of the general success with stereo data association, there are much fewer links associated with images on the crossing line. This is effect is particularly evident in high relief scenes where lighting and parallax create different effects as viewpoint changes. There are ways to avoid this problem. First, the vehicle can close loops over texturally but not structurally rich areas where lighting and parallax will cause fewer differences between viewpoints, but available texture still provides substance for good data associations. Another option is to close loops with the vehicle at the same heading as was maintained during the survey to achieve similar lighting and projective characteristics. The problem here is that profiling sensors such as multibeam sonar or structured light require heading and course over ground to be the same for proper data acquisition and coverage, so 56 this approach isn't practical if a loop closure is necessary with those instruments.
Finally, more distributed lighting systems (not feasible on Hercules due to space constraints) are begin used on other vehicles to mitigate the problem of shadows. That said, ICP accomplishes alignment for the places where there is data from two sensors (Fig. 18(e)). The amount of occlusion present implies a corresponding amount of uncertainty in the alignment between the sensors.
The process of generating link hypotheses between multiple modalities is currently done by hand. It would not be a stretch to automate, however. Link hypotheses could be generated based on areas where there is significant overlap between multibeam and stereo. To reduce the over all number of hypotheses and improve performance, a metric related to the normals of the surface could be used to determine which submaps have enough structure to be worth matching. If the 57 survey contains little structure z links can be used instead.
It is interesting that even with only z links, the alignment between the two sensors is good. This indicates that the navigation is constrained well enough that the submaps tend towards good x, y alignment even without cross modality constraint in those directions. However, the full cross modality links have a lot of value because they are a step towards aligning maps from two different sensors taken during two different surveys.
The necessity of hand tuning the information gain for cross modality links indicates that there is some unresolved issue in determining information gain for point cloud alignment which undervalues the link. Another possibility is the odometry or stereo constraints are being over valued. Resolving this issue requires further investigation since the relative importance of the constraints is important to the quality of the result.

Map-to-map error and implications for mapping results
It is important to know if the navigation is good enough for map construction. Assuming that this navigation solution is the best available, the next chapter undertakes the goal of creating a map that combines the strengths of each of the sensors.

List of References
[1] R. Smith, M. Self, and P. Cheeseman, "Estimating uncertain spatial relationships in robotics," Proceedings of the Second Annual Conference on Uncertainty in Artificial Intelligence, pp. 167-193, 1986.

Introduction
This chapter focuses on producing a bathymetric map using data from two mapping sensors. It is assumed that the sensor poses have been established already during the navigation refinement step. In general, the fidelity of a map is evaluated by the end user using somewhat abstract characteristics related to the map's specific purpose. Several such characteristics of a useful map are distilled into concrete criteria. The ways that hybrid maps can address these criteria provide justification for producing them as an alternative to the current methodologies. Two possible methods for combining multi-modal 3D point cloud data will be presented. The first method serves as a basis for comparison and the second method addresses issues of multi-modal data fusion and outlier rejection with emphasis on different aspects of map fidelity. Finally, the resulting point cloud will be evaluated in terms of how well it addresses the map fidelity criteria.

Background
A number of methods have been developed for producing reconstructions of the sea floor from images and acoustics.

Photomosaics
Photomosaics are maps comprised of images which are registered and warped to bring them into alignment and blended together. This is a fairly simple problem for a few images but underwater surveys are often made up of hundreds to thousands of images [1]. If many images are warped and aligned naively, major distortions can occur. Resolving this type of error requires a global solution to determine the projective transformations for each individual image which distribute warping evenly across the map. It done properly no section of the map is subject 61 to more distortion than the others [2]. This approach to mapping produces flat maps which convey a large amount of information about shape and texture of the scene. Such mosaics have been a very useful for archeology since they provide information about the relative position of artifacts and allow scientists to visualize an entire underwater scene at once [3].
The underlying assumption of this type of mapping is that each image is of a planar scene. In practice this is regularly violated. The result is that the map is not a scale accurate representation of the scene. In spite of this photomosaicking is still widely used because well automated solutions are available which makes such maps easy to produce and ultimately, they very informative in spite of their drawbacks.

2.5D and 3D maps
The next level of complexity in mapping is creating a model which conveys shape in 2.5D or 3D. A wide variety of methods have been developed to accomplish this, using both acoustics and optics.
Typical approaches such as SFM have been employed using sparse features [4,5]. In feature rich areas the resulting mesh is very accurate and quite dense.
They also rely on sparse feature extraction, which can be tailored to focus on high relief areas and areas of geometric importance, so complicated terrain can be efficiently represented. However, since the sparse points are optimized during the structure from motion solution, this method does not lend itself to arbitrarily high point densities. It is ultimately limited by the feature extractor's ability to extract and match features and the computational burden to optimize feature locations.
Other approaches which use acoustic data such as CUBE and BP-SLAM build height maps using a Bayesian filter approach. Depth measurements are added to a graph or grid structure and redundant measurements are fused with a filter [6,7]. These approaches are successful but have never been adapted for multiple modalities. One problem with applying them to multiple modalities is that naively 62 fusing two modalities together may reduce the quality of the more precise sensor.
The previously mentioned approaches (except CUBE) refine navigation data while building the map. Another option is mapping from known poses [8]. Roberson assumes known poses and reconstructs a surface using stereo vision data [9].
When a mesh is built from known feature points and poses, producing a seamless reconstruction becomes an issue of mesh and texture blending [10,11]. This type of rendering produces very appealing maps with good local and global accuracy.
Blending textures and meshes however can disguise alignment issues.
Mapping with known poses has some characteristics which are useful for multimodal mapping, mainly the idea of splitting navigation and mapping into two different steps. However, blending two modalities together without taking into account the characteristics of each sensor may reduce the detail portrayed in the final surface reconstruction.

Multi-modal mapping
Multi-modal mapping requires specific considerations for the characteristics of each sensor. Previous attempts have been limited to computing scene structure with multibeam and overlaying texture with the images [12,13]. However, stereo vision range data has some appealing characteristics that can compliment multibeam scene reconstruction. A next logical step in multi-modal sea floor mapping is to synthesize a sea floor reconstruction from both multibeam sonar and stereo vision.
Microbathymetric mapping at a scale of O(5cm) surface reconstructions of the sea floor can benefit from merged data [14]. A final surface at this scale can be overlaid with image data from a camera to create map which conveys detailed shape and texture of the sea floor [9,12,13] Such surface reconstructions can be thought of as 2.5D where a regular grid is populated with height data. This is also called a height map or relief map. This type of map only represents structure that is visible in a plan view. The mapping data naturally takes this form when range measurements are made looking down from an altitude much greater than the relief of the scene. The result of this mapping pipeline is a full 3D point cloud which can be gridded any number of ways or displayed as a triangulated mesh. However, a 2.5D grid representation is a convenient framework for dividing up a point cloud for operations in this pipeline.
It is also an intuitive way to view height information on a flat page so that is how the data will be presented in the results section.

Evaluation of Map Quality
Decisions on how to construct a map are informed by the map's ultimate application. For instance, producing maps for navigation requires conservative depth estimation biased towards the shoalest depth to comply with regulations [7]. The maps produced in this chapter are intended for quantitative scientific investigations of the sea floor.
A number of criteria related to the map application are important to the design of a mapping algorithm. The qualities that are considered in this design are as follows: • Grid resolution. Higher point densities are important for resolving detail, so long as each point is contributing additional information. Greater point density allows a higher grid resolution if the points are accurately localized.
• Gaps. Gaps in the data make it difficult to interpret. If possible they should be filled with real data, even if it is at a lower resolution. Interpolation can also be used to fill the gaps, but interpolated data is generally of less value than real data. In an interpolated map, it can be difficult to distinguish between the two which can cause the user to be over confident.
• Artifact reduction. The user must be able to make precise measurements of individual features in the map. Artifacts such as distortion and 'ghosting', where obviously identical features are mapped in multiple nearby locations, 64 need to be avoided. Such misalignments are related to sensor calibration errors and navigation errors. A logical threshold for concern is when those errors are greater than the grid size allowable by the instrument's resolution.
• Preserving discontinuities and detail. Discontinuities in the terrain such as those related to man made artifacts or hydro thermal vent spires must be preserved. If sensors with two different sampling frequencies measure an area, its preferable to represent the scene with only data that has the higher sampling frequency. This avoids low pass filtering and a loss of information in areas of high relief area.
• Outliers. Both sensors produce outliers in the range data. A good mapping algorithm rejects these outliers without rejecting good data.
These criteria can be used to qualitatively asses the fidelity of a map. They address the more abstract side of map quality which directly contributes to how effective the map is for its specific application. The mapping algorithm described in this chapter was developed with these specific criteria in mind.

Methods
This section describes an algorithm to merge data from two sensors into a map of the sea floor. Specifically it focuses on combining range data from stereo cameras and multibeam sonar assuming known vehicle poses. The following are steps in the process which effect one or more of the above criteria. The way the steps are independently parametrized is used to maximize the map's quality according to the criteria.

Stereo
There are a number of ways to produce range data for mapping from two cameras. The previous chapter used sparse features to match and triangulate three dimensional feature points. This is a good approach for navigation refinement because it reduces an image to its most unique features. There is no need to keep a record of the 3D position for every pixel of the image because only unique features are useful in data association.
A high point density is often desirable for mapping applications, and sparse feature matching cannot be used to determine a depth measurement for every pixel.
A different approach to stereo matching called dense stereo correspondence is more capable of computing a depth for each pixel. Dense techniques are more suited to this task because the feature correspondence search is limited to a set of putative correspondences which lie on the epipolar line in the conjugate image. The way that matches are established using this constraint can vary greatly. A review and classification of current methods can be found in Scharstein and Szeliski [15].
The simplest of the dense methods is the Block Matching Algorithm implemented in OpenCV [16]. A window around a given pixel in the key image is compared using the sum of squared differences to likely pixel matches in the conjugate image. The correspondence search region is constrained by user input of the minimum and maximum pixel disparity range. Then correspondences are established within this range. Stereo correspondences found between pairs of images can be triangulated to for a 3D point cloud.
The Block Matching dense stereo algorithm is used here. It is fast and its various filters reliably reject outliers without a lot of tuning. Additionally, since no smoothing constraint is used, edges and textures are largely preserved. This consideration is important because the majority of dense methods make assumptions about the shape or smoothness of the environment which are appropriate for urban and indoor settings but are violated in natural terrain of the sea floor.

Multibeam
Multibeam sonar data requires somewhat less processing than stereo cameras to formulate 3D points. Much of the signal processing necessary for beamforming is done by the sonar's data acquisition software. After the data has been gathered, the maximum intensity along a given beam is chosen as the distance to the sea floor along each of the 512 beams in a single ping. The ranges for each ping are naturally 3D point clouds which can be assembled into a map by projecting them into the map coordinate system via the navigation data.
During the processing we reject the outer 15 beams on each side as well as the center 10 beams. Both locations contain a large number of outliers. Rejecting a large number of beams is generally an aggressive an approach which rejects too much good data to be worth its simplicity. In this case however, setting an aggressive outlier rejection criteria is preferable for two reasons. First, since the survey geometry was designed to provide 200% overlap for the 45 • Field of View (FOV) camera measurements, there is twice as much overlap for the 90 • FOV multibeam sonar. This means that outlier rejection is unlikely to open up gaps in the map and at worst will simply cause a slight reduction in sampling frequency in areas where good data was incorrectly rejected. Second, this type of aggressive rejection makes the multibeam range data nearly free of outliers. This is very difficult to do for stereo range data making the multibeam data a good tool for rejecting bad stereo data. Ultimately, purging the multibeam data of outliers at the cost of losing some correct data appears to be worthwhile due to survey design and the difficult characteristics of stereo outliers.

Hybridization
Hybrid maps are created by selecting from the available visual or acoustic data to fill each grid cell of the map. The concept is that a map can be constructed by selecting the best data from a redundant data set by accounting for the specific characteristics of the sensors. The methods used here combine modalities with specific attention to the map qualities presented previously. The previous chapter focused on finding the sensor pose for each mapping measurement. This chapter follows by focusing on projecting them into a common frame using criteria to select the best data for each location of the map (Fig. 23). At this point, no additional  Figure 23. Mapping concept. Sensor poses are known at this point so the next step is to decide which data to use to populate each grid cell of the map. This chapter focuses on how to select the best data from each sensor with which to build the map.
navigation refinement will be done.

Simple Averaging
A simplistic method used for combining data from multiple sensors is to bin and average the data with no considerations made for outliers, misalignment or sensor characteristics. The area of the map is divided up into grid cells, 2 × 2cm.
All points which fall into the cell are averaged to get the depth for that cell. This map is created for comparison (Fig. 24). It gives an idea of how the modalities might compliment each other, as well as demonstrating the specific problems which need to be addressed when combining their data.

Mapping based on local criteria
Averaging illustrates the dominant issues which arise from combining multiple modalities into a single map. Another approach is to cope with each of these issues individually, and select the best data for map assembly using criteria which are  Figure 24. Averaging all sensors to construct multi-modal map. This approach makes no accommodation for outliers, different modalities, or misalignments. The hybrid map shows artifacts related to each of these issues. While we are able to fill in the holes usually seen in stereo, the precision of the stereo is degraded by averaging in the multibeam data. Additionally, the stereo outliers persist in the final map degrading the relatively outlier free multibeam range data.
remaining navigation errors.

Grid size
Grid size is selected to trade off between accuracy and resolution. There is no one grid size which is perfect for a given map. Instead trade offs must be considered as the map is being constructed. The minimum grid size is related to either the smallest grid size which still consistently contains one or two data points, or the smallest grid size which doesn't show obvious artifacts or errors. The smallest grid size that contains real information will have the highest spatial resolution.
However, due to inherently noisy measurements, accuracy is improved when you can average over more measurements. This is achieved by having larger grid cells containing many points.
Properly trading off accuracy and resolution requires a bit of tuning. First we select a grid cell size that tends to contain result in as many grid cells occupied with range data as possible. We then grid the point cloud to the chosen size and compute the gridding confidence. An appropriate gridding confidence threshold is set and then the map is assembled. If too much of the map is cropped out, either decrease the gridding confidence (which will increase the likelihood of ghosting) or increase the grid cell size, depending on which is more valuable.
It is important to note that much of this discussion makes the assumption that most error in x and y is from navigation and most error in z is due to intrinsic sensor errors. Sensor calibration errors can also manifest in x and y but these simplifications are still a reasonable tool for selecting a grid size and pruning maps associated with poorly constrained vehicle poses.

Egregious outliers
Each sensor has outliers with particular characteristics. Taking these into account, its possible to reduce their effect on the final map. In particular, stereo outliers can be very difficult to eliminate without manually adjusting rejection There are many fewer multibeam outliers than stereo outliers and those that exist are rejected later in the process. As a result, it is not necessary to explicit reject them here. However, when they are present, the use of the median keeps them from having undue influence on stereo outlier rejection (Fig. 25).

Subtle errors
Once any egregious errors have been rejected, errors due to navigation, calibration and fundamental differences between sensors must be dealt with. Computing the gridding confidence starts with determining the marginal covariance Σ ij of the x, y components of the transform x i,j between two poses x i and x j . x i,j = [x, y] T is a Gaussian random variable described by the ellipse k 2 is a χ 2 2 random variable which parametrizes the ellipse. Setting k 2 , to a value corresponding to the required level probability (α) gives the equation of the ellipse defining the error circle for that level of confidence. For instance,k 2 = 5.99 has a 95% probability or α = .95 according to the χ 2 2 probability distribution function. If the entire ellipse falls within a square the size of a grid cell, that indicates that the the two points come from maps are correlated enough that there is α confidence that they truly exist within the same square (Fig. 26).

• Sensor Selection
Attempts to fuse the two data sources together can result in reduced precision.
The edges of objects which are sharp in stereo reconstruction become blurred, and surfaces which are smooth become rough. To address this, only a single sensor is used to compute the depth at a given cell. In general stereo range data is preferred, and sonar data is rejected whenever it shares a cell with good stereo range data.

Results
This processing pipeline is designed to aggregate data from two sensors to produce a map which best addresses the map quality criteria introduced in Section 3.3. These criteria are more abstract than the quantitative error metrics used in µ x grid size ellipse of desired % confidence Figure 26. Confidence that maps appearing in the same grid cell are actually in same grid cell. This gridding confidence is computed using the marginal covariance of the two submap poses in X and Y. If the confidence of 95% confidence falls within a square the size and orientation of a grid cell, then those maps have an acceptable amount of relative uncertainty and both will be used in the grid cell. Otherwise, both will be flagged and the map associated with the pose related to the most bad flags will be rejected.

Multibeam
Multibeam maps created directly from iSAM refined navigation have been investigated by both Kunz and Vaughn [13,17]. The multibeam map in Figure 27 is the result of binning the point cloud of multibeam range data into grid cells and averaging the z values of the points in each grid cell. This map illustrates the type of artifacts which arise in multibeam maps and need to be addressed through multi-modal mapping. The artifact shown in the left inset commonly occurs when the vehicle stops briefly and many noisy multibeam pings are averaged together. In this case, the vehicle stopped and the navigation data dropped out for 6 seconds. However, it may be advantageous to maintain this smaller grid size at the expense of small gaps in order to take advantage of the finer resolution available in regions with more overlapping coverage.
In spite of these issues, the multibeam map has some very favorable characteristics in terms of map quality. While there is some blurring of details and discontinuities, the data has few large misalignments. Small gaps are only evident where there is no overlapping coverage and this occurs mainly around the edges of the map and there are no large areas where the sensor fails to provide data. There are also few large outliers.

Stereo
A map assembled from dense stereo matching is shown in Figure 28. The most apparent feature is the number of gaps in the data. In this case, the gaps are caused by a poor calibration which prevents adequate alignment between the images during stereo matching. As a result, portions of the image could not be matched so range data could not be computed. Calibration issues are not the only cause of stereo ranging failures however. A number of other failures common to    Another notable feature of dense stereo matching is that has trouble resolving the edges of objects. This issue can be observed in the inset of Figure 28 where there is no data at the edges of many amphorae. There are two reasons for this.
Some pixels are not matched simply because they are occluded, not visible in both the left and the right image as mentioned in Section 2.5.1. The other reason is that pixel matching is based on correlation between patches surrounding the pixels of interest. This assumes that the area around the pixel being matched is flat enough that parallax will not effect the content of the patch. This local flatness assumption is violated at the edges of objects. This is a common problem which applies to a greater or lesser extent to most stereo algorithms. Similarly, multibeam sonar has a limited ability to resolve edges due to having a relatively large beam footprint and as well as occlusions.
The stereo range data has much higher data density than the multibeam sonar, so it is better able to represent small details. However, the high point densities quickly become impractical for processing in Matlab. To deal with this, the dense stereo reconstructions were subsampled to a sample density slightly higher than the multibeam data density. Even so, the stereo cameras appear to provide better measurement fidelity. Figure 28 illustrates this where amphorae appear smoother and more distinct in the stereo map than in the multibeam map ( Figure 27).
Sharp sherds are also reconstructed faithfully in the stereo whereas they tend to be blurred out by the large beam width of the multibeam. The higher precision data available from stereo is good for representing detail but also makes ghosting more apparent. This is visible in in the inset in Figure 28 at the arrow.

Determining Grid Resolution
Grid size is the first parameter which must be set because then the point cloud can be gridded and all subsequent steps can be executed cellwise. This step helps decide on a reasonable minimum resolution at which to operate, however, the final result is a point cloud which can be re-gridded using any algorithm at any resolution.
Deciding on an appropriate grid resolution begins with choosing a minimum grid size that generally guarantees each grid cell will contain data. By plotting the percentage of occupied grid cells over a number of grid sizes, the minimum grid size becomes apparent. This is the point where increasing the grid cell size no longer increases the percentage of occupied grid cells. Figure 31 shows the percent occupancy of the grid for each sensor. It is necessary to examine both sensors. The sensor indicating the largest grid size should dictate the overall minimum grid size.
By examining this plot, a reasonable minimum grid size of 1.5cm can be selected.   (a) shows that some navigation error is still visible in the map at 1.5cm grid size. By using gridding confidence elimination and finding that it effectively removes the ghosting (b), and reaveraging, we can demonstrate that the ghosting was due to a few poorly constrained poses instead of a pervasive high level of navigation error across the map.
Grid size trades off several map attributes. A larger grid size results in lower resolution but fewer apparent artifacts and fewer gaps in the data. Our current choice of grid size is strictly based on gaps. It also must be validated in terms of navigation error artifacts. If navigation error is widespread and obvious at the current grid size, then the grid size should be increased. If there only a few localized errors, these might be due to isolated poor navigation data and can be eliminated in an outlier rejection step without having to increase grid size.
Errors related to navigation data can be observed by gridding and averaging in z (Fig. 32(a)). When this point cloud is gridded, there is clearly some ghosting in the area of the map indicated by the arrow. The question is whether this error is related to a small area of poorly constrained poses which can be eliminated as outliers, or if it is the dominant error magnitude for the whole map. This can be determined by eliminating points from different maps which don't have 95% confidence of being in the same grid square. In this case the gridding confidence elimination rejects the misaligned maps without reducing overall map quality ( Fig.   32(b)). The resulting increase in clarity confirms this grid size is reasonable. If the resulting map had still contained misalignments or become noisier due to gridding confidence rejection, it would have been likely that the navigation error was too large for the chosen grid size. In the latter case, the grid size must be increased until it is on the order of the navigation error. There is no further trade off associated with a larger grid size other than the loss of resolution due to subsampling.

Outlier rejection
Outlier rejection is the next step in mapping once the appropriate grid size has been determined and verified. The main purpose of this step is to identify and reject egregious outliers. Misalignments and more subtle errors will be addressed during subsequent steps. Having observed that there are very few egregious outliers in the multibeam data, its reasonable to use this data to reject stereo data which contains far more large outliers. There is more widespread camera rejection when only one grid cell is used, and this corresponds to a bumpier looking map as observed in (c) relative to (d). No other types of outlier rejection were used at this stage, these plots are strictly the result of rejecting camera data based on cellwise statistics.
Initially, outlier rejection was done simply by eliminating any stereo points lying more than three standard deviations from the median of the multibeam range for a given cell. This resulted in some valid stereo data being rejected simply because the multibeam data was noisier than the stereo or slightly misaligned.
This appears as clipped off data creating jagged edges on objects, or widespread elimination of stereo data. This can be corrected by computing the rejection statistics using a two grid cell radius around the cell where rejection is being performed.
When using a two cell radius there is no evidence of rejecting valid data which would necessitate a larger radius. Meanwhile, the stereo outliers appearing as horizontal lines are consistently rejected ( Figure 28, red arrows). This radius can be tuned for each new map. When alignment between the two modalities is nearly perfect, the rejection threshold can be computed from a one grid cell radius. Computing statistics from zero radius (or a single grid cell) isn't advisable since a single grid cell runs the risk of containing very few points and can produce statistics which poorly represent the area due to the noisiness of multibeam data. Figure 33(a) shows that when the rejection statistics are computed from only one grid cell, many sporadic camera points are rejected. These same isolated areas are not rejected when a two grid cell radius is used in Figure 33(b) resulting in smoother object surfaces in Figure 33(d).

Gridding Confidence Rejection
After rejecting large outliers, the more subtle misalignment errors can be addressed. Gridding confidence rejection can be used to eliminate points related to poorly constrained navigation data from the outlier rejected point cloud. While this step was run previously as a way to verify the grid size selection, now it is used as part of the map making pipeline.
Reducing ghosting can be accomplished by removing maps which are projected from poorly localized poses. In the example map, the crossing line is poorly con-Submaps in crossing line    submaps leads to only a 65% or lower confidence that points measured during the crossing line are being projected to the correct grid cell (Fig. 34). These gridding confidences are a good predictor of ghosting. The 95% confidence interval is both reasonable in theory and gives good results in practice (Fig. 35).
Generally, eliminating points from the grid cell average would tend to reduce accuracy because there are now fewer noisy measurements to average over. However, by eliminating only those points which are poorly localized, the map will be improved since points which are not representative of the surface in that grid cell have been removed. The exception to this is if only one point remains in a grid cell after rejection and that one point poorly localized. In this case, artifacts may continue to appear. At this point, the map the grid size should be increased, however this issue would have been apparent during grid size validation, and addressed with a larger grid size at that time.

Sensor selection
A single sensor is assigned to each grid cell during the sensor selection step.
This step occurs after all other outliers have been rejected. By saving this step for last, any gaps in the an individual sensor's data created during outlier rejection can be filled with the remaining sensor. This step assumes that all the data which remains in a grid cell is accurate but that having only one sensor per grid cell is preferable. The stereo camera range data is higher resolution and more precise so it is selected whenever two sensors occupy the same grid cell.
There is some error apparent between the two modalities which manifests itself at transitions between multibeam and stereo coverage. It appears to be a bias in z of 1 − 2cm with the multibeam ranges being slightly longer. The appearance of the bias is accentuated by the transition from the relatively smooth stereo measurements to the noisier multibeam measurements. A broader issue with this mapping pipeline is that there isn't perfect alignment between the camera and the multibeam data. Where there are many adjacent cells containing different sensors, the map appears bumpy even if the terrain is smooth (Fig. 30).
The sensor selection step allows the map to retain the favorable characteristics of the stereo data. The stereo data remains undistorted by multibeam data because the latter is only used in places where there is no stereo data (Fig. 36). Textures are also preserved which is not possible when using multibeam only or multibeam and camera data averaged together.

Comparison between single and multiple modalities
Comparing the single and multi-modality maps serves as a measure of performance for the mapping pipeline. Grid resolution was maintained at 1.5cm for the final map product. At this grid size, the multi-modal map is generally better than either single modality map.
The multi-modal map effectively incorporates the strengths of the single modality maps. The gaps in the stereo data are corrected using the multibeam data. However stereo data is the dominant data source which results in a highly detailed map. Outliers have been successfully rejected with minimal user intervention. The remaining artifacts in the final map are largely those which were also in the single modality maps. The area mentioned in Section 3.5.1 for having some dropped navigation data still contains artifacts, there is also a remaining artifact due to roll bias in the stereo cameras. It is present in both the camera only and the multi-modal map and is a flaw in the navigation refinement as opposed to the mapping.   Figure 37(b) shows a slice of the naively merged multi-modal point cloud.

Point cloud profile
The black line shows a profile of the surface created by gridding and vertically averaging. The initial point cloud is several centimeters thick and appears to have a multi-modal distribution in z, particularly in areas where there is more structure.
A close up shows that even in flatter regions, the vertical distribution of points in a grid cell tends to be composed of clusters ( Figure 38). Here submaps which are adjacent in time are also adjacent in the colormap, so warm submaps were acquired at the beginning of the survey and cool colors at the end. Notice that maps with similar colors tend to lie closer to each other than they do to other submaps in the same x, y location. This occurs regardless of modality. This indicates that navigation error is still a significant source of error in the map.
As mentioned previously, vertical averaging across such a clustered distribution in z will not give a good scene representation. The appearance of bumpiness in the initial estimated surface is mainly due to averaging across the unevenly distributed ranges in each grid cell. The final point cloud is much thinner, and it is able to faithfully render smoothness where the images show the bottom is probably smooth.

Discussion
This pipeline produces maps synthesized from multibeam and camera data.
The design of the pipeline was intended to address the map fidelity criteria laid out in Section 3.3. The steps above specifically address the issues of grid resolution, outliers, artifacts, detail preservation and gaps. This results in an improved map relative to the single modality maps. However, a few issues remain unresolved by this process. This section discusses the successful aspects of the pipeline as well as its limitations and potential remedies.

Gaps
There appear to be two types of gaps in the gridded data. One consists of large missing segments of data. These can be due to sensor malfunction or data elimination during outlier rejection steps. The solution to this problem is to substitute in other data. This pipeline is effective for coping with these types of gaps. The holes in Figure 28 are filled using multibeam data as shown in Figure   36.
The other type are small dispersed gaps. These occur when the grid size is too small for the sensor's measurement density leaving grid cells that are not populated with range data. If small holes are pervasive, it means that measurement density is too low for the chosen grid size and it should be increased. However, if increasing the grid size is not desired, or only small sections of the map have this problem, there are two ways these gaps could be addressed. First you could try to extract more range data from existing sources. Lowering filter threshold on dense stereo matching will result in more stereo matches, however these tend to be mainly bad matches. Therefore, this option impractical. Another option is to interpolate existing data. Interpolation should be handled with caution since it creates data where there was none by making assumptions about the characteristics of the surface. This is inadvisable because interpolated data can be confused with real data and make the user over confident. However, when the gaps are the size of a grid cell, a local interpolation technique such as linear interpolation would help make the map more readable without running the risk of creating fictitious structures or flat ground where there is texture.

Accuracy
Evaluating the map accuracy is very difficult with sea floor data. To evaluate the proposed method for true accuracy would require a ground truth data set.
Lacking this data set, we can note that the method requires no assumptions such a scene flatness, frequently used in photomosaicking, which introduces systemic distortions. Additionally, the map-to-map error (Fig. 22) shows that the map is relatively self consistent which also increases our confidence in the map's accuracy.

Artifacts
Many of the artifacts initially present in the naively combined multi-modal map have been eliminated, however a few biases remain. The roll bias in the camera data is still apparent after mapping, which can be expected since no specific effort was made to eliminate it in the mapping processes. This is clearly a shortcoming in the previous navigation refinement step. However, it may be possible to reduce the effect of a roll bias during mapping without adversely effecting the map. Roll bias is most apparent at the edges of submaps. In areas where multiple maps overlap, points which are farthest from the center of the submap can be eliminated. If a sophisticated gridding process is used to process the ultimate point cloud, blending techniques could be used to reduce the appearance of such artifacts [11].
There is a small difference in z value between the camera and the multibeam data. This is apparent in areas where there is no camera data and the multibeam data fills in the gap. In those places, there is a small depth discontinuity. While its possible to reduce the appearance of problems like this with blending, that is strictly a cosmetic solution. This issue requires better agreement between the two sensors to be established during navigation refinement.

Sequential steps versus a unified approach to mapping
The presented approach is a series of steps that deal with individual mapping issues specifically and directly as they arise. A series of individual steps has the advantage that it is easy to add and subtract steps and tailor the algorithm to data sets with unique issues. More unified techniques could be harder to adapt in cases where they fail. Another advantage is that tuning the algorithm is fairly straightforward since available dials correspond directly with physical parameters.
However, this design can be an issue because the approach doesn't solve problems that haven't been explicitly modeled. Instead specific techniques need to be developed to deal with the characteristics of some data sets. For instance, the outlier rejection technique used here rejects bad points based on their distance from the mean value of the grid cell. Over terrain with very large discontinuities, this method may not be able to distinguish outliers from large discontinuities so a new outlier rejection technique might be necessary. Even so, this method has addressed some common issues of mapping which can be expected to reoccur thus these techniques will generally transfer to other mapping problems.
The alternative would be to develop a more unified global method for surface reconstruction. Such a method was devised for this data set where multiple depth hypothesis were generated based on clustering algorithm for each grid square.
Depth hypothesis were resolved using a Markov Random Field to minimize a cost metric which weights hypotheses based on factors influencing map quality. The challenging aspect is that tuning the cost function is very difficult. The tunable parameters tend to be highly abstracted from their physical effect on the map.
Additionally, the smoothness constraint implicit in a Markov Random Field made over-smoothing a pervasive issue. The steps in the processing pipeline presented here appear better suited because the tunable parameters aren't relative weights, they are parameters with physical interpretations.
This result shouldn't eliminate the use of tools such as Gaussian Processs (GPs) or B-Splines. These unified methods could be very powerful for representing the scene. GPs naturally lend themselves to adaptive grid sizes to reflect actual resolution of the available sensors and have been used successfully for multi-sensor mapping on land [18]. One issue with such approaches is that they require a very strong understanding of the error characteristics of each point in the map which are derived from good modeling of the sensors and navigation data. Having a limited understanding of interplay of these uncertainties will lead to results such as over valuing one sensor's data, failure to recognize noise or biases, over-smoothing or over-fitting. These approaches also present a significant computational burden but this will be less of a problem as methods and hardware improve. In spite of these issues, such unified methods are likely to be the next evolution in multi-modal surface reconstruction, now that we have a more thorough understanding of how the two sensors interact in a single map. Even so, the outlier rejection steps listed will still be needed in a unified approach to surface reconstruction.

Introduction
This thesis presents a method for producing sea floor maps from multiple modalities. This is motivated by the recognition that the two sensors commonly available on underwater mapping platforms have complementary strengths. With this in mind we present two part system. This system has identified some of the challenging areas of the multi-modal mapping problem and addressed them by breaking the problem into navigation and mapping components. First mapping measurements from both sensors are localized in a common reference frame while enforcing consistency between their maps using navigation refinement. Considering the relative strengths and weaknesses of each sensor, data is then drawn selectively from the two sensors to populate a map. This map exceeds either individual modality's map on a number of map quality criteria while coping with outliers and remaining navigation error.

Summary of contributions
The system as a whole was successful at producing a multi-modal map. Several developments were necessary in accomplishing this: • Navigation framework with cross modality registration. A navigation framework was developed using the iSAM smoother. This framework built on the multi-modal navigation refinement system developed by Kunz by incorporating cross modality links between stereo reconstructions and multibeam submaps. The cross modality links emphasized consistency between the stereo camera and multibeam sonar maps.
• Summary of relevant error metrics. Several methods were used to evaluate the quality of the navigation refinement. These metrics included an 99 alteration to Roman's map-to-map error metric for quantifying the degree to which overlapping maps agree with one another.
• Multi-modal map fidelity criteria. Evaluating a map's utility requires some more abstract criteria than error metrics presented in the navigation chapter. To evaluate the fidelity of the maps produced here, several characteristics of a useful map are distilled into concrete criteria. The ways that hybrid maps can address these criteria provide justification for producing them as an alternative to the current methodologies.
• Multi-modal mapping processing pipeline A mapping methodology was developed which selects the best sensor data for each map location from a redundant data set while respecting the inherent characteristics of each sensor. The individual steps in this processing pipeline were designed to address the map quality criteria.

Limitations and Future Work 4.3.1 Navigation
There are several lingering problems with navigation processing. Some remaining navigation error was apparent after refinement which indicates that the refinement process needs further improvement. The error was mainly obvious at the edges of objects. This could be the result of poorly understood covariances between submap registration. These covariances encode the relative weights of the various constraints acting on each node, and if one constraint is overvalued, it can pull the corresponding node and mapping sensor measurement out of alignment.
More work is needed to determine why existing methods over-predict point cloud registration covariance. With noise in the constraints more accurately models, we can expect between alignment between maps. Processing more data sets will certainly improve results for the mapping portion of the algorithm. Each new data set will reveal issues associated with its particular environment such a turbid water or high altitude. Methods to cope with each of these issues can be incorporated in turn into the mapping process without reformulating the existing algorithm.

Local versus global mapping
The proposed mapping method selects appropriate data based on local and neighborhood criteria. This is one class of solutions to this problem. A strength of this is that the tunable parameters have very obvious physical meanings and effects. Another class of solutions to the multi-modal mapping problem would be more unified or global solutions such as representing the entire surface with a spline fit or GP. These algorithms require good understanding of the error characteristics of the point clouds, and may in fact benefit from some of the proposed mapping techniques presented in this thesis. These unified approaches are powerful tools that offer an alternative to the proposed method.

Ground truth for navigation and mapping
Ground truth is absent from this thesis because it is difficult to obtain. Ground truth navigation data can be obtained using an LBL system. This data will available during the summer of 2013 so that the navigation algorithm can be compared against an accurate ground truth. Ground truth mapping data requires construction of a synthetic sea floor which can be measured using sensors other than multibeam and sonar (such as a kinect or laser scanner) with accurate navigation data. This is the ultimate ground truth data set to help evaluate and improve this algorithm.