Deriving inherent optical properties from decomposition of Deriving inherent optical properties from decomposition of hyperspectral non-water absorption hyperspectral non-water absorption

25 Semi-analytical algorithms (SAAs) developed for multispectral ocean color sensors have 26 benefited from a variety of approaches for retrieving the magnitude and spectral shape of inherent 27 optical properties (IOPs). SAAs generally follow two approaches: 1) simultaneous retrieval of all 28 IOPs, resulting in pre-defined bio-optical models and spectral dependence between IOPs and 2) 29 retrieval of bulk IOPs (absorption and backscattering) first followed by decomposition into 30 separate components, allowing for independent retrievals of some components. Current algorithms 31 used to decompose hyperspectral remotely-sensed reflectance into IOPs follow the first strategy. 32 Here, a spectral deconvolution algorithm for incorporation into the second strategy is presented 33 that decomposes a t-w (  ) from in situ measurements and estimates absorption due to phytoplankton 34 (a ph (  )) and colored detrital material (a dg (  )) free of explicit assumptions. The algorithm described 35 here, Derivative Analysis and Iterative Spectral Evaluation of Absorption (DAISEA), provides 36 estimates of a ph (  ) and a dg (  ) over a spectral range from 350-700 nm. Estimated a ph (  ) and a dg (  ) 37 showed an average normalized root mean square difference of <30% and <20%, respectively, from 38 350-650 nm for the majority of optically distinct environments considered. Estimated S dg median 39 difference was less than 20% for all environments considered, while distribution of S dg uncertainty 40 suggests that biogeochemical variability represented by S dg can be estimated free of bias. DAISEA 41 results suggest that hyperspectral satellite ocean color data will improve our ability to track 42 biogeochemical processes affiliated with variability in a dg (  ) and S dg free of explicit assumptions.


Introduction
Dissolved organic matter (DOM) comprises the largest pool of fixed carbon in the ocean, roughly equivalent to the reservoir of atmospheric CO2 (~670 Pg C; Hansell et al. 2009;Ogawa et al. 2001).Yet, sources and cycling of DOM in the global ocean remain poorly constrained due to difficulty in assigning origin and tracking changes in composition to a complex mixture of organic compounds composed of up to ~20,000 molecular formulas in a sample (Andrew et al. 2013;Mentges et al. 2017;Riedel and Dittmar 2014).A portion of DOM is optically active, colored dissolved organic matter (CDOM), and displays distinct spectral variability between uniquely sourced material, namely terrestrial and marine-derived, and different degradation pathways, such as microbial or photodegradation (Catalá et al. 2016;Danhiez et al. 2017;Helms et al. 2013;Helms et al. 2008;Zhao et al. 2017).Due to its interaction with light, CDOM can be rapidly characterized using optical sensors and is observable from autonomous and satellite platforms (e.g., Siegel et al. 2005;Xing et al. 2012).These observations are crucial to adequately model ocean biogeochemical and physical processes due to the influence of CDOM on distribution and spectral quality of light in the water column and heating of the surface ocean (Chang and Dickey 2004;Dutkiewicz et al. 2015;Kim et al. 2016).
CDOM absorption (ag(), m -1 ;  denotes wavelength) at visible wavelengths also tracks the spectral shape of ag (Sg) and dissolved organic carbon concentration ([DOC], mgL -1 ) in coastal waters where a strong gradient of relatively degraded, terrestrial-derived material and conservative mixing produce a clear, observable signal across unique pools of CDOM (Cory and Kling 2018;Fichot and Benner 2011;Mannino et al. 2014;Stedmon and Markager 2003).This continuous dilution of ag() in coastal waters presents predictive capability of CDOM molecular weight, degradation state and terrestrial biomarkers (e.g., lignin) using ag() due to unique spectral features present in terrestrial material relative to CDOM of marine origin (Fichot et al. 2016;Fichot et al. 2013;Helms et al. 2008;Vantrepotte et al. 2015).While these relationships are strong in coastal waters, open ocean waters do not display a consistent relationship between ag(), Sg and [DOC] due to relatively low production rates and strong photodegradation in surface ocean waters (Helms et al. 2013;Nelson et al. 2010).This disconnect between single wavelength estimates of ag() and Sg currently limits our ability to accurately track production and degradation of CDOM across broad spatial scales while also introducing significant uncertainty in estimates of aph() and derived products (e.g., chlorophyll-a concentration).Additionally, increasing observations of ag() have shown that Sg displays significant variability and is capable of characterizing CDOM of unique source, environmental conditions and degradation state (Asmala et al. 2018;Danhiez et al. 2017;Grunert et al. 2018;Helms et al. 2008Helms et al. , 2013)).Considering this, it is likely that this parameter contains very useful information regarding food web processes and marine carbon cycling relevant to understanding the balance of the marine DOM carbon reservoir.
Hyperspectral ocean color observations from in situ measurements including flow-through systems and proposed satellite sensors such as the German Aerospace Center's Environmental Mapping and Analysis Program sensor and NASA's Plankton, Aerosol, Cloud and ocean Ecosystem (PACE) sensor provide the potential to observe inherent optical properties (IOP's), including phytoplankton absorption (aph(), m -1 ), non-algal particulate (NAP) absorption (ad(), m -1 ) and ag(), with greater accuracy across the global ocean.Hyperspectral satellite observations have the proven ability to characterize unique phytoplankton functional groups (Bracher et al. 2009;Sadeghi et al. 2012) while flow-through systems have provided an unprecedented view of phytoplankton productivity and physiology at a global scale (Chase et al. 2013;Werdell et al. 2013).Additional work including derivative analysis has also shown potential for estimating pigment concentrations, ag(), Sg, ad() and the spectral shape of ad() (Sd; Wang et al. 2016;Chase et al. 2017;Vandermeulen et al. 2017;Wang et al. 2018).To date, satellite algorithms use an assumed value or starting point for Sdg, the combined spectral slope term for ad() and ag(), based on global or regional observations and/or constrain solutions within a pre-defined space (Lee et al. 2002;Werdell et al. 2013;Dong et al. 2013;Zhang et al. 2015).These approaches are all made possible by a variety of existing inversion approaches developed for multispectral data outlined by Werdell et al. (2018).
Hyperspectral approaches are still scarce but apply bottom-up strategies on in situ Rrs() capable of estimating pigment concentrations and separating ag()/Sg and ad()/Sd using assumed starting points and lower/upper bounds on variables.Bottom-up strategies provide accurate solutions but result in IOP retrievals that are spectrally dependent on each other (Mouw et al. 2015).Here, we provide a top-down approach that independently estimates Sdg, adg() and aph() free of explicit assumptions from total non-water absorption (at-w()) using derivative analysis, iterative spectral evaluation and Gaussian decomposition of total non-water absorption spectra.
Beyond estimation of Sdg and more accurate spectral retrievals of aph(), such a method provides clearer spectral features for the derivation of phytoplankton functional types, including Gaussian fitting and second or fourth derivative analysis of phytoplankton pigments (Chase et al. 2017;Vandermeulen et al. 2017;Wang et al. 2017).We focus on accurate retrieval of Sdg and adg() to represent biogeochemical variability in NAP and CDOM absorption represented by the spectral shape and magnitude of adg().Our results suggest the algorithm, Derivative Analysis and Iterative Spectral Evaluation of Absorption (DAISEA), will work well with future top-down hyperspectral inversion approaches.

Data
In situ data were accessed from NASA's SeaWiFS Bio-optical Archive and Storage System (SeaBASS, https://seabass.gsfc.nasa.gov/) on January 12, 2018 (Werdell et al. 2003).We focused our collection on data where aph(), ad() and ag() were all measured coincidentally on a benchtop spectrophotometer within 10 m of the surface (Fig. 1).We initially quality controlled each set of absorption spectra by considering if any values were below zero for individual spectra.If the minimum value was more negative than -0.1, the spectra was discarded; if the value was greater than -0.1, an offset for the most negative value was applied to the entire spectrum.In doing so, spectral shape was retained while removing poorly defined absorption values that resulted in negative algorithm solutions.We removed any spectra where Sdg was less than 0.004 nm -1 , values unrealistic with historic observations and estimates (e.g., Siegel et al. 2002;Wang et al. 2005).
Additionally, spectra that had been sampled at a resolution less than 2 nm were not considered to ensure spectral shape was maintained when downsampling.After removing poor quality spectra, a total of 4,787 spectra remained.These spectra were randomly split into training (n=3,434; Fig. 1a) and test datasets (n=1,353; Fig. 1b) so that training spectra accounted for ~75% of total spectra.All absorption spectra were subsampled to 5 nm either through direct sub-sampling or linear interpolation to avoid introducing artificial curvature, with the spectral range from 350-700 nm used (71 data points).Some spectra were not sampled down to exactly 350 nm but were measured at or below 355 nm (e.g., 350.7; n=79); for these spectra, we extrapolated to 350 nm using a discretized partial differential equation with an enhanced plate metaphor (D'Errico 2005).Typical uncertainty estimates for spectrophotometer measurements, assessed as differences among triplicate samples, ranged from ~5-10% relative difference (Mouw, unpublished data).We focused on 5 nm spectral resolution here for an assessment of performance relative to the anticipated resolution of PACE.

DAISEA Algorithm Development
Our approach for decomposing at-w() focused on estimating adg() first through derivative analysis, optimizing the fit of adg() through iterative spectral evaluation, then estimating aph() using Gaussian decomposition.Steps described in this section are summarized in a schematic and accompanied by figures illustrating the primary components of each step (Fig. 2).Steps 1-7 evaluate at-w() to optimize estimates of adg() and aph() and Step 8 is a Gaussian decomposition of at-w() using estimated adg() and aph() with constraints defined below.For a detailed discussion on general algorithm framework and empirical relationships used, we refer the reader to section 4.1.2.
Step 1 To first parameterize adg(), the second derivative of at-w() was calculated as: (1) where ∆ indicates the wavelength resolution used to measure at-w() (here, 5 nm as described above) and ∆=k-j=j-i, j=i+1, k=i+2 and i is the current wavelength (Tsai and Philpot 1998).
Points where the second derivative equals 0 indicate inflection points of the spectrum (Fig. 2a; Lee et al. 2007).In theory, for at-w(), these are points where individual phytoplankton pigments least impact the underlying exponential signal and thus are considered as the observed signal most likely representative of adg() spectral shape.These points were defined as d0 and were found by identifying where d 2 at-w() was approximately 0. These points were identified by rounding second derivative values to the median magnitude of the second derivative, which is a function of the magnitude of observed absorption.For example, if the median second derivative was 0.005, a value of 0.0008 at 440 nm would be considered not zero and not included (rounded to 0.001).A value of 0.0004 at 450 nm would be considered zero (rounded to 0.000), 450 nm would be classified as a d0 wavelength and the corresponding absorption would be used in Step 2.
Step 2 Using wavelengths identified in Step 1, an initial exponential expression was fitted following where 0 is the minimum wavelength in d0 (Fig. 2b).S derived from Eq. 2 was used as the initial estimate of Sdg and adg(0) was estimated at 440 nm by estimating the relative contribution of adg(440) to at-w(440) using a piece-wise exponential relationship derived from the training dataset as follows: %  ℎ (440) = 1.038 Eq. 3 was developed on the entire dataset, and outliers were determined as residuals outside 1.5 times the interquartile range (25 th and 75 th quantiles).After determining outliers, a moving window of 10% aph(440) contribution to at-w(440) was used to assess for a significant bias in residuals derived from this relationship.Bias was defined as a median residual value an order of magnitude different (positive or negative) than median residual bias between the relationship and all data points.This threshold indicated a bias for aph(440) contributions greater than 60%, corresponding to an at-w(555)/ at-w(680) ratio of 0.685.From this, a new relationship (Eq.4) was developed to estimate aph(440) percent contribution above 60% without bias.These equations are discussed further in Section 4.1.2and figures referenced therein.
From the previous steps, the spectra for adg() was then estimated (Fig. 2b) as follows   () = ( − (440) • %  (440))  −  (−440) (6) Step 3 To determine if the adg() estimate was acceptable, we compared it to at-w(): If aresidual() was always positive, the previous variables -0, adg(0), Sdgwere maintained at the current estimated values (e.g., 0=440 nm; Fig. 2c).If aresidual() was negative at any point, the wavelength corresponding to the most negative residual was used as 0, and adg(0)=at-w(0) to recalculate adg() from Resulting aresidual() was re-calculated again following Eq.7 for the new estimated adg().This step was repeated until all aresidual() values were positive, with Sdg incrementally adjusted by +0.0001 nm -1 to a maximum adjustment of +0.011 nm -1 .If a potential solution was not found, Sdg was then incrementally adjusted by -0.0001 nm -1 to a minimum adjustment of -0.004 nm -1 .The difference in adjustment and focus on positive adjustment values first is discussed further in Section 4.1.2.If no valid solution was found through this routine, the initial estimate of adg() was used; if a valid solution was found, that was the new adg() estimate (e.g., Fig. 2c).At this step, negative residual values were allowed, and accounted for in Step 5.This occurred in 91 of the 1,353 spectra evaluated (6.7% of the time).
Step 4 Using the new or initial adg() estimate, aph() was estimated (Fig. 2d) following Step 5 To determine if adg() was estimated reasonably well, we considered the ratio of aph(350):aph(440), where a value greater than 1.5 was used to indicate whether a significant portion of the adg() signal was still present in the residuals.While some waters with a significant pigment contribution below 400 nm (e.g., mycosporine-like amino acids) may have violated this rule, it was generally applicable following discussion in Section 4.1.2.
If aph(350):aph(440) was greater than 1.5, a blended estimate of adg() was produced by fitting residuals from 350-400 nm with an exponential model (Fig. 2e) following: A new estimate of adg(), denoted as adg2(), was created from: A new Sdg was re-calculated for adg2() and the next iteration of adg() was estimated from: The adg() estimated from Eq. 12 was then iteratively evaluated by adjusting Sdg and assessing whether adg() > at-w() at any wavelength, within each iteration.If adg() > at-w(), an offset was calculated by finding the wavelength where adg() was most overestimated following where ind corresponds to the wavelength where adg() was most overestimated (maximum positive value from adg()-at-w()).Eq. 8 was then used to re-calculate adg(), with 0=ind and adg(0) equivalent to adg(ind) from Eq. 13.The offset corrects for overestimations, but the application of a new 0 with the current Sdg can allow for overestimation at a different .If this step was performed, adg(0) was no longer set to the empirically-derived estimate of adg(440), rather, and adg(0) and Sdg were altered simultaneously to find a solution (i.e., adg(0) set to at-w() minus an offset, and Sdg to the next iterative slope value).These steps were performed in a stepwise manner until aph(350):aph(440) was less than 1.5 or until the maximum number of allowable iterations, currently set to 20, was reached (Fig. 2f,g).If 20 iterations were reached without a solution, the final calculated model (from the 20 th iteration) was used.This allowed for negative residuals to be included in the subsequent estimate of aph(), following Eq.9.However, negative values did not impact the spectral analysis used to identify pigments for fitting in Step 7 and negative values were removed through simultaneous fitting of adg() and aph() in Step 8, described below, where a constraint of non-negative values on solutions was imposed.
Step 6 In Step 6, we identified locations and widths of Gaussian curves in aph() derived from Eq. 9 or its equivalent derived from Steps 10-13.For this, we utilized a generic version of Eq. 1 to calculate the second derivative of estimated aph() as spectral features were accentuated in the second derivative relative to aph() (Fig. 2h).The second derivative was smoothed with a linear Savitzky-Golay filter using a 10 nm smoothing window.This smoothing reduced the number of features identified that correspond to signal noise.The smoothed second derivative was inverted to allow for identification of local maxima (equivalent to where the first derivative equals 0) and direct estimation of Gaussian curves on these spectral featuresthis is a key distinction between our methodology and published bottom-up approaches where Gaussian width and height are constrained as initial conditions or within a fitting window.Identified peaks were then used as an initial estimate of the number of peaks and each peak's location and width (Fig. 2h,i) with each Gaussian curve modeled following where σ (nm) is the width of the curve, φ (m -1 ) is the height of the Gaussian curve defined as  = 1 √2 , consistent with Gaussian curve height defined as full width at the half maximum, and μ (nm) is the peak center position.Any Gaussian curves with a σ less than 5 nm were removed at this stage, as these features were fit to noise and not pigments when using a 5 nm spectral resolution.
At this stage, φ was scaled to the second derivative requiring re-parameterization of Gaussian curve heights relative to aph().Additionally, noisy data where σ > 5 nm can result in more identified peaks than was realistic.These issues were addressed in Step 7.
Step 7 For Step 7, Gaussian curves identified in Step 6 were scaled to aph().This was done by prioritizing peaks based on their relative prominence, identified as the φ determined for each identified peak in Step 6 (scaled to the second derivative).When identified in this manner, pigments that did not overlap, or overlap little, were fitted to aph() first, following the assumption that the majority of the absorption signal in that spectral region belongs to that Gaussian component (e.g., chlorophyll-a peak at 676 nm was typically prioritized for fitting first due to little overlap with other pigments).From this, aph() was iteratively fit with each Gaussian curve, the signal from that curve was removed, and the next Gaussian curve was fit to the remaining aph() signal to get a best approximation of φ for each Gaussian curve following where n indicates the number of peaks identified for fitting from Steps 6d and 6e (Fig. 2) and μ and σ identified in Step 6 were used for each peak.Due to the additive nature of fitting Gaussian curves, there was potential for some peaks to have a negative height.After estimating an appropriate φ for each curve, we filtered out peaks with negative heights and we limited the total possible number of peaks to 16, although fewer peaks were typically identified (mean=7.7 peaks, s.d.=2.2 peaks).Most Gaussian decomposition schemes assume the presence of ~12 peaks (e.g., Hoepffner and Sathyendranath 1993; Wang et al. 2016;Chase et al. 2017).These studies have considered similar peak locations with minor differences accounting for a total of 16 unique peak locations in the literature.From this, we assumed if more than 16 peaks were present and all had a positive peak height, some identified peaks were noise or signals not affiliated with phytoplankton pigments that had not been removed in earlier steps.We sorted for likely pigment signals by prominence, using the same method described for peak height previously, and selected the 16 most prominent identified peaks if more than 16 peaks were identified.Next, we used the σ, φ and μ values identified for each Gaussian curve as input into a least squares Gaussian decomposition model that best fit our initial aph() estimate (Eq.10) with the initial Gaussian curve estimates and fitting constraints described in Step 8 to define an updated set of Gaussian curves (Fig. 2j) following the expression: Step 8 Results from Steps 1-7 provided the start point for a combined retrieval of adg() and aph() from at-w().Using the estimate of adg() from Steps 1-7 and an estimate for each identified Gaussian curve fitted to aph(), a least squares fitting approach was performed using the following expression: Analogous to methods used for identifying poorly constrained features that deviate from an underlying exponential signal presented elsewhere (e.g., Massicotte and Markager 2016), the model decomposed at-w() by utilizing a baseline exponential (Eq.8) accompanied by a predefined number of Gaussian components based on previous steps (Eq.16).This method differs from other Gaussian decomposition methods applied to particulate absorption (ap), in that those methods typically have a pre-defined number of Gaussian components based on analysis of separate aph() for the respective system (e.g., Chase et al. 2013;Wang et al. 2016).This methodology fits primary pigments with width estimated from spectral features identified in the second derivative of estimated aph(), allowing for a constrained solution to decomposing at-w() while not assuming the presence of any specific types of phytoplankton.Parameters in Eq. 17 were constrained utilizing results from Steps 1-7: adg(0) can vary from 0 m -1 to at-w(0), Sdg can vary by -0.002 nm -1 to +0.003 nm -1 from the input estimate, Gaussian peak width can vary from input width to 3 times the input width, Gaussian peak height can vary by 0.25 times input height to 3 times input height and μ is fixed at the identified location due to high confidence in the second derivative analysis.DAISEA output was as follows: adg() was that estimated in Eq. 17, while aph() was the difference between observed at-w() and adg() from Eq. 17 (Fig. 3).
Step 8 ensures coherence between the exponential signal and overlying deviations due to aph() as constrained through Steps 1-7 in a flexible manner, while not assuming that aph() can be best parameterized by 6-8 Gaussian curves.Fitting of secondary features was possible but also increases the probability of over-constraining a solution (i.e. less flexibility in adjustments to adg()).

Low aph() waters
We found that waters dominated by adg() were best decomposed by fitting an initial exponential function and adjusting to a realistic solution following Eq.8, 9 and 13.These cases were identified after Eq. 3 and 4; waters were considered dominated by adg() where the ratio of at-w(555):at-w(680) > 2.528 (the empirical value indicating aph(440) < 10% of at-w( 440)).For these situations, the algorithm opted out of the Gaussian decomposition routine and followed a simplified routine analogous to Steps 2-4, where Sdg was considered equivalent to S calculated for at-w() (Eq.2), and magnitude was adjusted so that adg() ≤ at-w().We chose this threshold as forcing Eq. 17 to fit all cases resulted in significantly more error in Sdg estimates when aph(440) contributed < 10% of at-w(440).Above this threshold, using Eq. 17 to fit for aph() improved estimates of adg() and Sdg while also providing an estimate of aph().The exact value of 10% may not be an ideal threshold for all datasets but worked well as a threshold here and fit within our presentation scheme.Eq. 3 and 4 are empirical and follow band-ratio techniques used for fitting Sdg in current semi-analytical schemes (Lee et al. 2009;Matsuoka et al. 2013).Noise in this relationship was explained by variability in the exact shape of aph() due to varying phytoplankton composition, physiology and pigment packaging effects (Bricaud and Morel 1986;Bricaud et al. 1983;Ciotti et al. 2002;Johnsen et al. 1994) as well as variability in the spectral shape and features of ag() and ad() (Grunert et al. 2018).As the algorithm is currently optimized for a global approach, users may find that adjusting the empirical values used to initially estimate adg(440) and adjusting the value of 1.5 for the ratio of aph(350):aph(440) (Step 5) for a value more representative of their study region results in better algorithm performance.

Functions
To develop DAISEA, we focused on creating a primary, custom Matlab functiondaisea, an approach that utilizes derivative analysis and iterative fitting to optimize input spectra used in a least squares Gaussian decomposition scheme fitting an exponential signal and a pre-defined number of constrained Gaussian peaks.DAISEA uses a package of custom sub-functions.This package is freely available via GitHub (https://github.com/bricegrunert/daisea/tree/v1.0.0;DOI: 10.5281/zenodo.1306817).Version updates will follow Github conventions.Users are encouraged to use the most recent version for application.

Data Analysis
To assess the performance of DAISEA across a variety of water conditions, we present results as eight different categories based on the percent contribution of aph( 440) relative to atw(440), with the distribution of spectra within these classes shown in Fig. 1.Classes were defined as the percent contribution aph has to the overall absorption budget at 440 nm (%aph(440)) of 0-10, 10-20, 20-30, 30-40, 40-50, 50-60, 60-70, and >70, with n=286, 257, 303, 210, 146, 89, 34 and 28 spectra, respectively.This classification scheme emphasizes the relative, not the absolute, contribution of phytoplankton to the overall absorption signal.Thus, waters where aph(440) is the dominant contributor to total absorption are not limited to highly productive waters.In this sense, algorithm performance was not assessed across classic definitions of Case 1 or Case 2 waters (Morel and Prieur 1977).Rather, the only group dominated by coastal and inland waters was 0-10 %aph(440).It should be emphasized that these categories are only used to present the data within the context of relative contribution of aph() and adg().Beyond separating 0-10 %aph(440) spectra for fitting using Eq. 2, the algorithm does not analyze spectra differently based on these categories.
To determine whether aph() or adg() was retrievable, we calculated the absolute difference in the opposing metric and compared it to the observed value.For example, if aph_obs() > |adg_obs()-adg_est()|, we consider it retrievable at that wavelength.Within each %aph(440) group, we summed the total number of instances at each wavelength where aph() or adg() was greater than the absolute difference in the opposing metric and divided by the total number of spectra to get a percent retrievable metric for that %aph(440) group.In addition to percent retrievable metrics, we calculated Bayes factors (BF10, unitless) to assess fit significance (Wetzels and Wagenmakers 2012).Bayes factors represent the likelihood that the fitted model adequately represents the data relative to an alternative model.Bayes factors can be interpreted literally, so that BF10=2 means the data are twice as likely to be explained by the fitted model than an alternative model.Here, we used a BF10  3 as the threshold for significance (Wetzels and Wagenmakers 2012).We also calculated root mean square difference (RMSD), normalized RMSD (NRMSD), bias, mean absolute difference (MAD) and unbiased absolute percent difference (UAPD) using the following expressions: × 100 (22)

DAISEA Performance
Here, we present the results of DAISEA performance on the test dataset.Across all groups, adg() was retrievable >80% of the time for wavelengths < 450 nm (Fig. 4a).For waters where adg() contributed greater than 60%, it was retrievable at a rate of >80% for all wavelengths up to 650 nm.For aph(), local maxima in retrieval corresponded to chlorophyll-a absorption peaks (~440 and 680 nm within DAISEA), with these wavelengths displaying >80% retrievability for waters with %aph(440) > 10 (Fig. 4b).Relative difference for aph() and adg() was parameterized as NRMSD and displayed excellent performance for both parameters across most wavelengths and environments.For all conditions except %aph(440) > 70, adg() had a mean difference less than 20% for wavelengths from 350-650 nm (Fig. 4c).Mean aph() difference was generally less than 20% from 350-650 nm when %aph(440) was > 10 (Fig. 4d).As seen in Fig. 4 and 5, aph() was biased to greater than observed values when it was a non-dominant contributor at 440 nm and was biased towards values less than observed when it was a dominant contributor at 440 nm, and vice versa for adg() (Fig. 4e).Mean absolute difference generally decreased as the contribution of aph( 440) increased (Fig. 4f).
The threshold for estimating adg() with DAISEA appears to be %aph(440) < 70; for these conditions, adg() is estimated with NRMSD < 20% from 350-650 nm.NRMSD for aph() was < 20% for the majority of wavelengths between 400-650 nm when %aph(440) was > 10.This was also consistent when considering the retrievability of aph( 440) under different conditions and can be considered as the threshold for estimating aph().Sdg uncertainty increased with increasing contribution of aph(440); however, performance was reasonable across all water conditions and estimates (Table 1).This was also confirmed when considering Bayes factors for fitted models.
Overall, BF10 were quite high with 94.0% of collective model retrievals showing a BF10 > 3, our cutoff for significance, and 92.9% with a BF10 > 10, demonstrating strong confidence in our approach.The model retrieved adg() with slightly better success than aph(), with BF10 > 3 for 99.5% and 88.5% of the dataset, respectively.Of models that did not fit observed aph() adequately (n=156), the majority of poor model fits occurred in waters where %aph(440) < 20 (n=129).Only 6 model fits were not adequate for adg() and distribution across %aph(440) groups was random.
Using the 2.528 threshold applied to at-w(440) ratios to separate low %aph(440) did present issues, particularly in the %aph(440) of 10-20 category.This threshold miscategorized spectra from this category as having %aph(440) < 10 in 175 of 257 spectra, resulting in poorly resolved aph() for these.This highlights the primary drawback of utilizing empirical relationships and a weakness in our approach.Due to using the 2.528 threshold to separate %aph(440) contribution, more than half of the aph() estimates in the 10-20 %aph(440) group had negative values at wavelengths greater than 650 nm, outside of the chlorophyll-a (Chl) absorption peak at 676 nm (typically assigned to 680 nm within the algorithm framework; Fig. 5).While we attempted to account for this by using an offset from Eq. 13, moving the location of 0 and maintaining Sdg can result in negative values at a different spectral location.However, the benefit of using this scheme was evident in improved estimates of Sdg for correctly classified spectra (i.e., spectra where %aph(440) was <10).Without a threshold to separate these spectra, attempting to fit aph() when it contributed very little resulted in poor algorithm performance, with more spectra poorly fit than with the current approach including a threshold.
One of the primary motivators for developing DAISEA was to accurately retrieve Sdg without any assumptions regarding spectral shape while also independently estimating aph().Our results suggest that this is possible across a variety of optical conditions with a reasonable to excellent degree of accuracy, depending on the relative contribution of adg().Across the different groups of varying aph(440) contribution, median difference in Sdg varied from 0.9-17.7%,with third quartile differences ranging from 2.4-39.2%(Fig. 6a; Table 1).Mean Sdg observed across all spectra in the test dataset was 0.0147 nm -1 compared to a mean estimated value of 0.0150 nm -1 , while median observed and estimated Sdg was 0.0152 and 0.0153 nm -1 , respectively.Across individual groups, we evaluated the differences and present anticipated accuracy for Sdg (Table 1).
For most groups, median difference was << 0.001 nm -1 and absolute differences affiliated with the 1 st and 3 rd quantiles ranged up to -0.0037 and 0.0021 nm -1 , respectively, but were typically much smaller.We also considered distribution of differences in Sdg across all groups and it followed a predominantly normal distribution (data not shown), without an obvious bias between observed and estimated Sdg regardless of %aph(440) contribution (Fig. 6b).

Consistency in Gaussian features
We considered the accuracy of our Gaussian component locations within DAISEA by comparing to Gaussian component locations identified on observed aph() using the same Gaussian decomposition approach (Fig. 7).Overall, peak locations were quite similar, although DAISEA fitted more peaks (total peaks=10,394; 7.7 peaks/spectra) than for observed aph() (total peaks=8,794; 6.5 peaks/spectra).Since aph() estimated from the algorithm was derived from the smoothed residuals of at-w() -adg_est(),the additional noise in the spectra was derived from deviations in ad() and ag() not accounted for by a strictly exponential fit.We discuss potential reasons for an increase in fitted peaks in DAISEA output over observed aph() in Section 4.2, as well as fitting significantly fewer peaks under our approach than other Gaussian decomposition approaches (e.g., Chase et al. 2013).

Application
As evidenced here and elsewhere, hyperspectral ocean color data provides a means for estimating more variables in a less constrained manner (Bracher et al. 2009;Dierssen et al. 2015;Uitz et al. 2015;Vandermeulen et al. 2017;Wang et al. 2017).Global variability in water optical properties is significant yet the non-uniqueness of Rrs() hampers consistent interpretation across both empirical and semi-analytical methods (Werdell et al. 2018 and references therein).Previous concepts for working around this issue, particularly in light of multispectral limitations, have included screening Rrs() to most likely cases based on optical water types, non-linear spectral optimization, linear matrix inversion, bulk inversion, ensemble inversion and spectral deconvolution (Brando et al. 2012;Hieronymi et al. 2017;Mélin and Vantrepotte 2015;Trochta et al. 2015;Werdell et al. 2018).These approaches are broadly defined as bottom-up and top-down strategies (Mouw et al. 2015), where bottom-up strategies simultaneously solve for all parameters (350-800 nm) (Twardowski and Tonizzo 2018).While it is not possible at this time to fully assess the compatibility of DAISEA with these developing approaches, early indications suggest that spectral accuracy of aph() and adg() could be quite reasonable and in-line with error attached to current approaches (Chase et al. 2017;Wang et al. 2018).Additionally, we did not pursue separation of ad() and ag() as current methods for separating within a top-down scheme rely on empirical approaches.Independent approaches for separating these features are currently being considered for future work.Considering the accuracy of estimating Sdg with DAISEA (Table 1) and minimal additional uncertainty in separating adg() into ad() and ag() in future work (5-10%), Sg could feasibly be retrieved with a median difference of 0.001-0.002nm -1 across most optical conditions.This would provide an adequate resolution for estimating CDOM source, production and degradation processes as characterized in a variety of in situ studies.

General framework and empirical relationships
The general premise of DAISEA is that adg() can be accurately modeled using an exponential model and that deviations from this exponential model are solely due to aph().There are alternate explanations for both of these assumptions (e.g., Cael and Boss 2017;Catalá et al. 2016); however, there is biogeochemical significance in Sdg, while phytoplankton would presumably produce the largest deviation from an exponential signal as observable from satellite ocean color data.Beyond these basic assumptions, we also considered the relationship between adg() and aph() within a theoretical framework (Fig. 8).Based on this framework, it is important to recognize how varying contributions of each component will inherently lead to specific biases.
For example, S350:400 fitted to at-w() where aph() contributes will result in lower slope values due to the higher contribution of aph() at 400 nm relative to 350 nm.This is why we increased estimates of Sdg first, then alternated to decreasing Sdg, as an exponential fit of at-w() will produce lower S values when aph() contributes to the signal.Finding where the second derivative of atw() equals 0 and fitting an exponential at these points minimizes this impact (essentially "cutting through" primary pigment features for a least squares fit); however, there was still a consistent bias towards lower Sdg values as aph(440) contribution increased, as expected.The general framework illustrated in Fig. 8 is also the justification for setting a ratio of 1.5 to aph(350):aph(440); when the residual used to estimate aph() had a ratio higher than this, it was almost always indicative of a significant portion of the adg() signal remaining in the residual used to calculate aph().
In short of independent variables to validate each component of interest, some explicit assumptions are required within any algorithm framework.Here, we chose to limit our solutions by constraining initial adg(440) estimates by the empirical relationship between at-w(555)/at-w(680) and %aph(440) from the training dataset (Fig. 9a) and a theoretical ratio of 1.5 for aresidual(350)/aresidual(440) (Eq.7) to determine whether the contribution of adg() to at-w() from 350-400 nm had been reasonably estimated and removed.These relationships do not explicitly dictate the final product, but guide the algorithm to reasonable estimates, at which point fitting is not constrained by these specific values.They do, however, leave an impact on how results are constrained.As we discussed previously, empirical relationships can often fall short of their intended accuracy.Despite a similar optical and geographical distribution between the training and test datasets (Fig. 1), the piece-wise exponential relationship derived from the training dataset to predict %aph(440) (r 2 =0.91,RMSD=0.068for fitted points) did not predict the same relationship nearly as well for the test dataset (r 2 =0.58,RMSD=0.110;Fig. 9b).We considered sensitivity to this empirical relationship on the test dataset.By using a single exponential expression fitted to the test dataset (data points in Fig. 9b), with a value of 0.779 instead of 1.038 and -0.5834 instead of -0.9257 in Eq. 3, NRMSD fell below 6% for all wavelengths, with higher values in the UV and lower at longer wavelengths (data not shown).However, the number of Gaussian curves fitted within the algorithm were different for 17% of spectra, with nearly all instances fit with one fewer Gaussian curves.This suggests that the algorithm is relatively robust across datasets but does exhibit significant sensitivity to empirical values.
We adjusted the theoretical value of 1.5 to lower values as a stricter threshold for removing a residual adg() signal from the estimate of aph() derived from Eq. 9. Algorithm results did not significantly change with values less than 1.5; however, spectra that contain pigments below 400 nm (e.g., mycosporine-like amino acids) required a value of 1.5 to adequately identify and fit these pigments.For spectra that did not contain a significant absorption signal below 400 nm, the shape of the spectra here is predominantly exponential.If a Gaussian component is erroneously assigned in this spectral region, as in the example spectra in Figs. 2 and 3, the curve can be minimized in Step 8 by fitting an adjusted exponential to at-w().This adjustment is allowable within the constraints provided and provides for consistent and stable solutions, since no Gaussian components are dropped.This approach and the empirical values used best fit our global dataset, but adjusting empirical values to a regional value is quite easy within the code available online (Section 2.2.2).

Gaussian Decomposition Approaches
Gaussian component location was consistent between observed aph() and DAISEA output (Fig. 7).We did not utilize Gaussian components to estimate the final aph() output, as we found that a smoothed residual of at-w()-adg_est() more accurately represented observed aph().This is likely due to fitting fewer Gaussian components than needed to accurately model aph(), as DAISEA fits fewer peaks than alternate Gaussian decomposition schemes due to a difference in methodologies (Hoepffner and Sathyendranath 1993;Chase et al. 2013).These algorithms typically identify pigments from first and second derivative analysis of an existing database of phytoplankton spectra then assign windows around these points (typically 12 peaks).Our approach focuses on identifying primary pigment features to best fit observed at-w() without assuming the locations of pigments, resulting in fewer identified peaks (~7 peaks).There is potential to increase the sensitivity of the peak finding step.Our focus was to retrieve adg() and aph() accurately, including spectral shape, rather than individually parameterizing phytoplankton pigments.It is possible to utilize the aph() output in a separate Gaussian decomposition scheme, or other approach that identifies phytoplankton pigments.However, it should be noted that derivative analysis of the final aph() output, even after smoothing, resulted in more identified peaks than the observed aph() using our scheme.This is very likely due to the inclusion of chromophores in ag() and ad() that result in deviations from the typical exponential expression used to model these parameters, features visibly apparent in many of the adg() spectra.While often overlooked, these features have been recognized for some time (Babin et al. 2003;Schwarz et al. 2002) and a recent methodology for fitting these peaks provides a means of both quantifying them and more accurately modeling the underlying exponential signal (Catalá et al. 2016;Massicotte and Markager 2016;Grunert et al. 2018).This approach is useful for in situ data, but not practical for our proposed methodology and likely a non-factor when considering at-w() derived from satellite Rrs().

Conclusions
We show that across most optical conditions considered, DAISEA can accurately estimate adg(), Sdg and aph() magnitude and spectral features for all water types where %aph(440) > 10.
We parameterized the percent of adg() and aph() estimates that were retrievable by comparing the signal observed for one IOP to the difference between estimated and observed values obtained for the other IOP.Consistent with the general accuracy of DAISEA, primary features (i.e., chlorophyll-a absorption peaks) of aph() were retrievable for greater than 80% of spectra across environments where %aph(440) > 10; adg() was retrievable for at least 80% of spectra from 350-650 nm when %aph(440) < 70.NRMSD metrics suggest strong algorithm performance across most optical variability from 350-650 nm.Algorithm bias shows a tendency to overestimate aph() when %aph(440) < 40 and to underestimate aph() when %aph(440) > 60.
Currently, coincident hyperspectral measurements of Rrs(), bbp(), aph(), ad() and ag() observed down to a minimum wavelength of 350 nm, the proposed lower wavelength limit of PACE, are quite uncommon relative to coincident measurements at wavelengths ≥ 400 nm.This, along with limited hyperspectral inversion approaches free of spectral bias, limited our ability to fully assess how well DAISEA will perform in the context of a top-down spectral deconvolution approach.Despite empirical schemes for separation of adg() and Sdg into the component parts 2. Schematic and figures illustrating primary steps for the Gaussian decomposition algorithm.
This schematic is provided to aid in visualizing and organizing the steps detailed in the accompanying text (Section 2.2).Each figure illustrates the step as indicated for an example spectra.Not all spectra require all the steps depicted, while some spectra walk through all the steps (e.g., Fig. 2c shows a successful first guess, while some spectra required an iteration at this step).
3. Algorithm output for the example spectra depicted in Fig.
2. Gray dashed lines indicate the estimated (a) adg() and (b) aph() used as input into the least squares Gaussian decomposition of observed at-w() and black dashed lines indicate the respective observed IOP.For (a) and (b), respective colored lines display algorithm output.For (c), the brown line represents adg() algorithm output, the green line represents adg() + aph() algorithm output and the black line with circles indicates observed at-w().This example shows how a Gaussian component can be fitted to the residuals derived from Step 5 (Fig. 2), but is minimized due to a better fit of observed at-w() with an exponential curve.4. Performance metrics for each group delineated by aph(440) percent contribution to total non-water absorption (indicated by color, from tan to dark green).Each plot corresponds to (a) percent retrievable aph(), (b) percent retrievable adg(), (c) aph() %NRMSD, (d) adg() %NRMSD, (e) aph() bias (adg() bias represented as inverse of each line) and (f) mean absolute difference for both aph() and adg() (equivalent value by nature of the metric).5. Mean performance of the algorithm for all test spectra within each group of spectra delineated by aph(440) percent contribution to total non-water absorption relative to mean observed (a,c,e,g,i,k,m,o) aph() and (b,d,f,h,j,l,n,p) adg().The number of spectra within each group was: (a,b) n=286; (c,d) n=257; (e,f) n=303; (g,h) n=210; (i,j) n=146; (k,l) n=89; (m,n) n=34; (o,p) n=28.6.(a) Unbiased absolute percent difference of Sdg for each grouping delineated by %aph(440), indicated by the color (see legend) and (b) distribution and relationship between observed and estimated Sdg, with marker color indicating %aph(440) and the dashed black line (--) representing the 1:1 line.7. Distribution of identified peak locations for (a) observed aph() and (b) aph() estimated from at-w().Overall, identified peaks were quite consistent between the two signals displaying the strength of the scheme for initial estimates and constraints used for the Gaussian decomposition model.8.A theoretical representation of varying spectral shape of at-w() under varying contributions of adg() and aph().The base adg() and aph() spectra used for each curve are taken from measured spectra.We utilized this theoretical framework to develop the algorithm, namely understanding how changes in aph() percent contribution will inherently impact estimates of Sdg, how this inherent bias is impacted by wavelengths used and how to assess whether adg() has been accurately retrieved from at-w() free of an empirical relationship.9. Relationship between at-w(555)/at-w(680) and aph(440) contribution for the (a) training dataset, where the piecewise exponential relationship from Eq. 3 and 4 is represented by the red line, blue points indicate fitted data and gray points indicate values excluded from model fitting (r 2 =0.91,RMSD=0.068).Outliers were defined as 1.5•1 st /3 rd quartile and were used to remove the influence of the large spread in data points with %aph(440) < 10, as these points represented nearly 25% of the dataset.(b) Test dataset points relative to the piecewise exponential relationship derived from the training dataset, displaying the primary weakness in empirical relationships (r 2 =0.58,RMSD=0.110).

Table 1 .
Median and distribution of observed Sdg (1 st and 3 rd quartile) delineated by percent aph(440) contribution.Relative accuracy of estimated Sdg is presented as the median and distribution of absolute difference (estimated Sdgobserved Sdg).Locations of spectra utilized in the (a) training dataset and (b) test dataset where color and size represent spectra grouped by varying aph(440) percent contribution to total non-water absorption.
(NAP and CDOM; e.g., Dong et al. 2013), we did not pursue separation here.Considering current algorithm performance, we anticipate that a well-performing scheme to separate adg() into its component parts will allow for appropriate resolution in Sg to estimate source and degradation state of CDOM in the surface ocean.