ROBUST ESTIMATION FOR NOISY DATA

When a data set is corrupted by noise, the model for the data generating process is misspecified and can cause parameter estimation problems. In the case of a Gaussian autoregressive (AR) process corrupted by noise, the data is more accurately modeled as an autoregressive moving average (ARMA) process rather than an AR process. This misspecification leads to bias, and hence, low resolution in AR spectral estimation. However, a new parametric estimator, the realizable information theoretic-estimator (RITE) based on a non-homogeneous Poisson spectral representation, is shown by simulation to be more robust to noise than the asymptotic maximum likelihood estimator (MLE). We therefore conducted an in-depth investigation and analyzed the statistics of RITE and the asymptotic MLE for the misspecified model. For large data records, RITE and the asymptotic MLE are both asymptotically normally distributed. The asymptotic MLE has a slightly lower variance, but RITE exhibits much less bias. Simulation examples of a noise corrupted AR process are provided to support the theoretical properties. This advantage of RITE increases as the SNR decreases. Another topic of interest is Data fusion for estimation. It is a problem that utilizes information from multiple data sets to estimate an unknown parameter or vector. These data sets are usually from multiple sources, for example, multiple sensors. In this case, this problem is called distributed estimation. It uses data from multiple sensors and a fusion center (FC), or central processor, to achieve a more accurate estimation than using a single sensor observation. In this paper, we propose two estimators for data fusion estimation problem. The Fisher information and the observed Fisher information are used to reduce the negative effects of poor estimations and therefore improve the new estimators’ performance in terms of mean square error. At the same time, we found that there is a relationship between our new estimators and the second order Taylor expansion of l, which is the log likelihood function of data from all sensors. The solution of the maximum of the second order Taylor expansion of l, turns out to be our new estimator that uses the observed Fisher information. Our simulation results showed that the proposed estimators have obvious advantages in both low and intermediate SNR regions, especially when one or many sensors have much lower SNRs than the others.


Introduction
The modeling approach to spectral estimation produces less biased and lower variance spectral estimators if the model chosen is an accurate representation of the power spectral density (PSD). The most popular modeling approach is the AR spectral estimator since it can be found by solving a set of linear equations, the Yule-Walker equations [1]. On the contrary, autoregressive moving average (ARMA) or moving average (MA) estimation requires one to solve a set of nonlinear equations. When the AR modeling assumption is correct, spectral estimators such as the covariance method, Burg method, and recursive MLE method are approximately maximum likelihood estimators (MLE) [2] and attain the Cramer-Rao Lower Bound (CRLB) [3] [4]. However, modeling errors are always present to some extent. This problem is said to be a misspecification, which is difficult to avoid.
When the observations are corrupted by noise, the various AR estimators mentioned above are no longer MLEs but quasi-MLEs [5], which produce smoothed AR spectral estimates and are unable to resolve the peaks in the power spectral density (PSD). Numerous studies indicate that the resolution of estimated AR spectra decreases as the SNR decreases [6] [7] [8]. The sensitivity to noise results in a severe bias, therefore, limiting the utility of AR spectrum estimation. Several suboptimum approaches have been proposed for the misspecified model. One method is to recognize the true model as an ARMA process and use the modified Yule-Walker equations [9]. A second solution is to increase the order of AR model to reduce the bias due to the misspefication. A third approach is to compensate the AR parameters or reflection coefficients for the biasing effect of noise [10]. One more option is filtering of the data with a Wiener filter to enhance the signal. However, none of these existing approaches have met with great success. A realizable information theoretic estimator (RITE) [11] is proposed and is shown by theory and simulation to be more robust to noise than the asymptotic MLE.
Different from the general spectrum representation, which is a sum of sinusoids of fixed frequencies with random amplitudes and random phases, RITE is proposed by modeling the frequencies as random events distributed according to a nonhomogeneous Poisson process. Once this representation is adopted, we can derive its likelihood function. Since the random frequency events are not observable, we propose a realizable approximation to the likelihood function. Both RITE and MLE are obtained as the maximum of likelihood functions. They are special cases of M-estimators, which are introduced and whose asymptotic properties are analyzed by Huber [12]. The properties of a correctly specified model have been well studied. A misspecified model has also been investigated but to a lesser extent [5] [13]. In our study, the detailed statistical theorems for quasi-MLE and RITE are based on the theory of M-estimators. More detailed information about the M-estimator can be found in "The calculus of m-estimators" [14].
The paper is organized as follows. Section 1.2 gives a brief introduction to RITE. In Section 1.3, the theoretical properties of quasi-MLE and RITE are given.
In Section 1.4, simulation examples to verify the theory in Section 1.3 and a simple explanation of the robustness of RITE are provided. Section 1.5 summarizes our results and discusses future work.

Realizable Information Theoretic Estimator
The background for this section can be found in references [11] [15]. A real discrete-time wide sense stationary (WSS) random process can be written in the spectral form using random frequencies as The representation can be viewed as a marked Poisson process. In this study, instead of using uniformly spaced frequencies with a fixed number, we take N p as a nonhomogeneous Poisson random process in frequency with intensity λ(f ) = λ 0 p(f ). F k is the k th point event in the frequency interval 0 ≤ f ≤ 0.5 with "marks" (A k , Φ k ). A i 's are independent and identically distributed (IID) positive amplitude random variables. Φ i 's are uniformly distributed on [0, 2π) phase random variables.
The amplitude, phase, and frequency random variables are independent of each other.
The PSD of x[n] can be shown to be P (f ) = E(A 2 ) 2 p(|f |) for − 0.5 ≤ f ≤ 0.5 [11]. Here, we are only interested in the case that the total power is 1, i.e. E(A 2 ) = 1.
It can be shown that the part of the log-likelihood that depends on λ(f ) is [16] where N (df ) is the random variable indicating the number of frequency events in the interval (f, f + df ). Since the frequency events are in general not observable but only x[n] is observed, we proceed by replacing N (df ) with a realizable approximation: whereĪ(f ) is the normalized periodogram, which is given bȳ and I(f ) is the unnormalized periodogram In accordance with Ignoring the terms that do not depend on the PSD and the scaling parameter λ 0 , we have the realizable likelihood function The function (f ) ln P (f )df achieves its maximum when P (f ) is identical toĪ(f ). When we assume that the PSD depends on a set of parameters, the estimation of those parameters is chosen to maximize l R . Note that since , the maximization result does not depend on the normalization term . We finally have the RITE likelihood function as

The Statistical Properties of quasi-MLE and RITE
Let the real signal s[n] be a wide sense stationary (WSS) Gaussian random process whose power equals 1. Let N be the data record length. The observed data [1], · · · , x[N − 1]} is generated from the noise corrupted signal, with PSD function Q(f ; θ * ). We propose P (f ; θ) to be the PSD model of the signal, where θ is a p × 1 vector parameter. Assume P (f ; θ) is suitably smooth on θ. In accordance with the fact that signal power equals 1, we constrain 1, or equivalently, the autocorrelation satisfies r[0] = 1.

The Statistical Properties of Misspecified MLE
For large data records, the asymptotic Gaussian log likelihood function is [17] If E θ * (l M ) exists, where E θ * represents the expected value with respect to the true model, we define θ 0 to be the one that maximizes E θ * (l M ).
Theorem 1: The estimatorθ that maximizes (4) is asymptotically normally distributed with mean θ 0 and covariance matrix Corollary 1 In the case of a scalar parameter, for large data records, the quasi-MLEθ is asymptotically normally distributed with mean θ 0 and variance

The Statistical Properties of RITE
The RITE log likelihood function is Assume E θ * (l R ) exists, here we define θ 0 to be the one that maximizes E θ * (l R ).

Theorem 2:
The estimatorθ that maximizes (3) is asymptotically normally distributed with mean θ 0 and covariance matrix Corollary 2 In the case of a scalar parameter, RITEθ is asymptotically normally distributed with mean θ 0 and variance σ 2 , or

Simulation Examples
Assume an AR Gaussian process s[n] where u[n] has variance σ 2 u , p is the order of the AR process, and a[k] is the k th autoregressive coefficient. The PSD of s[n] is: As we stated in Section 1.3, the PSD is restricted to yield a power of 1, which makes where σ 2 w is the variance of the observation noise. In our simulations, we assume the AR model order is known. Thus a [1], a [2], · · · , a[p] are the p parameters to be estimated. No analytical expression is available for the RITE estimator.

AR(1) Process Example
The PSD of an AR(1) process with restriction r[0] = 1 is We generate a Gaussian AR(1) process with a[1] = −0.9. We use a grid search on k 1 to find RITE since the reflection coefficient k 1 is limited in (−1, 1) to guarantee a stable AR process. To be fair, quasi-MLE is also calculated using a grid search.
When the observed data is embedded in noise, quasi-MLE and RITE converge to means other than the true value. Thus to compare the performance of the two estimators, we need to evaluate the mean square error (MSE), which equals the variance plus the squared bias. The theoretical variance is computed by using Corollary 1 and Corollary 2, where θ * = −0.9. By definition, θ 0 is the value that maximizes the expected value of the likelihood function. As illustrated in Figure 6, the MSE of RITE grows slower than the one of quasi-MLE and this advantage increases as the SNR decreases. As shown in Figure 7, although the quasi-MLE has a smaller variance, the bias weakens its performance for low SNR range. Therefore, when the squared bias exceeds the variance, the RITE exhibits more noise robustness than the quasi-MLE.

Noise Sensitivity of Likelihood Function
An important problem in AR spectral estimation is its sensitivity to observation noise. The effect of noise flattens the estimated PSD and reduces the resolution, resulting from the severe bias of the misspecified MLE. However, the RITE is shown to have less bias than the MLE when observation noise is present, and this results in improved resolution. The derivative of expected likelihood function with respect to noise power is an evaluation for noise sensitivity. It can be proved that Consider a narrow-band process, for which the k i 's may be close to 1. The closer they are to 1, the larger is , and the more effect the noise has. The MLE likelihood function is more severely affected by noise when the AR random process is narrow-band. As an example, the simulation example we employ here, which is an AR(4) process, for This is a sensitivity difference of several orders of magnitude.

Burg and RITE AR Spectral Estimator
For a higher order AR process, we need to find the global maximum of l R , but an efficient algorithm is still to be found. Instead of performing a grid search which involves a high computation cost, we use the Matlab optimization function fmincon with a proper initial point generated by Burg method and constraint −1 < k i < 1 to hopefully find the global maximum solution as the RITE estimator. Oncek 1 ,k 2 , · · · ,k p are available, the AR parameters are estimated as follows: For k = 1â We denote this approach as the modified Levinson algorithm since we constrain

Abstract
When a data set is corrupted by noise, the model for the data generating process is misspecified and can cause parameter estimation problems. As an example, in the case of a Gaussian autoregressive (AR) process corrupted by noise, the data is more accurately modeled as an autoregressive moving average (ARMA) process rather than an AR process. This misspecification leads to bias, and hence, low resolution in AR spectral estimation. However, a new parametric spectral estimator, the realizable information theoretic estimator (RITE) based on a nonhomogeneous Poisson spectral representation, is shown by simulation to be more robust to white noise than the asymptotic maximum likelihood estimator (MLE). We therefore conducted an in-depth investigation and analyzed the statistics of RITE and the asymptotic MLE for the misspecified model. For large data records, RITE and the asymptotic MLE are both asymptotically normally distributed. The asymptotic MLE has a slightly lower variance, but RITE exhibits much less bias. Simulation examples of a white noise corrupted AR process are provided to support the theoretical properties. This advantage of RITE increases as the signal-to-noise-ratio (SNR) decreases.

Introduction
The spectral representation for a wide sense stationary (WSS) random process relies on the time representation which is a sum of sinusoids with fixed frequencies, random phases and random amplitudes [1]. It forms the basis for spectral estimation. Another less well known representation models the frequencies as random point events distributed according to a nonhomogeneous Poisson process. The likelihood function can be derived for this spectral representation. Since the frequency events are usually not observable, some modifications are applied to the likelihood function. The estimator that maximizes the approximated likelihood function is called the realizable information theoretic estimator (RITE) [2]. It can be used in model-based spectral estimation.
The autoregressive (AR) model is widely used in spectral estimation. The maximum likelihood estimator (MLE) is usually employed for a good estimate of the AR parameters. If we assume a real Gaussian random process, the autocorrelation method, which requires solving the Yule-Walker equations with a suitable autocorrelation function (ACF), can be found efficiently and is equivalent to the approximate MLE [3]. Many other methods for AR parameters estimation that produce the same numerical values for large data records, like the Burg method and the covariance method, are also approximate MLEs [4]. Hence those methods share the desirable properties of the MLE that for large data records they are consistent, asymptotically Gaussian, unbiased and attain the Cramer-Rao lower bound (CRLB) [5] [6]. However, when the observations are corrupted by additive noise, the various methods for AR parameters estimation mentioned above produce severe biases. This sensitivity to the noise addition results in a smoothed AR spectral estimate. Numerous studies indicate that the resolution of estimated AR spectra decreases as the signal-to-noise-ratio (SNR) decreases [7] [8] [9]. This is because the additive noise changes the true model to an autoregressive moving average (ARMA) where AR and moving average (MA) parameters are linked, instead of an AR. Hence the methods above are no longer the true MLEs. To get better resolution, one option is to use an ARMA model estimated by the least squares modified Yule-Walker equations (LSMYWE) [10], but the model order of the MA part depends on the noise type (see supplementary material A.5), and therefore limits its utility in practice.
The MLE for a misspecified model is called a quasi-MLE [11]. A misspecified model has been investigated but to a lesser extent in [11] [12]. Hence more analysis on misspecification is necessary. Compared with the asymptotic Gaussian MLE, RITE shows a robustness to white noise in PSD classification problems [2].
Therefore it would be of interest to investigate how RITE performs in AR spectral estimation.
In this paper, the asymptotic statistical properties of the quasi-MLE and RITE are derived and verified by simulation examples. Both estimators are asymptotically Gaussian distributed but with different means and covariance matrices. An application to spectral estimation using the AR model is provided in the paper.
For an AR PSD, we prove that the asymptotic Gaussian likelihood function is more sensitive to white noise than the RITE likelihood function. Our experiments show that in comparison to the quasi-MLE, RITE has smaller bias when white noise is present in AR process.
The paper is organized as follows. Section 2.2 gives a brief introduction to RITE. In Section 2.3, the theoretical properties of MLE and RITE are given.
Section 2.4 reviews the AR model and gives a simple explanation of the white noise robustness of RITE. In Section 2.5, spectral estimation application using AR model are provided to verify the theory in Section 2.3 and to show the robustness of RITE for white noise corrupted data. Section 2.6 summarizes our results and discusses future work.

Realizable Information Theoretic Estimator
The background for this section can be found in [2] and [13]. A real discretetime WSS random process can be represented in the spectral form as a sum of sinusoids with random frequencies, amplitudes and phases: A similar representation that uses two independent Poisson point processes can be found in [14]. The representation herein can be viewed as a marked Poisson process. If the number of events M is fixed, then the model reduces to that in [15]. In this study we take M as the number of events of a nonhomogeneous Poisson random process in frequency with intensity λ(f ) on the interval [0, 0.5]. F k is the k th point event on the frequency interval 0 ≤ f ≤ 0.5 with "marks" (A k , Φ k ).
. Here, we are only interested in the case that the total power is 1, i.e. E(A 2 ) = 1 or equivalently It can be shown that the part of the log-likelihood that depends on λ(f ) is with N (df ) is the random variable indicating the number of frequency events on the interval [f, f + df ). Since we cannot observe the frequency events but only x[n] in general, we proceed by replacing N (df ) with its approximate mean: whereĪ(f ) is the normalized periodogram, which is given bȳ and I(f ) is the unnormalized periodogram In accordance with We now have the approximated likelihood function: Ignoring the terms that do not depend on the PSD and the scaling λ 0 , we have the realizable likelihood function The function (f ) ln P (f )df achieves its maximum when P (f ) is identical tō I(f ) (This is proved in Appendix A, but not included in the original paper ). If we assume that the PSD depends on a set of parameters, then the estimation of those parameters is chosen to maximize l R . Note that , it follows that Since the maximization result does not depend on the normalization term , we finally have the RITE likelihood function as

The Statistical Properties of MLE and RITE
The MLE and RITE are both obtained by maximizing their likelihood functions. They are both special cases of M-estimators. Huber introduced Mestimators and analyzed their asymptotic properties [16]. The derivation of the statistical properties for MLE and RITE are based on the theory of M-estimators.
More detailed information about the M-estimator can be found in [17].
Let the signal s[n] be a wide sense stationary (WSS) Gaussian random process whose power equals 1. Let } be an observed data set generated from the noise corrupted signal, with PSD function Q(f ; θ * ), where θ * is the true value of a q × 1 vector parameter. We propose P (f ; θ) to be the PSD model of the signal, where θ is a p × 1 vector parameter. Assume P (f ; θ) is suitably smooth on θ, i.e., P (f ; θ) has continuous derivatives with respect to θ up to some desired order. In accordance with the fact that signal power equals 1, we , or equivalently, the autocorrelation satisfies r[0] = 1.

The Statistical Properties of Misspecified MLE
For large data records, the asymptotic Gaussian log likelihood function is [18] If E θ * (l M ) exists, where E θ * represents the expected value with respect to the true model, then we define θ 0 to be the one that maximizes E θ * (l M ). The following theorem and corollaries are valid under the assumption that {c · ∂l M ∂θ | θ=θ 0 } satisfies the Lyapunov condition for any p × 1 vector c at any frequency. The theorem below applies more generally to a vector parameter for misspecified problems. The corollaries are simplified for a scalar parameter and the correct model. Theorem 1. The estimatorθ that maximizes (4) is asymptotically normally distributed with mean θ 0 and covariance matrix [·] ul denotes the elements at row u column l. The proof is given in supplemen-

It follows that
This result agrees with the asymptotic CRLB [5] [19] and implies that for large data records the MLE estimator is the one that has the minimum variance among all estimators.
In the case of a scalar parameter, the quasi-MLEθ is asymptotically normally distributed with mean θ 0 and variance σ 2 , i.e., In the case of a scalar parameter, if the proposed model is correct, , then the MLEθ is asymptotically normally distributed with mean θ * and variance σ 2 , i.e.,

The Statistical Properties of RITE
The RITE likelihood function l R is given in (3). Assume E θ * (l R ) exists, here we define θ 0 to be the one that maximizes E θ * (l R ). The following theorems are valid under the assumption that {c · ∂l R ∂θ | θ=θ 0 } satisfies the Lyapunov condition for any p × 1 vector c at any frequency.
Theorem 2. The estimatorθ that maximizes (3) is asymptotically normally distributed with mean θ 0 and covariance matrix The proof is given in supplementary material A.2. When model is correct, unlike the MLE case, the equality A(θ 0 ) = B(θ 0 ) does not hold. Therefore, the expressions for a correct model cannot be simplified.
Corollary 2.1. In the case of a scalar parameter, RITEθ is asymptotically normally distributed with mean θ 0 and variance σ 2 , i.e.,

Spectral Estimation Application with AR Model
We assume an AR Gaussian process s[n] where u[n] is the driving noise of the model with variance σ 2 u , p is the order of the AR process, and a[k] is the k th AR coefficient. The AR PSD P (f ) is [6]: As we stated in Section 2.3, the PSD is restricted to yield a power of 1, so σ 2 u is not actually a parameter, but a function of a[1], a [2], · · · , a[p].
Alternatively, an AR process can be expressed by r[0], which in our case equals 1, and the reflection coefficients k 1 , k 2 , · · · , k p , which are restricted in (−1, 1) to guarantee a stable process. The Levinson algorithm transfers the reflection coefficients to the AR parameters [6]. It recursively computes the parameter sets and ρ p is σ 2 u . The algorithm is initialized by: The recursion for j = 2, 3, · · · , p is In our case, since r[0] = 1, the initial set should be: We denote the Levinson recursion with the above initial set as the modified Levinson algorithm. Note that the general Levinson algorithm has σ 2

White Noise Sensitivity of Likelihood Function
where σ 2 w is the variance of the observation noise. To analyze how robust the estimator is, we could try analyzing how the white noise affects the likelihood function. By taking the expected value of the likelihood function of RITE, we get ln |A(f )| 2 df = 0 which leads to Here k i is the true reflection coefficient. Taking the derivative with respect to σ 2 w , If we do the same operations to l M , we have from (4) Hence, By Parseval's theorem, Consider a narrow-band process, for which the k i 's may be close to 1. The closer they are to 1, the larger are , and the more effect the noise has. The MLE likelihood function is more severely affected by noise when the AR random process is narrow-band. As an example, take the AR(4) process that we employed in the next section, for This is a sensitivity difference of several orders of magnitude.

Simulation Examples
We consider spectral estimation of an AR process in noise to test our esti-  For a higher order AR process, this method is more efficient, but it only gives the local minimum. Hopefully, with a proper initial value, this local solution will also yield the true global solution.
Next we use a grid search for an AR(1) example and fmincon for an AR (4) example.

AR(1) Process Example
The PSD of an AR(1) process is Note that our restriction r[0] = 1 leads to σ 2 u = 1 − a 2 [1]. Hence We generated a Gaussian AR(1) process with a[1] = −0.9. If the observed process is not corrupted by noise, then the MLE (stands for the asymptotic MLE in this section) and RITE are unbiased estimators but with different variances. In this case, MLE is the optimal estimator since it has a smaller variance and attains the CRLB. It is proved that the RITE variance is larger than the MLE variance as shown in supplementary material A.3. By Corollary 1.3 and Corollary 2.1, the MLE has variance: 1 and the RITE variance is: − 2 1 + θ 2 + 2θ cos(2πf ) + 4 cos(2πf ) + 4θ (1 + θ 2 + 2θ cos(2πf )) 2 To be fair, RITE and MLE are both calculated using a fine grid search. The are listed in (6), (7), and Q(f ; θ * ) = P (f ; θ * ) + σ 2 w . As illustrated in Fig. 6, the MSE of RITE grows slower than the one of the quasi-MLE and this advantage increases as the SNR decreases. As shown in Fig. 7, although the quasi-MLE has a smaller variance, the bias weakens its performance for low SNR range. Therefore, when the squared bias exceeds the variance, RITE exhibits more noise robustness than the quasi-MLE. This example verifies the Theorems 1 and 2, at least for an AR(1) process in WGN.

AR(4) Process Example Burg and RITE AR Spectral Estimator
The Burg method is approximately an MLE and therefore is asymptotically unbiased with variance that attains the CRLB. It has been shown to have good resolution for a narrow-band PSD when the data is not noise corrupted [6]. Similar

Conclusion
We have introduced RITE as a new method for PSD estimation. RITE and the quasi-MLE are compared analytically when the data is misspecified. In particular, the misspecification example of additive noise is studied in detail. Theoretical results have been verified via simulation for a Gaussian AR (1)  These studies have established a solid foundation to further our goal of searching for the global maximum which is the true RITE, and which may have an even better performance. Therefore, an efficient method of computing RITE will be explored in future works. It should be emphasized that RITE is a general approach to model-based spectral estimation in the presence of model inaccuracies. Its robustness properties need to be explored for other scenarios in which data models are inaccurate, which is the "rule rather than the exception".

Introduction
Data fusion for estimation is a problem that utilizes information from multiple data sets to estimate an unknown parameter or vector. These data sets are usually from multiple sources, for example, multiple sensors. In this case, this problem is called distributed estimation. It uses data from multiple sensors and a fusion center (FC), or central processor, to achieve a more accurate estimation than using a single sensor observation [1]. It has been actively researched for decades. In late 1970's, researchers started with optimal fusion to reconstruct the global estimate [2]. Around the 2000, wireless ad hoc sensor networks information processing became a very active area [3] [4] [5]. The communication cost is expensive for these networks, so distributed estimation is preferred to require only local network information and minimized communication. Recently more interests are focused on distributed estimation algorithms that handle process noise [6] [7] [8]. Most of those studies assumed that the local estimates are unbiased. However, this is not always the case in the real world. For example, when the noise or inference is located closely to one sensor but far away from the other sensors, that sensor may generate outliers due to low signal-to-noise ratio (SNR), which causes the bias in estimation. Currently, there aren't many investigation with the assumption that the local estimation is not ideal.
This paper will use multi-sensor estimation as an example to illustrate our method. However, our method is not limited to the distributed estimation problem.
It can also be used for data from the same source at different time intervals. Also, here we focus on the algorithm at the central processor, which merges all the local estimations and the robustness against noise. An obvious way to integrate data is simply averaging the estimations from all sensors. This method doesn't require any additional information and is easy to implement. However, it's not a robust estimator. When one or some of the sensors have low SNRs, it affects not only the local estimation but also the final estimation at the central processor.
A robust estimator should be able to reduce the effects of poor estimations. In the real world, poor SNR conditions are very likely to occur and may be caused by multiple reasons. For example, some sensors, compared to others, may have larger noise interference, or be farther from to the target. The SNR information is ignored in this estimator, but it is important and can be used to calculate the Two experiments are carried out to test our estimators. Note that, below a certain SNR threshold, the mean square error (MSE) doesn't follow the Cramer Rao lower bound (CRLB) any more, and outliers show up with some probability [12]. Those outliers are the main factor that causes the rapid increase in MSE.
Based on our simulation results, the integrated final estimator, which uses the FIM and the one that uses the observed FIM does not have an obvious difference. However, they both reduce the number of outliers and show an improved performance over the averaged estimator in terms of MSE.
The paper structure is as follows: Section 2 gives a detailed description of the problem, our method and reveals the relationship between the overall log likelihood function and our new estimators. Section 3 is the simulation results that shows our estimator is better than the averaged one. Section 4 is conclusion and future work.

Problem Statement
Consider M sensors are linked with a fusion center. Each sensor observes a real-valued vector x i [n], n = 0, 2, · · · , N − 1, that consists of a deterministic signal s[n; θ] and white Gaussian noise w i [n] ∼ N (0, σ 2 i ). Here we assume that σ 2 i 's are all known. θ is the unknown parameter to be estimated, w i 's are independent across the sensors.

Maximum likelihhod estimation (MLE) is widely used in practice, since it is
asymptotically optimal for large data record [11]. If we put aside the communication cost and consider only the estimation performance, then for large data record, the overall MLE that involves the data from all the sensors reaches the CRLB and gives the efficient estimation. Denote the overall log likelihood as l, the log likelihood at the i th sensor as l i . We have l = M i=1 l i and

Taylor Expansion of the Log Likelihood Function
By doing Taylor expansion at point θ =θ i , which is the MLE at the i th sensor, l i can be written as Note that l = M i=1 l i , so the overall log likelihood function can be expressed by Since the first term M i=1 ln p(x i ;θ i ) has only data, finding the maximum of l is equivalent to finding the maximum of

Second Order Taylor Expansion
With the second order Taylor expansion approximated log likelihood function, the "approximated" MLE is the one that maximizes Hence the solution of the maximum of the approximated log likelihood function is In this case, ∂ 2 ln p(x i ;θ) ∂θ 2 θ=θ i andθ i are enough to reconstruct an approximated log likelihood function at the central processor. The approximation may sacrifice some performance, but the communication cost can be greatly reduced.

Higher Order Taylor Expansion
By using third or higher order expansion to represent the log likelihood function, we may approximate l better in a region ofθ i 's. Unlike the second order expansion, higher order expansion does not guarantee a convex function in θ and the maximum may be difficult to find. When this happens, higher order expansion will show some wrong estimations. Hence, in practice, the second order expansion should be more useful than higher order expansion. More detailed discussion can be found in Appendix B.

Definition of Proposed Estimators
Without considering the communication cost, for large data record, the optimal estimator is the maximum likelihood estimator, which is defined aŝ to be a constant in θ and x. This estimator is easy to process. However, when one or many sensors generate poor estimations, the performance of the final estimator θ A will be degraded. To improve the performance, we introduce the new estimator (8), which is the maximum of overall log likelihood approximated by second order Taylor expansion at pointθ i 's: ] available, then we don't need to calculate β i for each sensor, since it depends on data x i . This case, the weighting factor β i is replaced by α i , which is the Fisher information atθ i , We define this estimator to beθ F :θ F inθ F is short for Fisher information. The Cramer-Rao Lower Bound shows that the minimum variance of allθ is I −1 (θ), which indicates the accuracy of the optimal estimator [11]. The smaller the variance is, the more reliable the estimator is. Therefore, it is reasonable to use I(θ i ) as the weighting factor in front ofθ i . It has not been proven which method is better, the FIM or the observed FIM. The observed FIM is used in many studies [13] [14]. It may be more practical since the expectation of ∂ 2 ln p(x;θ) ∂θ 2 is not needed.

Simulation Results
In this section, we use the sinusoid frequency estimation problem as an example. Assume we have data x 1 , x 2 , x 3 from 3 sensors. At each sensor s[n; f 0 ] = sin(2πf 0 n + φ) where w i [n] is white Gaussian noise (WGN) with variance σ 2 i and independent from sensor to sensor. Here φ is known. f 0 is the unknown parameter, which is assumed not to be near 0 or 0.5. To calculate the FIM and the observed FIM, we need to know the log likelihood function and its second derivative with respect to By using the approximation we have the FIM to be which varies linearly with respect to the SNR, or 1 . Note that, since the data from different sensors are independent, the overall FIM is Together with the two new estimators, we have 4 estimators to compare: 0 is the MLE of the i th sensor, which maximizes (10). α i and β i are the weighting factors of thef (i) 0 : 3 is a constant from sensor to sensor, we simplify α i to be 1 We test the three estimators in two cases. Case 1, one of the sensors has extremely low SNR compared to the other two. In the real world, this case represents that one sensor (or some of the sensors) is (are) heavily corrupted by noise while the rest of the sensors are working properly. Case 2, the SNRs of all the sensors decrease at the same rate. In practice, this SNR reduction over all sensors may be due to the weakening of the signal or the departing of the target. The true values of parameters we use in this section are: f 0 = 0.1, φ = 0.1π.

Case 1
Let SN R i = 1 2σ 2 i denotes the SNR at the i th sensor. We fix the SNR of the first two sensors to be 1dB and 0dB. Let SN R 3 varies from −10dB to −1dB.   For each SN R 3 , we do 10, 000 simulations. Fig. 13 shows how the MSE(dB) changes when SN R 3 changes.f M is universally optimal in the simulation.f A has larger MSE compared with the other two estimators over all ranges, not only below the threshold SN R 3 = −2dB, but also above it. The MSEs off O andf F are smaller than that off A .f F has a slight advantage overf O when SN R 3 is below −9dB. However, in −9dB < SN R 3 < 2dB,f O shows a smaller MSE than f F . However, we cannot conclude which one is better from one experiment, but they both outperformf A regardless of the value of SN R 3 . The improvement of the performance in terms of MSE results in the reduction of outliers forf O and f F . Note that in Fig. 13(b), the gap betweenf A and other estimators is narrower at SN R 3 = 0, 1. This is because the SN R i 's are close to eath other, which means the FIM or the observed FIM are close to 1/3, the weighting factor off A .
For more detailed analysis of the estimations at the third sensor and the MSE curve, please refer to the supplementary material.

Case 2
We fix the initial SNR at the three sensors at {0, 10, 20} dB.  Fig. 14(b) since the number of outliers is greatly reduced when SNR is above a certain threshold. It is not obvious iff O orf F is better, butf O extends the threshold to be −4dB other than −3dB. However, they both work better than f A .

Conclusion and Future Work
We have proposed two estimatorsθ O andθ F for the distributed estimation problem.θ O turns out to be the one that maximizes the second order Taylor expansion of the overall log likelihood function. In Section 3, we compare our new estimators with the estimatorθ A in terms of MSE. Based on the simulation results, our estimators have obvious advantages overθ A , especially when the SNR at one or some sensors are much lower than the others. This improvement is due to the use of the FIM or the observed FIM. Since the weighting factors α i 's or β i 's for low SNRs are smaller than for high SNRs, the effect of poor estimations are reduced inθ O andθ F . Therefore we have better performances in terms of MSE.
We defined and simulated the scalar case only. It can be potentially extended to the vector case. In this paper, we assume the data come from independent sensors.
This assumption can be extended to, for example, independent data sets that come from the same sensor but at different time intervals. Also, here σ 2 i 's are assumed to be known. In the future, we may further study the case that σ 2 i 's are unknown.

CHAPTER 4 Future Work
Some of the assumptions and methods used in this research can be extended in the future: • An efficient way of finding the solution of RITE estimator should be explored since its robustness against additive noise in the spectral estimation problem.
• AR model is the one we used in our study while applying RITE in spectral estimation. However, this estimator is not limited to AR model only. We can try other models as well. For example, the exponential model, which simplifies the solution finding problem to a convex optimization problem.
• As for the data fusion for estimation problem studied in chapter 3, we defined and simulated the scalar case only. It can be potentially extended to the vector case.
• Also, in chapter 3, σ 2 i 's are assumed to be known. In the future, we may further study the case that σ 2 i 's are unknown, which might be more practical in real world applications.
Define θ 0 to be the one that maximizes E θ * (l M ). We make two assumptions: 2, {c · ∂l M ∂θ | θ=θ 0 } satisfies the Lyapunov condition for any constant vactor c at any frequency.
If the first assumption is valid, thenθ is the solution of For large data record, ∂l M ∂θ can be approximated by is multidimensional normal distributed. The proof is following: By the second assumption, lim = 0 for some positive . It follows by Lya- ∂l M i ∂θ | θ=θ 0 is normal distributed for any c, it is equivalent to say that 1 ∂l M i ∂θ | θ=θ 0 has multivariate normal distribution.
The ul th element in B(θ) can be expressed by By weak law of large number, it can be proved that By (A.1), (A.2), and Slutsky's theorem, it follows that

A.2 Proof of Theorem 2
Similarly to asymptotic MLE, RITE estimatorθ is asymptotically multivariate distributed when N goes to infinity. Here the likelihood function to be maximized is Denote θ 0 as the one that maximizes E θ * (l R ). Assume the two assumptions are valid: 1, l R is differentiable w.r.t θ. 2, {c· ∂l R ∂θ | θ=θ 0 } satisfies the Lyapunov condition for any constant vactor c at any frequency.
If the first assumption is valid, If l R is differentiable w.r.t θ, thenθ is the When N is large enough, ∂l R ∂θ can be approximated by If we expand ∂l R ∂θ | θ=θ near the value θ 0 , we will get is multidimensional normal distributed. The proof is similar as in A.1. Hence By weak law of large number, it can be proved that

By (A.3), (A.4), and Slutsky's theorem, we have
Regarding to the univariate and correct model, the variance of RITE is and the variance of MLE is Cauchy Schwarz inequality gives that, If we let f (x) = P (f ; θ 0 ) ∂ ln P (f ;θ) ∂θ | θ=θ 0 and g(x) = ∂ ln P (f ;θ) ∂θ | θ=θ 0 , then Hence the inequality becomes The left side is the variance of RITE, and the right side is the variance of MLE.
So MLE always has smaller variance than RITE.

A.4 Spectral Estimation for AR Process in Noise Modeled by AR
The AR process s[n] is noise corrupted such that the observed data x[n] becomes The scalar α reflects the SNR. In Section A.4 and A.5, the data length N is 350.

s[n]
is an AR(4) process with the same parameters as in the main paper, So the variance of the noise is 51.5. Simulation results for mixture Gaussian white noise corrupted AR process are shown in Fig.A1 to A4.

A.4.B White Laplacian Noise
This section presents the simulation results for white Laplacian noise corrupted AR(4) process (shown in Fig.A5 to A8).

A.4.C α-Stable Modeled Impulsive Noise
The α-Stable distribution can model an impulsive noise. It is denoted by S(α, β, γ, δ). α is called the characteristic exponent. β is the skewness. γ is the scale or dispersion parameter. δ is the location. Since the power of an α-Stable process is not defined, the conventional SNR cannot be used. We define the modified SNR to be:

A.5 Spectral Estimation for AR process in Noise Modeled by ARMA
Instead of using an AR model for an AR process in noise, we can use an ARMA model for spectral estimation. This model represents the data better.
However, unlike AR model, the ARMA parameter estimation via maximum likelihood criteria requires to minimize a nonlinear function. For this reason, it is not computationally efficient. The least squares modified Yule-Walker equations (LSMYWE) is a suboptimal estimator, but can be easily implemented.

A.5.A White Gaussian Noise
If the noise is white noise, then the true model for the noise corrupted AR(p) process is ARMA(p,p). In this section, we use ARMA(4,4) via LSMYWE for spectral estimation of the AR(4) process in WGN. Due to less error in modeling, results are better than using AR(4) model via Burg method (Fig. 4, 5 in main paper), and are shown in Fig.A13, A14.

A.5.B α-Stable Modeled Impulsive Noise
Since the α-Stable Noise is not white noise, it is not clear how to choose a proper order of the MA part. One can use a model order selection criteria. Here we simply use ARMA (4,4). Results are shown in Fig.A15, A16. They are not as good as in A.5.A because the order for the MA part is incorrect. At the same time, we also provide RITE estimations using AR(4) model and LSMYWE as the initial value for fmincon (Figs.A17, A18). As we can see, even for SNR=15dB, when LSMYWE PSDs are all flattened, RITE is able to resolve the peaks occasionally.
Since the fmincon solution of RITE is heavily dependent on the initial value, the flattened RITE PSDs are probably produced by local solution instead of the true RITE. The true RITE may have even more promising results.

A.6 Proof of the Statement in Page 23
The Kullback-Leibler divergence is always non-negative, At the third sensor, when the SNR is below a certain threshold,f The outliers lead to a large increase in MSE. Unlikef A , the other two estimatorŝ f F andf O have their outliers centered around the true value f = 0.1. This is because α 3 and β 3 are much smaller than α 1 , α 2 and β 1 , β 2 . So the poor estimations off (3) do not affect the overallf F andf O much, but do affectf A . Besides, in Fig.   B1, the good estimations off A have larger variance than that off O andf F , which also makes a difference in the MSE.

B.2 Higher Order Taylor Expansion
According to simulation results, 3rd order and 4th order are not as good as the 2nd order Taylor expansion of MLE. Since the higher order expansion generates some wrong estimation results at 0.5 (true value should be 0.1).
The Taylor expansion can approximate a function in a nearby region of a given point, but might be quite far away from the true value in other region. In our case, we want to find the maximum of the overall log likelihood. If we do Taylor expansion of the second order at a given point (MLE), then it guarantees a convex function and has only one maximum (at the given point, MLE). However, if we do a higher order expansion, then the function might be approximated more accurately near the given point, but at the end point 0 or 0.5, the expansion could be far from the true value and may be a maximum. So if we use this maximum, it will give us a wrong solution.
Below are the figures of the true log likelihood and Taylor expansion for one realization. In Fig.3, we can see the higher order can approximate the nearby region more accurately. However in Fig.1 and 2, the 3rd and 4th order expansion both give maximum at 0.5, which results in a wrong estimation. This is an example for one realization. It doesn't happen to all realizations, but when it happens in some realizations, it does decrease the performance of the higher order expansion.
So among, 2nd, 3rd, 4th order of expansions, the best expansion order is 2. Or, if we have large enough order of expansion, which means it can represent the true log likelihood function, we might have better performance, but it is not possible in practice.

B.3 More Simulation Examples
This section we use more examples to show the performance of the proposed estimators. Since the observed FIM and FIM perform similarly, we focus on the difference between the averaged onef A and the FIM onef F . The SNR's from different sensors are [−6.94; 0.45; −0.86] in dB.f A is shown in red andf F is shown in blue. Over all tested frequencies of f 0 (Fig. B6),f F has a smaller MSE thanf A . The MSE of Both estimators decreases as f 0 moves closer to 0.25 (Fig. B6, B9). As we can see from Fig. B7, B8f F has less variance and smaller bias compared tof A . Relative Probability^f