RELIABILITY AND PHYSIOLOGICAL VALIDITY: BAYESIAN PRINCIPAL COMPONENTS ANALYSIS AND GROUP COMPARISON

Statement of the problem: Recent years have demonstrated an increasing study of subjective happiness, often assessed by self-reported questionnaires (Gable & Haidt, 2005). The underlying physiological mechanisms of happiness are less well understood. This analysis examines scale reliability and validity in relation to biomarkers of stress responsiveness for scales of self-reported subjective well-being, to identify items that relate to an underlying physiology of happiness. Methods: A principal components model is employed. Bayesian model estimation is used to address small sample size (N = 20). Additionally, sampling was clustered, including a subset of undergraduate students (n = 10) and a subset of graduate students (n = 10). To account for this structure to the data, group mean differences are estimated and compared. This improves model estimation and allows for a mean difference comparison which is insightful for possible sources of confounding. Three scales are included, measuring self-reported life events, life satisfaction, and happiness. Three biomarkers are included: cortisol (arousal signaler), interleukin-6 (inflammation cytokine), and global DNA methylation (regulator of protein expression including cortisol and interleukin-6). Summary of results: All scales demonstrated internal structure validity and internal consistency reliability. Biomarkers demonstrated meaningful association with all scales, however in one case the direction of association was counter to expectations (life satisfaction positively associated with interleukin-6). Most group difference tests were nonsignificant, a lack of evidence of heterogeneity of measurement between groups. There were some scale items on that did demonstrate a significant difference. Implications for scale use, and this procedure for measurement theory, are discussed.

v      There has been some research into physiological correlates of happiness. Associations have been identified between depression and heightened stress responsiveness (Schneider et al, 2018;Weinstein et al, 2010;Young, Lopez, Murphy-Weinberg, Watson, & Akil, 2000). Participants with a more positive outlook and coping skills tended to release less cortisol when exposed to an experimental stress condition than participants with less positive attitudes (Taylor, Lerner, Sherman, Sage, & McDowell, 2003). Within the central nervous system, research has suggested that dopamine and oxytocin may be molecular signalers of subjective experience (Fredrickson, 2013;Tanzer, 2019b;Weems, 2011). Ratings on questionnaires for life satisfaction have been associated with reduced activity in the amygdala (Waldinger, Kensinger, & Schulz, 2011). A meta-analysis of neuroimaging studies of happiness also identified activity in the amygdala, as well as the insula as discriminating between subjective phenomena of happiness. A more pleasurable sense of happiness tended to cooccur with activity in the amygdala, whereas a satisfying experience of happiness tended to co-occur with reduced activity in the amygdala and increased activity in the insula (Tanzer & Weyandt, 2019). The amygdala tends to activate during fear and anger and transduces the release of stress response biomarkers, while the insula tends to activate during disgust (Rosenzweig, Leiman, & Breedlove, 1996;Sapolsky, 2017). It was suggested that differing aspects of subjective happiness may incorporate differing patterns of physiological stress responsiveness (Tanzer & Weyandt, 2019).
To explore and enable further research into this stress responsiveness hypothesis, this analysis considers scales of subjective well-being in relation to indicators of physiological arousal.
The Satisfaction with Life Scale (Diener, Emmons, Larsen, & Griffin, 1985), Hassles and Uplifts Scale (DeLongis, Folkman, & Lazarus, 1988;Bolt, 2001), and the Personal Fulfillment Inventory (Tanzer, 2019a) are examined using principal components analysis (Harlow, 2014). In addition to scale items, three biomarkers of physiological stress responsiveness are included: cortisol (glucocorticoid stress response signaler; Malven, 1993), interleukin-6 (inflammation cytokine; Malven, 1993), and global DNA methylation (epigenetic regulator of the expression of cortisol and interleukin-6; Koonin, 2012;Wang et al, 2012). Based on the principals of psychometrics, this will allow for an evaluation of the internal structure of the scales, and whether or not the items are associated with differences in personal physiologies (Furr, 2017). Additionally, internal consistency reliability can be estimated.
Recommendations for scale use and interpretation are provided.
Principal components analysis is a multivariate analysis to examine shared variation between a set of variables (Harlow, 2014). The correlations are estimated between each observed variable and an estimated latent variable, representing an underlying trait that explains shared variation. There can be as many latent variables estimated as there are variables included in the analysis. The function that defines the latent variable is based on the eigenvalues and eigenvectors of the variance covariance matrix of the variables in the analysis. This is a commonly used method of psychometric research because it summarizes common variation across a set of scale items with much greater parsimony than a variance covariance matrix (Furr, 2017). The relationship between each item and the underlying latent trait being measured can be used as a diagnostic check of scale functionality, and to identify items that do not contribute meaningful variance to the scale. Including biomarkers in a principal components analysis of questionnaires allows for the identification of the items that share variation with physiological stress responsiveness. This provides insight into the construct validity for the scales, demonstrating which scale items may correspond to biological aspects of scale results. Additionally, based on the values of the weights, scale internal consistency can be estimated, to further evaluate how reliable they are.
Recommendations are often provided for behavioral research using principal components analysis, suggesting that the number of observations in a sample should be much larger than the number of variables in the analysis (Harlow, 2014;Stevens, 2012). While larger samples are more reliable, the time and expense of data collection may be unrealistic for some fields of research, such as with the current analysis of saliva samples (Shaukat, Rao, & Khan, 2016;van de Schoot & Miocević, 2020). Simulation has demonstrated that principal components analysis can provide valid inferences when samples are small so long as the effect size is sufficient (Shaukat, Rao, & Khan, 2016;van de Schoot & Miocević, 2020).
While this analysis may provide some insight into the physiological and psychometric properties of measures used in psychology research, a primary goal of this analysis is methodological.
Due to limitations in funding, this pilot study is small (N = 20), a common problem when data collection is time consuming and expensive (van de Schoot & Miocević, 2020). Additionally, to ensure that there was some range of ages within the dataset, sampling was stratified, including a group of undergraduate students (n = 10) and a group of graduate students (n = 10). This improved the generalizability of the dataset given the sampling limitations, however it risks violating the assumption of uncorrelated errors for principal components model.
To address these concerns, a Bayesian model is used (Hoff, 2009;van de Schoot & Miocević, 2020). A Bayesian model is a natural fit for these data. This improves on the standard frequentist methods because Bayesian methods are more robust to small sample bias and can incorporate information from historical, larger samples. Further, because Bayesian methods are hierarchical by design. Accounting for dependencies within the data induced by clustered sampling is easier to address statistically. Frequentist methods assume that parameters are fixed to a single value, Bayesian methods assume that parameters are random, like samples of data (Hoff, 2009). Because of this, statistical significance of parameters can easily be estimated by examining credible intervals.
The posterior distribution, a function which accounts for variation in the observed dataset and the specified prior probability distribution, can be sampled from to estimate the range of probable values parameters may take. This makes estimation from small samples much more robust (Hoff, 2009;van de Schoot & Miocević, 2020).
In accounting for the clustered sampling, previous research has explored this topic, especially within confirmatory factor analysis methodology (Ansari & Jedidi, 2000;Can, van de Schoot, & Hox, 2015;Geldhof, Preacher, & Zyphur, 2014;Goldstein & Browne, 2002). This analysis seeks to extend this to a principal components analysis approach. This improves model estimation by addressing the known structure to the data, and allows for inference of between groups differences (e.g., undergraduate and graduate samples). Because Bayesian inference is based on sampling from a posterior distribution, the results between these two groups can be directly compared by subtracting paired samples. This provides a further check of construct validity by identifying items that have unmodeled differences that may contribute to confounding.

Participants
A total sample of 20 students was collected for this pilot study. Sample characteristics are reported in Table 1. Most participants identified as cisgender female (17 participants, 3 cisgender male) and white (16 white, 3 African American, 1 Asian). Psychology students were recruited and compensated with extra course credit. Specifically, participants were 10 undergraduate students and 10 graduate students. Sampling was stratified to ensure that there was variation in ages, which ranged from 19 to 26 years (Mean = 22.89 years, SD = 1.05 years). A student sample was selected because of the large personal developments occurring at that time of life. There are large neurological changes that occur in late adolescence through ages 20 to 30 (Rosenzweig, Leiman, & Breedlove, 1996), in addition to the social and professional changes during early adulthood (Robinson, 2013).
This allowed for an analysis of subjective well-being in participants across the range of early adult biopsychosocial development.

Materials
Scales included were the Hassles and Uplifts Scale (40 items), the Satisfaction with Life Scale (5 items), and the Personal Fulfillment Inventory (36 items). The Hassles and Uplifts Scale (DeLongis, Folkman, & Lazarus, 1988;Bolt, 2001) full form includes a total of 108 experiences that may be viewed as a hassle or an uplift; this analysis only included those 20 experiences most commonly experienced by students. Each item is rated as both a hassle and an uplift on a scale from 0 (representing "none or not applicable") to 3 (representing "a great deal" of a hassle or an uplift). Example items include "troubling thoughts about the future," which would likely be rated higher as a hassle and lower as an uplift. Some items are more ambiguous, such as "wasting time," which may be a hassle or an uplift depending on individual personality type. Possible scores range from 0 (representing an uneventful experience) to 120 (representing extremely eventful experience) Appropriate internal consistency was estimated for the scale in previous research (coefficient ω= 0.88; Tanzer, 2019a).
The Satisfaction with Life Scale (Diener, Emmons, Larsen, & Griffin, 1985) was used to estimate life satisfaction. This scale includes five questions rated on a Likert scale from 1 (representing "completely disagree") to 7 (representing "completely agree"). Final scores ranged from 5 (representing very low satisfaction) to 35 (representing high satisfaction). Previous research has estimated strong internal consistency for this scale (coefficient ω= 0.93; Tanzer, 2019a).
Lastly, the Personal Fulfillment Inventory (Tanzer, 2019a) is a composite scale compiled of question items from other published scales (Forsyth, 1980;Schueller & Seligman, 2010;Tangney, Baumeister, & Boone, 2004;Taormina & Gao, 2013;Walters, 2000). The final scale includes 36 items with correspondence to three subscales describing different subjective phenomena of happiness. This was done to allow for a more theoretically informed summary of personal happiness while accounting for multiple sources of subjective well-being. Interpretation of this scale is based on three subscales: the pleasant life, the good life, and the meaningful life. The pleasant life represents a conceptualization of happiness as fun and the absence of discomfort. The good life represents a conceptualization of happiness as active engagement with activities, a task-oriented view of contentment. The meaningful life represents a conceptualization of happiness as connection to something broader than the self. These three aspects were originally based on theory by Seligman (2010), but were expanded upon by Winston (2016). They represent a developmental maturation of subjective experience from emphasizing tactile enjoyment toward peaceful contentment. Each of these subscales incorporates theoretical aspects of broader psychological development, specifically moral reasoning, existential fears, and humanistic need satisfaction (Winston, 2016). As this is a developmental construct, item scores on the pleasure subscale are expected to be negatively associated with scores on the meaning subscale, as these are the theoretical extremes of the developmental teleology (Winston, 2016). Previous research has supported this empirically, while also indicating a bifactor model may be a more appropriate characterization (Tanzer, 2019a(Tanzer, , 2019b. Common positive association was identified among all items, likely representing general happiness; secondary variation among subscales was also identified distinguishing the expected developmental trajectory.
For scoring the measure, individual item responses are transformed into a percentage value compared to the maximum possible scaling values. This allows for all items to be directly compared despite differing response types. Twenty-one questions are answered on a Likert scale, where a low response value represents "strongly disagree" and a high response value represents "strongly agree." Example items include "in choosing what to do, I always take into account whether it will be pleasurable," and "I am rarely distracted by what is going on around me." Six questions are rated on a Likert like scale in terms of how much the respondent feels satisfied with their personal situation.
Example items include "the quality of sleep I get to feel fully refreshed," and "the emotional support I receive from my friends." Lastly, nine items are rated as checklist items for existential fears. An answer of "yes" is scored as 0 and an answer of "no" is scored as 1. Example items include "I fear inadequacy," and "I fear other people's opinions." Scores range from 0 (representing no experience of happiness) to 36 (representing experience of all forms of happiness). Previous use of the scale demonstrated appropriated internal consistency (coefficient ω= 0.83; Tanzer, 2019a).
Biomaterial included were cortisol, interleukin-6, and global DNA methylation. Cortisol and interleukin-6 were assayed first. Cortisol estimates were extracted from saliva samples using the Expanded Range High Sensitivity Salivary Cortisol Enzyme Immunoassay Kit from Salimetrics (catalog number 1-3002). Multiple design procedures were put in place to limit possible confounding.
Cortisol is released on a circadian cycle, so all samples were provided during the late morning (Malven, 1993). Recent trauma may also act as a confound, so participants were screened for traumatic experiences in the past month; no participants reported recent trauma (Malven, 1993).
Global DNA methylation was assayed second, using the QIAamp DNA Mini Kit (catalog number 51304) and the MethylFlash Global DNA Methylation (5-mC) ELISA Easy Kit (Colorimet-ric) by EpigenTek (catalog number P-1030-48). Free floating DNA in saliva samples were processed to estimate the percentage of methyl groups present across the entire genome. Due to financial limitations, samples had to be processed on two occasions, one year apart. At the first occasion, only 9 of 20 samples were analyzed; all samples were analyzed on the second occasion. Samples were assayed in triplicate, meaning that some samples had three estimates of methylation rates while others had six estimates. Aware that this may have induced a method bias, the conditional means were used as the final estimates used in the analysis to control for variance within individuals and methods.

Procedure
First, study procedures were described to participants, and then they were provided with an IRB approved informed consent form. Next, participants were screened for trauma in the past month using the Brief Trauma Questionnaire (Schnurr, Vielhauer, Weathers, & Findler, 1999).
After that, saliva samples were collected using the salivary drip method (Salimetrics, 2018) and questionnaire answers were provided by pen and paper. Saliva samples were immediately placed in a freezer at -20 • C until they were chemically analyzed.

Conventional Model
Drawing from methods in artificial intelligence, a simple Bayesian principal components analysis has been suggested by Bishop (1998). Let D denote the observed data set, denoted D={t i , i=1, . . . ,n}, where t i is a d x 1 column vector. It is proposed that a standard normal latent variable, x i , can be used to model common variation among the items, as follows: where W is a d x q matrix of variable weights to the q x 1 dimensional zero mean latent variable x i , τ is a d x 1 vector of d item means, and ε i is a zero mean, normally distributed vector of q x 1 dimension with the covariance matrix σ 2 . The values of the weights can be estimated using eigenvalue decomposition of the q x q covariance matrix. The covariance matrix, C, can be defined as follows: which allows for maximum likelihood estimation of the weight terms, W ML , based on given the eigenvectors of the sample covariance matrix, U q , and their corresponding eigenvalues, While this provides a detailed assessment of internal structure validity, internal consistency reliability will also be estimated for the scales. Specifically, coefficient ω will be calculated, based on the equation: where Ψ d is the error variance for each item, σ 2 (Furr, 2017). This can be conceptualized as a binomial distributed variable with shape parameters a = the squared sum of the item weights and b = the sum of the item error variances.
Each questionnaire will be examined in a separate model. In each case, the number of principal components extracted will be based on D-1, the number of question items minus one. For the Hassles and Uplifts scale, this means 39 components extracted, 4 for the Satisfaction with Life Scale, and 35 for the Personal Fulfilment Inventory.

Group Comparison Model Assuming Equal Variances
A goal of this analysis is to demonstrate the feasibility of a principal components analysis with clustered sampling to make group comparisons. While the outlined analysis can be used for standard model estimation, it is proposed to estimate the models with a between groups difference by age group as follows: such that t i is a d x 1 vector of observed measures for individual i, W are the principal component item weights. x i are the individual latent variables, τ is the grand mean across all participants, µ is the mean difference attributable to the subgroup of graduate students indicated by z i , and i are the individual error terms. In order to conduct group comparisons in this framework, the estimates of the weights, W, are constrained to be identical across groups. The latent variable matrix, x i , is assumed to be a standard normal, Comparisons are made by examining the quantile based credible interval around the parameter µ. A nonzero estimate would be indication of item level differences between groups. This improves model estimation, by allowing for possible heterogeneity of distributions between the two groups while identifying common variance among the individual items.

Group Comparison Model Assuming Heterogeneous Variances
While this model allows for group mean comparisons, there may also be differences among the variances. To allow for this possibility, a further model is estimated, where the model is adjusted to Here, the model allows for distinct error terms, i,1 and i,2 , based on group identity, z i . In both cases, these error terms are specified as zero mean normal distributions with variances σ 2 1 and σ 2 2 .
Differences between variance estimates can be estimated by examining the credible interval around the corresponding error variances σ 2 1 /σ 2 2 . If a value of one can be excluded, then this is evidence that variances are significantly different between groups.
Lastly, to compare model fit between the two approaches, the deviance information criterion (DIC) is used as a measure of model fit (Li, Zeng, & Yu, 2013). The DIC can be defined as: is the posterior expectation of the deviance: and P D is the effective number of parameters, defined as When two different models are compared, a smaller value indicates better fit to the data. This is used for comparing the two group comparison models, whether or not it is more appropriate to assume heterogeneity of variances between groups.

Priors
There was a concern about bias due to the small sample size. To address this, an informative prior was selected using relevant historical data. There was available to researchers a previous sample of 237 individuals from the same psychology courses at the same public university.
The sampling frame was similar, however there were some demographic differences between the samples. The historical sample was younger on average, by two years. This was likely because of the clustered sampling, which specifically targeted a subset of older students in the second sample.
Proportions of gender identities were similar, and so too was the proportion of white participants (i.e., predominately white, cisgender, female). The historical sample did have more racial diversity among nonwhite students. This was likely because of the larger sample, which probabilistically had more opportunity to identify more ethnic minority groups.
The biomaterial was not available for this historical dataset, all questionnaire variables were included. For fitting the principal components analysis, prior estimates for the values and error variances of the weights were based on a maximum likelihood solution of the same scales.
The same model was used for prior estimates for the squared sum of the weights and sum of error variances in estimating internal consistency. Drawing priors from this dataset was highly informative, however this was seen as appropriate given the limited sample of biomaterial. The focus was on the relationships with biomarkers, to identify which subsets of items may have concurrent validity within the stress response hypothesis. Because priors and observations were both from the same sampling frame, this was a meaningfully informative prior to adjust for the bias that is expected in small samples.
Specifically, the priors used in this analysis were as follows. The specification of the prior for the weights was: where W 0 is a matrix of prior item weights and α 0 is a diagonal matrix of items weights where the diagonal element is α 0,i , i = 1, . . . , D − 1. An inverse gamma prior was used for this term: This was done to be conservative, the weights estimated with wide variance to avoid restricting the posterior distribution of the weights given that the mean was informative. Priors for the grand mean, τ , and group mean difference, µ, were assumed to have zero mean and wide variance, to be vague: The error term, , is assumed to be mean zero with variance σ 2 . When equality of variance was assumed, the variance for the error term was set to be one, because all variables were standardized before the analysis was performed. When equality of variance was not assumed, variances were sampled with an inverse gamma prior, , to avoid being overly informative.
Specific values and summary descriptions for all priors are provided in Table 2 and Appendix 1. Prior values for the individual weights were taken from the larger previous sample of questionnaires without biomarkers; specifically, the maximum likelihood weight estimate for each item on each latent dimension was used. The number of Q principal components extracted was equal to D-1, the number of questions on the questionnaire minus one. Because it was unclear what prior weight estimates to provide for the biomarkers, the average weight value for each corresponding latent dimension was used.
Lastly, a central focus of this analysis was on the group mean differences, so a follow up sensitivity analysis was included to consider alternative specifications of the group mean difference parameters. First, the variance was reduced to fixed values that were much smaller: 10, and then 1.
Next, the group means variance was estimated separately between groups as D T Db 1 and D T Db 2 .
By definition, D T D will be greater than the true variance, so b 1 and b 1 can calibrate the variance to the values manifest in the data. The priors for these were uniform, from 0 to 1, to be noninformative.
Lastly, to consider the importance in specification of the prior mean, the sensitivity analysis allowed group mean differences to be small (0.2) and large (0.8; from Cohen, 1992). This will provide some insights into how dependent results were on the priors used.

Sampling diagnostics
Stan software was used to estimate the model parameters in R studio (Carpenter et al, 2017;RStudio Team, 2015). All variables were standardized as Z scores so that their weights and variances can be directly compared. Global DNA methylation, which represents the percentage of methyl groups attached to DNA, was reverse coded to represent percentage of unmethylated DNA.
This was done because greater methylation prevents protein expression. By using the percentage of unmethyhlated DNA, all biomarkers with true parallel associations can be expected to have the same direction of weight estimates.
Univariate distributions were examined to ensure that variables were reasonably normally distributed, as is assumed for principal components methodology (Harlow, 2014; see Table 1). Most variables were fairly normal (i.e., skewness < |1.0|, kurtosis < 3.0), the one exception was DNA methylation. The model was estimated with this variable log transformed, however this did not change the results, so it was left untransformed for the reported analysis. Previous research has suggested that this methodology may be robust to skewness and kurtosis of individual indicators (Muthén & Kaplan, 1985).
A total of 5,000 samples were taken for each scale, however the first 2,500 were burnt in. Trace plots and ACFs were visually examined for the posterior samples, generally appearing as white noise with rapid decay of autocorrelation. The effective sample size and Geweke diagnostic test were also used, and generally indicated convergence (i.e., ESS > 1,500 in most cases, |Z| < 1.96 in most cases). At this point, the posterior distributions were examined in detail.

Hassles and Uplifts Scale
The internal structure and reliability estimates are provided in Appendix 2, based on the model assuming equality of variances. The first latent dimension of the hassles and uplifts scale clearly represented common variation among all hassles and uplifts, but hassles in particular.
All principal components weights were estimated to be positive, and all hassles were significantly different from zero. Of the uplifts for which a 95% credible interval excluded a zero relationship, many were the items that were primarily hassles (e.g., "Inconsiderate smokers," W = 0. This did not change the significance of a mean parameter estimate. This further demonstrates measurement invariance between groups.

Satisfaction with Life Scale
As before, Appendix 2 describes the internal structure estimates assume equality of variances, only the first latent dimension included significant and meaningful item weights. All five items had significant and large positive weights (i.e., W > 0.75). Of the salivary biomarkers, interleukin-6 was significant, and positive (W = 0.74, 95% CI [0.13, 1.16]). Coefficient ω was estimated to indicate strong reliability (ω = 0.95, see Figure 1), though there was a wider range of probable values (95% CI [0.76, 0.99]). Regardless, even the lower limit of this credible interval is sufficient by standards in psychology research, supporting the reliability of the scale. Comparing between graduate and undergraduate students, in Table 4, individual item means were in instances estimated to be positive and negative, however these were not significant. This lack of differences was robust to changes in prior specification (i.e., increasing and decreasing the prior mean and variance

Personal Fulfilment Inventory
As many as eight latent dimensions showed significant and theoretically meaningful posterior weight estimates (see Appendix 2). All weights for the first dimension were estimated to be positive, including all of those on the pleasant life subscale being significantly different from zero.
Many items on the good life subscale were also significant (e.g., "I fear loss Group mean differences were considered between the subset of undergraduate and graduate students, reported in Table 5. Assuming equality of variances, most item level differences were not significant. The one exception was "I am satisfied with my financial security," for which graduate students tended to rate significantly less agreement (µ = - Allowing for heterogeneous variances between groups did identify more item group mean differences. As before, "I am satisfied with my financial security," was rated significantly lower by graduate students (µ = -1.49, 95% CI [-0.69, -2.52]). Another item related to financial security was now rated significantly lower by graduate students as well ("I am satisfied with my ability to get money whenever I need it," µ = -1.24, 95% CI [-0.17, -1.95]). Graduate students also tended to rate "I do things that feel good in the moment but regret later on" significantly lower (µ = -1.12, 95% CI [-0.39, -2.28]). This item was reverse scored, so this result indicates graduate students tended to be more impulsive than undergraduate students. Lastly, the item "I am satisfied with how highly other people think of me" was rated higher among graduate students (µ = 1.02, 95% CI [0.01, 2.05]).

Discussion
The goal of this analysis was to evaluate internal structure and reliability of three scales of subjective wellbeing. There was specific interest in identifying aspects that covary in relation to physiological arousal. Generally speaking, results supported validity and reliability of all measures.
While the Hassles and Uplifts Scale demonstrated an internal structure distinguishing between hassles and uplifts, the most salient trait accounted for high ratings of all events as hassles and uplifts both. Furthermore, this primary trait showed significant association with all three biomarkers of physiological arousal. No evidence was found to suspect heterogeneity of factor structure between age groups, and internal consistency reliability was estimated to be excellent for the measure. When the variances were allowed to be different between groups, graduate students reported having fun as slightly less of an uplift, however most items showed similar means. All in all, this supports the use of the Hassles and Uplifts Scale for measuring a unidimensional and physiologically underpinned trait of sensational subjective experience.
The Satisfaction with Life Scale demonstrated the strongest evidence of unidimensionality, with large positive weights for all questions. Reliability was appropriate despite wide credible limits.
This was likely because this was the shortest scale and the sample was small. Previous simulation has suggested that short scales with small samples can tend toward wider ranging reliability estimates (Yang & Xia, 2019). There was no evidence of heterogeneity of item responses between age groups.
While this supports the use of the Satisfaction with Life Scale, one counterintuitive finding was that a biomarker was positively associated with the scale, interleukin-6. It was hypothesized that inflammation response would be negatively associated with subjective well-being.
One possible explanation for this is that interleukin-6 has different effects within the central nervous system and the circulatory system. While circulating interleukin-6 facilitates arousal and inflammation response, it also is involved in a feedback loop to regulate long term stress responsive-ness. In the central nervous system, interleukin-6 is a part of a pathway to induce epigenetic quieting of stress response signaling (Erta, Quintana, & Hidalgo, 2012;Gruol, 2014 However, these were subtle differences that were less prominent than overall happiness. Importantly, cortisol and interleukin-6 were both positively associated with the primary factor dimension. This supports the stress response hypothesis. Participants with less mature happiness scores as measured by the personal fulfilment inventory tended toward higher rates of physiological arousal biomarkers.
Beyond this evidence of internal structure validity, internal consistency reliability was also strong. What was concerning, however, was that there was evidence of heterogeneity between age groups. Graduate students showed higher concerns about financial security. One possible explanation is that graduate students were more conscious of the consequences of their actions. They also reported more regrets in general, and higher satisfaction with how they are perceived by others.
Graduating and moving beyond college student life may provide a sense of social prestige, but also more personal and financial responsibilities.
A secondary goal was to explore the implications of estimating Bayesian principal components models that accounts for group mean comparisons. One benefit of this approach is that it is robust to making repeated group mean comparisons. It is common practice to want to compare each item level difference, however in the frequentist paradigm this is cautioned against. As the number of hypotheses increases, the critical value for determining significance must be reduced to avoid type one error inflation, however this in turn limits statistical power. This Bayesian approach does not require the adjustment to critical value because the variance of the group mean difference was specified to be vague (Gelman, Hill, & Yajima, 2012).
Another benefit of this approach is in interpreting the group mean difference. These significant differences demonstrate an important insight into measurement bias that can be obtained using this procedure. The group mean difference was estimated after accounting for variation attributable to latent trait scores. Therefore, this difference indicates remaining group mean differences beyond what is accounted for by the trait being measured. The significant items identified in this analysis were both associated as expected with the other questions. By decomposing group mean differences in this way, the items with residual mean differences beyond common measurement can be identified. In the case of these results, the intended measurement was of subjective happiness, aspects of contentment and stress about life circumstances stood out. This demonstrates how unique items can be identified. If they are in conflict with the theoretical assumptions of the construct being measured, they may be considered for removal. Specifying the model in this way allows for precise decomposition of these differences.
While this procedure identified some item level differences, a limitation is that the sample sizes were quite small. Lack of evidence of measurement heterogeneity does not necessarily mean measurement homogeneity. While Bayesian measurement model estimation may be more robust to small sample sizes, a larger sample is still preferred. Future research should attempt to replicate these results.
Another limitation of this analysis was that the biomaterial selected may not be precise enough. As previously mentioned, interleukin-6 demonstrated some unexpected relationships to the Satisfaction with Life Scale. This could be because of the differing physiological mechanisms it enables, or evidence against the stress response hypothesis. Future research could compare multiple ways of estimating interleukin-6, such as in saliva versus in serum. Additionally, global DNA methylation was known to have some measurement inconsistencies. Only half of the samples could be assayed when they were first collected, so this time delay could have induced bias. Another concern is that global DNA methylation is a very general biomarker, representing methyl groups across the entire genome. Examining methylation of specifically relevant promotor regions may also be more insightful.
Lastly, specification of the priors could be changed. Prior specification was based on a larger sample of questionnaires only. A prior that is more sensitive to the biomaterial may have been more appropriate. While multiple priors for the group mean differences were tested, it was unclear what values to use. Finding relevant historical data to use as a prior for group mean differences could help more accurately identify differences. Informative priors were used to correct for likely small sample bias. Using a less informative prior or incorporating the power prior could provide further insights, which weights the influence of informative priors on the data based on the concordance between distributions (Ibrahim, Chen, Gwon, & Chen, 2015) All in all, the goal of this analysis was to review measures of subjective well-being and to investigate a novel approach to identifying unreliable items. Results supported the use of all three scales in measuring aspects of subjective experience. Some items were identified as possibly containing residual group mean differences. Regardless, most scales were reliable and valid, and this result demonstrates how future research can use this procedure to identify at risk questions.   Group mean difference variance identification parameter Note: Multiple priors were used to test the sensitivity of results to prior specification *Informative prior was based on a historical sample of 237 previous questionnaire responses from the same data frame 1: Prior used for reported principal components weight estimates and mean differences with equal variances assumed (Appendix 2 and  Tables 2, 3, and 4) 2: Prior used for reported mean differences with heterogeneous variances assumed (Tables 2, 3, and 4) Note: Credible intervals are estimated as HPD based. Note: Credible intervals are estimated as HPD based. Note: A total of 35 principal components dimensions were extracted, only the first eight are presented because nearly all weight estimates were nonsignificant and near zero after the first seven dimensions. Results are based on the assumption of equal variances.
*95% HPD based CI excludes the possibility of zero A: Item was intended to operationalize humanistic need satisfaction B: Item was intended to operationalize happiness C: Item was intended to operationalize existential fear D: Item was intended to operationalize moral reasoning