Assessing Reading Grade Level of Online Mental Health Materials: Practical and Methodological Considerations

The Internet can be conceptualized as a useful tool for providing people with a vast array of mental health information at the click of a button. Despite this plethora of available knowledge, oftentimes the information that is presented on popular physical and mental health websites is written for an audience with a reading grade level higher than the national 6 th -8 th grade average. Although the CDC has developed guidelines for developing online patient health materials that account for disparities in health literacy across various socio-demographic groups, adherence to these guidelines is largely poor and minimally monitored. This discrepancy can have broad public health implications when considering the suggested relationship between low health literacy and poor health outcomes. The present study systematically examines grade level readability scores for online information describing sixteen different mental health disorders, extracted from six highly utilized mental health websites, using a general estimating equations approach. In order to best understand this problem, two manuscripts are presented herein. The first manuscript focuses on public health concerns associated with higher than average reading grade level estimates of online mental health materials, whereas the second manuscript focuses on the methodology used to make these determinations. Results suggest that reading grade level estimates of publicly available online mental health information are much higher than the 6 th – 8 th grade levels suggested by the CDC, such that the average reader will not be able to effectively understand the selected text. This finding can have broad implications from a public health perspective and maintain existing health disparities.


Introduction
In light of the massive expansion of the Internet over the last decade, a plethora of information has now become available on almost any topic imaginable. Given the existing socioeconomically and geographically-based health disparities in the United States, this treasure trove of knowledge can help to inform decision-making on important physical and mental health topics ranging from signs and symptoms of heart disease, to mental health concerns such as substance use and anxiety. Indeed, in a few short keystrokes, people now have access to a myriad of information from multiple sources via popular online search engines such as Google, Bing, Yahoo, and Ask.com.
However, despite the popularity of online health materials as a vital source of information from which to make important health-related decisions, little attention has been paid to the readability of these materials, where readability refers to a systematic measure of ease with which a passage of text can be read (Albright, de Guzman, Acebo, Paiva, Faulkner, & Swanson, 1996;McInnes & Haglund, 2011). This lack of attention to readability of online public health information is particularly problematic considering that approximately 35% of US citizens have basic and below basic health literacy, 53% have intermediate health literacy, and only 12% have proficient health literacy. Here, health literacy is defined as the ability to search for, comprehend, and utilize written health education materials to make educated healthcare decisions Berkman, Sheridan, Donahue, Halpern, Viera, & Crotty, et al., 2011;Kutner, Greenberg, Jin & Paulsen, 2006; Committee on Health Literacy, 2004).
Although there is some research examining readability scores for a range of physical conditions (see, for example, Brigo, Otte, Igwe, Tezzon & Nardone, 2015;Colaco, Svider, Agarwal, Eloy & Jackson, 2013;Misra, Agarwal, Kasabwala, Hansberry, Setzen & Eloy, 2013;or Svider, Agarwal, Choudhry, Hajart, Baredes, & Liu et al., 2013), little if no attention has been paid to assessing readability of online mental health materials. Furthermore, only one study to date has explored the topic of readability using a mixed modeling approach (McEnteggart, Naeem, Skierkowski, Baird, Ahn & Soares, 2015). Hence, this study is novel in that it is the first of its kind to explore the readability of mental health-related information for 16 of the most prevalent mental health disorders, using data extracted from several of the most popular mental health websites, using multiple readability indices.

Who uses the Internet in the United States?
According to a recent study by the Pew Research Center (Perrin & Duggan, 2015), 84% of all Americans use the Internet. Given the heterogeneity of the U.S. population, as well as vast differences in access to technological resources across various socioeconomic spheres, it is important to further examine rates of use by level of education, income, race/ethnicity, gender, and age. For instance, 95% of collegeeducated Americans use the Internet, as compared with 90% of those with some college education, 76% of those with a high school degree, and 66% with less than a high school diploma (Perrin & Duggan, 2015).
Likewise, 95 -97% of those earning more than $50,000 per year are Internet users, as compared with 85% of individuals earning between $30,000 -$49,999, and 74% earning less than $30,000 annually. Despite these gaps, there has been much growth in Internet use over the past 15 years among those in lower-income households and lower levels of educational attainment, such that class differences have shrunk somewhat and many are now able to regularly access this resource from a range of technological platforms (Perrin & Duggan, 2015).
Examination of Internet use by race/ethnicity reveals that 97% of Englishspeaking Asian individuals use the Internet regularly, as compared with 85% of non-Hispanic Whites, 81% of Hispanics, and 78% of non-Hispanic Blacks. Similar rates of use are evidenced across genders, with 85% of men, and 84% of women indicating Internet use (Perrin & Duggan, 2015). Lastly, a breakdown of use by age indicates that 96% of adults ages 18-29, 93% of adults ages 30 -49, 81% of adults ages 50-64, and 58% of adults ages 65 or older are Internet users. Although older adults have traditionally been the slowest age group to adopt this technology, a majority of senior citizens now indicate regular Internet use (Perrin & Duggan, 2015).
Despite some differences in rates of adoption among these heterogeneous groups, it is fair to state that a majority of Americans are using the Internet on a regular basis. Hence, there is much potential to utilize this tool to empower people to make more informed choices about their mental health care needs. However, in order to make specific recommendations and develop an action plan for increasing access to, and comprehension of, online mental health materials, it is first important to examine how people are currently seeking health information on the Internet, as well as how these behaviors are related to users' general sense of health literacy.
How are people using the Internet to acquire health-related information?
According to a 2012 study conducted by the Pew Research Center, approximately 72% of Internet users reported seeking health information online within the past 12 months (Fox & Duggan, 2013). Likewise, 77% of online health seekers reported beginning their search at a general search engine website such as Google, Bing, or Yahoo, whereas approximately 13% reported beginning at a more specialized medical website such as WebMd.com. Furthermore, 55% of users reported searching for a specific disease or medical problem, and 43% reported searching for a certain medical treatment or procedure. Approximately half of users reported searching for a close family member or friend (Fox & Duggan, 2013).
In addition, 35% of U.S. adults indicated that they have specifically gone online to find out what condition they or someone else might have, and 46% of these 'online diagnosers' reported that the information obtained led them to think they needed medical intervention (Fox & Duggan, 2013). The remaining 38% reported saying that they could take care of the issue themselves at home, with 11% being ambivalent about the decision to seek additional medical care. Participants also reported on the accuracy of their initial diagnosis, with 43% indicating that a medical professional confirmed or partially confirmed their hypothesis, 35% indicating they did not visit a professional, and 18% indicating that a medical professional either disagreed with the initial diagnosis or offered an alternate medical opinion (Fox & Duggan, 2013).

Health literacy
These statistics are important when considering the potential gravity of misdiagnosing or ignoring a serious medical problem based on written information obtained online, particularly when this information is only readable by a small fragment of the population. Indeed, given that approximately 77 million Americans have basic to below basic health literacy, defined as the ability to read, understand, locate, and interpret health-related information correctly in text (America's Health Literacy, 2008), and that the average reading level across the United States is no higher than the 6 th -8 th grade (Kutner, Greenberg, Jin, & Paulsen, 2006;Paasche-Orlow, Parker, Gazmararian, Nielsen-Bohlman & Rudd, 2005), it is important that health information be written at a level that is accessible by the majority of consumers.
According to the U.S. Department of Health and Human Services' Office of Disease Prevention and Health Promotion report on Health Communication Activities (2008), results from the National Assessment of Adult Literacy survey suggest that health literacy is an issue for all racial and ethnic groups, with 28% of Whites, 57% of Blacks, 65% of Hispanics, and 34% of Others (including Asians, Native Americans, and multi-racial adults) in the basic to below basic health literacy groups. Within the scope of this study, health literacy was defined as the ability to successfully: read a set of short instructions and identify what is permissible to drink before a medical test (below basic health literacy); read a pamphlet and give two reasons why a person with no symptoms should be tested for a disease (basic health literacy); read instructions on a prescription label and determine at what time a person can take the medication (intermediate health literacy); and, using a table, calculate an employee's share of health insurance costs for one year (proficient health literacy).
Results from this study also indicated that lower health literacy is associated with less education: 76% of individuals with less than a high school degree, 44% of those with a high school diploma, 21% of those who had completed some education beyond high school, and 12% of those with a Bachelor's degree or higher, were at the below basic or basic levels for health literacy. Likewise, uninsured adults (53%) and those enrolled in Medicare (57%) and Medicaid (60%) were more likely to be at the below basic or basic levels than those who received insurance from an employer (24%). Interestingly, only 15% of adults with below basic health literacy indicated using the Internet "some" or "a lot" of the time for obtaining health information, as compared with 31% of those with basic health literacy, 49% with intermediate health literacy, and 62% of those with proficient health literacy (America's Health Literacy, 2008). Clearly, marketing online health information for the 12% of users who possess proficient health literacy only serves to perpetuate existing health disparities and limits access to valuable resources to a thin and privileged slice of the population. Policy implications from the Office of Disease Prevention and Health Promotion report (2008) suggest that there is an urgent need to address the gap between publicly available health information and existing realities in health literacy levels across various socio-demographic spheres.
The importance of accessibility to comprehensible text becomes even more apparent considering that individuals with low health literacy are at higher risk for poorer access to care, experience poorer health outcomes , and have higher hospitalization rates than individuals with high health literacy (McInnes & Haglund, 2011). According to a number of reports (Baker, Parker, Williams, & Clark, 1998;Baker et al., 2002;Gordon, Hampson, Capell, & Madhok, 2002;Scott, Gazmararian, Williams, & Baker, 2002), individuals with low health literacy make greater use of treatment services, as compared with services designed to prevent the onset of disease or lessen serious complications. This results in an estimated $50 -$73 billion dollars in additional health care costs annually in the United States. It is possible that one way to attenuate these costs might be to match the readability of written healthcare information to national reading grade level averages, or below. Although this is clearly not a catch-all strategy for reducing the financial burden associated with poor health outcomes, it is an important first step in addressing existing disparities in health literacy, and providing consumers with usable information from which they can make more informed decisions about their own, or loved ones', mental healthcare needs.

Readability
In accordance with this theme, several national organizations including the Centers for Disease Control and Prevention (CDC) and the American Medical Association (AMA) recommend that health information be written at a 6 th -8 th grade reading level Weis, 2003 (Friedman & Hoffman-Goetz, 2006). These indices generate reading level scores based on unique formulas or algorithms, hence increasing the probability that scores obtained from each index will exhibit marked variability. Given that these five indices were used to assess the same sample of text for each disorder in this study, a modeling approach that takes into consideration clustering within the data was necessary in order to examine the relationship between website (source) and topic area (subject), when accounting for variability in reading grade level scores by index. This approach provides a robust method for assessing differences in readability scores between, and within, websites and content areas, respectively.
In summary, the purpose of this project was to systematically examine reading grade levels for 16 common mental health disorders from the top 6 websites common to all disorders. A significant source by content area interaction was hypothesized when accounting for the variability in reading level estimates generated by various indices, such that grade level estimates for various disorders were expected to vary based on the website text was derived from. It was also hypothesized that written text for some of the more serious mental illnesses examined, such as schizophrenia, bipolar disorder, and borderline personality disorder, would have the highest reading level estimates, as compared with text from other disorders. Given the dearth of attention bestowed upon the readability of patient mental health materials in the past, it was expected that text from all websites would exceed the recommended 6 th to 8 th grade guidelines suggested by the CDC and the AMA. Unlike the other sites examined, Wikipedia.com is owned by the non-profit organization Wikimedia Foundation and is described on the site Wikipedia.com as "a free Internet encyclopedia that allows its users to edit almost any article accessible.

Materials
Wikipedia is the largest and most popular general reference work on the Internet and is ranked among the ten most popular websites". Clearly, Wikipedia.com is not managed by a board of mental health professionals, and its users generate and edit most of the mental health content posted on the site. However, given its popularity, Internet users searching for medical and mental health conditions are often directed to this site for key information.
A selection of text from each website, for each disorder, was extracted and saved in a Word document as a simple text file during the last two weeks of October, 2015. All commas, quotation marks, apostrophes, hyperlinks, references, and headings were removed from the text, as specified by common guidelines for readability analysis . All bulleted lists and sentence fragments followed by a colon or semicolon were also removed.

Statistical analyses
For the purposes of this analysis, each of the selected reading level indices served as a separate rater of the same excerpt of text. Hence, reading level scores were clustered by rater (index), with each rater examining a total of 96 excerpts of text, for sixteen disorders, from six different websites. Because we were not interested in exploring differences in reading level scores between raters, and the raters chosen were conceptualized as a representative selection of the entire body of available raters (reading level indices), a population-averaged or generalized estimating equations (GEE) approach was utilized to explore systematic differences between websites, content areas, and website by content area interactions on population averaged reading grade level scores.
GEE's are typically used to estimate population-average or marginal models that describe changes in the population mean of a given variable in relation to other important covariates, while also taking into account subject specific non-independence among observations (Hubbard et al., 2010). Although the authors considered using the mean score for all raters for each disorder to explore differences in reading level scores across disorders and websites, this approach reduces the number of measurements in each subject cluster to one data point, which may reduce power.
Hence, specific statistical methodologies, such as GEE, that accommodate correlations within clusters were considered more appropriate for the questions explored in this study.

Results
Preliminary Analyses. Data were analyzed using SAS Version 9.3 (Carey Institute, N.C.), and SPSS Version 21 (IBM, 2012). In order to determine the need for more complicated methodological techniques, the Intraclass Correlation Coefficient (ICC) was first calculated for rater (index). The ICC can be conceptualized as a general measurement of agreement or consistency between two or more raters or measuring methods, where a value of '1' represents perfect agreement, and a value of '0' represents no agreement at all. The purpose of this preliminary analysis was to determine the extent of variability in reading level scores attributable to differences in rating algorithms utilized by each index selected. Because we were primarily interested in exploring how reading level scores vary by website and content area, it was important to take this variability into careful consideration; evidence for variability by index would suggest a clustering effect in the data that would need to be accounted for in all subsequent analyses.
A two-way random effects model was specified for rater in order to assess variability in reading level scores between raters. A two-way random effects model was selected because the same indices were used to assess all selections of text, and the indices selected were chosen from a population of available indices used to calculate grade reading level scores. The ICC (2) assumes that the variance of the raters serves to add noise to any ratings obtained, and that the mean of rater error is zero. Results indicated that the estimated reliability between indices was 82.1%, with 95% CI [76.9, 86.6], using a consistency definition. As can be seen in Table 1 below, the mean for reading level scores generated by the Gunning Fog index was highest and had the largest variability, whereas the mean for reading level scores generated by the SMOG index was lowest and had the smallest variability of the indices selected.
Overall, the indices selected were largely consistent in their ratings of readability across disorders and/or websites. Hence, it could be concluded that the indices chosen demonstrated sufficient consistency for further analysis. Given that the researchers were: 1) not interested in examining specific differences between raters (indices) across websites and disorders; and 2) wanted to increase power by retaining as much information as possible from the original dataset (collapsing the data by calculating a mean score for each disorder from each website would reduce the number of available data points from 480, with all raters considered separately, to 96 when scores are averaged), a GEE approach was utilized to account for any natural variation in outcomes attributable to rater specific effects. level. Significance tests for all reported pairwise comparisons were adjusted using the Holm-Bonferroni method (Holms, 1979). See Figure 1 and No reading grade level estimates from the websites examined approached the suggested 6 th -8 th grade reading level guidelines.   text related to ADHD and Agoraphobia from WebMd.com), reading grade level estimates were much lower, and consistent with a 6 th to 8 th grade reading level, respectively. These estimates suggest that an individual who completed the 6 th to 8 th grade could effectively read the selected text. However, all other estimates obtained were markedly higher, with a minimum average high school reading level required to effectively read the selected text.

Figure 1: Readability estimates by website and disorder
Interestingly, text related to borderline personality disorder demonstrated the highest reading grade level estimate, followed by text related to bipolar disorder, social phobia, schizophrenia, MDD, and GAD, in descending order of grade level.
Examination of estimates for these disorders generally suggests that an individual with an average post-high school reading level could effectively read the segments of text selected for analysis. Given the severity of impairment often associated with these disorders (particularly borderline personality disorder, bipolar disorder, and schizophrenia), it could be surmised that the information available online from the websites surveyed is not only relatively inaccessible to most healthy consumers, but particularly to those struggling with serious mental illness. Indeed, as noted by Revheim et al., (2014), individuals with schizophrenia commonly display severe deficits in reading ability. Likewise, given impairments in reading ability among individuals with serious mental illness, Rotondi et al. (2007) suggest that most online sources of mental health information are not well-suited to the needs of this population.
Not surprisingly, little difference was noted in reading grade level estimates between MDD and bipolar disorder, as these disorders may share a common language regarding general symptoms of depression. Likewise, given similarities in language, symptom presentation, and etiology, there was no notable difference in reading level scores for alcoholism and substance abuse, as well as social phobia and specific phobia. However, this rationale could not be extended to text describing the two predominant eating disorders examined in this study: reading level estimates for bulimia nervosa were significantly higher than those for anorexia nervosa. It is possible that further exploration of text content may reveal emphasis on different features, symptoms, or etiology of each disorder, hence contributing to differences in reading level estimates.
Indeed, the reader is encouraged to recall that this study only examined the readability of online public mental health materials, and did not explore the content (or meaning) of text extracted from the sites selected. Readability is an important first component in understanding whether the structure and form of written material is largely digestible by the average reader. Based on national statistics that suggest the reading grade level of the average American citizen is between the 6 th to 8 th grade (Kutner et al., 2006;Paasche-Orlow et al., 2005) Lastly, it is important to consider the practical and methodological limitations of this study before making sweeping conclusions about the content of online public mental health materials. Clearly, individuals have a multitude of ways of arriving at the websites and disorders examined within the scope of this study. In many cases, searching for mental health information may begin by entering key words related to symptoms, rather than names of formal diagnoses. This study did not assess the mechanism by which people arrive at the websites selected, with the implied understanding that based on common search terms, people will eventually be funneled to a web page describing a disorder whose symptoms are consistent with their initial search terms.
Furthermore, this study is in no way a comprehensive review of all mental health diagnoses, nor does it sample all websites with available online mental health materials. The websites selected for analysis were chosen, in part, because they contain information specific to each disorder under investigation. Some prominent mental health websites, such as the National Institute of Mental Health (NIMH.NIH.gov) were not selected because they did not promote information specific to substance abuse disorders or alcoholism. Likewise, given the speed at which technology changes, it is possible that the search engines selected in October, 2015 to conduct the initial investigation are no longer the most popular engines available.
From a methodological perspective, it may have been more robust to assess each block of text using additional readability indices, as well as to have multiple researchers select, clean, and process each block of text for enhanced inter-rater reliability. Although the researcher attempted to employ rigorous standards in selecting text for each disorder, it is possible that the selections may exhibit some bias.
However, despite these limitations, this study provides some initial evidence that current readability estimates for 16 of the most prevalent mental health disorders common to all sites surveyed are well above the 6 th to 8 th grade reading level

Introduction
The readability of written patient health materials is a topic of great importance for public health researchers.  (2011), and Weis (2003) for a full discussion of how the construct of readability is defined and related to reading comprehension). In the United States, recent estimates suggest that the average adult reads at the 6 th to 8 th grade reading level (Kutner, Greenburg, Jin, & Paulsen, 2006). This has profound implications for writers of public health information, as information presented at a reading level much higher than the national average has the potential to maintain or exacerbate existing health disparities by catering only to those consumers with a high reading and educational status. These effects may be particularly egregious for individuals experiencing stress or mental health concerns.
Luckily, much as the Internet is a valuable resource for consumers of health information, it is also a vast repository of publicly available data for researchers interested in evaluating the readability of online health and mental health information.
This study demonstrates how these data can be analyzed using methods that account for natural clustering by website, subject area, and/or readability index utilized to rate the text. In all cases, it is important to bear in mind how the research question of interest and interpretability of results may change in response to alternate conceptualizations for how data are structured within clusters. The purpose of this paper is thus two-fold: 1) to provide an overview of various methods for analyzing clustered data, including a discussion of the utility of the Intraclass Correlation Coefficient (ICC) and differences between fixed and random effects; and 2) to demonstrate how results may vary across two possible approaches to analyzing nested readability data from 6 different websites related to 16 different mental health disorders, using five separate readability rater indices.

Overview of methods for analyzing nested data. Numerous techniques for
analyzing nested data are currently available using common computer programs such as SPSS (IBM Corp., Armonk NY) or SAS (SAS Institute, Cary, NC), among others.
These methods range from more straightforward methods such as multilevel analysis (or hierarchical linear modeling) for cross-sectional data using single indicator and outcome variables, to multi-level mediation models involving multiple mediators and moderators. More complex analyses often include categorical or non-normal response data and modeling of longitudinal effects over multiple time-points.
Given this wide range in methods, choosing the appropriate analysis may seem like a daunting task. However, it is important to remember that study design and an emphasis on addressing key questions of interest are of primary concern in developing an appropriate data analytic plan. The methods described herein are presented as a sampling of the multitude of techniques available for the analysis of nested data, and are discussed in order from the most 'simple' (in comparison with the other techniques discussed) to the most complex.
Fixed and random effects. Throughout this report, reference will be made to 'fixed' and 'random' effects in the context of multi-level modeling. Slight variations in the definition of fixed and random effects appear in the literature on mixed modeling depending on author orientation and intended message. According to Hamilton (2012), fixed effects typically refer to intercepts and slopes that are meant to describe the population as a whole, whereas random effects refer to intercepts and slopes that vary across subgroups within the sample. Within the Hierarchical Linear Modeling (HLM) framework, Warne et al. (2012) describe fixed effects as the average impact that an explanatory variable has on a dependent variable across all clusters or groups, and random effects as the degree of variation between clusters.
Likewise, Hayes (2006) describes random effects as effects that are allowed to vary between Level 2 (higher order) units, whereas fixed effects are those that have only a single value in the model for each Level 1 (lowest level) unit regardless of the Level 2 unit under which they are nested. Under the umbrella of ordinary regression analyses, the intercept and slope are both considered fixed effects, and the residual is considered random. In contrast, when accounting for nested data, it is possible to specify an intercept and slope for each Level 2 unit of the same predictor by setting some of the coefficients as random effects (Hayes, 2006). However, given that this approach reduces the size of our sample from 480 units of analysis to 96, and that the data are limited in the number of Level 2 clusters (6 websites), it is likely that our power to detect an effect if one is present is limited.
Furthermore, without significant theoretical rationale for conceptualizing how nesting occurs within the data, it is equally possible to conceptualize that websites are clustered within disorders.
Another possible conceptualization of the data posits that each readability score for each disorder, from each website, is independent of all other scores, regardless of the rater (index) it was derived from. Although this iteration retains all data, it assumes that scores generated using the same readability index are not related, and ignores possible correlations within the data attributable to rater-specific effects.
Rater contributions are ignored, and only the fixed effects of website and subject area are explored in the regression analysis.
This second conceptualization represents a naïve approach because it ignores the possibility that each index can be thought of as a unique rater of the same block of text that uses a distinct formula to generate readability scores. It could hence be argued that scores within raters are more similar to each other than scores between raters, and that there is thus a need to account for these dependencies in the modeling process.
In the examples noted above, the similarities within raters are not accounted for either because an average readability score is calculated for each disorder by website combination (N decreases from 480 to 96), or because each rating is treated as independent of all others.
These conceptualizations are potentially problematic because either the total sample size is reduced, the number of Level 2 groups is small, and/or any interdependencies in the data are not explicitly accounted for. Alternatively, it is possible to retain all of the data by treating indices as 'individuals' who are making multiple ratings on various passages of text. Here, it is possible to explicitly account for similarities in rating strategies within individuals by conceptualizing that readability scores from distinct websites and content areas are nested within the five individual raters selected for this study. Within this framework, it is possible to not only retain all of the data, but also to account for interdependencies within the scores generated from the same raters. This can be accomplished in a number of ways.
First, using mixed modeling, an intercept-only random effects model can be specified with only a random intercept included for raters. This preliminary approach allows researchers to calculate the ICC, or ratio of group-level variance over total variance, and determine the need for further nested modeling approaches. Here, the ICC represents the proportion of variance in the dependent variable that is explained by the grouping structure of the hierarchical model (Castro, 2002;Wears, R.L., 2008).
Although some statistical references suggest that an ICC close to zero negates the need for multi-level or clustered data approaches, Hayes (2006) argues that values of the ICC as small as .05 can invalidate hypotheses tests and confidence intervals when clustering is not considered. In this context, a value of .05 would indicate that approximately 5% of the total variation in readability scores could be accounted for by which rater made the readability rating and thus the raters should be taken into account. More discussion on the ICC is given shortly to provide more input to researchers.
The researcher may then choose to add predictor variables to the model and explore how the ICC changes with each new addition. For instance, the researcher could include a random component for the rater variable, and specify the calculation of fixed effects for website and content area. This approach allows raters to vary on the mean of readability scores, but assumes that the degree of association between explanatory and outcome variables is the same for all raters. In this example, it is possible to determine the degree of variation in scores between raters, and account for this variation, if necessary. Likewise, the researcher is able to flexibly decide which coefficients are to be fixed, and which coefficients are allowed to vary based on theory and research design.
The mixed modeling approach described above might be particularly useful if we were interested in assessing differences between raters, had some theory or hypothesis concerning how scores between raters might vary, but assumed that the degree of association between the predictor variables and the dependent variable was the same for all raters included in the analysis. Likewise, if we hypothesized that the degree of association between our predictor(s) and readability scores varied between raters, we might specify a random component for the predictor(s) of interest. It is important to remember, that the more coefficients specified, the greater the cost in degrees of freedom. Because our sample size and the number of groups is relatively small, we may be limited to simpler methodological designs.
In contrast, if we were not interested in exploring differences between raters, but still wanted to account for the variability in readability scores due to rater effects, a population-averaged approach might be an appealing alternative. General(ized) Estimating Equations (GEE) provide one such flexible regression-based strategy.
These models are appealing because: 1) they can handle a variety of correlated measure models, as well as a variety of outcome data (i.e., continuous, count, binary); and 2) are more flexible for missing data compared to other models (Zeger, Liang & Albert, 1988).
Although both approaches take variation in rater scores into consideration, there can be marked differences in how output from these analyses are interpreted, particularly when outcome data are binary or counts. For linear data, interpretation of estimates obtained using mixed modeling and GEE suggests that: coefficients derived from mixed modeling procedures represent the change in mean outcome for a unit change in the associated grouping variable, keeping the random effect fixed; whereas coefficients derived from GEE represent the change in the mean outcome for a unit change in the associated grouping variable, across all levels of the grouping variable observed (Hubbard, et al., 2010).
Furthermore, whereas random-coefficient models typically explicitly address variation at both unit-specific and higher-order levels, GEE models assume simple random sampling of subjects representing a population, as opposed to a set of higher order groups. Hence, GEE models provide "population average" results and model the marginal expectation of the outcome variable as a function of the predictors specified.
Interestingly, intercept-only random-coefficients linear mixed models generally produce the same estimates as those obtained from the exchangeable working correlation model in GEE, albeit with a difference in degrees of freedom. Here, equal variances for all observations and equal covariance of all possible paired observations within the statistical unit are assumed, as well as no correlation of observations made on different units (Hubbard et al., 2010).
Although there are numerous costs and benefits to each modeling strategy, fundamentally, the decision to employ GEE over mixed modeling (or vice versa) can be pared down to the researchers' primary question(s) of interest. If the objective was to make comparisons between the grouping variable and the outcome of interest, a mixed modeling approach might be better suited. However, if the goal was to account for variation in the outcome variable due to clustering within the data, but not to make direct comparisons between clusters, a GEE approach might be more applicable. In the latter instance, the researcher is modeling the marginal expectation of the outcome of interest across all clusters, and assumes that subjects are drawn from a sample representing the population. For a more detailed technical explanation, including thorough discussion of assumptions relevant to both modeling strategies, the reader is referred to Hubbard et. al. (2010).
Intraclass Correlation. A discussion of clustered data analysis is not complete without detailed consideration of the ICC. One of the potential risks of using traditional statistical methods for analyzing clustered data is that estimated standard errors may be smaller than appropriate (Warne et al., 2012); this may result in increased probability for Type I error (Hox, 2010). The ICC is a quantitative measure of the degree of dependence in the data, such that it is possible to assess how similar subjects are to each other within clusters (Kenny, Kashy & Bolger, 1998;Peugh, 2010). The value of the ICC ranges from 0.0 (perfect independence) to 1.0 (all subjects are the same as others within the cluster) (Warne et al., 2012).
Traditionally, the ICC has been conceptualized as a measure of rater reliability, which is particularly relevant considering the conceptualization of the data used for the running example in this text (i.e., various readability indices as individual 'raters' of the same passage of text). In a seminal article on intraclass correlations, Shrout and Fleiss (1979) provide several examples of different uses for the ICC in the context of a reliability study of the ratings of several judges. The authors make the point that assessing whether judgments made by multiple observers are reliable is critical to knowing whether these measurements are meaningful. However, multiple forms of the ICC exist, and each is appropriate under a limited set of circumstances.
There are typically two ways of conceptualizing the ICC: the ICC (1) is a measure of the amount of variance in individual level responses attributable to group level properties, as described above; whereas the ICC also (2) is a measure of the reliability of group means (Castro, 2012). ICC (1) values are typically not affected by group size or the number of groups. However, because of slight variation in the formula used to calculate these coefficients, the ICC (2) is influenced by group size.
Because ICC's are based on variance partitioning, they are subject to the same assumptions as analysis of variance (ANOVA), including homogeneity of variance, normality, and independence (Castro, 2002). In summary, the ICC provides an omnibus measure of dependency in the data, and can be used to determine the need for hierarchical or nested modeling procedures.
Hierarchical Linear Modeling (HLM). For multilevel analyses involving two levels (i.e., individuals nested within groups) HLM can generally be thought of as a two-step approach. The first step, or Level 1, typically involves estimating a separate regression for each group of interest with individual-level predictors and outcome. At Level 2, the variance in the Level 1 slopes and intercepts is modeled using the group-level variable. These equations are evaluated simultaneously (Castro, 2002;Diex-Roux, 2000;Luke, 2004). By treating clustered groups as their own level of data, as well as a combination of individual scores, it is possible to examine the cross-level influence of variables, thus developing a more nuanced and ecologically valid approach to examining real-world phenomenon, when theoretically applicable (Luke, 2004; HLM is a statistical procedure that uses maximum likelihood to estimate the variance components of Level 2 models. This technique assumes multivariate normality for variables. Other assumptions of HLM include that: Level 1 residuals are independent and normally distributed with a mean of zero and equal variances across groups; Level 1 predictors are independent of Level 1 residuals; random errors at Level 2 are multivariate normal and are independent among Level 2 units; the set of Level 2 predictors are independent of Level 2 residuals; and that Level 1 and Level 2 residuals are independent (Hofmann, 1997).
Model building in HLM is a multi-stage process, in which the researcher may consider three broad classes of models, starting with a null model with no Level 1 or Level 2 predictors (Luke, 2004). As noted above, this model may be useful for calculating the ICC and guiding further decision-making, and generally produces estimates equivalent to those obtained from the exchangeable working correlation model in GEE. Next, depending on the primary question of interest, the researcher might begin to add predictor variables into the model, allowing the intercept to vary for each identified cluster. The last class of models assumes variation in slopes and/or intercepts across Level 2 units, and can include interactions between individuals and group-level constructs.
As discussed, although the benefits of using HLM to model real-world phenomenon are plentiful, there are some important limitations of this approach that warrant further explication. Perhaps the most glaring of these limitations include: potential violations of the assumption of multivariate normality when considering cross-level interactions; restriction of the dependent variable to be operationalized at the lowest level of analysis; and the need for fairly large sample sizes to obtain a sufficient level of power (Castro, 2002;Hofmann, 1997). In our example using readability data derived online, a multi-level or HLM approach using all of the data may not be the best approach given our conceptualization of the data as readability scores related to different disorders from different websites, nested within different raters selected from a population of all possible raters.

General(ized) Linear Mixed Modeling (GLMM). HLM is a powerful technique for
analyzing continuous outcome data. However, the assumptions of HLM do not hold when the response format is binary, multinomial, a proportion, or a count. For instance, if we were interested in whether websites passed or failed a reading grade level standard, or the influence of various factors on the number of websites that scored at the average reading level (rather than a continuous readability outcome measure), other statistical methods that take into consideration non-normal response formats would be necessary. GLMM is an extension of linear mixed modeling procedures that can readily handle non-normal data. This is particularly important when considering that much of the data collected online, in hospitals, schools, or other naturalistic community settings may follow a variety of alternative distributions (i.e., Poisson, binomial, negative binomial, etc.), and that the assumptions of linearity, normality, and constant variance may thus not be applicable. As such, GLMM is acceptable for determining Level 1 and Level 2 effects for non-normal or non-linear data, hence allowing for multi-level analysis of binary, count, ordinal, and multinomial data. Kaplan (2004) suggests that some additional steps that must be taken when estimating generalized linear mixed models. First, a sampling model and link function must be specified. The link function transforms the expected value into a predicted value that can be estimated with a linear equation. In the case of linear mixed modeling, this is a normal distribution with a mean and variance, and a link function with the value of 1 (because no transformation is required). For binary outcomes (Y = 1, N = 0), this would mean a Bernoulli distribution and a log odds ratio link function.
Next, the researcher must specify a linear structural model to estimate the transformed expected value. Conditional models may be specified, such that the researcher has the option of including relevant Level 1 or Level 2 predictors, and including fixed and random effects, as needed (Kaplan, 2004).
Furthermore, when considering generalized linear mixed models, a distinction should be made between unit-specific and population average models (Raudenbush & Bryk, 2002). For instance, the unit-specific model (hierarchically structured model) describes processes that are occurring in each Level 2 cluster, where processes are captured by the beta-coefficients of the Level 1 model. Here, the primary question of interest may be how the processes differ over a population of Level 2 units. It may be possible that these processes differ in their intercept alone, slope, or both.
Furthermore, the Level 2 model may also assess how differences in the Level 2 explanatory variables influence Level 1 processes in each Level 1 unit. Hence, unitspecific models provide information about how effects of predictors vary across groups (Kaplan, 2004;Raudenbush & Bryk, 2002). Raudenbush (2000; describes these questions as 'unit-specific', and contrasts them to a population-average (or Generalized Estimating Equations) approach (Zeger, Liang & Albert, 1988), in which the primary question of interest is in estimating average probabilities for population-level effects. Given the complexity and flexibility of this approach, one limitation may be that GLMM requires that researchers be explicit about their research questions and the type of data available for analysis, a priori. Interestingly, in some ways this could be conceptualized as both a weakness and strength of this approach, largely because it forces the researcher to exert much time and effort into clearly delineating their specific research hypotheses or intended intervention effects.

Structural equation modeling.
Over the past number of years, structural equation modeling (SEM) has been studied and applied as a valid methodology for the analysis of multilevel or clustered data (Tomarken & Waller, 2005). Indeed, one of the primary strengths of SEM is the ability to specify latent variable models that provide estimates of the associations between latent constructs and their indicators (otherwise known as the measurement model), as well as between important constructs themselves (the structural model).
Using this framework, it is possible to account for biases that are attributable to random error and variation that is not better explained by the constructs of interest.
Other general strengths of SEM include the ability to evaluate complex models with a large number of linear equations against less complex models, as well as the ability to specify recursive relationships between constructs (and error terms), hence accounting for dependencies in data that are nested or collected repeatedly on the same individuals over time.
In a comparison of HLM with SEM, Raudenbush & Bryk (2002)  However, one downfall is that SEM typically requires balanced data within groups, such that each individual is required to have the same number and distance between time points. Furthermore, Level 1 predictors with random effects are required to have the same distribution across all cases within each group. Unlike SEM, the HLM framework allows for unequal group sizes and spacing of time points, and does not require the distributions of Level 1 random effects to be identical (Raudenbush & Bryk, 2002). In recent years, the SEM framework has been extended to analyze data beyond a latent growth curve format, such that it is now possible to use SEM to examine clustered data in situations that do not involve repeated measurements (Heck & Thomas, 2015;Hox & Maas, 2001;Tomarken & Waller, 2005).
Some attention has also been focused on extending the assumptions of multi- Given these capabilities, it has become increasingly apparent that boundaries between HLM, GLMM, and SEM have become somewhat blurry, and that researchers are now faced with the important task of deciding which framework is best suited for their data and their most relevant research hypotheses (Tomarken & Waller, 2005).
Indeed, a return to fundamental questions of interest in any research design can be a guiding beacon of light for those who find themselves bogged down in the murky waters of 'analysis paralysis' in search of the 'best' analytic method. It is important for researchers to remember that the 'best' modeling strategy is that which is most suited to their research design, and that no strategy can ultimately save those who fail to thoroughly plan for their journey into unexplored research lands.
Analysis of clustered data and issues related to sample size. A discussion of the analysis of clustered data using the techniques described above also warrants some mention of concerns related to sample size. There is some consensus that group-level sample size is more important than total sample size, with some compensation for a small number of groups in large individual-level samples (Maas & Hox, 2005). In a simulation study of sufficient sample sizes for multi-level modeling, Maas & Hox (2005) indicate that a small sample size at Level 2 (less than 50 groups) can lead to biased estimates of the Level 2 standard errors. Hence, the researchers strongly suggest using caution when applying multi-level methods with a limited number of groups, and call for bootstrapping or other simulation methods to account for these concerns when analyzing small-sample data.
In light of these concerns, and the high probability that modeling real-world phenomenon often involves a small or limited number of Level 2 groups, Hoyle and Gottfredson (2015) make several recommendations for maximizing the yield of multilevel modeling or SEM efforts when N's are small. These recommendations include: retaining all cases where possible in the analysis sample, such that no data are left unmodeled; optimizing the observed data to achieve normality and using reliable measures; and fixing or constraining variables where possible using knowledge from previous research to decrease the number of parameters that need to be estimated.
Summary. After careful consideration of the key points discussed above, two modeling strategies for assessing the readability of online mental health materials using the full dataset described herein stand out as distinct possibilities. First, the data could be conceptualized as following a 2-level hierarchy, with scores from various disorders and websites nested within the five raters selected for this analysis.
However, because of the small number of higher-order groups, as well as the relatively small size of our sample, it is hypothesized that utilizing a 2-level multilevel modeling approach may not be advisable.
Second, we could conceptualize that the raters selected are a random sample of all possible raters of online material, and although we are not interested in addressing differences between raters, we are interested in accounting for clustering within the data. Given this important design consideration, a general estimating approach could likewise be considered because it is better suited to our primary question of interest (i.e. assessing differences in reading level scores between websites and disorders across the population of possible raters). Results from these approaches will be discussed herein, with an emphasis on demonstrating that GEE may be better suited to the structure of these data, as well as the underlying research question of interest.
However, because the response format is linear, it is likewise expected that results will not vary widely between approaches, and that the fundamental consideration for researchers selecting an appropriate methodology for analyzing these types of data will be conceptual in nature. were expected to largely be consistent across indices because the selected indices all measure the same construct.

ICC.
A two-way random effects model was specified for rater in order to assess variability in reading level scores between raters. A two-way random effects model was selected because the same indices were used to assess all selections of text, and the indices selected were chosen from a population of available indices used to calculate grade reading level scores. The ICC (2) assumes that the variance of the raters serves to add noise to any ratings obtained, and that the mean of rater error is zero. Results indicated that the estimated reliability between indices was 82.1%, with 95% CI [76.9, 86.6], using a consistency definition. The mean for reading level scores generated by the Gunning Fog index was highest and had the largest variability, whereas the mean for reading level scores generated by the SMOG index was lowest and had the smallest variability of the indices selected.
Overall, the indices selected were largely consistent in their ratings of readability scores across disorders and/or websites. Calculation of the ICC using a definition of absolute agreement revealed that although the various raters selected were consistent in their scoring, and could be thought of as reliable raters of reading grade level, they were not in absolute agreement on ratings of readability scores, ICC (2) = .483, 95% CI [.156, .700]. This distinction between consistency and absolute agreement can be best explained using the following example: score sets of (2,4), (4,6), and (6,8) can be thought of as perfectly consistent (ICC = 1.0), however, are not in perfect absolute agreement. For our purposes, measuring the consistency of reading level scores across raters is important because it tells us that raters are largely in agreement over how scores are assessed. Here, we can be confident that although there are differences in the scores generated by the raters selected, as a whole, they are largely consistent in how they measure grade level readability for the disorders and websites selected.
Alternatively, we could also use the ICC to determine the percentage of total variance in the outcome (readability score) that can be explained by the grouping variable (rater/index). Results from the unconditional intercept-only model (ICC = .409) suggest that approximately 41% of the total variation in reading level scores can be attributed to rater effects (i.e., which rater or index makes the rating). These results suggest that overall, consideration of rater effects, is important in the modeling process.  Table 2 below.

Discussion
The purpose of this paper was to provide a brief sampling of the analytic strategies available for analyzing nested reading grade-level data extracted from six different websites, for sixteen different mental health disorders and/or conditions, rated by five different readability indices. A discussion of various interpretations of the ICC was provided, as well as specific results from 1) a 2-level multi-level model with a random effect included to account for differences between raters/indices on average reading-level scores, and 2) a population-averaged GEE approach in which reading-level estimates were nested within a sample of all possible raters/indices.
In our example, data were conceptualized to be clustered within the various indices used to rate text extracted from online sources. Because we were not interested in exploring differences between raters, theorized that the readability indices selected were a random sample of all possible indices used to rate written text, and wished to retain data from all raters for each website and disorder combination in the modeling process, a marginalized models or GEE approach was selected as the best analytic strategy from a conceptual perspective. This approach was also selected given that the number of units of the grouping variable was small (k = 5 indices/raters), and some researchers suggest that utilizing multi-level modeling with a small number of groups may be inadvisable due to issues related to power and type I and II error (Hoyle & Gottfredson, 2015).
In this analysis, the variables website and disorder were treated as fixed effects, and an interaction term was included to account for differences in reading level scores across website and content area combinations. When comparing results from the GEE approach and a 2-level multi-level model, it is apparent that these two strategies provided similar results, as expected. These differences are displayed below. The similarities in results generated from both modeling strategies may be due to a number of factors. These factors may include the continuous nature of the outcome variable, as well as the limited number of factors included as fixed effects in the model. Although there are some differences in the interpretation of outcomes between multi-level models and GEE when the outcome variable is binary or nonlinear, the interpretation is largely consistent across models for continuous data.
Likewise, although the number of groups included to account for clustering within the data was small (scores nested within five raters), only the variables website and disorder were included as explanatory variables in both models. In this case, the decision to utilize GEE over a 2-level multi-level model hence lies in the fundamental question of interest to the researcher. Given that the primary research objective of this study was to evaluate the relationship between websites, disorders, and their interaction on reading grade-level scores across a population of possible raters (indices), a GEE or marginal models approach was hypothesized to be the best conceptual fit for this specific question. However, because the outcome data are linear and normally distributed, multi-level modeling may also be an appropriate alternative strategy.
Overall, aside from a key few instances, the reading grade level for all disorders across the various websites explored far exceeded the suggested 6 th to 8 th grade reading level guidelines established by the CDC and other similar organizations.
In some cases, (i.e. text related to borderline personality disorder from MedicineNet.com), the estimated reading grade level reached as high as 17.9. This estimate suggests that, on average, an individual with an advanced graduate degree (grade 17.9) would be able to read the selected text effectively. In other instances, (i.e. text related to ADHD and Agoraphobia from WebMd.com), reading grade level estimates were much lower, and consistent with a 6 th to 8 th grade reading level, respectively. These estimates suggest that an individual who completed the 6 th to 8 th grade could effectively read the selected text. However, all other estimates obtained were markedly higher, with a minimal average high school reading level required to adequately read the selected text.
Interestingly, text related to borderline personality disorder demonstrated the highest reading grade level estimate, followed by text related to bipolar disorder, social phobia, schizophrenia, MDD, and GAD, in descending order of grade level.
Examination of estimates for these disorders generally suggests that an individual with an average post-high school reading level could effectively read the segments of text selected for analysis. Given the severity of impairment often associated with these disorders (particularly borderline personality disorder, bipolar disorder, and schizophrenia), it could be surmised that the information available online from the websites surveyed is not only relatively inaccessible to most healthy consumers, but also especially to those struggling with serious mental illness.
Not surprisingly, little difference was noted in reading grade level estimates between MDD and bipolar disorder, as these disorders may share a common language regarding general symptoms of depression. Likewise, given similarities in language, symptom presentation, and etiology, there was no notable difference in reading level scores for alcoholism and substance abuse, as well as social phobia and specific phobia. However, this rationale could not be extended to text describing the two predominant eating disorders examined in this study: reading level estimates for bulimia nervosa were significantly higher than those for anorexia nervosa. It is possible that further exploration of text content may reveal emphasis on different features, symptoms, or etiology of each disorder, hence contributing to differences in reading level estimates.
Future research may focus on: 1) increasing the number of clusters of the grouping variable by including ratings from additional indices; 2) re-conceptualizing the data as being nested within various websites, or within disorders (instead of within raters) to expand the number of groups; 3) further investigating inter-rater reliability by asking multiple individuals to extract text from the websites selected for the study; 4) investigating how the construct of reading comprehension is related to the readability of selected text using human subjects; and 5) exploring how readability and comprehension are related to utilization of health services. These ideas for future investigation may address some of the key limitations of this study, which include a small number of groups of the clustering variable, and the absence of any information regarding how reading comprehension might be related to reading-grade level of selected text. Furthermore, only information from disorders that were available on all web platforms was selected for this analysis. Expanding the number of websites and disorders for analysis may provide a more comprehensive picture of the readability of online mental health materials, and may reveal additional or alternative associations not demonstrated in this analysis.
Overall, despite some differences in the width of confidence intervals, results from the multi-level modeling and GEE approach are consistent in that they suggest that although some website and disorder combinations had higher readability scores than others, scores from all websites and for all disorders exceeded the recommended 6 th to 8 th grade standard. This result is important because it demonstrates that much of the material obtained online is not written at a level that is comprehensible for the majority of consumers in the United States. In order to prevent the perpetuation of existing health disparities associated with lack of health literacy, writers of public online mental health materials are advised to take great care in ensuring that the information they post is accessible to as many individuals as possible. Readers are also encouraged to explore alternative modeling strategies for more complicated data, depending on their primary research aim.