Comparing Visual and Statistical Analysis in Single-Case Studies Using Published Studies

Little is known about the extent to which interrupted time series analysis (ITSA) can be applied to short, single-case study designs and whether those applications produce results consistent with visual analysis (VA). This article examines the extent to which ITSA can be applied to single-case study designs and compares the results based on two methods: ITSA and VA, using papers published in the Journal of Applied Behavior Analysis in 2010. The study was made possible by the development of software called UnGraph®, which facilitates the recovery of raw data from the graphs. ITSA was successfully applied to 94% of the examined graphs with the number of observations ranging from 8 to 136. Moderate to high lag-1 autocorrelations (>.50) were found for 46% of the data series. Effect sizes similar to group-level Cohen's d were identified based on the tertile distribution. Effects ranging from 0.00 to 0.99 were classified as small, those ranging from 1.00 to 2.49 as medium, and large effect sizes were defined as 2.50 or greater. Comparison of the conclusions from VA and ITSA had a low level of agreement (Kappa =.14, accounting for the agreement expected by chance). The results demonstrate that ITSA can be broadly implemented in applied behavior analysis research. These two methods should be viewed as complementary and used concurrently.

Group-level and single-case research designs are two methodological models employed for analyzing longitudinal research. The first model is based on data obtained from a large number of individuals and provides average estimates of longitudinal trajectories of behavior change based on group-level data, emphasizing between-subject variability. A significant limitation of group-level designs, also known as nomothetic designs, is the inability to capture high levels of variability and heterogeneity within the studied populations (Molenaar, 2004). Further, group-level designs emphasize central tendencies of the population and consequently obscure natural patterns of behavior change, their multidimensionality and unique variability within each individual (Molenaar & Campbell, 2009).
The second methodological approach employed in longitudinal research is based on data obtained from one individual or unit (N = 1) through intensive data collection over time. Single-case designs, also known as idiographic designs, Correspondence concerning this article should be addressed to Wayne F. Velicer, Ph.D., University of Rhode Island Cancer Prevention Research Center, 130 Flagg Road, Kingston, RI 02881. E-mail: velicer@uri.edu Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/hmbr. examine individual-level data, which allows for highly accurate estimates of within-subject variability and longitudinal trajectories of each individual's behavior. Idiographic methodology characterizes highly heterogeneous processes, which consequently allow for more accurate inferences about the nature of behavior change specific to an individual (Velicer & Molenaar, 2013). Single-case designs address the limitations of group-level designs and present several advantages. They allow for a highly accurate assessment of the impact of the intervention for each individual while grouplevel designs provide information about the effectiveness of the intervention for an "average" person, rather than any person in particular (Velicer & Molenaar, 2013).
In addition, single-case research allows studying longitudinal processes of change with much better precision than group-level designs, due to a higher number of data points and better controlled variability of the data. Also, it can be applied to populations that are otherwise difficult to recruit in numbers large enough to allow for a group-level design (Barlow, Nock, & Hersen, 2009;Kazdin, 2011).

Ergodic Theorems
The discrepancies between results from cross-sectional nomothetic data and those from longitudinal idiographic data can be understood through the ergodic theorems (Choe, 2005;Molenaar, 2008). Equivalent results will only occur if the two conditions specified by the ergodic theorems are met: (1) Each individual trajectory has to obey the same dynamic laws, and (2) Each individual trajectory must have equal mean levels and serial dependencies. If these conditions are not met, then results from nomothetic analyses will not capture the processes of the individuals that make up a sample. Inappropriately inferring from a group to an individual is known as an ecological fallacy, and is a common issue with nomothetic methods.
The ergodic theorems are based on a general theory about the relationships between intersystem and intra-system variability (where a system can be any unit: person, family, school, etc.). Ergodic theory addresses the relationships between individual units and groups of those units in the most general setting possible, namely for all measurable processes and using different metrics. A very special case is the set of Gaussian processes. Ergodicity for Gaussian processes is associated with the two conditions specified by the ergodic theorems. While these two criteria are sufficient for Gaussian processes, they are necessary (but not sufficient) for non-Gaussian processes. These fundamental issues are rarely evaluated by researchers, often due to too few data points (2-3 per unit is common and 8-10 in single-subject research).

Visual Analysis
Visual analysis (VA) is a descriptive method, widely used in applied behavior analysis research. The most basic experimental model used in single-case design is an AB model with a well-defined target behavior that is examined before and after the intervention. The first phase (A) of the design consists of multiple baseline observations that assess the pre-intervention characteristics of the behavior. In the second phase (B) of the design, the treatment component of the experiment is introduced and changes in behavior are examined (Barlow et al., 2009;Kazdin, 2011;Parsonson, & Baer, 1978). The most common form of AB design is a multiple baseline design (Shadish & Sullivan, 2011) where the timing of the intervention is staggered across cases, across dependent variables, or across settings.
The VA of the graph, performed by a judge or a rater, is based on a set of criteria that evaluate and compare the characteristics of Phase A and B and examine whether behavior changes observed in Phase B are caused by the intervention. The baseline (A) phase provides information about the descriptive and predictive aspects of the target behavior, such as stability and variability. Stable behavior, characterized by the absence of a trend or slope in the data, indicates that the targeted behavior neither increases nor decreases on average over time during the baseline phase (Kazdin, 2011). Variability of the data is characterized by the changes in the behavior within the range of possible low and high levels, and it is widely acknowledged that substantial variability of the behavior in the baseline phase can significantly impair the conclusions regarding the effects of the intervention (Barlow et al., 2009). Single-case experiments are evaluated based on magnitude and rate of change between Phase A and B. The magnitude of change is based on variability in level and slope of the data. Changes in level refer to average changes in the frequency of target behavior, whereas changes in slope refer to shifts in direction of the behavior across different phases. The mean is the average for all data in a particular phase. If the series is stable, the level will equal the slope. Changes in level and slope are independent from each other. Rate of change is based on changes in trend or slope of the data and latency of change. Trend analysis provides information on systematic increases or decreases in the behavior across phases, whereas latency of change refers to the amount of time between the termination of one phase and changes in behavior (Kazdin, 2011).
Although the above criteria are well established in the literature, they are rarely used in practice. Often, conclusions regarding the intervention effects instead of being based on a systematic and criterion based review, they are driven by a researcher's subjective evaluation. Applied behavior analysis researchers argue that large intervention effects are evident and provide unequivocal conclusions that can be easily observed by independent judges. Further, they state that the subjective evaluation of intervention effects has a minimal impact on reliability and validity of the conclusions drawn from the graphs presenting large and therefore easily observable treatment effects, because only those are considered to have significant clinical implications (Baer, 1977;Kazdin, 2011).
Proponents of VA acknowledge that certain characteristics of the data can significantly impair the ability to accurately evaluate intervention effects. The presence of slope in the baseline phase of the experiment may negatively affect the evaluation of the experiment, especially when the trend of the targeted behavior is moving in the same direction as potential treatment effects. High variability of the data may also interfere with the validity of the conclusions. However, advocates of this method state that the conservative approach to evaluating intervention effects guarantees highly accurate and consistent conclusions across independent judges, as well as reduces unknown probability of Type I error rate and consequently increases the probability of Type II error rate (Baer, 1977;Kazdin, 2011).
In the recent literature, some of the VA supporters have discussed the problem of the lack of effect size estimation, which results in an inability to perform meta-analytic reviews of single-case experiments. As stated by Kazdin (2011), the single-case research field would benefit from the ability to integrate a large number of studies in a systematic way that would allow drawing broader conclusions. However, to date there is no consensus regarding guidelines for interpreting effect sizes calculated based on methods that supplement VA. Brossart, Parker, Olson, and Mahadevan (2006) compared five analytic techniques frequently used in single-case research by applying them to the same data. They concluded that each analytical approach was strongly influenced by serial dependency, and the obtained results based on each method varied so much that it prohibited the development of any reliable effect size interpretation guidelines. A noteworthy study by Hedges, Pustejovsky, and Shadish (2012) proposed a new effect size that is comparable to Cohen's d, frequently used in group-level designs. It assumes normality and no trend in the data, and it can be applied across studies with at least three independent cases. This new approach can be applied in meta-analytic research and warrants further examination.
Several studies examined agreement rates among judges and showed that VA led to inconsistent conclusions about the intervention effects across different raters. The inter-rater agreement among judges who reviewed the same graphs was relatively poor, ranging on average from .39 to .61 (Jones, Weinrott, & Vaught, 1978;DeProspero & Cohen, 1979;Ottenbacher, 1990). Higher complexity of the data and experimental design resulted in less consistent conclusions. Factors like high variability of the data, inconsistent patterns of behavior over time, changes in slope, and small changes in level of the data were associated with lower agreement rates across judges (DeProsper & Cohen, 1979;Ottenbacher, 1990). One study by Jones et al. (1978) showed that the highest level of agreement between the two methods was found when there were non-statistically significant changes in the behavior, and the lowest agreement occurred when there were significant effects of the intervention. In addition, a number of studies demonstrated that higher levels of serial dependency in the data lead to higher rates of disagreement between visual and statistical analysis (Bengali & Ottenbacher, 1998;Jones et al., 1978;Matyas & Greenwood, 1990). Particularly, Matyas and Greenwood (1990) showed that positive autocorrelation and high variability in the data tend to increase Type I error rates. Overall, the above findings suggest that advantages of the conservative approach of VA are overstated and do not guarantee the reduction of Type I error rate. In addition, the effects of high autocorrelation on single-case data have been shown to negatively impact other analytical techniques such as inferential precision (Smith, Borckardt, & Nash, 2012) and effect size estimation (Manolov & Solanas, 2008).

Interrupted Time Series Analysis
Interrupted time series analysis (ITSA) 1 is a statistical method used to examine intervention effects of single-case study designs. It was initially developed by Box and Tiao (1965;Box & Jenkins, 1976) and introduced to the behavioral sciences by Glass, Willson, andGottman (1975/2008). 1 ITSA is a term descriptive of a method of analyzing idiographic data widely used in many disciplines. It should not be confused with a computer program ITSE developed by Williams and Gottman (1982) which was shown to be inaccurate (Harrop & Velicer, 1990) or the later version ITSACORR (Crosbie, 1993) which is also fatally flawed (Huitema, Bradley, McKean, & Laraway, 2007).
Although ITSA is widely used in areas such as econometrics, it has not reached saturation in the behavioral and social sciences to the same degree where there is little consensus on the appropriate method. Other methods that have been proposed for the same task, including multiple regression (e.g., Huitema, 2011;Maggin et al., 2011), multilevel modeling (e.g., Van den Noortgate & Onghena, 2003a, 2003b, 2007, 2008, and the overlap statistics proposed by Parker and others (e.g., . However, the autocorrelation structure of the data is sometimes ignored. For example, multiple regression is a special case of time series analysis when the autocorrelations are all equal to zero. As noted below, having all autocorrelations equal to zero is unlikely to occur, and ignoring the dependency in the data can lead to very inaccurate parameter estimates. An important general problem is that these methods do not directly address violations of the Ergodic Theorems (Molenaar, 2007;Molenaar, 2008;Velicer, Babbin, & Palumbo, 2014). The Ergodic Theorems represent a critical distinction between nomothetic and idiographic approaches and must be addressed before combining multiple idiographic studies.
An inherent property of time series data is serial dependency that reflects the impact of previous observations on the current observation and violates the assumption of independence of errors, which can significantly affect the validity of the statistical test. Serial dependency, examined by the magnitude and direction of autocorrelations between observations spaced at different time intervals (lags), directly impacts error term estimation and validity of the statistical test. Negative autocorrelations produce an overestimation of the error variance, which leads to conservative bias and increases Type II error rate, whereas positive autocorrelations lead to underestimation of the error variance, and cause liberal bias and increase Type I error rates (see Velicer & Molenaar, 2013, for an illustration).
Time series analysis may be expressed as a generalized least squares problem, i.e., where the parameters of interest are contained in the vector b, X is a design matrix, Z is the vector of observed data, and T is a lower triangular transformation matrix where the dependency is removed from the data. For an interrupted time series analysis, there are typically four parameters of interest in the b vector, the Level of the series (L), the Slope of the series (S), the Change in Level (DL), and the Change in Slope (DS). The slope parameters represent one of the other unique characteristics of a longitudinal design, the pattern of change over time. Investigating the pattern of change over time represents one of the real advantages of employing a longitudinal design. If the transformation matrix T = I, the identity matrix, there is no dependency in the data, and the parameter estimates are provided by the standard general linear model.
The ITSA method is able to measure the degree of the serial dependency in the data and statistically remove it from the series, allowing for an unbiased estimate of the changes in level and trend across different phases of the experiment (Glass et al., 1975(Glass et al., /2008. In addition, after accounting for serial dependency in the data, ITSA facilitates an estimate of the single-case effect size, by accounting for a withinindividual variance and evaluating the differences between experimental phases of the study. This type of effect size is similar to group-level Cohen's d effect size (Cohen, 1988), which is the most commonly used measure of intervention effects in behavioral sciences research with widely implemented interpretative guidelines.

ITSA Model Identification
Identification of the correct autoregressive moving average model (ARIMA), i.e., determining the specific transformation matrix T, is an essential element of ITSA, because model identification, as well as sample size, directly impact the accuracy of the parameter estimation. Proposed by Glass et al. (1975Glass et al. ( /2008 method for ARIMA estimation is computationally very complex, therefore not accessible to the average researcher, and it requires a large number of observation (minimum 100 data points). Nevertheless, Velicer and Harrop (1983) showed that identifying the correct ARIMA is often unreliable, even with the recommended number of data points, leading to model misidentification. To address the limitations of the Glass et al. (1975Glass et al. ( /2008 method, the general transformation model that does not require specification of a particular model, was proposed (Velicer & McDonald, 1984). While the Glass et al. (1975Glass et al. ( /2008) method requires a two-step approach: (1) identification of the ARIMA (p, d, q), which requires large number of data points and has been shown to be unreliable, and (2) estimation of the parameters after correct transformation of the data, the general transformation method replaces the specific transformation matrix by a generalized transformation matrix and avoids the problematic model identification step (Velicer & McDonald, 1984). Harrop and Velicer (1985) compared the results of ITSA using: (1) the model developed by Glass et al. (1975Glass et al. ( /2008 (2) a priori specified (1, 0, 0) model proposed by Simonton (1977); and (3) an assumed (3, 0, 0) model as an approximation to the recommended (5, 0, 0) general transformation model. The findings led to the conclusion that the model identification step can be eliminated and replaced with the assumed (1, 0, 0) and general transformation model, even for time series data with as few as 40 data points. However, the general transformation model instead of the assumed (1, 0, 0) model is recommended for higher order models.

ITSA Limitations
Although the accuracy of the assumed (1, 0, 0) model and the general transformation model has been shown for data with at least 40 observation, it has not been tested on a very short time series with less than 40 observations, which are very common in single-case study designs. Shadish and Sullivan (2011) found that among single-case studies, the median number of observations is 20, and 90.6% have less than 50 data points. Therefore, the accuracy of the parameter estimation based on time series with less than 40 observations should be considered cautiously. Another limitation is that information about the outcome distribution of the measures is typically lacking, and there may be more appropriate methods for alternative distributions. For example, a Poisson distribution is usually used to model counts, and that type of data is common in JABA reports. Alternative analytic methods based on the Poisson distribution are under development (Jazi, Jones & Lai, 2012) and represent an alternative choice for future analysis of these types of data.

Study Aims
The aim of this study is to examine the level of agreement between the conclusions drawn from VA of graphically presented data with the findings based on ITSA of the same data. The study uses graphical data based on single-case studies published in the Journal of Applied Behavior Analysis in 2010. This journal was selected because it is a leading journal on the topic used by applied researchers, and it strongly promotes the use of VA rather than quantitative analysis methods (Shadish & Sullivan, 2011;Smith, 2012). In a related study, all the studies published in a leading textbook (Kazdin, 2011) were evaluated in the same way (Harrington & Velicer, 2015). The study will also provide estimates of the degree of autocorrelation and estimate the effect size for each study.

Sample
Graphical data was obtained from the research papers published in the Journal of Applied Behavior Analysis (JABA) in 2010. For a graph to be included in this study, it was required to meet the following inclusion criteria: (1) present actual data (not simulated), (2) present interrupted time series data, (3) present a minimum of three observations in each phase of the design in order to estimate a full four-parameter model, (4) present baseline and treatment phases of an experimental design, (5) include corresponding description of the conclusions drawn from the VA of the graph, and (6) present welldefined data points (observations) in the graph. Any study design (e.g., ABAB, ABCA, etc.) presenting graphs that met the above eligibility criteria was included in the study. Graphs presenting cumulative data or alternating-treatment designs were not eligible.

Procedure
Eligible graphs were scanned and electronically imported into UnGraph R software version 5.0 (Biosoft, 2004), and the data was extracted using the UnGraph R software's function of a coordinate system. Then, sequentially ordered data recorded in a time series data format was exported into a Microsoft Excel R spreadsheet.

Validity and Reliability of UnGraph R Software
UnGraph R software has been previously examined for its validity and reliability when extracting data from graphs representing single-case designs (Shadish et al., 2009). Results of this study indicated high validity and reliability of the extracted data from graphs, with .96 as an average correlation coefficient between two raters.
Once the best fitting ARIMA was identified, parameters such as trend, change in trend, level, change in level, and mean and variability of the data series were evaluated. Intervention effects were examined based on changes in slope and level across the experimental phases of the design. An effect size similar to Cohen's d (d = Level /s), where the numerator represents change in level at the interruption point and the denominator represents within-case standard deviation, was calculated to examine the magnitude of the behavior change due to the intervention. The measure of effect size based on within-case standard deviation is expected to be inflated relative to between groups Cohen's d. An effect size was calculated only for studies where no significant slope or change in slope was present. Analyses were performed in SAS version 9.2.
Description of the VA of the graphs presented in the publications published in JABA was used to perform the comparison of the findings. These comparisons were based on conclusions made in regards to trend, change in trend, variability, and change in level across different experimental phases of the experiment.

Illustration of ITSE Application and VA Conclusions
To illustrate the application of the ITSA method in the analysis of single-case studies and comparison with the conclusions drawn on VA, three examples were selected from the experiments presented in Table 1.

Example 1
The first example is based on a study that examined the effects of providing praise and preferred edible items based on variable-time schedule in order to reduce problem behavior. In addition, effects of variable-time schedule on compliance were also evaluated. The study was based on a reversal design (ABAB) and included three participants (Lomas, Fisher, & Kelly, 2010). In the current example, data for one of the participants is provided. Sam was an 8-year-old boy diagnosed with Asperger syndrome and attention deficit hyperactivity disorder. Data displaying frequency of problem behavior and percentage of compliance in each phase of the design are presented in Figure 4 and Figure 5, respectively. Conclusions based on VA of the data suggested that the variable-   Figure 6. Top panel (ABCDEFBFEDC) First sequence of conditions "DRA lost its efficacy when implemented at less than 50% integrity with combined omission and commission errors" (p. 60).
On task (B (EF)) 9 17 (1, 0, 0   . . the third panel shows data from a driver for whom there was no effect when the delay was introduced or increased from 8 to 16 s." (p. 377).
Scott ( Figure 6. George (ABAB) "In summary, results of the combined analyses indicate that for these participants the relative rates of problem behavior and appropriate behavior were sensitive to a combination of the quality, delay, and duration of reinforcement following each alternative" (p. 584). Problem behavior (ABAB) 7/6 7/10 (5, 0, 0   Figure 1. Ricky (ABACABAC) "For Ricky, compliance improved when the guided compliance procedure was conducted" (p. 606).
Compliance (AACAAC) 3/3/3/3 6/8 (1, 0, 0) .58 * 6.91 23.11 −0.77 1.87 3.23 * 1.99 Ian (ABACABADAD) "For Ian, contingent access to preferred edible items initially appeared to be effective in increasing compliance, but compliance decreased toward the end of this phase" (p. 606). Compliance (AC) 5 6 (1, 0, 0  Note. The following information is included in the first column: authors of the publication, figure label as it is presented in the publication, experimental design presented using capital letters in the parenthesis. Unless otherwise indicated with the superscript ( †), each ITSA model was determined based on four parameters: level, slope, change in slope, and change in level. N BL = number of observations in the baseline or reference phase; N TX = number of observations in the treatment phase; ARIMA = autoregressive moving average model; AR 1 = autoregressive term 1; Level = intercept; Error σ = standard error estimate; Slope = t test statistic for linear trend of the time series; Slope = t test statistic for change in slope at the interruption point; Level = t test statistic for change in level at the interruption point; d = Cohen's d effect size; Cohen's d effect size is not available for time series with significant slope or change in slope. The quotes in the Table are the interpretation of a significant effect as presented in the original paper. The comparison that the quote refers to is indicated by the bolding below the quote. a significant AR 2. b significant AR 2 and AR 3. c significant AR 2, AR 3, and AR 4. d significant AR 2, AR 3, AR 4, and AR 5. † ITSA model estimated separately for slope and change in slope due to small number of observation that affected model's stability. * p < .05 time schedule reduced problem behavior, but did not increase compliance for Sam. Lomas et al. (2010) stated that "levels of compliance were only slightly higher during treatment with VT food and praise for Sam. .." and that "variable-time delivery of food and praise superimposed on a demand baseline (in which problem behavior continued to produce escape) greatly reduced problem behavior. . ." (p. 431). ITSA was implemented to evaluate the effect of variabletime delivery on problem behavior and compliance. The ARIMA (1, 0, 0) was applied to both behaviors to estimate 4 parameters: level, change in level, slope, and change in slope.
For problem behavior, lag-1 autocorrelation was .40. The analysis for slope and change in slope yielded nonsignificant findings, whereas change in level in the variable-time delivery phase indicated significant decrease in problem behavior (t (18) = −2.39, p < .05) with medium effect size (d = 1.85) based on tertile distribution. The findings based on statistical analysis confirm conclusions drawn from VA, indicating decrease in problem behavior due to variable-time delivery of preferred food and praise.
For compliance, lag-1 autocorrelation was .13. The analysis for slope and change in slope yielded nonsignificant findings, whereas change in level in the variable-time delivery phase indicated significant increase in compliance (t (18) = 2.43, p < .05) with medium effect size (d = 1.76). The findings based on statistical analysis did not confirm the conclusions drawn from VA, which indicated only slight increases in compliance, while statistical findings show significant increases with large effect sizes. ITSA details are presented in Table 1.

Example 2
The second example is based on a study that examined the effectiveness of a device that prevents drivers from changing gears for up to 8 seconds unless the seatbelt is buckled. The study was based on an ABA reversal design and included 101 commercial drivers (Van Houten et al., 2010). Data for one driver is displayed in Figure 6. Based on the VA of the data presented in the top panel, Van Houten et al. (2010) concluded ". . . an increase in seat belt use following the 8-s delay and a decline when the delay was removed" (p. 377).
ITSA was implemented to evaluate the effect of the 8-s gearshift delay on seatbelt use. Two ARIMAs (5, 0, 0) were applied to test increases in seatbelt use following the 8-s delay (AB) and to test a decline in seatbelt use when the delay was removed (BA). Each model estimated 4 parameters: level, change in level, slope, and change in slope. For AB phase of the design, lag-1 autocorrelation was significant (AR 1 = .56). The analysis for slope and change in slope yielded nonsignificant findings, whereas change in level in the 8s delay phase indicated significant increase in seatbelt use (t (79) = 8.59, p < .05) with large effect size (d = 2.78).
The findings based on statistical analysis confirm conclusions drawn from VA, indicating an increase in seatbelt use due to 8-s gearshift delay. For the BA phase of the design, lag-1 autocorrelation was significant (AR 1 = .76) . The analysis for slope yielded nonsignificant findings; however, change in slope and change in level were significant and indicated a decrease in seatbelt use due to removal of the gearshift delay (t (87) = −2.19, p < .05; t (87) = −8.58, p < .05 for change in slope and change in level respectively). The findings based on statistical analysis confirm conclusions drawn from VA, indicating a decrease in seatbelt use following removal of the 8-s gearshift delay.

Example 3
The third example is based on a study that performed several experiments, one of which examined the effects of delivery of higher quality reinforcement following appropriate behavior and lower quality reinforcement following problem behavior on changes in behavior (Athens & Vollmer, 2010). The study participant reported in this example was a 7-year-old boy diagnosed with attention deficit hyperactivity disorder, and the experiment was based on ABCAC design. Based on the VA of data presented in Figures 7 and 8, Athens and Vollmer (2010) made several conclusions such as "in the 1 HQ/ 1 LQ condition, rates of problem behavior decreased, and appropriate behavior increased" (p. 579); "problem behavior decreased, and appropriate behavior increased to high levels during the return to the 3 HQ/ 1 LQ condition" (p. 580); and "in summary, results of the quality analyses indicated that. . . the relative rates of both problem behavior and appropriate behavior were sensitive to the quality of reinforcement available for each alternative" (p. 581).
ITSA was implemented to evaluate the effect of the quality reinforcement on problem behavior and appropriate behavior. Three ARIMAs, estimating 4 parameters (slope, change in slope, level, and change in level) were applied to test each of the conclusions made based on VA.
First, an ARIMA (1, 0, 0) was implemented to evaluate the effects of 1 HQ/ 1 LQ on problem behavior and appropriate behavior (AB phase of the experiment). The lag-1 autocorrelations were −.05 and .13, for problem behavior and compliance, respectively. For problem behavior, ITSA revealed nonsignificant slope, significant change in slope (t (15) = 2.18, p < .05), and nonsignificant change in level. These findings indicated an increase in problem behavior in the quality reinforcement phase and did not confirm conclusions based on VA that found a decrease in problem behavior. For appropriate behavior, ITSA indicated significant slope (t (15) = −2.22, p < .05), nonsignificant change in slope, and significant change in level (t (15) = 4.24, p < .05). These findings indicated an initial decreasing trend in baseline phase (A) followed by an increase in compliance as an effect of 1 HQ/ 1 LQ quality reinforcement. The statistical results are consistent with VA conclusions. Second, an ARIMA (1, 0, 0) was applied to examine the effect of the return to 3 HQ/ 1 LQ phase on problem and appropriate behavior (AC phase of the experiment). The lag-1 autocorrelations were −.29 and .06, for problem behavior and compliance, respectively. For problem behavior, ITSA revealed significant slope (t (12) = −3.46, p < .05), a nonsignificant change in slope, and significant change in level (t (12) = 2.21, p < .05). These findings indicate an initial decreasing trend in problem behavior; however, the change in level indicate an increase in problem behavior during the 3 HQ/ 1 LQ experimental phase. The statistical results are not consistent with VA that concluded a decrease in problem behavior during the return to the quality reinforcement phase. For appropriate behavior, ITSA revealed nonsignificant slope, change in slope, and change in level. These findings indicate that no significant change in compliance occurred as a result of the 3 HQ/ 1 LQ experimental phase. The statistical results are not consistent with VA that concluded a high in-crease in compliance as a result of quality reinforcement phase.
Third, an ARIMA (1, 0, 0) and (5, 0, 0), for problem and appropriate behavior, respectively, was applied to examine the overall effect of the quality reinforcement (ABCAC experimental design). The lag-1 autocorrelations were −.07 for problem behavior and significant .44, for compliance. For problem behavior, ITSA revealed significant slope (t (41) = −2.88, p < .05), a nonsignificant change in slope, and significant change in level (t (41) = −2.91, p < .05). These findings indicate an initial decreasing trend, as well as decrease in problem behavior during the quality reinforcement phases. These results are consistent with VA. For appropriate behavior, ITSA revealed an initial significant increase in trend (t (41) = 3.49, p < .05), a nonsignificant change in slope, and change in level, indicating that quality of reinforcement did not have an effect on compliance. These results are not consistent with VA that concluded effectiveness of experimental treatment on increasing appropriate behavior.
FIGURE 4 Graphical presentation of the data illustrated in the first example of interrupted time series analysis (ITSA) application. Note. Figure reproduced from the data extracted using UnGraph R software from Lomas, Fisher, and Kelly, 2010 (p. 430).

FIGURE 5
Graphical presentation of the data illustrated in the first example of interrupted time series analysis (ITSA) application. Note. Figure reproduced from the data extracted using UnGraph R software from Lomas, Fisher, and Kelly, 2010 (p. 430).

Sample
A total of 75 research papers were published in the JABA in 2010. After reviewing the content of the publications, 25 papers met eligibility criteria and were included in the study. Excluded publications did not present interrupted time series data (27), presented fewer than 3 observations in at least one phase of the design (4), presented cumulative data (3), or alternating-treatment designs (9). One study presented gen-erated, hypothetical data, and one study presented a graph with insufficiently defined observations, which prevented data point extraction. Five studies were ineligible because presented descriptions of the findings based on the VA of the graphs were not possible to verify using ITSA (e.g., findings were generalized across all conducted experiments, rather than reported for each experiment separately). The eligible publications included one or more graphs. A total of 99 graphs presenting interrupted time series data with corresponding conclusions based on VA were included in the  study. The graphs displayed a diversified range of single-case designs, such as AB design and its variations (e.g., ABA, ABAB), ABC design and its variations (e.g., ABCA, AB-CACA, ABABACBC), and designs that included more than two different interventions (e.g., ABCD, ABCDEFBFEDC) (see Table 1 for details).
Based on 99 graphs, a total of 163 ITSA were performed, either because some graphs presented more than one inter-rupted time series data (e.g., two independent behaviors were plotted on a single graph) or multiple conclusions were made based on VA (e.g., conclusions were made based on different phases of the study). ITSA was applied to the data with the corresponding description of the findings formulated in a way that could be validated using statistical methods. To be certain that specific conclusions based on VA are directly comparable to findings based on ITSA, the key conclusions FIGURE 8 Graphical presentation of the data illustrated in the third example of interrupted time series analysis (ITSA) application. Note. Figure reproduced from the data extracted using UnGraph R software from Athens and Vollmer, 2010 (p. 580). were identified and matched with specific study phases, so that ITSA can be computed only for those phases. To illustrate the comparison process, the first study presented in Table 1 (St. Peter Pipkin, Vollmer, & Sloman, 2010) is used as an example. The complete study design is presented in parenthesis, next to the figure number ( Figure 6. Top panel (ABCDEFBFEDC)). The bolded and underlined phases are those comparisons made in the paper. In the row below, a conclusion based on VA of selected phases is cited: "DRA lost its efficacy when implemented at less than 50% integrity with combined omission and commission errors" (p. 60). In order to directly compare findings based on VA and ITSA, statistical analyses are performed only on data obtained from selected phases. The analyzed phases identified by letters are reported on the left side of the table in the same row as corresponding ITSA results, e.g. (B (EF)).

Descriptive Statistics
The number of observations in the analyzed experiments ranged from 8 to 136, with a minimum of 3 and maximum of 90 observations per phase. For 9 (5.52%) analyzed experiments, the interrupted-time series ARIMA did not converge. Six of those experiments came from one study that had multiple single-case data series characterized by low number of observations (<12) and low variability across observations; two experiments had higher number of observations (43 and 136) but low variability across observations; one experiment had high variability across 22 observations. The majority of the examples where the model did not converge typically did not meet the What Works Clearinghouse standards (Kratochwill & Levin, 2010;Smith, 2012).
An assumed ARIMA (1, 0, 0) (Simonton, 1977) was applied to 120 data series (77.92%). The general transformation ARIMA (5, 0, 0) (Velicer & McDonald, 1984;Harrop & Velicer, 1985) was applied to 32 data series (20.78%), all of which had 30 or more observations. ARIMAs (3, 0, 0) and (2, 0, 0) were applied to two experiments, after an assumed ARIMA (1, 0, 0) indicated correlated residuals and the general transformation ARIMA (5, 0, 0) (Velicer & Mc-Donald, 1984;Harrop & Velicer, 1985) did not converge due to an insufficient number of observations (for details see Table 1). According to classification outlined by Jones et al. (1978), low lag-1 autocorrelations ranging from .00 to .50 were found for 75 (46.01%) time series data, moderate lag-1 autocorrelations ranging from .51 to .75 were found for 67 (41.10%) time series data, and high lag-1 autocorrelations over .75 were found for 8 (4.91%) time series data. Lag-1 autocorrelation less than .00 were found for 13 (7.98%) time series data and ranged from −.32 to −.05. Lag-1 autocorrelations were significant for 93 time series data, 28 of those time series data also had significant lag-2 autocorrelations. The findings show a high heterogeneity of the lag-1 autocorrelations that could be largely related to a small number of observations in some experiments as well as different study designs. It has been shown that for short data series autocorrelations are negatively biased and may underestimate the true autocorrelation, and correcting for small sample bias is suggested (Huitema & McKean, 2000;Shadish & Sullivan, 2011). Figure 1 presents the distribution of the lag-1 autocorrelations for the eligible studies, and details are presented in Table 1.
Changes in the target behavior such as changes in level and decreasing or increasing trend were evaluated for all 154 data series for which an ARIMA was established. Twenty-three experiments (14.94%) had significant slope, indicating that the target behavior was either decreasing or increasing in the baseline phase; 15 (9.74%) had significant change in slope due to experimental design, indicating that the target behavior was either decreasing or increasing in the intervention phase of the experiment; 18 (11.69%) had significant slope and change in slope, indicating that target behavior was either decreasing or increasing in both phases of the experiment. The nonlinearity of the slopes was not examined. Over 50% of the examined time series data (k = 79) had significant changes in level as a result of the experiment due to examined study design phase change.
An effect size estimate, based on a similar formula used to evaluate Cohen's d, was estimated for all experiments that did not have significant slope or change in slope, a total of 98 (63.64%). The effect sizes ranged from 0.00 to 22.74 (see Table 1 for details). Figure 2 presents the distribution of the effect size estimates for the eligible studies. Cohen's (1988) traditional classification of effect sizes cannot be used, as effects calculated for single-case designs are expected to be inflated relatively to Cohen's standards for betweengroup studies. Therefore, in this study we propose alternative classification based on the tertile distribution of the effect sizes, where effects ranging from 0.00 to 0.99 are classified as small, those ranging from 1.00 to 2.49 as medium, and large effect size are defined as 2.50 or greater.

ITSA and VA Comparison
Comparison of the findings based on VA and ITSA was performed for 154 data series. Consistent results were found for 94 (61.04%) data series, with most conclusions (k = 79, 84.04%) referring to significant changes between different phases of the experiment, and 15 (15.96%) referring to nonsignificant changes such as reversal to baseline. For the remaining 60 experiments (38.96%), the findings based on statistical analysis did not confirm the conclusions based on VA (bolded data in Table 1). Among the experiments that led to inconsistent findings between the two methods, 30% had significant slope, change in slope or both, and 53% had lag-1 autoregressive term greater than .40.
Fifty-two, out of 60 data series that were identifies ad inconsistent, were identified as significant based on the VA method, while ITSA did not confirm these results. For 49 of those experiments, VA indicated significant changes between different phases of the study design, while statistical analysis did not reveal significant differences. No significant slope parameter or change in slope parameter was found for 38 of those experiments; 10 experiments had significant slope in the first phase with the trend of the targeted behavior moving in the same direction as hypothesized in the reference phase, and 1 experiment had significant slope in the treatment phase with the trend of the targeted behavior moving in the opposite direction than hypothesized based on VA analysis. For three experiments statistical analysis revealed significant findings. However those findings were in the opposite direction than conclusions reported based on VA. Two out of three experiments had significant slope in the first phase with the trend of the targeted behavior moving in the same direction as hypothesized in the reference phase. All three experiments were reported in the same research paper and are bolded and italicized in Table 1.
For eight experiments, nonsignificant findings based on VA were not confirmed by statistical analysis. Three of those experiments had no significant slope or change in slope parameter, four had significant slope in the first phase with the trend of the targeted behavior moving in the same direction as hypothesized in the reference phase, and one experiment had significant slope in the treatment phase with the trend of the targeted behavior moving in the opposite direction then hypothesized based on VA analysis.
The level of agreement between VA and ITSA was calculated based on the difference between the observed agreement and the expected agreement that would be present by chance alone. Kappa coefficient is a measure of this difference, ranging from -1 to 1, where 1 is a perfect agreement, 0 is an agreement by chance, and value < 0 would indicate an agreement less than expected by chance (Cohen, 1960). Figure 3 provides a summary of the agreement and disagreement between the two methods as well as the percent of cases with lag-1 autocorrelations greater than .40 for each cell. The overall level of agreement was low (Cohen's Kappa = .14) (Cohen, 1960). The VA results identified as significant had a very high percent of cases with lag-1 autocorrelations greater than .40. This is consistent with the potential bias that the positive autocorrelation can create the illusion of significance but decrease the apparent variability of the series. Figure 3 also presents the mean effect size estimate for each cell. As would be expected, the average effect size was higher for the significant ITSA results.

DISCUSSION
This study applied ITSA to 75 studies published in the JABA in 2010 and compared the conclusions authors reported based on VA with those obtained through ITSA. Issues such as autocorrelation, effect size estimation, and level of agreement between statistical and VA were addressed. Evaluated studies covered a wide range of single-case experiments that included different study designs, such as multiple-baseline, reversal, and multiple intervention designs. The experiments also differed in total number of observations in each study as well as within each phase of the design. ITSA model was estimated for all but nine of the eligible studies, indicating that this statistical method can be applied to a wide range of single-case experimental designs.

Agreement between Visual and Statistical Analysis
Comparison of the conclusions drawn from VA and ITSA revealed an overall low level of agreement (Kappa = .14). When graphical presentation of the intervention effects presents ideal or almost ideal data patterns, such as low variability, no trend, and large effect size, ITSA was in agreement with VA for 94 data series, including those with small numbers of observations. However, in 60 (38.96%) of the evaluated data series, the conclusions drawn based on VA did not agree with the statistical analysis. VA was more likely to imply significant effects when ITSA indicated nonsignificant findings. This is the opposite state of affairs expected by Baer (1977), who argued that visual analysis should be less likely to report significant findings than statistical analysis. Only for eight experiments, nonsignificant findings based on VA were not confirmed by statistical analysis, and for 3 experiments ITSA resulted in significant findings but in the opposite direction than indicated by VA. Among the experiments that led to inconsistent findings between the two methods, 30% had significant slope, change in slope or both, and 53% had lag-1 autoregressive term greater than .40.
If we view statistical analysis as a necessary but not sufficient condition for clinical significance, this result is discouraging. Moderate to high autocorrelation, present in most examples, is one potential explanation for the low agreement. Also, trend in the data, closely related to the autocorrelation and not easily observable, particularly in short series, may impact the accuracy of the conclusions based on VA. ITSA is able to account for trend in the data when examining intervention effects, as well as evaluate quantitatively trend and change in trend that may occur across different phases of the design.
Although the failure to detect a statistically significant effect occurred at a much smaller rate (5%), these errors have the potential to prematurely terminate the investigation of a potentially effective intervention. Initial studies of an intervention in a real world study typically represent an attempt to detect an effect in a very noisy environment, and effect sizes that are initially small can become much more important with additional controls.

Autocorrelation
Overall findings based on ITSA revealed high lag-1 autocorrelations for most of the evaluated data, including short time series of less than 20 observations. These results confirm findings based on earlier studies showing that serial dependency is a common property of single-case data (Jones, Vaught, & Weinrott, 1977;Jones et al., 1978;Matyas & Greenwood, 1990;Barlow et al., 2009). With over 60% of the lag-1 autocorrelations at either moderate or high level, the assumption that autocorrelations can be ignored (Huitema & McKeon, 1998) seems to be indefensible. The effect of a positive autocorrelation is to decrease the apparent degree of variability. This would potentially affect both graphical analysis and any statistical analysis that ignores dependency in the data.
The autocorrelations can also help address another important research question, i.e., what is the nature of the generating function for the observed data. The autocorrelations also provide information about the extent to which the ergodic theorems are satisfied, a critical question for combining data across individuals (Molenaar, 2008;Velicer & Molenaar, 2013). In order to draw valid inferences from group-level data to the individual level, two ergodic theorem conditions must be met: (1) the individual trajectories must obey the same dynamic laws, and (2) must have equivalent mean levels and serial dependencies (Molenaar, 2008;Velicer, Babbin, & Palumbo, 2014). However, the small sample sizes available in the studies reviewed here do not permit these questions to be addressed.

Effect Size Estimation
The effect size estimates were predominately large with some very large effect sizes such as d = 22.74, an extremely large effect size for the behavioral sciences. The term "clinical significance" is largely undefined but can be viewed as analogous to a large effect size. (Statistical significance is typically viewed as a necessary but not sufficient condition for clinical significance.) Based on this interpretation, the effect size estimates observed in this set of studies support the contention that graphical methods focus on clinically significant effect sizes.

Advantages of Statistical Analysis
ITSA provides supplementary quantitative information such as degree of the serial dependency, trend, changes in trend and level across phases, and variability of the data, that are not available through visual inspection of the graphs. Evaluation of the serial dependency could provide information about the generating function of the examined behavior, such as the strength of relationships of the observations or cyclic patterns in the behavior that are not observable by visual inspection of the graph. Unbiased statistical evaluation of the graphs facilitates comparison of the intervention effects across different individuals within the same experiment or across different studies. This information is particularly useful when experiments are executed across multiple subjects or settings, allowing for a better understanding of the unique variability of the behavior across different subjects or settings.
ITSA facilitates an estimate of effect size similar to Cohen's d that enables systematic meta-analytic review of single-case experiments, as well as evaluation of the intervention effects for experiments with small numbers of observations. In this study, we used the effect size to examine the magnitude of the intervention effects within single-cases; for the application of Cohen's d effect size to between-cases see work by Hedges et al. (2012). Statistical significance tests are largely dependent on the sample size. For small sample sizes, the results may be insignificant due to insufficient statistical power. However effect size is independent of sample size, and meta-analysis can provide more accurate estimates of effect size based on multiple replications. The development of the new software such as UnGraph R (Biosoft, 2004), DataThief (Tummers, 2006), and a new function in R (Bulté & Onghena, 2012) permits extraction of the data from published graphs and reanalysis using ITSA. This would permit the inclusion of historical data based on single-case studies in meta-analytical studies.

Limitations
The results of this study have limited representativeness. The collected data is based on a set of single-case studies published in JABA in 2010. The characteristics of these studies may influence the findings, particularly the large effect sizes and high autocorrelations, which are likely a product of the design and interventions published in JABA and the journal's preference of publishing studies that are likely to show the "clinically meaningful" threshold.
In addition, the autocorrelations were not corrected for small sample bias, which could underestimate the true autocorrelation. Therefore replication of these results in other samples of the published studies within the applied behavior analysis field is needed. Another potential limitation of the analysis was that nonlinearity was not examined. There is some reason to believe nonlinearity is present in this type of data.
The sample size, defined in single-case study designs as number of observations in each phase rather than number of different individuals, is another limitation of the study. For the set of studies reviewed here, the numbers of observations was generally very small compared to idiographic studies reported in other disciplines or even other areas of behavioral science. The average number of data points was 28 (median = 19) for 163 data series. These findings are similar to those presented by Shadish and Sullivan (2011) in a comprehensive review of 21 journals that report on single-case studies. They found that the median number of data points was equal to 20, whereas average number of data points in JABA in the year of 2008 was 29.
Large effect sizes are necessary for any type of significance, given the small sample sizes. However, a power analysis was seldom performed to guide the choice of the number of observations. Given that these studies focus on four parameters (slope, change in slope, level, and change in level), the lack of statistical power produces very poor estimates of the parameters of interest. Increasing the number of observations by even a small amount would greatly improve the quality of the research. There are times when obtaining additional observations is very difficult and expensive, but at other times a larger number of observations were collapsed for the graphical presentation of the data.
The number of observations is also related to the time between observations. Time is a core concept for idiographic studies, and we presently have very little information to guide researchers on how frequently observations should be taken. Advances from the information sciences are producing new measures that can greatly improve the quality and number of observations. A review of these methods, often labeled telemetrics, is provided by Goodwin, Velicer, and Intille (2008). Indeed, advances in telemetrics may shift the issue from not having many observations to having too many observations.

CONCLUSIONS
In conclusion, ITSA models can be applied to a large number of the published applied examples of single-case study designs. Moderate to high lag-1 autocorrelations (>.50) were found for 46% of the data series, and the majority of first order autocorrelations (more than 60%) were positive and at the moderate to high level (.41-.60 or >.60). Comparison of the conclusions drawn from VA and ITSA revealed an overall low level of agreement (Kappa = .14), and the results of the study support the conclusion that VA is prone to bias and should not be used as a stand-alone analytical method. When both methods produce discrepant results, the researcher should determine the basis for the discrepancy. Finally, ITSA provides important additional information such as effect size estimates, which permits the application of meta-analysis and the accumulation of knowledge.