Comparing Visual and Statistical Analysis in Single-Subject Studies

Objective. There has been an ongoing scientific debate regarding the most reliable and valid method of single-subject data evaluation in the applied behavior analysis area among the advocates of the visual analysis and proponents of the interrupted time-series analysis (ITSA). To address this debate, a head-to-head comparison of both methods was performed, as well as an overview of serial dependency, effect sizes and sample sizes. Method. The comparison of both methods was conducted in two independent studies. In the first study, conclusions drawn from visual analysis of the graphs published in the Journal of Applied Behavior Analysis (2010) were compared with the findings based on the ITSA of the same data; in the second study, conclusions drawn from visual analysis of the graphs obtained from the textbook by Alan E. Kazdin (2011) were used. These comparisons were made possible by the development of software, called UnGraph ® which permits the recovery of the raw data from the graphs, allowing the application of ITSA. Results. In both studies, ITSA was successfully applied to over 90% of the examined time-series data with numbers of observations ranging from 8 to 136. Over 60% of the data had moderate to high level first order autocorrelations (> .40). A large effects size (≥ .80) was found for over 70% of eligible studies. Comparison of the conclusions drawn from visual analysis and ITSA revealed an overall low level of agreement (Kappa = .14) in the first study and moderate level of agreement (Kappa = .44) in the second study. Conclusions. These findings show that ITSA can be broadly implemented in applied behavior analysis research and can facilitate evaluation of the intervention effect, particularly when specific characteristics of single-subject data limit the reliability and validity of visual analysis. Comparison of the two methods revealed low to moderate agreement between visual analysis and ITSA. Overall, the two methods should be viewed as complimentary and used concurrently.


LIST OF TABLES
Visual analysis, although guided by a set of criteria, is mostly driven by a subjective evaluation of intervention effects. Advocates of this approach state that large intervention effects are evident and provide unequivocal conclusions easily observed by independent judges. It is argued that the rationale for using visual analysis is to highlight large (i.e. easily observable) intervention effects and disregard small (i.e. not easily observable) effects, concluding that visually undetected intervention effects have insignificant clinical impact. Proponents of visual analysis state that the conservative approach to evaluating intervention effects guarantees highly accurate and consistent conclusions across independent judges, as well as reduces unknown probability of Type I error rate and consequently increases the probability of Type II error rate.
Several studies examined agreement rates among judges and showed that visual analysis led to inconsistent conclusions about intervention effects across different raters and that the inter-rater agreement among judges who reviewed the same graphs was relatively poor, suggesting that visual analysis is not a reliable method for assessing intervention effects of single-subject data. Factors such as high complexity of the data and experimental design, high variability of the data, changes in slope, and serial dependency of the single-subject data were associated with lower agreement rates among judges and increased Type I error rates.
On the other hand, advocates of the visual analysis call attention to some drawbacks of ITSA, such as difficulty to accurately estimate ARIMA model, requirement of a large number of observations, and inability to apply this statistical method to complex single-subject experimental designs.
To address this debate, a head-to-head comparison of both methods was performed in two independent studies.
The first study used the graphical data based on the single-subject studies published in the Journal of Applied Behavior Analysis (2010). The journal was selected because it is a leading journal on the topic used by applied researcher and it strongly promotes the use of visual analysis rather than statistical methods.
In the second study, graphical data was obtained from the book titled "Single-Case research designs: Methods for clinical and applied settings" by Alan E. Kazdin (2011), who is currently the leading advocate of visual analysis of single-subject studies. The book is a widely used textbook within applied psychology area and provides numerous examples of graphs presenting single-subject experimental data with corresponding evaluations of intervention effects based on visual analysis.

Journal of Applied Behavior Analysis Examples
Group-level and single-subject research designs are two methodological models employed for analyzing longitudinal research. The first model is based on data obtained from a large number of individuals and provides average estimates of longitudinal trajectories of behavior change based on group-level data, emphasizing between-subject variability. A significant limitation of group-level designs, also known as nomothetic designs, is the inability to capture high levels of variability and heterogeneity within the studied populations (Molenaar, 2004). Further, group-level designs emphasize central tendencies of the population and consequently obscure natural patterns of behavior change, their multidimensionality and unique variability within each individual (Molenaar & Campbell, 2009).
The second methodological approach employed in longitudinal research is based on data obtained from one individual or unit (n = 1) through intensive data collection over time. Single-subject designs, also known as idiographic designs, examine individual-level data, that allows for highly accurate estimates of within-subject variability and longitudinal trajectories of each individual's behavior. Idiographic methodology characterizes highly heterogeneous processes, which consequently allow for more accurate inferences about the nature of behavior change specific to an individual (Velicer & Molenaar, 2013). Single-subject designs address the limitations of group-level designs and present several advantages. They allow for a highly accurate assessment of the impact of the intervention for each individual while group-level designs provide information about the effectiveness of the intervention for an "average" person, rather than any person in particular (Velicer & Molenaar, 2013).
In addition, single-subject research allows studying longitudinal processes of change with much better precision than group-level designs, due to a higher number of data points and better controlled variability of the data. Also, it can be applied to populations that are otherwise difficult to recruit in numbers large enough to allow for a group-level design Kazdin, 2011).

Methods of evaluating single-subject studies
Currently, there are two widely used methods for evaluating intervention effects based on single-subject designs. Visual analysis of the graphs presenting experimental data is a commonly used approach in applied behavior analysis research, while interrupted time-series analysis (ITSA) is a statistical method used in research fields, such as electrical engineering, economics, business, and other areas of psychology, to name just a few. The use of visual analysis preceded the development of quantitative methods like time series analysis which required high speed computers to implement.

Visual analysis
The most basic experimental model used in single-subject research is an A-B design with a well defined target behavior that is examined before and after the intervention. The first phase (A) of the design consists of multiple baseline observations that assess the pre-intervention characteristics of the behavior. In the second phase (B) of the design, the treatment component of the experiment is introduced and changes in behavior are examined Kazdin, 2011).
The visual analysis of the graph, performed by a judge or a rater, is based on a set of criteria that evaluate and compare the characteristics of phase A and B and examine whether behavior changes in phase B are a result of the intervention. The baseline (A) phase provides information about the descriptive and predictive aspects of the target behavior, such as stability and variability. Stable behavior, characterized by the absence of a trend or slope in the data, indicates that the targeted behavior neither increases nor decreases on average over time during the baseline phase (Kazdin, 2011). Variability of the data is characterized by the changes in the behavior within the range of possible low and high levels . Single-subject experiments are evaluated based on magnitude and rate of change between phase A and B. The magnitude of change is based on variability in level and slope of the data.
Changes in level refer to average changes in the frequency of target behavior, whereas changes in slope refer to shifts in direction of the behavior across different phases. The mean is the average for all data in a particular phase. If the series is stable, the level will equal the slope. Changes in level and slope are independent from each other. Rate of change is based on changes in trend or slope of the data and latency of change.
Trend analysis provides information on systematic increases or decreases in the behavior across phases, whereas latency of change refers to the amount of time between the termination of one phase and changes in behavior (Kazdin, 2011).
Visual analysis, although guided by the set of criteria described above, is not based on any specific decision making rules and it is mostly driven by subjective evaluation of the intervention effects. Advocates of this approach argue that large intervention effects are evident and provide unequivocal conclusions that can be easily observed by independent judges. Further, it is argued that the subjective evaluation of intervention effects has a minimal impact on reliability and validity of the conclusions drawn from the graphs presenting large and therefore easily observable treatment effects, since only those are considered to have significant clinical implications Kazdin, 2011). This concept is particularly promoted in the field of applied behavior analysis.
Proponents of visual analysis acknowledge that certain characteristics of singlesubject data can significantly impair the ability to accurately evaluate intervention effects. The presence of slope in the baseline phase of the experiment may negatively affect the evaluation of the experiment, especially when the trend of the targeted behavior is moving in the same direction as would be expected due to treatment effects. High variability of the data may also interfere with the validity of the conclusions. For example, accuracy of the evaluation of intervention effects on disruptive behavior can be significantly affected by a pattern of behavior that is decreasing (getting better) over time. Also high variability of the behavior, such as extreme fluctuations from none to high frequency of disruptive behavior can limit the ability to draw valid conclusions about intervention effects (Kazdin, 2011). However, it is argued that the rationale for using visual analysis is to highlight large (i.e. easily observable) intervention effects and disregard small (i.e. not easily observable) effects.
It is concluded that visually undetected intervention effects have insignificant clinical impact. Proponents of visual analysis state that the conservative approach to evaluating intervention effects guarantees highly accurate and consistent conclusions across independent judges, as well as reduces unknown probability of Type I error rate and consequently increases the probability of Type II error rate Kazdin, 2011).
In the recent literature, some of the visual analysis advocates have discussed the problem of the lack of effect size estimation which results in an inability to perform meta-analytic reviews single-subject experiments. As stated by Kazdin (2011), the single-subject research field would benefit from the ability to integrate a large number of studies in a systematic way that would allow drawing broader conclusions regarding intervention effects that would generalize beyond single experiments.
However, to date there is no consensus regarding guidelines for interpreting effect sizes calculated based on supplementing visual analysis methods commonly used among single-case researchers.  compared five analytic techniques frequently used in single-subject research applied to the same data and concluded that each analytical approach was strongly influenced by serial dependency, and the obtained results based on each method varied so much that it prohibited the development of any reliable effect-size interpretation guidelines.
Inability to estimate effect sizes based on currently used analytical methods leaves meta-analytic approaches out of reach in the field of single-subject research. A noteworthy study by Hedges, Pustejovsky and Shadish (2012) proposed new effect size that is comparable to Cohen's d, frequently used in group-level designs. It is applied across single-subject cases and it can be used in studies with at least three independent cases. This new approach can be applied in meta-analytic research and warrants further examination.
Several studies examined agreement rates among judges and showed that visual analysis led to inconsistent conclusions about the intervention effects across different raters. The inter-rater agreement among judges who reviewed the same graphs was relatively poor, ranging on average from .39 to .61 (Jones, Weinrott, & Vaught, 1978;DeProspero & Cohen, 1979;Ottenbacher, 1990), suggesting that visual analysis is not a reliable method for assessing intervention effects of single-subject data. Higher complexity of the data and experimental design resulted in less consistent conclusions.
Factors like high variability of the data, inconsistent patterns of behavior over time, changes in slope, and small changes in level of the data were associated with lower agreement rates across judges (DeProsper & Cohen, 1979;Ottenbacher, 1990).
In addition, Matyas and Greenwood (1990) showed that a positive autocorrelation and high variability in the data tend to increase Type I error rates. These findings suggest that the claimed advantages of visual analysis resulting in reduced Type I error rates are overstated.
Several studies demonstrated that higher levels of serial dependency in singlesubject data lead to higher rates of disagreement between visual and statistical analysis Jones et al., 1978;Matyas & Greenwood, 1990). One study by Jones et al. (1978) showed that the highest level of agreement between the two methods was found when there were non-statistically significant changes in the behavior and the lowest agreement occurred when there were significant effects of the intervention. These findings suggest that statistically significant results may be more often overlooked by visual analysis than non-significant results and that the highest agreement between these two methods occurs when there is no serial dependency in the data and intervention effects are insignificant.

Interrupted time-series analysis
Interrupted time-series analysis (ITSA) is a statistical method used to examine  (Glass, Willson, & Gottman, 2008).
The ITSA method is able to measure the degree of the serial dependency in the data and statistically remove it from the series, allowing for an unbiased estimate of the changes in level and trend across different phases of the experiment (Glass et al., 2008). In addition, after accounting for serial dependency in the data, ITSA facilitates an estimate of Cohen's d effect size (Cohen, 1988), which is the most commonly used measure of intervention effects in behavioral sciences research with widely implemented interpretative guidelines.

ITSA limitations
Although, the most recommended method for removing serial dependency from single-subject data is implementing an ARIMA model (Glass et al., 2008), some researchers call attention to some drawbacks of ITSA related to accurate ARIMA model estimation and limited utility in applied behavior analysis studies (Ottenbacher, 1992;Kazdin, 2011). Identifying the correct ARIMA model has been shown to be often unreliable, leading to model misidentification (Velicer & Harrop, 1983).
However this issue has been addressed through the general transformation method, which uses the ARIMA model for lag-5 autocorrelation (5, 0, 0) that was shown to be simpler and more accurate than other model specification methods (Velicer & autocorrelation (1,0,0) is sufficient when applied to data that does not require forecasting (Simonton, 1977). Simulation studies have shown that these procedures are very accurate (Harrop & Velicer, 1985;Harrop & Velicer, 1990).
Another disadvantage of the ARIMA procedure has been associated with the requirement of a large number of observations. A minimum of 35-40 observations or even as high as 25 observations per phase were recommended (Glass et al., 2008;Ottenbacher, 1992) in order to correctly identify an ARIMA model. However, application of predetermined ARIMA model allows for reliable evaluation of shorter data series. In addition, proponents of visual analysis argue that ITSA may not be a suitable analytical approach for experimental designs that reach beyond the basic AB model, such as alternating treatment designs or multiple baseline designs Kazdin, 2011).

Study Aims
This study will perform a head-to-head comparison of the conclusions drawn from visual analysis of graphically presented data with the findings based on interrupted time-series analysis of the same data. The study will use graphical data based on single-subject studies published in the Journal of Applied Behavior Analysis (2010). This journal was selected because it is a leading journal on the topic used by applied researchers and it strongly promotes the use of visual analysis rather than quantitative analysis methods (Shadish & Sullivan, 2011;Smith, 2012). In a related study, all the studies published in a leading textbook (Kazdin, 2011) were evaluated in the same way (Harrington & Velicer, 2013).
The aim of this study is to examine the level of agreement between these two methods, as well as degree of serial dependency in single-subject data, and estimate the effect size for each study. This comparison is made possible by the development of a statistical program called UnGraph ® software version 5.0 , which permits the recovery of raw data and the application of interrupted time series analysis.

Sample
Graphical data was obtained from the research papers published in the Journal of Applied Behavior Analysis (JABA) in 2010. For a graph to be included in this study, it was required to meet the following inclusion criteria: (1) present actual data (not simulated); (2) present interrupted time-series data; (3) present a minimum of three observations in each phase of the design in order to estimate a full four parameter model; (4) present baseline and treatment phases of an experimental design; (5) include corresponding description of the conclusions drawn from the visual analysis of the graph; and (6) present well defined data points (observations) in the graph. Graphs presenting cumulative data or alternating-treatment designs were not eligible.

Procedure
Eligible graphs were scanned and electronically imported into UnGraph ® software version 5.0 . Next, data presented in each graph was extracted using the UnGraph ® software's function of coordinate system that defined each graph's structure and scale. Then, sequentially ordered data recorded into a time-series data format was exported into a Microsoft Excel ® spreadsheet.

Validity and reliability of UnGraph ® software
UnGraph ® software has been previously examined for its validity and reliability when extracting data from graphs representing single-case designs (Shadish et al., 2009). Results of this study indicated high validity and reliability of the extracted data from graphs, with .96 as an average correlation coefficient between two raters.

Analysis
Interrupted time-series analyses (ITSA) were used to evaluate intervention effects of each single-subject study based on the data collected using UnGraph ® software.
Identification of the ARIMA model was performed in a series of steps. First, level of autocorrelation in the data was evaluated based on autocorrelation function (ACF) and partial autocorrelation function (PACF). These two functions refer to autoregressive and moving average parameters and estimate whether negative or positive correlation was present in the data series, as well as in how many lags the correlation was present.
Also, the stationarity parameter (d) was evaluated, and if required, differencing of the data was performed.
Second, values of each parameter were estimated and the fit of the ARIMA model was evaluated. The best fitting model resulted in uncorrelated residuals. In cases when the residuals were correlated, the model identification process was repeated and a new model was evaluated Glass et al., 2008). Once a correctly identified ARIMA model was applied to single-subject data, parameters such as trend, change in trend, level, change in level, as well as mean and variability of the series were evaluated. Intervention effects were examined based on changes in slope and level across the experimental phases of the design. In addition, for studies where no significant slope or change in slope was present, Cohen's d effect size was calculated to examine the magnitude of the behavior change due to the intervention. Analyses were performed in SAS version 9.2. This study was approved by the University of Rhode Island Institutional Review Board.
Description of the visual analysis of the graphs presented in the publications published in JABA was used to perform a head-to-head comparison of the findings based on each method. These comparisons were based on conclusions made in regards to trend, change in trend, variability of the data and change in level of the data across different experimental phases of the experiment.

Sample
A total of 75 research papers were published in the JABA in 2010. After reviewing the content of the publications, 25 papers met eligibility criteria and were included in the study. Excluded publications did not present interrupted time-series data (k = 27), presented less than 3 observations in at least one phase of the design (k = 4), presented cumulative data (k = 3), or alternating-treatment designs (k = 9). One study presented generated, hypothetical data, and one study presented a graph with insufficiently defined observations, which prevented data point extraction. Five studies were ineligible because the presented description of the findings based on the visual analysis of the graph was not possible to verify using ITSA.
The eligible publications included one or more graphs. A total of 99 graphs presenting interrupted time-series data with corresponding conclusions based on visual analysis were included in the study. The graphs displayed a diversified range of single-subject designs, such as AB design and its variations (e.g. ABA, ABAB), ABC design and its variations (e.g. ABCA, ABCACA, ABABACBC), and designs that included more than two different interventions (e.g. ABCD, ABCDEFBFEDC) (see Table 1 for details).
Each graph presented one or more interrupted time-series data (e.g., data points presenting two independent behaviors were plotted on one graph). Conclusions based on visual analysis were applied to either the full study design or to one or more sections of the design. ITSA was applied to the data with the corresponding description of the findings formulated in a way that could be validated using statistical methods. A total of 163 ITSA were performed.

Descriptive statistics
The number of observations in the analyzed experiments ranged from 8 to 136, with minimum of 3 and maximum of 90 observations per phase. For 9 (5.52%) analyzed experiments, the interrupted-time series ARIMA model did not converge.
Six of those experiments came from one study that had multiple single-subject data series characterized by low number of observations (< 12) and low variability across observations; two experiments had higher number of observations (43 and 136) but low variability across observations; one experiment had high variability across 22 observations.
For the remaining 154 time-series data, 23 (14.94%) had significant slope, 15 (9.74%) had significant change in slope due to experimental design, 18 (11.69%) had significant slope and change in slope. The nonlinearity of the slopes was not examined.
Over 50% of the examined time-series data (k = 79) had significant changes in level due to examined study design phase change.
Small lag 1 autocorrelations ranging from .00 to .20 were found for 15 time-series data, small-medium lag 1 autocorrelations ranging from .21 to .40 were found for 34 time-series data, medium lag 1 autocorrelations ranging from .41 to .60 were found for 61 time-series data, and large lag 1 autocorrelations .61 or larger were found for 40 time-series data. Lag 1 autocorrelation less than .00 were found for 13 time-series data and ranged from -.32 to -.05. Lag 1 autocorrelations were significant for 93 time-series data, 28 of those time-series data also had significant lag 2 autocorrelations. The autocorrelations were not corrected for small sample bias (Shadish & Sullivan, 2011).  Table 1.
Cohen's d effect size was estimated for all experiments that did not have significant slope or change in slope, a total of 98 (63.64%). The effect sizes ranged from 0.00 to 22.74. Figure 2 presents the distribution of the effect size estimates for the eligible studies. Based on Cohen's (1988) classification, small effect sizes, ranging from .20 to .49 were found for 8 time-series data, medium effect sizes, ranging from .50 to .79 were found for 8 time-series data, and large effect sizes of .80 or greater were found for 72 time-series data (73.47%). Details are provided in Table 1.  Table 1). For 52 of those experiments, visual analysis indicated significant changes between different phases of the study design, when statistical analysis did not reveal significant differences. For 8 experiments, non-significant findings based on visual analysis were not confirmed by statistical analysis. See Figure 3 for a summary of the agreement and disagreement between the two methods. The overall level of agreement was low (Cohen's Kappa = .14) (Cohen, 1960). Among the experiments that led to inconsistent findings between the two methods, 30% had significant slope, change in slope or both, and 53% had lag-1 autoregressive term greater than .40.

ITSA and visual analysis comparison
To illustrate the application of the ITSA method in the analysis of single-subject studies and comparison with the conclusions drawn on visual analysis, three examples were selected from the experiments presented in Table 1.

Example 1
The first example is based on a study that examined the effects of providing praise and preferred edible items based on variable-time schedule in order to reduce problem behavior. In addition, effects of variable-time schedule on compliance were also evaluated. The study was based on a reversal design (ABAB) and included three participants (Lomas, Fisher, & Kelly, 2010). In the current example, data for one of the participants is provided. Sam was 8 years old boy diagnosed with Asperger syndrome and attention deficit hyperactivity disorder. Data displaying frequency of problem behavior and percentage of compliance in each phase of the design is presented in Figure 4 and Figure Table 1.

Example 2
The second example is based on a study that examined the effectiveness of a device that prevents drivers from changing gears for up to 8 seconds unless the seatbelt is buckled. The study was based on an ABA reversal design and included 101 commercial drivers (Van Houten et al., 2010). Data for one driver is displayed in

Example 3
The third example is based on a study that performed several experiments, one of which examined the effects of delivery of higher quality reinforcement following appropriate behavior and lower quality reinforcement following problem behavior on changes in behavior (Athens & Vollmer, 2010). The study participant reported in this example was a 7 year old boy diagnosed with attention deficit hyperactivity disorder, and the experiment was based on ABCAC design. Based on the visual analysis of data presented in Figures 7 and 8, Athens and Vollmer (2010) made several conclusions such as "in the 1 HQ/ 1 LQ condition, rates of problem behavior decreased, and appropriate behavior increased" (p. 579); "problem behavior decreased, and appropriate behavior increased to high levels during the return to the 3 HQ/ 1 LQ condition" (p. 580); and "in summary, results of the quality analyses indicated that . . . the relative rates of both problem behavior and appropriate behavior were sensitive to the quality of reinforcement available for each alternative" (p. 581).
ITSA was implemented to evaluate the effect of the quality reinforcement on problem behavior and appropriate behavior. Three ARIMA models, estimating 4 parameters (slope, change in slope, level and change in level) were applied to test each of the conclusions made based on visual analysis.
First, an ARIMA (1, 0, 0) was implemented to evaluate the effects of 1 HQ/ 1 LQ on problem behavior and appropriate behavior (AB phase of the experiment). The lag 1 autocorrelations were -.05 and .13, for problem behavior and compliance, respectively. For problem behavior, ITSA revealed non-significant slope, significant change in slope (t (15) = 2.18, p < .05), and non-significant change in level. These findings indicated an increase in problem behavior in the quality reinforcement phase and did not confirm conclusions based on visual analysis that found a decrease in problem behavior. For appropriate behavior, ITSA indicated significant slope (t (15) = -2.22, p < .05), non-significant change in slope and significant change in level (t (15) = 4.24, p < .05). These findings indicated an initial decreasing trend in baseline phase (A) followed by an increase in compliance as an effect of 1 HQ/ 1 LQ quality reinforcement. The statistical results are consistent with visual analysis conclusions.
Second, an ARIMA (1, 0, 0) was applied to examine the effect of the return to 3 HQ/ 1 LQ phase on problem and appropriate behavior (AC phase of the experiment).
The lag 1 autocorrelations were -.29 and .06, for problem behavior and compliance, respectively. For problem behavior, ITSA revealed significant slope (t (12) = -3.46, p < .05), a non-significant change in slope, and significant change in level (t (12)  Evaluated studies covered a wide range of single-subject experiments that included different study designs, such as multiple-baseline, reversal and multiple intervention designs. The experiments also differed in total number of observations in each study as well as within each phase of the design. ITSA was successfully applied to all but nine of the eligible studies, indicating that this statistical method can be applied to a wide range of single-subject experimental designs, frequently occurring in applied behavioral analysis research.
These findings directly refute the inability to apply ARIMA models to data obtained from a wide range of single-subject studies, a limitation that is commonly voiced by proponents of visual analysis.

Serial Dependency
Overall findings based on ITSA revealed high lag-1 autocorrelations for most of the evaluated data, including short time-series of less than 20 observations. These results confirm findings based on earlier studies showing that serial dependency is a common property of single-subject data (Jones, Vaught, & Weinrott, 1977;Jones et al., 1978;Matyas & Greenwood, 1990;).
The majority of first order autocorrelations (more than 60%) were positive and at the moderate to high level (.41-.60 or >.60). Given the sample size limitations, it is difficult to form more precise conclusions. However, the assumption that autocorrelations can be ignored (Huitema & McKeon, 1998) seems to be indefensible.
The effect of a positive autocorrelation is to decrease the apparent degree of variability. This would potentially affect both graphical analysis and any statistical analysis that ignores dependency in the data. Velicer and Molenaar (2013) provide an illustration of the smoothing of the series visually.
The autocorrelations can also help address another important research question, i.e., what is the nature of the generating function for the observed data. The autocorrelations also provide information about the extent to which the ergodic theorems are satisfied, a critical question for combining data across individuals (Molenaar, 2008;Velicer & Molenaar, 2013). In order to draw valid inferences from group level data to the individual level, two ergodic theorem conditions must be met: (1) the individual trajectories must obey the same dynamic laws, and (2) must have equivalent mean levels and serial dependencies (Molenaar, 2008;Velicer, Babbin, & Palumbo, in press). However, the small sample sizes available in the studies reviewed here do not permit these questions to be addressed.

Effect Size Estimation
The effect size estimates were predominately large (73%) with some very large effect sizes such as d = 22.74, an extremely large effect size for the behavioral sciences. The term 'clinical significance' is largely undefined but can be viewed as analogous to a large effect size. (Statistical significance is typically viewed as a necessary but not sufficient condition for clinical significance.) Based on this interpretation, the effect size estimates observed in this set of studies support the contention that graphical methods focus on clinically significant effect sizes.

Sample Size Issues
The The number of observations is also related to the time between observations.
Time is a core concept for idiographic studies and we presently have very little information to guide researchers on how frequently observations should be taken.
Advances from the information sciences are producing new measures that can greatly improve the quality and number of observations. A review of these methods, often labeled telemetrics is provided by Goodwin, Velicer, and Intille (2008). Indeed, advances in telemetrics may shift the issue from not having many observations to having too many observations.

Agreement between Visual and Statistical Analysis
Comparison of the conclusions drawn from visual analysis and ITSA revealed an overall low level of agreement (Kappa = .14). When graphical presentation of the intervention effects presents ideal or almost ideal data patterns, such as low variability of the data, no trend in the data, evident effects of the intervention, ITSA was in agreement with visual analysis, even for the studies with small numbers of observations or experimental designs with more than two phases (AB).
However, in 34% of the evaluated studies, the conclusions drawn based on visual analysis were not supported by statistical analysis. This means that the reported significant results from visual analysis could have been due to chance. If we view statistical analysis as a necessary but not sufficient condition for clinical significance, this result is discouraging. Once the data diverge from the ideal pattern, visual analysis and ITSA can lead to contrary findings. Serial dependency in time-series data is one potential explanation. Moderate to high serial dependence was present in most examples. It is well known that this can impact reliability and validity of the conclusions based on visual analysis.
Another basis for disagreement is the presence of trend. Trend is not easily observable through visual analysis, especially in short series, and therefore may not be accounted for when evaluating changes in level across phases of the experiment. ITSA is able to account for trend in the data when examining intervention effects, as well as evaluate quantitatively trend and change in trend that may occur across different phases of the design.
Although the failure to detect a statistically significant effect occurred at a much smaller rate (5%), these errors have the potential to prematurely terminate the investigation of a potentially effective intervention. Initial studies of an intervention in a real world study typically represent an attempt to detect an effect in a very noisy environment and effect sizes that are initially small can become much more important with additional controls.

Advantages of Statistical Analysis
In addition, for all single-subject studies, ITSA provided supplementary quantitative information such as degree of the serial dependency, trend, changes in trend and level across phases, and variability of the data, that are not available through In addition, the development of the new software such as UnGraph  and new function in R package  creates the possibility to extract the values from published graphs and reanalyze available data using ITSA.
This opens up a unique opportunity to use historical data based on single-subject studies and perform far-reaching meta-analytical studies.

Limitations
The findings based on this study have limited representativeness. The graphs    Figure 6. Top panel (ABCDEFBFEDC) First sequence of conditions "DRA lost its efficacy when implemented at less than 50% integrity with combined omission and commission errors" (p. 60).
On task (B (EF)) 9 17 (1, 0, 0 . . the third panel shows data from a driver for whom there was no effect when the delay was introduced or increased from 8 to 16 s." (p. 377).
Third panel (ABCA) b 26/27 60/23 -.50* Model did not converge "The bottom panel shows the data of a participant who initially showed an increase in seat belt use following the introduction of the 8-s delay followed by a gradual decline in seat belt use. After the 16-s fixed delay was introduced, seat belt use improved" (p. 377).
Problem behavior ( Figure  N BL N TX ARIMA AR 1 Level Error σ Slope Δ Slope Δ Level d Figure 5. Corey (ABCAC) "In summary, results of the delay analysis indicate that the relative rates of problem behavior and appropriate behavior were sensitive to the delay to reinforcement following each alternative" (p. 582).
Problem behavior ( Figure  N BL N TX ARIMA AR 1 Level Error σ Slope Δ Slope Δ Level d Clark (ABAB) "In summary, results of the combined analyses indicate that for these participants the relative rates of problem behavior and appropriate behavior were sensitive to a combination of the quality, delay, and duration of reinforcement following each alternative" (p. 584).
Problem behavior a 6/5 8/10 (1, 0, 0 "For Ian, contingent access to preferred edible items initially appeared to be effective in increasing compliance, but compliance decreased toward the end of this phase" (p. 606).
Compliance (AD) 3 6 (1, 0, 0 Note. The following information is included in the first column: authors of the publication, figure label as it is presented in the publication, experimental design presented using capital letters in the parenthesis. Unless otherwise indicated with the superscript ( †), each ITSA model was determined based on four parameters: level, slope, change in slope and change in level. N BL = number of observations in the baseline or reference phase; N TX = number of observations in the treatment phase; ARIMA = autoregressive moving average model; AR 1 = autoregressive term 1; Level = intercept; Error σ = standard error estimate; Slope = t-test statistic for linear trend of the time series; Δ Slope = t-test statistic for change in slope at the interruption point; Δ Level = t-test statistic for change in level at the interruption point; d = Cohen's d effect size; Cohen's d effect size is not available for time series with significant slope or change in slope. a significant AR 2 b significant AR 2 and AR 3 c significant AR 2, AR 3, and AR 4 d significant AR 2, AR 3, AR 4 and AR 5 † ITSA model estimated separately for slope and change in slope due to small number of observation that affected model's stability *p < .05      The second methodological approach employed in longitudinal research is based on data obtained from one individual or unit (n = 1) through intensive data collection over time. Single-subject designs, also known as idiographic designs, examine individual-level data, that allows for highly accurate estimates of within-subject variability and longitudinal trajectories of each individual's behavior. Idiographic methodology characterizes highly heterogeneous processes, which consequently allow for more accurate inferences about the nature of behavior change specific to an individual (Velicer & Molenaar, 2013). Single-subject designs address the limitations of group-level designs and present several advantages. They allow for a highly accurate assessment of the impact of the intervention for each individual while group-level designs provide information about the effectiveness of the intervention for an "average" person, rather than any person in particular (Velicer & Molenaar, 2013).
In addition, single-subject research allows studying longitudinal processes of change with much better precision than group-level designs, due to a higher number of data points and better controlled variability of the data. Also, it can be applied to populations that are otherwise difficult to recruit in numbers large enough to allow for a group-level design Kazdin, 2011).

Methods of evaluating single-subject studies
Currently, there are two widely used methods for evaluating intervention effects based on single-subject designs. Visual analysis of the graphs presenting experimental data is a commonly used approach in applied behavior analysis research, while interrupted time-series analysis (ITSA) is a statistical method used in research fields, such as electrical engineering, economics, business, and other areas of psychology, to name just a few. The use of visual analysis preceded the development of quantitative methods like time series analysis which required high speed computers to implement.

Visual analysis
The most basic experimental model used in single-subject research is an A-B design with a well defined target behavior that is examined before and after the intervention. The first phase (A) of the design consists of multiple baseline observations that assess the pre-intervention characteristics of the behavior. In the second phase (B) of the design, the treatment component of the experiment is introduced and changes in behavior are examined Kazdin, 2011).
The visual analysis of the graph, performed by a judge or a rater, is based on a set of criteria that evaluate and compare the characteristics of phase A and B and examine whether behavior changes in phase B are a result of the intervention. The baseline (A) phase provides information about the descriptive and predictive aspects of the target behavior, such as stability and variability. Stable behavior, characterized by the absence of a trend or slope in the data, indicates that the targeted behavior neither increases nor decreases on average over time during the baseline phase (Kazdin, 2011). Variability of the data is characterized by the changes in the behavior within the range of possible low and high levels . Single-subject experiments are evaluated based on magnitude and rate of change between phase A and B. The magnitude of change is based on variability in level and slope of the data.
Changes in level refer to average changes in the frequency of target behavior, whereas changes in slope refer to shifts in direction of the behavior across different phases. The mean is the average for all data in a particular phase. If the series is stable, the level will equal the slope. Changes in level and slope are independent from each other. Rate of change is based on changes in trend or slope of the data and latency of change.
Trend analysis provides information on systematic increases or decreases in the behavior across phases, whereas latency of change refers to the amount of time between the termination of one phase and changes in behavior (Kazdin, 2011).
Visual analysis, although guided by the set of criteria described above, is not based on any specific decision making rules and it is mostly driven by subjective evaluation of the intervention effects. Advocates of this approach argue that large intervention effects are evident and provide unequivocal conclusions that can be easily observed by independent judges. Further, it is argued that the subjective evaluation of intervention effects has a minimal impact on reliability and validity of the conclusions drawn from the graphs presenting large and therefore easily observable treatment effects, since only those are considered to have significant clinical implications Kazdin, 2011). This concept is particularly promoted in the field of applied behavior analysis.
Proponents of visual analysis acknowledge that certain characteristics of singlesubject data can significantly impair the ability to accurately evaluate intervention effects. The presence of slope in the baseline phase of the experiment may negatively affect the evaluation of the experiment, especially when the trend of the targeted behavior is moving in the same direction as would be expected due to treatment effects. High variability of the data may also interfere with the validity of the conclusions. For example, accuracy of the evaluation of intervention effects on disruptive behavior can be significantly affected by a pattern of behavior that is decreasing (getting better) over time. Also high variability of the behavior, such as extreme fluctuations from none to high frequency of disruptive behavior can limit the ability to draw valid conclusions about intervention effects (Kazdin, 2011). However, it is argued that the rationale for using visual analysis is to highlight large (i.e. easily observable) intervention effects and disregard small (i.e. not easily observable) effects.

It is concluded that visually undetected intervention effects have insignificant clinical
impact. Proponents of visual analysis state that the conservative approach to evaluating intervention effects guarantees highly accurate and consistent conclusions across independent judges, as well as reduces unknown probability of Type I error rate and consequently increases the probability of Type II error rate Kazdin, 2011).
In the recent literature, some of the visual analysis advocates have discussed the problem of the lack of effect size estimation which results in an inability to perform meta-analytic reviews single-subject experiments. As stated by Kazdin (2011), the single-subject research field would benefit from the ability to integrate a large number of studies in a systematic way that would allow drawing broader conclusions regarding intervention effects that would generalize beyond single experiments.
However, to date there is no consensus regarding guidelines for interpreting effect sizes calculated based on supplementing visual analysis methods commonly used among single-case researchers.  compared five analytic techniques frequently used in single-subject research applied to the same data and concluded that each analytical approach was strongly influenced by serial dependency, and the obtained results based on each method varied so much that it prohibited the development of any reliable effect-size interpretation guidelines.
Inability to estimate effect sizes based on currently used analytical methods leaves meta-analytic approaches out of reach in the field of single-subject research. A noteworthy study by Hedges, Pustejovsky and Shadish (2012) proposed new effect size that is comparable to Cohen's d, frequently used in group-level designs. It is applied across single-subject cases and it can be used in studies with at least three independent cases. This new approach can be applied in meta-analytic research and warrants further examination.
Several studies examined agreement rates among judges and showed that visual analysis led to inconsistent conclusions about the intervention effects across different raters. The inter-rater agreement among judges who reviewed the same graphs was relatively poor, ranging on average from .39 to .61 (Jones, Weinrott, & Vaught, 1978;DeProspero & Cohen, 1979;Ottenbacher, 1990), suggesting that visual analysis is not a reliable method for assessing intervention effects of single-subject data. Higher complexity of the data and experimental design resulted in less consistent conclusions.
Factors like high variability of the data, inconsistent patterns of behavior over time, changes in slope, and small changes in level of the data were associated with lower agreement rates across judges (DeProsper & Cohen, 1979;Ottenbacher, 1990).
In addition, Matyas and Greenwood (1990) showed that a positive autocorrelation and high variability in the data tend to increase Type I error rates. These findings suggest that the claimed advantages of visual analysis resulting in a reduced in Type I error rates are highly overstated.
Several studies demonstrated that higher levels of serial dependency in a singlesubject data lead to higher rates of disagreement between visual and statistical analysis Jones et al., 1978;Matyas & Greenwood, 1990). One study by Jones et al. (1978) showed that the highest level of agreement between the two methods was found when there were non-statistically significant changes in the behavior and the lowest agreement occurred when there were significant effects of the intervention. These findings suggest that statistically significant results may be more often overlooked by visual analysis than non-significant results and that the highest agreement between these two methods occurs when there is no serial dependency in the data and intervention effects are insignificant.

Interrupted time-series analysis
Interrupted time-series analysis (ITSA) is a statistical method used to examine intervention effects of single-subject study designs. It is based on chronologically The most widely used model for examining serial dependency in the data is the autoregressive integrated moving average (ARIMA) model. It consists of three elements to be evaluated. The autoregressive term (p) estimates the extent to which the current observation is predictable from preceding observations and the number of past observations that impact the current observation. The moving average term (q) estimates the effects of preceding random shocks on current observation. The integrated term (d) refers to the stationarity of the series. Stationarity of time-series data requires the structure and the parameters of the data, such as mean, variance and the patterns of the autocorrelations to remain the same across time for the series. Nonstationary data requires differencing in order to keep the series at a constant mean level, otherwise reliability of the assessed intervention effects can be compromised (Glass, Willson, & Gottman, 2008).
The ITSA method is able to measure the degree of the serial dependency in the data and statistically remove it from the series, allowing for an unbiased estimate of the changes in level and trend across different phases of the experiment (Glass et al., 2008). In addition, after accounting for serial dependency in the data, ITSA facilitates an estimate of Cohen's d effect size (Cohen, 1988), which is the most commonly used measure of intervention effects in behavioral sciences research with widely implemented interpretative guidelines.

ITSA limitations
Although, the most recommended method for removing serial dependency from single-subject data is implementing an ARIMA model (Glass et al., 2008), some researchers call attention to some drawbacks of ITSA related to accurate ARIMA model estimation and limited utility in applied behavior analysis studies (Ottenbacher, 1992;Kazdin, 2011). Identifying the correct ARIMA model has been shown to be often unreliable, leading to model misidentification (Velicer & Harrop, 1983).
However this issue has been addressed through the general transformation method, which uses the ARIMA model for lag-5 autocorrelation (5, 0, 0) that was shown to be simpler and more accurate than other model specification methods (Velicer & autocorrelation (1, 0, 0) is sufficient when applied to data that does not require forecasting (Simonton, 1977). Simulation studies have shown that these procedures are very accurate (Harrop & Velicer, 1985;Harrop & Velicer, 1990).
Another disadvantage of the ARIMA procedure has been associated with the requirement of a large number of observations. A minimum of 35-40 observations or even as high as 25 observations per phase were recommended (Glass et al., 2008;Ottenbacher, 1992) in order to correctly identify an ARIMA model. However, application of predetermined ARIMA model allows for reliable evaluation of shorter data series. In addition, proponents of visual analysis argue that ITSA may not be a suitable analytical approach for experimental designs that reach beyond the basic AB model, such as alternating treatment designs or multiple baseline designs Kazdin, 2011).

Study Aims
This study will perform a head-to-head comparison of the conclusions drawn from visual analysis of graphically presented data with the findings based on interrupted time-series analysis of the same data. The study will use graphical data based on published single-subject studies included in the textbook by Kazdin (2011).
The text was selected because it is a leading text on the topic used by applied researchers and the author strongly promotes the use of visual analysis rather than quantitative analysis methods. In a related study, all the studies published in a leading journal (Journal of Applied Behavior Analysis, 2010) were evaluated in the same way (Harrington & Velicer, 2013).
The aim of this study is to examine the level of agreement between these two methods, as well as degree of serial dependency in single-subject data, and estimate the effect size for each study. This comparison is made possible by the development of a statistical program called UnGraph ® software version 5.0 , which permits the recovery of raw data and the application of interrupted time series analysis.

Sample
Graphical data was obtained from the book titled "Single-Case research designs: Methods for clinical and applied settings" by Alan E. Kazdin (2011), who is currently include corresponding description of the conclusions drawn from the visual analysis of the graph; and (7) present well defined data points (observations) in the graph. Graphs presenting cumulative data or alternating-treatment designs were not eligible.

Procedure
Eligible graphs were scanned and electronically imported into UnGraph ® software version 5.0 . Next, data presented in each graph was extracted using the UnGraph ® software function of coordinate system that defined each graph's structure and scale. Then, sequentially ordered data recorded in a time-series data format was exported into a Microsoft Excel® spreadsheet.

Validity and reliability of UnGraph ® software
UnGraph ® software has been previously examined for its validity and reliability when extracting data from graphs representing single-case designs (Shadish et al., 2009). Results of this study indicated high validity and reliability of the extracted data from graphs, with .96 as an average correlation coefficient between two raters.

Analysis
Interrupted time-series analyses (ITSA) were used to evaluate intervention effects of each single-subject study based on the data collected using UnGraph ® software.
Identification of the ARIMA model was performed in a series of steps. First, level of autocorrelation in the data was evaluated based on autocorrelation function (ACF) and partial autocorrelation function (PACF). These two functions refer to autoregressive and moving average parameters and estimate whether negative or positive correlation was present in the data series, as well as in how many lags the correlation was present.
Also, the stationarity parameter (d) was evaluated, and if required, differencing of the data was performed.
Second, values of each parameter were estimated and the fit of the ARIMA model was evaluated. The best fitting model resulted in uncorrelated residuals. In cases when the residuals were correlated, the model identification process was repeated and a new model was evaluated Glass et al., 2008). Once a correctly identified ARIMA model was applied to single-subject data, parameters such as trend, change in trend, level, change in level, as well as mean and variability of the series were evaluated. Intervention effects were examined based on changes in slope and level across the experimental phases of the design. In addition, for studies where no significant slope or change in slope was present, Cohen's d effect size was calculated to examine the magnitude of the behavior change due to the intervention. Analyses were performed in SAS version 9.2. This study was approved by the University of Rhode Island Institutional Review Board.
Description of the visual analysis of graphs presented in the textbook written by Kazdin (2011) was used to perform a head-to-head comparison of the findings based on each method. These comparisons were based on conclusions made in regards to trend, change in trend, variability of the data and change in level of the data across different experimental phases of the experiment.

Sample
A total of 134 graphs presenting time-series data based on published studies were reported in the textbook. After reviewing the content of the graphs, 60 met eligibility criteria and were included in this study. Excluded publications presented less than 3 observations in at least one phase of the design (k = 26), presented cumulative data or alternating-treatment designs (k = 17), did not present interrupted time-series data (k = 14), or did not meet one of the other eligibility criteria (e.g. included aggregated or truncated number of observations, data points were not well-defined, not a singlesubject study design, visual analysis of the graph was not possible to verify using ITSA ) (k = 17).
Almost all eligible graphs displayed AB single-subject study designs and their variations (e.g. ABA, ABAB, BABA). Only two graphs presented ABCBC and ABC study designs. Each graph presented one or more interrupted time-series data (e.g., data points presenting two independent behaviors were plotted on one graph).
Conclusions based on visual analysis were applied to either the full study design or to one or more sections of the design. ITSA was applied to the data with the corresponding description of the findings formulated in a way that could be validated using statistical methods. A total of 75 ITSA were performed (see Table 2 for details).

Descriptive statistics
The numbers of observations in the analyzed experiments ranged from 10 to 68, with a minimum of 3 and a maximum of 61 observations per phase. For 2 (2.67%) analyzed experiments, the interrupted-time series ARIMA model did not converge.
For the remaining 73 time-series datasets, 6 (8.22%) had significant slope, 12 (16.44%) had significant change in slope due to experimental design, 11 (15.07%) had significant slope and change in slope. The nonlinearity of the slopes was not examined.
Over 60% of the examined time-series datasets (k = 45) had significant changes in level due to examined study design phase change.
Small lag 1 autocorrelations ranging from .00 to .20 were found for 8 time-series data, small-medium lag 1 autocorrelations ranging from .21 to .40 were found for 10 time-series data, medium lag 1 autocorrelations ranging from .41 to .60 were found for 17 time-series data, and large lag 1 autocorrelations .61 or larger were found for 33 time-series data. Lag 1autocorrelation less than .00 were found for 7 time-series data and ranged from -.21 to -.01. Lag 1 autocorrelations were significant for 49 time-series data, 22 of those time-series data also had significant lag 2 autocorrelations. The autocorrelations were not corrected for small sample bias (Shadish & Sullivan, 2011). Figure 9 presents the distribution of lag-1 autocorrelations for eligible studies and details are presented in Table 2.
Cohen's d effect size was calculated for all experiments that did not have significant slope or change in slope, a total of 44 (60.27%). The effect sizes ranged from 0.21 to 12.29. Figure 10 presents the distribution of effect sizes for eligible studies. Based on Cohen's (1988) classification, small effect sizes, ranging from .20 to .49 were found for 5 time-series data, medium effect sizes, ranging from .50 to .79 were found for 6 time-series data, and large effect sizes of .80 or greater were found for 33 time-series data (75.00%). Details are provided in Table 2.

ITSA and visual analysis comparison
Comparison of the findings based on visual analysis and ITSA demonstrated consistent findings for 58 analyzed time-series datasets. Most of these consistent findings (k = 48) referred to significant changes between different phases of the experiment, while 10 referred to non-significant changes such as reversal to baseline.
For the remaining 15 experiments (20.55%), the findings based on statistical analysis did not confirm the conclusions based on visual analysis (bolded data in Table 2). For 8 of those experiments, visual analysis indicated significant changes between different phases of the study design, when statistical analysis did not reveal significant differences. For 7 experiments, non-significant findings based on visual analysis were not confirmed by statistical analysis. See Figure 11 for a summary of the agreement and disagreement between the two methods. The overall level of agreement was moderate (Cohen's Kappa = .44) (Cohen, 1960). Among the experiments that led to inconsistent findings between the two methods, 26.67% had significant slope, change in slope or both, and 33.33% had lag-1 autoregressive terms greater than .40.
To illustrate the application of the ITSA method to the analysis of single-subject studies and comparison with the conclusions drawn based on visual analysis, three examples were selected from the experiments presented in Table 2.

Example 1
The first example is based on a study that intended to reduce vocal stereotypy and increase appropriate vocalization among children diagnosed with autism spectrum disorder. The study was based on ABAB design, and the data for one of the children, a 3-year-old boy, is presented in Figure 12 and Figure 13. Conclusions based on visual analysis of the data suggested that intervention decreased vocal stereotypy and increased appropriate vocalization. Kazdin (2011) stated that "as evident in both graphs, whenever the response interruption and redirection intervention was  Table 2. Comparison of conclusions drawn from visual analysis and ITSA analysis revealed consistent findings.

Example 2
The second example is based on a study that examined the intervention effects on the reduction of disruptive behavior during the dental treatment among five children, ages 4 to 7 years. The study was based on AB design, and the data for each child is presented in Figures 14 through 18. Kazdin (2011) provided the following description of the findings based on visual analysis of the graphs: "as for changes in level (discontinuity at point of intervention for each child), possibly two (Elaine and George) show this effect. As for changes in trend, perhaps all but one (George) show a different slope from baseline through intervention phases" (p. 294).

Melissa
For data obtained from Melissa (Figure 14), ITSA model was based on ARIMA (5, 0, 0) and included 4 parameters: level, change in level, slope and change in slope.
Lag 1 autocorrelation coefficient was significant (ar 1 = .86). The analysis for trend and change in trend yielded non-significant results. The analysis for change in level indicated significant reduction in disruptive behavior with large effect size (t (28) = -2.11, p <.05, d = 2.21). Details are presented in Table 2.
Comparison of the visual analysis and ITSA findings revealed inconsistencies.  Table 2.
Comparison of conclusions drawn from visual analysis and ITSA analysis revealed inconsistencies. ITSA did not confirm conclusions based on the visual analysis regarding significant changes in level.

Example 3
The third example is based on a multiple-baseline study that intended to reduce depression among five patients with physical illness. The data for participant 1 is shown in Figure 19. Conclusions based on visual analysis of the data suggested that intervention decreased level of depression. Kazdin (2011) stated that ". . . individual data show the effects of the intervention . . ." (p. 395).
ITSA was implemented to evaluate the effects of the intervention on depression.
ITSA model was based on ARIMA (1, 0, 0) and included 4 parameters: level, change in level, slope and change in slope. Lag 1 autocorrelation coefficient was not significant (ar 1 = .53). The analysis for trend yielded significant results (t (7) = -2.54, p < .05), indicating gradual decrease of depression. The analysis for change in trend and change in level yielded non-significant findings. Details are presented in Table 2.
Comparison of the visual analysis and ITSA findings revealed inconsistencies.
ITSA did not confirm conclusions based on the visual analysis regarding changes in level. Statistical analysis indicated non-significant changes in level due to the intervention, whereas visual analysis came to the opposite conclusions.

Discussion
This study performed a statistical analysis of data presented only in graphic form to examine the properties of published single-subject data presented in the widely used textbook within applied psychology area and to evaluate how findings based on ITSA compare to conclusions drawn from visual analysis. Issues such as serial dependency, measures of effect size, and level of agreement between statistical and visual analysis were addressed.
Evaluated studies were based mostly on the basic single-subject study design (e.g. AB, ABAB), with different number of observations within each study and across different phases of the designs. ITSA was successfully applied to all but two of the eligible studies, indicating that this statistical method can be applied to single-subject experimental designs with a wide range of observations, frequently occurring in applied behavioral analysis research.
These findings directly refute the inability to apply ARIMA models to data obtained from a wide range of single-subject studies, a limitation that is commonly voiced by proponents of visual analysis.

Serial Dependency
Overall findings based on ITSA revealed high lag-1 autocorrelations for most of the evaluated data, including short time-series of less than 20 observations. These results confirm findings based on earlier studies showing that serial dependency is a common property of single-subject data (Jones, Vaught, & Weinrott, 1977;Jones et al., 1978;Matyas & Greenwood, 1990;. The majority of first order autocorrelations (more than 65%) were positive and at the moderate to high level (.41-.60 or >.60). Given the sample size limitations, it is difficult to form more precise conclusions. However, the assumption that autocorrelations can be ignored (Huitema & McKeon, 1998) seems to be indefensible.
The effect of a positive autocorrelation is to decrease the apparent degree of variability. This would potentially affect both graphical analysis and any statistical analysis that ignores dependency in the data. Velicer and Molenaar (2013) provide an illustration of the smoothing of the series visually.
The autocorrelations can also help address another important research question, i.e., what is the nature of the generating function for the observed data. The autocorrelations also provide information about the extent to which the ergodic theorems are satisfied, a critical question for combining data across individuals (Molenaar, 2008;Velicer & Molenaar, 2013). In order to draw valid inferences from a group level data to the individual level, two ergodic theorem conditions must be met: (1) the individual trajectories must obey the same dynamic laws, and (2) must have equivalent mean levels and serial dependencies (Molenaar, 2008;Velicer, Babbin, & Palumbo, in press). However, the small sample sizes available in the studies reviewed here do not permit these questions to be addressed.

Effect Size Estimation
The effect size estimates were predominately large (75%) with some very large effect sizes such as d = 12.29, an extremely large effect size for the behavioral sciences. The term 'clinical significance' is largely undefined but can be viewed as analogous to a large effect size. (Statistical significance is typically viewed as a necessary but not sufficient condition for clinical significance.) Based on this interpretation, the effect size estimates observed in this set of studies support the contention that graphical methods focus on clinically significant effect sizes.

Sample Size Issues
The Time is a core concept for idiographic studies and we presently have very little information to guide researchers on how frequently observations should be taken.
Advances from the information sciences are producing new measures that can greatly improve the quality and number of observations. A review of these methods, often labeled telemetrics is provided by Goodwin, Velicer, and Intille (2008). Indeed, advances in telemetrics may shift the issue from not having many observations to having too many observations.

Agreement between Visual and Statistical Analysis
Comparison of the conclusions drawn from visual analysis and ITSA revealed an overall moderate level of agreement (Kappa = .44). When graphical presentation of the intervention effects presents ideal or almost ideal data patterns, such as low variability of the data, no trend in the data, evident effects of the intervention, ITSA was in agreement with visual analysis, even for the studies with small numbers of observations. Although, these findings are encouraging, in 10.96% of the evaluated studies, the conclusions drawn based on visual analysis were not supported by statistical analysis. This means that the reported significant results from visual analysis could have been due to chance. Once the data diverge from the ideal pattern, visual analysis and ITSA can lead to contrary findings. Serial dependency of the time-series data is one potential explanation. Moderate to high serial dependence was present in most examples. It is well known that this can impact reliability and validity of the conclusions based on visual analysis.
Another basis for disagreement is the presence of trend. Trend is not easily observable through visual analysis, especially in short series, and therefore may not be accounted for when evaluating changes in level across phases of the experiment. ITSA is able to account for trend in the data when examining intervention effects, as well as evaluate quantitatively trend and change in trend that may occur across different phases of the design.
The failure to detect a statistically significant effect occurred at a similar rate (9.59%). These errors have the potential to prematurely terminate the investigation of a potentially effective intervention. Initial studies of an intervention in a real world study typically represent an attempt to detect an effect in a very noisy environment and effect sizes that are initially small can become much more important with additional controls.

Advantages of Statistical Analysis
In  Hedges et al. (2012). Statistical significance tests are largely dependent on the sample size, therefore for data with limited numbers of observations, the results may be insignificant due to insufficient statistical power. However effect size is independent of sample size, and meta-analysis can provide more accurate estimates of effect size based on multiple replications.
In addition, the development of the new software such as UnGraph  and new function in R package  creates the possibility to extract the values from published graphs and reanalyze available data using ITSA.
This opens up a unique opportunity to use historical data based on single-subject studies and perform far-reaching meta-analytical studies.

Limitations
The findings based on this study have limited representativeness. The graphs presented in the textbook (Kazdin, 2011) are not representative of all published singlesubject studies and were selected to serve as instructional examples used in training of visual analysis method. Therefore the presented graphs are largely based on the most basic single-subject study design and show easily observable intervention effects. The replication of these findings in more representative samples of the published studies within the applied behavior analysis field is needed.

Conclusions
ITSA can be successfully applied to a number of single-subject study designs. It provides important additional information such as effect size and aids the evaluation of intervention effects, particularly when the experiment lacks striking changes in behavior. Characteristics of single-subject data such as serial dependency, trend, and high variability limit the reliability and validity of visual analysis. At a minimum, the situation should no longer be viewed as involving competition between the two approaches. Both methods should be performed concurrently to assure valid conclusions about treatment effects, particularly in situations when there is limited number of observations available or when characteristics of the time-series data are not optimal.  Fig. 6.6. (ABAB) "As evident in both graphs, whenever the response interruption and redirection intervention was implemented there was a dramatic reduction in stereotypic statements and an increase in appropriate vocalization" (p. 133).
Function-Based (ABB) 7 4/4 (1, 0, 0    Fig. 12.6. (AB) "As for changes in level (discontinuity at point of intervention for each child), possibly two (Elaine and George) show this effect. As for changes in trend, perhaps all but one (George) show a different slope from baseline through intervention phases" (p. 294).   Fig. 13.10 (AB) " Figure 13.10 shows the effects of the program" (p. 337).
Subject 1 Kazdin's textbook and refer to the chapter and figure number. Experimental design is presented using capital letters in the parenthesis. Unless otherwise indicated with the superscript ( †), each ITSA model was determined based on four parameters: level, slope, change in slope and change in level. N BL = number of observations in the baseline or reference phase; N TX = number of observations in the treatment phase; ARIMA = autoregressive moving average model; AR 1 = autoregressive term 1; Level = intercept; Error σ = standard error estimate; Slope = t-test statistic for linear trend of the time series; Δ Slope = t-test statistic for change in slope at the interruption point; Δ Level = t-test statistic for change in level at the interruption point; d = Cohen's d effect size; Cohen's d effect size is not available for time series with significant slope or change in slope. a significant AR 2 b significant AR 2 and AR 3 c significant AR 2, AR 3, and AR 4        Note. Figure reproduced from the data extracted using UnGraph ® software from Kazdin (2011) (p. 396).

CONCLUSIONS
The goal of this dissertation was to address the scientific debate regarding the most reliable and valid method of single-subject data evaluation in the applied behavior analysis research field. This research conducted a head-to-head comparison of the conclusions based on visual analysis and interrupted-times series analysis (ITSA) using the same single-subject data.
Two independent studies were conducted, to examine the level of agreement between visual analysis and ITSA. The first study was based on the graphical data published in the Journal of Applied Behavior Analysis (2010). The second study was based on graphical data obtained from the book titled "Single-Case research designs: Methods for clinical and applied settings" by Alan E. Kazdin (2011).
In addition to a comparison of the conclusions based on visual and statistical method, serial dependency of the data was also evaluated, as well as additional statistical characteristics of single-subject experimental designs such as effect size and sample size.
Overall findings for both studies show that ITSA can be successfully applied to at least 95% of single-subject experiments with different sample sizes ranging from 8 to 136 observations and highly diverse study designs, ranging from the most basic AB designs to designs with multiple phases and two or more types of interventions (e.g. ABACABAC, ABCDEFBFEDC).
Evaluation of serial dependency revealed high lag-1 autocorrelations for most of the evaluated data in both studies, including short time-series of less than 20 observations. These results confirm findings based on earlier studies showing that serial dependency is a common property of single-subject data. The majority of first order autocorrelations (62% and 67%, for study 1 and study 2 respectively) were positive and at the moderate to high level (.41-.60 or >.60). The effect size estimates were predominately large, with Cohen's d ≥ 0.80 for 73% of time-series data in study 1 and 75% in study 2. The term 'clinical significance' can be viewed as analogous to a large effect size. Statistical significance is typically viewed as a necessary but not sufficient condition for clinical significance. Based on this interpretation, the effect size estimates observed in this set of studies supports to the contention that graphical methods focus mostly on clinically significant effect sizes.
Comparison of the conclusions drawn from visual analysis and ITSA revealed an overall low level of agreement (Kappa = .14) in study 1 and moderate level of agreement (Kappa = .44) in study 2.
The difference in the levels of agreement between the two studies could be driven by the type of single-subject experimental designs presented in each source. It can be expected that the Journal of Applied Behavior Analysis (2010) presents a more representative sample of all single-subject studies recently conducted in the field of applied behavior analysis. However, the graphical data presented in the textbook by Kazdin (2011) is based on a non-representative sample of published single-subject studies that were selected as instructional examples used for teaching visual analysis methods. Therefore, the chosen single-subject experimental designs are largely based on the most basic single-subject study designs with easily observable intervention effects. Consequently, data based on intervention effects that present ideal or almost ideal data patterns, such as with low variability, no trend in the data, and evident intervention effects, ITSA was more likely to be in an agreement with visual analysis, even for studies with small numbers of observations or experimental designs with more than two phases (AB).
In conclusion, these findings show that ITSA can be broadly implemented in 95% of applied behavior analysis research and can facilitate the evaluation of intervention effects and additional statistical characteristics of the data, particularly when specific characteristics of the single-subject data (e.g., slope, change in slope, etc.) limit the reliability and validity of visual analysis. Overall, the two methods can be viewed as complimentary and can be used concurrently, while retaining the benefits of both methods, to advance the field and accumulate an evidence base over time.