The Stability of WISC-III Scores: For Whom are Triennial Re-Evaluations Necessary?

This study investigated the long-term stability of Verba~ Performance and Full Scale IQ scores for a sample of exceptional children evaluated with the Wechsler Intelligence Scale for Children-Third Edition (WISC-ill) as part of the mandated three-year reevaluation process. Archival data were collected from the special education files of 592 children who were administered the WISC-ill on two separate occasions between September , 1992 and June, 1996 from twenty-seven school districts in a small New England State . Several variables (special education classification , Full Scale IQ score at initial administration , age of participant at initial administration , as well as the ethnicity and gender of the participants) were examined to detect which variables , if any, would influence the stability of test scores over the three-year interval. Three statistical analyses for measuring stability (correlational method , test of mean difference and test ofintraindividual variability) were conducted. Results indicated that for this sample of children in special education, examined as a group , Verbal IQ, Performance IQ and Full Scale IQ scores remain stable over time . However , certain populations demonstrated significant instability in scores over the three-year interval. Results suggested that children classified as mentally retarded or behavior disordered fluctuate significantly in their performance on the WISC-ill over a three-year interval. Similarly, children who receive an initial Full Scale IQ score above 109 demonstrate significant instability in scores between administrations . Therefore , for these populations, three-year re-evaluations appear necessary. However, routine administrations of the WISC-ill for all children involved in special education is of questionable value. Further research must be done to confirm the findings of this study in order to assist policy makers in distinguishing which children would benefit from a three-year re-administration of the WISC-ID and for which children such information would yield no further information.


Rationale for Study
Criteria for determining eligibility for special education services under federal and state legislation requires the administration of an intelligence test to measure a child' s cognitive :functioning. This has generall y been interpreted as an individually administered, standardized intelligence test . Such a measure is re-administered every three years to determine if the child is still eligible for special education services. Therefore , results of the individual intelligence test is a viable component in the decision making of continuation of services .
The Wechsler Intelligence Scale for Children-Third Edition (WISC-ill) is the current cognit ive test of choice for most practicing school psychologists . There have been no studies at this point , however , that address the stability of the WISC-ill IQ scores over the mandated three-year period . Therefore , there is a necessity to investigate the stability of the WISC-ill in order to examine the reliability of the scores that are used frequently in the decision making process regarding a child' s classification and eligioility of services received . Such an investigation will assist in determining whether there are benefits in readministering the same test to the same child during the three-year re-evaluation process .
If one accepts the assumption that IQ scores remain relatively stable over time, then re-evaluations appear unjustified. However, if it is determined that instability of scores is expected for certain populations , then it is imperative to determine for which children scores will remain stable over time and for which children score fluctuations will be expected .
The purpose of this study is to investigate the long-term reliability and stability of the Verbal, Performance and Full Scale IQ scores for a sample of exceptional children . In addition, several variables (special education classification, Full Scale IQ score at initial administration, age of the participants at initial administration , ethnicity and gender) will be examined to detect which variables may influence the stability oftest scores over the mandated three year re-evaluation interval .
Research addressing this area is lacking and the importance of conducting this research will greatly influence the decision making practices of school personnel in determining classification and placement of children as a result of a three-year reevaluation.

Review of the Literature
The Nature of Intelligen ce In order to understand the role of intelligence testing in the special education community , it is important to first define the nature of intelligence. Currently , there is no single comprehensive definition of the construct , intelligence , that is accepted throughout the scientific community (Sattler , 1988). This is partially due to the development of various definitions of intelligence derived from several theoreti cal orientations .
An early and influential theory developed to define the nature of intelligence was proposed by Charles Spearman. In his 1927 publication, The Abilities ofMim, Spearman defined his two-factor theory of intelligence . According to Spearman, intelligence is comprised of two principal components . One component of intelligence is designated as the general factor and denoted by the letter g ...
it is so named because , ahhough varying freely from individual to individual, it remains the same for any one individual in respect of all the correlated abilities. The second factor has been called the 'specific factor' and denoted by the letters . It not only varies from individual to individua~ but even for any one individual from each ability to another (p. 75).
According to Spearman's theory, the g factor is related to perceiving and manipulating relationships, thus representing abstract reasoning. Because performance on various tasks did not correlate perfectly with g, Spearman suggested that each task includes a specific ability ors ability. Spearman's theory greatly influenced the development of David Wechsler' s conceptualization of intelligence and the development of the Wechsler scales. Wechsler ascribed to the phenomenon of g and defined intelligence as the "overall capacity of the individual to understand and cope with the world around him" (Wechsler, 1974). Although Wechsler's scales have been dichotomized into verbal and performance scales, Wechsler adhered to a strong emphasis on the assessment of g or general intelligence. In this manner, the Performance and Verbal scales are "primarily a way of identifying two principal modes by which human abilities express themselves" (Wechsler, 1974).
Intelligence, according to Wechsler (1974), is a multidimensiona~ global entity rather than an independently defined psychological trait. Thus, intelligent behavior is the combination of qualitatively different abilities for the purpose of thinking rationally, logically and dealing effectively with the external world.
A second influential theory of intelligence was proposed by L.L. Thurstone (1938).
Thurstone suggested that intelligence consists of independent abilities, not a single general factor, as proposed by Spearman. Some of these primary mental abilities consist of verbal comprehension, word fluency, perceptual speed and reasoning. According to Thurstone ' s theory, individual differences exist among abilities, whereby individuals demonstrate more proficiency in some areas and not in others.
A recent theory of intelligence proposed by Robert Sternberg (1985) stresses the importance of the manner in which individuals solve problems instead of the answers that are derived . This triarchic theory of intelligence emphasizes the examination of the mental activities individuals engage in while solving problems .
Although there are several theories that attempt to explain the nature of intelligence, this construct has currently been synonymous with results received from intelligence measures. Therefore the development of the intelligence test has a great importance in the understanding of the nature of intelligence .

The History of Intelligence Testing
Measuring intelligence is an abstract procedure. Currently , the most effective way to measure intellectual abilities is indirectly, by evaluating and observing the intellectual behaviors of an individual as demonstrated through tasks on an intelligence test .
The development of intelligence testing has been made possible by the contributions of early experimental psychologists (Anastasi, 1988). Early work by Wilhelm Wundt in 1879 resulted in the discovery of individual differences in sensory abilities and reaction time. These measures were later instituted as part of formal cognitive measures.
Sir Francis Gahon, the father of mental tests, developed the first assessment designed to measure intelligence in the late nineteenth century (Boring, 1950). Gahon believed that one's understanding of the environment is a result of how the information is received through the senses. Therefore , individuals with high intelligence would have exceptional sensory discrimination abilities. As a result, Galton developed a battery of tests to examine sensory discrimination and motor coordination in order to assess mental functioning. However, none of the traits measured correlated with his theory of intelligence.
Similar work was conducted in the United States during this period of time, namely through the work of James McKeen Cattell. Like Gatton, Cattell believed that simple sensory, perceptual and motor responses were the key dimensions of intelligence (Boring , 1950). Cattell and Gatton provided little conclusions about intelligence. However, as a result of their investigations many questions were raised about the nature of intelligence that are currently being examined today .
The first standardized intelligence test was devised by two French psychologists , Alfred Binet and Theophile Simon (Doll, 1962). In 1904, the French Ministry of Education commissioned Binet to develop a formal assessment procedure to measure the cognitive performance of children. The underlying purpose was to determine which children would be placed in special programs for the 'mentally deficient. ' Binet and Simon produced an objective test to measure the skills necessary for academic success such as mathematical skills, memory, language abilities and the ability to follow directions . It was believed that by using such an array of items, the test would inevitably tap enough abilities to assess a child ' s intellectual potential. The contribution of Binet and Simon's work has had a lasting effect on special education eligioility. Although test development and standardization procedures have improved over time, ''Binet' s early work has had considerable influence on adoption of the 'diagnostic /prescriptive' paradigm that has guided special education for much of its history in the United States " (Meyen, 1995). The purpose of the administration of measures of intelligence has generally been interpreted to assist not only in eligibility determination, but also play a key role in guiding appropriate placement and assessing the individual academic needs of the child ( Galvin & Elliott , 1985).
According to PL 94-14 2, schools are not only required to assess eligt'bility for special education services, but are also mandated to re-evaluate children involved in special education every three years. The implicit rationale of this mandate is to confirm the original diagnosis and placement as well as to reassess the academic needs of the child (Galvin & Elliott, 1985).
The re-evaluation process typically includes a repetition of an initial standardized test battery (Elliott , Piersel & Galvin, 1983;Hartshorne & Hoyt, 1985). Thus, for many school psychologists the re-evaluation is quite indistinguishable from an initial evaluation in both the purpose of the assessment as well as the procedure. Although three-year reevaluations are mandated at this time, there are no requirements under PL 94-14 2 nor State guidelines that a routine battery of standardized tests must be administered (Safer & Hobbs, 1980;Hartshorne & Hoyt, 1985). Therefore, many school psychologists question the need for the re-administration of an entire battery of individualized standardized assessments . Because federal legislation does not require the re-administration of an individualized cognitive assessment, it is essential for school psychologists to determine the appropriate procedures and tools that would yield the most useful information in determining the needs of the child concerning placement and services received .
According to Anderson, Cronin and Kazmierski (1989), '"The re-evaluation information that is presently available has failed to provide direction for diagnosticians and special education administrators who are charged with the responst'bility of meeting students' assessment needs and complying with federal evaluation requirements" (p. 941).
In this manner, one must question the necessity of re-evaluations, the effectiveness of reevaluations and the efficacy of the process. Elliott, Piersel and Galvin (1983) conducted a study whereby results suggest that school psychologists spend 15% of their time each year performing re-evaluations, and 80% of school psychologists spend more than three hours per evaluation . Therefore , the time involved in conducting mandated re-evaluations are quite extensive. In terms of financial strain, testing costs consume almost one-third of the amount spent in educating a child in special education for one academic year (Smith , 1982).
As one questions the necessity of psychological re-evaluations, it is essential to be cognizant of the research that has addressed this issue. According to a survey conducted by Galvin and Elliott (1985), there generally is a relatively low incidence of perceived and actual change in classification and placement for children in special education as a result of a three year re-evaluation . As a result, the need for an automatic re-administration of an intelligence assessment for every child receiving special education services is questionable .
While the purpose of re-evaluations appears quite logical, studies have currently raised the question of automatic three year re-evaluations in terms of the efficiency, cost effectiveness and overall value of re-testing, especially the re-administration of an individual intelligence measure.

Reliability of Intelligence Test Scores
Test reliability refers to ''the consistency of scores obtained by the same persons when reexamined with the same test on different occasions " (Anastasi, 1988).
Psychometric theory adheres to the belief that an individualized obtained score is composed of a true score and an error score (Sattler, 1988). The true score estimates the amount of the trait or characteristic the child actually possesses, while the error score refers to the extent to which individual differences are due to chance factors . Therefore , measures of test reliability estimate what proportion of the variance is due to true differences and what proportion is due to error factors .
Reliability can be measured in a variety of ways . The most common computation that determines the degree of consistency in test scores is the reliability coefficient . The reliability coefficient represents a ratio of the true score variance to the observed score variance (Sattler , 1988). Reliability information is easily attained through a test-retest procedure. This procedure requires the administration of the same test to the same group of participants on two occasions and correlates the resuhant test scores . This method has been shown to provide the most reasonable estimate oftest reliability (Bauman , 1980).
The correlation coefficient that is most often used in test-retest reliability studies is the Pearson product-moment correlation coefficient. This correlation coefficient repre sents the degree to which the obtained score is consistent or stable over time . Other methods that have been implemented in test-retest reliability studies are the examination of group mean differences and the examination of the existence of intra-individual variability .
Test-retest reliability represents the extent to which one's scores on a given test can be generalized over different times . Thus , the higher the reliability coefficient, the less susceptible one ' s scores are due to random changes in the condition of the testing environment or the condition of the test taker (Anastasi, 1988). ''For most tests of cognitive and special abilities, a reliability coefficient of .80 or higher is generally considered to be acceptable " (Sattler, 1988).
The investigation oftest-retest reliability of intelligence measures is essential because the implicit assumption underlying the procedures of special education eligioility and placement is heavily based upon their anticipated stability over time (Webster , 1988). For this reason, exploring the test-retest reliability of the WISC-ID, which is the current test of choice, would be beneficial in understanding the meaningfulness of scores when compared to scores achieved three years earlier. The original WISC was published in 1949 as a means to assess the cognitive functioning of children aged six years, zero months through sixteen years , eleven months.
Since its publication, numerous studies have been conducted with "normal" children as well as with exceptional children to investigate the reliability and stability of the Verbal, Performance and Full Scale IQ scores over time. Several variables have been considered in this investigation to determine whether score stability is a factor in certain populations.

Special education classification
Non-S,Pecial education population. Before examining the stability of scores of a special education population, it is of importance to first investigate the stability of scores with samples of non-special education children . A study conducted by Gehman and Matyas (1956) , examined the stability of WISC scores for a group of sixty children not receiving special education services over a four year period. Test-retest reliability coefficients were .77, .74 and .77 for Verbal IQ (VIQ) , Performance IQ (PIQ) and Full Scale IQ (FSIQ) scores, respectively. Conklin and Dockrell (1967) Throne et al., (1962) . With a longer test-retest interval, Walker and Gross (1970) examined the stability of WISC scores with forty-nine mentally retarded children with a retest interval ranging from two to three and one half years . Examining the effects of special education classification on the stability of WISC scores, results of these studies conclude that VIQ , PIQ and FSIQ scores remain relatively stable for children classified as mentally retarded. However, for children classified as learning disabled and emotionally disturbed, score instability was prevalent .
Full Scale IO at Initial Administration Whatley and Plant (1957) examined the stability ofWISC IQ scores for a group of seventy children referred for testing who received an initial FSIQ score below 90. After a mean test-retest interval of seventeen months, no significant mean differences were found between IQ scores. Thus, the researchers concluded that FSIQ, VIQ and PIQ scores are relatively stable over time for children with initial FSIQ scores of90 or below .

Age at Initial Administration
Irwin (1966) investigated the reliability ofWISC scores at two selected age levels (Group I= five years, seven months to six years, six months and Group II= ten years, seven months to eleven years, seven months) over a three to five week test-retest interval.
Results indicate that younger children are more variable in their ability over time than older children. Similarly, Klonoff (1972) reported in his sample of 173 children ranging in ages from five to thirteen that the youngest group of children ( aged five) exhibited the most instability over a year period. Therefore, it appears that younger children experience greater instability in WISC scores over time than older children.

Gender
In terms of the relationship between gender and test stability, Klonoff (1972) reports no gender differences in the patterning ofIQ change.

Summary of WISC Stability Studies
Review of the literature investigating the test-retest stability of WISC scores suggest that VIQ , PIQ and FSIQ scores remain moderately stable over time. Specifically, results conclude that non-special education children, children classified as mentally retarded, and children who received an initial FSIQ score below 90 have stable WISC IQ scores over a test -retest interval However , results also indicated that for certain populations , WISC IQ scores demonstrate significant instability over time. Children classified as learning disabled or emotionally disturbed significantly fluctuate in scores . In addition, younger age groups systematically elicit more instability in scores over time than older children.
Therefore, it appears that certain variables ( such as special education and age) may attribute to test-retest score instability of the WISC.

Cowarison ofWISC/WISC-R Scores
The Wechsler Intelligence Scale for Children (WISC-R) was published in 1974. The revised manual stated that ''the revision of the WISC represents a synthesis of two somewhat opposing aims: (a) the retention of as much of the 1949 WISC as possible because of its widespread use and acceptance , and (b) the modification or elimination of items felt by some test users to be ambiguous, obsolete , or differentially unfair to particular groups of children" (Wechsler, 1974, p. 10). Such elimination and alteration of items were due to reports in the literature that certain items were unfair or culturally weighted (Wechsler , 1974 Davis, 1977).
The manual for the revised Wechsler Intelligence Scale for Children (WISC-R) , did not include any empirical data regarding the comparability of scores from the WISC-R with scores from the original WISC (Wechsler, 1974). With no information provided in the manual, researchers were forced to conduct studies comparing the stability of WISC and WISC-R scores over time. Since the WISC-R had become widely utilized, interpretation of change scores had been the sole responsibility of practicing psychologists .
Information comparing previously administered WISC scores and currently administered WISC-R scores was very important in the re-evaluation process of children receiving special education services.
Because performance on intelligence measures is a major criterion in determining eligibility for special education services, sudden changes in scores could potentially and seriously effect the special education placement of children in need of services . For example, a decrease in scores from the WISC to WISC-R will ultimately decrease the discrepancy between ability and achievement, thus decreasing the number of children eligible as learning disabled and increasing the number of children classified as mentally retarded.
There have been a number of studies that have examined the difference between WISC and WISC-R scores in a variety of populations over variable test-retest intervals .
However , because research has examined the comparability of two distinct instruments , a counterbalanced design has been typically utilized to control for error, such as practice effects. Studies employing a counterbalanced design are more methodically sound . In these studies, children are randomly assigned into two groups . Half of the children are administered one edition of the WISC followed by the alternative edition while the other half is administered the alternative edition followed by its counterpart. Many studies examining the comparability of the two WISC scales have indicated a differential practice effect. Such a practice effect results in a significant difference in IQ scores when the WISC is administered subsequent to the WISC-R (Davis, 1977;Hamm, Wheeler, McCallum, Herrin, Hunter, & Catoe , 1976;Swerdlik, 1978;Tuma, Appelbaum, & Boe, 1978).
Questions have been raised regarding the reasons for this significant difference in scores based upon the order of administration . Rynn (1984) suggests that administration procedures are quite dissimilar between editions, especially the use of direct teaching : (a) On the WISC-R the examiner explains the correct answer to the child if he/she fails the first item of a subtest ; (b) The WISC-R supports an administration whereby the examiner can probe a child' s answer; (c) On questions whereby a correct solution requires two answers , the examiner is required to specifically ask the child for a second response if only one is given. These practices are not allowed in the administration of the WISC. It appears that when the WISC is administered first to a child, a practice effect operates when administered the WISC-R because of the similarity in tasks . This practice would basically cancel out due to the operation of different test norms. However , when the WISC-R is administered first to the same child, the child's score benefits from not only the practice effects of the similar tasks but the child also benefits from direct teaching. Such benefits as practice effects and direct teaching may lead to an artificially inflated discrepancy score . Therefore in order to determine the comparability of the WISC and WISC-R, it is essential to utilize a counterbalanced procedure to control for such discrepancy due to this order effect.
Reviews of the literature investigating the reliability between WISC and WISC-R scores have been conducted by employing a meta-analysis procedure. Kaufinan (1979)  Caucasian ten-year old children, Tuma, Appelbaum and Bee (1978) examined the stability of WISC and WISC-R scores . In a counterbalanced design and a mean inteival of twenty days, results also indicated significantly lower WISC-R Full Scale IQ scores . Thus, it appears that a trend exists whereby normal school-aged children will score significantly lower on the WISC-R than on the original edition of the WISC.
Several studies have examined the comparability of the WISC and WISC-R by way of special education classification.
Mentally retarded population . With short test-retest inteivals ranging from three days to thirty-nine days, studies reported WISC-R FSIQ scores to be significantly lower than WISC FSIQ scores for mentally retarded children. For example, Solly (1977) administered the WISC and WISC-R to twelve mentally retarded children in counterbalanced order over a three day inteival. Results indicated that children scored on the average 10.8 points lower on the WISC-R FSIQ . Catron and Catron (1977) examined the WISC and WISC-R scores of sixty-two mentally retarded children each administered the WISC and WISC-R in counterbalanced order over an inteival of three weeks . All three WISC-R IQ scores were significantly lower than their WISC counteiparts (5 points between VIQ scores , 6 points between PIQ scores and 5.5 points between FSIQ scores). Berry and Sherrets ( 197 5)  Results of these studies for the mentally retarded population are in agreement with Hamm et al.' s ( 197 6,p .140) observation that " ... the author was aware of the earlier maturation, greater test sophistication, and the increasing availability of manipulative materials similar to the subtest tasks present at the WISC-R standardization in 1974.
Therefore, today ' s subjects who have this maturity and experience should tend, in general, to obtain higher scores on the WISC than on the WISC-R when exposed to both scales." As a result, it has been concluded by Hamm et al. (1976) that when the WISC-R is used as a main criterion for classification and placement of mentally retarded children, there will be an increase in the number of students classified as mentally retarded because of the significant decrease in IQ scores from the WISC to WISC-R. .
Learning disabled population. Paal, Hesterly & Wepfer (1979} administered the WISC and WISC-R in counterbalanced order to a group of forty students classified as learning disabled ranging in age from six to ten . After a test-retest interval of sixty to sixty-seven days, reports indicate that the WISC-R VIQ and FSIQ were both significantly lower than the WISC VIQ and FSIQ . There were no significant differences between the PIQ. In a study comprised of a larger sample of 186 learning disabled children and a mean test-retest interval of two years, results indicated no significant differences between FSIQ scores (Covin , 1977) . Reliability coefficients for FSIQ ranged from .85 at age nine , to .96 at ages eight , fourteen , and fifteen. Covin (1976) also examined the comparability of the WISC and WISC-R with a group of 101 elementary school aged children with learning difficulties administered two years apart. Correlation coefficient for the FSIQ between the WISC and WISC-R was reported to be .95 . However , the mean FSIQ score for the WISC-R was significantly lower than the WISC mean FSIQ score .
Several Sj)ecial education populations. Zimmerman and Woo-Sam ( 1972) examined the comparability of WISC and WISC-R IQ scores utilizing a sample of eighty-six children diagnosed as either mentally retarded or emotionally disturbed . Results indicated that for the emotionally disturbed sample , WISC scores were significantly higher in VIQ ( 4. 9 points), PIQ (3 points) and FSIQ (4 . 1 points) scores as compared to WISC-R scores . For the mentally retarded sample, WISC scores were also significantly higher in VIQ (3 .3 points), PIQ (2.2 points) and FSIQ (2.1 points) scores as compared to WISC-R scores .
McGonagle (1977) examined the comparability of WISC and WISC-R scores for fifty-eight children classified as either mentally retarded, learning disabled , emotionally disturbed or non-special education eligible . The mean test-retest interval for this group was three years, seven months . Results indicated that VIQ, PIQ and FSIQ scores on the WISC-R were significantly lower than the WISC IQ scores . However , this was not the case for the sample of thirteen mentally retarded children. Although the WISC-R IQs were lower than the WISC IQs for each of the three scales, they were not statistically significant.
Age at Initial Administration Hamm, Wheeler, McCallum, Herrin, Hunter, and Catoe (1976) examined the comparability of WISC and WISC-R scores for forty-eight mentally retarded children based upon the age of participants. Subjects were divided into two groups. Group I consisted of children aged 9.6 to 10.6. Group II was comprised of children aged 12.6 to 13.6. After an average interval of thirty-nine days, results indicated that no significant mean difference exists based upon age. However , this study examined only children classified as mentally retarded . Therefore, these results can not be generalized to all children in special education .
However, Doppelt and Kaufman ( 1977) examined the relationship between age and score stability with a homogeneous group of children in special education. Results concluded that the WISC and WISC-R discrepancies were a function of the age of the examinee. The mean FSIQ discrepancy between the two editions of the WISC was about six points for children below age eleven but only two points for children older than eleven.
These results were supported by Berry and Sherrets (1975) and Udziela and Barclay (1983) utilizing mentally retarded populations; and by Klinge, Rodziewicz, and Schwartz (1976) utilizing an adolescent psychiatric population. Specifically, children classified as learning disabled, mentally retarded or emotionally disturbed as well as children in ''regular education " tend to score significantly lower on the WISC-R across the three IQ measures as compared with the WISC . Investigating the effect of age on the stability of scores , results suggest that children ten year s of age and younger experience more score fluctuation than children ov er ten. Ethnicity and gender of part icipants does not play a factor in score stability. Therefore , results of this extensive literature review suggest that significant systematic decrease s ofWISC-R IQ scores as compared to WISC IQ scores exist . Such significant differences in scores between test editions suggest that cognitive assessment materials must be kept up to date . "Reasonabl y contemporary normative tables are essential for making estimates of a child' s level of intellectual functioning if the goal is to compare him meaningfully with his peers " (Swerdlik, 1977).

Stability of the Wechsler Intelligence Scale for Children-Revised (WISC-R)
Since the publication of the WISC-R in 1974, studies have been conducted to investigate the stability ofWISC-R scores over time . Statistical data reported in the WISC-R manual with the standardization sample suggest that the Verbal, Performance and Full Scale IQ scores have reliability coefficients of .94, .90 and .96 respectively acros s age groups (Wechsler , 1974). Such results were gathered from a subgroup of 303 children from the representative sample who were retested within a three month inteIVal When comparing the group mean test-retest differences in Verbal, Performance and Full Scale IQ scores , there was a gain of3 .5 points on the Verbal Scale, 9.5 points on the Performance Scale and 7 points on the Full Scale. Because of the short test-retest inteIVal reported in the manual, several studies were conducted to measure the stability of WISC-R IQ scores over an extended period of time, namely the mandated three year reevaluation period.

Special Education Classification
Non-SJ)ecial education population . Similar results as presented in the WISC-R manual were reported by Tuma and Appelbaum (1980) . These researchers examined the degree of stability of the WISC-R when administered to a sample of ''normal" children within a six month inteIVal. Each of the forty-five non-special education children were administered the WISC-R twice , with a mean test-retest inteIVal of 5.84 months . Results indicated significant increases of7 .82 points on the PIQ and 4.73 points on the FSIQ.
Verbal IQ mean differences were not significant. Test-retest correlation coefficients were .95, .89 and .95 on Verbal, Performance and Full Scale IQ, respectively . As a result , Tuma and Appelbaum (1980) concluded that practice effects are significant for the Performance items among normal children when administered the WISC-R twice .
"Although there is considerable evidence to support the conclusion that WISC-R IQ scores are stable over time in the normal population, one should not infer that the same holds true for exceptional children. Because exceptional children are the ones who are being referred for testing and re-evaluation, evidence concerning the stability of scores in this group is more pertinent to the question of the need for re-evaluation" (Bauman , 1991, p . 96) .
Mentally retarded population . Spitz (1983) examined the constancy of the Full Scale IQ scores for a group of mentally retarded children . Results indicated that after an average interval of two years, there was a significant mean increase of more than three points on the WISC-R FSIQ and a stability correlation of .84 .
Learning disabled population . Anderson, Cronin and Kazmierski (1989)  These results are similar to that of Kaye and Baron (1987) who examined the WISC-R test-retest stability of ninety-nine learning disabled children over a three year period .
Results suggested significant VIQ decreases over time , as well as significant PIQ increases over the three year period. With a very small sample of nineteen learning disabled children, Saklofske , Schmidt and Y ackulic ( 1984) discovered the same trend of significant decreases in VIQ and FSIQ between the two administrations conducted three years apart. Higher correlation coefficients were reported in a study conducted by Lally, Lloyd and Kulberg (1987)  Results of these studies suggest that children classified as learning disabled tend to demonstrate significant instability in WISC-R scores between test-retest administrations.
More specifically, it appears that PIQ scores tend to increase over time , whereby VIQ and FSI Q scores tend to decrease over time . Schmidt et al ( 1989)  However , decreases of two points on the VIQ and increases of two points on the PIQ were found.
Full Scale IO Score at Initial Administration In order to investigate the effects of initial IQ score on IQ stability, Bauman ( 1991) divided his 130 elementary school subjects into three groups on the basis of initial FSIQ score. The three groups consisted of children who received an initial FSIQ below 90, which was identified as the "below average " group; 90-110 was identified as the "average " group; and above 110 was classified as the "above average" group. Results indicated that while the mean IQ scores for the entire sample of 130 children significantly declined between testings for VIQ and FSIQ scores , the "above average group" suffered a significantly greater loss on all three subscales than the "below average" IQ group . Bauman ( 1991) explained that regression toward the mean is not responsible for the loss in scores because this explanation would have predicted that below average groups would make significant gains in IQ scores which did not occur .
Naglieri and Preiffer (1983) examined the stability ofWISC-R scores with a group of children who scored below 90 on initial administration. Fifty-three children (fifteen classified as mentally retarded, twenty-three classified as borderline intelligence, and fifteen classified as low average) were administered the WISC-Ron two separate occasions after a mean interval of two years, ten months. Results indicated no significant mean differences on VIQ , PIQ nor FSIQ scores between administrations. In addition , correlation coefficients suggested that the WISC-R scores have a good test retest reliability over a long period of time (VIQ= .79, PIQ= .75, FSIQ= .85) .

Age at Initial Administration
Similar to the studies examining WISC/WISC-R comparability , research has been conducted to determine whether age is related to the degree of change in WISC-R scores upon retest . Bauman ( 1991) examined the relationship of age to IQ stability in children with learning disabilities . With a sample of 130 children, Bauman divided the sample into two groups, below eight at initial administration and over eight at initial administration .
The defined reasoning for this distribution is that children below age eight complete Section A of the Coding subtest of the WISC-R, and children over eight complete Section B. After a mean test-retest interval of two years , eight months, results indicated that children under eight experience significant decreases in VIQ, PIQ and FSIQ, suggesting that age is a significant factor in WISC-R change scores over time .
Elliott , Piersel, Witt, Argulewicz, Gutkin, and Galvin ( 1985) examined the stability coefficients oftest-retest VIQ, PIQ, and FSIQ scores. Three hundred and eighty-two children ( 105 classified as learning disabled , fourteen classified as behavior disordered , and 247 classified as mentally retarded) were divided into three groups according to their age (6.0 years to 9.0 years ; 9.1 years to 13. 75 years ; and 13.75 years to 17.0 years) . Results indicated that the stability coefficients of these three age groups did not vary significantly.
Ethnicity Elliott and Boeve ( 1987) examined the relationship of ethnicity to the three-year stability of the performance ofhandicapped children on the WISC-R Subjects consisted of 168 males equally distnbuted as Caucasian, Mexican-American and African-American and were classified as either learning disabled, mentally retarded or "other." Results indicated that the variable, ethnicity, did not significantly influence test stability . As a whole, the sample decreased by an average of two points on the mean VIQ and increased by an average of three points on the mean PIQ . ''Thus, although statistically significant the influence of three years time on the intelligence test performances of the sampled handicapped children was pragmatically insignificant" (p. 464) . Elliott et al (1985) also examined the reliability ofWISC-R scores with a larger sample consisting of 175 Caucasian , 67 Mexican-American, and 140 Afiican-American children . Of the 382 subjects , 105 were classified as learning disabled , fourteen were classified as behavior disordered and 24 7 were classified as mentally retarded . Resuhs indicated that with a multiracial group of handicapped children , WISC-R IQ scores were quite stable over a three year test-retest interval.
Gender Elliott et al. (1985) results suggest that the variable , gender , has a minimal influen ce on test score stability . However , it was reported that female VIQ resulted in significantly larger stability coefficients than scores obtained by the males in the sample . It was suggested that one possible reason for this finding is that females tend to develop v erbal skills earlier and more rapidl y than their male counterparts .

Summary ofWISC-R Stability Studies
Numerous studies have been conducted that examine the stability ofWISC-R scores over time . Because these studies utilized different methodologies , statistical analyses , sample sizes and test-retest intervals , many of the result s obtained are inconsistent across investigations. Therefore , while interpreting the resuhs , it is important to understand the manner in which the studies were conducted in order to construct appropriate generalizations . However , the general trend for each variab le under investigation is summarized below .
Studies have investigated the relationship of special education classification on WISC-R score stability. Resuhs suggest that children in ')"egular education " generally experience stability in WISC-R scores over time. However , significant increases in PIQ and FSIQ scores can be expected . Children classified as mentally retarded experience moderate stability in WISC-R scores across the three IQ measures over time. A significant trend is prevalent for children classified as learning disabled. Generally, Verbal IQ scores decrease and Performance IQ scores increase significantly at retest over a three-year interval Examining the effects of initial FSIQ scores on test-retest stability, results suggest that children who receive an initial FSIQ below 90 experience more stability over retest than children who score above 90. Studies investigating the relationship between age and score stability are inconsistent. One study suggests that younger children experience significantly more instability over time than older children while another study suggests that no difference in stability across age group exists. Studies examining the relationship between ethnicity and score stability indicate no difference across ethnic group .
Significant yet pragmatically insignificant differences in gender were found indicating that female VIQ is more stable than male VIQ. Therefore, several variables can be attnlmted to the instability ofWISC-R IQ scores over time.

The Wechsler Intelligence Scale for Children-Third Edition (WISC-ill)
The third edition of the Wechsler Intelligence Scale for Children was published in 1991, seventeen years after the second edition. The test is designed for children aged six through sixteen, eleven months . The WISC-ill contains thirteen subtests, ( six in the Verbal Scale and seven in the Performance Scale). Ten subtests are required and three subtests are supplementary and not computed in the IQ scores .
The WISC-ill was standardized with 2,200 children across the United States. One hundred boys and one hundred girls in each of eleven age groups from six years to sixteen years, eleven months comprised the standardiz.ation sample. The sample was stratified on age, ethnicity, geographic region and parental education based upon the 1980 U.S .
Census. The WISC -ID standardization sampling procedure has been regarded as superior to that of the WISC-R. The WISC-ID matched the census quite adequately across variables whereas the WISC-R procedure stratified race by white vs. non-white . WISC-ID standardization was also superior to the WISC standardization, which only used white subjects in the standardization sample (Sattler, 1988).

Special Education Classification
Non-spec ial education population . The manual of the WISC -ID reports a test-retest reliability study conducted with 206 children aged six through sixteen, who were administered the WISC-Rand WISC-ID in counterbalanced order after a median test interval of21 days (Wechsler , 1991 Several special education populations . The WISC-ID manual (Wechsler, 1991) reported a test-retest study consisting of 104 children enrolled in special education, examined as a homogeneous group . Each child was administered the WISC-Rand WISC-ID. Results suggest that mean VIQ, PIQ and FSIQ scores significantly decreased by five to six points on the WISC-ID as compared with the WISC-R.
Graf and Hinton ( 1994)  With a group of twenty-seven learning disabled, eighteen emotionally disturbed and twenty-three mentally retarded children, Post (1992) examined the comparability of the WISC-Rand WISC-ID scores over a mean test-retest interval of three years . Results suggested that WISC-ID IQ scores were significantly lower than WISC-R IQ scores for the learning disabled, emotionally disturbed and mentally retarded samples. Mean differences between scores were 6.4 points, 6.5 points and 7.0 points for VIQ, PIQ and FSIQ , respectively .
Learning disabled population. Lyon (1995) Doll and Boren (1993) examined the performance of seventeen children classified as severely language impaired on the WISC-R and WISC-ID .
Results indicated that the mean WISC-III VIQ, PIQ , and FSIQ scores were six points lower than respective scores on the WISC-R. Newby, Recht , Cladwell and Schaefer (1993) examined the comparability ofWISC-R and WISC-ID scores of twenty-six children with dyslexia administered between one and five years apart . Mean VIQ scores were 4.9 points lower and FSIQ scores were 4.8 points lower between the WISC-Rand WISC-ID. Mean PIQ differences of3.4 points were not significant.
Mentally retarded population . With a group of ninety-three children classified as mentally retarded, Nagle and Daley (1994) discovered that WISC-ID scores were five to eight points lower than comparable WISC-R scores.
Full Scale IO at Initial Administration Graf and Hinton ( 1994) examined the records of eighty-four children who were administered the WISC-Rand WISC-III three years apart . In order to examine the effect of initial FSIQ, the entire sample was divided by IQ range . Findings suggest that at the lower IQ range (60-90), the WISC-III yielded higher scores for VIQ, PIQ and FSIQ .
However for the 91-105 and the 106-120 subgroups, the WISC-R consistently yielded higher scores across all three IQ measures . These results may be attnlmted to the phenomena of regression to the mean.

Summary ofWISC-R/WISC-ID Stability Studies
Extensive research has been conducted to investigate the comparability ofWISC-R and WISC-III IQ scores for a variety of populations over variable test-retest intervals.
Generally, results suggest that the WISC-ID yields lower Verbal IQ, Performance IQ and Full Scale IQ scores as compared to the previously administered WISC-R Specifically, children classified as learning disabled, emotionally disturbed or mentally retarded as well as children in ''regular education" tend to score significantly lower across the three IQ measures on the WISC-ID as compared with the WISC-R Examining the relationship between initial Full Scale IQ score and score stability, results suggested that children who scored between 60 and 90 on the WISC-R scored significantly higher on the WISC-ID for VIQ, PIQ and FSIQ. However , for the children who scored above 90, the WISC-ID scores were significantly lower across IQ measures .
WISC-ID IQ scores tend to yield significantly lower scores than the WISC-R This trend is comparable to the studies investigating the stability between WISC and WISC-R scores . Such results suggest that revisions of the Wechsler scales tend to produce lower scores than previous editions.

Stability of the Wechsler Intelligence Scale for Children-Third Edition (WISC-ID)
A study descnoed in the WISC-ID manual (Wechsler , 1991)  (a) The correlation coefficients of Verbal, Performance and Full Scale IQ scores for Administration # 1 and Administration #2 of the WISC-ill will not differ significantly from the median test-retest reliability estimates of the standardization sample.
(b) Group mean differences oftest-retest Verbal, Performance and Full Scale IQ scores will not be significant.
(c) There will be significant intra-individual variability between test-retest Verbal, Performance and Full Scale IQ scores .
Hypothesis #2: Certain variables may influence test-retest Verbal Performan ce and Full Scale IO scores within s,pecified populations .
Special education classification (a) The learning disabled sample will elicit less stable correlation coefficients for testretest Verbal, Performance and Full Scale IQ scores than the mentally retarded sample.
(b) The learning disabled sample will experience significant group mean increases in Performance IQ scores and significant group mean decreases in Verbal IQ scores between test-retest administrations.
( c) The mentally retarded population will not experience significant group mean differences in Verbal, Performance and Full Scale IQ scores between test-retest administrations .
Initial Full Scale IO Score ( d) Students with initial Full Scale IQ scores falling in the mentally deficient or below average category will elicit more stable correlation coefficients for test-retest Verbal, Performance and Full Scale IQ scores than subjects in the above average category .
( e) Group mean differences for test-retest Verbal, Performance and Full Scale IQ scores will not be significant for students with initial Full Scale IQ scores below 90.
(f) Group mean differences for test-retest Verbal, Performance and Full Scale IQ scores will be significant for students with initial Full Scale IQs above 110.
Age at initial administration (g) Test-retest correlation coefficients for Verbal, Performance and Full Scale IQ scores will be more stable for older children than for children under the age of eight at the time of the initial test administration .
(h) Group mean differences for test-retest Verbal, Performance and Full Scale IQ scores for children under eight will be significantly higher than the group mean differences for children over the age of eight during the initial test administration.
Ethnicity (i) There will be no significant differences between the correlation coefficients for test-retest Verbal, Performance and Full Scale IQ scores based upon the ethnicity of the child.
G) Group mean differences for test-retest Verbal, Performance and Full Scale IQ scores will not be significantly different among ethnic groups .

Gender
(k) There will be no significant gender differences between the correlation coefficients oftest-retest Verbal, Performance and Full Scale IQ scores .
(1) Group mean differences for test-retest Verbal, Performance and Full Scale IQ scores for males will not be significantly different than the group mean differences for females.

Participants
Longitudinal archival data were collected from the special education files of 592 Hispanic (10.8%) , African-American (7 .3%), Asian-American (3.3%) and Native-American (.5%) . These demographics are comparable to those reported in this stud y.
As a result of the initial administration, students were classified as learning disabled (78 .9%), not eligil>le for services (6.1 %) , behavior disordered (4 .7%), speech and language disordered ( 4 . 1 % ), mentally r~tarded ( (Wechsler, 1991). Concurrent validity studies reported in the WISC-ID manual suggest high correlations between the Full Scale IQ score of the WISC-

Procedure
Members of the State organization for school psychologists were asked by mail to participate in the data collection process of the study. The school psychologists were asked to conduct a file review for each child in their school of employment who was evaluated with the WISC-ID on two separate occasions as part of the mandated reevaluation procedure. A standardized worksheet was provided (See Appendix B).
Requested information included demographic information for each student (gender and ethnic background), as well as the age, special education classification, and Verbal, Performance and Full Scale IQ scores for both initial evaluation and re-evaluation .
Neither the name nor other identifiable information was recorded on the worksheets which were forwarded to the principal investigator upon completion. Twenty-five school psychologists from twenty-two school districts participated in the data collection process .
Each participating school psychologist received a $50.00 stipend which was funded by the  Table 1 for means and standard deviations) . Scale IQ for this sample of 592 children in special education was not significant.
Results of the three statistical analyses employed ( test of correlations , test of mean differences and test of individual variability) suggest that Verbal IQ, Performance IQ and Full Scale IQ scores are stable for this sample of children in special education who were administered the WISC-ID twice over the three-year mandated re-evaluation period .

Hypothesis #2: Stability ofIQ Scores Within Specified Populations
The above mentioned results indicated that WISC-ID Verb~ Performance and Full Scale IQ scores are stable over the three-year re-evaluation period. However, in order to determine whether specified populations will elicit less stable scores, several subgroups were examined ( special education classification, Full Scale IQ score at initial administration, age of the participants at initial administration , as well as the ethnicity and gender of the participants).

Special Education Classification
The special education files of 592 children were examined to investigate the stability of WISC-ID IQ scores over a three-year re-evaluation period . As a result of the initial administration , these children were classified according to State regulations as learning disabled (n= 467), mentally retarded (n= 15), behavior disordered (n= 28) , not eligible for services (n= 36), Attention-Deficit/Hyperactivity Disorder (n= 10), Otherwi se Health Impaired (n= 10), and speech/language disordered (n= 24) . The classifications of two children were not reported . (See Table 2 Table 3 for means , standard deviations , PPM correlation coefficients and t-value s).
It was predicted that students with an initial Full Scale IQ score below 90 would have more stable PPM correlation coefficients than subjects scoring above 109 (Hypothesis 2d) .
(Refer to for students with initial FSIQ scores below 90. The mean VIQ score was significantly higher at retest, t(256)=-2.44, p< .05; mean PIQ score was significantly higher at retest t(256}=-2 .89, p< .05; and mean FSIQ score was significantly higher at retest, t(251}=-3.26, p< .05. Therefore , results suggested that group mean differences exist between testretest VIQ, PIQ and FSIQ for &1udents with initial FSIQ below 90. These students will score significantly higher on the second administration approximately three years later. Age at Initial Administration Subjects were divided into two groups: children under the age of eight ( Group 1) and children eight years of age and older (Group 2) at initial administration. (See Table 4 for means , standard deviations , and correlation coefficients ).
It was predicted that test-retest PPM correlation coefficients for VIQ , PIQ and FSIQ would be more stable for children over the age of eight than for children under the age of eight at the time of the initial administration (Hypothesis 2g ). (See Table 4 It was predicted that group mean differences for test-retest Verbal, Performance and Full Scale IQ scores for children under the age of eight would be significantly higher than the group mean differences for children eight years of age and older (Hypothesis 2h).

Ethnicity of Partitjpants
It was predicted that there would be no significant difference between the PPM correlation coefficients from test-retest Vern~ Performance and Full Scale IQ scores based upon the ethnicity of the child (Hypothesis 2i). In addition, it was predicted that group mean differences for test-retest Verbal, Performance and Full Scale IQ scores would not be statistically significant among ethnic groups (Hypothesis 2j). Because of the small percentage of participants that were reported as other than Caucasian (17 .2%), the analysis divided ethnicity into two categories ''Caucasian" and ')Ion-Caucasian ." (See Table 5 for means, standard deviations, and PPM correlation coefficients).
A two-tailed Fisher' s Z test was utilized for each pair oftest-retest PPM correlat ion coefficients for the three IQ measures for Caucasians and non-Caucasians . Results suggested that no significant differences between correlation coefficients for Verbal IQ scores, z= -. 73, p> .05 ; Performance IQ scores, z= -1.36, p> .05; nor Full Scale IQ scores, z= -.54, p> .05 exist based on the ethnicity of the subjects.
In order to determine whether significant group mean differences exist for test-retest VIQ, PIQ and FSIQ scores between the Caucasian and non-Caucasian group, a multivariate repeated measures analysis of variance (MAN OVA) was conducted .
Independent variables were ethnicity and time ; and the dependent variables were the three IQ measures . Utilizing Wilk' s test of significance, results suggested that no significant interaction exist between Caucasians and non-Caucasians in the amount of change between administrations across IQ measures F(3 ,555)= 1.08, p> .05 . Therefore , no significant test-retest differences were found between the amount of change across VIQ , PIQ and FSIQ between Caucasians and non-Caucasians .

Gender of Participants
It was predicted that gender differences would not exist in the stability ofIQ scores over time. More specifically, it was predicted that there would be no significant differences between the correlation coefficients oftest-retest Verbal, Performance and Full Scale IQ scores based upon the gender of the child (Hypothesis 2k}. It was also predicted that group mean differences for test-retest Verbal, Performance and Full Scale IQ scores for males would not be significantly different than the group mean differences for females (Hypothesis 21). (See Table 6  Analysis of the stability coefficients revealed that females' PIQ scores are significantly more stable than male Performance IQ scores over a three-year test-retest inteival Chapter 4

DISCUSSION
The purpose of the current study was to investigate the long term stability of Verbal, Performance and Full Scale IQ scores for a sample of exceptional children evaluated with the Wechsler Intelligence Scale for Children-Third Edition over a three-year interval Several variables (special education classification, Full Scale IQ score at initial administration, age of the participant at initial administration, ethnicity and gender of the participants) were examined to detect which variables, if any, would influence the stability oftest scores over the mandated three-year re-evaluation period.

Stability oflQ Scores for the Entire Sample
Results of the current investigation indicated strong reliability between test-retest Verba~ Performance , and Full Scale IQ scores over a three-year interval for 592 exceptional children. However, the Pearson product-moment (PPM) correlation coefficients were significantly less than the reported PPM correlation coefficients of the standardization sample, consisting of353 non-exceptional children, reported in the WISCm manual There may be two reasons for the discrepancy in reliability coefficients . First, the length of the test-retest interval may be a factor. The mean test-retest interval of the current investigation was three years, whereas the test-retest interval of the standardization sample was a median of twenty-three days. Because stability coefficients tend to decrease over time, such discrepancy in correlation coefficients may simply be due to the differential time interval between the Wechsler study and the current study. A second explanation for the discrepancy in reliability coefficients between this investigation and the Wechsler reliability study is that the current study was comprised of children in special education, whereas the Wechsler sample consisted of a random selection of the entire school-aged population, mainly children not receiving special education services.
Therefore, comparisons are difficult to interpret because the samples themselves are intrinsically different .
Based on previous research investigating the long term stability of the WISC-R, it was predicted that no significant mean differences would exist between test-retest VIQ , PIQ and FSIQ scores. Results confirmed this hypothesis, whereby no significant mean differences were f01md between test-retest VIQ , PIQ and FSIQ scores for this sample of 592 children treated as a homogeneous group . Results of this investigation support previous studies examining the stability ofIQ scores within a homogeneous special education population (Whorton , 1985;Vance , Hankins & Brown , 1987).
Previous research investigating the stability of the WISC-R with a homogeneous group of children in special education suggest that significant intra-individual variability exist across the three IQ measures (Elliott & Boeve, 1987;Truscott, Narrett & Smith, 1994;Vance, Blixt, Ellis & Debell, 1981;Webster , 1988 ( Coleman , 1963 ). Similar trends were reported investigating the stability of the WISC-Rover a three-year interval (Anderson , Cronin , & Kazmiers~ 1988;Elliott & Boeve , 1987;Kaye & Baron , 1987;Haddad , Juliano & Vaughan, 1984). It was believed that increased Performance IQ scores in a learning disabled population was a result of practice effects (Covin , 1977) . It was also suggested that decreased Verbal IQ scores was a result of the increased difficulty that handicapped children experience with verbal conceptualization and abstract verbaJ thinking as grade level advances (Naglieri & Pfeiffer , 1983 Based on previous research, it was predicted that the mentally retarded sample would not experience significant group mean differences in VIQ, PIQ and FSIQ scores between test-retest administrations . Results confirmed this hypothesis , whereby no significant mean differences were found between administrations for the sample of fifteen children classified as mentally retarded. These results are comparable to those reported investigating the stability oftest-retest WISC scores (Throne , Schuman & Kaspar , 1962;Friedman, 1970) and test-retest WISC-R scores (Whorton, 1985) for this population . should utilize a larger sample to further investigate the reliability of WISC-ID scores for this population, specifically to explain the significant decreases in PIQ scores.

Not eligible for services
This group of thirty-six children was referred for an evaluation, administered the WISC-ID and was considered not eligible for special education services . Approximately three years later , these children were again referred and re-administered the WISC-ID .
Children categorized as ''not eligible" for services as a result of the initial administration had strong test-retest correlations for VIQ , PIQ and FSIQ scores . In addition , no significant differences were found between test-retest scores across the three IQ measures. Future studies should investigate this population with larger sample sizes. One factor that may influence the reliability of WISC-ID scores which would be interesting to investigate is the relationship between speech/language services received., and the stability ofVIQ scores over time .
Stability oflO Scores Based on Initial Full Scale IO Score In order to investigate the effects of initial FSIQ score on the stability oftest-retest scores, the sample was divided into three groups: children receiving an initial FSIQ below 90, children receiving a FSIQ between 90 and 109, and children receiving a FSIQ above 109.
Based on previous research investigating the stability of scores with the WISC and WISC-R, it was predicted that students receiving an initial FSIQ below 90 would have more stable correlation coefficients than subjects scoring above 109. Results confirmed these previous studies whereby children with initial FSIQ below 90 demonstrated significantly higher test-retest correlation coefficients for VIQ, PIQ and FSIQ scores than children scoring above 109 (Naglieri & pfeiffer, 1983;Bauman, 1991;Klonoff: 1972).
It was also predicted that group mean differences for test-retest VIQ, PIQ and FSIQ scores would not be significant for students with initial FSIQ scores below 90. Results of this investigation disconfirmed this hypothesis. Results indicated that significant mean differences exist between test-retest administrations for this group . Children who had an initial FSIQ scores below 90 scored significantly higher on the second administration approximately three years later by a mean of 1.24, 1.65 and 1.54 on VIQ, PIQ and FSIQ , respectively. Such increases in scores are inconsistent with previous research of the WISC and WISC-R (Whatley & Plant, 1957;Naglieri & pfeiffer, 1983).
It was also hypothesized that group mean differences for test-retest VIQ, PIQ and FSIQ would be significant for students with initial FSIQ scores above 109. Results of this study indicated that mean VIQ , PIQ and FSIQ scores significantly decreased over administrations by means of3.12, 6.36 and 5.32 points, respectively, over the three-year period. Such results are consistent with previous research investigating the stability of the WISC-R (Bauman , 1991) .
Children receiving an initial FSIQ below 90 tended to score significantly higher on the second administration, whereas children scoring above 109 tended to score significantly lower on the second administration three years later. Such results may be attributed to the phenomena of regression to the mean. However , the discrepancy in scores between administrations for the below 90 group appear to be pragmatically insignificant . An increase in scores by less than two points will not have a significant impact on the decision ma.king process for a change in classification nor eligibility of services. However, the discrepancy in scores for the children who scored above 109 on FSIQ at the time of the initial administration has significant practical implications. Such decreases in scores by more than three, five and six points may play a significant role in the decision making process for a change in classification or eligibility of services .

Stability oflO Scores Based on Age at Initial Administration
The present study compared differences and similarities between test-retest performance on the WISC-ID at two age levels: below eight years of age, and eight years and older .
Based on previous research by Bauman ( 1991)  In order to det ermine whether initial Full Scale IQ scores affect test-retest reliability, the sample was divided into below average, average and above average subgroups.
Results of the current investigation suggest that children scoring a FSIQ below 90 at initial administration tend to have greater stability than children scoring above I 09.
Interestingly, it was also determined that children scoring below 90 tend to score significantly higher on the second administration, and children scoring above I 09 tend to score significantly lower on the second administration. Such results can be attnbuted to the phenomena of regression to the mean. However, the increases in VIQ, PIQ and FSIQ scores that were prevalent in the subgroup that scored below 90 on FSIQ at initial administration were very small. On average, increases in scores were less than two points across the three IQ measures. On the other hand, for the children that scored a FSIQ above l 09 at initial administration, VIQ , PIQ and FSIQ scores decreased by more than three , six and five points, respectively . Therefore, children who score an initial FSIQ above 109 will tend to have less stable scores than children who score in the average or above average range .
Age does not play a significant factor in test stability of the WISC-ID . Although results suggest that children eight and older at the time of the initial administration demonstrate more stable PIQ correlation coefficients than children under eight at the time of the initial administration, such differences are relatively small and not diagnostically useful In addition , significant but minimal mean differences exist between test-retest VIQ and PIQ scores based upon the age of the participant. However, these results also appear pragmatically insignificant.
No significant differences were discovered between differential change scores based on the ethnicity of the participants . Similarly, no significant interaction exists between the gender of the participant and the amount of change between administrations . However , females demonstrated more stable PIQ scores than male participants . Such a significant but small difference in stability also appears to be pragmatically insignificant.

Limitations
Although this investigation is comprised of a very large sample of 592 children, several limitations are quite evident and affect the generalizability of the study. First, the sample of subjects was limited to a small New England state . This sample is not representative of the United States population, and caution should be taken when generalizing these resuhs . Second, because the criteria for classification differs among states, the results obtained may not be representative to children outside the state because their membership in the classification categories may differ. Third, small sample sizes were evident in examinine the relationship of special education classification on the reliability ofIQ scores. Larger samples are necessary in future research to interpret such resuhs with confidence. Fourth, the ethnicity of the subjects do not appear to represent the national percentages of individuals in special education.

Implications
The current study has several implications for the practicing school psychologist with respect to the necessity of conducting cognitive assessments every three years. The present study concluded that WISC-ID Verbal IQ, Performance IQ and Full Scale IQ scores are generally stable for children in special education over a three-year re-evaluation period. Therefore, the practice of conducting three-year re-evaluations utilizing the same cognitive assessment for every child is not viably useful This study investigated the stability of scores for several populations within the special education community to determine whether certain children will fluctuate in VIQ, PIQ and FSIQ scores over the re-evaluation period. Results of this investigation suggest that generally scores remain stable over time. Children classified as learning disabled, not eligible for services, ADHD, Otherwise Health Impaired or speech and language disordered will tend to have stable scores and re-evaluations do not appear necessary.
However, children classified as mentally retarded tend to experience instability in VIQ scores over the three-year re-evaluation period. Similarly, children classified as behavior disordered experience instability in PIQ scores over a three year interval. In addition , children who received an initial FSIQ score above 109 will tend to fluctuate significantly between administrations. Therefore, for these populations, re-evaluations appear justified.
Because significant differences are not expected for the rest of the population of children in special education, retesting the same child over three years will not yield new information. By gaining an accurate assessment of children's intellectual ability during the initial administration, further administrations of the same cognitive tool will contnlmte limited information to educational planning and program assessment for children in special education. Therefore, routine ~dministrations of individually administered IQ tests, namely the WISC-Ill, is of questionable value and should no longer be required for all children.
The re-evaluation process consumes an extensive amount of school psychologists' time as well as financial resources of school districts. This study has concluded that routine re-evaluations are not necessary for all children and thus should be eliminated.
The elimination of unnecessary re-evaluation assessments will result in financial relief of school districts who will no longer be responsible for financing the needless practice of conducting individual cognitive assessments. As a result, school districts will be able to disburse those funds more effectively. Such funding would aid in developing more appropriate programs for children in special education to service their educational and psychological needs. The elimination of routine re-evaluations will also have dramatic implications on the roles and functions of school psychologists. School psychologists will have more opportunity to perform the roles they are trained in such as counseling and teacher consultation as well as have a direct influence in the design and implementation of educational and psychological services for children that are desperately needed in the schools today .
It is imperative that caution is taken when interpreting the results of this study . Policy makers are strictly warned not to carelessly make judgments regarding the necessity of school psychologig s in the districts . The elimination of the re-evaluation process should not eliminate the employment of school psychologists in the schools . On the contrary, the release of such testing constraints will significantly expand the role of school psychologists . The elimination of testing will provide the opportunity for school psychologists to perform the jobs that they are intended.
It is inevitable that questions will be raised by school psychologists, school administrators and policy makers regarding the future role of school psychologists.
Before decisions are made on a state or district levei it is imperative that school psychologists unite to develop a professional philosophy regarding the future of their profession. After such discussions , the State school psychology association must collaborate with school administrators and polic y makers to discuss the practical implications of eliminating three-year re-evaluations and the effect such policies will have on the expanding role of the school psychologist .

IRB ACTION REPORT
The :ic1ivi1y indic:ued below hJS been reviewed by the Vnive~ity oi Rhode lsl:ind lnstitution:u Review Bo:i.rd (!RB l in :iccorclance with the n:quiremems of Title 45. P:irt 46 of the Code of Feder:u Re;;ul:itions (Protection of Human Subje::tsl. or other feder.il regul:itions as required such as 21CFR 50. The University has an approved assurance of compliance on file with the De;,:irunent of He:ilth and Human Services which covers this activity. Our assur:ince number is M 1457. Any changes which may alter the investii;ati on:u sinmion must be reported promptly to the !RB. Any questions conce:ning this action can be dirm. I am writing as a follow up to our mee11 ng with Linda Cass idy en this ca te. Spec:fica ily. I wiil out line the RISPA suppon for Ms. Cassa:y ·s researcl. proiec: . and de !aii my unoers:anc 1 ng oi the agreement between RISPA and the Olfice cf Spec ial Neeos .
At our May 1996 RISPA Executive Boaro meet ing we voted in support cf Ms. Cass icy's proiec: ·stability of WISC Ill Scores: For Whom are Re-eva luations Necessa ry?'" . We unce rs:and that these data may be used to inform how IDEA reauthorization is enac:ed in Rhode lslanc. anc the potential impact on the school psycholog1s:·s role. In add ition. I unders:and that this s:ucy is being supponed cy the Rhode Island Depanment of Educat ion. Office of Special Nee-:s in t:ie amount of S; 500 .00 .
RISPA will provide Ms. Cassidy with the fo:lowing in support of her projec:.
1. A letter of support to enc::,urage partic ipat ion of sc:ico l ps,c hclc gis:s in ca :a c: !lec: :c n in the ir school dis:r icts . 2. The RISPA mailing laoe!s Funding from tne Office of Spec ial Neeos will maoe payao le to RISPA . In turn we wiil compensate in the amount of S50 partic ipating school psycholog ists a,,c pract1c:1r. .:1 ntFn school psycho logy students (working under direct superv ision of d1 Sir1 c: school psyc:ioiogi s:s and subject to the same confident iality standards ) following Ms. Cass 1 oy's verif ica: 1 or, c f individual completion of data collec11on.
It is a pleasure to collaborate with you on behalf of Rhode lslano ch iloren. tr-,eir scr.co ls. anc our profession. To facilitate 1his effort you may contact Judy Zeyl, RISPA Treasurer, at the addres s above. Follow ing elections to t11e RISPA Executive Board. You will be notified of the contact person for continuing coJJaboration on IDEA reauthorizat ion. Until then , I may be reached by phone or voice mail at 401-874-4221, or e-mail AGL 1 0t@URIACC.URI.EDU. I would like to personally invite you to partic:pate in the stUdy mritled. Stabiliry of WJSC-1IJ Scores : For Whom art Tritnnial Rt-cvaluanons Ntcessary ? By assisting in the dau collection proa:ss in your school district, you are heiping to genet2t.e very imporunt information that may ultimately assist school psychologists all over the state in determining which children would benefit from a psyc:hologic:il r~U2tion and which children might not need such an evaluation.
Endorsement and funding for this study has been provided by the Rhode Island Depanm1:1t of Education, Office of Special Needs. Bob Pryhcxia. Direc:or, is providing a le:1er to e:1c:h of the Direc:ors of Special Education in Rhode Island encouraging tlut e.:icil dismc: parucipate 111 tlus study. Tne fir.cimg t.-i:. t h:.s been provided has been alloc.:md to you. the sc.,ool psyc.1oiogist . in assisting 111 the d.Jta collec:ic:i process I understand the sd-100 1 ye.:i r 1s c.iming to a dose However. I'm asking e.:id1 of you to t..ke some ur.ie v.1tiun the next couple weeks . or during the su.-nme r. to par.icipate :n tlus very imporunt study (;;..-:ci rnake some money while you are at it !) A sample data shee. 1s 1:1c :osed v.ith this m3iling. As you C.:lrl see. we are seeia.'1£ mirumal information on the s-..1b ility of 'w 1SC-ID sc:ires over ur:-:e and therefore. collec:mg suc.11..-,fo rmauon v.1il not take you muc.-i time .
Thank you for taking the tir:ie to read this informauon. Any assistance you c:.n give rr.e at ail wouici be gre:itly appreciated . If I c:in explain the research furthe:-to you or if you have any questions, pi ease c::nuc: me at ( 40 l) 782-3709 or you can reach my major professor, Janet Kulberg, Ph.D., at ( 401) 874~228 . I look forward to he:iring from you soon. TIWlk you. Sincere!y,