CONFIRMATORY FACTOR ANALYSIS OF THE WISC-IV WITH A TRINIDAD REFERRED SAMPLE

There has been an extensive amount of research in the intelligence-assessment literature on the structure of the Wechsler Intelligence Scale for Children, fourth edition (WISC-IV; 2003a). Numerous studies show that the test’s general factor structure replicates across normative and referred groups, in the U.S. and globally. Thus far, few studies have been done examining the factor structure of this, and other intelligence tests with Caribbean samples. The current study adds to this body of literature by examining the factor structure of the WISC-IV with a referred sample from Trinidad. This study utilized archival data from a sample accessed through private practices and a public clinic located in the Northeast region of the island of Trinidad, within the Republic of Trinidad and Tobago (N = 261). Data were extracted from client files and included age (M = 11.13, SD = 2.76), gender (males n = 182), DSM diagnosis, WISC-IV subtest scaled scores and composite standard scores, and other variables that were not used in this study due to incomplete data. An examination of subtest and composite mean scores showed that measures of visual-spatial processing speed (Coding and Symbol Search) and the overall processing speed standard score fell almost one and one-half standard deviations below the normative mean, and lower compared with other cognitive domain scores in this sample. Confirmatory factor analysis procedures were completed examining six different configurations: one-, two-, threeand four-factor models, and two hierarchical (direct and indirect) models that account for the influence of four factors plus a general intelligence factor (g). The four-factor model, which excluded a g factor, yielded superior fit with the data based on an examination of several fit indices (χ2, χ2/df ratio, comparative fit index [CFI], root mean square error of approximation [RMSEA], iii standardized root mean-square residual [SRMSR], Akaike information criterion [AIC]). The indirect-hierarchical model, which represents the WISC-IV interpretive model, was not considered the most appropriate for the sample in this study. Reasons for these results are postulated, study limitations are explored, and areas for future research are considered.

. Follow-up ANOVA results for the five composite scores for each agency group………………………………………………………………….………..……. 48 x LIST OF FIGURES Since the inception of the field, assessment of intelligence has been a core practice of clinical and school psychologists (Vasquez-Nuttall et al., 2007). Initially developed over a century ago in response to social and economic changes spurred by the Industrial Revolution (Oakland, 2004), intelligence tests continue to be revised and Over the years, several tests of intelligence have been developed and empirically evaluated, and continue to be revised and adapted. Of the available measures, the Wechsler intelligence scales are among the most widely used and validated measures for assessing cognitive ability in the US and worldwide (Ambreen & Kamal, 2014;Bowden, Saklofske, & Weiss, 2011;Dang, Weiss, Pollack & Nguyen, 2012;Saklofske, Weiss, Beal & Coalson, 2003). The Wechsler scales are normed and standardized for use with various populations (e.g., US, Canada, United Kingdom, Australia, Germany, Austria and Switzerland, France, Mexico, India, Sweden, China, and Japan; Grégoire et al., 2008). Although versions of the Wechsler scales have been adapted for use with various countries and cultural groups, it is often more difficult to find well-validated intelligence tests in developing countries. As such, it is not uncommon to find cross-cultural applications of the test with groups for whom the test was not standardized. This is the case in Trinidad and Tobago (T&T) and other Caribbean nations.

Statement of the Problem
In T&T, the US versions of the Wechsler scales (Wechsler, 2003a(Wechsler, , 2008(Wechsler, , 2014(Wechsler, , 2012 are typically used in practice, as there is no comparable test of cognitive or intellectual ability that has been normed and validated with persons from this population. Without empirical support, the use of tests normed on one population with another may produce biased results and inaccurate interpretations. This practice leads to the question of whether a test, developed on one population, can measure intellectual potential accurately for persons from another population with unshared cultural experiences.  Louison (2016) investigated the factor structure of the WISC-IV by completing factor-analytic procedures with referred and normative samples from T&T. The results of that study showed that for the referred group, a four-factor structure was recovered; however, with the normative sample, other factor configurations showed superior fit compared with the WISC-IV recommended model.

Justification for and Significance of the Study
In T&T, other than geographic location, population demographics, socio- accounted for 1.4%, and a fairly large percentage (6.2%) did not state their ethnic group membership (MPSDCSO, 2012). In addition to ethnic group differences, when compared to the US, there are stark differences with regard to economic development in T&T. Additionally, whereas the native language is English, it can be argued that differences in expression and use of language exist. This raises concerns for the cultural appropriateness of the US versions of these tests for measuring intellectual functioning with persons from T&T. To address these concerns, the current study aimed to gain a better understanding of the validity of the Wechsler Intelligence Scale for Children, fourth edition (WISC-IV; Wechsler, 2003a), for use with children in T&T. The construct of intelligence, how it is typically measured, and the importance of cultural appropriateness of test usage were reviewed, and planned analyses were conducted.

REVIEW OF THE LITERATURE
Scientific inquiry into defining and measuring intelligence began taking root in the mid 19th Century, whereas the widespread application and perceived importance of examining this construct become cemented in the early 20th Century (Gottfredson & Saklofske, 2009). In the US and other Industrialized nations, economic, political and social changes occurring at the turn of the 20th Century led to increasing needs to educate more children and youth at higher levels, to meet the special learning needs of students, and to help ensure children and other individuals with severe disorders were provided appropriate care (Farrell, Jimerson, Oakland, 2007;Oakland, 2004).
Assessment methods were developed to measure and identify those needs and guide decision making for developing appropriate programs and supports for children with special needs (Oakland, 2004), with intelligence tests playing a role in that process.
Intelligence test use has a long and contentious history in the field, both in terms of socio-political factors (e.g., issues of cultural bias) as well as issues related to defining and conceptualizing the intelligence construct. There is no agreed-upon definition of the construct intelligence (Sternberg, 1997); secondly, reliance on theory is a relatively new advancement in the measurement of human intelligence, as earlier versions of intelligence test batteries were developed without a clear and wellestablished theoretical framework (Keith & Reynolds, 2010;Schneider & McGrew, 2012). The following review outlines the conceptualization of intelligence as a construct, the development of a guiding theoretical framework and model for understanding and measuring intelligence, issues of validity and reliability, and presents literature on the use of the Wechsler Intelligence Scale for Children (WISC) with various populations.

Defining Intelligence
Intelligence is a latent trait, abstract and difficult to define. Sternberg (1997) presents a review of the various definitions that have been applied. He comments that the literature-base on intelligence has generated various definitions over time, and provided examples of common elements found in definitions. Intelligence has been defined as higher-level abilities related to executive functions (e.g., abstract reasoning, problem solving, decision making), the ability to learn, adaptation to environmental demands, and based on cultural values (Sternberg, 1997). The manual for the newest version of the WISC incorporates these themes and defines intelligence as an individual's capacity to understand the world and the resourcefulness to cope with its challenges (Wechsler, 2014). Despite its lack of a coherent and established definition across fields, researchers agree that intellectual thinking is critical to daily human functioning (Dang et al., 2011) and has been shown to predict success in academic and occupational settings (Brody, 1997). Sternberg's (1997) review of definitions of intelligence highlights that there is an interaction whereby human beings do not just adapt to their environments, but actively shape them. He offered the following definition, "Intelligence comprises the mental abilities necessary for adaptation to, as well as shaping and selection of, any environmental context…a process of lifelong learning, one that starts in infancy and continues throughout the life span" (Sternberg, 1997(Sternberg, , p. 1030). Sternberg's definition suggests that intelligence is a fluid concept, shaped by the interaction of the individual with the environment and changes with time. The idea of viewing intelligence as a transactional person-environment concept relates to Bronfenbrenner's (1977Bronfenbrenner's ( , 1994 social-ecological theory of human development, and Vygotsky's (1978)  have been criticized for the application of these measures in schools and clinics without guidance from a coherent evidence-based theory. As such, over the years conceptual theories of measuring intelligence have been highly researched mainly using factor analytic methods, and test developers have placed increasing emphasis on incorporating theory into instruments for measuring intelligence.

Conceptualizing Intelligence
The Cattell-Horn-Carroll (CHC) theory provides a taxonomy of human cognitive abilities that organizes over 100 years of research into a systematic theoretical framework for understanding and measuring intelligence, and related variables (Schneider & McGrew, 2012). The CHC model is a synthesis of the Cattell-Horn fluid-crystallized (Gf-Gc) model of intelligence (1966) with the Carroll Three-Stratum model (1993), which were both influenced by Spearman's (1927) conceptualization of general intellectual functioning (Keith & Reynolds, 2012;McGrew, 2009;Schneider & McGrew, 2012). Developed through factor-analytic methods, CHC theory is a multidimensional, hierarchical model that includes an overarching general intellectual ability factor, broad interrelated ability factors, and an array of narrow sub-skill variables.
The Binet-Simon test (1905) is credited as the first practical test of intelligence applied to measure intellectual differences. This and other early intelligence tests conceptualized and measured intelligence using a unidimensional construct (Newton & McGrew, 2010). Spearman, one of the earliest intelligence theorists, expanded on this concept of a general intelligence factor, symbolized as g, and included sub-skills of g, termed s, which he considered specific abilities related to g (Spearman, 1927).
Research by Spearman and early theorists such as Thurstone (1938) applied factoranalytic methods to expand the idea of a general intelligence factor, to include several, broad highly correlated but distinct factors (Alfonso, Flanagan, & Radwan, 2005;Horn & Blankson, 2012). It was Cattell and Horn's Gf-Gc theory; however, that provided the basis for the modern CHC model (Schneider & McGrew, 2012). Cattell (1943) purported that Spearman's g was better explained by the inclusion of two factors: general fluid (Gf) and general crystallized (Gc) intelligence.
According to Cattell, Gf includes inductive and deductive reasoning abilities that are influenced by biological and neurological factors, and incidental learning through interaction with the environment (Alfonso Flanagan, & Radwan, 2012). In contrast, Gc includes acquired knowledge abilities that largely reflect acculturation, (Alfonso et al., 2005). Gc represents the degree to which an individual has learned practical, useful knowledge and mastered valued skills relevant to the culture (Keith & Reynolds, 2010). Cattell (1943) postulated that Gf increases until adolescence and then slowly declines, and incorporates the function of the whole cortex (Schneider & McGrew, 2012). Gc in contrast, consists of knowledge previously learned, initially through the operation of fluid ability, but no longer requires insightful perception or novel problem solving (Schneider & McGrew, 2012). According to Cattell (1943), most learning occurs through effort and several other non-ability-related variables such as availability and quality of education, family resources and expectations, and individual interests and goals (Schneider & McGrew, 2012). These collective differences in time, resources, and effort spent on learning were termed investment (Cattell, 1963;Horn & Blankson, 2012;Schneider & McGrew, 2012). As with other early theorists, Cattell observed a high correlation between Gf and Gc, and hypothesized that Gf supports the development of Gc via investment (Schneider & McGrew, 2012). Spearman also noted the high correlation among sub-skill factors, as well as among varying measures of ability, a phenomenon he termed the positive manifold and saw it as evidence for the existence of a g factor (Horn & Blankson, 2012;Schneider & McGrew, 2012).

seminal work Human Cognitive Abilities, A Survey of Factor
Analytic Studies examined 460 datasets found in the factor-analytic literature at the time, and re-analyzed the data using exploratory factor-analytic (EFA) methods. Based on analyses of the large body of work since Spearman, Carroll synthesized and organized an empirically based taxonomy of human cognitive abilities into a systematic, coherent framework (McGrew, 2009;Schneider & McGrew, 2012). Carroll (1993) proposed a three-tiered model of cognitive abilities: stratum III is the broadest level, a general intelligence factor (Schneider & McGrew, 2012 Stratum I contains numerous narrow abilities, which are subsumed by stratum II abilities, which, in turn, are subsumed by the stratum III g factor (see Figure 1; Schneider & McGrew, 2012). Newton andMcGrew (2010), andMcGrew (2009) present an organized summary of CHC broad and narrow abilities. Carroll's aim was to provide a "map of all known cognitive abilities" (p. 887) to aid in interpreting intelligence test scores in applied settings (Carroll, 1997

Symbol Search 12
There was a clear need for the classification and organization of the large body of research of intelligence test theory. The CHC model and its systematic taxonomy of cognitive abilities have become popular with contemporary researchers, test developers, and practitioners over the years. Since the development of the CHC model, many new and revised intelligence batteries are incorporating CHC theory (Alfonso et al., 2012). Keith and Reynolds (2010) reviewed the factor-analytic research of several different intelligence batteries, and found that most contemporary intelligence batteries were either explicitly grounded in CHC theory, or strongly influenced by the theory. The Woodcock-Johnson Psychoeducational Battery, Revised (WJ-R; Woodcock & Johnson, 1989) was the first published test officially to apply the Gf-Gc theory to assessment practice particularly in educational settings (Schneider & McGrew, 2012;Keith & Reynolds, 2010). Since then, the CHC model Factor-analytic methods traditionally have been used by intelligence theorists and test developers to formulate and conceptualize intelligence, and determine its measurement. In fact, the study of cognitive abilities is closely tied to historical developments in exploratory (EFA) and confirmatory factor analysis (CFA) (Schneider & McGrew, 2012), and early intelligence theories and factor-analytic methods were developed in tandem (Keith & Reynolds, 2012). The psychometric evidence provided for the CHC structural framework in Carroll's (1993) book, and the body of research since then makes it difficult to refute that the model is measuring related variables of an underlying latent construct. Robust psychometric support for the CHC model has been shown in the related literature. Findings across a multitude of studies employing EFA, CFA, and multi-group factor-analysis methods have been applied to test the model's validity. Additionally, factorial invariance for the CHC structure of intelligence has been observed in a large majority of studies.
CHC based tests such as the Wechsler Intelligence Scales have been tested and generally replicated within and across clinical/referred samples, cross-cultural samples (US ethnic groups, international), and across age and gender groups suggesting that the constructs measured by intelligence tests appear to be invariant across groups. The Wechsler Intelligence Scale for Children, fourth edition (WISC-IV; Wechsler, 2003a) has been adapted and standardized in Canada (both English and French versions), the United Kingdom, Australia, Germany, Austria and Switzerland, France, Mexico, India, Sweden, China, and Japan (Grégoire et al., 2008). The structure of the CHC model has been replicated across referred samples, including children with ADHD (Styck & Watkins, 2017), Specific Learning Disabilities (Styck & Watkins, 2016) and other clinical groups (e.g., Canivez, 2014;Devena, Gay, & Watkins, 2013;Watkins et al., 2013). Factorial invariance of the CHC model is observed across age groups (Chen, Keith, Chen, & Chang, 2009;Bickley, Keith, & Wolfle, 1995;Keith, Fine, Taub, Reynolds, Kranzler, 2006).
Although the CHC model is currently the most widely accepted and applied theoretical framework for describing the structure of human intelligence, there are several issues that need to be considered (Keith & Reynolds, 2010). The CHC model currently does not provide a definition of intelligence that can be applied across contexts. Evidence for the validity of the CHC has mainly focused on construct validation through the use of CFA. Keith and Reynolds (2010) suggest a more rigorous approach that tests both the measurement structure of a test, and theory behind it. Cross-battery CFA (CB-CFA) analyzes tests from one battery with subtests from other intelligence test batteries (Keith & Reynolds, 2010). Similar to discriminant validation procedures, different instruments drawn from different orientations may offer a better opportunity to confirm or disconfirm each instrument's structure (Keith & Reynolds, 2010). Compared to other abilities, Gc is more easily influenced by factors such as experience, education, and cultural opportunities (Schneider & McGrew, 2012). This raises two major issues; Gc is theoretically broader than what current intelligence tests measure, and no test of Gc can be culture-free (Keith & Reynolds, 2010). Relating to the second point in particular, the cultural validity of intelligence theories and tests has been a source of debate since the very beginning. Gf is also a measure of fluid reasoning within context, and is dependent on culturally relevant environmental demands. Issues of cultural bias, and a method to address cultural bias is reviewed subsequently.

Wechsler Intelligence Scale for Children
Since the development of the Wechsler Bellevue Intelligence Scale in 1939 (Boake, 2002), Wechsler intelligence scales reflect over 70 years of intelligence test research and development (Wechsler, 2003b). Intelligence tests typically produce scores traditionally described as intelligence quotients, abbreviated as IQ. Historically, IQ referred to the score achieved by dividing measured mental age by chronological age, a ratio process that is no longer in use (Neisser et al., 1996). Though an antiquated term, IQ has held its use in modern applications. Contemporary intelligence tests like the WISC use statistical procedures to derive standardized, deviation (versus ratio) IQ scores, which are considered global estimates of intellectual functioning.
The Wechsler test batteries are differentiated by age group and comprise tests for preschoolers, children and teens aged 6 to 16, and adults aged 16 to 90 years old.
The scales have been updated and revised over time to incorporate new norms, and changes in the intelligence theory such the CHC model. The WISC is currently in its fifth revision; however, for this study, the fourth edition of the WISC was used. At the time this research was conducted on the island of Trinidad, the newest edition of the test was not yet commonly used in public agencies, as such, data for the WISC-IV were more accessible.
Reflecting the CHC hierarchical model, the WISC-IV has 15 subtests measuring various sub-skills. The scores derived from the tests follow a three-stratum structure of the CHC model. Figure 2 illustrates the WISC-IV's four-factor higherorder or indirect hierarchical model influenced by CHC theory and extrapolated using factor-analytic methods. In stratum I are the subtests; 10 of these subtests are core or compulsory, and 5 are supplemental (not included in Figure 2). At stratum II, the 15 subtests are grouped into four, theoretical, factor-based index scores: Verbal Comprehension Index (VCI, cf. Gc), Perceptual Reasoning Index (PRI, cf. Gv and Gf), Working Memory Index (WMI, cf. Gsm), and Processing Speed Index (PSI, cf. Gs). At the third strata, the Full Scale IQ (FSIQ) is based on the sum of the 10 core subtests (three VC, three PR, two WM, and two PS), and considered the most reliable measure of g (Wechsler, 2003b). which also affect a subset of the measured variables (Styck & Watkins, 2016). Direct hierarchical models are also considered nested-factor models, where all subtests are loaded directly both on a g factor and on the other broad factors, with the factors generally orthogonal or uncorrelated (Gignac, 2008;Keith & Reynolds, 2012).
Although both models indicate that the subtests are affected both by g and one or more broad abilities, the nature of that influence differs (Keith & Reynolds, 2012). The higher-order model assumes that g influences individual tests through the broad abilities, the direct-hierarchical model does not infer the relation between g and the broad (first-order) factors, instead only specifying that the subtests measure both g and broad abilities.
These findings maintain the underlying conceptual importance of g, but stray from the traditional three-stratum hierarchical CHC model, where g is assumed to mediate the relationship between the secondary and primary abilities; rather g is directly related to primary abilities in a more meaningful way. Although CHC abilities appear to be measuring underlying cognitive abilities, a re-evaluation of the structure of the CHC model may be necessary considering these findings.
Results of intelligence tests have direct influence on the outcomes of examinees. IQ test scores are combined with other measures of academic, emotional, adaptive, and neurological functioning to determine access to supports and services.
Thus, inaccurate test results can have detrimental effects on individuals, their families, and the systems within which they operate. As such, it is imperative that test results are reliable and valid, and the inferences made from these results reflect an accurate estimate of the construct being measured. Accurate, unbiased testing leads to better predictive power.

Reliability and Validity
Standardized tests address two important characteristics. First, an examinee's score can be compared with a normative group consisting of others who share important characteristics (e.g., age, gender, language, cultural background).
Additionally, standardized tests aim to ensure consistency of format and procedures in use and administration to reduce the influence of extraneous variables on the construct being measured. Reducing external influences minimizes error and ensures that the results garnered produce reliable and valid information about the test taker.
Reliability refers to the consistency and precision of results (Urbina, 2004).
Reliability measures target consistency of measurement over time, forms of a test, or the internal consistency of instruments, and is evaluated with the intent to assess measurement error because reliability is inversely related to measurement error.
Although some level of random error is expected, systematic and consistent error in measurement represents a source of bias and limits the validity of test results (Urbina, 2004).
Validity is concerned with how accurately a test measures a construct or latent trait of interest. If a test is a valid measure of a specific construct, ideally it should have strong reliability; however, if a test consistently produces similar results, this does not guarantee that it is a valid measure of the intended construct. Concerns for bias can arise when the validity of a standardized test is questioned. There are various types of evidence of validity in measurement tools that test developers examine to reduce bias such as content validity, criterion-related validity, and construct validity (Wechsler, 2003b). Construct validity is relevant to understanding the underlying psychological processes tests measure (Brown, Reynolds & Whitaker, 1999).
Generally, construct validity examines whether the pattern of relationships among measures of a trait is related or unrelated to other traits and is consistent with theoretical expectations (Barker, Pistrang, & Elliott, 2002). One way to establish construct validity is by showing that the measure shows a pattern of high correlations with related measures (convergent validity) and low correlations with measures of unrelated constructs (discriminant validity; van deVijver & Tanzer, 2004). Construct validation procedures can also be applied early in test development.
Traditionally, factor-analytic methods have been used in the development of intelligence theory and intelligence tests (Keith & Reynolds, 2012). CFA is a structural equation modeling (SEM) method applied to assess the relationships among sets of measures or items and their respective hypothesized latent factors (Harlow, 2014). CFA is a theory-driven approach used to test how well a set of items fit with a predetermined theoretical model. The Wechsler scales, in addition to other contemporary tests of intelligence, commonly use CFA to examine the measures' fit with the CHC model. CFA analyses with the WISC-IV with the US standardization sample have yielded strong factor loadings that fit the structure of intelligence hypothesized by CHC theory. CFA was used for this project to determine whether a similar model fit of the CHC model was replicated with a sample from the island of Trinidad. Observed differences in the factor structure may indicate a number of possibilities: test items may be interpreted differently by the two different groups, the nature of the construct may vary due to cultural differences, the test may measure completely different constructs for the two groups, or groups may apply different cognitive processes to respond to items (Warne, Yoon & Price, 2014). Moreover, differential factor structures would raise concerns about bias.

Cultural Adaptation of Intelligence Tests
The WISC-IV has been adapted and standardized in Canada (both English and French versions), the United Kingdom, Australia, Germany, Austria and Switzerland, France, Mexico, India, Sweden, China, and Japan (Grégoire et al., 2008). As can be seen from this list, the WISC has been culturally adapted for several developed countries, though, for many developing countries there are limited, well adapted intelligence tests (Dang et al., 2011).
Culture implies shared values, knowledge, communication (Greenfield, 1997), and meaning (Serafica & Vargas, 2006). For a test to be applied cross-culturally, these domains should be shared among the normative groups (Greenfield, 1997). Grégoire et al. (2008) argue that intelligence cannot be assessed independently of any cultural influence; there is no culture-free test, thus, cross-cultural applications may lead to biased interpretations. Cross-cultural adaptation goes beyond linguistics. Even in the UK, an English-speaking country, some items on verbal subtests from US WISC were modified during adaptation (Grégoire et al., 2008). Whether verbal or non-verbal, all tests include information relevant to the culture in which the test was developed, and contain items reflecting what is considered intelligent within that particular culture.
In cross-cultural adaptations of the Wechsler scales, the verbal subtests are the most frequently modified across languages and cultures (Grégoire et al., 2008). That observation does not suggest that other subtests are less culturally loaded. Non-verbal tests are not culture-free (Ortiz, Ochoa, & Dynda, 2012); cultural experiences provide a framework through which we perceive, analyze, and process non-verbal stimuli (Grégoire et al., 2008). Pérez-Arce (1999) discusses the concept of an "ecological brain," and posits that cultural knowledge and experience provide an interpretive framework that guides reasoning and problem solving. Cultural environment has a significant impact on intellectual skills (Gopaul-McNicol & Armour-Thomas, 2002); to be considered intelligent or adaptive means to excel in the skills valued by one's own group (Neisser et al., 1996). All tests are culturally loaded and contain items reflecting what is considered to be intelligent within that culture (Suzuki, Prevost, & Short, 2008). As such, the cross-cultural application of tests that were developed for one culture, as is the case with the WISC-IV in T&T, may not accurately reflect the underlying latent trait that the test was designed to measure.

Research Objective
The objective of this study was to determine whether the factor structure of the WISC-IV could be replicated with a Trinidadian sample. This objective was examined using CFA to determine model fit of the four-factor hierarchical model of the WISC-IV. The results of this study have implications for determining the construct validity and applicability of this tool for measuring intellectual functioning with this population. This study is similar in objective, method, and scope to Watkins et al. (2013) and Louison (2016). In Watkins et al. (2013), researchers completed a factor analytic study of the WISC-IV with a referred sample in Ireland with the UK version of the test. The factor structure was replicated and model fit established with a sample of 794 Irish children. In Louison (2016), hierarchical models were not tested with the referred sample, but only with the normative sample and a direct-hierarchical model was determined to have superior fit. This study sought to examine whether results would be replicated with a different sample from the island of Trinidad. Results were compared with other global studies that have used samples from various countries, as well as studies that have used clinical samples.

Participants
Data were extracted from client records for children and adolescents who had been referred for evaluation of learning difficulties and other disabilities (N = 261).
Records were sourced from private practices and one public agency in Trinidad. Of note, the psychologists that agreed to participate and provide authorization had private practices mainly located in the north-west and north-central regions in Trinidad. Data were not collected on the island of Tobago, mainly due to constraints with time and available resources. The sample consisted of children and adolescents aged 6 to 16 years old, with an average age of 11 years old (M = 11.13, SD = 2.76). There were more males (n = 182, 69.7%) in the sample compared to females (n = 79, 30.3%).
Clinical diagnoses were included in client records for most participants, though for 19.5% of participants a diagnosis was not discovered in the records. Twenty percent (20.7%) of the records reported that the participant met criteria for at least two diagnoses; 1.9% reported three or more diagnoses. Of the cases with more than one diagnosis, Attention-Deficit/Hyperactivity Disorder (ADHD) was often a co-morbid diagnosis. Table 1 lists the diagnostic categories for the participants.

Measures
The most recent version of the Wechsler Intelligence Scale for Children is the fifth edition, the WISC-V (2014). In this study, however, the fourth edition WISC-IV Reliability coefficients reported in technical manuals are usually high for Wechsler scales, all typically above .70. For the WISC-IV, internal consistency reliability was obtained using the split-half method for all subtests with the exception of the processing-speed subtests; test-retest reliability was obtained for these speeded subtests. Exploratory (EFA) and confirmatory (CFA) factor-analytic studies with the WISC-IV indicated strong evidence for construct validity. Table 2 shows the loadings obtained from the standardization sample based on results of EFA analysis with the WISC-IV. Simple structure is observed with subtests loading highly (more than or equal to .40) on expected factors, and very low loadings on non-respective factors. Loadings ranged from .45 to .84, providing evidence of simple structure (Harlow, 2014;Gorsuch, 1983). Factor analytic procedures with the US normative sample demonstrate that the four-factor model fit the data best compared with alternative models (Wechsler, 2003b stored. An identification number was provided for each client in the database. Practitioners also were assigned an identification number; no identifying information was recorded or stored for the practitioners. IQ test scores were recorded for each client, as well as academic scores once these were available. Reading and Math composite scores were mostly from the Wechsler Individual Achievement Test, second (WIAT-II) or third (WIAT-III) editions, but were not used in this study's analyses due to inconsistent reporting of academic test score results in client files. It was initially intended to explore mean differences or factor invariance based on school type -government (public), government assisted (e.g., religious charter schools), or public. However, there was much difficulty sourcing information on which schools fell into the three categories, and this variable was not explored further.
In Trinidad and Tobago there is no established research that indicates expected score differences based on ethnicity or other demographic characteristics. As such, researchers purposefully did not sample to create stratified groups based on race/ethnicity. Within this population, poverty, socioeconomic status (SES) and related factors (e.g., access to education, nutrition, chronic stress) were seen as more important to consider as potentially contributing to any observed group differences.
Thus, private/public school was considered as a possible proxy for SES to assess possible differences in scores based on this variable if it were available in the data collected from the schools.
Several guidelines for appropriate sample sizes for factor-analytic studies are suggested in the literature. Most guidelines propose that fairly large sample sizes are required, typically at least 100 to 200 participants (for reviews see Guadagnoli & Velicer, 1988;Harlow, 2014;MacCallum, Widaman, Zhang & Hong, 1999).
Compiling larger sample sizes is ideal, though this is not always feasible. When using factor analysis, MacCallum et al., (1999) suggest that sample sizes of less than 100 may be appropriate with high communality (estimates of the shared variance among subtests) and well determined factors. Factor-analytic models may require fewer participants than common guidelines suggest if the model yields high estimates of shared variance among variables (greater than or equal to .30), factors show high loadings on at least three or four variables, and show good simple structure (Guadagnoli & Velicer, 1988;Harlow, 2014). Therefore, with a larger sample the impact of sampling error on factor-analytic models may be reduced, and making generalizations or inferences from a sample is strengthened as sample size increases (Harlow, 2014). Considering these various criteria, a sample of 261 participants was determined to be adequate, although, a larger sample would be preferred in the future.
Descriptive statistics and correlation tables were computed using SPSS version 21. CFA models were computed using the lavaan (latent variable analysis; Rosseel, 2012) package in R which computes parameters using maximum likelihood estimation. The semPaths package in R was used to create CFA diagrams for the models tested. Six models were tested based on four WISC-IV factor structures examined in the test manual (Wechsler, 2003b) as well as what have been tested in CFA studies with referred and non-clinical samples (e.g., Canivez, 2014;Watkins, 2010;Watkins et al., 2013,). The first included a one-factor structure with all ten subtests loading on a single g factor. The second model included a two-factor model whereby five subtests that require higher language demand (verbal expression and oral listening skills: SI, VC, CO, DS, LNS) loaded on a verbal factor, and five subtests that require visual-spatial abilities (BD, PC, MR, CD, SS) loaded on a non-verbal factor. The third model contained the verbal comprehension factor and the perceptual reasoning factor with their respective subtests, with a third, cognitive processing, that combined processing speed and working memory subtests. The fourth model was a correlated factor model that included the four WISC-IV factors with their respective subtests, without accounting for the effect of g. The fifth model examined the WISC-IV four factor model with the inclusion of the higher-order/hierarchical g factor as recommended in the test manual (Bodin, Pardini, Burns, & Stevens, 2009;Wechsler, 2003b). A higher-order model implies full mediation, whereby the association between a higher-order factor (g) and the observed variables (subtests) is mediated fully by the lower-order factors (composites; Yung, Thissen & McLeod, 1999). The sixth model examined a direct hierarchical (Gignac, 2008) or bi-factor model. With this model, only direct effects are estimated, as such, each observed variable (subtest) is free to contribute variance directly to the g factor, as well as directly contribute variance to the individual factor that the observed variable (subtest) on to which is intended to be loaded. Results of the descriptive statistics, correlations, and CFAs are outlined in the next chapter.

Descriptive Statistics
In total, 265 cases were collected from public and private agencies in Trinidad.
Four cases had missing subtest scores and were removed from the final sample (N = 261). Descriptive statistics, including indicators of skewness and kurtosis for WISC-IV subtests and composite scores are presented in Table 3. To examine the effect of very low scores on sample means, the data for the four composite scores was sorted to highlight standard scores that fell more than two standard deviations from the mean (< 70). Participant cases that contained three or four of their domain composite scores under 70 were removed and means were recalculated. Participants cases with composite scores higher than two standard deviations above the means were also examined, but none of the cases had more than one score above a standard score of 130. When re-calculated, all composite score means were higher and closer to the average range with the exception of the PSI which remained one standard deviation below the standard score mean of 100 (new M = 82.48, SD = 12.78). The re-calculated Coding (new M = 6.40, SD = 2.67) and Symbol Search (M = 7.21, SD = 2.88) subtest mean scores also became higher, but still approximately one standard deviation below the subtest scaled score mean of 10. Table 5 indicates that all ten subtests were positively correlated at the α = 0.01 level. Moderate to strong correlations were generally observed. Similar findings were observed for the index scores (Table 6). Index scores show moderate to strong correlations, all significant at the α = 0.01 level.  .773 -Note. All correlations were significant at the α = 0.01 level (2-tailed).

Confirmatory Factor Analysis
Following the general linear model, assumptions for CFA to be met include independence, normality (minimal skewness or kurtosis), homoscedasticity (equal variance for one variable across all levels of another variable), and linearity (relationships among variables do not change directions after a certain point; Harlow, 2014). Displayed in Table 3, skewness values were within an acceptable range (− 1.0 to + 1.0), as were kurtosis values (below 1.0) indicating relatively symmetric and homogeneously spread univariate distributions.
The moderate to large correlations among the subtests in Table 5 and the index scores in Table 6 suggest that the relationships among the scores are relatively linear.
Similarly, there does not appear to be evidence for multicollinearity among the subtest scores nor among the four index scores, as all correlations were less than .90 (Harlow, 2014). There were correlations of .929 and .916 between the FSIQ score and the VCI and PRI index scores, respectively; however, that is to be expected as the FSIQ score is a composite and is derived from the subtests. Additionally, Myers (1990) states that a variance inflation factor indicating an R-squared less than .90, which corresponds to a correlation of .95, would suggest that collinearity is not present, which is consistent with these data.
CFA is a multivariate method that delineates the underlying dimensions in a set of variables or, in this case, subtests, to determine fit with a theoretical model (Kline, 2016). Factor-analytic methods can be used to test the theory about the conceptual nature of underlying dimensions within a set of variables by assessing the nature of the common-factor variance, or shared variance among variables, while acknowledging the presence of error variance within the variables (Harlow, 2014). An examination of model fit determines the degree to which the structural-equation model fits the sample data, though there is no single statistical significance test that identifies a correct model given the sample data, as such, multiple criteria should be considered to evaluate model fit (Schermelleh-Engel, Moosbrugger & Müller, 2003).
CFA utilizes the χ 2 test as a macro-level significance test to assess whether there is a good fit between the hypothesized model and the data. For this study, six correlated models were tested. Model fit statistics are presented in Table 7. Models that do not adequately explain the data yield a large chi square (χ 2 ) with a significant p value. The χ 2 test, however, is directly affected by sample size (Schermelleh-Engel et al., 2003); as such, a large sample like the one required for this dissertation is likely to produce significant χ 2 values. Thus, other indices are suggested to assess fit. One of them, which is considered a macro-level effect-size, the root mean square error of approximation (RMSEA), would be relatively small with values of .05, .08, or .10 or less representing good, fair, or acceptable effect size and fit, respectively (Steiger & Lind, 1980). A 90% confidence interval is reported for the RMSEA. The standardized root mean-square residual (SRMSR) should also be small with .05, .06, or .08, for excellent, good, and acceptable SRMR fit. The comparative fit index (CFI; incremental fit between a hypothesized model and an independent model that specifies only variances among the constructs) is also reported (Bentler, 1990) with .95 or more indicating better fit. The χ 2 /df ratio was also considered as a parsimony index that favors a smaller value (Cangur & Ercan, 2015;Schermelleh-Engel et al., 2003). At the micro-level of interpretation, the pattern of factor loadings (i.e., correlations among the subtests and the factors) was examined as correlation coefficients with values of .1, .3, and .5 or more representing small, medium, and large effect sizes, respectively (Cohen, 1988). High loadings indicate strong correlations between the variable and the underlying dimension. In addition to high loadings on expected factors, subtests should not load highly on non-expected factors (i.e., not have loadings of .30 or more on more than one dimension; Harlow, 2014). This pattern of loadings would result in observed simple structure. Comparison of Akaike information criterion (AIC; Akaike, 1974) values was also considered, whereby the best model would have a lower value (Keith et al., 2006;Lecerf, Rossier, Favez, Reverte, & Coleaux, 2010;Watkins et al., 2013).  showed the best fit with the data. The four-factor and indirect models both show high loadings for the relationships between the indicators and their respective factors.

Model Fit Analyses
As seen in Table 7, the first three models did not fit the data as well as the last three models examined. Overall, model fit improved as the number of factors increased. Compared to the other models examined, the one-factor model appeared to provide the least appropriate fit to the data. The χ 2 (35, N = 261) = 149.752 was relatively large with a significant p value (p < 0.001). The CFI value was lower than .95, RMSEA > .10 (p < .001), SRMR = .044, and AIC values were higher compared to the other models. Four factor correlated model. For the correlated four-factor/first-order (FF) model the χ 2 (29, N = 261) = 56.524 was relatively small with a significant p = .002).
The χ 2 /df ratio was smaller than other models (1.95), the CFI value was large, .985, RMSEA = .06, and the SRMR = .03, with all of these indicating excellent fit. The AIC value was also the lowest compared to the other models. Factor loadings for the fourfactor model, as illustrated in Table 8, were strong with values ranging from .743 to .919 indicating large effect sizes. Loadings were all significant at the p < .001 level.
This model is illustrated in Figure 3. Indirect higher-order model. The indirect hierarchical/higher-order (IH) factor model also showed relatively good fit to the data. The χ 2 (31, N = 261) = 68.165 was relatively small with a significant p value (p < 0.001). The χ 2 /df ratio was small (2.20), the CFI value was large, .979, RMSEA = .068 (p = .086), and the SRMR = .031 both indicating excellent fit. This model is illustrated in Figure 4.

Multivariate Analysis of Variance
In addition to the CFA analyses reported above, multivariate analysis of variance (MANOVA) analyses were completed to explore possible group differences among the six practitioner groups across the five composite/factor scores. As aforementioned, data for this study were collected from six agencies, one of which was a public clinic, the other five sources were from private practice agencies. The dependent variables in the MANOVA analyses were the five composite scores or WISC-IV factors: VCI, PRI, WMI, PSI and FSIQ. In the dataset the agency or practitioner groups were numbered 1 through 6. For the purpose of interpretation, they are presented in the results section labelled P1 though P6. P4 represented the public clinic.
Preliminary analyses were conducted to test for the assumptions of MANOVA before the main analyses were completed. Skewness and kurtosis values for all dependent variables were acceptable indicating that that variables are normally distributed. As seen in Table 6 Results of the main MANOVA analyses show statistically significant differences for the five composites examined based on agency grouping, F(25, 1275) = 7.09, p < .001; Pillai's trace = .610, partial η 2 = .122; with a small to medium effect size. This suggest that about 12% of the variance is significantly shared between the IV agency group, and a linear combination of the DVs. Follow up ANOVA analysis revealed significant group differences (p < .001) across all the IQ composites, with small to medium effect sizes. Table 9 displays means and standard deviations, in addition to follow-up ANOVA results for the five composite scores for each agency.
Post-hoc analyses were completed using the Scheffé test due to unequal group sizes and its conservative nature. Means for each composite score were compared across the individual groups. P4, the public agency group, showed significant mean differences across each composite score compared with the other private practitioner groups with the exception of P1. Unlike other composites, the PSI variable did not show significant groups differences, with the exception of P4 and P5.

Sample Characteristics and Mean Scores
The sample used in this dissertation consisted of children and adolescents who had been referred for a psychoeducational evaluation. As with clinical samples (e.g., Bodin et al, 2009;Canivez, 2014;San Miguel Montes et al., 2010;Watkins et al., 2013), there were more boys in this sample, with girls representing only a third of the sample. Over 20% of the sample had received diagnoses of one or more neurodevelopmental disabilities or psychological disorders, and about the same amount had no diagnosis reported in their records. Eighteen percent of the sample had received a diagnosis of an Intellectual Disability, and 16.5% received a diagnosis of ADHD. About 70% of the data came from private psychological practices. Participant ages ranged from 6 to 16, with the average person being 11 years old.
Comparable to studies using referred samples, means for the current study were generally lower than the US standardization sample. In particular, processingspeed subtests Coding and Symbol Search, and the PSI scores were more than one standard deviation from the normative sample mean. Scores on the perceptual reasoning subtests and the PRI were somewhat higher, and approaching normative means. Higher PRI scores were also observed in other studies with clinical samples (e.g., San Miguel Montes, 2010). Median scores for all subtests were within one standard deviation of the US normative sample, with the exception of the processing speed scores. Unlike other studies using referred samples, the Coding, Symbol Search and overall PSI means for the Trinidad sample were particularly low. This observation was a unique finding with a Trinidad sample. The dissertation by Louison (2016) showed similar findings, with means for Coding and the PSI scores all being lower than other scores in both a referred and non-clinical sample (Coding M = 7.0; PSI M = 87.3).
Due to the sample having large proportion of participants with a diagnosis of Intellectual Disabilities, it may be possible that these cases would affect overall mean scores as the cognitive profile of persons with that diagnosis typically involves scores falling two standard deviations or more below the standard score mean (Wechsler, 2003b). To determine whether these cases significantly affected processing speed scores, participants with cases having either all, or three of their composite scores falling below 70 were removed and means were re-calculated. Although processing speed score means became higher, these scores were still lower compared with other domains and still about one standard deviation below the scaled score and standard score means. system is similar to that of historic and in some cases current practices in developing countries of sorting students into schools based on performance on a standardized assessment, a practice that has long been established to lead to inequitable outcomes for less advantaged students. The use of SEA examinations to determine educational placement has largely retained its importance in the educational system in T&T despite the little work that has been done on evaluating the validity and usefulness of those examination systems in the Caribbean (De Lisle, 2012). Another high-stakes examination is the Caribbean Secondary Education Certificate (CSEC) that is similar to the Scholastic Aptitude Test (SAT) in the US and occurs in Form 5 (Grade 11). It is often the case that students are referred for psychoeducational evaluation in order to qualify for accommodations for these high-stakes exams. As part of the psychoeducational examination, students are required to complete IQ testing, in addition to other forms of cognitive, academic and social/emotional testing. The results of the psychoeducational evaluation determine whether students are provided with testing accommodations on the SEA examination, with the IQ test results weighing heavily on the decision. Anxiety can have an inverse relationship with scores on intelligence tests (Meijer & Oostdam, 2007); thus, psychological evaluation and high-stakes testing in general may present a source of anxiety unique to students in T&T. If so, the cognitive load that accompanies this anxiety can lead to slower and less efficient working speeds, particularly if a student was referred because they are already struggling to perform academically in school.
In addition to test anxiety, Petty and Harrell (1977) and Grégoire et al. (2008) indicate that test-wiseness and motivation are important sources of error variance in educational testing and psychological measurement. Test-wiseness or test stimulus familiarity can explain differences in performance across subtests, as the more familiar one is with the structure of a test or testing conditions, the more likely one is to have better outcomes on that test. Test-wiseness can be related to country affluence, with more affluent countries likely to be more acquainted with psychological evaluation (Grégoire et al., 2008). Additionally, the motivation to display one's skills or abilities may depend on the amount of previous exposure to psychological tests, the freedom to participate or not (Grégoire et al., 2008), or high levels of pressure to perform. At this time, the reasons for lower processing-speed scores observed in this study and in Louison (2016) are not empirically supported in the literature, and outside the scope of the current research (i.e., anxiety was not directly measured) and can only be speculative.

CFA: Model Fit Analyses
The aim of this study was determining whether the WISC-IV factor structure replicated with a Trinidad sample. For this study, data were extracted from archival records for a clinical sample of children and adolescents between the ages of 6 to 16.
Data were collected from a sample of 261 participants. Confirmatory factor analysis was applied to test whether the indirect hierarchical/higher-order (IH) WISC-IV structure recommended in the test manual would emerge from the data, but also five additional models were tested to determine whether another model would provide better fit to the Trinidad sample. Results of CFA with the US normative sample (outlined in the WISC-IV test manual) demonstrate that the IH four-factor WISC-IV model fit the data best compared with alternative models, and thus, this model is suggested by the test developer as the best for interpreting general IQ as a construct and its related cognitive domains (Wechsler, 2003b). Results of this study indicated that although the WISC-IV IH structure showed acceptable fit to the data with the Trinidad referred sample, a first order four-factor model provided better fit to the data.
Six models were tested using CFA methods, and fit indices were examined.
The six models included: (a) ten subtests all loading directly onto one g factor; (b) a two-factor model consisting of a verbal (subtests that demand English language and listening) and a non-verbal factor (subtests measuring visual-motor or visualperceptual abilities); (c) a three-factor model with a verbal comprehension factor, visual-perceptual factor, and cognitive proficiency factor (working memory and processing-speed skills); (d) a four-factor/first order (FF) model with verbalcomprehension, perceptual-reasoning, working-memory, and processing-speed factors, without the influence of g, (e) the higher-order/indirect hierarchical (IH) WISC-IV model suggested in the test manual, that is, four correlated factors that load onto g and act as mediators between g and the subtests; and (f) a direct hierarchical (DH) model where the ten subtests indirectly load onto g as well as their respective four factors.
Although the fifth model is more reflective of a CHC framework, the first and sixth more closely align with Spearman's conceptualization of g.
CFA procedures were completed and several fit indices were examined (i.e. χ 2 , χ 2 /df ratio, CFI, RMSEA, SRMR, AIC). Fit improved with the addition of factors. The one-factor, two-factor, and three-factor models did not represent the data as well as the latter three models. The FF, IH, and DH models provided better empirical fit to the data, with the FF model being the most parsimonious, and offering the best overall fit indices. This pattern of better fit with increasing factors, specifically with four-factor models has been shown consistently in the literature (Canivez, 2014;Louison, 2016;Rowe, Dandridge, Pawlish, Thompson, & Ferrier, 2014;Watkins, 2010;Watkins et al., 2013;Wechsler, 2003b). Generally, when tested, either the FF, IH and DH models are selected as best representing the data with normal and referred samples. Of the three, however, there has been some variability in which model is selected as the most appropriate based on an examination of fit indices, parsimony, and theory.
In this study, three models emerged as providing good fit with the data, the FF, DH and the IH model (based on the WISC-IV structure). The χ 2 diff values for the three models indicated that FF and DH models were not significantly different; however the IH model was significantly different from the other two models. The DH model was dropped as it was the least parsimonious of the three, and some factor loadings were non-significant and smaller than the acceptable threshold for statistical effect sizes. Of the six models, the FF and IH models were seen as providing the best fit for the data, and had an almost identical pattern of strong factor loadings, all significant at the p < .001 level. Further comparison of the FF and IH models indicated that although the IH model fit better with the existing CHC three-stratum theory and had more degrees of freedom, the FF model had a lower χ 2 /df ratio, higher CFI, and lower RMSEA. The FF model was also the more parsimonious of the two models, and was selected as best representing the data in this study.
This finding has implications for interpreting the WISC-IV with referred samples in Trinidad. Combining subtest or composite scores into a single overarching IQ score may not be the best representation of intellectual functioning with referred Trinidad samples. Rather, examining the four index scores independently may be a more appropriate conceptualization of intelligence with this sample and others like it. Rowe et al. (2014) found similar results with a sample of students who were tested for gifted and talented (GT) programs. The sample used in Rowe et al. (2014) and the one used in this study have some similarities. Students considered and tested for GT or who are eligible for GT programs tend to have higher scores compared with population means (Rowe, Miller, Ebenstein, & Thompson, 2012;Winner, 2000). An examination of subtest and index score means in Rowe et al. (2012) shows a pattern of deviation from the mean similar to what was observed in this study, except that for the GT sample, the scores were significantly above the mean. Other parallels between their sample and the one used for this study is a pattern of lower processing-speed subtests and index scores compared to other domain scores. Although the GT sample showed CD, SS and PSI scores in the average range, the scores were lower compared with other domain scores. For this study, processing-speed scores were lower compared with other cognitive domains. Louison (2016) completed a multi-aim study examining the factor structure of several models with both referred and normative samples from T&T. Like this study, for the referred group, the author found that a four-factor model fit the data better than one, two, or three-factor models. The author of that study however, did not examine hierarchical models with the referred sample. The second aim of Louison (2016) examined several measurement models including hierarchical models with a T&T normative sample. Similar to the current study, the FF, IH and DH models showed good fit with those data; however, the DH model was shown to provide a superior fit when compared with other models. It may be possible that with a normative T&T sample, g has more importance in explaining the relationship among subtests.
The WISC-IV manual recommends that the FSIQ or general intelligence score not be interpreted as the best estimate of overall intellectual ability if there are significant discrepancies among subtest or index scores. For referred or non-normal samples, variability in scores is expected as often the individuals are referred due to observed impairment or difference in one or more areas of cognitive or academic functioning. The results of the current study as well as for Rowe et al. (2012), and Louison (2016; with a T&T referred sample) support this recommendation, as the FF model suggests that keeping interpretation at the index-score level is likely more appropriate for samples that differ diagnostically from the norm.
That being said, several studies using referred samples have shown different results, whereby hierarchical models, either the IH or DH are chosen to best fit their data. The traditional WISC-IV factor structure (IH model) has been replicated with samples of referred children in Bodin et al. (2009), Styck andWatkins (2017), and Nakano & Watkins (2014). In Bodin et al. (2009), CFA was conducted examining the higher-order factor structure of the WISC-IV with a large hospital sample. This study did not examine a DH model, but included FF and IH models in the analyses. CFA results favored the IH model with their sample (Bodin et al., 2009). Results described in Styck and Watkins (2017) showed that the IH model recommended by the WISC-IV was replicated with an ADHD sample. Similar findings were observed with Nakano and Watkins (2014) with a referred Native American sample. The authors examined the same six models outlined in this current study and found that the IH model best represented their data (Nakano & Watkins, 2014). In general it was found that these studies found overall good fit with both a FF and IH model, though the IH models were chosen on the basis of one or two fit indices. Watkins (2006) suggested that the WISC-IV IH factor structure was not the best model for interpreting performance on the intelligence test. Watkins (2006) recommended examining by transforming the four first-order factors to be orthogonal to each other and to the second-order g factor. According to Watkins (2006), interpreting a second-level factor on the basis of first-level factors can be misleading because performance on the subtests reflects a mixture of both first-order factors and g (McClain, 1996). This recommendation suggests the application of a DH model to examine the relationship between the subtests and factors, and to interpret intelligence test results. Watkins (2010) examined the same six models that were analyzed in this current study and found that the DH model produced fit indices that best represented the data. Canivez (2014) found very similar results with CFA procedures examining six models using a referred sample; the DH model was chosen as superior when compared with FF or IH models. Gomez, Vance, and Watson (2017) found that though the IH and DH models showed good fit, the DH model was found to be superior based an examination of fit indices with normative and low-IQ samples.
Interestingly, Devena et al. (2013) found similar findings as the current study when examining the same six models. Although fit indices and factor loadings were observed to be better for the FF model, the authors reported that no model showed superior fit over the other, suggesting that the differences among models were marginal. The authors chose the DH based on "ease of interpretation and breadth of influence" (Devena et al., 2013, p. 596). According to Keith and Reynolds (2012), measurement and theory are intertwined and it is important to select an approach to interpretation on theoretical grounds as well as practical ones. In the study by Devena et al. (2013), however, the FF model appears to show the best fit based on an examination of fit indices, and may have been the superior model.
In the study by Watkins et al. (2013), CFA analyses resulted in strong replication of previous examinations of the internal structure of the WISC-IV with an Irish sample (N = 794). Watkins et al. (2013) recruited a sample of participants who were referred to an educational psychologist in the Republic of Ireland. For their study, participants would have been tested using the United Kingdom version of the WISC-IV, which has the same factor structure as the US version. Watkins et al. (2013) tested the same six models as the current study. Similarly, the FF, IH, DH (constrained) models showed adequate fit with their data, compared with one-, two-, or three-factors models. Although the FF model showed overall better fit, and appeared parsimonious compared with the DH model, the investigators found the DH model to be superior . Factor loadings ranged from .61 to .93.
Additionally, the researchers found that the higher-order g factor accounted for substantially greater portions of WISC-IV UK common and total variance relative to the factor index scores. According to the authors, although the FF model yielded better fit to the data, meaningful differences in fit statistics were not observed between the FF, IH and DH models . More so, Watkins et al. (2013) also suggested that because the latent factors were highly correlated, a higher-order structure is implied, as such the FF model was seen as an inadequate explanation of the WISC-IV factor structure.
Studies that have identified the DH model as the best compared with the IH and FF models have recommended that based on overall model fit and their findings of the g factor accounting for more sources of variance compared with the individual domains, interpretation of intelligence test scores should be focused at the FSIQ level, or if examined at the factor level should be done with extreme caution , Canivez, 2014. However, it is often the case that neuropsychologists, school and clinical psychologists, routinely go beyond the FSIQ to look for strengths and weaknesses among a client's cognitive skills (Fiorello, Hale, McGrath, Ryan, & Quinn, 2002).
In the current Trinidad sample, there was a discrepancy between cognitive domains whereby the PSI score was significantly lower than the other domains. For both clinical and typical populations, as subtest or factor variability increases, there is less shared variance among the underlying domains/abilities when predicting the FSIQ (Fiorello et al., 2002;Fiorello et al., 2007;Hale et al., 2001). Although domain variability is expected in both clinical and non-clinical populations, this may be more likely observed in clinical populations as individuals are often referred due to displaying specific neurocognitive weaknesses, yielding an unequal IQ profile. Some studies have found evidence to support idiographic (individual) Index interpretation over nomothetic (general) interpretation of a global FSIQ score for Specific Learning Disability (SLD), Attention-Deficit/Hyperactivity Disorder (ADHD), and Traumatic Brain Injury (TBI) populations (Fiorello et al., 2001;Fiorello et al., 2007;Hale et al., 2001). Hale et al. (2007) recommend that practitioners move beyond global IQ interpretation to methods for objective idiographic interpretation.
Compared with US normative and clinical samples with mixed diagnoses, lower processing-speed scores on the WISC-IV were observed with this sample and the referred and non-referred samples utilized in Louison (2016). The FSIQ or g construct is a composite of all four domain scores, including the processing speed scores, and thus if PSI scores with the Trinidad sample in this study were low, it is expected that FSIQ scores would also be lower. Based on the work by Fiorello et al. (2001), Fiorello et al. (2007), and Hale et al. (2001), as well as what is recommended in the WISC-IV manual, if domain scores are discrepant, interpretation should remain at the domain or composite level, as such a FF model is likely more clinically relevant.
Additionally, it may be that lower processing speed scores on the WISC-IV in this sample and the referred and non-referred samples utilized in Louison (2016) suggest that processing speed, at least the way it is measured on this test, may not be a good predictor of intellectual ability for Trinidadian children. This begs the question, are there other cognitive or intellectual strengths characteristic of Trinidadian children that the WISC-IV is not measuring? Are foreign-based tests adequately representing intellectual functioning? Examination of these questions are outside the scope of this study but are important to explore for future research.
Although the WISC-IV four-factor model presented in the manual (Wechsler, 2003) made an attempt to more closely align with modern CHC theory (Keith et al., 2006), it is only partially in accordance with the mainstream CHC model of intelligence (Golay et al., 2012;Lecerf et al., 2010). Some studies have examined the WISC-IV factor structure testing five-and six-factor models that more closely align with CHC theory. It must be noted that typically these studies have had access to the full 15 core plus supplemental subtests of the WISC-IV, and mainly used normative samples. Only the 10 core subtests were used in the current study as with most clinical samples only the mandatory tests are administered when the WISC-IV test battery is used in practice. Among those studies that tested alternative models, Weiss et al. (2013) and Keith et al. (2006) tested the validity of a four-versus five-factor structure using the WISC-IV standardization sample and tested several models allowing for different cross-loadings. Keith et al. (2006) compared the four-factor IH WISC-IV structure (VCI, PRI, WMI, PSI) with a CHC five-factor model that split the PRI index into two factors representing visual-spatial (Gv: Block Design and the supplemental test Picture Completion) and fluid reasoning domains (Gf: Matrix Reasoning, Picture Concepts, and the supplemental test Arithmetic). The authors argued that though the four-factor IH model fit the data well, the five-factor model showed better fit (Keith et al., 2006). Weiss et al. (2013) split the PRI index into Gv and Gf. The authors found that both models showed good fit to the data, and were invariant across both normative and clinical samples (Weiss, et al., 2013). Lecerf et al. (2010)  The most recent version of the WISC, the fifth edition, (WISC-V; Wechsler, 2014) favors a five-factor IH model. Based on studies like the ones reviewed and cited previously, and on contemporary research on the utility of the CHC framework for conceptualizing intellectual abilities, the WISC-V splits the WISC-IV PRI into two domains, the Fluid Reasoning Index (Gf) and the Visual-Spatial Index (Gv; Wechsler, 2014). Additionally, new subtests were added in the revisions for both these indices.
Five-or six-factor models were not examined in this study, but may be considered for future research.

MANOVA Results
Results of the MANOVA analyses indicated statistically significant mean differences for the five composites scores across all agency/practitioner groups.
Follow-up ANOVA results showed statistically significant group differences for the individual five composite scores. Additionally, post-hoc analyses indicated that the public agency group showed significantly lower means for the composite scores with the exception of one private practitioner group. Differences between the public agency group compared with the private practice groups were not surprising. In Trinidad, persons who access public clinics for psychological services typically are from lower income households. Research has highlighted that SES can be related to IQ test performance, specifically, there have been trends showing that lower IQ scores can be linked with lower SES and vice versa (Weiss & Saklofkse, 2020). Among the many reasons for these findings in the literature, one likely explanation is that parents with less financial means, possibly access psychological services only when the child needs are significant. An interesting finding from post-hoc analyses indicated that for the PSI or processing speed score, significant groups differences were not as evident with the exception of two out of the six groups. This finding supports the observation that the PSI score is generally lower across groups compared with the other composite scores.

Limitations
Several limitations of this study should be considered. A random sample was not used in this study. For the practitioners that provided data or allowed access to clinical files, it was usually an exhaustive list of cases selected for data extraction.
Once the client data met the study requirements, cases were selected for inclusion in the database. Additionally, this sample was a clinical sample referred for a range of academic and other difficulties, and not guaranteed to represent non-clinical children between the ages of 6 to 16 in Trinidad. Cases were also sampled from private practices and one public agency mainly located in the north-west and north-central regions in Trinidad. Sampling did not extend to other regions in Trinidad and did not include Tobago. Moreover, a majority of the sample came from private practices, which may lead to a higher percentage of the sample coming from homes where parents can afford to pay for services, which are often expensive in T&T. Without a non-clinical comparison group, it is uncertain whether the results of this study are generalizable to the larger T&T population. With more time and resources, a larger, more representative sample, one involving data from clients across Trinidad and including Tobago, as well as both clinical and non-clinical samples would provide results that could be more generalizable. Additionally, access to a more representative clinical and non-clinical sample would allow researchers opportunities to develop norms specific to T&T.
Measures of socio-economic status (SES) were not be readily available to be examined in this study; as such there is no appropriate means to determine the impact of SES on the scores obtained. About 70% of the sample came from private practices.
It is more likely that clients who accessed services from private practices were from a higher SES background. Even if the private practices did pro-bono or voluntary work, or clients sought services through employee benefits (e.g., Employee Assistance Programs), a large proportion of their clients were paying clients. Data sourced from the public agency were taken from clients who were not required to pay. As such, the sample and findings may not be largely representative of persons from lower SES backgrounds. If a larger sample were generated from public agencies and clinics, it would be interesting to explore whether the findings in this study could be replicated across groups based on SES. The implications of those results could inform diagnostic frameworks and intervention planning for persons from more vulnerable subpopulations within T&T.
This study used archival data from various sources. Therefore, the accuracy of administration and scoring procedures are assumed. Analyses in this study were limited to the ten core subtests of the WISC-IV as data from the five supplemental subtests were not available, as is typical for referred samples. More subtests included in the analyses allow for more flexibility in the models tested. With all 15 subtests a variety of model configurations could have been tested. Although the WISC-IV basic factor structure was replicated in this study, more research is needed to explore other configurations (e.g., Golay et al., 2013) that could possibly better represent T&T WISC-IV data. The newest version of the WISC, the WISC-V, recommends interpretation based on a five-factor IH structure that more strongly aligns with the CHC theoretical framework. In this study, data for the older version of the WISC were examined, because at the time data were collected, public agencies in T&T were still widely using the fourth version. Further research is needed with the newer WISC-V to explore whether the results of this study will hold with other versions of this test.

Summary and Conclusion
US based standardized tests of intelligence are commonly used in assessment in T&T. Wechsler scales are frequently used; however, published work on the psychometric properties and appropriateness for use with a T&T population is limited in the current literature. The results of this study have significant implications for supporting the continued use of the WISC-IV, or other Wechsler scales with this population, or discerning whether different assessment approaches (e.g., response to intervention models) need to be considered in practice and policy.
With the current Trinidad sample, although the recommended factor structure in the WISC-IV was acceptably replicated with the data, a first-order four-factor configuration provided superior fit to the data. These findings suggest that with referred samples in Trinidad, interpretation of the cognitive or intellectual abilities measured by the WISC-IV might best be examined at the index/composite score level.
Evidence for models that included a general factor showed adequate fit, but were not the best based on fit indices and expectations of parsimony. The WISC-IV first-order factor structure may provide the best interpretive model for this sample due to observed variability among subtest means and composite scores means. In the current sample, processing-speed mean scores were significantly lower compared with other scores. The WISC-IV manual suggests that the FSIQ/g factor is a less reliable estimate of intelligence when subtest and composite scores are discrepant. This may account for the results of the CFA observed in this study. Lower processing speed scores was also found in the dissertation by Louison (2016) with both a clinical and non-clinical T&T sample. Together these findings may be indicative of a trend of lower processing scores on the WISC-IV with individuals from T&T, which can depress overall FSIQ scores and lead to underestimations of overall intellectual functioning. More research into the WISC-IV processing-speed scores, and processing speed in general may be warranted with a T&T population.
Factor analysis is a useful tool for informing how best to interpret relationships among subtests and exploring the theoretical structure of an instrument; however, clinical utility should be considered (Prifitera, Weiss, Saklofske, & Rolfhus, 2005;Weiss et al., 2013). Although the FF model does not fit closely with the three-stratum CHC model, that configuration may fit best with a referred Trinidad sample. This may be due to composite-score mean discrepancies, or that the WISC-IV may not accurately measure processing speed in the current sample of referred individuals.

Future Directions
Assessment involves a comprehensive, integrative process of data collection and information gathering. Results of assessment inform diagnosis, and are used to tailor intervention and appropriate supports to individual psychological, emotional, cognitive, and physical needs. Intelligence testing has remained an essential part of the assessment process in health and educational settings. Schools and health facilities use intelligence tests as part of the assessment process to gather information about individuals' cognitive functioning, and the results inform diagnostic and educational placement decisions. Thus, ongoing research is needed in the field of diagnostic testing to ensure that methods and procedures are accurate, valid, and culturally appropriate.
There is great need for continued research in intelligence testing in T&T.
Although it is important to consider alternative methods, these are beyond the scope of the current research, and should be considered for continued or future research. Future studies can employ sophisticated procedures such as latent variable modelling, which combines a measurement model and a predicative model. One main goal of psychological testing is ideally to predict functioning or important outcomes.
Predictors that can be considered for a latent variable model can include SEA performance, other academic outcomes. This study also highlighted that research on the impact of SES on IQ score performance is warranted.
Additionally, qualitative research on what it means to be intelligent within this population and context is needed to provide evidence for content validity. More research is needed into the content validity of the WISC with a T&T population. As with the development of the WISC and other intelligence tests, content validity could be examined through meaningful exploration into the relationship between the test content and the construct it is intended to measure. Differential item functioning or item-response theory analyses could be conducted in future studies to determine whether there is bias with individual items rather than the whole test. It would be interesting to examine how a Trinidad sample would fair with individual items compared with a US sample. In addition to CFA, an alternative method would be to examine group mean differences using a comparison matched US referred sample. For this, factorial invariance methods can be applied.
Overall, there has been a paucity of research with the Wechsler scales and intelligence test use and interpretation in T&T and there is a clear need for more work in this area, particularly as special education and psychological practice in T&T continues to grow and develop.