RELATIONSHIPS BETWEEN GENDER, SOCIOECONOMIC STATUS, MATH ATTITUDES, AND MATH ACHIEVEMENT: AN INTERNATIONAL INVESTIGATION

Investigations into the factors related to math achievement have traditionally been studied within individual countries, despite the existence of large international data sets available for analysis. This dissertation investigated the relationships among gender, socioeconomic status, math attitudes, and math achievement based on information from 50 participating countries in the Trends in International Mathematics and Science Study (TIMSS). Countries were grouped into clusters using hierarchical cluster analysis. Six cluster solutions were investigated based on average mathematics scores, average science scores, and average math attitude. The clusters were then validated on a separate sample using discriminant function analysis. The validation process utilized several country-level indicator variables, such as the Human Development Index, to ascertain the external validity of the cluster solutions. Multiple-group latent variable modeling was employed between-clusters and within-clusters to assess the nature and strength of the relationships between gender, socioeconomic status, math attitudes, and math achievement. The findings suggest that math self-confidence has a particularly strong relationship with math achievement, and that value of math has a particularly weak relationship with math achievement. Additionally, gender differences in math achievement appear to have disappeared or now favor female students, but male students report generally higher levels of math self-confidence. Among the implications discussed is the need to promote math selfconfidence in education curricula and in teacher education.


Identifying and Validating Clusters of Countries in the TIMSS 2007 Data Set
In 2006 the Bush Administration released the American Competitiveness Initiative (Domestic Policy Council, 2006), in which a call was made for a renewed push to promote education in the subjects of science, technology, engineering, and mathematics (commonly referred to as STEM disciplines). According to data from international studies on academic performance the United States has fallen behind in math and science (Koretz, 2009;Schmidt & McKnight, 1998), which is concerning for a country that has long self-identified as a world-leader in education .
Education is one of the key indicators of a society's development and stability.
Education can reduce social and economic inequality at the individual level (Lott & Bullock, 2007), and lay a foundation for a country's social and economic development. As the world's workforce is increasingly globalized due to technological advances, education plays a key role in developing and/or maintaining a competitive advantage. In many discussions in the U.S., mathematics and science receive special emphasis for being particularly important to our country's future well-being National Research Council, 2007).
One major influential factor in mathematics education is math attitude; in the United States, this is a generally expressed as a positive correlation, with higher math attitude being indicative of higher math achievement (Harlow, Burkholder, & Morrow, 2002;Schreiber, 2002). This relationship has been investigated for decades (Aiken & Dreger, 1961;Anttonen, 1969), and continues to be of interest because attitude is a much more malleable variable than cognitive ability or background variables such as SES (Singh, Granville, & Dika, 2002).
As could be inferred from Bandura's self-efficacy theory (Bandura, 1997), the differences in self-efficacy levels between boys and girls in math related topics may explain much of the sex-based differences seen in math performance. As an example of this, Ethington (1992) demonstrated the importance of the student's attitude toward math as a key component in the sex-based difference in performance and that the value placed on math was more influential for boys, but indirect psychological influences such as math affect were more influential for girls in the 746 eighth-grade participants of the Second International Mathematics Study (SIMS). Additionally, Casey, Nuttall, and Pezaris (2001) demonstrated that math self-confidence significantly mediated the relationship between gender and performance in their sample of 187 eighth-grade students. Further,  found that, for the 602 participants in their study, students who are disinterested in math lack the motivation to learn the subject, whereas those who are highly interested often challenge themselves by selecting more advanced math courses, which in turn leads to higher learning rates and a deeper understanding of concepts.
Historically, the majority of the research on the relationship between math attitude and math achievement has been conducted from an ethnocentric perspective; researchers may investigate the relationship between these two variables in individual countries (e.g., Ma, 1997;Papanastasiou & Zembylas, 2002), but little research has investigated the relationship from a crossnational perspective, even considering the wealth of data available for such analyses.
An example of such data is the Trends in International Mathematics and Science Study (TIMSS).
TIMSS is a recurring assessment of mathematics and science achievement for 4th and 8th grade students in participating countries. Initially conducted in 1995, the study has released four waves of data as of this writing; a fifth wave, TIMSS 2011, will become available for secondary analysis in January 2013. The purpose of the TIMSS is to provide an international view of mathematics and science achievement, which can then be used by educators and policy makers as a foundation for policy relevant decisions. Prior to the public release of each wave of TIMSS data, a thorough report of the study's summary statistics is published, in which the performance of participating countries is discussed in broad terms of mean comparisons and benchmarking ratios (see Martin, Mullis, & Foy, 2008, for the TIMSS 2007 summary report). These summary reports are often used by mass media and policy makers to compare the performance of one country with that of other countries, or to illustrate how a country compares with the international mean.
As Koretz (2009) pointed out, such comparisons should be interpreted with caution. International averages are not constants, and tend to vary from assessment cycle to assessment cycle. Koretz argues that we should instead make comparisons based on countries that are most similar to our own (i.e., Australia, Canada, and the U.S.), and with countries that consistently outperform our own (i.e., Japan and Singapore Furthermore, certain advanced analysis methods such as multiple-group latent variable modeling (LVM), which can illustrate group differences in complex statistical models, can accommodate only a limited number of groups, and using all of the available countries from a data set like TIMSS as a grouping variable for such analyses would yield too many groups. However, if countries could be grouped according to similarity, or broken into separate clusters where each country was similar to other countries in its cluster and different from countries in other clusters, procedures such as multiple-group LVM could be applied.
The purpose of this study was to identify meaningful clusters of countries in the TIMSS 2007 8th grade data set. Because attitude is a strong predictor of achievement, attitude was used as a clustering variable in addition to achievement. The resulting clusters were validated using several external variables, which are discussed in detail in the methods section below.

Participants
The sample for this analysis included the 8th grade students from 48 of the 50 participating countries and territories in TIMSS 2007; sample characteristics in terms of sample size, average math and science achievement scores, and average math attitude scores are presented in Table 1.1. It should be noted that only 48 countries were investigated as Mongolia and Morocco are excluded from the analysis due to sampling violations reported in Mullis, Martin, and Foy (2008).

INSERT TABLE 1.1 APPROXIMATELY HERE
The complex sampling design of studies like TIMSS mandate special consideration in analyses by utilizing what are known as sampling weights included in the data set. These weights account for sampling design, take into account stratification and disproportionate sampling of subgroups, and include adjustments for non-response ). The TOTWGT sample weight in TIMSS 2007 is the weighting variable used to calculate student population estimates within countries, and use of this variable will ensure that subgroups are properly and proportionally represented in population estimates; using the TOTWGT variable inflates the sample size in the analysis to reflect the approximate size of the population (i.e., the total weight).
However, when making cross-national comparisons TOTWGT may not be applicable because larger countries will be overrepresented in the analysis. For analyses in which countries should be weighted equally, the SENWGT sample weight is preferred. SENWGT, presumed to be an acronym for senate weight, is a transformation of TOTWGT which produces a weighted sample of 500 for each country ; in this way, the SENWGT variable forces each country to have equal representation, hence the name senate weight. Because the current analysis is concerned with making cross-national comparisons, the SENWGT weighting variable was used.

Measures Achievement variables.
TIMSS achievement variables are standardized with a mean of 500 and a standard deviation of 100. TIMSS provides scores for sub-topics within each subject. For math, sub-topics include algebra, geometry, statistics, and so on. For science, sub-topics include physics, chemistry, biology, and so on. The achievement variables used in this study are the overall math and science achievement scores provided by TIMSS. Achievement scores measured by TIMSS are complicated. The TIMSS attempts to measure achievement over a broad variety of math and science topics. In order to reduce time demands on each student, a complex matrix-sampling booklet design is implemented (Williams et al., 2009). This design requires that individual students respond to a relatively small number of items from the overall battery of assessment items. Item responses are then aggregated across all students to provide coverage of a wide range of content.
Because each student responds to only a selection of possible items, TIMSS utilizes an item response theory (IRT) scaling approach based on multiple imputation techniques to create a set of plausible values Williams et al., 2009). Plausible values are essentially imputed scores based on a student's item responses in conjunction with background variables. Imputed scores based on limited information certainly contain some amount of error, and to account for this error, scores should be imputed multiple times; the result of each of these imputations is considered a plausible value, or a score that a given student could have received, had the student answered all items in the TIMSS assessment. The TIMSS data set contains five plausible values per student for each achievement related variable.
To accommodate the use of plausible values in any analysis, the analysis must be conducted once for each plausible value. The results of the separate analyses are then combined into a single result which includes parameter estimates and standard errors incorporating both sampling and imputation error  Also found in the Human Development Report, the Education Index is a composite of adult literacy rates and enrollment ratios at the primary, secondary, and tertiary levels of education. The Education Index is also on a rating scale between 0 and 1, with numbers closer to 1 indicating a higher level of education for the country.
For 2011, scores ranged between 0.627 (Ghana) and 0.993 (Australia), with one case of missing data (Georgia).
The Gender Equality Index comes from the Gender Inequality Index within the Human Development Report. The Gender Inequality Index is a composite measure which represents a country's gender-based inequality in reproductive health, empowerment, and the labor market. The Gender Inequality Index is on a rating scale between 0 and 1, with numbers closer to 1 representing high levels of inequality. In order to have the scale coincide with the other scales in the analysis (i.e., high scores are more positive), this scale was rescaled by subtracting the provided score from 1, yielding scores such that numbers closer to 1 represent a high level of equality. Thus, we have renamed the scale to the Gender Equality Index for the purpose of this study.
In 2011, scores on the Gender Equality scale ranged from 0.354 (Saudi Arabia) and 0.951 (Sweden). Six territories lack a score on the Gender Equality Index: Taiwan, Hong Kong, Serbia, Bosnia and Herzegovina, Palestine, and Egypt.

Democracy Index
The Democracy Index "provides a snapshot of the state of democracy worldwide for 165 independent states and two territories…The overall Democracy index is based on five categories: electoral process and pluralism; civil liberties; the functioning of government; political participation; and political culture" (Economist Intelligence Unit, 2011, p. 1). The Democracy Index original scores are on a scale from 0 to 10, with higher scores indicating higher levels of democracy. These scores were divided by 10 in order to force the scale to be comparable with the other scales in the analysis (i.e., ranging from 0 to 1, with numbers closer to 1 indicating higher levels of democracy). Final scores on the Democracy Index ranged between 0.177 (Saudi Arabia) and 0.980 (Norway).

Economic Freedom Index
The Index of Economic Freedom is a joint venture between The Heritage Foundation and the Wall Street Journal. The index is a composite of ten components of economic freedom: property rights, freedom from corruption, fiscal freedom, government spending, business freedom, labor freedom, monetary freedom, trade freedom, investment freedom, and financial freedom. Each of these ten components is rated from 1 to 100, and the overall economic freedom score for a country is the average of these ten components. For the current study, the reported value for the Economic Freedom Index was divided by 100 to yield scores ranging from 0 to 1, with numbers closer to 1 indicating higher levels of economic freedom. Final scores on the Economic Freedom Index ranged between 0.421 (Iran) and 0.897 (Hong Kong), with one case of missing data (Palestine).

Freedom of the Press Index
The Press Freedom Index is a report measuring the treatment of journalists and media in countries (Reporters without Borders, 2012). The report is based upon a 40item questionnaire which assesses the state of press freedom in each country. Scores on the Press Freedom Index range between -10.00 and 142.00, with smaller numbers indicating greater press-related freedom. For the purposes of the current study, the scores for this index were first subtracted from 150, yielding scores ranging from 8 to 160. These values were then divided by 160, yielding scores on a 0 to 1 scale, with numbers closer to 1 indicating higher levels of press-related freedom. Final scores for the Freedom of the Press Index ranged between 0.075 (Syria) and 1.00 (Norway).

Procedures
Hierarchical cluster analysis was conducted to identify groups of countries similar to each other yet different from other groups of countries. Punj and Stewart (1983) reviewed several methods of cluster analysis, concluding that Ward's (1963) clustering algorithm consistently performed best among hierarchical clustering techniques. Ward's algorithm forms mutually exclusive groups or clusters, starting with n clusters (i.e., one for each participant in the sample) and iteratively reducing the number of clusters by 1. At each stage a given unit is determined to either fit into an existing cluster or to form a new cluster with another unit, ultimately resulting in a single cluster.
Cluster analysis is inherently an exploratory procedure. It is for this reason that a key component of cluster analysis is the evaluation of the reliability (the degree to which the cluster solutions are consistent) and validity (the degree to which the cluster solutions are meaningful) of the resulting clusters. Evidence of reliability can be demonstrated by testing the structure of the cluster solutions on a separate sample, known as cross-validation (Sherman & Sheth, 1977). Validity can be demonstrated by assessing the identified clusters on variables other than those used for the cluster analysis (Punj & Stewart, 1983). In order to create a cross-validation sample, the TIMSS sample was split into two approximately equal halves using the random selection feature in SPSS 18.0. The initial cluster analysis was performed on an initial (model building) sample, and the subsequent validation analyses were performed on a second (cross-validation) sample. Descriptive statistics for the model building sample can be seen in Table 1 For this reason, I decided to begin with the 4-cluster solution and proceed until the clustering solution produced multiple clusters that were significantly smaller than the others (i.e., one or two countries).
The initial cluster analysis for this study was conducted using the default cluster analysis functions available in the R program's hclust command. The cluster analysis was performed using Ward's clustering algorithm with squared Euclidean distance for the distancing measure. The cluster analysis routine was performed 6 times, with the number of clusters set to a specific value between 4 and 9 for each analysis (i.e., once to obtain a 4-cluster solution, once for a 5-cluster solution, and so on). This provided six initial cluster solutions to explore during the initial model-testing phase.
In order to investigate the reliability and validity of the cluster solutions that resulted from the initial cluster analysis, the following processes were performed once for each cluster solution. First, countries were assigned a group identifier based on cluster membership from the initial cluster analysis. This identifier was used as the grouping variable for a discriminant function analysis (DFA) and a series of ANOVAs (using the R default MANOVA function and the Anova function from the CAR package, respectively).
DFA is a procedure in which several continuous independent variables are used to predict membership in a categorical grouping variable. For the DFA, cluster membership was included as the dependent variable and the validation variables discussed previously were included as independent variables. A common procedure associated with discriminant function analysis is the assessment of the predictive accuracy of a classification system .
Applying that purpose to this study, the cluster solutions identified in the initial cluster analysis could be considered reasonably accurate if countries are correctly classified at a rate that is greater than what could be expected by chance (Harlow, 2005, pp. 141-142).
Because cluster membership for the validation analyses was based on the results of the initial cluster analysis, a highly accurate comparison of predicted and actual cluster membership using the cross-validation sample and the validation variables outlined previously would provide elegant evidence for the reliability and validity of the clusters. This analysis was carried out for each of the initial cluster solutions using the lda function in the MASS package in the R software.

RESULTS
As previously described, the TIMSS 2007 data set was split into two roughly equally sized subsamples, a model building sample and a cross-validation sample. The hierarchical cluster analysis using Ward's (1963) clustering algorithm was conducted six times on the model building sample, with the number of clusters to investigate set between 4 and 9; this yielded six different cluster solutions. A dendogram, a commonly used diagram for displaying clustering patterns from a cluster analysis, can be seen in Figure 1.1; the six initial cluster solutions are shown in Table 1

General Cluster Solution Discussion
Cluster analysis cannot be considered complete until the resulting clusters have been assigned meaning (Punj & Stewart, 1983;Ward, 1963). As Tukey (1977) implies, a picture is worth a thousand words, and the following figures begin to illustrate some of the patterns in the clusters of countries. Figure 1.2 shows the countries arranged by cluster according to average math achievement score, and To begin the process of assigning meaning to the cluster solutions, we first consider 3 of the clusters that were most stable across all 6 of the cluster solutions. Christian, and several of them share cultural history (i.e., the U.S. and Australia were both once English colonies, while many of the remaining countries were once part of the Soviet Union). These countries are all political democracies, have high ratings for human development and education, and the citizens for most of these countries enjoy the highest ratings for gender equality, democracy, economic freedom, and freedom of the press. The third consistently stable cluster contains Thailand, Jordan, Tunisia, Turkey, Bahrain, Iran, and Syria. Although this cluster eventually ended up being merged with another cluster in the 4-cluster solution, these countries represented their own independent cluster up until that point. These 7 countries are predominantly Islamic in religion, and most of them share geographic proximity with the exception being Thailand. This cluster of countries generally has very low ratings for democracy, human development, gender equality, and freedom of the press. Now we will turn to the remaining clusters, which were not as consistent across cluster solutions. This is to be expected, since hierarchical cluster analysis is a process of combining similar groups with each other to the point of maximum inclusion. What this means for the current analysis is that we ended up with two mega-clusters. These mega clusters are clusters which, by the 5-cluster solution, had absorbed several other clusters over the course of the analysis. In addition to the findings of the cluster analysis and the validation analysis, an interesting pattern in the relationship between math attitude and math achievement emerged. As was previously stated, this relationship is generally expressed as a positive relationship at the individual level with higher math attitude generally indicative of higher math achievement. However, when observed at the country-level, the relationship is less clear; in fact, the relationship becomes counterintuitive. As Figure 1.4 illustrates, at the country level, the relationship between math attitude and math achievement appears to be a fairly strong, negative correlation.

DISCUSSION
The purpose of this study was to investigate the TIMSS 2007 8th grade data set with the intent to identify meaningful clusters of countries. Six different cluster solutions (i.e., solutions containing between 4 and 9 clusters) were identified via cluster analysis, cross-validated using discriminant function analysis, and then externally validated on several additional variables of sociocultural and political interest. The findings of these analyses suggest guidelines for cluster membership, allowing future researchers to make international comparisons among countries that are considered similar to their own based on cluster membership.
Although the classification analyses lent support for each of the six initially identified cluster solutions in terms of the percentage of correctly classified countries, there are some considerations that should be made. First, the cluster containing only Lebanon in the 9-cluster solution immediately disappeared and was absorbed by another cluster. Depending on the researcher's purpose, it may be advisable to simply begin with the 8-cluster solution rather than including a cluster consisting of only a single country. Additionally, the 4-cluster solution may be the point at which the clusters begin to be less meaningful, as the resulting 21-country cluster in this solution may be too large for any meaningful between-cluster comparisons. Based on this information, we would recommend the use of the 8-cluster, 7-cluster, 6-cluster, or 5cluster solutions, each of which had high reliability in our analyses.
Which of these cluster solutions would be best for any given analysis is dependent on the nature of the research question, but it may also be influenced by the limits of technology. Although they had high predictive reliability for our analysis, the 7-cluster solution through the 4-cluster solution all contain at least one cluster that is too large for current technology to handle when conducting multiple-group LVM analyses, which may make the 8-cluster solution the ideal for researchers seeking to employ those methods. This may not seem intuitive, since the 8-cluster solution had slightly lower predictive accuracy (75% accuracy) than all of the cluster solutions, but considering the ratio of expected accurate predictions by chance alone (13.8%), the 8cluster solution can still be considered quite accurate. However, if cluster size is not necessarily a concern for the researcher the 5-cluster solution, having the highest predictive accuracy among the cluster solutions at 90% accuracy, may be a desirable choice for future investigations.
As was previously discussed, cluster analysis is an exploratory procedure. The purpose of the analysis is to identify any patterns of relationships among group members based on similarities within members of the same group and differences between members of other groups. The resulting clusters are largely based on the variables included in the cluster analysis and the variables used to validate the clusters.
As such, it is to be expected that other random samples taken from the TIMSS data set could yield slightly different cluster solutions. This could lead to several interesting additional investigations. Of particular interest would be a cluster analysis using the soon-to-be-available TIMSS 2011 data to see if the cluster solutions investigated here replicate in that data set. Similar analyses could be done on the previous versions of TIMSS as well, although earlier administrations of TIMSS had fewer participating countries which would influence the resulting cluster solutions.
Finally, it is worth discussing the negative relationship between math achievement and math attitude at the country level. First, it is important to remember that this does not mean that math attitude has a negative impact on math achievement at the individual level. However, it could indicate significant cultural differences in the emphasis placed on either math achievement or math attitude, or both. These cultural differences may be manifesting in the TIMSS 2007 data as Extreme Response Style (ERS), a type of confound driven by group differences in the actual attitude and its respective response patterns (Eid, Langeheine, & Deiner, 2003;Morren, Gelissen, & Vermunt, 2012;Poortinga & van de Vijver, 1987). Another possible explanation could be that this is an excellent example of Yule-Simpson's Paradox, wherein a correlation evidenced in a number of groups disappears or reverses direction when the groups are combined. The international level of the comparison may be masking important subgroups within the sample, creating a situation in which the gestalt of the correlation may be something different than the sum of its parts. Notes: N = Sample size; Mean Math = average score for math achievement; SD Math = standard deviation for math achievement; Mean Sci = average score for science achievement; SD Sci = standard deviation for science achievement; Mean Att = average score for math attitude; SD Att = standard deviation for math attitude.
At least part of the push for increased focus on STEM education has been caused by the findings of multinational assessments like the TIMSS (Trends in International Mathematics and Science Study) and PISA (Program for International Student Assessment), where the United States has been performing sluggishly in such assessments for decades (Koretz, 2009;Schmidt & McKnight, 1998). Most of the findings from these studies are based on mean differences or regression analyses, and more complex evaluations have rarely been reported, possibly due to the complexity of the data. In the increasingly global communities and economies that are being developed, rigorous multivariate research needs to be conducted to look at these relationships on an international scale in order to understand such relationships from a global and more overarching perspective.
Additionally, the underrepresentation of women in STEM disciplines has garnered considerable attention (Beede, Julian, Langdon, McKittrick, Khan, & Doms, 2011).  found that attitude toward mathematics had an effect on course selection, and that female students were less enthusiastic about and less likely to enroll in advanced mathematics courses than their male counterparts.
Because STEM disciplines require an understanding of upper-level mathematics, female students who have shied away from increasingly challenging mathematics courses are less likely to have had the prerequisites for upper-level math.

The Relationship between Math Attitude and Math Achievement
The relationship between attitude toward math and math achievement is generally expressed as a positive correlation -the higher a student's attitude toward math is, the better the student will do in math classes and on math assessments (Harlow, Burkholder & Morrow, 2002;Schreiber, 2002). This relationship has been investigated for decades (see Aiken & Dreger, 1961;and Anttonen, 1969 for some early work), and continues to be important to consider because cognitive ability and background variables (such as SES) are difficult to change, whereas affective variables can be more easily targeted for intervention (Singh, Granville, & Dika, 2002).

The Relationships between Sex, Math Achievement and Math Attitude
Some early investigations into sex-based differences in math performance painted a stark picture; "huge sex differences" were reported (Benbow & Stanley, 1980), with females on the losing end of the comparison. However, as Rossi (1983) pointed out, effect sizes that account for less than 5% of the explained variance should not be characterized in terms such as "huge", "large", "substantial", or "sizeable". Indeed, even before 1980 researchers were characterizing sex differences in math performance as small (i.e., Fennema & Sherman, 1977), a position supported by many researchers then and now (Felson & Trudeau, 1991;Fennema, Peterson, Carpenter, & Lubinski, 1990;Joffe & Foxman, 1984;Lieu & Wilson, 2009;Lindberg, Hyde, Petersen, & Linn, 2010;Linn and Hyde, 1989;Rossi, 1983).
Sex-based differences in math achievement are likely not attributable to differences in ability. Rather, as could be inferred from Bandura's self-efficacy theory (Bandura, 1997), the differences in self-efficacy levels between boys and girls in math related topics may explain much of the sex-based differences seen in math achievement, and there are examples in the literature that support this interpretation. Ethington (1992) demonstrated the importance of the student's attitude toward math as a key component in the sex-based difference in performance and that the value placed on math was more influential for boys, but indirect psychological influences such as math affect were more influential for girls. Additionally, Casey, Nuttall, and Pezaris (2001) demonstrated that math self-confidence significantly mediated the relationship between gender and performance. Further,  found that students who are disinterested in math lack the motivation to learn the subject, whereas those who are highly interested often challenge themselves by selecting more advanced math courses, which in turn leads to higher learning rates and a deeper understanding of concepts.
A recently conducted study (Duerr & Harlow, 2011) investigated the differences between sex, SES, math attitude, and math achievement using the 8th grade United States participant data from TIMSS 2007. The findings from that study were: 1) sexbased mean differences in the math achievement latent construct were not statistically significant; 2) sex-based path coefficient differences between the SES and math achievement constructs, and between the math attitude and math achievement constructs were statistically significant but small in magnitude; 3) factor loadings for the math attitude construct were different depending on participant sex, with math affect and value of math loading more highly for female participants and math selfconfidence loading more highly for male participants.

The Current Study
That sex-based differences in math achievement have receded in the United States is widely accepted by researchers. However, this is not necessarily the case for all countries. Furthermore, what remains unclear is the relationship between math attitudes and achievement, specifically whether these relationships differ across diverse cultures. Because the world is becoming more of a global community, it is time to start looking at topics like this from a multinational perspective.
The purpose of this study was to extend the work from Duerr and Harlow (2011) to a multinational scale. Using the TIMSS 2007 data and through the use of multiplegroup latent variable modeling, relationships between gender, SES, math attitude, and math achievement were investigated between and within clusters of 44 participating countries from TIMSS 2007. It was hypothesized that the math attitude variables would have strong, positive relationships with math achievement across countries, but that the strength of the relationships would change depending on cluster membership and depending on the participant's country of origin.
Additionally, it was hypothesized that sex-based differences in math achievement would be small in magnitude, but that gender differences would be seen among the math attitude variables. Specifically, based on the findings of Ethington (1992) it was hypothesized that self-confidence and math affect would have a stronger relationship with math achievement for female participants than for male participants, while the value of math would have a stronger relationship with math achievement for male participants than for female participants. It was further hypothesized that the magnitude of these differences would change depending on the participant's country of origin.

Participants
The sample for this study is comprised of students from countries that participated in TIMSS 2007 at the 8th grade level; the sample sizes for each country are broken down by gender in In the TIMSS data sets, the number of books in the home is on a 5-point response scale, with the lowest response being "None or very few (0 to 10 books)" and the highest response being "Three or more bookcases (over 200 books)." Parent education level is assessed on the International Standard Classification of Education (ISCED) scale, which ranges from 0 (no education) to 6 (Doctoral or professional degree). One difficulty when using the ISCED scale to measure SES in the TIMSS data is a lack of response to the questions for a number of participants, or the selection of the 8th option, which is "I don't know". This option was treated as nonresponse for the purpose of this analysis, and is discussed in more detail below.

Gender
Gender is a dichotomous variable, with female participants coded as 1 and male participants coded as 2.

Procedure
This study investigates differences between and among clusters of countries through a multiple-group structural equation modeling (SEM) framework. As such, there are three components to the analysis that need to be discussed: 1) the latent variable model being investigated, 2) how the countries were separated into clusters, and 3) how the clusters were used to conduct the multiple-group analyses.
The Latent Variable Model.
The path diagram shown in Figure

Identifying Clusters of Countries.
Multiple-group structural equation modeling is unable to incorporate a large number of groups, and could certainly not handle the number of countries included in this study. In order to make the multiple-group procedure possible, the countries in the TIMSS data set needed to be separated into smaller groups or clusters. To accomplish this, a hierarchical cluster analysis was conducted in a previous study (Duerr & Harlow, under review), which allowed for the identification of several clusters of countries. Clustering was based on country-level mean TIMSS math and science achievement scores and math attitude values.

Of the cluster solutions identified in Duerr and Harlow (under review), a variant of the 8-cluster solution was chosen for the present study primarily because this
solution's clusters all contain fewer than 10 groups, which is the point after which multiple-group analysis becomes infeasible (L. K. Muthén, 2011); and the clusters appeared conceptually or geographically cohesive. The cluster membership for this cluster solution can be seen in Table 2.4. It should be noted that although this has been referred to as the 8-cluster solution in Duerr and Harlow (under review), the cluster comprised of Qatar and Ghana has been excluded due to the previously discussed missing data, yielding 7 clusters for the present analyses.

Between-and Within-Cluster Analyses.
To investigate the performance of the latent variable model across countries, a series of multiple-group comparisons was conducted. The first analysis utilized the data for all 44 countries in the study, with cluster membership acting as the grouping variable for a multiple-group analysis. This analysis provides an overall assessment of the model's performance at the international level, with clusters of countries being compared with other clusters of countries that are similar within their respective clusters.
Following this initial multiple-group analysis a series of additional multiple-group analyses was conducted for the three clusters deemed to be the most salient for U.S.
policymakers, providing an assessment for the model's performance within clusters.
For these analyses, countries are the grouping variable within a cluster, such that countries which have been identified as being similar to each other are compared with each other.

Multiple-Group Analyses.
This study employs multiple-group structural equation modeling methodology and follows procedures outlined in Kline (2011)  Which fit indicators should be used when conducting any structural modeling analysis has been the subject of much research (Bentler, 1990;Hu & Bentler, 1999;Steiger, 2000;Steiger & Lind, 1980), but there is some consensus in regard to the use of chi-square as a measure of model fit; due to chi-square's susceptibility to sample size and SEM's reliance on large samples, the use of chi-square as a test of model fit is generally accepted as being a poor choice. Several alternative fit indices have been proposed over the years, with some of the more popular choices being Root Mean Squared Error of Approximation (RMSEA, Steiger & Lind, 1980) and the Comparative Fit Index (CFI, Bentler, 1990). When using RMSEA as a measure of model fit, values greater than 0.10 are considered to have poor fit, and RMSEA values of 0.08 and 0.05 represent acceptable and good fit, respectively (Browne & Cudeck, 1993). CFI values less than 0.90 are generally considered an indication of poor fit, whereas values of 0.95 and higher represent good fit (Hu & Bentler, 1999 An additional complexity associated with the large-scale assessments like TIMSS comes in the form of the sampling design, which employs stratified cluster sampling.
A thorough discussion of the sampling procedure for TIMSS assessments can be found in , and a detailed summary can be found in Appendix 1.
A brief overview of the procedure is that schools are randomly selected according to certain regional characteristics, and then two classrooms are randomly selected from each chosen school. The students within these classrooms are then assessed, forming the sample. This sampling design introduces another source of error that must be acknowledged and accommodated in any subsequent analysis.
Accommodating the error associated with the sampling design is accomplished through the use of jackknife repeated replication (JRR). Using JRR, several replicates of the original sample are created and compared with the original sample; the variation between the estimate for the original sample and the estimate for the replicate becomes the jackknife estimate of sampling error for the statistic. Using these jackknife estimates allow for the creation of replicate weights found in the TIMSS data set.
These replicate weights must be used to estimate a given parameter estimate, with one estimate for each of the 75 replicate weights and one estimate for the original sample for a total of 76 estimates of the parameter (a detailed explanation of the analysis considerations, including a discussion on the source of the 76 estimates of the parameters, can be found in Appendix 1).
A final methodological accommodation associated with using the large-scale data is the use of sample weights. The use of weighting variables forces a given country's sample to be representative of the country's population in terms of the background variables used for participant selection. Sampling weights for the TIMSS are values assigned to each participant based upon that participant's actual probability of having been selected for participation; this probability is based on the sample selection characteristics associated with the participant's school and classroom, which are known values due to the methodology used for sample selection ).

Missing Data.
One issue encountered when using large-scale data sets is the inevitable existence of missing data, and the TIMSS data set is no exception to this. Missing data can take many forms, and the source of missingness will determine how it should be addressed.
In some cases, the responses may be missing because it was not asked of the participants; this is the case for parental education information for England and Scotland in TIMSS 2007. In this situation it is difficult to justify the use of any imputation procedure to reconcile the missing data, and the observations for those countries must be excluded from the analysis.
Another source of missing data for the TIMSS is that caused by the balanced incomplete block assessment, which asks each participant to complete only a selection of the items from the test bank. This is systematically missing data, which is accommodated through the jackknife repeated replication procedure discussed previously.
Finally, there is participant nonresponse, where the participant simply does not answer the question. This is not problematic for TIMSS achievement score data, because the calculation of plausible values accommodates this type of missingness.
However, this can be problematic for background variables, such as the questions regarding math attitude and socioeconomic status. Additionally, the "I don't know" response option for parental education level offers no more information than does leaving the question unanswered.
Although many different approaches for handling missing data have been advocated over the years, including mean substitution, nearest neighbor substitution, and listwise deletion, it is generally agreed that these other methods are inadequate and lead to inaccurate parameter estimates. Fortunately, the increase in processing power available in computers has led to new methods in the accommodation of missing data in an analysis; for structural modeling analyses involving categorical indicators, weighted least squares is the preferred method (B. Muthén, 1984;B. Muthén & Satorra, 1995;L. K. Muthén & Muthén, 2009), and is the method employed in the current study.

Overview
The bulk of the present study is interpreted in terms of regression coefficients between latent variables. Regression coefficients provide a researcher with an effect size for a given relationship between two variables and, when interpreted in conjunction with their accompanying t-statistic, two specific questions can be answered: is the effect meaningful in terms of magnitude, and is the effect meaningful in terms of statistical significance? I have attempted to interpret the findings in this study from within this paradigm, but in order to do so I should clarify what my decision-making process was.
In terms of magnitude, it is often difficult to determine what qualifies as an important finding. Taking the inferred advice of Rossi (1983), effect sizes that explain less than 5% of the variance in math achievement research will be reported but dismissed as being inconsequential to the discussion. In terms of statistical significance, it should be acknowledged that the present study relies on very large sample sizes; even at the country level the samples consist of thousands of participants of both genders, in which case the sample sizes may promote statistical significance of even very small effects. For this reason, statistical significance will be reported at the p < 0.05, p < 0.01, and p < 0.001 levels, but p-values greater than 0.01 will not be discussed in great detail.
A final note on the interpretation of regression coefficients has to do with which coefficients to report. Several sources, including ; Kline (2011);and Pedhazur (1997); state that unstandardized coefficients are preferable to standardized coefficients when making comparisons across samples. However, Montgomery, Peck, and Vining (2006); and  state that standardized regression coefficients can be used, cautiously, to judge the magnitude of a given variable's effect relative to other variables in the model; the primary hazard of doing so manifests when subsequent samples have a different range of responses, which will affect the associated coefficients.
Based on this information, the present study reports both standardized and unstandardized regression coefficients. Unstandardized regression coefficients are discussed when effects are compared across groups (i.e., when assessing differences in the magnitude of the effect of self-confidence on math achievement between Australia and the United States), while standardized regression coefficients are interpreted when the magnitude of a variable's effect is being compared with the effects of other variables within the same group (i.e., when assessing differences in the magnitude of the effects for self-confidence and value of math in the United States).

Between-Clusters Analysis
The between-clusters analysis began by testing the model with all parameters freely estimated. The model was assessed for goodness of fit at this step by examining RMSEA and CFI values; the RMSEA of 0.056 and CFI of 0.94 for this step are both within the generally acceptable parameters for good model fit.

INSERT TABLE 2.5 APPROXIMATELY HERE
Constraints were then placed on factor loadings and regression coefficients, with the change in the CFI indices (ΔCFI) reviewed to determine whether or the models were invariant. Following recommendations in Cheung and Rensvold (2002), a ΔCFI value greater than 0.01 was considered to be evidence of a lack of invariance. The ΔCFI for this first step (0.014) is greater than the 0.01 threshold advocated by Cheung and Rensvold, suggesting that there is a lack of invariance between the clusters of countries in this comparison; this leads to the conclusion that the model does not perform equally across the clusters. Because the only constraints on the model for this step were the constraints placed on the factor loadings and regression weights, it can be concluded that the differences in model performance at the cluster level can be attributed to at least one or more of these loadings and weights; i.e., the regressions and loadings show differences between the clusters of countries. The standardized factor loadings for the between-clusters analysis can be seen in Table 2.6, and the unstandardized regression weights can be seen in shown that this relationship is positive in direction, and that students who enjoy learning about mathematics outperform those who do not (see Ryan & Deci, 2000;Lepper, Corpus, & Iyengar, 2005). There are some statistical anomalies that could be present that would influence the directionality of this variable in the model. For instance, given the correlations between the latent variables (shown in Table 2.7), the argument could be made that multicollinearity among the latent constructs is the principle culprit; the correlation between Positive Math Affect and Math Self-Confidence is quite large at 0.878. This could cause the magnitude and directionality of relationships with these variables to be unstable.

INSERT TABLE 2.7 APPROXIMATELY HERE
Gender's relationship with math achievement, while statistically significant across clusters, was consistently small in magnitude and explained less than 1% of the variance in most clusters. This is a finding consistent with much of the contemporary research on gender differences in mathematics achievement, including a comprehensive meta-analysis on the topic by Lindberg, Hyde, Petersen, and Linn (2010). What have not been studied in great detail are potential international differences regarding gender and math attitude. In terms of the present study's between-clusters analysis, the relationship between gender and math attitude is minute in most cases, although there are a few noted exceptions. For example, the relationship is rather larger for Cluster 4's countries, accounting for 7.5% of the variance for value of math, to 24% of the variance for math self-confidence. Gender also seems to have more links with math attitude in Cluster 1 countries (Asia), accounting for 3% (value of math) to 12% (math self-confidence) of the variance.

Within-Clusters Analyses
The within-clusters analyses consist of three separate multiple-group latent variable analyses. Once again based on the unstandardized regression coefficients (see Tables 2.8 through 2.13 for the standardized and unstandardized coefficients for these three clusters of countries), the conclusion can be drawn that, of the variables included in the model, math self-confidence had the largest relationship with math achievement for nearly all countries within these clusters; the magnitude of this relationship is considerably larger than that for nearly any other variable in the model, including gender and SES. Also of note is that Value of Math is significantly related to math achievement at the α = 0.01 level for only two countries (Czech Republic and South Korea) across all three analyses. Finally, positive math affect again shows a negative relationship with math achievement, but this appears to be the result of multicollinearity in the model; the correlation between the math self-confidence and positive math affect latent constructs is very high (see Tables 2.14 through 2.16).
INSERT in each of these countries the link between Math Self-Confidence and Math Achievement was much smaller (though still statistically significant at the α < 0.001 level) than in other countries in their respective clusters.
The role of gender in the model differed by country within each cluster. For many countries, gender was not statistically significantly related to Math Achievement, and for most of those where gender was statistically significantly related to Math Achievement, the magnitude of the effect was trivial and accounted for less than 1% of the variance. There were some countries wherein this was not the case, however. In Australia, for example, the relationship between gender and Math Achievement was both statistically significant and large in magnitude (b = -0.841, p < 0.001) in favor of female participants; a similar relationship can be seen in Armenia (b = -0.643, p < 0.001).
Another important finding for this analysis can be seen in the relationship between gender and the math attitude variables. Although the strength and directionality of the relationship does differ depending on country, the relationship is either not statistically significant or the magnitude of this relationship is very small, accounting for less than 1% of the variance, in all countries except Chinese Taipei, Hong Kong, and Australia. In each of these three countries gender has a statistically significant and relatively strong relationship with all of the math attitude variables, and the relationship favors male participants.

DISCUSSION
The purpose of this study has been to investigate the relationships between gender, socioeconomic status, math attitudes, and math achievement from an international perspective by applying multiple-group latent variable modeling techniques to clusters of countries within the TIMSS 2007 data set. The following sections contain some general conclusions that can tentatively be made based on the results of these analyses.

The Relationships between Math Attitude and Math Achievement
First, the role of self-confidence is clearly shown to be an important one when it comes to math achievement; the self-confidence latent variable was the most influential variable in the model for all of the clusters in the between-clusters analysis, and for almost all of the countries in the within-clusters analyses. Simply stated, students who performed highly relative to their peers on the TIMSS 2007 math assessment were also consistently more likely to endorse feeling confident in their ability to perform well on math-oriented tasks. This was the case between clusters as well as between countries.
The importance of self-confidence for math achievement has been well established (Ethington, 1992;Ethington & Wolfle, 1984;Ganley & Vasilyeva, 2011;Lloyd, Walsh, & Yailagh, 2005;Nosek & Smyth, 2011). The findings of the present study, in addition to what has been found in previous literature for nearly 30 years, suggest that educators should be placing a considerable amount of emphasis on building math self-confidence as math is being taught in the classroom. The question is whether or not this information has been integrated into the primary and secondary education systems through pre-service teacher education programs and through published curricula in math education; this is a topic that should be investigated by future research. Research on math achievement has tended to focus on sex-based differences, and through the course of those efforts several content analyses were conducted and used to make the case that math curricula were potentially unfairly biased against girls (Sadker & Sadker, 1994;Spender & Sarah, 1980). It would be beneficial to the field of math education to make similar inquiries into the content of current curricula to determine whether or not proper attention is being paid to developing self-confidence in learning math for both boys and girls.
A second conclusion that can be drawn from the present analyses has to do with the relationship between the value students place on math and math achievement. In this study, student value of math did not have a statistically significant relationship with math achievement, and even when it was statistically significant the relationship was very small. There are several questions that this raises. For instance, how much of the student response for value of math is a reflection of the way they value math, as opposed to a reflection of the values placed on math by significant adults in their lives (i.e., relatives and/or teachers)?
Finally, the relationship between positive math affect and math achievement was difficult to interpret in this study, largely due to the high correlation between positive math affect and math self-confidence; the correlation was large enough that the two latent variables may well be measuring the same construct in this data set. Although the measurement items used to represent math attitude all loaded onto separate factors during a preliminary factor analysis, their use in the current model had unforeseen effects on the covariance matrix.

The Impact of Gender on Math Attitude and Math Achievement
Gender's link with math achievement has been investigated for decades, with several researchers (e.g., Lieu & Wilson, 2009;Lindberg, Hyde, Petersen, & Linn, 2010) having reached the conclusion that any meaningful gender differences have disappeared, at least in the United States. The findings from this study support such conclusions and extend them to include most of the countries that were investigated; either gender's relationship with math achievement is not statistically significant, or it is of negligible practical importance. Of particular interest is the finding that, in TIMSS 2007 data and for those countries with large and statistically significant effects, the gender gap is in favor of female participants. Whether or not this rolereversal in the math achievement gender gap will result in a similar reversal in the emphasis of research and political discussion remains to be seen, though some preliminary evidence would support such a claim; researchers in a handful of countries have already begun investigating the "boy turn" (Weaver-Hightower, 2003). The "boy turn" is a phrase used to reference a shift in attention in gender-effect research to boys (i.e., "now it's the boys' turn to be looked at", or "now it's time to turn our attention to the boys").
Gender's relationship with math attitude is similar to its relationship with math achievement: the relationship is most commonly small in magnitude and is often not statistically significant. What is particularly interesting is the apparent reversal of direction in the relationship: in countries where gender is a significant and potentially meaningful predictor of math attitude, male participants display more positive attitudes toward math than female participants. However, the number of countries for which this is the case (i.e., the relationship is both statistically significant and has a large magnitude of effect) is limited to three: Chinese Taipei, Hong Kong, and Australia. Why gender has a stronger relationship with math attitude for only these territories is unknown, though the status of both Hong Kong and Australia as former British colonies, and of Chinese Taipei as a dissenting territory within the People's Republic of China could hold some information as to the reason for the relationship; perhaps the strength of this relationship is in some way a residual effect in these communities as they search for their sense of national identity.

Limitations of the Current Study and Suggestions for Future Research
There are naturally limitations for this study, many of which are tied to the nature of the data. Because this is a secondary data analysis of public-use data, the variables included in the model are necessarily limited to the variables available in the data. The present analysis has suffered from this in at least one major way, and that is the apparent overlap (represented by very high correlations) between the indicator variables used to create the math self-confidence and positive math affect variables.
As pointed out earlier, these two constructs are supposed to be different from each other, but they may very well be close to measuring the same thing.
There are currently no mathematical or methodological solutions to the presence of multicollinearity in structural models, and conventional wisdom would suggest that the variables should either be merged into a single variable or one of the two should be dropped from the analysis. This may be the impetus for a follow-up study, wherein only the effects of math self-confidence are analyzed in the model, since the other two attitude variables either suffered from multicollinearity (math affect) or were not consistently significantly influential in the model (value of math). Alternatively, the model could be tested using each of the three attitude variables separately, though this would have the negative side effect of not being able to compare the strength of the resulting regression coefficients with each other because the models would not be accounting for the shared variance among the attitude variables in their relationships with math achievement.
Another limitation for these analyses is the use of categorical indicators for the majority of the independent latent constructs. Although Mplus can accommodate the use of categorical variables through the application of the weighted least squares estimation method, the resulting interpretations are not entirely clear. For example, it was observed that math self-confidence is a significant predictor for math achievement, but the regression coefficients associated with that relationship cannot be interpreted as though they are continuous. Indeed, the only thing that can really be said about the relationship is that higher self-confidence is associated with higher achievement, and the magnitude of the relationship becomes difficult to interpret in a quantitative sense.
Each of the previous limitations shares a common theme: the measurement of the attitude variables has impacted the quality of the analysis and the ability to interpret the results. It would be in the best interests of education researchers, therefore, to develop methods of attitude measurement that do not suffer from these issues. There is some evidence that the Implicit Association Test, or IAT (Greenwald, McGhee, & Schwartz, 1998) can be used to fill this need. The IAT has successfully been used by Nosek and Smyth (2011) as an accurate measure of the cognitive processes associated with math attitudes. Use of the IAT would circumvent many of the biases associated with self-report assessments; in the case of math attitudes, it may help to remove the salience of parental and/or teacher expectation from the participant's responses, yielding a more accurate reporting of positive math affect and value of math in particular.
Yet another limitation with this study is that despite its methodological complexity the TIMSS data set is still cross sectional in nature. Although the TIMSS has four current assessment points available (with a fifth available in January 2013), the use of these data sets in conjunction with each other would still only qualify as longitudinal at the country level because the students assessed differ from assessment cycle to assessment cycle. Therefore, causal inferences cannot be included in the final conclusions in this or any current TIMSS related study. Instead, the findings in this study should be used to inspire future research, ideally longitudinal in nature, to fully investigate the effects of math attitude on math achievement.
The final limitation I would like to acknowledge is that this analysis is contingent upon the model being tested. Regardless of the relatively good fit between the model and the TIMSS 2007 data, in the end it is still simply a model of the relationships among the latent constructs, and it is not necessarily the best model. Other models, those that include other variables from the TIMSS data set, or perhaps exclude variables included here, may perform equally well or possibly more admirably. It is also a reasonable expectation that the presented model is inadequate due to variables of interest that are not available within the TIMSS data, such as the value parents place on math and/or education. Researchers who utilize statistics, and particularly those who employ modeling methods, should be reminded of the fundamental lesson of Magritte's La Trahison des Images: the model is simply a conceptual representation, rather than the physical manifestation of any kind of Truth.

TIMSS Data Collection Procedures
As an international assessment of mathematics and science achievement with dozens of participating countries and hundreds-of-thousands of student-participants, TIMSS data collection is a very complicated process. The following is a general overview of those procedures; significantly more detailed information on the TIMSS data collection process can be found in Williams et al. (2009).
The populations of interest for the TIMSS are all fourth and eighth grade students in the participating countries. The sampling techniques TIMSS employs aim to produce country-level samples that are representative of the country's population.
Because my investigations all focus on the eighth grade data, I will discuss the sampling procedures only for the eighth grade and make the statement that the sampling procedures are generally the same at the fourth grade level. Additionally, for the sake of simplicity, I will discuss the sampling procedure as it relates to the United States, with the acknowledgement that this is the IEA mandated sampling procedure used by all countries.

Sample Selection
TIMSS data collection begins with the two-stage sampling process. First, a sample of schools within the country is selected, and then a sample of classrooms To select participating schools, schools were randomly selected from each stratum. As schools were identified, a first-and second-alternative was selected as well, based on the primary selection's position in the sampling frame; the school below the selected school was identified as the first-alternative, and the school above the selected school was identified as the second-alternative. Schools were selected within the stratum using proportional probability sampling (PPS) until the target MOS for the stratum was met or exceeded. Alternative schools were invited to participate only if the primary selection declined.
Once schools were identified for participation and agreed to participate, a second sampling frame was created. This sampling frame included all eligible classrooms within the school. All classrooms within the school had an equal probability of being selected for participation, and two classrooms were selected within each school. In return for participating in the study, schools were gifted with an all-in-one printer, and students were gifted with a clock-compass carabiner.

Assessment Administration Procedures
The TIMSS administration procedures are also complex. In order to reduce the time required to complete the TIMSS assessment yet still adequately cover a breadth of subject-matter, balanced incomplete block spiraling (BIB) of assessment items was employed. The BIB assessment design is manifested in the assessment booklets used. There are notable benefits associated with using BIB assessment and plausible values. Firstly, the burden of assessment is reduced; students are asked only a sample of questions, rather than being subjected to a battery consisting of all possible questions. This translates into less time on the part of the participant as well as a cost reduction for the overall assessment. Secondly, the use of plausible values to estimate scale scores provides more accurate population-level estimates of average performance and variability than procedures which utilize a single score for each student (Beaton & González, 1995;Olson, Martin, & Mullis, 2008). There is, however, a tradeoff to the use of plausible values: because they are drawn from the estimated distribution of the student's achievement, plausible values are not an accurate or valid measure of a specific individual student's achievement.

TIMSS Analysis Considerations
The complexity of the TIMSS sampling design necessitates added complexity for any analysis using the data set. This complexity comes in two forms: adequately adjusting the sample such that it is representative of the population, and accurately estimating the error introduced by the sampling design.
As was previously discussed, TIMSS assessments utilize a two-stage sampling procedure, with the result being a stratified sampling frame based on several categorical variables and the participating school's enrollment. The construction of the sampling frame allows TIMSS researchers to create weighting variables which, when properly employed, forces the data to be representative of the population for the country. The TIMSS weighting variables are calculated for each individual participant based on their sampling frame characteristics. As detailed in Foy and Olson (2009, pp. 102-105), the probability of any given student within a country being selected for the sample is known because students were selected using the probability sampling method previously described. Therefore, sampling weights can be constructed for each student by taking the inverse of the probability of selection. The use of sampling weights accommodates the complex sampling design by accounting for stratification and any disproportional subgroup sampling; the TIMSS sampling weights also include adjustments for non-response.
The TIMSS data set includes several sample weight variables, each of which has a different purpose. Of primary interest are the TOTWGT (total weight) and SENWGT (senate weight) variables. TOTWGT is the variable name assigned for the sample weight previously described; when used, TOTWGT will ensure that the subgroups within the stratified sample are proportionally represented in population estimates. TOTWGT should be used whenever student-level population estimates within a country are desired.
When properly applied, the TOTWGT variable will inflate the sample size for a given country to approximately the size of the grade-appropriate population for that country (i.e., TOTWGT would inflate the 7,377 U.S. eighth grade participants to approximately the size of the entire eighth grade population). This creates a problem when comparisons between countries are desired; countries with larger populations have more students than countries with smaller populations. This difficulty is accommodated by using the SENWGT weighting variable. SENWGT is a transformation of TOTWGT that forces each country to have a weighted sample size of 500. For analyses where comparisons are being made between countries, the SENWGT variable should be used rather than TOTWGT to allow for an equitable assessment ).
The other analysis consideration when using TIMSS data is the proper estimation of the error introduced by the sampling design. There are two forms of error to consider, sampling error caused by the stratified sampling procedure, and imputation error caused by the use of plausible values. To accommodate the sampling error, the jackknife repeated replication technique (JRR) is employed. In JRR, pairs of schools are systematically assigned to sampling zones, creating pseudo-replicates of the original sample; for TIMSS, 75 pseudo-replicates were created. The statistic of interest is calculated once for the overall sample and again for each pseudo-replicate. The variation between the estimate of the original sample and the estimate for the jackknife replicate is the jackknife estimate of the sampling error for the statistic . The 75 jackknife estimates for sampling error were then used to create 75 replicate weights. Adequately accommodating the error introduced by the stratified sampling procedure, therefore, involves estimating a parameter 76 times, once for the original sample and once for each replicate weight.
This brings up the point of accommodating the plausible values in the analysis.
Because plausible values are imputations rather than actual observed scores, there is error associated with the imputation. As stated by Williams et al. (2009), averaging the plausible values and using the resulting mean to calculate a parameter estimate would underestimate the standard error associated with the subsequent analysis. Therefore, the imputation error is accommodated by calculating a given statistic once for each plausible value and then averaging these results over five analyses.
The accommodation of both sampling error and imputation error would ideally result in the calculation of a given parameter estimate 76 times for each of the five plausible values (once for the overall sample and once for each jackknife replicate weight), yielding 380 analyses to be averaged for an accurate parameter estimate.
However,  state that a shortcut is available: accommodate the sampling error by estimating the parameter once for each of the 75 replicate weights using only the first plausible value, and then accommodate the imputation error by estimating the parameter once for each plausible value, computing the parameter estimate a total of 80 times.

Analysis Software
As can be seen in the previous sections, analyses of the TIMSS data have some inherently complex considerations associated with them which are not common to most data analysis endeavors. Fortunately, data analysts have some support in overcoming these complexities in the form of the analysis software available. First and foremost, the IEA has created and made available a database analysis package called the International Database Analyzer (IDB Analyzer). The IDB Analyzer is a stand-alone application that generates SPSS syntax, which can then be used to analyze data from the IEA assessments, including TIMSS. The SPSS syntax generated by the IDB analyzer accounts for the complex sampling design associated with these studies by properly implementing the use of the weighting variables and plausible values.
Although the IDB Analyzer makes some analyses simpler to perform, the range of options available through the IDB Analyzer is limited. An analyst can calculate percentages and means, correlations, and percentiles, and can perform regression analyses, but that is all. Even within these options, there are limitations to what the IDB Analyzer will do. For example, only one variable with plausible values can be used for any given analysis, so analyses comparing more than one plausible valuesbased variable, such as math achievement with science achievement, are not possible.
Additionally, multivariate methods more advanced than multiple linear regression are also not possible through the IDB Analyzer.
Fortunately, the developers of certain advanced analysis software have begun to incorporate methods for analyzing complex data; both LISREL and Mplus have accommodated the use of sampling weights and imputation through plausible values for SEM techniques, and HLM and Mplus can both accommodate these complexities when conducting hierarchical linear modeling (HLM).

APPENDIX 2: MODEL BUILDING PROCEDURE
In this appendix I will briefly discuss the model building procedure used to create the model tested in this analysis. The decisions were based on not only model fit, but also on the goals of the study.
As was stated in the body of this paper, the model used in this study is based on a model that was previously tested with only the U.S. data. However, that model used a single math attitude latent construct, which was comprised of several of the math attitude indicator variables in this study. The findings of that study indicated that the math attitude indicator variables did not contribute equally to the relationship on math achievement, and so the model for the present study was created to investigate those inequalities.
Because the primary goals of the study were to investigate differences among the math affect variables' impact on math achievement, three separate math affect latent variables were used for the model. This is also why a second-order math affect variable was not included in the model; a second-order math affect variable would have removed the paths between the individual math affect variables and math attitude, thereby defeating the goal of the analyses.
The model represented in Figure 1 was originally conceptualized with a regression path between SES and math achievement. However, it was determined that the final model displayed superior fit (ΔCFI = 0.039, ΔRMSEA = 0.012). In this way, SES is included in the model as a covariate; we are not especially interested in the relationships between SES and the other variables in the model, but we want to account for the role that SES plays in the model.

General Discussion
The purpose of this study was to investigate the relationships between gender, socioeconomic status, math attitude, and math achievement from an international perspective. Whereas there has been much research on the individual contributions of gender, SES, and math attitude on math achievement, there is little research that has investigated these relationships within a multivariate framework and from an international perspective. Using publicly available data from a large, international data set (TIMSS 2007), the relationships among these variables was investigated through multiple-group latent variable modeling.
In The cluster solutions from the cluster analysis study were used to inform decisions for the multiple-group latent variable model analyses. Because multiplegroup latent variable modeling prefers fewer than 10 groups in order to work, the eight-cluster solution was chosen; this is the only cluster solution in which all of the resulting clusters were comprised of fewer than 10 countries. It is important to note that the cluster solutions appear to have some contextual cohesion, based on several geopolitical indicators, in addition to the mathematical parsimony for the cluster solutions (see Study 1).
The multiple-group latent variable analyses were used to investigate the ways in which sex, math attitudes, and math achievement are related with each other, and how these relationships differ on an international scale. The findings suggest that math selfconfidence has a consistently strong relationship with math achievement across countries, and that value of math has a consistently weak relationship with math achievement across countries. Furthermore, the relationship between gender and math achievement, and the relationship between gender and math attitude, is quite small in most cases.
Policy implications from the findings of this study include the need to promote math self-confidence in math curricula. The strength of the relationship between math self-confidence and math achievement is larger than the relationship between math achievement and any other variable in the model, including SES and gender. Since math self-confidence is a more malleable construct than SES, gender, or cognitive ability, and because math self-confidence has such a strong relationship with math achievement, it is imperative that self-confidence is a major consideration for math educators at all levels. Future research on this subject should begin to focus on the development of interventions to increase math self-confidence.
In addition to the need to focus more attention in the math education process onto math self-confidence, the current study suggests a need to develop better measures for math attitude. Suggestions include the use of the Implicit Associations Test (Greenwald et al., 1998) as demonstrated in Nosek and Smyth (2011), although whether or not this would be possible for studies as large as TIMSS remains open to discussion.
A final consideration for this study has to do with a more qualitative interpretation of the findings. That certain countries consistently outperform the United States in math and science achievement is a given; South Korea, Singapore, Hong Kong, Taiwan, and Japan have long been leaders in international math and science assessments. However, there are important cultural differences between these countries and countries like ours (which are defined in the present study as being those countries in the same cluster as ours, such as Australia and Sweden).
A prime example of these cultural differences can be seen in how countries view tutoring, although other specific examples certainly exist. In the United States, students receive tutoring only after they are perceived as being at risk for failure, and expectation is that this supplemental assistance will be funded by the already taxed budget for the school system. Conversely in South Korea there is an expectation that most students receive additional instruction, and this supplemental instruction commonly comes at an additional expense to parents. In essence, students in South Korea are spending nearly twice as much time in school as their American counterparts, and their parents are willing to pay large amounts of money to afford their children this opportunity.
The question becomes whether or not we are willing as a society to adopt the tactics employed by other countries and other cultures to achieve their level of math achievement on international assessments. Furthermore, whether international assessments of math achievement at the fourth and eighth grade levels are an important predictor for a country's well-being has not been established, only assumed.
Given the economic situation that has persisted in the United States since 2008, and the general trend toward decreasing funding for education, it seems unlikely that we will be seeing any kind of massive paradigm shift in which the U.S. is willing to spend more money on education, either at the national, state, or individual levels, which is likely what would be necessary to increase math achievement scores by large amounts.
Because it is unlikely that we will be spending more money on education, it is important that we spend the money we do allocate for education efficiently and in a way that it will have the most positive impact. By increasing the attention paid to selfconfidence in learning math during the teaching of mathematics, we may be able to decrease the gaps we see between ourselves and countries we desire to emulate in terms of math achievement, without committing tremendous additional resources to the endeavor.