EARLY VOCABULARY ASSESSMENT WITHIN A RESPONSE TO INTERVENTION FRAMEWORK

The current study examined the predictive and social validity of two weekly vocabulary assessments embedded within a Tier I Kindergarten vocabulary curriculum. Participants (N=250 Kindergarten students) received ongoing vocabulary instruction and their target word knowledge was monitored weekly over the course of 24 weeks using two target word assessments (a Yes/No assessment and Receptive Picture assessment). Data from the weekly vocabulary assessments were examined at multiple time points with various cut scores. Predictive validity was examined in terms of correct classification of student risk for poor vocabulary outcomes, and results were compared with standardized measures of general receptive and expressive vocabulary knowledge. Teacher judgments regarding the efficiency and effectiveness of the two weekly vocabulary assessments were examined. Considerations for vocabulary assessment within a multi-tiered or Response to Intervention framework are made.

. Given that students with poor early language and literacy skills are at risk for poor reading achievement, researchers and educators have recognized the urgency of identifying students at risk for low achievement and intervening early with evidence-based instruction (Coyne, Capozzoli, Ware, & Loftus, 2010;Dickinson & Tabors, 2002;Scarborough, 2001;Snow, Burns, & Griffins, 1998). While many factors can cause children to enter school with poor early language and literacy skills, educators have an opportunity to alter the trajectory of at risk students' achievement through instruction and intervention. A wealth of knowledge has been established regarding the development, instruction, and assessment of many early language and literacy skills (Dickinson & Neuman, 2006;Hosp, Hosp, & Howell, 2007;Moats, 2010;NRP, 2000;Scarborough, 2001). However, more research is needed to aid educators in accurately identifying children at risk for language and literacy difficulties, particularly in the area of vocabulary (Loftus & Coyne, 2013;NRP, 2000).

Early Language and Literacy Skills
Reading researchers have indicated that word recognition abilities and language comprehension abilities each play a foundational role in promoting skilled reading. Word recognition skills include the use of phonological awareness, decoding, and sight word recognition, while language comprehension skills include the use of background knowledge, language structures, verbal reasoning, literacy knowledge, and vocabulary (Scarborough, 2001). A report by the National Reading Panel (2000) concluded that the five "pillars" of proficient reading achievement include skilled phonemic awareness, phonics, fluency, vocabulary, and comprehension.
Research has shown that in the early grades, struggling readers often experience difficulty with word recognition skills, especially phonemic awareness (Scarborough, 2001;Torgeson, 2002). Given these findings, much attention has been devoted to bolstering word recognition skills in the early elementary grades. However, a misconception held by many educators is the belief that word recognition skills must be established prior to teaching language comprehension skills (Biemiller, 2001).
Although word recognition skills tend to be the focus of reading instruction in early elementary grades, a more effective approach entails simultaneously supporting word recognition skills and language skills through high quality, systematic, and explicit instruction beginning in Kindergarten (Biemiller, 2001). A comprehensive approach to promote reading success includes explicit and direct vocabulary instruction in the early elementary grades.

Causes and Consequences of Poor Early Language and Literacy Skills
For many reasons, children enter school with considerably different levels of early language and pre-reading skills. One reason for this variability is that children from families of low socioeconomic status have far less exposure to rich oral language compared to children from families of high socioeconomic status. In a longitudinal study by Hart and Risley (1995), the researchers visited 42 families monthly over the course of two years, and recorded the language (e.g., the number and nature of utterances) that one and two year old children were exposed to through communications at home. The findings revealed that children from families of low socioeconomic status (SES) were exposed to substantially less oral language at home, in comparison to children from families of middle and high SES. The researchers extrapolated that by age three, the differences in word exposure amounted to a 30 million word gap between children from families of high SES and low SES. As a consequence, the children from low SES families were at a substantial disadvantage in terms of their vocabulary knowledge prior to entering Kindergarten. A follow-up study indicated that the children's vocabulary knowledge at age three strongly predicted their vocabulary knowledge at ages nine and ten (Hart & Risley, 1995). The follow-up findings provide evidence that the gap in word knowledge persisted over time, and initially disadvantaged children were not able to "catch up" to their advantaged peers when they began school. Replication studies (e.g., Dickinson & Tabors, 2002) with similar findings have encouraged the need for high quality early intervention for disadvantaged children.
Recent data indicate a large gap in reading achievement between advantaged and disadvantaged children. Findings from the 2012 National Assessment of Educational Progress indicate that 80% of children from lower income families scored below proficiency in fourth grade reading achievement, while 49% of children from higher income families scored below proficiency in fourth grade reading achievement (National Center for Education Statistics [NCES], 2013). While differences in exposure to rich oral language plays a role in this discrepancy, it is also necessary to acknowledge the many risk factors associated with childhood poverty, including higher rates of violence, lead poisoning, air and noise pollution, family stress, and health problems (Evans, 2004). In society today, proficient language and literacy skills promote opportunities for school success and increased control over career opportunities and life outcomes. Children with disadvantaged backgrounds often begin formal education lacking prerequisite skills for school success (Biemiller, 2001;Hart & Risley, 1995). Without early intervention, many children will continue to struggle with language and literacy.
Researchers and educators have recognized the need to close the gap by providing at risk students with early interventions to build foundational skills.
Intervening early is essential, in order to minimize the problem of "Matthew Effects" (Stanovich, 1986), in which the "rich get richer and the poor get poorer" over time, increasing the achievement gap between advantaged and disadvantaged children. For example, research has demonstrated that one way children bolster their vocabulary knowledge is by frequently engaging in reading. Skilled readers tend to read widely, encountering many novel vocabulary words in texts, further bolstering their language and reading skills. However, individuals who lack the skills to read advanced texts are not exposed to rich vocabulary through texts (Stanovich, 1986). Furthermore, individuals with poor reading skills are less likely to engage in frequent reading compared to their peers with proficient reading skills (Morgan, Fuchs, Compton, Cordray, & Fuchs, 2008). Findings from the most recent National Assessment of Educational Progress report show that students who read frequently for enjoyment (almost daily, or once or twice a week) had higher levels of reading proficiency compared to students who reported reading for fun infrequently (a few times a year or less) (NCES, 2013).
Many reciprocal interactions between initial skills and learning demands cause initially disadvantaged students to fall further behind their peers over time. Scarborough (2001) reported that of the children who experience early language and literacy difficulties, 65%-75% continue to experience difficulties in subsequent years.
Conversely, of children who do not experience early language and literacy difficulties, only 5% -10% have difficulties in subsequent years. Research has indicated that individuals with limited vocabulary tend to learn new words at a slower rate compared to their peers with larger vocabularies (Coyne, Simmons, Kame'enui, & Stoolmiller, 2004). Over time, the achievement gap between students with underdeveloped early language and literacy skills and their advantaged peers tends to increase unless interventions are put in place to close the achievement gap (Hart & Risley, 1995;Snow, Burns, & Griffin, 1998;Torgeson, 2002).

A Multi-Tiered Approach for Promoting Language and Literacy Skills
Researchers have emphasized the need for instructional practices that aim to prevent language and literacy difficulties, and to intervene as early as possible when students do not make adequate progress towards important outcomes (Bradley, Danielson, & Doolittle, 2005;Cunningham & Stanovich, 1997;Wanzek & Vaughn, 2007). Such initiatives have been guided by a public health model approach to education, based on the idea that preventing academic problems is more effective and efficient than remediating problems (Gutkin, 2012;Torgesen, 2002). A proactive approach towards language and literacy development is particularly important, considering the evidence that early reading skills strongly predict future reading acquisition (Cunningham & Stanovich, 1997;Scarborough, 2001).
Response to Intervention (RtI) is a framework for providing multi-tiered, differentiated instruction and supports to all students (National Center on Response to Intervention [NCRTI], 2010). Schools using an RtI framework recognize that students vary in terms of the level of instructional supports they need to learn and succeed academically. As such, schools that implement an RtI framework regularly and systematically identify students in need of additional support, and provide appropriate support as needed. While researchers, educators, and school psychologists have long recognized within-child factors that can affect student learning (e.g., intrinsic learning or attention problems, etc.), it is important to note that ecological factors (e.g., the quality of previous instruction, parent support, etc.) also play an important role in promoting or prohibiting student learning (Gutkin, 2012). With multiple tiers of support in place, students with diverse learning needs are supported, regardless of the underlying cause of learning difficulties. As Brown-Chidsey and Steege (2010) emphasized, "…the nature of the interventions provided to help students overcome school difficulties is more important than the etiology or symptoms" (p. 27).
Key components of an RtI framework include the use of evidence-based, differentiated instruction and the use of a comprehensive assessment plan that includes screening, progress monitoring, and diagnostic assessment (NCRTI, 2010). Evidencebased instruction refers to instructional methods or curricula that have empirical support for promoting learning for most students. Differentiated instruction refers to instruction that continuously targets the specific needs of individual students. The universal level of support, or Tier I support, is high quality instruction in the classroom. In an RtI model, the instructional practices provided through Tier I meet the learning needs of most students (approximately 80% of students in the classroom).
For various reasons, some students (approximately 15%) will need Tier II support (e.g., more instructional time, more opportunities to practice, more feedback, small group instruction, etc.), in addition to Tier I instruction, to reach their learning goals.
A few students (approximately 5%) will require additional intensive Tier III supports (e.g., increased instructional time, more explicit instruction, more opportunities to practice skills, more feedback, and one-to-one or small group instruction) to reach their learning goals (Burns & Gibbons, 2008).
Through data-based decision-making, educators identify students who need additional support, determine the specific skills that need to be targeted for interventions, and monitor how effective the interventions are in promoting learning (Brown-Chidsey & Steege, 2010). Universal screenings, diagnostic assessments, and progress monitoring are RtI assessment methods that promote timely and efficient instructional decision-making. Universal screening is typically done three times throughout an academic year within an RtI framework (Hosp et al., 2007). The purpose of universal screening is to identify all students who are low performing and in need of additional support. Screening tools should accurately predict students who are at risk for learning difficulties and therefore would benefit from additional support.
In circumstances when the majority of students in a classroom are identified as being at risk, modifications should be made in Tier I instruction (Brown-Chidsey & Steege, 2010;Burns & Gibbons, 2008). In an RtI framework, individual student progress is monitored to guide instructional decision-making and bolster language and literacy development. It is important to continually monitor individual students' progress towards proficient reading using efficient and technically adequate measures. Doing so allows educators to adapt their instruction and determine whether or not a particular intervention is effective (Fuchs, Fuchs & Vaughn, 2008). Bloom, Hastings and Madaus (1971) described the need for classroom teachers to differentiate instruction to facilitate learning for all children. Many assessments in schools today measure differences in student aptitudes for learning in a given area. Bloom et al. (1971) argued that the use of such aptitude tests lead many teachers and students to believe that high levels of achievement are only possible for initially high performing students. Carroll (1963) reasoned that "aptitude is the amount of time required by the learner to attain mastery of a learning task" (as cited in Bloom et al., 1971, p. 46). In Carroll's view, most students can become successful learners, if given appropriate time and instruction. Formative evaluations are valuable for effectively gauging students' instructional needs.
In a formative evaluation, a course or subject is broken up into smaller units of learning, and assessments are administered after the end of each unit (Bloom et al., 1971). The data obtained from formative assessments are used to determine which students have mastered the learning objectives, and which students have not. For the students who have not yet mastered a given skill, teachers can use formative assessment data to determine the specific area(s) of difficulty and provide appropriate instruction. Importantly, such assessments are not intended to grade or judge students, but rather they are intended to be used as a tool to guide instruction and improve student learning (Stiggins, 2001). Summative assessments, on the other hand, are intended for grading and evaluating the outcome of instruction and learning (Bloom et al., 1971).
Formative assessment data are used in schools today to identify student instructional needs in a timely manner (Wiliam, 2006;Burns & Gibbons, 2008).
Research has demonstrated that formative assessments are powerful tools for improving student learning (Black & Wiliam, 2009). In fact, a review of over 800 studies found the use of frequent formative assessment to be the most powerful teaching variable to affect student learning (Hattie, 2009). The ongoing use of formative assessments allows educators to allocate appropriate resources within a multi-tiered service delivery framework, such as Response to Intervention (Burns & Gibbons, 2008).
Curriculum Based Assessments are widely used tools for formative assessment and evaluation. Curriculum Based Assessments are measurements that use "direct observation and recording of a student's performance in the local curriculum as a basis for gathering information to make instructional decisions" (Deno, 1987, p. 41).
Curriculum Based Assessment (CBA) is considered a broad "umbrella" term, and there are many forms, including Curriculum Based Measurement (CBM), Curriculum Based Evaluation (CBE), Criterion-Referenced Curriculum Based Assessment (CR-CBA), and Curriculum Based Assessment for Instructional Design (CBA-ID) (Hintze, Christ, & Methe, 2006). Curriculum-based assessments can be divided into two major forms: specific sub-skill mastery measurements (CBE, CR-CBA, and CBA-ID), or general outcomes measurements (CBM). Each form of CBA addresses different questions regarding instructional decision-making, and no single form provides comprehensive information regarding the evaluation of and intervention for academic problems (Hintze, Christ, & Methe, 2006). Therefore, it is helpful to understand each form of CBA independently to inform the most appropriate measure to use in a given context.
In the area of specific sub-skill mastery measurement, a global curriculum is sequenced into short-term sub-skills, and mastery of each unique sub-skill is measured. Mastery measures are typically not standardized, and the format of measures can shift depending on the skill that is assessed. For example, within the domain of reading, decoding skills are typically sequenced beginning with relatively simple decoding skills (e.g., decoding CVC words). Once mastery measures indicate that a student has mastered a specific skill, the student receives instruction for the next short-term skill in the curriculum sequence (Hintze, Christ, & Methe, 2006). The mastery measures are closely aligned with the specific curriculum, and therefore are likely to have high content validity and social validity (i.e., the assessments measure what was taught).
With Curriculum Based Assessment for Instructional Design (CBA-ID; Gickling & Havertape, 1981), the goal is to determine a student's current instructional needs by aligning the content of the assessment with the current content of instruction.
With CBA-ID, excessive amounts of unknown information are not included in the assessment, but instead the content is closely aligned with current instructional skill areas (Hintze, Christ, & Methe, 2006). Teachers use CBA-ID data to control the timing at which new instructional topics (e.g., sub-skills) are introduced to individual students (Gickling & Havertape, 1981). For example, a teacher might monitor a student's progress towards mastery of decoding CVC words before moving on to teaching and assessing CVCe decoding skills.
With Criterion Referenced Curriculum Based Assessment (CR-CBA; Idol & Paolucci-Whitcomb, 1999), the goal again is to determine a student's current instructional needs. However, within a CR-CBA, several levels of the curriculum are assessed at once. With CR-CBAs, the content consists of skills that have already been taught and skills that have not yet been taught. A student's performance is compared with mastery criteria (e.g., using local norms to determine acceptable performance levels) (Idol, Nevin, & Paolucci-Whitcomb, 1999). CR-CBAs can be used to monitor long-term growth of skills from a sequenced curriculum.
Curriculum Based Evaluation (CBE; Howell, 1986) is a process in which survey-level assessments are used to sample from a wide range of skills within a particular domain, such as reading (Hintze, Christ, & Methe, 2006). For example, oral reading fluency probes are often used as a survey level assessment of a student's current level of reading proficiency (Hosp et al., 2007). Using the results of a surveylevel assessment, follow-up diagnostic assessments are administered to examine mastery levels for specific sub-skills and to determine the specific areas in which more instruction is needed (e.g., silent-e endings, digraph patterns, etc.). CBE is a systematic process for determining a student's current instructional needs, in terms of the specific skills that have or have not been mastered (Hosp et al., 2007).
In the area of general outcome measurements, global indicators of basic skills are measured repeatedly to monitor long-term growth in a particular domain.
Curriculum-Based Measurements (CBM; Deno, 1987) are general outcome measures, or standard measures of basic skills such as reading, spelling, writing, or mathematics.
In contrast to mastery measurements, CBMs are not aligned precisely with the specific content taught in the curriculum. CBMs are used as dynamic indicators of basic skills or DIBS to guide formative evaluation (Deno, 1987). CBMs are dynamic or sensitive to differences between individuals and within individuals over time. The measures also serve as evidence-based indicators of basic skills, such as reading (Shinn, 1998).
While CBMs are not as closely aligned with the instructional curriculum as mastery measurements are, they are standardized, efficient to administer, sensitive to shortterm and long-term improvement and have established acceptable psychometric properties (Hosp et al., 2007). As Shinn (1998) described, CBMs can be regarded as "academic thermometers", used to monitor indicators of overall academic health in a particular domain (e.g., reading). However, CBMs are not useful for identifying specific areas of weakness (Shinn, 1998).
While there is evidence that CBAs are useful as screening, progress monitoring, and diagnostic instructional decision-making tools in areas such as phonemic awareness, phonics, fluency, comprehension, there is currently insufficient research regarding useful vocabulary assessments within an RtI framework (Loftus & Coyne, 2013). Other reading skills work well within a general outcome or mastery measurement system (e.g., oral reading fluency); however, the measurement of vocabulary poses unique challenges. For example, given the vast number of vocabulary words (over 500,000 distinct word types; Nagy & Anderson, 1984), Multiple unique measures that may vary in difficulty as the unit or objectives change.
How are the data used?
Used to monitor progress toward long-term achievement in broad domain areas (e.g., reading); Used to identify students at risk for low achievement in broad domain areas.
Used to monitor progress toward short-term achievement in specific skill areas (e.g., CVC word decoding); Used to document mastery of specific skills.
How often is it administered?
Typically administered weekly for progress monitoring; triannually for universal screening.
Administered at the end of each unit (frequency may vary).
What are the benefits?
Allows for continuous assessment of retention and generalization in broad domain areas; The method of assessment is consistent over time. A wealth of research has been conducted to explore best practices in promoting code-based skills (e.g., phonemic awareness, phonics) within a multi-tiered or RtI framework (Hosp et al., 2007). For example, the Dynamic Indicators of Basic Literacy Skills or DIBELS (University of Oregon Center on Teaching and Learning, 2014; see  include widely used general outcome measures in skills such as phonological awareness, alphabetic principles, phonics, oral reading fluency, and comprehension. Far less attention has been devoted to instructional strategies and assessment tools for early vocabulary acquisition (Biemiller, 2001;Loftus & Coyne, 2013;NRP, 2000). The tools that have been developed to monitor vocabulary progress have not established adequate sensitivity for short term gains in vocabulary knowledge, and therefore are of limited use. Tools measuring general vocabulary knowledge (i.e., items reflect a sampling of words that were not necessarily targeted for direct instruction) are not likely to be effective in capturing ongoing gains in word knowledge (NRP, 2000;Paris, 2005;Stahl & Bravo, 2010).
Researchers have agreed that it is a challenge to measure vocabulary knowledge within an RtI framework (Beck, McKeown, & Kucan, 2002;Loftus & Coyne, 2013;NRP, 2000;Paris, 2005). One of the challenges of measuring word knowledge is determining what it means to know a word (Beck et al., 2002). Another challenge is determining the most effective methods for measuring word knowledge (NRP, 2000). Before discussing vocabulary assessment methods, is first helpful to provide an overview of the nature of vocabulary development and evidence-based instructional strategies.

Early Vocabulary Development and Instruction
Although vocabulary knowledge and growth varies from one child to the next, most children's lexicons grow substantially during the second year of life (Bates et al., 1988, as cited in Snow, Burns & Griffin, 1998) and continue to grow rapidly through preschool and subsequent school years. Researchers distinguish between multiple forms of vocabulary, including receptive vocabulary and productive vocabulary (NRP, 2000). Receptive vocabulary refers to words that an individual is able to recognize (e.g., words that are understood when presented through speech or writing).
Productive vocabulary refers to words that an individual is able to produce (e.g., words that an individual can produce through speech or through writing). Receptive and productive vocabularies can be further sorted into categories of oral vocabulary (words that are understood or produced through speech or oral language) or reading vocabulary (words that are understood or produced through text or writing) (NRP, 2000).
Researchers have attempted to estimate vocabulary size and rate of growth; however, this task is difficult for two reasons. First, there are challenges in defining what it means to know a word. Additionally, different procedures and measures have been used to capture vocabulary knowledge (Beck et al., 2002), leading to inconsistencies in estimations of vocabulary knowledge. Researchers have estimated that the average school-age child learns (or, becomes aware of) approximately seven new words a day (Just & Carpenter, 1987;Nagy & Herman, 1987;Smith, 1941; as cited in Snow, Burns, & Griffin, 1998). However, the number of words learned per day can vary substantially from one student to the next. While some students learn well over seven new words per day, some students learn two new words a day or fewer (Beck et al., 2002). Research has indicated that children who enter school with limited vocabularies learn new words at a lower rate compared to students who enter school with rich vocabularies (Baker, Kame'enui, Simmons, & Simonsen, 2007;Baker, Simmons, & Kame'enui, 1997;Hart & Risley, 1995).
Language and literacy researchers have asked the question, what does it mean to know a word? Carey (1978) explained that initially, a "fast mapping" process of word learning takes place. During this process, the individual has a very basic sense of the meaning of the word. According to Carey (1978) it is not until the individual has used and understood the word in multiple contexts that "extended mapping" or a more advanced knowledge of the word can occur. Several other perspectives of word learning have been put forth by researchers (see Table 2). Each perspective recognizes that word knowledge is not an all or nothing phenomena (Beck et al., 2000). Instead, word knowledge deepens incrementally as an individual uses and understands words in multiple contexts (Stahl, 2003;Beck et al., 2002). Determining an individual's word knowledge is a difficult and nuanced task.
One of the most important components of effective vocabulary instruction is selecting appropriate words to teach. Nagy and Anderson (1984) analyzed words in printed school materials for Grades 3-9 and identified over 88,500 distinct word families (e.g., motivate, motivated, motivates, motivating, motivation, motivations, motives, motivational, and unmotivated are categorized as one distinct word family).
Given that there are thousands of words to choose from, researchers have categorized the most important types of words for educators to teach directly. Beck et al. (2002) encourage careful selection of target words that are useful and likely to bolster language comprehension. 3. Narrow, context-bound knowledge, such as knowing that a radiant bride is a beautifully smiling happy one, but unable to describe an individual in a different context as radiant.
4. Having knowledge of a word but not being able to recall it readily enough to use it in appropriate situations.
5. Rich, decontextualized knowledge of a word's meaning, its relationship to other words, and its extension to metaphorical uses, such as understanding what someone is doing when they are devouring a book. Cronbach (1942) 1. Generalization: The ability to define a word.
2. Application: The ability to select or recognize situations appropriate to a word.

Precision:
The ability to apply a term correctly to all situations and to recognize inappropriate use.

Availability:
The actual use of a word in thinking and discourse.
Note: The information provided in this table was obtained from Beck et al. (2002, pp. 9-10). Beck et al. (2002) distinguish between three tiers of words (unrelated to the tiers of support referenced in an RtI framework). Tier One words are common, everyday words such as clock, chair, and hand. Tier One words are relatively simple to conceptualize, and most individuals learn these words quickly and easily through everyday interactions and experiences. Tier Two words (e.g., operate, maintain, and previous) are less common, more abstract terms that are used across many different content areas. Tier Three words (e.g., peninsula, abolitionist, and isotope) are uncommon, specialized, and limited to specific academic domains (Beck et al., 2002).
Tier Two and Tier Three words (Beck et al., 2002) align with what Blachowicz, Fisher, Ogle, and Taffe (2013) referred to as academic vocabulary.
Academic vocabulary refers to content-area words that are often unfamiliar to students until they are presented in academic contexts (e.g., by teachers, in texts, or other academic resources). Unlike Tier One words, Tier Two and Three words are difficult to learn through incidental exposure, because they are more abstract. Vocabulary researchers suggest that Tier Two words or general academic vocabulary terms are especially useful to teach, because they are found across disciplines and content areas, and do not require domain-specific knowledge (Beck et al., 2002).
Given the large number of words in the English language, researchers and educators have debated over the merits of a breadth versus depth approach to early vocabulary instruction. In other words, in the allotted time available for vocabulary instruction, should educators provide extensive, direct instruction for a few words, or should they aim to cover many words through brief, incidental vocabulary instruction?
Research has demonstrated that direct vocabulary instruction of Tier Two words has more powerful long-term effects than incidental exposure approaches to vocabulary instruction (Coyne, McCoach, Loftus, Zipoli, & Kapp, 2009;Maynard, Pullen, & Coyne, 2010), particularly for students with underdeveloped vocabulary knowledge.
Evidence-based practices for promoting vocabulary knowledge include selecting appropriate target words, teaching words directly, using student-friendly definitions, reinforcing the definition in multiple contexts, providing rich and varied language experiences, storybook reading, fostering word consciousness, teaching word learning strategies (such as looking for prefixes and root words), and providing students with multiple opportunities for practice and feedback (Beck et al., 2002).
Vocabulary researchers (Beck et al., 2002;Biemiller, 2001;Coyne et al., 2009) have cautioned educators against relying on incidental vocabulary learning to build students' vocabulary for Tier Two words. Research has indicated that relying on contextual clues to learn new Tier Two words can provide inaccurate understandings of novel words, especially for individuals with low levels of reading achievement and vocabulary knowledge (Beck et al., 2002).
Studies have shown that repeated readings of storybooks paired with explicit, rich explanations of Tier Two words is an effective method for bolstering the vocabulary of children at risk of reading difficulty (Coyne, Simmons, Kame'enui, & Stoolmiller, 2004;Loftus, Coyne, McCoach, Zipoli, & Pullen, 2010;Maynard, Pullen, & Coyne, 2010). Vocabulary growth through shared storybook readings has also been documented with children who are English Learners (Collins, 2010;Hickman, Pollard-Durodola, & Vaughn, 2004;Silverman, 2007). Importantly, the most effective approach for promoting vocabulary growth through shared storybook approaches includes purposeful selection of Tier Two words, providing student-friendly definitions, and planning lessons and activities to promote target word use in rich contexts (Coyne et al., 2004). Incidental exposure to words through storybook reading is less effective for promoting vocabulary knowledge, particularly for students with limited vocabulary or students who are English Language Learners (Collins, 2010;Coyne et al., 2005;Coyne, McCoach, & Kapp, 2007;Maynard et al., 2010). Vocabulary curriculum promotes vocabulary growth for young children (Apthorp et al., 2012;Resendez & Azin, 2007).
Even with the use of evidence-based vocabulary curricula, a major challenge to effective instruction is the heterogeneity of student vocabulary knowledge in a given classroom. Research has documented that children enter formal schooling with widely differing levels of language and literacy skills (Hart & Risley, 1995;Dickinson & Tabors, 2002). Given these findings, it is important that educators not only use evidence-based instructional practices in the classroom (Tier I), but also that the instruction is differentiated depending on the instructional needs of individual children. The most effective and appropriate method for differentiating instruction is to use technically adequate formative assessments to guide instructional decisionmaking .

Early Vocabulary Assessment within a Multi-Tiered Framework
Research has shown that direct assessment of early language and literacy skills provides stronger predictive validity compared to teacher judgments, in terms of correctly identifying students who are at risk for poor literacy achievement (Cabell, Justice, Zucker, & Kilday, 2009). While technically adequate curriculum-based assessments have been developed for early literacy skills such as phonemic awareness, grapheme-phoneme knowledge, phonics, and fluency (Hosp et al., 2007) there is a need for valid and efficient assessments of vocabulary knowledge and growth (Loftus & Coyne, 2013). As Paris (2005) pointed out, "there has been increased assessment and instruction on alphabet knowledge, phonemic awareness, and oral reading fluency as the main enabling skills and significant predictors of later reading achievement.
There has been relatively less research and classroom emphasis on vocabulary and comprehension to date, perhaps because of the difficulty of assessing and teaching these skills to children who are beginning to read." (p. 187).
While vocabulary is considered one of the five "pillars" of reading acquisition (NRP, 2000), there are fundamental differences between vocabulary and the other pillars of reading acquisition. Paris (2005) described phonemic awareness, phonics and fluency as linear, constrained skills. For example, within a few years of instruction, most students are able to demonstrate complete mastery of skills such as letter naming, letter-sound knowledge, phonemic awareness, and decoding. However, the same is not true for vocabulary knowledge. Unlike constrained skills, vocabulary development has no ceiling for mastery. Vocabulary acquisition is an unconstrained skill that continues to develop across a lifetime (Paris, 2005).
Different methods have been developed to aid in measuring an individual's word knowledge. Some methods are intended to measure "shallow" word knowledge, while other methods aim to measure "deep" word knowledge (Beck et al., 2002). In a review of the research on vocabulary instruction and assessment, the National Reading Panel found, …most of the researchers [use] their own instruments to evaluate vocabulary, suggesting the need for this to be adopted in pedagogical practice. That is, the more closely the assessment matches the instructional context, the more appropriate the conclusions about the instruction will be… instruments that match the instruction will provide better information about the specific learning of the students related directly to that instruction. (NRP, 2000, Chapter 4, pp. 26-27).
In other words, tools that aim to measure vocabulary knowledge and growth should be closely aligned with the vocabulary instruction or curriculum. Curriculum-based assessments have received a great deal of attention and use for instructional decisionmaking in constrained areas of reading acquisition (e.g., letter-sound knowledge, phoneme awareness, phonics). With CBA's, a student's progress toward mastery of constrained skills can be monitored over time, and instruction can be differentiated based on a student's progress (or lack of progress) towards short or long-term outcomes. In order for vocabulary assessments to be useful to teachers, the content of the assessment and vocabulary curriculum must be closely aligned. However, many educators do not use a curriculum for direct vocabulary instruction, and instead rely on indirect or incidental vocabulary instruction. An unstructured, incidental approach to vocabulary instruction limits the availability and use of vocabulary assessments that are aligned with target words. In other words, curriculum-based vocabulary assessment is only possible with a vocabulary curriculum in place. The words that are taught directly should be the same words that are assessed (NRP, 2000).
With a high quality vocabulary curriculum in place, educators can identify a "ceiling" for mastering target vocabulary words over a long period of time (e.g., over the course of an academic year, or multiple years). For example, if a teacher uses a vocabulary curriculum to directly teach 100 new Tier Two words throughout the school year, the "ceiling" could be defined as mastery of the 100 target words. In this context, teachers could have an opportunity to measure the specific words that were taught directly throughout the year, and to make decisions regarding individual student learning. Using a Tier I (whole-class) vocabulary curriculum provides educators with an opportunity to use curriculum-based vocabulary assessments to make decisions regarding the effectiveness of instruction for individual students. A variety of approaches, tools, and procedures exist for measuring vocabulary knowledge.
However, research is needed to explore and identify best-practices for measuring vocabulary knowledge within a multi-tiered framework (Loftus & Coyne, 2013 (Loftus & Coyne, 2013). Technically adequate (reliable, valid) and useful indicators of student learning are essential in a proactive and preventative model for instructional decision-making (Deno & Mirkin, 1977).
A disadvantage of many mastery measurement curriculum-based assessments is that technical properties such as reliability and validity are often not established (Shinn, 1998). Reliability is a test property that reflects the degree to which differences in observed scores are aligned with differences in true scores (Furr & Bacharach, 2008). Adequate reliability is necessary but not sufficient for validity. The conceptualization of validity has evolved over time. Furr and Bacharach (2008)  Researchers have pointed out the need to use precise language when referring to the concept of validity (Furr & Bacharach, 2008;Scriven, 2002). Scriven suggested that "there are no valid tests of future affairs, only indicators… the use of test results may be a valid or invalid indicator of future performance" (2002, p. 258). In other words, the actual question is whether the inferences we make using test results are valid for a given purpose. Messick, suggested that "the essence of unified validity is that the appropriateness, meaningfulness, and usefulness of score-based inferences are inseparable and that the unifying force behind this integration is the trustworthiness of empirically grounded score interpretation" (1989, p. 5). Data based decision making within a Response to Intervention framework requires the use of tools that can efficiently and accurately predict student risk for poor outcomes in important domains.
Therefore, it is appropriate to examine the predictive validity of screening assessments, or the degree to which assessments accurately classify students at risk or students not at risk for poor outcomes.

Curriculum-Based Vocabulary Assessments
In using formative assessments and screening assessments, it is important to determine whether the assessment data have predictive validity. That is, do assessment results correlate highly with future learning outcomes? It is expected that through direct vocabulary instruction, learning outcomes will include expressive or productive knowledge of target words (i.e., ability to generate definitions of words) and receptive or discriminate knowledge of target words (i.e., ability to select an accurate representation of a word, or the ability to discriminate between examples and nonexamples of target words). These learning expectations are based on research on early vocabulary instruction Coyne et al., 2009;McKeown & Curtis, 1987;NRP, 2000).
In the classroom, teachers benefit from using assessments that are efficient to administer and that will guide instructional decision-making (Hosp et al., 2007). While multiple choice measures are convenient and efficient to use, disadvantages to this approach can include the availability of context clues and the possibility that the student will guess correctly. However, ongoing results from well-designed multiple choice assessments could provide a general indication regarding a student's understanding of target words. A primary advantage of receptive or discriminative methods of vocabulary assessment is the efficiency of administration; an entire classroom could be assessed in minutes using multiple choice assessments.
Classrooms that use a multi-tiered service delivery model require ongoing assessments to inform the teacher of student progress or lack of progress (Burns & Gibbons, 2008;Coyne, Kame'enui, & Simmons, 2001). As vocabulary instruction is just beginning to be emphasized in early elementary school, there is a need for vocabulary assessments that are accurate indicators of student learning. Two curriculum-based assessments are currently available in one of the most widely used commercially available Kindergarten vocabulary programs, Elements of Reading: Vocabulary (Beck & McKeown, 2004). In the program, five new vocabulary words are taught to Kindergarten classes each week through story book reading and a variety of other language and literacy activities. The two curriculum-specific vocabulary assessments are administered at the end of each week (i.e., the end of each unit).
While teachers are encouraged to use these assessments, it is unclear whether they are technically adequate assessments of student vocabulary development, and whether the assessments are efficient and useful for teachers to administer. Research is needed to determine the practical and predictive validity of the measures, and to inform best practice in the use of these vocabulary assessments.
In the current study, data are examined from two curriculum-based vocabulary assessments completed weekly by 250 Kindergarten students over the course of an academic year. The study examines the extent to which these measures are appropriate for gauging Kindergarten students' understanding of target vocabulary words that have been through multi-tiered instruction. While ongoing formative assessment is essential for supporting differentiated instruction, it is difficult to select appropriate tools unless a core vocabulary curriculum is in place. Considering the vast number of words available to teach, it can be a challenge to select a brief, formative assessment that will capture short-term growth in vocabulary knowledge. Inadequate sensitivity can be a major barrier to measuring short-term vocabulary growth unless the assessment is aligned with words that have been taught (i.e., aligned with the curriculum or curriculum-based). With this in mind, it is evident that vocabulary assessments must be closely aligned with a curriculum or structured framework for direct vocabulary instruction. In the current study, the utility of two curriculum-based vocabulary assessments are examined within a multi-tiered vocabulary instructional framework.

Research Questions
The current study examined the predictive validity and social validity of two weekly curriculum based vocabulary assessments included in an evidence-based Kindergarten vocabulary program (Beck & McKeown, 2004 Correlations were examined between each of the 24 weekly probes, for both of the vocabulary assessments.

Tier I vs. Tier II Group Differences on Curriculum Based Vocabulary
Assessment Performance. The current study examined whether the curriculum based vocabulary assessments included in the Elements of Reading: Vocabulary curriculum captured group differences in target word vocabulary knowledge between at risk students receiving Tier I instruction and at risk students receiving Tier I and Tier II instruction. Tier I and Tier II group differences were also examined using end-of-year proximal and distal vocabulary outcome measures. Vocabulary by Beck and McKeown (2004). Kindergarten teachers were trained to use this evidence-based curriculum to deliver direct, whole-class vocabulary instruction.
Five new target vocabulary words were taught each week in Tier I instruction, through a variety of lessons and activities in the Elements of Reading: Vocabulary curriculum.
The target vocabulary words taught to all participants are listed in Table 3. "reference" students. The "remaining" students (N=247) did not complete additional testing for the purposes of Project EVI, but were included in the current study. A summary of Project EVI groups and instruction received is provided in Table 4. Table 4 Project EVI Group Information "At risk" students with PPVT-4 scores between the 5 th and 30 th percentile (N=79) were randomly assigned to either a control group (N=36) or a treatment group (N=43). The control group only received Tier I (whole class) vocabulary instruction throughout the year. The treatment group received Tier I (whole class) vocabulary instruction and Tier II (small group) vocabulary instruction throughout the year. The "reference" group and "remaining" group only received Tier I (whole class) vocabulary instruction throughout the year. In other words, all students except the treatment group received only Tier I (whole class) vocabulary instruction throughout the year. The treatment group received Tier I and Tier II vocabulary instruction throughout the year. The primary goal of Project EVI is to examine the effects of Tier II vocabulary instruction for at risk students, compared to a control group and reference group.
Interventionists (school-based reading specialists, paraprofessionals, teaching assistants, etc.) were trained through Project EVI to provide Tier II instruction to the treatment students. Tier II instruction was delivered four days per week for 20 minutes each day to groups of 2-4 students. Students in the treatment group received 80 additional minutes of small group vocabulary instruction a week, compared to the control, reference, and remaining students. Tier II interventionists reviewed and reinforced three out of the five target words that were taught each week in Tier I instruction. For example, in Week One of Tier I instruction, five words were taught directly in a whole-class lesson (comforting, fleet, glimmer, expression, and lively).
Only three of the five words were reviewed throughout the week in the Tier II intervention (comforting, fleet, and glimmer). Within the Tier II instruction, treatment students had extended opportunities to use and interact with the target words through various activities. For example, in one activity students discern between examples and non-examples of target word meanings using picture cards. In other activities, students used target words in sentences or used word webs to make connections between target words and other words. In the Tier II instruction, students received scaffolded instruction and immediate corrective feedback. Given the extended instruction and increased support, students in the treatment group were expected to develop higher levels of target word knowledge compared to the control group.
Trained Project EVI researchers collected pre-intervention and postintervention data from the treatment, control, and reference groups related to early language and literacy skills. The current study collected weekly vocabulary assessment data over the course of the academic year (each week for 24 weeks), in addition to the pre and posttest data collected for Project EVI. A summary of the groups and data collected from each group is provided in Table 5. The design of the current study is described next, building from the context of Project EVI.
Note:  indicates data that were collected in the current study;  indicates data that were collected through Project EVI and used in the current study.
Current study design. In the current study, Kindergarten teachers participating in Project EVI were trained to administer two target vocabulary assessments at the end of each weekly lesson: a Yes/No assessment (see Appendix A), and a Receptive Picture assessment (see Appendix B). The weekly vocabulary assessments are both embedded in the Elements of Reading: Vocabulary curriculum by Beck and McKeown (2004). Detailed information about these measures is provided in the Measures section. At the end of each week, students were instructed to complete the weekly vocabulary assessments independently, without help from teachers or peers. Kindergarten teachers read each item aloud to students in a whole-group setting, and monitored independent completion of each assessment. The degree to which students actually worked independently was examined using two data sources: classroom fidelity observations and teacher reports on a questionnaire. The three classrooms that were not observed were excluded from analyses in the current study. Of the 16 classrooms observed, two were eliminated from further analyses due to low fidelity ratings (i.e., fidelity scores below six), leaving 14 Kindergarten classrooms with high fidelity observation ratings.
Teacher questionnaires. A teacher questionnaire was completed by 18 of the Kindergarten teachers at the end of the school year (see Appendices D and E). The questionnaire included teacher reports regarding the ease of administering the two target vocabulary assessments, perceived strengths and weaknesses of the assessments, and other information. One of the items asked teachers to report the degree to which students in their classrooms completed each target vocabulary assessment independently. Teachers provided a rating from 1-10, with 10 representing the highest level of independent work from students. Classrooms with ratings lower than six on this item were excluded from further analyses. Of the 18 teachers who completed the questionnaire, four teachers reported low levels of independent student work. Three of these classrooms had already been eliminated from analyses due to low observation fidelity levels. After eliminating classrooms with either low observation fidelity scores or low teacher ratings for independent work, 13 classrooms remained for further analyses.

Participants
The participants in the current study initially included teachers and students from 19 Kindergarten classrooms in Rhode Island and Connecticut. Participants were recruited from four elementary schools in Rhode Island and Connecticut through their participation in Project EVI. The initial number of Kindergarten student participants was 374 (M age =5 years 5 months, age range: 4 years 8 month to 6 years 8 months).
Through Project EVI, all of the initial 374 student participants were screened at the beginning of the academic year with the Peabody Picture Vocabulary Test-4 (Dunn & Dunn, 2007) to determine their initial level of risk for language and literacy outcomes.
Using screening results, 127 of the 374 students were assigned to one of three groups: treatment (n=43), control (n=36), or reference (n=48). The remaining students (n=244) were not selected for follow-up testing for Project EVI, but were included in analyses for the current study.  . In this assessment, the teacher reads a yes or no question out loud to the class and students respond by circling "Yes" or "No" on their response probe. The yes/no format requires students to use word knowledge and comprehension of contextual clues to determine the correct response.
There are five yes/no questions each week, one for each of the target words. For example, the question for the the target word gorgeous is, "Can a sunset be gorgeous?" (yes). For the target word peculiar, the question is, "Is it peculiar to see a giraffe in the zoo?" (no). This allows the test to be administered to an entire classroom at once, rather than testing students individually. While the PPVT-4 measures general receptive vocabulary, this task refers to words specifically targeted in the classroom vocabulary instruction.

Receptive curriculum based vocabulary
Expressive Vocabulary Test-2 (Williams, 2007). In the current study, the control group, treatment group, and reference group (N=86) completed the Expressive Vocabulary Test-2 (EVT-2) at the beginning of the year and again at the end of the year. The EVT-2 is a standardized assessment of expressive vocabulary. In this assessment, the student is shown a picture and asked to provide a one-word response to a stimulus question related to the picture. For example, a child is shown a picture of a dog and asked, "What do you see?" The test-retest reliability is .95 for the EVT-2 (Williams, 2007).  (0), partially correct (1) or completely correct (2). In the current study, lesson numbers were added to the bottom of each week's Yes/No assessment. A unique picture was included at the bottom of each page, next to the lesson number. This was done to ensure that students responded on the correct probe, assuming that some students might have difficulty locating page numbers alone.
Teachers were trained to instruct students to turn to the correct page in the Yes/No response booklet by referring to the lesson number and a description of the picture at the bottom of the probe (e.g., "Turn to Lesson 7 with picture of a squirrel at the bottom of the page"). The Receptive Picture assessments were included in the Elements of Reading: Vocabulary student workbooks. Each student had a workbook with his or her name written on the cover. Given that the participants were in Kindergarten, teachers were trained to take steps to ensure that students were responding on the correct page, and also to ensure that students were responding independently.
Teacher feedback midway through the study indicated that some students had difficulty circling their intended responses on the Yes/No assessment. For example, some students circled both "Yes" and "No" as a response for the same item. The original version of the response probe did not include lines separating each item, which seemed to create visual-spatial confusion for some students. For this reason,  After all raw data were entered into a spreadsheet formulas were created in Excel to automatically score student responses. Automated scoring was done to minimize the human error in scoring. Unique formulas were created to score each item of the Yes/No assessment (120 items) and each item of the Receptive Picture assessment (120 items). For each item, the formula coded a score of "1" for a correct response, and "0" for an incorrect response. Missing data ("Absent", "Both", "Unclear", or "No Answer" responses) were coded as "missing". If any items were "missing" in a given week, the student's score for that week was eliminated from analyses. This was done to prevent artificial deflation of scores for students with missing data. For example, rather than scoring a missing response as "0", the entire test was considered invalid for interpretation, and the student's score for the week was coded as "missing".
The conservative approach taken to address missing data from absences and ambiguity of item responses resulted in a relatively high incidence of missing data.
After student absences and unclear responses to items were considered, 18% of weekly data was coded as "missing" for the Yes/No assessment (1078 missing out of 6000). For the Receptive Picture assessment, 16% of weekly data was coded as "missing" (965 missing out of 6000). An individual student's score for each weekly assessment was coded as "missing" or invalid for interpretation if one or more of the five items contained a missing score due to ambiguity of item responses or absences.

RESULTS
After the data entry and coding process was complete, the Microsoft Excel spreadsheet was uploaded into the statistical analysis program SPSS Version 20.
Descriptive statistics, graphs, and inferential statistics were examined to assess the utility of the weekly Yes/No and Receptive picture assessments. In all inferential analyses, missing data were excluded pairwise. In other words, a participant's score was excluded from a given analysis only if the data required for the specific analysis was missing. If the same participant had the necessary data to be included in other analyses, those results were included.
The Yes/No and Receptive Picture assessment each consist of five items per week, consistent with the target vocabulary words taught on a given week. Therefore, the lowest score possible for each measure was a score of "0" and the highest score possible for each measure was a "5". Given that the assessments were administered weekly over the course of 24 weeks, many options were possible for data analysis. For example, scores could be examined separately for individual weeks, or scores could be averaged across a number of weeks, among many other options.
In the current study, weekly vocabulary data were analyzed using two methods.
First, data were examined separately for each of the 24 weeks. In other words, Yes/No and Receptive scores from Week 1, Week 2, Week 3, etc. were examined independently. Next, participants' scores for each week were averaged with scores from previous weeks (e.g., Weeks 1-2 averaged, Weeks 1-3 averaged, Weeks 1-4 averaged, etc. ). An example of each approach to analyzing weekly data is presented in Table 7. The use of incrementally averaged scores allows for a quick and simple method of examining student performance over multiple weeks, and for using the most recent averaged score as an indicator of student risk level. It was reasoned that including multiple weeks of data should increase the accuracy of decisions regarding student level of risk. The incremental averaging method was also used to examine the earliest point in time at which averaged scores accurately predicted end-of-year outcomes. Averaging the scores incrementally over time allows decision makers to take multiple weeks of data into consideration.

Incrementally Averaged Scores Example
Week Prior to conducting inferential data analyses, the assumption of normality was examined separately for each of the 24 weeks, for the Yes/No assessment and for the Receptive Picture assessment. Means, standard deviations, skewness, and kurtosis were examined separately for Week 1, Week 2, Week 3, and so on for each of the weekly measures. Next, the assumption of normality was examined for incrementally averaged sets of data (Weeks 1-2, Weeks 1-3, Weeks 1-4, etc.). The assumption for normality was examined for the total sample (N=250), and again for the Project EVI sub-sample (N=86).

Assumption of Normality for Weekly Scores
The normality of distributions was first examined for the entire sample (N=250). The assumption of normality was examined for independent Yes/No weekly scores. The distribution of scores varied from week to week, and ranged from -2.29 to -.01. The majority of distributions were negatively skewed but greater than -1.00 on the Yes/No assessment (i.e., the skewness was closer to zero than -1.00). Figure 2 shows sample histograms from Lessons 3,9,   Picture assessment distributions had skewness between -1.00 and -3.00. Figure 3 shows sample histograms from Lessons 3,9,13 and 17 of the Receptive data. The skewness of the Yes/No and Receptive distributions and violations of normality were somewhat expected, given that the total sample (N=250) included a majority of participants who scored at or above "Average" on the screener (78% scored above 92 on the PPVT-4). Next, the normality of distributions was examined for the sub-sample of Project EVI participants in the treatment, control or reference groups (N=86). It was expected that these distributions would be closer to normal, given that only 34% of participants (the reference group) scored in the "Average" range on the PPVT-4 screener.

Assumption of normality for the sub-sample (N=86).
The assumption of normality was examined for the Yes/No scores each week, in the sub-sample of 86 participants in the Project EVI control, treatment or reference groups. The distribution of scores varied from week to week. The skewness of distributions ranged from -1.23 to .20. The majority of distributions were negatively skewed but greater than -.80 on the Yes/No assessment for the sub-sample (i.e., the skewness of distributions was closer to zero than -.80). As expected, the distribution of scores for the sub-sample of treatment, control and reference participants was more normal than the distribution of the total sample. However, the assumption of normality was not met for the Yes/No assessments when scores were examined for individual weeks (Kolmogorov-Smirnov=.00 for each week) for the sub-sample (N=86).
Next, the assumption of normality was examined for the sub-sample (N=86) distributions of Receptive scores each week. The distribution of scores varied from week to week. The skewness of distributions ranged from -2.87 to -.69 on the Receptive Picture assessment. The majority of distributions were negatively skewed but greater than -2.00 on the Receptive Picture assessments (i.e., the skewness of distributions was closer to zero than -2.00). The assumption of normality was not met for the Receptive Picture assessments when scores were examined for individual weeks (Kolmogorov-Smirnov=.00 for each week) for the sub-sample (N=86).

Assumption of Normality for Incrementally Averaged Scores
The normality assumption for incrementally averaged sets of data from the Yes/No and Receptive Picture assessments was examined for the entire sample (N=250). The skewness of distributions ranged from -1.09 to -.48. The majority of distributions were negatively skewed but greater than -1.00 (i.e., the skewness of distributions was closer to zero than -1.00). The assumption of normality was not met for the Yes/No assessments when incrementally averaged weekly scores were examined (Kolmogorov-Smirnov=.00 for each week) for the total sample (N=250). Similarly, the assumption of normality was not met for the Receptive Picture assessments when incrementally averaged weekly scores were examined (Kolmogorov-Smirnov=.00 for each week) for the total sample (N=250).
Next, normality of incrementally averaged distributions was examined for the subsample (N=86). The distribution of scores varied from week to week. The skewness of distributions ranged from -1.20 to .17. From Lesson 6 on, the majority of distributions' skewness values fell between -.10 and .17, indicating more normal distributions. Figure 4 shows sample histograms from Lessons 6, 9, 12 and 18 of the Yes/No incrementally averaged data. The assumption of normality was met for Weeks 1-6, Weeks 1-7, Weeks 1-8, Weeks 1-9, Weeks 1-10, and Weeks 1-11, and Weeks 1-12 (Kolmogorov-Smirnov>.05). The assumption of normality was not met for the remaining incrementally averaged weeks (Kolmogorov-Smirnov<.05).
Finally, the normality of incrementally averaged Receptive Picture assessment distributions was examined. The assumption of normality was not met for any of the incrementally averaged Receptive Picture assessments (Kolmogorov-Smirnov=.00 for each week). The distribution was negatively skewed, with most participants achieving high scores on the incrementally averaged weekly Receptive Picture assessments.

Stability of Yes/No and Receptive Picture Scores from Week to Week
To assess the stability of scores on the Yes   To provide a more accurate (or typical) representation of individual student data, sample Weekly Yes/No scores are presented in Figure 6, with graphs from sample Control, Treatment, and Reference students. These graphs are displayed to  For example, the sample Control student's graph in Figure 6 indicates that he or she may have struggled to learn some of words that were taught during Lessons 1 and 2.
The information provided in the graph could prompt a teacher to provide additional instruction and support to the student to bolster his or her understanding of the target words. Additionally, teachers can use individualized graphs to make decisions about the effectiveness of instruction for individual students, by examining overall patterns of achievement over time.
Next, Pearson product-moment correlation analyses were conducted to

Descriptive Information for Each Group
Descriptive information was examined for each of the groups (Control, Treatment, Reference, and Remaining) on each of the pre-intervention and postintervention measures (PPVT-4, EVT-2, Target Expressive, and Target Receptive Picture assessments). As indicated in Table 8, the Remaining group (students who were not followed for the purposes of Project EVI but participated in the current study) obtained the highest scores on the PPVT-4 pre-intervention, but did not complete other pre-intervention or post-intervention measures. As expected, the Reference group (typically achieving students) obtained the highest average scores on most of the pre-intervention and post-intervention measures. However, the Treatment group (at risk students who received Tier II supports) obtained the highest scores on the two target word post-intervention measures.
Descriptive data were also examined regarding the average Yes/No and Receptive scores (from weeks 1-24), for each group. As predicted, the control group's   (24) weeks.

Inferential Findings
Next, inferential statistics were calculated to examine the following:

Correlations between Weekly Vocabulary Assessments and Outcomes
Pearson product-moment correlation coefficients were calculated to examine the relationship between scores on the incrementally averaged weekly measures and post-intervention outcome scores on the Target Receptive, Target Expressive, PPVT-4 and EVT-2 measures. The analyses were only conducted with the Project EVI subgroup (N=86), given that outcome data were not available for the "remaining" group.
As indicated in Table 9, there were medium to large, positive correlations between Yes/No incrementally averaged scores each week and each outcome measure, with higher Yes/No scores associated with higher scores on outcome measures. The incrementally averaged weekly Receptive Picture assessments did not correlate significantly with any of the outcome measures. This finding is somewhat expected, given the ceiling effect that was found in the distribution of the weekly Receptive Picture assessments, with most participants demonstrating high scores. The correlation between the pre-intervention measures (PPVT-4 and EVT-2) and the post-intervention outcome measures was positive, and ranged from medium to large (see Table 9). Note: The Pearson correlation coefficients can be interpreted using Cohen's (1988) guidelines (r=.10 to .29 is small; r=.30 to .49 is medium; .50 to 1.0 is large).

Between-Group Differences in the Weekly Vocabulary Assessments
Group differences on the Yes/No assessment were explored by conducting Mann-Whitney U Tests. The Mann-Whitney U Test is a non-parametric method of analyzing between-group variance. This method was used given that not all of the distributions of incrementally averaged data conformed to the assumption of normality. The Mann-Whitney U test was conducted to examine differences in Yes/No incrementally averaged scores between the treatment and control groups. It was expected that the treatment group scores would be significantly higher than the control group scores on the Yes/No assessment. This finding was expected given that the treatment group received supplementary (Tier II) instruction throughout the year that the control group did not receive. Group differences were explored separately for each incrementally averaged week (e.g., Weeks 1-2, Weeks 1-3, Weeks 1-4, etc.). This approach allowed the researcher to explore the earliest point in time at which group differences emerge between the treatment, control, and reference groups.
First, differences were explored between the treatment and control group performance on the Yes/No assessment. The treatment group scores were higher than the control group scores on each incrementally averaged week. A series of Mann-Whitney U tests revealed that from Weeks 9 to 24, there were significant group differences in scores on the Yes/No incrementally averaged measure between the treatment group and the control groups (p<.03), with small to medium effect sizes (see Table 10). Next, differences between the reference and control group scores on the Yes/No assessment were explored. The reference group scores were higher than the control group scores on each incrementally averaged week. Mann-Whitney U tests were conducted and effect sizes were calculated to determine the magnitude of the group differences. It was expected that the reference group scores would be significantly higher than the control group. A series of Mann-Whitney U tests revealed that from Weeks 4 to 24, there were significant group differences in scores on the Yes/No incrementally averaged measure between the reference group and the control group (p=.00), with medium effect sizes (see Table 10).
Group differences were also examined between the treatment and reference groups. The reference group scores were higher than the treatment group scores on each incrementally averaged week. Mann-Whitney U tests were conducted and effect sizes were calculated to determine the magnitude of the group differences. It was expected that the reference group scores would be significantly higher than the treatment group. A series of Mann-Whitney U tests revealed significant differences between the reference and treatment group scores on incrementally averaged Yes/No scores for the majority of weeks (p<.05), with small to medium effect sizes. No significant group differences were found between the treatment and reference group incrementally averaged scores on Week 2 and Week 10 (p>.05).
In summary, the Yes/No assessment captured statistically significant differences in scores between the treatment, control, and reference groups in Project EVI, with small to medium effect sizes. As expected, the Yes/No incrementally averaged scores were higher for the treatment group compared to the control group.
However, statistically significant group differences did not emerge until Week 6 of instruction. Statistically significant differences were seen between the treatment and reference group by Week 4 (p<.02,r=.25), and statistically significant differences were found between the control and reference groups by Week 2 (p<.01,r=.28 between the treatment and control group, U=339.50, z=-.83, p=.41. This finding provides more evidence that the Receptive Picture assessment did not distinguish between varying levels of target word knowledge. As indicated in preliminary findings, the Receptive Picture assessment had a ceiling effect (most participants earning high scores), which limits the utility of the measure for accurately gauging word learning. However, the results provide initial evidence that the Yes/No assessment did differentiate between varying levels of word knowledge.

Classification Accuracy of the Weekly Vocabulary Assessments
Analyses were conducted to examine the sensitivity ( To calculate classification accuracy results, multiple pass/fail cut-scores were selected for each of the predictor measures (the Yes/No weekly measure, the preintervention PPVT-4, and the pre-intervention EVT-2). Cut-scores were also selected to dichotomize "passing" and "failing" for each of the target word outcome measures (post-intervention Target Receptive and post-intervention Target Expressive).
Multiple cut-scores were examined for each of the predictor measures (Yes/No assessment, Pretest PPVT-4 and Posttest EVT-2). For the Yes/No incrementally averaged measures, the cut-scores examined were scores below 3.25, 3.50, and 3.75 (see Table 11). The goal in examining classification accuracy using multiple pass/fail cut-scores was to find the most appropriate cut-scores to maximize sensitivity and specificity. For example, setting a very high pass/fail predictor cut-score would likely result in high levels of sensitivity, but low levels of specificity. Setting a very low pass/fail predictor cut-core would likely result in high levels of specificity but low levels of sensitivity. Conducting multiple classification analyses using a range of cutscores aided decision-making regarding the most appropriate cut-score for predictor measures.

The cut-scores for post-intervention Receptive Target and post-intervention
Expressive Target measures were determined by examining base rates of "failing" participants using various cut-scores. Cut-scores on the Target Expressive and Receptive Picture assessments that categorized the lowest 30% of scores in the sample as "failing" were used for the classification analyses. Participants scoring below the 30 th percentile on the Target Receptive Picture assessment achieved scores under 12; therefore, 12 was used as the pass/fail cut-score for the classification analyses.
Similarly, participants scoring below the 30 th percentile on the Target Expressive measure achieved scores under 10; therefore, 10 was used as the pass/fail cut-score for the classification analyses.
Using the formulas presented in Figure 6, the sensitivity (SE), specificity (SP), positive predictive power (PPP), and negative predictive power (NPP), the weekly Yes/No assessment, pre-intervention PPVT-4 , and pre-intervention EVT-2 were calculated for each of the target word outcome measures. Table 11 presents the classification accuracy of Yes/No incrementally averaged data sets, using the postintervention Target Receptive outcome measure. Table 12 presents classification accuracy results for the pre-intervention PPVT-4 and EVT-2, also using the postintervention Target Receptive outcome measure. The purpose of conducting these analyses was to examine the predictive validity of the Yes/No incrementally averaged measure in comparison to other methods (the PPVT-4 and the EVT-2).
As shown in Table 11 Table 11, incrementally averaged Yes/No data sets with adequate classification accuracy are highlighted in bold font.
In Table 12, classification accuracy data is displayed for the pre-intervention PPVT-4 and EVT-2 data. Again, the outcome measure used was the post-intervention Target    Next, classification accuracy was examined using the post-intervention Target Expressive measure as the outcome. The classification accuracy of the Yes/No incrementally averaged data was again compared with the pre-intervention PPVT-4 and EVT-2 measures. As shown in In Table 14 Comparing the classification accuracy findings between the Yes/No assessment and the PPVT-4 and EVT-2 measures, there is evidence that incrementally averaged Yes/No data were more useful for accurately predicting students who were at risk for low performance on an end-of-year target word outcome measure (Target Expressive outcome). Comparing classification accuracy data from Tables 13 and 14, the Yes/No assessment was more accurate than the pre-intervention PPVT-2 or EVT-2 in predicting performance on the Target Expressive outcome measure, beginning at Week 4 (SE=.78; SP=.68; K=.39).

Receiver Operating Characteristic Curve (ROC Curve) Analyses
Next, Receiver Operating Characteristic (ROC) curve analyses were conducted with each of the predictors (Yes/No assessment, pre-intervention PPVT-4, and preintervention EVT-2). ROC curves plot the true-positive rate against the false-positive rate for varying cut off scores on a predictor measure (Compton, Fuchs, Fuchs, & Bryant, 2006). The ROC curve analysis allows for the examination of the combinations of sensitivity and specificity that are possible for a given predictor and a given outcome. The Area Under the Curve (AUC) is an indicator of the overall classification accuracy of a predictor. In the current analysis the AUC indicates the degree to which a predictor measure correctly classifies students according to end-ofyear outcomes. According to Compton et al. (2006) an AUC below .70 is poor; between .70 and .80 is fair; .80 to .90 is good; and .90 and above is considered   to .90 is good; and .90 and above is considered excellent (Compton et al., 2006).
Cut-scores for "passing" or "failing" outcome measures were again selected based on scores that yielded a "failing" base rate for less than 30% of participants.
This method categorized the lowest 30% of scores as "failing" outcome vocabulary assessments. Cut-scores for the Posttest PPVT-4 and EVT-2 were selected using nationally normed base rates for standard scores (Dunn & Dunn, 2007 for the PPVT-4; Williams, 2007 for the EVT-2).With the PPVT-4 and EVT-2, scores that fell under the 30 th percentile (standard scores under 92) were categorized as "failing" scores for the purposes of classification analyses. Table 15 Next, the AUC for the pre-intervention EVT-2 measure was examined for each of the outcome measures. On the Target Receptive and Target Expressive outcomes, the pre-intervention EVT-2 AUC was considered "fair" (AUC=.72). On the postintervention PPVT-4, the pre-intervention EVT-2 AUC was considered "good"

PPVT-4 and EVT-2 ROC curves. As indicated in
(AUC=.82). On the post-intervention EVT-2, the pre-intervention EVT-2 AUC was also considered "good" (AUC=.80). Overall, the results indicate that the preintervention EVT-2 measure provided fair classification accuracy for the target word outcomes, and good classification accuracy for general or distal vocabulary outcomes.
Yes/No assessment ROC curves. The AUC for the incrementally averaged Yes/No assessments varied across outcome measures and number of weeks (see Table   15). On the Target   In summary, ROC Curve analyses indicated that the Yes/No assessment provided greater classification accuracy for target word outcomes compared to the preintervention PPVT-4 and EVT-2 measures. Area Under the Curve indicated that the Yes/No assessment was "good" by Week 8 for predicting Target Receptive outcomes, and "good" by Week 11 for predicting Target Expressive outcomes. The findings also indicate that the pre-intervention PPVT-4 measure did not provide "good" classification accuracy for the target word outcomes (AUC=.73). Similarly, the preintervention EVT-2 measure did not provide "good" classification accuracy for the target word outcomes (AUC=.72). On the other hand, the pre-intervention PPVT-4 and EVT-2 measures did provide "good" classification accuracy for the general vocabulary outcome measures (post-intervention PPVT-4 and EVT-2), while the Yes/No assessments did not provide "good" classification accuracy for general vocabulary outcomes. The findings indicate that the Yes/No assessments were stronger in predicting target vocabulary word outcomes, while the PPVT-4 and EVT-2 measures were stronger in predicting general vocabulary outcomes.

Teacher Questionnaire Results
Eighteen Kindergarten teachers from Project EVI completed brief questionnaires (see Appendices E, F) regarding their experiences administering and using the weekly vocabulary assessments (Yes/No and Receptive). Responses from the 13 teachers who participated in the current project are presented in Table 16.
As summarized in

DISCUSSION
The implementation of multi-tiered systems of support in schools holds promise for addressing low achievement for disadvantaged and struggling students.
Research has demonstrated that Kindergarten interventions are particularly effective in preventing reading difficulties for at risk students (Cavanaugh, Kim, Wanzek, & Vaughn, 2004). While there is extensive research informing instruction and assessment for word recognition skills within a multi-tiered context, less attention has been focused on promoting early vocabulary growth (Biemiller, 2001;Loftus & Coyne, 2013;Paris, 2005). Recent research (Beck & McKeown, 2007;Blachowicz et al., 2013;Coyne et al., 2004Coyne et al., , 2007Coyne et al., , 2009Loftus et al., 2010) has contributed greatly to inform educators of best practices regarding early vocabulary instruction and intervention. However, within a multi-tiered framework, educators must have adequate and useful tools for identifying students who are at risk. While many curriculum-based assessments and tools have been developed to identify student risk level for word recognition skills and reading comprehension skills (e.g., DIBELS; University of Oregon, 2014; see , more research is needed to examine methods of assessing early vocabulary knowledge within an RtI framework (Loftus & Coyne, 2013;NRP, 2000). The purpose of the current study was to examine the utility of two curriculum-based assessments of vocabulary that are embedded within the Elements of Reading: Vocabulary curriculum (Beck & McKeown, 2004). Specifically, the current study examined teacher perceptions of the assessments (social validity) and the extent to which either or both of the assessments accurately identified students who were at risk for poor end-of-year vocabulary outcomes (predictive validity).

Summary of Results
The findings of the current study provided evidence that the Yes/No assessment embedded within the Elements of Reading: Vocabulary curriculum (Beck & McKeown, 2004) accurately identified students who were at risk for poor end-ofyear outcomes in target word knowledge. In other words, the results support the use of the Yes/No assessment for identifying students who are at risk and therefore would benefit from additional instructional support (i.e., Tier II or Tier III vocabulary interventions). Furthermore, the findings indicate that averaged Yes/No assessment data from Weeks 1-8 provided greater classification accuracy for end-of-year target word outcomes than the PPVT-4 or EVT-2 screening measures. Additionally, the Yes/No assessment data captured statistically significant differences in target word vocabulary knowledge between at risk students who received Tier I support only (Project EVI control group, lowest scores), at risk students who received Tier I and Tier II support (Project EVI treatment group), and typically achieving students who received Tier I supports (Project EVI reference group, highest scores).
On the Receptive Picture assessment embedded within the Elements of Reading: Vocabulary curriculum, a ceiling effect (i.e., most students achieved high scores) limited the ability to use the assessment to identify students who were at risk.
While there may be advantages to administering the Receptive Picture assessment (e.g., providing students with an additional opportunity to practice using target words), the current findings indicate that the Receptive Picture assessments in the Elements of Reading: Vocabulary curriculum are not useful for predicting end-of-year vocabulary outcomes, or for differentiating between students receiving Tier I versus Tier II support.
Results from teacher questionnaires indicate that Kindergarten teachers found the assessments to be very easy to administer and they did not believe that the administration of the assessments was particularly time-consuming. While some teachers chose to examine the results of student assessments often, other teachers chose not to examine student responses at all. The majority of the teachers indicated that they would be likely to use the assessments in the future (85% would use the Yes/No assessment; 77% would use the Receptive Picture assessment). However, teachers noted areas for improvement on the assessments. Recommendations for improving the Receptive Picture assessment include increasing the difficulty of items, and selecting pictures that were less ambiguous for interpretation. Recommendations for the Yes/No assessment included changing the visual-spatial organization of probes to clearly separate each item, and to include explicit visuals next to each item (e.g., a thumbs up picture paired with each "Yes" and a thumbs down picture paired with each "No").

Limitations
While the findings of the current study provide initial evidence of the utility of the Yes/No assessment for predicting end-of-year target word vocabulary outcomes, many limitations must be noted. First, strong evidence for classification accuracy was not established until Week 8 of the Yes/No assessment. Assessment methods with good classification accuracy that could identify at risk students earlier than eight weeks would be preferable. Additionally, the classification accuracy of the Yes/No measure was "good" by Week 8 (AUC>.80), but not "excellent" (AUC>.90).
Educators using this assessment to identify students at risk should be mindful that the measure does not have perfect classification accuracy. Some students who are truly at risk might perform well on the Yes/No assessments, and some students who are not at risk might perform poorly on the assessments. While the current study demonstrates that the Yes/No assessment had good classification accuracy for identifying student risk on target word assessments, the assessment was not accurate in classifying student risk for end-of-year general vocabulary outcomes, as measured by the PPVT-4 and the EVT-2.
In the current study, teachers administered the weekly assessments, rather than researchers. While the teachers were trained and fidelity observations were conducted, it is possible that the teachers did not always administer the assessments in a standardized method. Additionally, the assessments were administered in a wholeclass format, which increases the possibility that students did not always complete the assessments completely independently. Teachers were trained to take steps to ensure that students completed the assessments independently, and fidelity observations noted a few instances where students did not complete the assessments independently (i.e., 'peeking' at neighbors' responses); however, it was not possible for the researcher to comprehensively monitor the degree to which assessments were completed independently. To minimize error, six of the initial 19 classrooms (31.5%) were eliminated from analyses in the current study due to low levels of independent work on the assessments (either during fidelity observations, or as reported in the teacher questionnaire). The relatively high percentage of classes that were not able to complete the assessments independently brings into question the social validity of whole-group test administration in early Kindergarten.
Finally, some students struggled with visual-spatial orientation for the Yes/No assessments in early Kindergarten, and some student responses were ambiguous (e.g., both "Yes" and "No" were circled for the same item). For each item with an ambiguous response, the student's score for the entire week was omitted from analyses, leading to the problem of occasional missing data. Additionally, because the current study took place in elementary schools rather than a controlled environment, student absences also led to missing data. While these limitations pose challenges for research purposes, they accurately reflect the day to day considerations for assessment practices at the early elementary level.

Considerations for Early Vocabulary Assessment
Given that vocabulary is an "unconstrained" skill (Paris, 2005) the method used to identifying at risk students in the current study differs from the conventional methods used to screen for poor word recognition skills. For example, most screening tools do not align exactly with the content of the curriculum, yet contain at least some material that is known to the student, even if the amount of known material is minimal. It is expected that the target words selected for direct vocabulary instruction will be unknown to students prior to instruction; therefore, it would not be appropriate or useful to screen students prior to instruction using target words. Many researchers have relied on measures of general vocabulary knowledge, which sample both known and unknown words, to identify students at risk for poor vocabulary outcomes.
However, these methods have substantial limitations for use in a classroom context.
The current study examined the utility of standard, ongoing curriculum-based vocabulary assessments to identify students at risk for poor vocabulary outcomes. In other words, ongoing or formative curriculum-based vocabulary assessment results were used to identify students who did not respond to Tier I instruction.
In selecting tools to identify students at risk, it is important to specify the outcome. In other words, it is necessary to specify exactly what a student is or is not at risk for. Within a Tier I direct vocabulary instruction curriculum such as the Elements of Reading: Vocabulary, the primary instructional goal is for students to learn the target words or proximal words that were taught directly. A secondary goal for instruction is to expand students' transfer or distal word learning and language comprehension. Although a handful of studies have demonstrated initial evidence for distal vocabulary gains through short term vocabulary instruction and intervention (e.g., Coyne et al., 2010;Elleman, Lindo, Mophy, & Compton, 2009); there is strong research supporting increases in target word learning through direct vocabulary instruction (Beck et al., 2002;Biemiller & Boote, 2006;Loftus et al., 2010).
The use of standardized measures of general vocabulary knowledge as a universal screener or outcome assessment for early vocabulary instruction is problematic for several reasons. First, such measures lack sensitivity to capture knowledge of the specific target words taught. For example, imagine a Kindergarten student who learned over 100 "Tier Two" vocabulary words over the course of an academic year through direct vocabulary instruction. An outcome assessment that measured target word knowledge is more sensitive and appropriate for capturing the student's gains, compared to standardized measures of general vocabulary knowledge.
Within an RtI context, screeners typically provide teachers with two levels of important information. First, screening results identify individual students who are at risk for poor outcomes and are in need of additional support. Additionally, screening results provide teachers with an overall conceptualization of Tier I instructional effectiveness, by examining the number of students who are not responding to Tier I instruction. A limitation to relying on standardized measures of general vocabulary knowledge as screeners is that such measures do not allow teachers to examine the overall effectiveness of their direct vocabulary instruction. For example, imagine that most students in a kindergarten class were not responding to Tier I vocabulary instruction. The use of curriculum based vocabulary assessments could inform the teacher that there is a need to change Tier I instruction to increase the percentage of students who respond positively. Unlike curriculum based assessments, measures of general vocabulary knowledge do not provide specific information regarding the effectiveness of the local instruction.
Another limitation to using standardized measures of general vocabulary knowledge is that the scores are typically interpreted in terms of percentile ranks and compared with national norms. This means that even if a student's performance improves (raw score increases), the student's relative ranking (standard score) is not likely to indicate an improvement unless the gain is substantial. Additionally, general measures such as the PPVT-4 and EVT-2 are not designed to be administered repeatedly within a short period of time. In schools, these measures are commonly administered to students by specialists in schools, for the purposes of evaluations.
Using these tools too frequently can result in practice effects and invalidate the use of the data for multiple purposes.
A practical limitation to using standardized measures of general vocabulary knowledge is the amount of time and training needed to administer the measures and score the protocols. Such measures require individual administration, and can take 20 to 30 minutes to complete. In a classroom of 20 Kindergarten students, it would take over six hours to complete testing using a measure such as the PPVT-4 or EVT-2, with an additional two to three hours dedicated to scoring and interpreting results. In the current study, the average time spent administering the Yes/No assessments was 3.82 minutes, and weekly results for an entire class could be calculated within several minutes.
In early vocabulary intervention studies, researchers typically conduct pre-tests of target word knowledge. Doing so allows researchers to account for initial target word knowledge, and make accurate claims regarding growth in target word knowledge at the time of the posttest. In practice, it may not be appropriate to administer such pretests of target word knowledge, particularly if "Tier Two" words are selected for instruction and it is not likely that students have prior word knowledge (Beck & McKeown, 2002). Instead, assessment of target word knowledge can provide valuable information for teachers when administered after direct instruction has occurred. Collecting multiple weeks of data can aid teachers in identifying students who are not responding to Tier I instruction, and are in need of additional support.
In the current study, assessments were administered on a weekly basis, following direct vocabulary instruction. The weekly scores were averaged incrementally over time for each participant, and interpreted as students' response to the vocabulary instruction. Students with higher averaged scores (i.e., scores above 3.50) were considered to be responding well to the Tier I instruction, with low levels of risk for poor end-of-year vocabulary outcomes. Students with lower averaged scores (i.e., scores below 3.50) were considered to be struggling to learn with Tier I instruction alone, with high levels of risk for poor end-of-year vocabulary outcomes.
In the current study, at risk students who did not receive Tier II supports demonstrated an average score of 3.29 on the Yes/No assessment, and at risk students who did receive Tier II supports demonstrated an average score of 3.67 on the Yes/No assessment. Students who were identified as low risk earned an average score of 4.02 on the Yes/No assessment. (2007) reviewed considerations to be made when selecting appropriate screening tools, emphasizing the importance of efficiency and classification accuracy. Criterion validity is often used by researchers to evaluate the utility of measures. While criterion validity provides useful information regarding the relationship between two measures, the information provided is insufficient for establishing the utility of a screening or predicting measure. Effective screening measures not only correlate with important and relevant outcomes, but also accurately classify students as being at risk or not at risk for poor outcomes (Jenkins et al., 2007).

Jenkins, Hudson and Johnson
The National Center on Response to Intervention (2011) provides a review of technical information regarding commonly used screening tools. Each measure is A challenge of identifying adequate vocabulary assessments in an RtI framework is the necessity of a Tier I vocabulary curriculum. With a whole-class vocabulary curriculum in place, educators have the opportunity to test the same words that they teach, aligning the assessment with the curriculum. At the secondary level, many educators use vocabulary curriculum-based assessments to monitor student learning, however few assessment practices are currently available for early vocabulary instruction. A common method of curriculum-based assessment for vocabulary at the secondary level is the use of vocabulary matching CBAs (Espin, Shin & Busch, 2005). However, at the Kindergarten level such methods are unavailable because students have not yet learned to read and write to demonstrate their vocabulary knowledge.  Child answers a contextual question about the target word orally.

Expressive Word Knowledge
Child is shown a picture of a target word or is given verbal definitions of the word, and produces the target word orally (i.e., says the word).

Contextual Word Knowledge: Yes/No
Child answers a contextual question about a target word, with a response of "Yes" or "No".
12.5% (n=4) 6. Story Retell Child listens to a story and retells the story immediately following.

Metalinguistic Awareness
Child demonstrates ability to reflect on and manipulate language.

Language Samples
Child's use of general vocabulary is observed and recorded by the researcher.
6.25% (n=2) 9. Spelling Target Word Child listens to target words read aloud and writes the words.

Categorical Word Knowledge
Child demonstrates ability to sort words into appropriate categories.

3.1% (n=1)
Note: This table was adapted from Hardy, Furey, & Loftus (2013). Twenty-six early vocabulary intervention studies were examined, and some of the studies used more than one experimenter developed target word measure.

Future Directions
Early vocabulary assessments that are aligned with Tier I instruction can provide useful information regarding the effectiveness of Tier I instruction and individual students' level of risk for poor target word outcomes. However, the current study only explored two methods of vocabulary measurement (Yes/No and picture Receptive Picture assessments). More research is needed to examine other forms of vocabulary assessment (e.g., expressive assessments), as well as other methods of assessing vocabulary (e.g., one-to-one, peer assessments, computer-based assessments, etc.).
A promising area of research on early vocabulary assessment involves the use of technology (computers, tablets, etc.) to administer assessments and provide teachers and students with immediate feedback. In the current study, only 56% of the teachers took the time to examine assessment results. It is essential that teachers are able to access assessment results in a timely manner, in order to make appropriate instructional decisions (Fuchs & Fuchs, 1986;Stecker, Fuchs & Fuchs, 2005).
Technology-based assessments have the potential to provide teachers with immediate feedback and store information regarding classroom outcomes and district outcomes.
Educators would benefit from easily accessible data regarding student progress and level of risk. Additionally, researchers and practitioners are encouraged to collect local screening data and conduct classification analyses using relevant "gold standard" outcomes (National Center on Response to Intervention, 2010).
Another consideration for future research involves examining the trade-off between using vocabulary assessments that provide comprehensive information and using vocabulary assessments that are efficient and manageable to administer. For example, most of the vocabulary assessments used at the Kindergarten level are administered to students individually, given that Kindergarten students are not yet able to read or write to express their word knowledge. Assessments that can be administered individually have benefits in terms of the type of information that can be gathered at the Kindergarten level, and individual administration ensures that students respond independently. However, individually administered assessments are more time consuming. Assessments that can be administered in a whole class or small group setting have important benefits in terms of efficiency. Maximizing the quality of the vocabulary assessments (psychometric and predictive properties) and the efficiency of administering the assessment (time and ease of administration) is crucial for promoting data-based instructional decision making for vocabulary development.

Conclusions
Findings from the current study suggest that the ongoing use and interpretation of curriculum-based vocabulary assessments within a Response to Intervention framework can provide useful and accurate information regarding student response to instruction and level of risk. In fact, the findings demonstrated that curriculum based assessments of vocabulary knowledge can provide more useful information than standardized measures of general vocabulary knowledge, regarding risk level for target word outcomes. Previous research on early vocabulary instruction and intervention has largely used proximal or direct, experimenter-developed assessments of target words as the gold standard outcome measures NRP, 2000). A primary reason for developing or selecting measures that assess target word knowledge directly is that such "proximal" measures have higher levels of sensitivity to growth in vocabulary, compared with standardized measures of general or "distal" word knowledge. In short, researchers agree that the most direct method of capturing student learning within a multi-tiered vocabulary curriculum is to assess the same words that were taught, or to use curriculum-based assessments (NRP, 2000).
The current study incrementally averaged multiple weeks of Yes/No assessment data were over time, with the goal of examining how well individual students respond to Tier I vocabulary instruction, which students are at risk for poor end-of-year target word outcomes, and how many data points are necessary for accurate classification accuracy. It is not typical practice to use ongoing assessment results as a universal screener to identify students at risk for poor outcomes. More typically, researchers have used standardized measures of general vocabulary knowledge to screen students and identify students who are likely to be at risk for poor vocabulary outcomes (Coyne et al., 2009). Indeed, standardized measures of general vocabulary knowledge such as the PPVT-4 or EVT-2 are more useful for identifying at risk students prior to instruction when compared with curriculum-based assessments. It stands to reason that most students would achieve low scores vocabulary curriculum-based assessments that were administered prior to receiving direct vocabulary instruction, because it is expected that the words assessed had not yet been learned. However, the use of standard, ongoing curriculum-based vocabulary assessments can allow educators and researchers to assess individual student response to instruction or intervention.
If the intended use of assessment data is to accurately identify students who are at risk for a given outcome, it is important to examine classification accuracy of the assessment using relevant outcomes. Surprisingly, many of the widely used screening measures in the domain of reading have not demonstrated adequate levels of classification accuracy (National Center on Response to Intervention, 2011).
Considering that it is nearly impossible for an assessment to have perfect classification accuracy, researchers have emphasized a need for balancing levels of sensitivity and specificity. In recognition of the inherent measurement error that is associated with assessments, researchers and educators must consider trade-offs between selecting cut scores that yield high sensitivity (i.e., the screener detects almost all of the at risk students) yet sacrifice specificity (i.e., some of the students identified as being at risk are not actually at risk) (Petscher, Kim, & Foorman, 2011). While it is desirable to provide every at risk student with additional supports, screeners with high sensitivity and poor specificity will over-identify the number of students at risk. Given the limited resources for Tier II and III (supplemental) instructional supports, it is in the best interest of schools to use measures with adequate sensitivity and specificity for important outcomes. However, in an RtI framework, researchers have emphasized the need to maximize sensitivity in order to provide timely services for students who are at risk (Jenkins et al., 2007;Petscher et al., 2011).
The current study provides a framework for examining the predictive validity of curriculum based vocabulary assessments within a multi-tiered system of instruction. Importantly, vocabulary assessments that are technically adequate and are efficient to use will be the most useful in a classroom context. While more research is being conducted regarding best practices for early vocabulary instruction and intervention (e.g., Biemiller & Boote, 2006;Coyne et al., 2010), less research is focused on early vocabulary assessments within an RtI framework. Perhaps one of the greatest challenges to examining vocabulary assessment within an RtI framework is the necessity of a Tier I vocabulary curriculum. With increased attention to early vocabulary instruction and intervention, there are increased opportunities to simultaneously evaluate the utility of vocabulary assessments. Researchers are encouraged to examine the utility of vocabulary assessments within the context of multi-tiered early vocabulary instruction and intervention. Appropriate and efficient tools for identifying students at risk for poor vocabulary outcomes will permit educators to intervene early and support learning outcomes for all students. 10. Teacher asks students to turn to appropriate page in the Receptive Workbook (or, teacher takes steps to ensure that the students are responding on the correct page of the workbook).
11. Teacher reads each question loudly and clearly, and gives students enough time to respond.
12. Teacher ensures that each student completes assessments independently; no guidance is given related to the correct or incorrect answers (until after responses have been recorded).