A New Direction in Testing with English Learners

print this section

Given the nature and extent of the psychometric problems inherent in the evaluation of English learners, any effort in test development will likely form some sort of compromise relative to a wide range of issues that may include measurement concerns, practical considerations, or economic limitations. The Ortiz Picture Vocabulary Acquisition Test™ (Ortiz PVAT™) is no exception.

At the outset of development of the Ortiz PVAT, several such considerations were examined and various goals were adopted as the basis and foundation for the inherent structure and design of the test.

Goal 1.	*Vocabulary acquisition* was selected as the central construct, due to its relevance, salience, and importance in a wide range of language and language-related tasks, learning processes, and academic skills development. The importance of vocabulary acquisition was discussed at the outset of this chapter and reinforces the value inherent in its measurement.
Goal 2.	The test should be *appropriate for both native English speakers and English learners.* This degree of inclusivity provides ultimate flexibility in being able to administer the test to any individual, regardless of their monolingual or bilingual status, because it uses English as an invariant standard across both groups.
Goal 3.	Given the increasing linguistic diversity of the U.S., a deliberate choice was made to ensure that the test could be *administered to and provide valid results for all English learners, irrespective of their first or native language.* Again, such a goal ensures that the test is not limited in its application to only those from a specific linguistic background.
Goal 4.	The test should be usable by any evaluator by permitting administration in a manner that *did not require the evaluator to be bilingual.* Note that for individuals who have extremely limited exposure to English (e.g., less than three months of exposure or are suspected of having severe developmental delays in their native language), task instructions may be more readily understood if they are presented in the examinee’s native language. As mentioned in more detail in chapter 3, Administration and Scoring, the Ortiz PVAT provides translated task instructions for five of the most common non-English languages according to the recent American Community Survey (ACS; United States Census Bureau, 2014): Arabic, Chinese, Russian, Spanish, and Vietnamese. Only the instructions have been translated; the target words for each test item must always be presented in English.

Based on these practical considerations and development objectives, it quickly became evident that none of the customary approaches to evaluation would be able to satisfy all requirements. Rather, the goals for the test made it clear that it was necessary to return to a previous perspective in measurement that had, admittedly and correctly, been viewed as exceptionally problematic—testing in English. Although testing in English poses no issue for monolingual, native English speakers, the same cannot be said for English learners. Failure to evaluate in an English learner’s native language might not reveal the full extent of the individual’s language abilities and development. Even an individual who is well-educated in their native language, but introduced only recently to the English language as a new language, may not perform well if tested only in English.

Nevertheless, testing in English provides some significant advantages in both practical and psychometric terms that makes it not only a viable approach, but also one that can effectively and fully respond to the main requirements for the test.

Evaluation and measurement of vocabulary acquisition in English has applicability for both English speakers and English learners as both groups have some degree of English language development.

For the purposes of creating a test appropriate to individuals in the U.S., evaluation of vocabulary acquisition in English provides a common metric (or yardstick) by which comparisons of development, growth, and attainment can be made, irrespective of the individual’s native language.

Testing in English requires no special skill or bilingual ability on the examiner’s part if the examinee demonstrates sufficient comprehension with English task instructions (e.g., has more than three months of exposure to English).

Although the practical considerations are rather easily accomplished via testing in English, the psychometric considerations are not so easily resolved. As has already been discussed, it is neither possible nor valid to compare the development of English vocabulary of a monolingual English speaker to that of a bilingual English learner via traditional, single norm group tests. Instead, it is necessary to create a test that has two distinct normative samples: one for native English speakers, and one for English learners. Only in this way is it possible to begin addressing the psychometric issues that have long plagued tests designed to evaluate the language abilities of English learners.

Dual Norms

The concept of having more than one set of norms for a particular test is not in itself a new idea. Tests sometimes provide data from clinical subsamples that may guide interpretation in cases of individuals with known disorders who are tested with a particular instrument. However, the idea of developing two distinct normative samples for use with a single test is more of a departure from standards in test development, particularly when applied to differences in language development. In this regard, the Ortiz PVAT provides the first effort at creating a test that is, in all respects, two tests because it contains two sets of norms (one for English speakers and one for English learners), but with the same administration procedure for both. The Ortiz PVAT is uniquely suited for assessing receptive vocabulary acquisition of both English speakers and English learners, akin to two separate tests combined into one. This idea of dual-norming is quite different than the notion of parallel tests in different languages.

Given the serious limitations in trying to create 200+ language versions of the same test, dual-norming has many advantages and provides an exceptionally elegant alternative. This is not to say that establishing a representative normative sample for English learners is a simple task, but only that it represents perhaps the most reasonable and appropriate option for addressing the inherent psychometric considerations. To meet and address these issues, the limitations that were discussed previously (i.e., the lack of appropriate representation with respect to language) must still be overcome. That is, differences in language development must be accounted for in some fashion within the construction and stratification of the normative sample. This factor is not a concern for monolingual native English speakers as age already provides the necessary and sufficient degree of comparability in terms of English-language development. This is not the case, however, for English learners. By limiting the test to English vocabulary acquisition, the normative sample for English learners can focus on controlling primarily for development in (i.e., exposure to) English without regard to development in the native language.

However, native language development is still of great importance; factors such as age, formal education in the native language, and parental socioeconomic status are critical in terms of determining native language development and its potential impact on learning English. Should an individual possess significant native language vocabulary (which might have formed from prior formal education in the native language), the trajectory and growth of vocabulary acquisition in English is likely to be far more rapid than for an individual without such formal education in the native language due to the transfer of knowledge regarding language in general. This process is described by Cummins (1984) in his Developmental Interdependence Hypothesis; he uses the concepts of BICS and CALP to explain the process of linguistic transfer and the factors that facilitate the learning of a new language, and differentiates conversational language from that learned in school. In addition, a series of meta analyses regarding the effectiveness of native language instruction in producing academic achievement in English continues to demonstrate that the longer an individual is taught in their native language, the better they perform academically in English (Goldenberg, 2008, 2013). Evaluation in the native language can become a secondary aim, since gains in cognitive ability and maturation facilitate this linguistic transfer from one’s native language to English. Instead, evaluating growth in English can also demonstrate native language development, as development of both English and a native language ought to yield even greater vocabulary growth than individuals who did not receive formal education in the native language (as no transfer would have occurred).

Dual norms provide another important advantage that is largely absent in other psychometric frameworks that rely on single group normative samples—separation and distinction regarding the purpose of evaluation (i.e., diagnosis versus intervention). When only a single normative sample is available, examiners are forced to address the separate questions of disability and of instructional need (for school-age children) with the same score compared to the same group. This limitation does not exist when dual norms are constructed and made available. For example, when the purpose or reason for evaluation is to assess disability in English learners, it is never appropriate to use a normative sample based primarily or exclusively on monolingual native English speakers. Such comparisons, as discussed, will be inherently biased and discriminatory, and the resulting inferences and conclusions will lack validity. Diagnosis of disability or disorder is based solely on developmental delay (either in rate of progress or magnitude of attainment) and must, therefore, only occur relative to other individuals of the same age and with similar exposure to English.

Conversely, when the purpose or reason for an evaluation is to identify the instructional need or intensity of intervention that may be required for an English learner, it is appropriate to compare performance against the typical grade- or age-level standard, even when that standard is based on the performance of monolingual, native English speakers. Note that this action upholds the standard used for all students in U.S. schools and does not lessen or diminish the expectations of achievement for an English learner. In addition, it accurately reflects that any individual with a minimal level of English vocabulary is likely to have a very high need for instruction and intervention. Of course, the rate of growth and acquisition of English vocabulary may be variable depending on factors such as prior formal education, parental socioeconomic status, or even differences in individual ability, but this would not alter or disrupt the validity of the meaning of the obtained measurement or score. To an astute observer who is knowledgeable in cultural issues in testing, the current standards established in the U.S. school systems regarding grade-level performance, as well as expectations regarding academic achievement in English, can be somewhat ethnocentric. Admittedly, the current standards do not take into account the different types of learning across cultures or languages when the focus is tied largely to age- or grade-based expectations derived from and appropriate for native English speakers. Nevertheless, to help bilingual students succeed in schools, it is important to evaluate them based on standards that reflect at least average age- or grade-level performance. In order to determine exactly the level of instruction modification or intensity of intervention services needed for an examinee who is not performing up to grade-level standards, one must determine the gap between the examinee’s current vocabulary level and the expected standard (i.e., the amount of growth required to attain the expected standard). The same is also true for an examinee who is performing at or above grade-level expectation; the evaluator needs to determine whether the examinee would need help to sustain their current performance in order to maintain progress commensurate with their age-matched peers. Therefore, the ability to generate information that accurately gauges instructional needs relative to established performance standards may be valuable in terms of educational planning, decision-making, and progress monitoring.

Exposure to English

Of all of the psychometric considerations discussed thus far (e.g., using English as a common metric, creation of dual norms, inclusion of speakers from any language, and distinguishing BICS level vocabulary acquisition from CALP level), perhaps the most important issue to be addressed involves the concept of English exposure. With the exception of the BESA (Peña et al., 2013), the construction of normative samples for use with English learners is often accomplished without any regard to the varying levels of development in either the native language or English (Ortiz, 2014). This omission is, of course, a significant error because English learners cannot be considered to be a monolithic group wherein it is permissible to compare, for example, the performance of a 17-year-old with 12 years of formal education in English to another 17-year-old with only two years of formal education in English. Compounding the problem is that differences may exist not only in their English vocabulary development, but in their respective native language development as well. For example, individuals with solid, formal education in their native language (whether received in the U.S. or their native country) will, given enough time and opportunity, eventually outperform their same-age English learner peers without formal education in their native language. In some cases, they will also outperform their same-age English-speaking peers in English academics, due to the linguistic transfer from their native language to English. Unfortunately, attempts to consider variation in language development in both English and an individual’s native language would constitute an enormously complex hurdle and would require much time and effort to closely match individuals in terms of age, development in English, development in their native language, and across all languages.

On the surface, accounting for the aforementioned factors (i.e., age, exposure to English, development in English, and development in the wide variety of possible native languages) within the normative samples seems inherently prohibitive and likely insurmountable. Moreover, when considered in terms of the stated objectives and structural goals for the Ortiz PVAT, it becomes clear that accounting for these factors is not strictly necessary. This is because native language development, irrespective of the language, influences English language development in terms of the degree and rate of linguistic transfer (i.e., more native language education means easier and more rapid acquisition of English) but does not alter the pattern of English vocabulary acquisition. That is, because language acquisition and learning are predicated on vocabulary word frequency, there is no disruption of the process of such acquisition, irrespective of the individual’s education in their own language or what that language happens to be (Milton, 2009). In other words, regardless of one’s level of education in their native language or how long they have been learning English, high-frequency words are learned first and low frequency words are learned later.

Because the process of vocabulary acquisition remains invariant across both native English speakers and non-native English speakers (e.g., learning high-frequency words first and rarer words later), it stands to reason that English vocabulary acquisition for each group can be evaluated using the same test, in much the same way; however, length of exposure to English must be factored in, such that individuals in each group are compared to other individuals with the same exposure and opportunity to learn English.

On the basis of the aforementioned considerations and given the parameters within which the test must operate, it became very clear that a viable test could be constructed with dual norms appropriate for both native English speakers and English learners, as long as both normative samples control for differences in exposure to English. With respect to native English speakers, there is no problem in accounting for English exposure because it is already tied strictly to age level, assuming adequate conditions for language development to exist. That is, when English is the only language being heard, modeled, taught, and learned, an individual’s age provides an accurate approximation regarding their length of exposure, much like age provides a reasonable measure of an individual’s cognitive maturation. Because age is the primary stratification variable in virtually all normative samples of tests that measure developmental processes, this factor is already controlled in any age-based sample of monolingual English speakers. However, with respect to English learners, who by definition have been exposed to a language other than English and who started learning English at some point in their lives which may or may not be right after birth, the construction of a representative normative sample cannot rely simply on age to determine an individual’s level of experience with and exposure to the English language. Instead, development of an adequate normative sample for English learners must directly ascertain and account for differences in length of exposure to English. This point cannot be overstated. For valid comparisons of English vocabulary acquisition for the purposes of disability identification, an individual with a certain amount of exposure to English must be compared only to other individuals with equivalent exposure to English. Only in this way can the English vocabulary acquisition test performance of English learners meet the requirements for fairness (American Education Research Association, American Psychological Association, & National Council on Measurement in Education, 2014; see also chapter 8, Test Standards: Reliability, Validity, and Fairness). This feature alone—the addition of a new stratification variable based on English exposure in the normative sample constructed for English learners—distinguishes the Ortiz PVAT from all other current tests.

It is important to understand that same-aged English learners may have had a little, some, or a great deal of exposure to English. To evaluate them fairly, with respect to measurement of their vocabulary acquisition in English, an English learner must be evaluated against other English learners of the same age and of comparable amounts of English exposure. By sampling individuals of the same age and range of exposure to English (i.e., from no exposure to high exposure) and from a wide range of languages, a suitable, appropriate, and fair comparison can be made by determining the expected receptive vocabulary ability of individuals across the entire span of exposure for any given age. Traditional norms based on age and other demographic factors represent a rather flat, two-dimensional construction that has been quite sufficient for creating tests that work well for speakers of a single language. This standard, however, is not suitable for tests that seek to measure all abilities, especially those related to language, of individuals who have experiences in two or more languages. By accounting for developmental differences due to differential exposure in a common language (in this case, English), the norms become three-dimensional and provide the necessary degree of comparability that yields the necessary adequacy in representation required for drawing valid diagnostic inferences of test performance. When coupled with the ability to accurately measure growth in English vocabulary by incorporating a progression of high-frequency BICS to lower-frequency CALP words applicable to language development of both native and non-native English speakers, dual norms represent a new and innovative approach to test development that can provide a much broader, more valid, and less discriminatory framework for evaluation than what has been available previously. They also represent an important, and perhaps valuable, departure from previous conceptualizations of test construction for use with linguistically diverse populations.

<< Psychometric Issues in the Evaluation of English Learners

Summary >>