Psychometric Issues in the Evaluation of English Learners

print this section

The concern with measurement issues relevant to non-native English speakers has an extensive history and can trace its roots back to the very advent of psychometrics and the development of the Army Mental Test (Yerkes, 1921). Once it was clear that both the lack of literacy and English comprehension factors were resulting in zero-order scores, the test was split into two versions: the Alpha version for those literate in English, and the Beta version for those illiterate in English or with insufficient English comprehension. The Army Beta comprised “performance items” that did not require the use of language and was a framework later adopted by Wechsler in the formation of his initial Wechsler-Bellevue intelligence test, notably in his use of the “Performance IQ” (PIQ; Wechsler, 1939). This approach eventually led to clinical practice that recommended administration of the subtests that comprised the PIQ for English learners and avoidance of the “Verbal IQ” (VIQ; Kaufman, 1994; Ortiz, 2014). Ultimately, the concept was further extended into the development of discrete “nonverbal” tests. In this way, the nature of measurement proceeded under the assumption that simply eliminating language as a construct in a test’s measurement would be sufficient to address problems inherent in evaluating non-native English speakers. While the premise is reasonable and tasks that assess one’s ability to analyze language-reduced complex patterns and relationships (i.e., nonverbal tests) are less influenced by test-takers’ linguistic and cultural differences than verbal tests, limiting evaluation primarily to visual tasks and nonverbal abilities introduces a different set of challenges. Language, as noted previously, is an important aspect of intelligence and cognitive abilities and is an important predictor of academic success (Schneider & McGrew, 2012). Thus, the exclusion of language from an evaluation may significantly reduce the predictive validity of such measures (Lohman, Korb, & Lakin, 2008) and limit their applicability in evaluating potential causes of reading and writing difficulties, as well as those that may be related to speech and language functioning.

Other approaches have been suggested as alternative ways of addressing language differences in individuals undergoing evaluation with equally unsatisfactory results as nonverbal testing. For example, some clinical recommendations for working with English learners involves the use of modified or altered methods of assessment, sometimes referred to as “testing the limits” (Sattler, 2001; Shaughnessy, 2006). This strategy may include repeating instructions, mediating concepts before administering items, extending time limits or eliminating them all together, accepting responses in a language different from the test’s language, or use of a translator or interpreter. Such alterations, however, may effectively violate the standardized protocol of the test and, thus, invalidate the results by undermining the test’s psychometric properties and reducing the appropriateness and comparability of the normative sample. Likewise, the lack of evidence for the validity of any of these modifications makes them problematic in terms of representing viable solutions to addressing developmentally-based language differences.

One of the more recent attempts to manage language differences in testing involves the development of instruments in a language other than English. On the surface, there is some validity to the notion, but it quickly runs into a variety of challenges. For example, the creation of a test in a language other than English means the test can only be used with speakers of that language. Given the predominance of Spanish in the U.S., it is no surprise that this language has received the most attention in terms of actual development and publication (Schlueter, Carlson, Geisinger, & Murphy, 2013); however, over 350 languages are spoken in U.S. public schools alone (United States Census Bureau, 2015), making it wholly impractical, not to mention prohibitively expensive, to attempt the creation of more than one or two tests in a language other than English.

Tests in a language other than English face larger difficulties, however. When these tests are used in the U.S., the examinee is no longer a monolingual speaker of either language, especially in the case of school-age children who have become circumstantial bilinguals, as noted previously. Because English learners (by definition) do not possess comparable linguistic experiences to either monolingual speakers of their native language or English speakers, neither age nor grade comparison groups remain suitable standards for evaluating language development in general or vocabulary acquisition specifically, even when other typical factors are controlled via stratification in the normative sample (Ortiz, 2014). The very purpose of a normative sample is to ensure sufficient comparability of the developmental experiences of all individuals by permitting a fair and equitable comparison of performance. This objective was generally accomplished by ensuring that normative samples are stratified on the basis of a number of variables that include age, but also accounting for variables that correspond to typical population demographics (e.g., gender, geographic region, socioeconomic status, and race/ethnicity). Although the inclusion of race/ethnicity gives the impression that cultural issues have been effectively addressed, it is insufficient in this regard and does little to create any type of equivalency with regard to an individual’s exposure to another language. The use of a psychometric framework to evaluate individuals who differ specifically in terms of their language development in two or more languages cannot rely on the standard variables commonly used to create a representative normative sample (in particular, race and ethnicity), unless there is direct attention to experience and exposure in one language or the other, or both. Tests in languages other than English continue to be lacking in this respect.

In an attempt to overcome some of these limitations, true “bilingual” tests (e.g., the Bilingual English-Spanish Assessment [BESA]; Peña, Gutiérrez-Clellen, Iglesias, Goldstein, & Bedore, 2014) have been developed that provide an opportunity for an individual to demonstrate the full extent of their language repertoire by assessing proficiency in both English and the individual’s native language. Although the BESA employs a common procedure seen in other similar tests (i.e., giving credit for correct responses in either language), it is rather unique in that its normative sample is based fully on bilingual English-Spanish speakers. Moreover, guidelines based on language use are provided for when evaluation should be completed in either both languages or only one. Whereas the BESA represents the first viable, standardized testing framework for the evaluation of English learners from Spanish homes, the methodology comes at a price. For example, apart from assessing individuals within a very narrow age range (4 years 0 months to 6 years 11 months), the test still requires that either the evaluator be bilingual or a trained interpreter must be used to conduct the Spanish administration, which effectively removes the evaluator somewhat from the clinical aspect of the testing. Moreover, the test is designed only for English learners whose native language is Spanish, making it of limited value to speakers of any other language. Nevertheless, the concept of norms based on non-native English speakers, along with considerations regarding variation in language development and proficiency, are two crucial concepts upon which development of tests for English learners must be based. Accounting for these design considerations would address the concerns related to adequate normative sample representation and expectations of performance.

<< Bilingualism and Vocabulary Development

A New Direction in Testing with English Learners >>