The Role of Measurement in Promoting Diversity

Sep 23

Prompted by a series of brutal murders of Black citizens, the nation is examining mechanisms that promote structural racism. At the same time, there is a loss of public confidence in standardized tests as evidenced by the California university system’s suspension of the SAT and ACT for admissions, furor over cheating on college admissions test and the success of students admitted to test-optional schools without scores. Anti-racism and test skepticism are related in that one of the most cited reason for abandoning standardized tests is that they preserve systems of privilege (Public Council, 2019).

Good Intentions

This is certainly not the intent of the measurement community, which works hard to assure that tests are fair, accessible, and free of content that puts any group at an advantage or disadvantage. Procedures for eliminating bias are implemented at every phase of development.

Although differences in test performance are often characterized as reflective of societal differences beyond the influence of tests, we need to recognize measurement’s role in creating those differences and reinforcing existing power structures.

Historically, assessment standardization has begun with egalitarian intent.

· 605-1905 Chinese Imperial Examination System – wanted to choose bureaucrats by merit rather than birth and unify a diverse empire.

· Around 1600 – examinations for the Jesuit Ratio Studiorum – wanted to expand the pool of candidates for the priesthood and broaden exposure to Jesuit philosophy

· Around 1845 – Horace Mann institutes standard written exams for Common Schools – wanted an educated citizenry to strengthen economic and civic engagement

· 1934 - James Bryant Conant promoted the SAT - wanted to diversify the Harvard student body beyond Eastern prep schools

In all these cases, the objective was to provide wider opportunity by disseminating information needed for success. Indeed, there are many examples of individuals who moved into elite positions by doing well on assessments. Another well-intentioned purpose for tests has been to foster participation in common culture or to unify disparate elements. Tests hold out the possibility of advancement for those who show ability, so their content is studied by people of all groups. These virtuous purposes, however, have the effect of perpetuating existing social order.

Consequences

Codifying and disseminating the skills needed for success is intended to provide access and diversity but has the effect of further reinforcing mainstream culture. It says, “In order to become part of society, you need to become like us.” You need to know what we know, think like we think and write like we write. The rewards of society are given to those who can do this. Scores tend to identify the most privileged members of society as the most able, because excellence is defined by skills possessed by elite people.

Those who are part of the dominant culture have a head start. They enter school already speaking the right dialect, knowing the right manners, and having the same world view as their teachers. As school progresses, their skills are buttressed by a home and social life that confirms their fit to the society. They learn academics in full expectation that the fruits of society will be theirs.

For those who aren’t part of the dominant culture, there is less certainty. From the outset there is an extra burden of trying to figure out the dominant code. There is the feeling that one’s speech is wrong or problematic. In what these students have observed of life, people like them do not enjoy the fruits of society despite hard work.

None of this is the fault of standardized tests, but it is confirmed and supported by test results. Testing begins early in the lives of students, giving students of color and economically disadvantaged students a message of inadequacy, creating doubt and anxiety.

The content of tests at any given time reflects the ideal of the dominant group. It may cover Confucian philosophy, classical Greek, or rhetoric, but it always contains some skill that is more prevalent in the dominant group than in other groups. Tests are perceived as objective and scientific, so demographic differences in scores are seen as proof that the already fortunate deserve their status and that others do not.

The criticism of standardized testing has been consistent, from the 1300s forward:

Test prep businesses crop up, available only to the wealthy
Standardization of content limits creativity and diversity of thought
Preparing for narrow test content displaces true scholarly activity
Test anxiety masks true ability
Cheating

Once established, tests can be deliberately used to protect existing high status and limit opportunity. Although Binet cautioned that intelligence is changeable and too too complex to measure with a single number, his work was quickly expanded and used as evidence supporting hereditary views of intelligence including eugenics. One of the appeals of the early SAT to admissions officers was that it was developed from an IQ test. Since there was a prevalent belief that race and IQ were connected, they believed it would allow them to exclude non-whites without explicit discriminatory rules (Manhattan Review, 2019). The recent SAT cheating scandal showed how wealthy families try to tip the scale for their children.

Tests as Gatekeepers

Most public criticism is about using scores to allocate scarce resources. Tests that focus on learning rather than screening or ranking are more widely accepted. There is far more objection to test for certification and admissions, which are used when something is in limited supply. This includes entrance slots in good universities and the number of people who can enter a profession. The thinking is that these limited positions should be reserved for the most able or those most likely to succeed. However, there is increasing skepticism about the ability of tests to do that. Do standardized tests select the best students or only the best test-takers?

Relevance of Content

In the limited test session, examinees can’t do the complex time-consuming thought demanded in actual situations. Tests have been dominated by multiple-choice questions atypical of real problems. This year Oregon, Washington and Utah are allowing law school graduates to become licensed without passing the bar. Darleen Ortega, a judge on the Oregon Court of Appeals, argues that the bar exam’s passing criterion is based on artificially difficult content irrelevant to the practice of law.

“The test, while difficult, does not screen for the skills actually needed to demonstrate minimum competence. The exam requires them to answer questions under timed conditions that do not parallel the realities of actual practice. Indeed, answering a client's questions quickly from memory would in most cases constitute malpractice. And a major portion of the examination consists of multiple-choice questions that aim to trick the test-taker, requiring them to choose the best among several slightly wrong answers.” (Ortega, 2020)

There has been an ongoing charge that test content is narrow and irrelevant. Cornell psychologist Robert Sternberg has long argued that college admissions tests are inadequate for identifying the best thinkers. He advocates looking for creative practical skills that indicate ability to live in a complex and changing world: “in the end, intelligence is about adaptation to the environment, not solving trivial or even meaningless problems.” (Sternberg, 2020)

Relevance of Criteria

Some tests don’t seem to do a good job of gatekeeping. Hiss and Franks (2014) examined 33 institutions with test-optional policies. They found few significant differences in cumulative GPAs and graduation rates between those who submitted scores and those who did not, despite significant differences in SAT/ACT scores. Non-submitters were more likely to be underrepresented minorities, first generation college attendees and Pell grant recipients.

Ibram X, Kendi finds similar evidence about the predictive power of tests:

“The biggest irony and tragedy of the Regents v. Bakke case—and the affirmative action cases that followed—was that no one was challenging the admissions factors being used: the standardized tests and GPA scores that had created and reinforced the racial disparities in admissions in the first place. The fact that UC Davis’s nonwhite medical students had much lower MCAT scores and college GPAs than their fellow white medical students but still nearly equaled their graduation and licensing exam passage rates exposed the futility of the school’s admissions criteria.” (Kendi, 2016a)

Somehow, for all the attention paid to making tests fair, they continue to exclude some capable people and discourage diversity.

What Should We Do?

What would it take for assessments to be an agent for mobility and diversity?

Evaluate how Scores are Used in Screening Test scores used rigidly or in isolation are a blunt instrument. They are too simplistic and imprecise. Assessment professionals have always recommended that they be used in concert with other information as part of a thoughtful selection process, and many universities and employers already do this. This year, we’re going without standardized college admissions tests. For some, this will be permanent. For others, it will be a test of how well other indicators perform to increase diversity. The same will be true for professions that suspend licensure or certification tests. This crop of students and professionals are a rich source of information about how people who would have been screened out by tests succeed. Studies of their characteristics can inform test content and policy for test score use.

Achievement Gap Few mourn the deferral of K-12 statewide testing this year. Accountability tests required by successive versions of the Elementary and Secondary Education Act (NCLB, ESEA, ESSA) have been praised for mandating achievement gap reports. But Kendi (2016b) questions this function:

“What if different environments actually cause different kinds of achievement rather than different levels of achievement? What if the intellect of a poor, low testing Black child in a poor Black school is different—and not inferior—to the intellect of a rich, high-testing White child in a rich White school? What if the way we measure intelligence shows not only our racism but our elitism?” (Kendi, 2016b)

The achievement gap concept implies a belief that acquiring skills determined by the dominant culture is a worthy goal. If we look at the skills of existing successful people as a way to predict future success, we get the same people we have now. As Sternberg notes, this leads to less robust selection. Skills and habits of other groups may differ but are needed in society. How can we view diversity as a resource rather than a problem to be solved?

An interesting approach to this is the game Evoke, created by Jane McGonigal for the World Bank at the request of the South African government (World Bank, 2012). The target players are South African high school and university students who are cast as experts to solve various world crises. The premise is that the world is asking the player for help because Africans have the most experience coping with big problems in conditions of scarcity. The qualities usually used to characterize the player as needing help are used to characterize them as powerful experts. How can we harness the creativity and practicality of those who cope with balancing multiple jobs, raising families on limited budgets and negotiating multiple cultural identities? How can we create a research agenda around this?

Marty McCall

Marty McCall is the lead psychometrician for Imbellus, a company that builds simulation-based assessments for education and industry. She has previously worked in measurement for state, consortium and interim tests.

The Role of Measurement in Promoting Diversity

On Mindful Measurement, from a Distanced Perspective

Mindful Measurement and COVID-19