PART I: Unexpected, Unexplained Score Drops
Critics of standardized tests correctly assert that even the best built tests are imperfect and incomplete. However, high stakes tests like the SAT and ACT have remained credible because they have traditionally produced consistent results. That is, because very large testing pools don’t change much from one year to the next, a well constructed test should yield fairly similar results each time it is offered. The exact level of difficulty of any given test form will vary slightly, but that’s where scaling is used to bring scores into alignment.
And that is where something seemed amiss, when 2019 PSAT scores were released last month.
College counselors across the country noticed significant drops and were understandably concerned that their current juniors and sophomores underperformed on this year’s exam. So we looked into the figures provided by College Board to the schools, and found the following:
- The number of juniors scoring 1400+ dropped 30%, from 71,000 to fewer than 50,000.
- The number of sophomores scoring 1400+ dropped 36%.
- High-performing students scored as much as 30 points lower than in previous years. The student who would have scored 1400 last year was more likely to score 1370 this year.
- There were far fewer students in the typical National Merit ranges. We now project that National Merit Semifinalist cutoffs will decline 1–4 points.
- The primary PSAT form (Wednesday, October 16th, taken by 86% of students) may have been the most skewed, resulting in inequities based on when students tested.
The unexplained results affected more than just the top scorers:
- Students scoring 1200+ dropped by 15% for juniors and 21% for sophomores.
- The average PSAT scores for all 1.7 million juniors dropped by 10 points, an unusual change considering that the average usually moves by 1–2 points on a well-constructed exam with a stable group of testers.
- The drop in Math scores was almost twice that of the drop in Evidence-based Reading and Writing (ERW) scores. Even the number of students meeting the baseline Math benchmark for college-readiness dropped 10% this year.
Unless there was a national math crisis that put high-scoring students in the classes of 2021 and 2022 a half-year behind their class of 2020 peers, something appears to have gone wrong with the PSAT. Until an explanation is provided, we’ll assume this was a test construction problem and not a change in student achievement. This could have ramifications for how students and schools interpret and use PSAT scores and on the amount of trust College Board is given when creating and reporting on PSATs and SATs.
Compass has divided the rest of this report into sections so that readers can find the parts most relevant to their concerns:
PART II: How conclusions were drawn
PART III: Why discrepancies matter
PART IV: The need for an independent auditor
PART V: An FAQ for parents, students, and counselors
PART VI: A deeper dive into the data
PART II: How conclusions were drawn
College Board makes almost no information about PSAT results public and provides no self-assessment. This is our attempt to partially fill that void.
The PSAT is not a single exam. A set of forms is used for different dates and for different situations. Because the primary form on October 16th was the dominant exam—taken by 86% of students—our analysis is most applicable to that exam, 2019 PSAT Form A. We believe that much of the problem lies with this form. We hope to gather more data for other forms such as the October 19th and October 30th administrations.
Counselors can contact the author at [email protected] for more details about how to help get more useful comparisons for themselves and their peers.
Disappearing High Scorers
Twenty thousand top scorers don’t simply disappear. This should have been a red flag that something had gone wrong. Instead, this information was buried in a report available only via a portal for college counselors.To push 20,000 students below 1400, scores would have been off by an estimated 20–30 points. A curve that normally aligns year to year has been deformed.
Unusual Changes in Average Scores
When testing similar groups of students, a well-constructed exam produces similar distributions of scores. At the very least, it allows us to trust that a 1200 on one exam means the same as a 1200 on another. That trust is what keeps these tests alive. With 3.5 million students taking the PSAT/NMSQT each fall, results should be stable. In 2019, they were not.These declines were well outside the norm and should have been another indication that something was wrong. Achievement changes that large rarely happen in a single year. The last time average SAT scores declined by 10 points was in 1975.
Changes in average scores are less volatile than score changes at a specific portion of the scale. For example, unchanged scores for students scoring below 1000 offset discrepancies at the high end. Several forms are averaged together when computing mean scores. The estimated decline of 30 points among high-scoring students on October 16th test was almost triple that observed across all PSAT takers.
PART III: Why discrepancies matter
PSAT scores are used in a number of situations where accuracy matters:
- The PSAT determines more than $100 million in financial aid through National Merit and other college scholarships.
- College Board urges schools to use the PSAT to place students into AP classes.
- The PSAT provides an important comparison to students’ performance on the SAT or ACT.
- The PSAT is used by schools, districts, and states to track student performance over time.
Scoring anomalies also call into question College Board’s ability to accurately construct PSATs and SATs or to objectively critique the tests’ performance. No third party provides oversight of test integrity. (It’s perhaps ironic that a test prep company is trying to do that.) College Board and ACT grade their own work and have a habit of making erasures after time is called. We are concerned that the same discrepancies can crop up on the SAT, but that they are easier for College Board to keep hidden, especially since it does not publicly release the data.
How will this impact National Merit cutoffs?
When aligning their class of 2021 students to the class of 2020 Semifinalist cutoffs, many counselors will have a sense of alarm. Our analysis, however, indicates that most state cutoffs will decline. It is more difficult to predict where that dividing line will occur. Compass maintains the most thorough reporting of National Merit results and analysis here. Our current forecast is that most cutoffs will decline from 1 to 4 points. Even large states that typically move no more than a point from year to year may see significant drops. The Commended cutoff could move as much as 4 points from 212 to 208, although 209 is Compass’s current estimate. [This is another place counselors can help through crowdsourcing of data. Contact the author for details.]
Does this mean that my school will see far fewer National Merit Semifinalists?
Possibly not. Ultimately, 16,000 students will be named Semifinalists, just as they were in last year’s class. If each test code—and we know of at least 8 of them for this year’s PSAT—had performed similarly, then one test date versus another wouldn’t matter. We believe, however, that the forms did not perform equivalently. We are particularly concerned that October 16th test takers were disadvantaged. Conversely, students who took one of the other forms may have a higher chance of qualifying thanks to a lower bar. The discrepancies also impacted ERW and Math scores unequally. This may have consequences for the mix of National Merit Semifinalists given the way the Selection Index is calculated. Score anomalies are potentially damaging when hard cutoffs are used to make decisions or where scores are used programmatically.
Why are there different forms at all?
For test security, exams given on different dates should be unique. Although 86% of students took Form A on October 16th, another 7% took Form H on October 30th. We estimate that 5–6% of students took the Saturday October 19th exam. (College Board keeps that exam almost entirely under wraps. Students are not given back their booklets or access to questions after the exam. No scaling is reported.) The remaining students took a form used for those with special accommodations or were in a group of schools who took one of the top secret forms used by College Board to help create, test, and scale future exams.
Top secret sounds ominous. What does that mean?
Even on October 16th, not all students took the same exam. On many PSAT and SAT administration dates, secondary forms are used for a variety of test development and monitoring purposes. Typically, a school is expected to distribute these exams in a prescribed fashion that allows—when repeated over a number of schools—College Board to gather what is known as “equivalent group” data useful for scaling new tests. The significant downside from a student’s perspective is that these secondary forms are not released. Even test administrators must go through special training to use them. Other than scores, these students and schools receive no useful feedback, which largely defeats the intended purpose of the PSAT. We know of at least 6 secondary forms (Forms B through G) used on October 16th. We do not have sufficient data to report these results, but our preliminary comparison of the forms is not favorable. It appears that some forms resulted in higher overall scores than did others.
PART IV: The need for an independent auditor
Why has College Board remained silent?
There are multiple places where alarms could and should have been sounded and scoring deficiencies addressed. Tens of thousands of high scoring students “disappearing” should have set lights flashing at College Board. The unusual drop in average scores should have provoked a reevaluation of scaling. If—as we suspect—different forms produced different results, interventions could have been made. Are there proper checks and balances in place? Are auditors not speaking up? Are they speaking up but not being heard by management? Are there any auditors at all?
Beware of false claims
In the past, College Board has shifted attention from its own mistakes by wrapping them in misleading narratives. On the 2016 PSAT, scores for juniors improved by 9 points and were up across all demographics. College Board explained this as a win:
“It is both rare and encouraging to see this kind of positive improvement across the board. I’m inspired by the swift embrace of the PSAT-related assessments, and even more by the progress we are seeing,” — Cyndie Schmeiser, senior advisor to the president, the College Board.
That account 4 years ago suggested that millions of students improved their performance concurrently. The simpler reading was that the 2015 PSAT—the first edition of the completely overhauled exam administered a year prior—was a lemon. It appears the same failure to produce consistency from exam to exam has reoccured. We hope, with this report, to preclude any false narratives. This is not about students losing the ability to do algebra or read critically. This is more likely about a miscalculation of scores.
Are problems limited to the PSAT, or do they also apply to the SAT?
We fear the same sort of miscalculations arise on the SAT, although they are better obscured by the high number of dates and forms used throughout the year. The PSAT provides a useful Petri dish because entire cohorts take the test at the same time each year. Almost 3 million students took the October 16th administration, 5 to 10 times the number taking the typical SAT form. Moreover, the cohort of PSAT takers is stable from year to year. On the SAT, on the other hand, the normal variation in student behavior can mask test construction problems. College Board should have a transparent audit process and report card for both the PSAT and SAT to demonstrate its commitment to accurate reporting of student scores. Obfuscation is incompatible with its mission.
PART V: An FAQ for parents, students, and counselors
Compass has prepared this FAQ to further explain how schools and students can interpret the 2019 PSAT results.
Don’t scores fluctuate? Isn’t the standard error of measurement more than 20–30 points?
The overall drop on PSAT scores should not be confused with the standard error of measurement (SEM) that is inherent to any standardized test. SEM is an estimate of how much an individual student’s score may vary from form to form. Rather than view a 1340 as a completely accurate measure of ability, students are encouraged to think of a range of 1300–1380, reflecting the 40 point SEM for total scores on the PSAT.
SEM would be like a single golfer who averages 200-yard drives hitting the ball 190 yards some times and 210 yards other times. In testing jargon, there is a difference between his observed drives and his true driving ability. The 2019 PSAT was like a bad batch of golf balls leaving nearly every drive about 30 yards short. On the 2019 PSAT, lower scores impacted hundreds of thousands of students. The standard error of measurement still exists—it’s just now layered on top of a flawed scale. A shift of this magnitude would not happen with a well-designed exam.
Are percentiles still correct? Don’t they reflect actual performance?
No and no. The User Percentiles reported by College Board are not at all related to the scores received on the 2019 PSAT/NMSQT. Instead, the figures are aggregates from the 2016, 2017, and 2018 administrations. The User Percentiles for above average scores, therefore, under-report students’ actual relative standing on the 2019 exam. Once more, we can’t express this exactly without more data from College Board. Also, percentiles are never differentiated by form.
A rough rule-of-thumb for October 16th PSAT takers at 1200 and above would be to add 20–30 points to the total score and use the percentile tables from PSAT/NMSQT Understanding Scores 2019.
Should high schools treat scores differently?
First, schools should not be surprised to find that they have fewer high scorers this year. In our audit, we found a decrease of almost 40% in students scoring 680 or higher on the Math PSAT. The charts we provide in the Deep Data Dive section below give a rough sense of how things may have shifted. We will update our guidance if we are able to gather additional information. We now expect that most—if not all—states will see lower National Merit Semifinalist cutoffs. Based on our analysis, many states will see cutoffs 2–3 points lower than those seen for the class of 2020.
As a rule of thumb, schools offering the October 16th exam may want to add 20 points to Math scores above 600 and 10 points to ERW scores above 600 to bring them into closer agreement with previous years’ results.
Does this change how a student should choose between the SAT and ACT?
In most cases, the lower PSAT scores will not change a student’s decision. Any given test administration produces only an approximation of a student’s “true score.” Many factors come into play when deciding between the SAT and ACT. We would recommend that, when using a tool such as Compass’s PSAT/ACT Score Comparison or Concordance Tool, that October 16th students add 20–30 points to their scores if they are at a 1200 or higher. Unfortunately, we don’t have enough data to make a recommendation for students who took other administrations. A student who feels the PSAT results are unreliable should take an official practice SAT.
If the 2018 and 2019 PSATs are so different, could it be the 2018 test that was off?
The results from the 2018 PSAT were in line with those seen in 2016 and 2017. This seems to be a problem with the 2019 PSAT.
What lessons can students learn?
The job of College Board (and ACT) is to provide a fair testing environment by using parallel forms. One exam should not give students an advantage over students taking another. Unfortunately, this goal is not always achieved. While students have no way of predicting when a “bad” test will be offered, they can diminish its impact by spreading their risk. Relying on a single exam is poor planning. A student may perform poorly because of lack of sleep, a noisy room, an untrained proctor, an odd question or passage that trips up timing, a mismatch of skills studied and the problems appearing on that form, and many other reasons. The 2019 PSAT shows that students can’t even necessarily trust that the exams themselves are accurately scaled. An effective test plan must allow for the unexpected.
An alternative plan is to look more seriously at test optional or test flexible colleges. It’s unfortunate that students should plan for more testing to compensate for the vagaries of the exams.
Could the change be intentional? Were scores getting too high?
College Board has not announced any adjustments to the PSAT scale. Changing a scale without a specific rationale and without notice would be unprecedented. College Board would have little interest in lowering scores. We hope that College Board will clarify what went wrong with this year’s exam, but we don’t see any validity to this particular conspiracy theory.
Is any of this related to the introduction of experimental sections on the PSAT?
At schools not willing to pay additional fees, students were given “experimental” sections that are typically used to help scale and develop future tests. The use of these experimental sections may be evidence that College Board is taking the production and scaling of the PSAT more seriously. Any added diligence has not yet shown up in PSAT quality. The experimental sections are unlikely to have caused the scoring problem.
So what did cause the scoring mistake?
Only College Board can properly answer this question, and they declined to respond to this report. The most likely cause is poor sampling when developing the exam’s scale. If the sample group differed in unexpected ways from the reference group—differing in expected ways can be modeled—then the scale would have been incorrect. This is more likely to happen at the high end of the scale, where there are fewer test takers and fewer students in a sample group. Mistakes such as this one would spur most organizations to re-examine their procedures. Organizations can use a “red team” to intentionally probe for weaknesses or errors. College Board should have high standards of accuracy or at least be candid with students, schools, and colleges about what standards it does maintain.
PART VI: A deeper dive into the data
Is there another explanation?
When analyzing data, it is important to consider alternate hypotheses. Scores sometimes move lower when the pool of test takers shifts significantly. For example, if several high-scoring states abandoned the PSAT, then the numbers might have been skewed. There is no evidence that this happened, and the number of PSAT test takers is little changed. The problems also seem to have affected both the sophomore and junior classes, which makes an alternate explanation far less credible.
To further rule out the possibility that overall test taker achievement explains the score drops, we worked with a set of schools to compile data comparing class performance on the 2018 and 2019 PSATs. We observed the same issues with scores at every school. For those offering the October 16th exam, there was a 46% drop in the number of students scoring at or above 1400. We did not gather enough data from the October 19th or the October 30th dates. [Compass’s data does not come from our own students, and the effects of test prep should be irrelevant to the results.]
Was this a result of harsh scales students have complained about?
What first drew our attention to the PSAT scores were complaints from students about harsh scales on this year’s exam. Indeed, on the Reading and Writing tests, scores dropped precipitously from only 1 or 2 mistakes. The most significant scoring problems, though, were on on the Math section—the only test, as it were, that had a “typical” scale.
A harsh scale exists to offset an easier test. The scaling done on the PSAT is different from what a classroom teacher might do to determine that a certain percentage of students will receive As, a certain percentage will receive Bs, and so on. PSAT scaling is designed to take into account the small differences in difficulty between test forms. We have found that the Reading and Writing & Language tests really were easier than is typical.
How do we know a test was easier?
One way of judging the difficulty of a test is to look at how many students achieved a perfect score. Zero wrong is the only score that can’t be impacted by a scale because, by design, a perfect test must receive the highest available scaled score. On an easier test form, we should see more students achieving a perfect score. While College Board doesn’t release this level of detail, our sample of school data shows the expected result—the 2019 PSAT was in fact an easier test.
Perfect Reading Scores Increased 703%
Perfect Writing Scores increased 61%
Perfect Math Scores Increased 160%
However, when we look at scores below perfect, the 2018 and 2019 tests come into closer alignment on the Reading test and the Writing & Language test. We expect this behavior if the test is properly scaled.
Scores on Reading and Writing are reported on a scale of 8–38. They are added together and multiplied by 10 to produce an ERW score. For example, a 35 Reading and 34 Writing produces a 690 ERW score. For Math scores, we use the more familiar 160–760 scaled score range.
On Reading and Writing, there are some differences between 2018 and 2019 results, but nothing like what we see on the Math chart below. For example, the chart shows that almost twice as many students earned a 660 or higher on the 2018 PSAT than did on the the 2019 PSAT.
These sample school data (exclusively from the October 16th Form A) are in line with the overall performance results reported by College Board—while none of the three subjects are perfectly aligned, the discrepancies are far more prominent on Math than on Evidence-based Reading and Writing.
Compass fully strives to report data honestly. If this report includes any errors, we will be quick to admit them and fix them. This is why we granted College Board a preview of this report. Our goal is to give complete and accurate information and analysis to everyone who has a vested interest in high stakes college admission testing.