Soon after June SAT results were released last week, clients from across the country were calling the Compass offices, confounded by the news they had to share. Parents were understandably bewildered by the seemingly illogical fact that their students, as one mom put it, “did better, but worse!” On the Math portion of the June SAT, many students correctly answered a much higher percentage of questions than on previous test dates yet still ended up with lower 200-800 scaled scores. Meanwhile, College Board maintains that the results are accurate.
Some students, parents, counselors, and tutors are feeling not merely disappointed but outright disillusioned by this anomalous outcome. Among the questions we are hearing are “How did this happen?,” “Why did this happen?,” “Can it be correct?,” “What should we do next?,” “Might this happen again?,” “How will colleges view June 2018 results?,” and “What if my score actually went up in June?”
Through and Under the Looking Glass
Test makers and test takers are alike in that they both see each test item as a precious opportunity to gain ground. Students can’t afford to waste a question with a careless mistake and test producers can’t afford to include a question that doesn’t contribute enough to the sorting of students. A 58-question SAT Math test is inefficient, for example, if every student gets the first 30 questions correct. It will be an inaccurate exam if all of the questions are far too hard to make useful distinctions.
The fundamental promise of standardized tests — that scores are consistent and interchangeable over time — is also their predicament. In a test-maker’s utopia, there would be an endless supply of equivalent test forms that would each contain unique questions but would be otherwise identical in scope, composition, and difficulty. While certain parameters like item count, section sequence, and timing can be held consistent, other aspects of the tests have unavoidable variance.
The most intuitive way of scoring a test is based on rights and wrongs. Students can immediately understand what it means to get 40 out of 58 questions right. This method breaks down, however, when you consider that tests are far too complex to produce the exact same results every time.
A Simplified Example of Equating
Imagine a reference group of students who take every SAT before it is widely offered. The reference students who got 42 questions right on Form A got 43 questions right on Form B. On Form C the same students got 41 questions right. It would not seem fair to consider 41 correct answers to be equivalent on all three forms. Form C was the hardest of the exams; Form B was the easiest. The equating process might say that 42 -> 650 on Test A, 43 -> 650 on Test B, and 41->650 on Test C. That level of difference is what we expect to find when equating exams. In fact, when looking at the highest score that ever produced a 650 on the old SAT — where we have almost 50 released exams over 10 years — the lowest raw score was 40. The highest raw score to produce a 650 was 44. The curves never varied by more than 40 or 50 scaled points for the same raw score anywhere on the exams.
How the June SAT Falls Short
Compare this to how the June SAT 2018 Math fits in among its fellow new SATs. A 650 could be achieved with 50 correct answers. That’s the lowest scaled score the new SAT has ever produced for 50 correct answers. The highest score it has produced for 50 correct answers on an actual, released exam is 740 points — a 90-point swing! So in its first two years, the new SAT has approximately doubled the extremes seen on the old SAT over 10 years and 4 times as many exams. In terms of standard deviation, the June 2018 test was a full 2 SD further away from the mean than any other exam. When a 100-year flood occurs after two years, you have to be highly suspect of the weather forecasting.
The Difference Between Accurate and Fair
College Board keeps coming back to the fact that the items were developed according to standards, the equating was computed correctly, the items were scored correctly, and the resultant scale was correct. There is no direct evidence that College Board misscored the exams. There is, however, evidence that College Board issued an exam that it would have known was well out of spec. It is expensive to throw away an exam, but that’s exactly what College Board should have done.
Perhaps College Board thought that the new SAT questions on previous forms were too hard and it was trying to shift to an easier set of questions. Such things are done by testing organizations, but typically through lengthy evolutionary process. To make such a jump so early in a program would be highly irresponsible. We believe that June was an outlier rather than being the new normal, but only College Board knows for sure.
Is there a way to externally verify the results of the June SAT? How did it impact students as a group? The detailed data used to construct and equate new exams are considered trade secrets that the testing organizations do not disclose. College Board does periodically release the range of reliability coefficients for exams, so we may eventually see if June fell short on this measure.
We can, at least, look at the end result. The pool of students taking the SAT on a particular test date is fairly consistent from year to year. If the June 2018 test were entirely out of whack, then the miscalibration should show up in a comparison to June 2017 results. Despite having very different raw-to-scaled score conversion tables, the math test this June ended up with identical average scores and similar score distributions to the test in June 2017. Some students may have missed out on high scores because of the test structure, but this means that an approximately equal number of students benefitted.
A “Fair” Result for the Whole May Still be Unfair for Many
The slight variations in difficulty among forms usually go unnoted. At a certain point, however, the difficulty level changes the testing experience. A student is justified in saying, “This is not what I signed up for!” College Board would not, unannounced, start putting calculus problems on the SAT. Those questions might not impact the reliability of the exam, but the change would violate the implicit compact College Board has with its students and its member institutions. The SAT is not meant to be a “gotcha” exam.
In June, the esoteric world of career psychometricians setting what amounts to an odd scale on one of many tests constructed over the years for millions of students applying to thousands of colleges came into conflict with the practical world of one unlucky teenager drawing that oddly scaled test before applying to a few colleges, once in her life. The 340,000 students taking the June SAT may have ended up with a similar score distribution to those taking the June 2017 SAT, but the reliability — how well did a student’s score line up with previous performance — is still very much in question. Compass has seen many students who performed well on the June SAT, but this does not negate the concerns about its fairness.
Score Gaps and Imprecision
The other major anomaly with the Math exam was the set of gaps between scaled scores — especially at the high end. On the June SAT, no one received a 710, a 730, a 740, a 760, a 780, or a 790; over half of the possible scores at or above 700 went unused. A single wrong answer could have dropped a student by as many as 30 points. There were surely test takers who would have normally filled those gaps, but the June SAT was incapable of sorting them properly. The over-reliance on easy questions did not give the SAT enough power to make fine distinctions among high-scoring students. And while admissions officers are cautioned not to make meaningful distinctions between small score differences, highly selective colleges are flooded with scores in this range and may perceive contrast that the tests themselves can’t see.
The incongruity of the June scale is a valid criticism. Indeed, College Board includes “minimized gaps” among its own goals of test construction in its SAT Technical Manual. No other new SAT has anywhere close to that many gaps. Some have none at the high end.
Such aberrant figures draw skepticism toward the integrity of these exams that ostensibly reflect math (and verbal) ability and little else. Did the June version adequately do that or did it perhaps exaggerate the effect of a careless mistake? Will college application readers apply any contextual interpretation to a June 2018 SAT score when advised by test makers to disregard administration dates? Students should assume not. Devaluing the scores would disadvantage the students who did well on the exam.
Students should also not count on College Board caving to the pressure of petitions or other demands to revisit the June outcomes. It was a poorly constructed exam — College Board had to have known that going in — but there was no wholesale error that would cause results to be revised or annulled. College Board is, perhaps sheepishly, standing by its reported scores. It feels good . . . enough . . .about the values it has assigned to every possible raw point total.
While this post has focused on the shortcomings on the Math side of the June SAT, it’s also worth noting that four “verbal” questions (2 Reading, 2 Writing) were deemed unscorable, effectively reducing the denominator of every student’s ERW raw point ratio. College Board insists this had no impact on scaled scores (what about time spent answering these scrapped items?) but it’s another reason why the June 2018 SAT will not be remembered as one of College Board’s finest pieces of work.
Instead, concerns grow as stakeholders — students, parents, counselors, and even colleges — question if the SAT is losing its way. Compass is quick to defend the SAT when it is wrongly blamed, but we’ve also written about its owner’s missteps which have seemed to increase since 2012 when current leadership assumed power. Significant internal changes have occurred in recent years, including the loss of key members of the SAT team and the reclaiming of test construction from ETS, which had long been subcontracted to create tests. It seems College Board may be learning the hard way how difficult the job really is.
As for advice for students unsure about their next steps — Score Choice, holding pat, re-testing, additional preparation — it should be a case-by-case discussion with an expert who can carefully examine a list of individual factors. We encourage you to schedule a call with a Compass director to discuss your specific plans for the fall.
I think the same phenomenon, possibly even worse, happened with the October SAT test. My daughter’s raw score on the October SAT test was 94 correct out of 96 questions on EBRW sections and 56 correct out of 58 questions on the Math section, and yet she received a 770 scaled score on each. Can it really be that missing only 2 questions out of 96 on the EBRW gets a 770 and missing only 2 questions out of 58 in math also gets a 770? That grading scale seems much tougher than even the June scale. What do others think?
You are correct. The Math scale wasn’t quite as harsh, but the Reading and Writing scales were. The Writing scale, in particular, may have been the most extreme that I have seen. This means that the set of questions used was very easy. If you haven’t seen our other post about this, you can read What Happens When an SAT is Too Easy. It’s unclear whether College Board has just had some poorly constructed exams (i.e. too easy) or whether there is a method to the madness.
Thank you for this very informative article. The topic is of particular interest to me this month.
Our eldest child took his first official SAT exam this June. While I feel that he did well, I can’t help but wonder if results could or should have been different. Are there multiple versions of the test administered on a particular date – or did every June test taker receive the same one?
On a side note, our son did receive a math score of 780 (missed one) on his June 2018 SAT – so at least some students are being given that score this go around.