I commonly write about the topic of error in assigning grades in a classroom setting, but it always fills me with a certain measure of anxiety.

Assigned Grade = True Grade + Error

This is a fundamental relationship that we have to deal with when trying to measure anything interesting. Yet I often wonder how students reading this blog might feel to be so frequently reminded that the grades they receive in their classes always contain some error. Though error never completely avoidable, it is easy to understand how someone might feel disaffected or angry to learn that the measures which hold sway over their future career prospects contain some degree of error. The degree of error varies from class to class and student to student, but it’s easy to imagine that in almost every class there are some students whose grades are in the grey area between, say, a B and a C, for whom even a small amount of error could be enough to produced an assigned letter grade that is different from what the ‘true’ grade would be if the error had not existed.

How important is the difference between a B and a C? In the case of Megan Thode, a student at Lehigh University, the difference amounted to a $1.3 million lawsuit.


To quote the NY Daily News:

Getting a grade you don’t deserve in school is worth about $1 million in damages, according to a lawsuit filed in Pennsylvania.

A former student at Lehigh University is so unhappy with the C+ she received in a course in 2009 has decided to sue the school for $1.3 million, claiming the unfair grade has ruined her future earning potential.

In this case the C+ grade was enough to disqualify Ms. Thode from continuing in her graduate program. Her lawsuit claims that the grade she received was inaccurate at least in part because bias the instructor carried towards her due to certain political statements she had made and complaints she made about taking part in an internship as part of the class. The university denies any bias.

For the sake of argument, let us assume that the instructor did not make a conscious decision to alter her grade, but that annoyance at Ms. Thode’s behavior subconsciously contributed to the error present in her assigned grade. It would not outrage me to learn that this was true. I personally do not consider myself immune to small biases based on student interactions – a point that I try to make clear to all my students when discussing my professional behavior standards and one of the reasons I employ blind grading for most assignments. This hypothetical situation provokes a few interesting questions:

  • How much error is enough to warrant a lawsuit?
  • If we hold instructors legally liable for grade error, does it matter whether the largest source of the error is unconscious bias, sampling, grader variation, or any of the many other factors that can contribute to error?
  • Is it even possible to determine how much of an assigned grade is error or what caused the error?

I’ll pass on trying to answer the first two questions as they are mostly philosophical. The question of determining the magnitude and cause of error is a practical question. In theory, the answer to the question is ‘yes’, but with several caveats so large as to make any attempt clearly impractical.

Classical testing theory posits that while the sampling error in each assessment is random, the randomness is not uniform, but rather forms a normal distribution centered on zero. From a statistical standpoint, all that you would need to do to reduce the error is to continue sampling, provided – and here comes the big catch – that you could ensure that the thing you are sampling (i.e. the student’s knowledge) has not changed at all, something that would require access to a time machine or alternate dimensions. As for rooting out bias-driven error, it probably isn’t normally distributed, so you might be able to discover it if you were able to fit multiple independent graders into your time machine so you could re-test each student in the class multiple times and then try to correlate the scores between graders.

In other words, short of some kind of smoking gun, it is not practical to sue based on standard assessment error. Frankly I’m glad, since the thought of the chilling effect that would have on grade assignments is enough to make me shiver.


Last week, I had the privelge of attending UW-Madison’s Teaching Academy Fall Kickoff Event, which this year focused on issues of grading and assessment. Keynote speaker, Dr. James Wollack, presented challenges with grading and described potential solutions. One issue he addressed was the promotion of positive behaviors (for example, attending office hours, coming to class, participating in discussion) and the challenge of assessment of these behaviors. I’ve long argued that awarding points for attendance systematically discriminates against at-risk and non-traditional students, who may face more challenges to class attendance than other students. But at the same time, it is important to recognize the motivating nature of class point structures; to wit, more students show up for class when they get points for doing so.

Dr. Wollack suggested that instructors could create a separate category of assessment items that act as gatekeepers for certain grades. For example, to receive an A, a student must earn at least 90% of available course points and not miss more than 2 class sessions. Failing to meet this attendance criterion results in a lower letter grade, no matter what percentage of points a student earned in the class.

This idea is very appealing because it allows instructors to build in assessment components that are traditionally hard to score. For example, consider the issue of participation. Let’s say that, in each class session, students can earn up to 5 points for participation. This should motivate students to participate more in class, but it also means that instructors must somehow both conduct class and evaluate student participation. This daunting task reduces reliability of grading and confidence in scores for both instructors and students. But a checkbox system and a gatekeeper point value can help solve this problem.

But what is the theoretical justification behind gatekeeper-style grading? One issue comes in the correlation between gatekeeper items and other normally assessed items. No matter how attendance, for example, is scored, it is still related to overall classroom performance. Students who attend class more frequently do better on exams. This may seem like a great reason to do everything possible to compel students to attend class, but whether attendance is scored or a gatekeeper, it still serves as a barrier to higher performance from a motivated student who, for whatever reason cannot attend class as regularly as the instructor would like.

Think of it like this.

Attendance = Learning + error
Test Score = Learning + error

Ideally, attendance and test scores would be a perfect reflection of what the student has learned in the class. But of course, we recognize that some error takes place, as some students might attend class but fall asleep or be distracted; others might skip class but study extra hard on their own. For this system to work, we need to assume that the two error terms are not correlated with each other. But we are assessing the same construct (Learning, the student’s knowledge of the course material), and thus we know that these error terms in fact SHOULD be correlated.

Furthermore, we want the error terms to be random, such that error in measurement does not systematically help or hurt students. But because both Test Scores and Attendance are measures of Learning, when we kick in points for Attendance, it is like awarding automatic extra credit points on the test for students who were in class. The error term isn’t random any more, because the relationship between Learning and Test Scores is influenced by the relationship between Attendance and Test Scores.

In short, what results is a big assessment mess that is difficult to straighten out. There is little theoretical justification for awarding points for attendance, whether as a scored item or as a gatekeeper. There are other views on this issue, however. I hope to feature some of those in upcoming posts.

Some people argue that a free market economy is the best economic system for getting the most good to the most people. Given that some form of the free market is the only economic system that has been shown to work on a large scale, it can be hard to argue against this claim. But in the 2012 presidential election, much of the debate centers around just how much governmental involvement in the economy is the right amount. While neither side favors complete laissez-faire or communistic approaches, there are stark contrasts between President Barack Obama and Republican presidential candidate Mitt Romney.

How does this relate to measurement and error? We can understand the debate between the two candidates as one connecting our personal values with the values of economics. The free market functions based on a profit motive, but humans are much more complex. And thus we can represent the debate with an equation.

Personal Values = Economic Values + error

In a perfect free market, all of our own values would be solved with free market economics. Sick people need health care! Don’t worry, there’s a free market solution to the problem. Poor children need to be fed! Look no further than economics for help. City streets must be cleaned, in poor and rich parts of town! The free market is there to help.

Unfortunately, all our values–health care, feeding children, clean streets–do not have a free market solution to them. The most salient example is that of providing health insurance to individuals with “pre-existing conditions.” When insurance companies provide insurance for these people, it is much more expensive. Before health care reform legislation was passed in early 2010, insurance companies dealt with this issue by charging higher rates or denying coverage altogether for these people. That is a free market solution, but it did not reflect the values of many people.

New policy forbids insurance companies from these practices and in turn requires all Americans to obtain health insurance, thus ostensibly increasing insurance company revenues. This is anti-free market policy. It restricts the behavior of both corporations and individuals. But it also fits with the values of many people. And proponents argue that it is a case where the free market had failed and intervention was needed.

The error in the equation is because of situations in which there appears to not be a free market solution to a problem. For Republican Mitt Romney, the error term in the equation for him is very small; there aren’t many problems that cannot be solved by the free market. For Democrat Barack Obama, the error term is larger; government has a vital role to play in the lives of all Americans. Your own view of this equation should be a big guiding factor in how you vote, and both candidates point out that their disagreement on the size of the error term shapes their unique visions for the country. In this case, error and your view of it will help determine who wins the election in November.

A few days ago, on my professional blog, I wrote about how the IRB can promote best practices. The IRB (or Institutional Review Board) must approve all research at an institution and works to ensure that research complies with federal, state, local, and university laws and policies. Though this suggests a standard process with carefully constructed guidelines for review, many researchers note frequent discrepancies between institutions and even between protocols. A procedure requiring modifications in one proposed project may go unquestioned in another. This naturally results in much frustration from researchers, no matter their level of concern with research ethics. Indeed, it should be those most concerned about practicing research ethically who protest loudest against IRB discrepency.

We can view the whole process as an equation and investigate further how the error term leads to frustration.

IRB Practice = Best Practice + Error

Let’s first agree that for most research practices, there is a “right answer” regarding the protection of participants. I use the example of data storage in the blog post linked above. “Consider, for example, handling collected data (the surveys and spreadsheets that contain participant responses). This data must be stored in a secure location. On paper, it should be in a locked room that is accessible to very few people. Digitally, it should be stored on a computer and backed up on a secure server.”

If there was no error, then any deviation from this best practice should be flagged by the IRB, and all protocols asserting that they will follow these practices should be accepted by the IRB. Where there is no error (that is, no difference between best and IRB practices), there is no frustration. But this isn’t the case. Some IRBs may flag protocols that have not described backup locations as secure, while others may not worry about this detail. Error (the inconsistency) leads to frustration because a researcher cannot accurately predict what issue the IRB will flag.

The IRB, however, has some power over this error term. Right now, researchers must answer a variety of questions about items like data security. The application contains empty boxes into which the researcher must detail her plans and hope that they pass muster, with little guidance about what a best practice might be. This creates error because the researcher can easily forget or misstate an important piece of information. But if the IRB were to simply state the best practice and instruct researchers to either agree to follow it or describe their alternative plan, then the error term would be eliminated for most applicants. The IRB would save time as well.

Thus, inasmuch as the IRB wants to reduce researcher frustration, they should promote these best practices. By understanding the process of researcher approval in a classical test theory manner, we can see direct guidance as to how the IRB can improve their application procedure. In these cases, the error term is nothing but trouble for both IRB credibility and researcher sanity. All steps that reduce that error, while protecting participant safety, can and should be taken.

Over the past few days I’ve been working on editing a series of interviews on individual views about the classroom assessment process that will be presented at the upcoming Teaching Academy fall kickoff event at the University of Wisconsin.

One of the first things we asked our interviewees to do was to provide their definition of a ‘C’. At the risk of spoiling the result, I’ll just go ahead and say that there wasn’t a clear consensus (shocking, I know). Those who chose to give criterion-based responses tended to describe C-level as “adequate”, “acceptable”, or “OK”. These responses track well with the meaning of the grade in most academic environments where the C is the minimum grade that a student can receive and still continue on to the next class in a sequence. Interestingly, none of them made reference to any rubric or institutionally standardized level of achievement in determining what qualified as ‘adequate’, giving a sense that this was still a pretty subjective judgment.

A large segment of the interviewees gave a normative response, though they seemed to have a tough time agreeing on exactly where a C fell within the norm. Only one instructor described the C as ‘average’. The rest all described a C as worse than average, varying from ‘slightly below average’ to ‘the bottom 30%’ to ‘basically seen as failing’, with the harshest assessment coming from an undergraduate student. One professor clearly felt that it was proper for there to be multiple norms for a C, saying that because students at Wisconsin were above average, that he considered a BC to be average.

A few people gave answers that were not linked to achievement in learning, instead categorizing a C as representing a lack of effort or hard work on the part of the student.

Regardless of the view you take on the validity of each of these positions, I think the sheer variety of answers is itself the strongest indictment against the utility of letter grade systems. Why, after all, do so many institutions adopt the familiar A, B, C, D, F system if not, well, for its familiarity? The value of standardization is clear; take languages as an example. I can author a post on this blog in English and be confident that even accounting for regional vernacular, a reader in any other English-speaking region will be able to reasonably understand most of the meaning. This is the same basic goal that underpins a standardized grade system. I don’t assign a student a grade of ‘7.5 walruses’ because I can just imagine the work a college recruiter would have to go through to decipher the meaning of a transcript if every school used a proprietary system.

ABCDF has become our lingua franca, but what good is a ‘common language’ if everyone interprets the meaning of the words differently? Such a system is actually worse than a fragmented system because it gives the illusion that grades from different instructors mean the same thing, when in fact they don’t. If I give the grade of 7.5 walruses, at least the person trying to interpret the grade will recognize that they need to get more information to know what it means. Experienced recruiters and admissions agents may be able to learn to adjust for institutions with reputations for strict or lax grading, but it’s unrealistic to expect them to be able to account for significant differences between instructors within the same institution.

We’re left with a situation in which everyone loses. Instructors are stuck with a stiflingly reductive grade system, without the value of its simplicity and standardization actually being leveraged. Recruiters are given a signal that is ostensibly accurate, but which contains noise from not only the inherent error of assessment but also from a disagreement in the meaning of the grades themselves. Students are subtly penalized if they don’t seek out instructors who have a lower standard for the meaning of the grades. I suppose the advantage of such as system is that it fits with the long held traditions of giving university professors autonomy within their classrooms, but I’ve personally never met an instructor who relished the process of having to come up with their own definition of what differentiates a C from a B (or B- or BC).

There are a few potential solutions to the problem of our language barrier when it comes to letter grades. One would be to tie it to a relatively objective quantitative measurement, such as the student’s rank within a class; i.e curved grade distributions with all the hand-wringing that entails. Another would be to set out clear rubrics within departments and seek out collaboration with other major universities to develop standards, giving up some freedom in the process. Still another would be to broaden the use of university-level standardized tests, like the MCAT, LSAT, GRE, or FE. Whatever the solution, it’s about time we start doing something.

After the conviction of Jerry Sandusky, former Penn State assistant football coach, on charges of child molestation and rape; and the release of the Freeh Report, which finds former Penn State football coach Joe Paterno and other leaders at Penn State did not act responsibly when informed of accusations against Mr. Sandusky, Penn State is faced with many difficult choices. One of those choices is whether to remove the status of Mr. Paterno that currently sits outside Penn State’s stadium.

Paterno Statue

On the face of it, this is a choice about how Mr. Paterno’s legacy should be treated. Should be be honored for his commitment to the university and his longevity as coach? Or should he be admonished for his lack of moral character after he recommended not reporting Mr. Sandusky to the police? Choose the former and the statue should stay; choose the latter and it should go.

To help clarify this debate, it is valuable to consider the statue as a representation of Mr. Paterno’s time at Penn State.

Statue Representation of Mr. Paterno = Mr. Paterno in Actuality + error

Prior to the scandal, we might have concluded there was little error in the portrayal of Mr. Paterno in the statue. He was a treasured figure on campus, and the criticisms that might have been levied against him paled in comparison to his contributions to Penn State, in both athletics and academics.

After the scandal, it turns out that Mr. Paterno, in actuality, was not quite the same as the statue representing him. Indeed, there was many complexities in Mr. Paterno’s leadership. Mr. Paterno both lead football players onto the field to engage in sport; he also lead university officials to ignore a moral imperative to protect young boys from a serial child rapist. With Mr. Sandusky exposed and convicted for his crimes, we see Mr. Paterno in an entirely differently light.

The result is an error term that taints the representation in Mr. Paterno’s statue. And given that the statue is designed to represent the success of Mr. Paterno at Penn State, we can see the representation as a flawed portrayal of a man who, we can now conclude, was far from a saint (which is how one mural on campus presented him–with a halo–before the artist removed it). If university officials believe that Mr. Paterno’s reputation is now sullied (and they should), then they need to ask themselves what the statue is meant to represent and what it actually represents. A mismatch between the two (the error term) suggests a solid rationale for why the statue should be removed.

The error in the representation, an indication of Mr. Paterno’s own errors, means the statue’s intention is subverted. It now stands as a bitter, ironic representation of Mr. Paterno’s time at Penn State, not a celebration of football success. Because its intention now falls far short of reality, Penn State officials should opt to remove the statue.

When in Paris in 2008, the funniest movie posters I saw were for the American film Step Up 2: The Streets. Disposing with any pretense, the movie had been retitled Sexy Dance 2. This title, I suppose, told French audiences everything they needed to know about the film. According to Box Office Mojo, the film grossed over $4 million in France, or 7% of the films take globally, so we can conclude the new title was successful.

But other movie titles don’t seem as aptly translated. Some of the most perplexing and amusing:
If You Leave Me, I Delete You, instead of Eternal Sunshine of the Spotless Mind (Italy)
The Jungle Died Laughing, instead of George of the Jungle (Israel)
Urban Neurotic, instead of Annie Hall (Germany)
His Great Device Makes Him Famous, instead of Boogie Nights (China)
Six Naked Pigs, instead of The Full Monty (China)

Something seems missing from these titles. Some lack nuance. Others lack cleverness, though perhaps that is lost in the back-translation. In any case, the translations are hardly a faithful representation of the original title.

We can put this into a simple equation: Actual Translation = Exact Translation + error

But is that error actually random? Likely not, as we can assume that the film’s international distributers worked very hard to come up with a title that would attract an audience in that country. Audiences in Germany, the re-title suggests, are much more interested in seeing a film with a descriptive (and accurate!) title than a person’s name. In this case, marketing appears to be the force that is adding error to the relationship between the actual title and an exact translation of the original title.

We see this same kind of process in education. When constructing an assessment tool, an instructor wants to make the tool capture student learning as faithfully as possible. But the instructor may also want the assessment tool to provide some pleasure (or at least lack of pain) for students. This could include paper assignments that the instructor thinks will be fun for students or test questions with amusing aspects.

Ideally, these fun tools would measure learning just as well as a more straightforward method of assessment. But just like movie titles, marketing efforts (making the tool fun) may add error to the process. In attempting to translate student learning into an assessment tool, the instructor can easily become distracted with other goals. Marketing can make translations into a mess.

Instructors should be careful. We laugh at funny translated movie titles, but poor test items are no laughing matter. Just consider the uproar about a story question featuring a rabbit racing a pineapple that made news this spring. It was a funny, entertaining question that made a lot of people mad. It was just another error in translation.

Violet and Zoe Michener didn’t have a good day at school in late June. Both girls returned home with severe sunburns after spending most of the day outside as part of school activities. The burns were so bad that their mother Jesse Michener rushed them to the hospital. Why didn’t their school apply sunscreen to the girls, one of whom has a skin condition that makes her especially vulnerable to burns? The reason is that their school’s rules (backed up by state law) forbid administration (including self-administration) of medicine without a doctor’s note. The sunburns here are a direct result of measurement error in differences between the spirit and the letter of the law.

Think of it this way. Each law is crafted with a purpose in mind. This particular law is designed to keep kids safe (and school districts protected) when regulated substances are involved. A doctor’s note must be provided in order for the child to take medication at school. But the law is written such that it forbids application of sunscreen (classified as an over-the-counter drug by the FDA), even application by the child with no help from a teacher. In this case, the letter of the law falls short of the spirit of the law by actually putting children in more danger.

This equation shows just exactly what is going on: Letter of the Law = Spirit of the Law + error

So what is a school district to do? Though courts try to upload the spirit of the law, the regulation in this particular case is problematic. ABC News (in the article linked above) quotes Dan Voelpel, Tacoma school district spokesperson, as saying this: “Because so many additives in lotions and sunscreens cause allergic reaction in children, you have to really monitor that.” In other words, the letter of the law may not be so far from the spirit, if allergic reactions to lotions are something that threaten children.

Instead, we must look at the spirit of the law in the broadest possible way. It suggests the need to protect children from harmful substances. And in this case, excessive exposure to sunlight should be considered as another substance that requires regulation. After all, the measurement error can swing both ways. It can be overly restrictive, banning safe products like sunscreen. But it can also fall short, not also specifying regulation of sun exposure. We often call the letter falling short of the spirit a “loophole” that allows for illegal activities that a law intended to ban but was not written well enough to prevent.

In this case, the children were outside for five hours. It is not safe to be outside, exposed to the sun, for that length of time without sunscreen. And as doctors recommend reapplying sunscreen every 2-3 hours, the school failed to live up to their duties to restrict sun exposure (because sunscreen applied by parents before school would not have been effective for the entire time in the sun). Falling short of the spirit of the law, even if abiding by the letter, is an example of the harmful effects of measurement error. PlusError.com wishes Violet and Zoe a speedy recovery from their sunburns!

When you think about it, playing catch is not much exercise. When you play the game ideally, each throw goes directly to your partner, who needs not even move her feet to catch the ball. The body does get some workout in the act of catching and throwing, but without any running around, there’s relatively little exertion. The same thing is true if engaged in a friendly rally in tennis: Both participants are trying to return the ball so that their partner can easily hit it. We can represent this in an equation. (You knew I was going to say that, didn’t you?).

Playing Catch = Perfect Play + error

We have to include some error because, invariably, a throw goes off target, a frisbee gets caught in a gust of wind, or a tennis shot goes wide. Error, then, can be seen as exertion. The error in a game of catch causes people playing get exercise, running to chase down the object being thrown or hit back and forth.

Now consider games of sport where an object is moved from one area of the playing space to another. For example, in football, the goal is for one team to throw and carry the ball down the field and into their goal area; failing that, they can kick the ball into a certain zone. The team gets four tries to move the ball a certain distance or score before they must return the ball to the other team. In these types of sport, if there was no error in the above equation, then there wouldn’t actually be any competition, except against the clock.

Error comes into play as one team tries to play catch and the other team attempts to stop them. This necessitates exertion, as the thrower must frequently throw the ball to where a catcher will be, rather than where she is currently. Or the thrower and catcher must carefully plan so that a throw to the catcher is perfectly timed. The process is similar in games like baseball and tennis. In baseball, the hitter tries to hit the ball in a way that keeps the other team from catching it. In tennis, the hitter tries to keep the ball in the bounds of the court but to hit it in such a way that the other player cannot return it before the ball has hit the ground twice.

What does this all add up to? It’s a great example of a time in which error is actually what makes the equation (representing a real-world phenomenon) interesting and meaningful. We don’t see this pattern very often in the realm of testing. The difference is in how useful measurement error is and what it can tell us about real life. In sports, that error term represents, in part, how good the other team is at playing defense (and, in part, how unskilled the offensive team is). In testing, error represents the ways in which our measurement instrument failed.

The difference between playing catch and playing sport is thus another application of classical test theory to understand the world. We might bemoan error when putting together a test, but in this particular case, error is something we should celebrate. Without it, we’d just be watching games of catch each Sunday.

As I put recyclable items into a paper bag today, I was struck by the “recycler’s terror”: what if the items I put in the recycling bin are, in fact, not recyclable? And what if some item I’ve cast into the trash can actually be saved? Even in towns with liberal recycling programs, would-be stewards of the earth are still stuck wondering what exactly can be recycled. The problem, as I see it, is one of measurement.

Items in the Recycling Bin = Recyclable Items + error

All that error comes in the form of mistakes on both sides of the recycler’s terror. Some recyclable items end up in the trash, and some trash ends up headed for the recycling center. We might conclude that, if the error is random, there’s little that can be done to make sure refuse is categorized appropriately. But what if careful analysis of that error term could actually help guide recycling policy?

Two types of policy could skew the error term one way or the other. The first is a policy of over-encouragement. This policy demonizes trash and lionizes recycling. It may include widespread campaigns to encourage careful sorting of refuse, scrutiny of trash bins (yes, some cities do look through trash to spot recyclable material), and recycling centers that accept a wide variety of papers, plastics, metals, and other consumer waste (sometimes including compost and electronics). The policy says, at its heart, think twice before throwing something in the trash. And this policy is likely to lead to more trash in the recycling bin.

The second policy is one of restriction, in which few items are accepted for recycling. This might involve telling citizens to check carefully for the number on the bottom of plastic containers because the recycling center can only handle certain types of plastic. The same restrictions might extend to types of papers and metals. The policy might also instruct people to separate out their different types of recycles, rather than accepting all types at the same facility. The difficulty in determining what is recyclable means that the error term will skew the opposite way and cause recyclable material to end up in the trash.

With this balance in mind, what’s a city to do? The answer has everything to do with what the city does with the error term. If a trash item being recycled causes kinks in the whole process, then greater restriction is necessary. But if the city is worried about excess trash (perhaps because it costs the city a lot of money to ship the trash elsewhere), then more recycling should be encouraged. The real problem for ordinary citizens is that we don’t know what happens to a non-recyclable item that is miscategorized. What happens if I accidentally toss the wrong type of plastic into the recycling bin? My biggest fear is that it will cause a whole fleet of actual recyclables to be tossed into the landfill. But that’s obviously not what happens (I hope).

Scrutiny of the error term here offers valuable guidance for any city looking to properly manage their recycling policy. By focusing on the goal of the recycling program, cities can choose to err on the right side and hopefully, ultimately, improve the above equation, so that all refuse items are properly categorized.

Red Sox manager Bobby Valentine has entered into the debate about uniform rule enforcement in sports by suggesting, vaguely, that baseball umpires need help calling balls and strikes. This relates very directly to a post here on PlusError where we considered how replay systems might change measurement error. The technology to call balls and strikes quickly through the use of a computer system does exist; television broadcasts employ it frequently. And Mr. Valentine cited evidence that the human eye is not capable of following a 90 MPH fastball in the last 5 or 6 feet of its path, exactly the space in which a ball or strike is determined. But if such a system were to be put in place, would baseball lose something valuable? Is the human element a key component of the true game?

Think of it like an equation. Here’s what a classic test theory model of the game would look like.

Real-World Baseball = True Baseball + error

The way that baseball actually gets played is just an approximation of what the true game should be like. Error comes in the form of rule deviations. These deviations could include something simple like a player disrespecting the game by smashing a bat against the dugout wall after a failed at-bat, a manager getting tossed out for spewing profanity at an umpire, a pitcher intentionally hitting an opposing team’s batter, bench-clearing brawls, or, as Mr. Valentine identifies above, bad calls by umpires. Based on this logic, the improvement in calls would reduce the error and make real-world baseball more exactly like what baseball is supposed to be.

But there’s a problem in that logic. Many of the source of error identified above are things that people love about baseball! They collectively make up the “human element” of the game. Given that it is indeed a game played by humans, possible perfections of the game (perhaps played by robots with complex odds-systems for determining their actions) actually make it farther from baseball’s true nature. Take, for example, intentionally hitting a batter to make up for a perceived slight that occurred earlier in the game, earlier in a series, or even far back in the season. Just like fighting in hockey, many sports commentators say that this is necessary to maintain basic standards of the sport.

This presents a problem for modeling the effect of better balls and strikes, because our equation must be updated to include a term that describes how people expect baseball to be played.

Real-World Baseball = True Baseball + Subjective Baseball + error

In other words, there are certain elements of how baseball is played that are socially-constructed and not part of the official rulebook, but without these elements people would see real-world baseball as somehow diminished and less than what they expect.

This makes Mr. Valentine’s argument harder to maintain. What separates the human element of batter-bonking different from the human element of having a catcher work an umpire? Given that umpires frequently get calls right, and given that Major League Baseball does monitor umpires for their calls (with help of television technology), the decrease in error may also be associated with a decrease in what people expect to see in a baseball game. Unless those two pieces (error and expectations) can be separated out when making the argument, then Mr. Valentine’s reasoning falls short. A decrease in measurement error can also be seen, in part, as a detriment to the game itself.

In systems analysis, there is a concept of something called ‘state’. State is any details internal to a system that might change the way its output responds to its input.

The human brain is an exemplar of this concept of state. As a newborn, you experience everything in a raw sensory way. As you interact with more things you store these experiences and begin to recognize patterns. After a year, my nephew could point at passing vehicles and say ‘car’, rather than just seeing a red blur accompanied by a rumbling noise. In this sense, your brain’s state may seem like a purely good thing, a way to make more meaningful connections between pieces of data, but it’s worth remembering that such state can also impact our interpretations of data in negative ways. A great example of this is the phenomenon of confirmation bias, the observation that individuals tend to assign greater importance to evidence that confirms their existing beliefs than they do to evidence that challenges them.

Before you start crafting an opinion on whether state is an asset or detriment in making judgments, it’s interesting to think about all of the state that might exist in a common assessment scenario, such as an in-class written exam for a college course. To narrow the scope, let’s only consider a single question on this hypothetical exam. Let’s consider the system to be the combination of the prompt (the question) and the grading process. I’ll define the input to this system as the students’ knowledge and the output to be the number of points awarded by the grader. ┬áNow let’s consider the state in this system – that is, the things that might cause a different grade output with the same knowledge input. Since I started explaining state with the example of the brain, it’s natural to think about the brains of the grader (or graders). Here are a few examples of past input that might bias their judgment of a response:

  • How has the student done on previous questions of the exam? If the student has done well, the grader may be disposed to grade vague answers with greater leniency, while the reverse is true if the student gave poor answers on previous questions.
  • How is the student doing in the class as a whole? Again this information may cause the grader to make inferences about the student’s response that aren’t evident from the response itself.
  • Does the grader have a personal stake in the students’ success or failure? If the student has been working diligently and making use of office hours, an instructor often my have a desire to see them succeed. On the other hand, if the student skips class and sends discourteous emails, there may be a subconscious desire for failure.
  • How have students on exams that have already been graded scored on the same question? If the question is a particularly nuanced one, instructors may notice errors they had not anticipated when grading previous exams. This may cause later exams to be graded more critically, as the grader is on the lookout for the errors they have already seen committed.
  • How long has the grader been working? Grading exams can be very mentally fatiguing. This may lead graders to rush the grading of later exams and miss errors they would have caught if they were being more thorough. It may also cause them to deduct larger amounts for wrong answer on later tests.
  • What does the grade distribution look like for the exams that have already been graded? Instructors often have a certain grade distribution in mind when designing the difficulty of a given exam. Sometimes this desired distribution may be motivated to counter the impact of a previous exam that awarded grades that were too high or too low or to meet the anti-grade inflation course GPA targets set by the institution. In any of these cases, the grader may subconsciously get more or less lenient in their grading to try to hit the target distribution.

These are just a few potential examples of how state can impact the grade that is assigned. The later examples in the list would probably be universally recognized as inappropriate bias and bad practice, however I expect that many people may think that using additional information you have about a student to decode vague answers can actually lead to greater accuracy. I think this is a dangerous viewpoint. The problem in this, and in reliance on previous behavior in general, is that there is a risk of amplifying errors made in earlier assessments.

As fans of PlusError are sure to know, any measurement = real value + error. This is ok if we are talking error that is random and uniform across a large number of assessments, because the error will tend to average out. In contrast, when we consider a students previous performance in making judgements about a current response, the error is no longer uniformly distributed. To put that into example, if and error on Assignment One causes the instructor to judge that Timmy is a great student, future errors will tend more towards Timmy’s favor than against it, as the grader is more likely to assume a vague answer is correct based on Timmy’s reputation. To put this in mathematical terms,

measurement1 = real score1 + error1.

measurement2 = real score2 + error2(measurement1) = real score2 + error2(real score 1 + error1)

The error in the 2nd measurement has become a function of the error in the first measurement and therefore no longer meets the uniform, random requirements.

In many way, ‘stateless’ may sound as if it is a synonym for ‘objective’, but that’s not necessarily the case. Since state is merely information or conditions created in the past, it can be completely rational and dispassionate, and this is where it is most perilous, because it will have the appearance of making a grading decision more accurate.

Methods of eliminating the presence or influence of state from the previous examples are left as an exercise for the reader, although a few tips are fairly self-evident, such as removing identifying information prior to grading, grading individual questions rather than entire exams, and randomizing the sequence that students exams are graded in.

As a last aside, you may have noticed that there is another important system that was not mentioned previously, that is, the system that takes the question prompt and knowledge and produces the response – the student herself. When it comes to the student, rather than trying to eliminate the non-knowledge-related state, we instead try to manipulate it. Often we put easier questions early in the exam to try to manipulate a students’ nerves from being on-edge to being relaxed and confident. We may also worry about issues such as face validity – the idea that an assessment appears to the test-taker to be valid – to avoid angering or frustrating the student. We may also even schedule assessments at certain periods of day to avoid student fatigue and maximize wakefulness.

We here at PlusError hope we have manipulated the state of your brain sufficiently that you will strive for stateless assessment in the future. Of course, time will tell.


Two major sporting events going on right now help illustrate the differences in refereeing philosophy. The French Open allows players to challenge calls and uses a sophisticated system to replay the hit and judge it in or out. Players are given a limited number of challenges to avoid excessive use of the system. Conspiracy theorists can seize upon the fact that few calls are overturned (obviously because the system is rigged), but for the rest of us, it offers reassurance that a bad call can be fixed. The opposite is true in the NBA playoffs, where the conference finals are under way. Basketball offers no review of calls, allowing fans frequent fodder for complaints. Even in cases where the call is easily reviewable (for example, goal-tending), coaches have no recourse against a perceived missed call.

Differences in the two sports may explain this. Fouls are very frequent in basketball and the rules of the game are not enforced such that every foul is called. Indeed, fans who complain about bad calls also complain about missed calls–flagrant fouls that are ignored–and too frequent calls that interrupt the pace of play. But even with the differences between the two sports, replay reminds us that the role of the referee is a subjective one. In sports like basketball and baseball, referees and umpires play an outsized role in determining the game outcome.

We can think of the process as an equation.

Called Play = Actual Play + error

Conspiracy theories aside, we can embrace referees as relatively unbiased arbiters of the game’s rules. And thus, as calls are made on the multitude of plays throughout the game, series, and season, the record of a ref is a solid one with few errors relative to right calls. Think of a season of baseball, in which each team plays over 160 games not counting the playoffs; in that time, a team is likely to have bad calls made against it and for it, resulting in an error term that is uncorrelated with season outcome.

But what happens when the stakes are higher? For example, a basketball game, score tied, one minute remaining – any clumsy call can cause calamity for the team who is wronged. No replay means no recourse, and the team may head home hearing “better luck next year.” In other words, the error term has greater stakes in some games than in others and often there aren’t chances to reconcile that through the course of the series. In basketball, teams get seven games, which is better than just one, but is it good enough?

The question now becomes one for the instant replay system itself. Can the system be deployed deftly and decrease bad calls? Can the system be integrated instantly and not interrupt the flow of the game? Can the game adjust to after-call allegations of error without allowing all calls to be examined? In tennis, such implementation is easy: caps on number of challenges and constraints on calls that are reviewable. But in basketball, with the increased complexity of calls, can a replay system do enough?

We can consider an additional equation to account for the replay system.

Called Play = Actual Play + Reviewed Play + error

And we can then assess, after adding the additional term, if the amount of variance explained is significantly greater. If all plays are reviewable, then despite any damage to play, we should see a significantly reduced amount of error. But, given likely limits on calls allowed for review in sports like basketball, perhaps the additional term won’t improve the refereeing. If, for example, missed calls cannot be reviewed but made calls can be, then referees may be inclined to ignore tough calls to reduce their reviewed percentage. This could increase the error term and negate the benefits of a replay system.

In short, though a replay system as an isolated construct would likely improve the quality of calls, the real world implications of such a system become more complex. For those clamoring for a basketball review system, consider the boons and banes of such a change. For those opposed, I’ll let the conspiracy theorists continue their campaign!

It seems likely that shortly after the “American Dream” entered into world lexicon there were people claiming that such a dream was no longer attainable for the average person. Of course, for those who are succeeding, the American Dream seems like it is still very much true. For those on the bottom, it seems, at best, an absurd prank and, at worst, a blame-the-victim mantra. But if anything has changed in today’s economy, perhaps it is just that the measurement equation we use to predict achievement of the Dream has changed.

Let’s consider the most classic formulation of the American Dream.

Achievement of American Dream = Hard Work + error

(Can’t forget that + error term! How else will we account for times where hard work doesn’t pay off?)

This equation is the reason that “Protestant work ethic” is credited with the greatness of America. Because settlers coming to the New World were willing to work hard and toil under awful conditions, the United States was built into a strong and prosperous nation. Those who worked hard were rewarded with success; they were living the American Dream and making something out of nothing. Note that in this equation, there is no additional predictor. The equation isn’t something like this.

Achievement of American Dream = Hard Work + Family Wealth + error

But what we hear now is that the American Dream is no longer attainable, at least not for some groups of people. But what if this is not because hard work doesn’t pay off but instead because our error term is actually correlated with our outcome, such that there is much less unexplained variance for people who have achieved the American Dream (most of them are hard workers) and much more unexplained variance among people who have not achieved the Dream (many of them are hard workers, but still aren’t achieving)? This explanation seems plausible when you include another term in the equation.

Achievement of American Dream = Hard Work + Education + error

We might refine this further by making Education a binary variable where college education = 1.

Achievement of American Dream = Hard Work + Hard Work*Education + error

In this case, the predictive power of Hard Work counts double for people who have a college degree. And, if we think education is really a key predictor, then we could refine the equation even further.

Achievement of American Dream = Hard Work*Education = error

Now, the equation says that if you don’t have a college degree, you can work as hard as you want and it still has no effect on achievement. But for those people with a college degree, hard work is still a significant predictor of the American Dream.

What does this all mean for our image of the American Dream? The initial equation is a beautiful one and belief in it requires simple faith in a country where anything is possible for those willing to work for it. But should we be surprised that during any time in our nation’s history, the error term was awfully large and the amount of variance explained by hard work was significant but small? News stories that say the American Dream is no longer attainable are simply seeing the glass as half empty and focusing on the variance that is unexplained by hard work. During good times, they focus on the reverse.

From a measurement perspective, however, the story is the same. In good economic times, the achievement of prosperity is likely associated with the same amount of error, but other factors (robust economic growth and plentiful job opportunities) help explain error variance. Without those factors, we are left with the same amount of variance but with no easy ways to explain it. Taking an incomplete equation and reporting it as fact is disingenuous. Still, the cynic’s perspective (that hard work is not predictive of attainment) is also wrong. And, as always, the most interesting part of the whole story is just what is contained in the “+ error” term.

The University of Wisconsin’s annual Teaching and Learning Symposium is going on this Wednesday and Thursday. Aaron Brower, professor of social work and vice provost for teaching and learning here at UW started off the event yesterday morning with a thought-provoking introduction to the event’s main speaker, Nancy Cantor, chancellor of Syracuse University. In Dr. Brower’s introduction, he talked about the changes occurring in higher education and what this means for higher ed. instructors. One particular change he highlighted was an increased focus on measuring progress by competencies gained rather than credits earned. This is an interesting change because it highlights issues of measurement in college learning.

When we think of earning credits in classes, we believe that those credits are earned because of learning that the student demonstrated. If the assessment instruments for judging that learning were properly designed, then the grade earned in the class is equivalent to the amount of learning that occurred. Of course, this measurement isn’t perfect. The added error in measurement could come through in a variety of ways. Students may “cram” before the test, thus demonstrating material that they have memorized for a short period of time rather than actually learned and internalized. Additionally, not all assessment tools may measure learning equally well, and improper weighting of these instruments could prioritize elements other than learning. For example, rewarding attendance or participation, even when these items may conflict with test scores, does not actually assess learning.

How does rewarding competencies (perhaps in the form of skills-based “badges”) change the issues facing the current credits-based system? First, it places the focus on evaluating skills rather than evaluating classroom performance. Second, it allows for easier communication of what a student has learned. Even the class title is often not enough to tell an employer what a student should have gotten out of a class, but specific competency badges can be accompanied by a list of associated skills. Third, it helps students understand what they are supposed to be learning so that their attention in a class is not on figuring out what they need to know for a test and instead is turned toward learning new skills.

But do competency badges change issues of measurement? In short, no. There is no reason that each individual course cannot change focus on grading based on competencies. This is often just a change in how the instructor communicates the point of the class to students. The instructor can say, “In this lecture, I expect you to learn these three things. And on this assignment, I expect you to demonstrate those three things in two ways.” Any instructor can make this adjustment. The instructor may also need to change her tests or other assessment tools, but this kind of reduction of error in measurement is something all instructors should try to do no matter if the students are given credits or badges.

Competencies also do nothing to reduce the challenge in measuring more nebulous concepts of learning. For example, one primary skill that employers say they seek in a new hire is an ability to communicate effectively. Awarding a badge in this skill is no easier than awarding credit for courses that focus on written and verbal communication. And while seeing that the student has his “effective communicator” badge stamped on his transcript might give an employer some degree of confidence, no employer should assume that this means the person will be a superb communicator from day one. After all, among other factors, one challenge for communication is knowing how to speak the language, and there is no chance that this student learned that company’s parlance in college.

If we want to think what the university will be like in the future, it’s important to do two things. One is to think about what changes will help problems that exist today. But the other is to think about what new problems will be created. If a college were to switch to a compentencies-based approach, the first complaint from employers would be that the compentency badges are inaccurate, being awarded too easily, or relate to skills that the employers don’t care about. “We hired a student with four badges related to analytical thinking,” the employer will say. “But when we asked him on his first day to write a report analyzing the economic impact of R&D spending increases on our production efficiency, he was totally lost! These colleges are rewarding badges too easily!”

In other words, when we increase the specificity of what we claim students know, we also increase expectations. And those expectations may not be met in a wide array of situations. But when badges don’t fix complaints from employers, the challenge will be what to do next. And this too is a measurement issue. Employers want a college degree (and perhaps an additional metric like GPA or number of badges earned) to be a perfect measurement of job performance. The only way to increase measurement error in this equation is to make college more like job training. Of course, college is not job training. And therefore, each step toward reducing measurement error in using classes taken to measure skills gained undermines the true value of a college education, even if that education doesn’t represent a perfect metric for employers.