Fidelity to a program, policy, or practice means that people who follow the procedures follow them faithfully. They do not deviate in systematic or random ways. They do not decide to do something else because it seems better to them. Instead, they stick to whatever the procedure says and earnestly carry out the work.

Organizations may check for fidelity by conducting a fidelity review. These reviews can take many forms, so I’m going to describe the one I’m most familiar with and use that example throughout this post. The type of review is in Social Work, and it’s called a case review. In a case review, an expert reviewer goes through a selection of cases and attempts to determine if the caseworker was faithful to the procedures she was supposed to follow. The reviewer documents the overall rate of fidelity to a set of procedures and then reports the results in aggregate.

Let’s be clear from the start: This is really hard to do. It’s hard to figure out what fidelity means. It’s hard to figure out what to look at. It’s hard to figure out what rate of fidelity (or level of fidelity) is good and what is bad. Professionals in this area have a hard time doing this work well, so it’s no surprise that when anyone trying to do it right finds it challenging.

But that doesn’t mean that there are any excuses for doing it poorly and then making conclusions based on bad data. A fidelity review done incorrectly IS NOT TRUSTWORTHY! It tells us nothing! It actually makes us know LESS than we knew before because we now have inaccurate, unreliable data flopping around our head that we have to consciously tell ourselves, “Ignore those results! Forget them! They mean nothing!”

Let’s consider the steps that need to be followed so that a fidelity review can go right.

1. What are we trying to measure? That is, what does fidelity mean?

Let’s consider a simple example. A procedure instructs that a caseworker must confirm the safety of every child in a home each time she visits the home. So, what does “confirm” mean? Does it mean the caseworker should ask the adult in the home if every child is safe? Is that confirming safety? Does it mean the caseworker must see every child? What if a child is not at home when the caseworker visits? Does she then have to travel to where that child is (school, a part-time job, visiting family or friends elsewhere) to confirm the child is safe? Does confirming safety require asking questions of each child in private? Does it involve some kind of examination of each child? These are the questions that must be clarified before the fidelity review can be planned and performed.

Furthermore, these questions cannot be answered unilaterally by the person performing the review, by a person in an authority position, by a worker. Instead, they must be already defined and disseminated to the workforce. If not, then the reviewer may conclude workers failed to meet a standard that workers had no idea they were supposed to meet. In this case, there wouldn’t be a lack of fidelity; there would be a lack of implementation.

2. What does the performance of what we are trying to measure look like?

We are discussing a case review, in which the reviewer is looking over the case notes and other documents the worker provided. It is not possible for the worker to capture everything that went on during her home visit, and thus the reviewer will always be missing information. The worker could have conducted expansive safety checks for each child but not documented any of it. In that case, there would be no evidence that the worker was following the safety check procedure. That’s the kind of error that we accept as appropriate, because it is reasonable to consider undocumented actions as actions that did not occur.

But what if the worker writes “all children okay” and makes no other mention of safety checks? Is that considered evidence that the worker indeed confirmed the safety of all children? The reviewer needs to specify what evidence will be accepted to judge the requirement was met or not met. And if the reviewer wishes to judge with greater nuance than that binary standard, then there needs to be more detailed specifications for each level.

3. Where will the evidence be located?

Fitting with the “if it isn’t documented, then it didn’t happen” standard, the reviewer also must specify where the evidence will be located. Does it have to be in one specific place? And if it isn’t in that place, but is somewhere else, then will the reviewer find that the requirement was not met? Just like question 1, this particular standard must be disseminated throughout the entire organization and understood by everyone. Otherwise, a worker may be performing a thorough safety check but putting the information on the wrong form; she believes (and likely her supervisor too) that she is doing everything correctly, but the reviewer will reach a different conclusion.

4. What sample of cases will be reviewed? And related, what kind of conclusions do we want to reach with the review?

If 10 “best cases” are selected and reviewed, then what kinds of conclusions can we draw about fidelity? We can conclude something about the fidelity review of the cases selected as “best” by the people who selected them. Is that a useful conclusion? Can the conclusion tell us anything about cases more broadly? No, and no.

Sampling strategy is complicated and cannot be summarized in a paragraph, blog post, or even a series of posts. But the basic idea is that larger samples are better than smaller samples (and larger means dozens or scores of cases), conclusions between case types require large samples of every case type considered, and random selection of cases is better than any other selection method.

5. What evidence do we have to suggest that Reviewer A’s judgment on fidelity is the same as another reviewer’s?

This question is all about “interrater reliability,” a measure of the likelihood that, given the same materials, Reviewer A will reach the same conclusion as Reviewer B. In my experience, questions 1 through 3 are usually answered by those conducting fidelity reviews, but this question is more often ignored or answered with inadequate evidence.

Just like sampling in Question 4, there are entire books written on establishing and measuring interrater reliability. Go read those to fully understand this question. The basic procedure is that reviewers should be trained together and then given a set of cases to review and rate independently. If interrater reliability is established, then two or more reviewers, given the same materials and working independently, should reach the same conclusion. If they do not, it is because A) they aren’t using the same standards, B) the standards aren’t realistic (for example, too complicated) and can never be used to produce the same results, or C) some combination of A and B. (Reason C is the most likely explanation.)

If reviewers disagree, then the standards should be reviewed, rewritten as necessary, the reviewers re-trained, and the exercise repeated. If the reviewers continue to disagree, there are three options: A) once again revise the standards, retrain, and retest; B) create a method by which the reviewers’ differing judgments can be combined (say, through averaging, or through mutual discussion to reach consensus); or C) scrap the entire effort and judge the particular procedure used as too subjective to be evaluated for fidelity.

Under no circumstance should a fidelity review be conducted without diligent, earnest efforts to establish interrater reliability beforehand. For example, perhaps an agency has one person passionate about fidelity reviews. No one else has time to help or interest in the topic, so this reviewer proceeds on his own and writes up the results of his fidelity review. What good are these results? THEY ARE USELESS AND SHOULD IMMEDIATELY BE DISCARDED. Furthermore, the employee should be disciplined for wasting time. At best, the employee has produced a subjective report on his own views of what should be happening. At worst, he has produced a biased report that makes people who read it stupider because they use his subjective, unverified judgment to answer questions about fidelity, when in fact his report provides no answers at all.

The best thing about establishing interrater reliability is that we do not need multiple reviewers for each case. Instead, because we know the judgment of Reviewer A is similar to the judgment of Reviewer B, we can have cases looked at by only one reviewer. This means we can increase our number of cases reviewed more efficiently.

6. How will conclusions be written?

Congratulations to you if you managed to complete a properly sized case review after establishing interrater reliability! Now, what kinds of conclusions can we reach? The best advice is to write limited conclusions with plenty of caveats and limitations. For example, let’s say our review finds that 60% of cases have safety checks documented, and 40% do not. It is not reasonable to conclude that 40% of children served by our agency are unsafe. Instead, we can write that 40% of children served did not have safety checks documented in the specific places reviewers looked. Anything other than this conclusion is an abuse of the fidelity review.

Fidelity reviews are hard to do. They take practice, continued reading on best practices, and a willingness to do a lot of work for limited return. But when fidelity reviews do not follow the above recommendations, they are a waste of time that makes people know LESS than they did before the fidelity review was conducted. So proceed with caution. A quality fidelity review is a thing of beauty, but you should have no faith in unreliable fidelity reviews.

In last week’s post, I described a promising practice in child protection work: Differential Response or DR. DR is an attempt rewrite the traditional “one size fits all” approach to child maltreatment investigations. Rather than treating all family members as child abuse perpetrators until evidence proves them innocent, child protection workers approach some families (reported for less-serious maltreatment allegations like inadequate supervision or having a dirty home) and attempt to encourage them to make the changes needed to protect their children. For example, if mom is often called into work unexpectedly and leaves her 9 year old daughter to watch her 2 year old son, then the worker will help mom understand why that’s not okay and find resources that will help provide childcare when needed. In some cases, a traditional investigation is still needed, and any case where children aren’t safe at home results in a traditional investigation no matter what initial approach was tried. But the basic idea is the same: Some families can do better to protect their own children with help from the child welfare system, but not when they are treated like potential criminals.

Here’s the dilemma for evaluators, then: It appears that the implementation of DR results in an overhaul of the culture of the CPS department. Evaluations done by my research center find that workers like DR in part because it aligns with their overall values of how to treat people.

It’s kind of like working at a customer service desk. Your first manager tells you that all customers are liars trying to steal money from your company with phony claims. As such, the manager instructs, you should treat them all with contempt and suspicion. That sounds like a difficult thing to do, as many of your interactions are likely to turn hostile. But then a new manager comes in and says, yes, there are some fraudulent claims, but unless there is evidence right away that the claim is fraudulent, we should assume all customers are telling the truth. Would you be surprised if the customer service workers liked that change? And would you be surprised if even customers with fraudulent claims were treated with more respect than before?

These are exactly the changes we see in DR. The process of “family engagement” (which is essentially building rapport and encouraging change through positivity) is one that takes hold whether a report of maltreatment was assigned to focus on engagement or assigned to receive a traditional investigation. Workers realize that their interactions with families—no matter the purpose of the interaction—go better when the worker treats everyone with respect.

Thus the evaluator must ask, what is the specific mechanism by which we expect DR to cause change in outcomes? Previously, we said it was a greater emphasis on family engagement. But if we find that engagement practices are used no matter what type of investigation we assign the report to, then engagement cannot be the mechanism we count on to cause differences.

This problem ends up undermining the gold standard method to determine causal effects, random assignment. In most cases, we would want to evaluate DR by taking all reports that get assigned to family engagement and then randomly assigning them to receive a family engagement approach or a traditional approach. Then we know that, with enough cases, the only difference between the two groups is the approach the worker took with the family.

But if workers are going to bring family engagement practices to all types of families, then we can no longer count on random assignment to produce results that show cause and effect. Instead, we may need to turn to another method, something like propensity score matching. In this method, we take regions that have implemented DR and look at the cases they assigned to the family engagement track. Then, we take another region that has not implemented DR, find similar families, and then compare the outcomes, under the presumption that families in non-DR counties are not receiving the increased engagement work. (Even this is not a fair assumption, as the focus on engagement may be added to trainings even when DR is not practiced. Family engagement is the social work practice du jour.)

Under these circumstances, is it surprising that much DR research finds small or no differences between similar families assigned to different tracks? According to critics of DR, it is not only surprising; it is also damning. If you can’t find any evidence that assigning reports to different approaches increases child safety in the long-term, then why bother doing it at all?

What this criticism misses is the evaluator’s dilemma. How can you effectively evaluate a program when the program does not have a clear mechanism by which it will create change? If we report evidence that shows no difference, but we don’t expect there to be a difference in the first place because of confounding factors, then should we be reporting the results at all? And if “no difference” (or an absence of evidence) is used to indicate that there is no support for this practice change (that is, evidence of absence), then our reporting of the results has only armed critics who will twist facts to support the conclusion they decided on before seeing any evidence.

That’s not to say the evaluator was only looking for reasons to support the program and trying to fight off critics. It’s instead to say that any evidence resulting from an evaluation like this can only be used to draw limited conclusions about the program’s effectiveness. We probably cannot look at long-term outcomes, given the similarities of both approaches in a DR system. But we can still look at other questions: Do workers like one approach over another? Do families? These pieces still matter.

I don’t have any more of a specific solution to this evaluator’s dilemma, but I hope to explore this question in future posts. Thanks for reading!

Evaluation work is challenging. It requires teasing out the effects of an intervention from the effects of daily life and business as usual. Finding these effects can sometimes be done using methods like randomized control groups and other methods that come from research labs. But even with these methods in place, real life cannot be held at bay.

Let’s consider an example from social work and child welfare, the areas I evaluate. Child protective services (CPS) is a reactionist body. It reacts to reports of child maltreatment. It is not a preventative body (though some reaction naturally involved prevention; more on that in a bit). And as such, it must design its interventions to focus on what can be done after child maltreatment is alleged to have occurred.

The traditional way that CPS has responded is as a criminal investigation unit, gathering evidence to legally justify the suspension or termination of parental rights in order to keep a child safe from future harm. But this evidence is not often needed. Why is this the case? First, the CPS worker often cannot find evidence to conclude that harm occurred. For example, imagine a report alleges abuse of a baby because a neighbor claims the mother was rough when putting her baby in a car seat. If this report is investigated (and many reports are not) and there is no evidence to suggest harm to the baby (no bruising, etc.), then the report will be dismissed. Another example, imagine a teacher hears an 8th grade student say his parents let him drink beer whenever he wants. If the investigation of this claim finds the parents saying that is not true and the student saying he was just bragging to impress his friends, then there is no evidence to suggest harm.

Evidence also often isn’t needed because, whether harm occurred or not, the CPS worker may conclude the child is safe at home. That is, the worker concludes that the child’s parent or parents are capable of protecting the child from harm, despite what may have happened in the past. For example, imagine police respond to a house for a report of domestic violence. Mom is in the kitchen crying and being comforted by her 10 year old daughter. Bad boyfriend is in the yard pacing and yelling. Bad boyfriend gets arrested, and the CPS worker visits. Mom says she and bad boyfriend are done and promises he won’t be allowed back in her house again. Whether mom’s claims are true or not, the CPS worker is likely to conclude that mom is capable of protecting her child. There is unlikely to be any justification for removing the child from the home.

What do these factors add up to? We have a CPS system that acts like the police, but often has no need for the evidence it collects. Instead, CPS workers are confronting the challenges of daily living that may occasionally make lives difficult for children. Kids may go to bed without having enough to eat. Kids may wear shoes until they start to fall apart or not wear a warm coat because they don’t have one that fits. Kids may interact with a parent who is an alcoholic or drug addict. Kids may deal with a parent who is overly critical or not affectionate. Kids themselves may be reactive or hostile at times and faced with parents who feel overwhelmed and unable to guide their child’s behavior. None of these situations are likely to be enough to justify removing the child from the home and placing them into the care of the state. Instead, they are situations in which the family needs some help. These types of situations are present in a majority of CPS cases, and a majority of children alleged to be harmed are left at home in the care of a parent, exactly what happened before CPS showed up.

So what can CPS do? There are still circumstances in which a criminal investigation-type approach is needed. There are horrific cases of child abuse that require the child be removed from the home as soon as possible and the perpetrators prosecuted. But thankfully, these cases are rare. When the outcomes are criminal prosecution or nothing, the latter will apply in most cases. And thus, CPS has innovated and found an alternative way to approach families.

This new approach (which is hardly new anymore) is called differential response, and it attempts to differentiate cases in which a criminal investigation response is needed (often called “traditional response”) from those where an alternative is needed (often called “alternative response”). Traditional response is a stereotypical CPS approach, but alternative response attempts to do more with the family, including understanding the family’s strengths and needs and attempting to find ways for the family to build on their strengths to address their needs.

For example, consider the example above with mom and bad boyfriend. Perhaps mom could benefit from some counseling to help her deal with the effects of domestic violence and address potentially abusive behaviors from partners. Perhaps mom could also use some job help that would reduce the appeal of living with a partner who can help with bills. Perhaps her daughter could benefit from some therapy to help her deal with the trauma of being exposed to domestic violence. Perhaps all of these things will reduce the chances that CPS will be called again because mom has entered into another violent relationship.

Differential response begins with the premise that a family will be more likely to accept this additional help if the CPS worker acts like a caring social worker focused on helping the family and does not act like a criminal investigator determined to collect the facts and document exactly what the family did wrong. Differential response also holds the premise that the specter of possible criminal prosecuting for the parents reduces the parents ability to honestly and earnestly address their problematic behavior. Reports are assigned to the alternative response track thus decrease the usage of criminal investigation-type behaviors and instead try to engage families into making changes that will help protect their children from future harm.

But what’s good for the goose is good for the gander, and the challenge of separating out alternative response tactics from traditional response tactics gets complicated. I’ll consider that challenge later this week in Part 2 of this series.

Why did Mitt Romney lose the 2012 election? There are lots of reasons one could offer. When I think back on his candidacy, I think of the ways his positions shifted (from relatively liberal in some areas, useful when running for governor of Massachusetts) to pretty conservative (useful when running in the Republican presidential primaries). And I think of some of his bizarre comments, like saying he had advisors bring him binders full of women and how the trees in Michigan are the right height. Immediately after the 2012 election, conservatives offered their own explanation: Mitt Romney wasn’t a true conservative. If he had been, then he would have attracted mass support from true conservatives, enough support to easily defeat President Obama.

This same argument is used today by those who support Republican presidential candidate Senator Ted Cruz. An essay at The Federalist argues that Ted Cruz is the only true conservative in the race. Marco Rubio has said the same thing.

Meanwhile, the same thing is happening on the Democratic side, with candidates Hillary Clinton and Bernie Sanders arguing over who the “true” progressive is in the race. Each wishes to hold the title, and each has an argument against the other using the title.

It strikes me that the logic used by candidates from both parties can be expressed as a simple equation:

Political Prosperity = Ideological Purity + error

All candidates are arguing that their nomination will lead to party success, while nominating their opponent would lead to failure. But, of course, this is only argumentation, as we have to hold the election to truly see who wins. As such, their arguments for their own ideological purity are merely an imprecise measure of political prosperity. Every other factor has to get lumped in the error term.

And you know who looms out of that error term? Donald Trump, who currently leads Ted Cruz in Republican delegates. Mr. Trump would not call himself ideologically pure or a true conservative. (At some point in time, we must acknowledge that there is so much bluster in any political campaign, mixed with unending commentary, that a search for a candidate’s name and the phrase “true [conservative, progressive, etc.]” turns up plenty of results. I am trying to make an argument, while also respecting all the people who indeed have called Mr. Trump a “true conservative.” There are some.) Mr. Trump has called himself a “commonsense conservative,” which has a pragmatic ring to it.

We could also look to President Obama as another person who fits in that error term. Mr. Obama has famously eschewed politicking. He has argued that his positions are commonsense and that there are many areas that both Republicans and Democrats can agree on. Indeed, his favored legislation to increase the number of Americans with health insurance is based on a plan from a conservative think tank (the Heritage Foundation) and was previously implemented by Mitt Romney in Massachusetts. I think Mr. Obama would be happy to call himself progressive, but he has shown continued disinterest in ideological purity.

Perhaps this is simply the difference between running for office and actual governance. Ms. Clinton, Mr. Cruz, and Mr. Sanders have legislative records that they must present to the public. (Ms. Clinton also has leadership experience, especially her time as Secretary of State.) Mr. Trump has his business experience, which involves deals and dictates more than policy. Mr. Obama, in seeking reelection in 2012, had his governance and leadership. Naturally, leading is more pragmatic than advocacy, and a president must actually govern, while a senator is more an advocate for the people of her state.

In any case, the argument of ideological purity as a determinant of political prosperity seems an odd one to me. Even if purity could help win an election (and I find no evidence that it can), it is surely less useful in governance. A candidate can declare she will push to lower taxes until economists crunch the numbers and suggest it is impossible without massive spending cuts. A candidate can decry drone strikes and state he will ban their use until a military leader has a dangerous terrorist in their sights and asks to pull the trigger.

In the case of this equation, ideological purity might be a useful rhetorical tool, but the error term is too large for it to offer much prediction of political success. Purity makes perfect bluster, but the realities of politics often bust the blustering candidate.

Many people, experts and non-, believe this equation describes a lot of what we need to know about college education:

College Costs = College Value + error

Yes, the outlay to pay for college can be high. Even taking out a few thousand dollars in loans each year can add up to substantial debt by the time the student finishes his education. Leaving college with $20,000 in debt might not be uncommon, but it is a burden.

That burden, though, has high value over a lifetime of earnings. And when students spend more, it can mean more value. A community college two-year degree costs less and has less value than a four-year degree from a state school, which costs less and has less value than a degree from Harvard. We expect that the cost is worth the value, and that those who pay more also make more. College education is a good investment.

So what about that error term? How does the equation mislead us? Let’s examine two examples.

The first comes from Senator Bernie Sanders, who is currently running to win the Democratic Party’s presidential nomination. Sen. Sanders has released a plan to eliminate tuition from state colleges and universities. This plan is a more dramatic version of President Obama’s proposal to make two years of community college free for everyone.

This plan makes several very interesting changes to the equation. First, it changes the makeup of college costs. The student now faces a cost of $0, but the plan doesn’t change the costs of providing that education. Instead, it just shifts the bill from the individual to society. (Actually, if you read the plan closely, the cost isn’t really shifted to society. “The cost of this $75 billion a year plan is fully paid for by imposing a tax of a fraction of a percent on Wall Street speculators who nearly destroyed the economy seven years ago.” No word on whether the tax will apply to all Wall Street investors or just those that Sen. Sanders believes played a role in “destroy[ing] the economy.”)

So are educations that cost more to provide equal to educations that are higher in value? That’s difficult to assess. Tenured professors often make more money than nontenured professors, and in theory, these professors are better at teaching students. But if a university is expanding (suggesting it provides a high quality of education, thus attracting many students), then it will have more nontenured professors. A university with only tenured professors hasn’t hired professors in quite some time and thus isn’t growing. In any case, Sen. Sanders offers no corresponding plan about reducing college costs overall.

The result is that students will be forced to judge a different kind of “cost = value” equation, one in which they choose between the value provided for $0 and the value provided for, say, $40,000, the tuition at a private college or university. Because this erases the variance between different public schools (that is, tuition at the best public school is now equal to tuition at the worst), quality will explain less of the variance in cost, increasing the error term.

Free college education could also lower the value of a college education because it could increase the number of people with college degrees. The college degree could become more like the high school diploma: It’s only noteworthy if you didn’t earn one. As such, we could see college costs remain the same but value decrease. I don’t think this is the intention of Sen. Sanders when putting forward his plan, but he offers no other details on increasing enrollment in trade programs or creating programs that would guide students along a professional path early in their post-secondary education.

The second example of how this equation could mislead us comes from the relationship between federal student aid money and college tuition. The New York Federal Reserve released a report last summer that linked increases in federal financial aid with increases in college tuition. Essentially, because so many students use federal aid to pay for college (either through grants or loans), colleges know that additional money is out there. Thus when schools decide to raise tuition, they know the burden will not immediately fall on students. And when tuition rises, there is more push for greater levels of student aid. The result is a cycle of ever increasing costs.

Do the additional costs add up to greater value? It’s hard to draw a conclusive answer. Let’s say the additional tuition money is put toward new lab space, better computer equipment, or upgraded study spaces. All this should help students learn more, thus increasing the value of their education. But it could also be put toward increasing the size of college administration, paying the very high salary of a college coach, or building fancier dorms.

It’s worth noting that making more money available to students, especially in the form of loans, while doing nothing to control costs is quite different from how the federal government handles other entitlement programs. Take Medicare, for example. The federal government uses its leverage to keep healthcare costs down, and the expansion of coverage under Obamacare increased these efforts. As such, hospitals who wish to thrive must find ways to deliver quality care for less. This leads to innovations like focusing on patients who cost the most and offering them greater preventative care.

If the same thing was done with colleges, then we might see the equation protected. But as it stands now, increasing federal student aid seems to lead to increased cost with no corresponding increase in value.

For a high school student trying to decide if college is a good investment, does the equation provide guidance? Or is the unexplained variance too large? On the rough level, I think all people should pay attention to the equation’s general principles: If you are investing money, investing it in your own education will pay a fantastic rate of return over your lifetime. But be mindful of the error term. Some education is over-priced for its value, and this occurs at all ends of the cost scale. A university education can be overpriced relative to value at $40,000 a year or at $4,000 a year. Only careful consideration of the relationship, and vigilance to the error term, will lead to the best outcome.

Let’s say you have two possible routes you could take to work, differentiated by the exit you use to get off the highway and the corresponding different roads through town. The distance is roughly equal, so you wonder which route is faster. To decide, you randomly pick which route to take every day for a month, resulting in 11 trips to work on Route A and 13 trips on Route B. You average the times and compare. Route B takes, on average, 21:13, compared to Route A at 22:15. Are you safe to conclude that Route B is the faster route?

Statistics has a method to deal with questions like this. First, it notes that these 25 total trips are just a sample of all the days you will spend driving that route. In any sample, bias is present. Perhaps, for example, the timing of a stoplight was off due to some road construction. This resulted in a 20% greater likelihood of getting stopped at that light along Route A. Once the light’s timing is corrected, this delay will disappear. So we have to set a bar of difference that is higher than a difference that is caused by picking a sample of days instead of looking at all days.

Second, the statistical test we would use to see if the two times are different takes into account the variance in the data. The times above are averages, and what those numbers don’t tell us is how widely the data fluctuated. Let’s say that Route B had times ranging from 18:00 to 25:00, while Route A’s times ranged from 21:00 to 23:00. Route A may be slightly slower on average, but Route B could leave you sitting in traffic for a longer period of time. We need to factor this in if we are to truly judge if one route is faster than the other.

We can do the test and have it spit out a “p-value,” the probability that these two numbers (21:13 compared to 22:15) are “significantly” different. In social sciences, we generally use a standard of 95% or greater confidence that the difference observed between the two numbers is not random chance but instead is because of some differentiating factor (in this case, the two routes).

That’s useful, but only to a point. The difference may be “significant” but does significance tell us anything about meaningfulness? The difference between the two route times is a little over a minute. Even if statistical tests show that this difference is significant, it can’t factor in questions like “How enjoyable is each route to drive?” If you prefer highway driving to city driving, then Route B, with its later highway exit, may be better. If Route A takes you past a scenic lakeside park, then perhaps it’s the better route for you.

All this presents a challenge to social science researchers. And we can represent that in this equation:

Significant Differences = Meaningful Differences + error

Why is this equation a challenge? It has much to do with the current wave of skepticism toward the idea of “significance.” In short, there are many ways in which data can be prepared and analyzed to increase the chances of a “significant” result. And because significance is probabilistic, the chances that one or more of those tests are just random chance and not actually a real difference, grows higher.

For example, let’s imagine we wanted to craft a model that could predict support of a political candidate. What could influence this? We measure gender, age, race, religiosity, income, and where a person lives. For political support, we measure passion for the candidate, voting, donating, and talking to friends and family. That’s six predictors of difference for 4 different outcomes, a total of 24 different tests. The chances of not finding at least one “significant” difference among these is .95^24 or 29%. In other words, just by random chance and because we’ve captured so much data, we find some “significant” results we can report even if in reality none of these things predicts candidate support. That is, we find significant differences even if, in actuality, it’s all error.

Because of these types of realizations and social science’s current “replication crisis,” scholars are dedicating far more attention to the error term instead of to the times in which significant does concord with meaningful. And that means that scholars can no longer declare “significant” results to have merit. Let’s say we find that reading a biography of a political candidate in which the candidate describes her hardscrabble childhood increases passion for the candidate by .4 on a scale from 1 to 7. This increase might be “significant” by standard statistical tests. If we use significant and meaningful as synonyms, that might be enough to declare the result matters. But if we give scrutiny to the method (the participant has to read a whole book?) and the outcome (and this raised passion by an average of only .4?), we should land squarely in the error, rather than the concurrence, between the two terms.

Any social scientist worth her salt will tell you that statistics without theory is nonsense. A statistical tests cannot tell us what results mean or if they are meaningful; only theory and argumentation can do that. When we highlight the times in which significant and meaningful don’t align, we reinforce this vital lesson. As more findings in social science (and other fields too) are questioned as non-meaningful despite “significance,” we improve scientific pursuit. And that is most certainly a meaningful outcome.

Tell me if you disagree with this argument. Some college freshmen struggle; it can be hard to make the transition to a new living situation, new people, and new responsibilities. This presents a challenge for college professors: How much leeway should they give these struggling students? Professors need to accept that some students will fail. This helps the university uphold high academic standards, it provides valuable lessons for struggling students, and it may help some of those students find a new university where they can be successful.

Do you agree? Regardless, do you think the argument is controversial? I’m ambivalent about the argument itself, as some struggling students actually don’t know what they are supposed to do. As such, they need information, not “tough love.” But I don’t think the argument is controversial. That is… unless you phrase it like this:

“[We need to find struggling students and remove them from the university in order to improve our retention rate.] This is hard for you because you think of the students as cuddly bunnies, but you can’t. You just have to drown the bunnies… put a Glock to their heads.”

This is how Mount St. Mary’s now-former president Simon Newman phrased his desire to increase university retention rates by kicking out students quickly rather than waiting for them to drop out and count against the university. Mr. Simon resigned earlier this week.

This quote isn’t all that led to Mr. Simon’s resignation. His words emerged from an email exchange about a survey Mr. Simon wanted to give to students; the email exchange was published in Mount St. Mary’s student newspaper. Though the survey’s instructions said the survey would provide the students with useful information and had no right or wrong answers, Mr. Simon intended to use the survey’s results (which were not confidential) to identify students deemed more likely to fail and then to encourage those students to leave the university. The email exchange (linked above) contained many dissenting voices who were horrified by Mr. Simon’s plan. And Mr. Simon’s actions after the email exchange was published were wrong-headed as well. For example, he fired the newspaper’s faculty advisor for being “disloyal” and punished two other faculty members as well. But it’s hard not to see this as a question about a poorly-chosen metaphor and its backlash.

Metaphors are powerful things, able to make an argument more vivid or connect an argument more closely to someone’s own experience. For example, in my own work, I often compare foster youth living in group homes to a dysfunctional high school. Foster youth are treated as irresponsible children who must be kept under control by draconian rules. If the same thing happened in a high school, what would result? Classes wouldn’t be productive, students would disrupt things often, and everyone would do whatever they could to not be in school. Is it any wonder, thus, that the same things happen in group homes? In my opinion, this helps shift our thinking away from the trauma that the foster youth have experienced and onto how any teenager would react to such a setting. It humanizes foster youth and turns our attention to the real problem: the restrictive group homes.

But in the power of metaphors comes a real risk. Mr. Simon used a metaphor that highlighted the warm and fuzzy feelings professors have toward their students and then contrasted that with the importance of putting the university first. His metaphor is stupid in many ways. Drowning a bunny does not improve that bunny’s life, while helping a student find a university where he can succeed does help him, certainly more than waiting for him to fail out later in the school year. But the metaphor is also a good one if it helps professors recognize that coddling students, in many cases, does nothing to help them. Indeed, Mr. Simon’s attitude is exactly opposite to what I have seen other universities do: string students along, right on the edge of failure, in order to keep collecting their tuition money. Kudos to Mr. Simon for not proposing something like that, a far more controversial idea.

Let’s try to describe Mr. Simon’s problem with metaphor in an equation.

A Metaphorical Point = A Literal Point + error

I don’t think that error term is particularly large, because humans seem wired to understand and appreciate metaphors. So what did Mr. Simon do wrong? Why was his metaphorical point not immediately translated into a literal point? Perhaps it is the graphic nature of his metaphor. Drowning and shooting bunnies is dark territory, especially here in the United States where most people have generally positive feelings about rabbits. (Mr. Simon was born in the UK, so he may have different feelings about the animal.) And when bunnies are a stand-in for students, a half-reading of the metaphor turns even darker. (Drowning and shooting students? Obviously a terrible image.) Combine all that with the context (Mr. Simon’s controversial proposal to weed out students) and you have a metaphor that fails to convey a literal point.

But should we defend Mr. Simon’s use of a clumsy metaphor? We’ve likely all deployed clumsy metaphors in the past or heard others do the same and been willing to go along with them. I remember a friend from high school describe her crush on a boy as if she was a dock and he a boat, drifting closer then farther from her. I nodded along politely even as seagulls, tides, and sandbars got added into the metaphor (okay, maybe I’m exaggerating a little, but it did get very complex!). So should we have greater sympathy when someone else struggles to use metaphorical language to advance a point?

In this case, I’m going to say no. Mr. Simon might not have gotten as much attention for his plans if he hadn’t used a metaphor, but the controversy would have remained. And if he used the metaphor while encouraging professors to assign tough grades because students can’t be coddled, then he could have apologized for his phrasing and ended the uproar. Certainly people are condemned when their metaphors don’t translate well into literal points. And I’ll rush to their defense the next time it happens. In this case, Mr. Simon’s plans seemed to be the big problem and his language simply encouraged more people to pay attention to his plans. His actions afterwards only hurt him. So Mr. Simon aside, others who use inapt metaphors should be defended. A metaphor that doesn’t lead to a literal point is mostly just clumsy language, not the sign of anything nefarious.

If being a graduate student was pitched like a job, it would sound something like this. “Are you looking for a job that pays you to work 20 hours a week, but requires you to work 80? Are you looking to be berated for your failings, but never given instruction on how to improve? Do you wish to be supervised by people who are experts in their field but cannot functionally interact with other human beings? If so, apply to be a graduate student at a research university!”

I’m dramatizing somewhat, obviously. This is not the experience of all graduate students; indeed, it wasn’t my experience. But from being in graduate school and interacting with many graduate students after I finished my degree, this description isn’t that far off from what they go through every day.

And in some ways, we shouldn’t be surprised by this. Professors aren’t trained as managers just like they aren’t trained as teachers. We only assume they can help others learn because we know they are experts in their fields. And graduate school is school, so the tuition wavers that come with an assistantship really are the main compensation; no graduate student will ever get paid based on the number of hours she works. Many parts of graduate school are closer to working an unpaid internship than working a job, including the hope that working for free now will lead to a real (and good) job in the future.

So perhaps we have to forgive everyone involved in graduate education for seeing the system as this equation:

All Work, No Play = Good Graduate Student + error

This equation fits very well with American work values, in which working hard means getting ahead and pain now is expected to produce rewards in the end. And it also fits people whose work truly is their life. Being in the lab at 11 PM doesn’t feel like work when there isn’t anything more appealing to do.

The trouble comes with that error term. In my estimation, there are very few people (indeed, maybe no one at all) who can actually live with all work and no play. That’s not because some people aren’t very hard working. Of course some people are! And some people seem to need less sleep than others. And some people really do seem to live their lives at work. It is, after all, where a lot of people do the majority of their socializing.

What I mean is that graduate students and professors may be too quick to accept this equation without enough attention to the error term. For example, let’s consider the part of being a good graduate student that revolves around being a willing and able collaborator on other people’s projects. If that graduate student is the first to leave the lab at the end of the day, does that mean he isn’t a good graduate student? Our equation says yes, but intellectual work is different from working on an assembly line. When one person steps away from the line, it stops and everyone else must stop too. But if someone leaves the lab to go home for the evening (or, heck, just out to take a walk before coming back), it doesn’t mean their mind stops working. Furthermore, when that person is out of the lab, most everyone else in the lab can keep working just as they were before. As such, we have to consider a broader conception of that person’s willing and able collaboration. If he returns to the lab right away the next morning feeling refreshed and ready to work, then his contributions will be strong even though he left the lab before others the night before. Indeed, his contributions may be stronger if everyone else comes into the lab bleary eyed at 11 AM.

Let’s consider another example, that being a good graduate student means a complete immersion in a field of study. In order to succeed, the graduate student needs to live and breath all aspects of her field. This translates into behaviors like never reading anything that doesn’t relate to her academic study, rejecting hobbies and activities that don’t advance her research, constantly coming up with new ideas, and keeping a notepad next to her bed so she can write down any thought that comes to her immediately. Here too, we need a broader conception of how productive immersion may be. For some graduate students, the only way to work at all is to maintain a proper balance between life and work. It’s kind of like how exercise makes you tired but gives you more energy overall. That is the nourishing part of taking time to remove yourself from your field of study and have other experiences.

So what are we to do with this equation? In the end, two things need to happen. The first is that the notion of being a good graduate student needs to have more predictors than simply an exhausting level of effort. Rather than looking at how much time someone puts in, professors and fellow graduate students should pay attention to output. Sending emails at 3 AM doesn’t mean anything if those emails are rubbish (as many emails sent at that time of night are). Meetings that stretch for hours don’t do anyone any good if that meeting is using each person at just a fraction of their overall capacity. All work and no play is a convenient measure, but convenience isn’t really that useful in this case. Professors are actively mentoring graduate students, and if they are using just a convenient measure to assess how that student is doing, it means they are failing as mentors. Being a good graduate student should be reframed as being engaged in research, able to improve work based on feedback, demonstrating growing knowledge and expertise of a field, and yes, some measure of productivity.

The second thing that needs to happen is a refusal to accept the equation above. Too many people take it as truth and assess themselves against it as a normative standard. In reality, people overestimate how many hours they work. And it’s too easy to forget about the consequences of, say, a late night of work. But think back to times in which you (or someone you knew) decided to “pull an all-nighter.” In hindsight, it is almost never a good idea. Perhaps in the short-term it worked to meet a deadline, but in terms of overall productivity, it’s a losing solution. All work and no play doesn’t make a good graduate student; it makes an under-productive, stressed out, and unhappy graduate student. The sooner everyone is willing to say this, the better off graduate students, professors, and academia as a whole will be.

In my previous post, I described the challenge of defining a group of children in order to calculate how long those children have stayed in care. The goal is to provide timely, useful measurement of this indicator of child permanence, while minimizing the children who aren’t counted, the error in our measurement effort. Using entry cohorts (or all children who entered care in a given period) means using a measure of central tendency like median or waiting a really long time (theoretically up to 18 years after entry into care) to be able to calculate a mean. Clearly, entry cohort is not a perfect solution, and in this post, I’ll consider two other methods.

If entry cohort is all children who entered care in a given period of time, exit cohort is the opposite. With all children who exited care in a given period of time, we can calculate all kinds of useful information. And we don’t have to rely on measures of central tendency like median; we can calculate means, modes, or anything else we like. We can get a very detailed picture of these children’s experiences in care. How long were they in care? How many different living settings were they placed in? How is it that they are exiting care; are they headed back to their family of origin, or being adopted, or entering into the permanent care of a relative? Because exit cohort isn’t dependent upon when the child entered care, there’s no lag for these calculations. We can, in theory, calculate it as soon as the period of time that defines the cohort ends.

But exit cohort also has an error term. We are only selecting the children who are exiting care, and thus ignoring all the children who remain in care. It’s great news that Susie is exiting care after just 8 months, but what about Sam who came into care at the same time? He’s still in a foster home instead of a permanent living situation. Using exit cohort, we forget about Sam until he too (eventually) exits from care. We can discuss Sam in other ways, of course. For example, he would get counted in the total number of kids currently in foster care. But we cannot calculate his length of stay or his number of placements because we are truncating the number based on when we are doing our calculations. Sam has been in care for 8 months, just like Susie, but it’s not correct to use 8 months in our calculations; Sam will be in care for longer.

So what are we supposed to do? Both entry and exit cohort have significant error; neither provides a full picture of children in care. The other option to consider is a cross-sectional, rather than longitudinal, approach. In this method, we don’t track individual kids in care. Instead, we look at the system itself over a defined period of time. Who are the kids in care? What are their statuses?

Let’s imagine we look at kids who spent time in care in calendar year 2015. All kids will fall into one of four groups. First, there are kids who both entered and exited care during the year. My previous post offered a good example of this: the child staying with her grandmother while her parents took a cruise. She entered care after her grandmother was hospitalized and exited care a day later when her parents got back from their cruise. Second, there are all other kids who exited care in 2015. These kids are similar to the exit cohort method discussed above. Third, there are kids who entered, but did not exit, care during the year. And finally, there are kids who were in care before the year started and remain in care when the year ends.

The overall outcome of this method is that no child is missing in our counting. We describe all children who spent any time in care during the year. We will take those children exiting care and calculate length of stay. But in doing so, we won’t be ignoring any children. If 25% of children in care at any time in the year exit care, then we are clear that our calculations of length of stay are based on just this 25%. The reader is presented with full information about children in care.

By paying attention to the children left out of other calculations (the error in our measurement) and not merely focusing on those children who allow us to make easy calculations, we find a better solution. These numbers aren’t going to help us figure out how to remove children from care quickly and safely, but at least they provide us a look at all children who had experience with the child welfare system. When we ignore no one, we help illuminate the wide array of child experiences with the child welfare system. And hopefully this, over time, can push us toward achieving better outcomes for all children in care.

I work in child welfare research, and one of the most important reports my research center publishes each year describes outcomes for children in state care. We have a set of indicators that measure safety, permanency, and well-being for children. Many of these indicators are relatively straightforward to calculate. We know how the data is structured, so we pull it out, run some syntax, and report the output. But other indicators require some debate, often centered around who we are leaving out of the calculation. These children who get left out are a sort of “error term” that keeps us up at night.

Lately, we’ve been discussing the question of how to measure “length of stay” in care. The definition of this indicator is simple: How many months did the child stay in care before exiting? No one disputes that this definition is correct. The challenge comes from measuring it. The minimum length of stay would be a matter of hours or days. Let’s imagine a four year old child is in the care of a grandparent because the child’s parents are on a cruise. The grandparent takes the child to the playground, tries to go down the slide, but falls and breaks her leg. She must go to the hospital, cannot return home, and because there isn’t someone else immediately available to care for the child (for whatever reason), the child is taken into care until the parents can rush back from their vacation. That’s the minimum length of stay, the time it takes the parents to return. The maximum length of stay would be 18 years. This could occur if a child was taken into state care upon birth and never achieves permanent placement. The child would stay in care until he “ages out,” turns 18, and is removed from the state’s care. (Some states allow youth to stay in care until 21, but the idea is the same.) This outcome is extraordinarily unlikely, perhaps even unprecedented, but it nevertheless represents the longest amount of time a youth could stay in care.

Because of this wide variability, calculating length of stay for the system as a whole becomes complicated. Let’s say we want to calculate length of stay for children who entered care in 2015. Some children will exit in the same year they enter, like the girl in the temporary care of her grandmother. But other children will stay in care for years. So what should we do in this case? Our current solution is to carefully select a measure of central tendency and to allow enough time to pass so that we can report the number. We use median months in care, the number of months it takes for half of children entering care in that year to have exited. The number is around 30 months and has stayed at this level for the past several years. We report this number with a lag of about three years, necessary because it takes that long for half of the children in a given entry cohort to exit care.

This would be a perfectly acceptable solution if length of stay was normally distributed. The trouble is that most kids are in care for a relatively brief period of time and some kids stay in care for an extraordinarily long amount of time. The data is left skewed, with more data toward the left side of the chart than the right. Median tells us some “good” news about half the kids in care: They stayed in care for 30 months or less. But it also tells us bad news about the other half of kids: They stayed in care for more than 30 months, and we won’t know for years more how much longer they remained in care!

That’s the challenge with using entry cohorts to calculate this info. If we have to wait for all children who entered care in 2014 to exit from care, we might not know the length of stay for this cohort until 10 or more years later. In theory, we might not be able to calculate the number until 2032. At that point in time, the information would only be useful to historians. The children who don’t get “caught” in our calculation of length of stay (because of the necessities of using median) represent “error” in our measurement. Entry cohorts, the distribution of length of stay, and the need to get the data out in a timely manner mean we lose these kids. We can note that 50% of kids stay in care longer than the median, but that isn’t nearly as useful as actually stating how long they stayed in care.

So that’s the problem with how we currently do things. But what can we do to fix it, or augment what we currently do? I’ll consider that question tomorrow.

The field of psychology and social science in general is facing a “replication crisis.” (Other fields are too, including medicine.) The most cited cause of this “crisis” is one-off experiments that produced a “significant” result, got published, and were rapidly accepted as fact by the scientific community. Significance, in this case, refers to a statistical test designed to indicate the likelihood that the result occurred due to chance; a small enough percentage leads to a result being declared “significant.” Readers should be cautious to not use that word as a synonym for “meaningful” when talking about research.

Many solutions have been proposed for this crisis, and no solution is better than an increased effort to replicate the studies. This solution is ideal because it can be added as a requirement for future studies (for example, that at least two different samples be used in an experiment) and can be used to review past findings.

One trouble with replication, however, is that it isn’t “sexy.” People who go into research fields (whether in academia or not) aren’t entering the field because of a chance to redo the work of others. Instead, they seek opportunities to test and advance their own lines of thought. The entire tenure process at research universities is built around making new, meaningful contributions to a field of study. Replication, by design, doesn’t do that. And though it may be helpful for our scientific understanding (thus advancing human knowledge overall), it doesn’t tell us much at all about the researcher conducting the replications. Is the replicating researcher a good scholar? We don’t have enough evidence to answer the question if all the researcher has done is try to replicate other people’s findings.

So who is supposed to do replications? The answer may come to us from an unlikely source: the arts. For an example, let’s look to painting, an artistic pursuit that, for many artists, requires years of training. How does one begin to learn how to paint? The exercises to learn painting have been the same for centuries and center around copying the work of others. This could mean learning to mix paints to see how other artists have created depth in painting (for example, make more distant objects bluer). That means painting still-lifes of fruit bowls, just like the great masters did. And sometimes, it means copying past paintings to try to understand how they were created. All this is done BEFORE we expect the student to produce any great, original work. The same is true for writing. Students are expected to read widely and to attempt to write in the style of other writers. Indeed, consider “copywork,” a old education method that called for students to directly rewrite the works of others to learn penmanship, spelling, grammar, and perhaps even style.

Let’s do the same thing in social science research and ask graduate students, those learning to do research, to undertake replication efforts as the fundamental step to begin their research education. There is much to be learned from trying to replicate the work of others. First, graduate students learn about the necessity of writing a quality Methods section because these sections will guide replication. Too many methods sections are vague on key details. This may be the result of bad writing; most academic writing is of poor quality. It may also be because the author wishes to obfuscate some details that didn’t go the way the researcher intended. For example, perhaps the researcher should have used random assignment but did not because of how the experiment was planned; the researcher may not be specific about this detail because it could undermine faith in the study.

Second, by replicating past work, graduate students will be able to get into the muck and learn about the real work that goes into conducting experiments or other methods of gathering data. No amount of prior planning will resolve all possible issues researchers encounter when running an experiment. I recall a personal example, running an experiment in which one participant had an opportunity to lie to another for financial gain. Not having a procedure to keep the two participants from leaving the experiment at the same time meant occasional awkward encounters after the experiment was done. I had to use a new process for participants, in which they were asked to signal when done completing a survey rather than just leaving. This allowed me greater control to make sure the participants left separately. It is best to learn how to create a good research process as early as possible.

Third, graduate students can conduct analysis of the data with a step-by-step guide from past studies. This reduces time and worry from analysis, and also allows an opportunity to see how analysis both explores the data and what it leaves unexplored. Learning to think about data–from both good and bad perspectives–is key to social science research. The graduate student can ask why the published paper conducted each test. Was there an effort to avoid some variables and use others? Was this done to emphasize significant results and downplay non-significant ones?

Finally, graduate students contribute to scientific knowledge right away, in ways that their “original” ideas likely would not. And this contribution means more experienced researchers can spend time advancing their own research agenda (and contributing future results which may or may not be replicable).

Will this “solve” the replication crisis in psychology? Certainly not in the short term. Instead, it will immediately exacerbate the problem by showing even more results that aren’t replicated. But in the long term, it may. Graduate students will learn the importance of producing good work and the uselessness of significance to tell us what results are meaningful. The best studies tell us something no matter what the results show, but these studies must be carefully constructed. Learning by copying may help budding researchers learn what those research questions look like.

It may still be a tough sell to graduate students. But when replicated research is respected by the academic community at large (as it should be) and when graduate students get to explore some of their favorite research, their attitudes should change. This new method will allow graduate students to emerge as better researchers. Let graduate students be apprentices, building their own tools first, before we turn them out into the world.

I commonly write about the topic of error in assigning grades in a classroom setting, but it always fills me with a certain measure of anxiety.

Assigned Grade = True Grade + Error

This is a fundamental relationship that we have to deal with when trying to measure anything interesting. Yet I often wonder how students reading this blog might feel to be so frequently reminded that the grades they receive in their classes always contain some error. Though error never completely avoidable, it is easy to understand how someone might feel disaffected or angry to learn that the measures which hold sway over their future career prospects contain some degree of error. The degree of error varies from class to class and student to student, but it’s easy to imagine that in almost every class there are some students whose grades are in the grey area between, say, a B and a C, for whom even a small amount of error could be enough to produced an assigned letter grade that is different from what the ‘true’ grade would be if the error had not existed.

How important is the difference between a B and a C? In the case of Megan Thode, a student at Lehigh University, the difference amounted to a $1.3 million lawsuit.

To quote the NY Daily News:

Getting a grade you don’t deserve in school is worth about $1 million in damages, according to a lawsuit filed in Pennsylvania.

A former student at Lehigh University is so unhappy with the C+ she received in a course in 2009 has decided to sue the school for $1.3 million, claiming the unfair grade has ruined her future earning potential.

In this case the C+ grade was enough to disqualify Ms. Thode from continuing in her graduate program. Her lawsuit claims that the grade she received was inaccurate at least in part because bias the instructor carried towards her due to certain political statements she had made and complaints she made about taking part in an internship as part of the class. The university denies any bias.

For the sake of argument, let us assume that the instructor did not make a conscious decision to alter her grade, but that annoyance at Ms. Thode’s behavior subconsciously contributed to the error present in her assigned grade. It would not outrage me to learn that this was true. I personally do not consider myself immune to small biases based on student interactions – a point that I try to make clear to all my students when discussing my professional behavior standards and one of the reasons I employ blind grading for most assignments. This hypothetical situation provokes a few interesting questions:

  • How much error is enough to warrant a lawsuit?
  • If we hold instructors legally liable for grade error, does it matter whether the largest source of the error is unconscious bias, sampling, grader variation, or any of the many other factors that can contribute to error?
  • Is it even possible to determine how much of an assigned grade is error or what caused the error?

I’ll pass on trying to answer the first two questions as they are mostly philosophical. The question of determining the magnitude and cause of error is a practical question. In theory, the answer to the question is ‘yes’, but with several caveats so large as to make any attempt clearly impractical.

Classical testing theory posits that while the sampling error in each assessment is random, the randomness is not uniform, but rather forms a normal distribution centered on zero. From a statistical standpoint, all that you would need to do to reduce the error is to continue sampling, provided – and here comes the big catch – that you could ensure that the thing you are sampling (i.e. the student’s knowledge) has not changed at all, something that would require access to a time machine or alternate dimensions. As for rooting out bias-driven error, it probably isn’t normally distributed, so you might be able to discover it if you were able to fit multiple independent graders into your time machine so you could re-test each student in the class multiple times and then try to correlate the scores between graders.

In other words, short of some kind of smoking gun, it is not practical to sue based on standard assessment error. Frankly I’m glad, since the thought of the chilling effect that would have on grade assignments is enough to make me shiver.


Last week, I had the privelge of attending UW-Madison’s Teaching Academy Fall Kickoff Event, which this year focused on issues of grading and assessment. Keynote speaker, Dr. James Wollack, presented challenges with grading and described potential solutions. One issue he addressed was the promotion of positive behaviors (for example, attending office hours, coming to class, participating in discussion) and the challenge of assessment of these behaviors. I’ve long argued that awarding points for attendance systematically discriminates against at-risk and non-traditional students, who may face more challenges to class attendance than other students. But at the same time, it is important to recognize the motivating nature of class point structures; to wit, more students show up for class when they get points for doing so.

Dr. Wollack suggested that instructors could create a separate category of assessment items that act as gatekeepers for certain grades. For example, to receive an A, a student must earn at least 90% of available course points and not miss more than 2 class sessions. Failing to meet this attendance criterion results in a lower letter grade, no matter what percentage of points a student earned in the class.

This idea is very appealing because it allows instructors to build in assessment components that are traditionally hard to score. For example, consider the issue of participation. Let’s say that, in each class session, students can earn up to 5 points for participation. This should motivate students to participate more in class, but it also means that instructors must somehow both conduct class and evaluate student participation. This daunting task reduces reliability of grading and confidence in scores for both instructors and students. But a checkbox system and a gatekeeper point value can help solve this problem.

But what is the theoretical justification behind gatekeeper-style grading? One issue comes in the correlation between gatekeeper items and other normally assessed items. No matter how attendance, for example, is scored, it is still related to overall classroom performance. Students who attend class more frequently do better on exams. This may seem like a great reason to do everything possible to compel students to attend class, but whether attendance is scored or a gatekeeper, it still serves as a barrier to higher performance from a motivated student who, for whatever reason cannot attend class as regularly as the instructor would like.

Think of it like this.

Attendance = Learning + error
Test Score = Learning + error

Ideally, attendance and test scores would be a perfect reflection of what the student has learned in the class. But of course, we recognize that some error takes place, as some students might attend class but fall asleep or be distracted; others might skip class but study extra hard on their own. For this system to work, we need to assume that the two error terms are not correlated with each other. But we are assessing the same construct (Learning, the student’s knowledge of the course material), and thus we know that these error terms in fact SHOULD be correlated.

Furthermore, we want the error terms to be random, such that error in measurement does not systematically help or hurt students. But because both Test Scores and Attendance are measures of Learning, when we kick in points for Attendance, it is like awarding automatic extra credit points on the test for students who were in class. The error term isn’t random any more, because the relationship between Learning and Test Scores is influenced by the relationship between Attendance and Test Scores.

In short, what results is a big assessment mess that is difficult to straighten out. There is little theoretical justification for awarding points for attendance, whether as a scored item or as a gatekeeper. There are other views on this issue, however. I hope to feature some of those in upcoming posts.

Some people argue that a free market economy is the best economic system for getting the most good to the most people. Given that some form of the free market is the only economic system that has been shown to work on a large scale, it can be hard to argue against this claim. But in the 2012 presidential election, much of the debate centers around just how much governmental involvement in the economy is the right amount. While neither side favors complete laissez-faire or communistic approaches, there are stark contrasts between President Barack Obama and Republican presidential candidate Mitt Romney.

How does this relate to measurement and error? We can understand the debate between the two candidates as one connecting our personal values with the values of economics. The free market functions based on a profit motive, but humans are much more complex. And thus we can represent the debate with an equation.

Personal Values = Economic Values + error

In a perfect free market, all of our own values would be solved with free market economics. Sick people need health care! Don’t worry, there’s a free market solution to the problem. Poor children need to be fed! Look no further than economics for help. City streets must be cleaned, in poor and rich parts of town! The free market is there to help.

Unfortunately, all our values–health care, feeding children, clean streets–do not have a free market solution to them. The most salient example is that of providing health insurance to individuals with “pre-existing conditions.” When insurance companies provide insurance for these people, it is much more expensive. Before health care reform legislation was passed in early 2010, insurance companies dealt with this issue by charging higher rates or denying coverage altogether for these people. That is a free market solution, but it did not reflect the values of many people.

New policy forbids insurance companies from these practices and in turn requires all Americans to obtain health insurance, thus ostensibly increasing insurance company revenues. This is anti-free market policy. It restricts the behavior of both corporations and individuals. But it also fits with the values of many people. And proponents argue that it is a case where the free market had failed and intervention was needed.

The error in the equation is because of situations in which there appears to not be a free market solution to a problem. For Republican Mitt Romney, the error term in the equation for him is very small; there aren’t many problems that cannot be solved by the free market. For Democrat Barack Obama, the error term is larger; government has a vital role to play in the lives of all Americans. Your own view of this equation should be a big guiding factor in how you vote, and both candidates point out that their disagreement on the size of the error term shapes their unique visions for the country. In this case, error and your view of it will help determine who wins the election in November.

A few days ago, on my professional blog, I wrote about how the IRB can promote best practices. The IRB (or Institutional Review Board) must approve all research at an institution and works to ensure that research complies with federal, state, local, and university laws and policies. Though this suggests a standard process with carefully constructed guidelines for review, many researchers note frequent discrepancies between institutions and even between protocols. A procedure requiring modifications in one proposed project may go unquestioned in another. This naturally results in much frustration from researchers, no matter their level of concern with research ethics. Indeed, it should be those most concerned about practicing research ethically who protest loudest against IRB discrepency.

We can view the whole process as an equation and investigate further how the error term leads to frustration.

IRB Practice = Best Practice + Error

Let’s first agree that for most research practices, there is a “right answer” regarding the protection of participants. I use the example of data storage in the blog post linked above. “Consider, for example, handling collected data (the surveys and spreadsheets that contain participant responses). This data must be stored in a secure location. On paper, it should be in a locked room that is accessible to very few people. Digitally, it should be stored on a computer and backed up on a secure server.”

If there was no error, then any deviation from this best practice should be flagged by the IRB, and all protocols asserting that they will follow these practices should be accepted by the IRB. Where there is no error (that is, no difference between best and IRB practices), there is no frustration. But this isn’t the case. Some IRBs may flag protocols that have not described backup locations as secure, while others may not worry about this detail. Error (the inconsistency) leads to frustration because a researcher cannot accurately predict what issue the IRB will flag.

The IRB, however, has some power over this error term. Right now, researchers must answer a variety of questions about items like data security. The application contains empty boxes into which the researcher must detail her plans and hope that they pass muster, with little guidance about what a best practice might be. This creates error because the researcher can easily forget or misstate an important piece of information. But if the IRB were to simply state the best practice and instruct researchers to either agree to follow it or describe their alternative plan, then the error term would be eliminated for most applicants. The IRB would save time as well.

Thus, inasmuch as the IRB wants to reduce researcher frustration, they should promote these best practices. By understanding the process of researcher approval in a classical test theory manner, we can see direct guidance as to how the IRB can improve their application procedure. In these cases, the error term is nothing but trouble for both IRB credibility and researcher sanity. All steps that reduce that error, while protecting participant safety, can and should be taken.