Why did Mitt Romney lose the 2012 election? There are lots of reasons one could offer. When I think back on his candidacy, I think of the ways his positions shifted (from relatively liberal in some areas, useful when running for governor of Massachusetts) to pretty conservative (useful when running in the Republican presidential primaries). And I think of some of his bizarre comments, like saying he had advisors bring him binders full of women and how the trees in Michigan are the right height. Immediately after the 2012 election, conservatives offered their own explanation: Mitt Romney wasn’t a true conservative. If he had been, then he would have attracted mass support from true conservatives, enough support to easily defeat President Obama.

This same argument is used today by those who support Republican presidential candidate Senator Ted Cruz. An essay at The Federalist argues that Ted Cruz is the only true conservative in the race. Marco Rubio has said the same thing.

Meanwhile, the same thing is happening on the Democratic side, with candidates Hillary Clinton and Bernie Sanders arguing over who the “true” progressive is in the race. Each wishes to hold the title, and each has an argument against the other using the title.

It strikes me that the logic used by candidates from both parties can be expressed as a simple equation:

Political Prosperity = Ideological Purity + error

All candidates are arguing that their nomination will lead to party success, while nominating their opponent would lead to failure. But, of course, this is only argumentation, as we have to hold the election to truly see who wins. As such, their arguments for their own ideological purity are merely an imprecise measure of political prosperity. Every other factor has to get lumped in the error term.

And you know who looms out of that error term? Donald Trump, who currently leads Ted Cruz in Republican delegates. Mr. Trump would not call himself ideologically pure or a true conservative. (At some point in time, we must acknowledge that there is so much bluster in any political campaign, mixed with unending commentary, that a search for a candidate’s name and the phrase “true [conservative, progressive, etc.]” turns up plenty of results. I am trying to make an argument, while also respecting all the people who indeed have called Mr. Trump a “true conservative.” There are some.) Mr. Trump has called himself a “commonsense conservative,” which has a pragmatic ring to it.

We could also look to President Obama as another person who fits in that error term. Mr. Obama has famously eschewed politicking. He has argued that his positions are commonsense and that there are many areas that both Republicans and Democrats can agree on. Indeed, his favored legislation to increase the number of Americans with health insurance is based on a plan from a conservative think tank (the Heritage Foundation) and was previously implemented by Mitt Romney in Massachusetts. I think Mr. Obama would be happy to call himself progressive, but he has shown continued disinterest in ideological purity.

Perhaps this is simply the difference between running for office and actual governance. Ms. Clinton, Mr. Cruz, and Mr. Sanders have legislative records that they must present to the public. (Ms. Clinton also has leadership experience, especially her time as Secretary of State.) Mr. Trump has his business experience, which involves deals and dictates more than policy. Mr. Obama, in seeking reelection in 2012, had his governance and leadership. Naturally, leading is more pragmatic than advocacy, and a president must actually govern, while a senator is more an advocate for the people of her state.

In any case, the argument of ideological purity as a determinant of political prosperity seems an odd one to me. Even if purity could help win an election (and I find no evidence that it can), it is surely less useful in governance. A candidate can declare she will push to lower taxes until economists crunch the numbers and suggest it is impossible without massive spending cuts. A candidate can decry drone strikes and state he will ban their use until a military leader has a dangerous terrorist in their sights and asks to pull the trigger.

In the case of this equation, ideological purity might be a useful rhetorical tool, but the error term is too large for it to offer much prediction of political success. Purity makes perfect bluster, but the realities of politics often bust the blustering candidate.

Many people, experts and non-, believe this equation describes a lot of what we need to know about college education:

College Costs = College Value + error

Yes, the outlay to pay for college can be high. Even taking out a few thousand dollars in loans each year can add up to substantial debt by the time the student finishes his education. Leaving college with $20,000 in debt might not be uncommon, but it is a burden.

That burden, though, has high value over a lifetime of earnings. And when students spend more, it can mean more value. A community college two-year degree costs less and has less value than a four-year degree from a state school, which costs less and has less value than a degree from Harvard. We expect that the cost is worth the value, and that those who pay more also make more. College education is a good investment.

So what about that error term? How does the equation mislead us? Let’s examine two examples.

The first comes from Senator Bernie Sanders, who is currently running to win the Democratic Party’s presidential nomination. Sen. Sanders has released a plan to eliminate tuition from state colleges and universities. This plan is a more dramatic version of President Obama’s proposal to make two years of community college free for everyone.

This plan makes several very interesting changes to the equation. First, it changes the makeup of college costs. The student now faces a cost of $0, but the plan doesn’t change the costs of providing that education. Instead, it just shifts the bill from the individual to society. (Actually, if you read the plan closely, the cost isn’t really shifted to society. “The cost of this $75 billion a year plan is fully paid for by imposing a tax of a fraction of a percent on Wall Street speculators who nearly destroyed the economy seven years ago.” No word on whether the tax will apply to all Wall Street investors or just those that Sen. Sanders believes played a role in “destroy[ing] the economy.”)

So are educations that cost more to provide equal to educations that are higher in value? That’s difficult to assess. Tenured professors often make more money than nontenured professors, and in theory, these professors are better at teaching students. But if a university is expanding (suggesting it provides a high quality of education, thus attracting many students), then it will have more nontenured professors. A university with only tenured professors hasn’t hired professors in quite some time and thus isn’t growing. In any case, Sen. Sanders offers no corresponding plan about reducing college costs overall.

The result is that students will be forced to judge a different kind of “cost = value” equation, one in which they choose between the value provided for $0 and the value provided for, say, $40,000, the tuition at a private college or university. Because this erases the variance between different public schools (that is, tuition at the best public school is now equal to tuition at the worst), quality will explain less of the variance in cost, increasing the error term.

Free college education could also lower the value of a college education because it could increase the number of people with college degrees. The college degree could become more like the high school diploma: It’s only noteworthy if you didn’t earn one. As such, we could see college costs remain the same but value decrease. I don’t think this is the intention of Sen. Sanders when putting forward his plan, but he offers no other details on increasing enrollment in trade programs or creating programs that would guide students along a professional path early in their post-secondary education.

The second example of how this equation could mislead us comes from the relationship between federal student aid money and college tuition. The New York Federal Reserve released a report last summer that linked increases in federal financial aid with increases in college tuition. Essentially, because so many students use federal aid to pay for college (either through grants or loans), colleges know that additional money is out there. Thus when schools decide to raise tuition, they know the burden will not immediately fall on students. And when tuition rises, there is more push for greater levels of student aid. The result is a cycle of ever increasing costs.

Do the additional costs add up to greater value? It’s hard to draw a conclusive answer. Let’s say the additional tuition money is put toward new lab space, better computer equipment, or upgraded study spaces. All this should help students learn more, thus increasing the value of their education. But it could also be put toward increasing the size of college administration, paying the very high salary of a college coach, or building fancier dorms.

It’s worth noting that making more money available to students, especially in the form of loans, while doing nothing to control costs is quite different from how the federal government handles other entitlement programs. Take Medicare, for example. The federal government uses its leverage to keep healthcare costs down, and the expansion of coverage under Obamacare increased these efforts. As such, hospitals who wish to thrive must find ways to deliver quality care for less. This leads to innovations like focusing on patients who cost the most and offering them greater preventative care.

If the same thing was done with colleges, then we might see the equation protected. But as it stands now, increasing federal student aid seems to lead to increased cost with no corresponding increase in value.

For a high school student trying to decide if college is a good investment, does the equation provide guidance? Or is the unexplained variance too large? On the rough level, I think all people should pay attention to the equation’s general principles: If you are investing money, investing it in your own education will pay a fantastic rate of return over your lifetime. But be mindful of the error term. Some education is over-priced for its value, and this occurs at all ends of the cost scale. A university education can be overpriced relative to value at $40,000 a year or at $4,000 a year. Only careful consideration of the relationship, and vigilance to the error term, will lead to the best outcome.

Let’s say you have two possible routes you could take to work, differentiated by the exit you use to get off the highway and the corresponding different roads through town. The distance is roughly equal, so you wonder which route is faster. To decide, you randomly pick which route to take every day for a month, resulting in 11 trips to work on Route A and 13 trips on Route B. You average the times and compare. Route B takes, on average, 21:13, compared to Route A at 22:15. Are you safe to conclude that Route B is the faster route?

Statistics has a method to deal with questions like this. First, it notes that these 25 total trips are just a sample of all the days you will spend driving that route. In any sample, bias is present. Perhaps, for example, the timing of a stoplight was off due to some road construction. This resulted in a 20% greater likelihood of getting stopped at that light along Route A. Once the light’s timing is corrected, this delay will disappear. So we have to set a bar of difference that is higher than a difference that is caused by picking a sample of days instead of looking at all days.

Second, the statistical test we would use to see if the two times are different takes into account the variance in the data. The times above are averages, and what those numbers don’t tell us is how widely the data fluctuated. Let’s say that Route B had times ranging from 18:00 to 25:00, while Route A’s times ranged from 21:00 to 23:00. Route A may be slightly slower on average, but Route B could leave you sitting in traffic for a longer period of time. We need to factor this in if we are to truly judge if one route is faster than the other.

We can do the test and have it spit out a “p-value,” the probability that these two numbers (21:13 compared to 22:15) are “significantly” different. In social sciences, we generally use a standard of 95% or greater confidence that the difference observed between the two numbers is not random chance but instead is because of some differentiating factor (in this case, the two routes).

That’s useful, but only to a point. The difference may be “significant” but does significance tell us anything about meaningfulness? The difference between the two route times is a little over a minute. Even if statistical tests show that this difference is significant, it can’t factor in questions like “How enjoyable is each route to drive?” If you prefer highway driving to city driving, then Route B, with its later highway exit, may be better. If Route A takes you past a scenic lakeside park, then perhaps it’s the better route for you.

All this presents a challenge to social science researchers. And we can represent that in this equation:

Significant Differences = Meaningful Differences + error

Why is this equation a challenge? It has much to do with the current wave of skepticism toward the idea of “significance.” In short, there are many ways in which data can be prepared and analyzed to increase the chances of a “significant” result. And because significance is probabilistic, the chances that one or more of those tests are just random chance and not actually a real difference, grows higher.

For example, let’s imagine we wanted to craft a model that could predict support of a political candidate. What could influence this? We measure gender, age, race, religiosity, income, and where a person lives. For political support, we measure passion for the candidate, voting, donating, and talking to friends and family. That’s six predictors of difference for 4 different outcomes, a total of 24 different tests. The chances of not finding at least one “significant” difference among these is .95^24 or 29%. In other words, just by random chance and because we’ve captured so much data, we find some “significant” results we can report even if in reality none of these things predicts candidate support. That is, we find significant differences even if, in actuality, it’s all error.

Because of these types of realizations and social science’s current “replication crisis,” scholars are dedicating far more attention to the error term instead of to the times in which significant does concord with meaningful. And that means that scholars can no longer declare “significant” results to have merit. Let’s say we find that reading a biography of a political candidate in which the candidate describes her hardscrabble childhood increases passion for the candidate by .4 on a scale from 1 to 7. This increase might be “significant” by standard statistical tests. If we use significant and meaningful as synonyms, that might be enough to declare the result matters. But if we give scrutiny to the method (the participant has to read a whole book?) and the outcome (and this raised passion by an average of only .4?), we should land squarely in the error, rather than the concurrence, between the two terms.

Any social scientist worth her salt will tell you that statistics without theory is nonsense. A statistical tests cannot tell us what results mean or if they are meaningful; only theory and argumentation can do that. When we highlight the times in which significant and meaningful don’t align, we reinforce this vital lesson. As more findings in social science (and other fields too) are questioned as non-meaningful despite “significance,” we improve scientific pursuit. And that is most certainly a meaningful outcome.

Tell me if you disagree with this argument. Some college freshmen struggle; it can be hard to make the transition to a new living situation, new people, and new responsibilities. This presents a challenge for college professors: How much leeway should they give these struggling students? Professors need to accept that some students will fail. This helps the university uphold high academic standards, it provides valuable lessons for struggling students, and it may help some of those students find a new university where they can be successful.

Do you agree? Regardless, do you think the argument is controversial? I’m ambivalent about the argument itself, as some struggling students actually don’t know what they are supposed to do. As such, they need information, not “tough love.” But I don’t think the argument is controversial. That is… unless you phrase it like this:

“[We need to find struggling students and remove them from the university in order to improve our retention rate.] This is hard for you because you think of the students as cuddly bunnies, but you can’t. You just have to drown the bunnies… put a Glock to their heads.”

This is how Mount St. Mary’s now-former president Simon Newman phrased his desire to increase university retention rates by kicking out students quickly rather than waiting for them to drop out and count against the university. Mr. Simon resigned earlier this week.

This quote isn’t all that led to Mr. Simon’s resignation. His words emerged from an email exchange about a survey Mr. Simon wanted to give to students; the email exchange was published in Mount St. Mary’s student newspaper. Though the survey’s instructions said the survey would provide the students with useful information and had no right or wrong answers, Mr. Simon intended to use the survey’s results (which were not confidential) to identify students deemed more likely to fail and then to encourage those students to leave the university. The email exchange (linked above) contained many dissenting voices who were horrified by Mr. Simon’s plan. And Mr. Simon’s actions after the email exchange was published were wrong-headed as well. For example, he fired the newspaper’s faculty advisor for being “disloyal” and punished two other faculty members as well. But it’s hard not to see this as a question about a poorly-chosen metaphor and its backlash.

Metaphors are powerful things, able to make an argument more vivid or connect an argument more closely to someone’s own experience. For example, in my own work, I often compare foster youth living in group homes to a dysfunctional high school. Foster youth are treated as irresponsible children who must be kept under control by draconian rules. If the same thing happened in a high school, what would result? Classes wouldn’t be productive, students would disrupt things often, and everyone would do whatever they could to not be in school. Is it any wonder, thus, that the same things happen in group homes? In my opinion, this helps shift our thinking away from the trauma that the foster youth have experienced and onto how any teenager would react to such a setting. It humanizes foster youth and turns our attention to the real problem: the restrictive group homes.

But in the power of metaphors comes a real risk. Mr. Simon used a metaphor that highlighted the warm and fuzzy feelings professors have toward their students and then contrasted that with the importance of putting the university first. His metaphor is stupid in many ways. Drowning a bunny does not improve that bunny’s life, while helping a student find a university where he can succeed does help him, certainly more than waiting for him to fail out later in the school year. But the metaphor is also a good one if it helps professors recognize that coddling students, in many cases, does nothing to help them. Indeed, Mr. Simon’s attitude is exactly opposite to what I have seen other universities do: string students along, right on the edge of failure, in order to keep collecting their tuition money. Kudos to Mr. Simon for not proposing something like that, a far more controversial idea.

Let’s try to describe Mr. Simon’s problem with metaphor in an equation.

A Metaphorical Point = A Literal Point + error

I don’t think that error term is particularly large, because humans seem wired to understand and appreciate metaphors. So what did Mr. Simon do wrong? Why was his metaphorical point not immediately translated into a literal point? Perhaps it is the graphic nature of his metaphor. Drowning and shooting bunnies is dark territory, especially here in the United States where most people have generally positive feelings about rabbits. (Mr. Simon was born in the UK, so he may have different feelings about the animal.) And when bunnies are a stand-in for students, a half-reading of the metaphor turns even darker. (Drowning and shooting students? Obviously a terrible image.) Combine all that with the context (Mr. Simon’s controversial proposal to weed out students) and you have a metaphor that fails to convey a literal point.

But should we defend Mr. Simon’s use of a clumsy metaphor? We’ve likely all deployed clumsy metaphors in the past or heard others do the same and been willing to go along with them. I remember a friend from high school describe her crush on a boy as if she was a dock and he a boat, drifting closer then farther from her. I nodded along politely even as seagulls, tides, and sandbars got added into the metaphor (okay, maybe I’m exaggerating a little, but it did get very complex!). So should we have greater sympathy when someone else struggles to use metaphorical language to advance a point?

In this case, I’m going to say no. Mr. Simon might not have gotten as much attention for his plans if he hadn’t used a metaphor, but the controversy would have remained. And if he used the metaphor while encouraging professors to assign tough grades because students can’t be coddled, then he could have apologized for his phrasing and ended the uproar. Certainly people are condemned when their metaphors don’t translate well into literal points. And I’ll rush to their defense the next time it happens. In this case, Mr. Simon’s plans seemed to be the big problem and his language simply encouraged more people to pay attention to his plans. His actions afterwards only hurt him. So Mr. Simon aside, others who use inapt metaphors should be defended. A metaphor that doesn’t lead to a literal point is mostly just clumsy language, not the sign of anything nefarious.

If being a graduate student was pitched like a job, it would sound something like this. “Are you looking for a job that pays you to work 20 hours a week, but requires you to work 80? Are you looking to be berated for your failings, but never given instruction on how to improve? Do you wish to be supervised by people who are experts in their field but cannot functionally interact with other human beings? If so, apply to be a graduate student at a research university!”

I’m dramatizing somewhat, obviously. This is not the experience of all graduate students; indeed, it wasn’t my experience. But from being in graduate school and interacting with many graduate students after I finished my degree, this description isn’t that far off from what they go through every day.

And in some ways, we shouldn’t be surprised by this. Professors aren’t trained as managers just like they aren’t trained as teachers. We only assume they can help others learn because we know they are experts in their fields. And graduate school is school, so the tuition wavers that come with an assistantship really are the main compensation; no graduate student will ever get paid based on the number of hours she works. Many parts of graduate school are closer to working an unpaid internship than working a job, including the hope that working for free now will lead to a real (and good) job in the future.

So perhaps we have to forgive everyone involved in graduate education for seeing the system as this equation:

All Work, No Play = Good Graduate Student + error

This equation fits very well with American work values, in which working hard means getting ahead and pain now is expected to produce rewards in the end. And it also fits people whose work truly is their life. Being in the lab at 11 PM doesn’t feel like work when there isn’t anything more appealing to do.

The trouble comes with that error term. In my estimation, there are very few people (indeed, maybe no one at all) who can actually live with all work and no play. That’s not because some people aren’t very hard working. Of course some people are! And some people seem to need less sleep than others. And some people really do seem to live their lives at work. It is, after all, where a lot of people do the majority of their socializing.

What I mean is that graduate students and professors may be too quick to accept this equation without enough attention to the error term. For example, let’s consider the part of being a good graduate student that revolves around being a willing and able collaborator on other people’s projects. If that graduate student is the first to leave the lab at the end of the day, does that mean he isn’t a good graduate student? Our equation says yes, but intellectual work is different from working on an assembly line. When one person steps away from the line, it stops and everyone else must stop too. But if someone leaves the lab to go home for the evening (or, heck, just out to take a walk before coming back), it doesn’t mean their mind stops working. Furthermore, when that person is out of the lab, most everyone else in the lab can keep working just as they were before. As such, we have to consider a broader conception of that person’s willing and able collaboration. If he returns to the lab right away the next morning feeling refreshed and ready to work, then his contributions will be strong even though he left the lab before others the night before. Indeed, his contributions may be stronger if everyone else comes into the lab bleary eyed at 11 AM.

Let’s consider another example, that being a good graduate student means a complete immersion in a field of study. In order to succeed, the graduate student needs to live and breath all aspects of her field. This translates into behaviors like never reading anything that doesn’t relate to her academic study, rejecting hobbies and activities that don’t advance her research, constantly coming up with new ideas, and keeping a notepad next to her bed so she can write down any thought that comes to her immediately. Here too, we need a broader conception of how productive immersion may be. For some graduate students, the only way to work at all is to maintain a proper balance between life and work. It’s kind of like how exercise makes you tired but gives you more energy overall. That is the nourishing part of taking time to remove yourself from your field of study and have other experiences.

So what are we to do with this equation? In the end, two things need to happen. The first is that the notion of being a good graduate student needs to have more predictors than simply an exhausting level of effort. Rather than looking at how much time someone puts in, professors and fellow graduate students should pay attention to output. Sending emails at 3 AM doesn’t mean anything if those emails are rubbish (as many emails sent at that time of night are). Meetings that stretch for hours don’t do anyone any good if that meeting is using each person at just a fraction of their overall capacity. All work and no play is a convenient measure, but convenience isn’t really that useful in this case. Professors are actively mentoring graduate students, and if they are using just a convenient measure to assess how that student is doing, it means they are failing as mentors. Being a good graduate student should be reframed as being engaged in research, able to improve work based on feedback, demonstrating growing knowledge and expertise of a field, and yes, some measure of productivity.

The second thing that needs to happen is a refusal to accept the equation above. Too many people take it as truth and assess themselves against it as a normative standard. In reality, people overestimate how many hours they work. And it’s too easy to forget about the consequences of, say, a late night of work. But think back to times in which you (or someone you knew) decided to “pull an all-nighter.” In hindsight, it is almost never a good idea. Perhaps in the short-term it worked to meet a deadline, but in terms of overall productivity, it’s a losing solution. All work and no play doesn’t make a good graduate student; it makes an under-productive, stressed out, and unhappy graduate student. The sooner everyone is willing to say this, the better off graduate students, professors, and academia as a whole will be.

In my previous post, I described the challenge of defining a group of children in order to calculate how long those children have stayed in care. The goal is to provide timely, useful measurement of this indicator of child permanence, while minimizing the children who aren’t counted, the error in our measurement effort. Using entry cohorts (or all children who entered care in a given period) means using a measure of central tendency like median or waiting a really long time (theoretically up to 18 years after entry into care) to be able to calculate a mean. Clearly, entry cohort is not a perfect solution, and in this post, I’ll consider two other methods.

If entry cohort is all children who entered care in a given period of time, exit cohort is the opposite. With all children who exited care in a given period of time, we can calculate all kinds of useful information. And we don’t have to rely on measures of central tendency like median; we can calculate means, modes, or anything else we like. We can get a very detailed picture of these children’s experiences in care. How long were they in care? How many different living settings were they placed in? How is it that they are exiting care; are they headed back to their family of origin, or being adopted, or entering into the permanent care of a relative? Because exit cohort isn’t dependent upon when the child entered care, there’s no lag for these calculations. We can, in theory, calculate it as soon as the period of time that defines the cohort ends.

But exit cohort also has an error term. We are only selecting the children who are exiting care, and thus ignoring all the children who remain in care. It’s great news that Susie is exiting care after just 8 months, but what about Sam who came into care at the same time? He’s still in a foster home instead of a permanent living situation. Using exit cohort, we forget about Sam until he too (eventually) exits from care. We can discuss Sam in other ways, of course. For example, he would get counted in the total number of kids currently in foster care. But we cannot calculate his length of stay or his number of placements because we are truncating the number based on when we are doing our calculations. Sam has been in care for 8 months, just like Susie, but it’s not correct to use 8 months in our calculations; Sam will be in care for longer.

So what are we supposed to do? Both entry and exit cohort have significant error; neither provides a full picture of children in care. The other option to consider is a cross-sectional, rather than longitudinal, approach. In this method, we don’t track individual kids in care. Instead, we look at the system itself over a defined period of time. Who are the kids in care? What are their statuses?

Let’s imagine we look at kids who spent time in care in calendar year 2015. All kids will fall into one of four groups. First, there are kids who both entered and exited care during the year. My previous post offered a good example of this: the child staying with her grandmother while her parents took a cruise. She entered care after her grandmother was hospitalized and exited care a day later when her parents got back from their cruise. Second, there are all other kids who exited care in 2015. These kids are similar to the exit cohort method discussed above. Third, there are kids who entered, but did not exit, care during the year. And finally, there are kids who were in care before the year started and remain in care when the year ends.

The overall outcome of this method is that no child is missing in our counting. We describe all children who spent any time in care during the year. We will take those children exiting care and calculate length of stay. But in doing so, we won’t be ignoring any children. If 25% of children in care at any time in the year exit care, then we are clear that our calculations of length of stay are based on just this 25%. The reader is presented with full information about children in care.

By paying attention to the children left out of other calculations (the error in our measurement) and not merely focusing on those children who allow us to make easy calculations, we find a better solution. These numbers aren’t going to help us figure out how to remove children from care quickly and safely, but at least they provide us a look at all children who had experience with the child welfare system. When we ignore no one, we help illuminate the wide array of child experiences with the child welfare system. And hopefully this, over time, can push us toward achieving better outcomes for all children in care.

I work in child welfare research, and one of the most important reports my research center publishes each year describes outcomes for children in state care. We have a set of indicators that measure safety, permanency, and well-being for children. Many of these indicators are relatively straightforward to calculate. We know how the data is structured, so we pull it out, run some syntax, and report the output. But other indicators require some debate, often centered around who we are leaving out of the calculation. These children who get left out are a sort of “error term” that keeps us up at night.

Lately, we’ve been discussing the question of how to measure “length of stay” in care. The definition of this indicator is simple: How many months did the child stay in care before exiting? No one disputes that this definition is correct. The challenge comes from measuring it. The minimum length of stay would be a matter of hours or days. Let’s imagine a four year old child is in the care of a grandparent because the child’s parents are on a cruise. The grandparent takes the child to the playground, tries to go down the slide, but falls and breaks her leg. She must go to the hospital, cannot return home, and because there isn’t someone else immediately available to care for the child (for whatever reason), the child is taken into care until the parents can rush back from their vacation. That’s the minimum length of stay, the time it takes the parents to return. The maximum length of stay would be 18 years. This could occur if a child was taken into state care upon birth and never achieves permanent placement. The child would stay in care until he “ages out,” turns 18, and is removed from the state’s care. (Some states allow youth to stay in care until 21, but the idea is the same.) This outcome is extraordinarily unlikely, perhaps even unprecedented, but it nevertheless represents the longest amount of time a youth could stay in care.

Because of this wide variability, calculating length of stay for the system as a whole becomes complicated. Let’s say we want to calculate length of stay for children who entered care in 2015. Some children will exit in the same year they enter, like the girl in the temporary care of her grandmother. But other children will stay in care for years. So what should we do in this case? Our current solution is to carefully select a measure of central tendency and to allow enough time to pass so that we can report the number. We use median months in care, the number of months it takes for half of children entering care in that year to have exited. The number is around 30 months and has stayed at this level for the past several years. We report this number with a lag of about three years, necessary because it takes that long for half of the children in a given entry cohort to exit care.

This would be a perfectly acceptable solution if length of stay was normally distributed. The trouble is that most kids are in care for a relatively brief period of time and some kids stay in care for an extraordinarily long amount of time. The data is left skewed, with more data toward the left side of the chart than the right. Median tells us some “good” news about half the kids in care: They stayed in care for 30 months or less. But it also tells us bad news about the other half of kids: They stayed in care for more than 30 months, and we won’t know for years more how much longer they remained in care!

That’s the challenge with using entry cohorts to calculate this info. If we have to wait for all children who entered care in 2014 to exit from care, we might not know the length of stay for this cohort until 10 or more years later. In theory, we might not be able to calculate the number until 2032. At that point in time, the information would only be useful to historians. The children who don’t get “caught” in our calculation of length of stay (because of the necessities of using median) represent “error” in our measurement. Entry cohorts, the distribution of length of stay, and the need to get the data out in a timely manner mean we lose these kids. We can note that 50% of kids stay in care longer than the median, but that isn’t nearly as useful as actually stating how long they stayed in care.

So that’s the problem with how we currently do things. But what can we do to fix it, or augment what we currently do? I’ll consider that question tomorrow.

The field of psychology and social science in general is facing a “replication crisis.” (Other fields are too, including medicine.) The most cited cause of this “crisis” is one-off experiments that produced a “significant” result, got published, and were rapidly accepted as fact by the scientific community. Significance, in this case, refers to a statistical test designed to indicate the likelihood that the result occurred due to chance; a small enough percentage leads to a result being declared “significant.” Readers should be cautious to not use that word as a synonym for “meaningful” when talking about research.

Many solutions have been proposed for this crisis, and no solution is better than an increased effort to replicate the studies. This solution is ideal because it can be added as a requirement for future studies (for example, that at least two different samples be used in an experiment) and can be used to review past findings.

One trouble with replication, however, is that it isn’t “sexy.” People who go into research fields (whether in academia or not) aren’t entering the field because of a chance to redo the work of others. Instead, they seek opportunities to test and advance their own lines of thought. The entire tenure process at research universities is built around making new, meaningful contributions to a field of study. Replication, by design, doesn’t do that. And though it may be helpful for our scientific understanding (thus advancing human knowledge overall), it doesn’t tell us much at all about the researcher conducting the replications. Is the replicating researcher a good scholar? We don’t have enough evidence to answer the question if all the researcher has done is try to replicate other people’s findings.

So who is supposed to do replications? The answer may come to us from an unlikely source: the arts. For an example, let’s look to painting, an artistic pursuit that, for many artists, requires years of training. How does one begin to learn how to paint? The exercises to learn painting have been the same for centuries and center around copying the work of others. This could mean learning to mix paints to see how other artists have created depth in painting (for example, make more distant objects bluer). That means painting still-lifes of fruit bowls, just like the great masters did. And sometimes, it means copying past paintings to try to understand how they were created. All this is done BEFORE we expect the student to produce any great, original work. The same is true for writing. Students are expected to read widely and to attempt to write in the style of other writers. Indeed, consider “copywork,” a old education method that called for students to directly rewrite the works of others to learn penmanship, spelling, grammar, and perhaps even style.

Let’s do the same thing in social science research and ask graduate students, those learning to do research, to undertake replication efforts as the fundamental step to begin their research education. There is much to be learned from trying to replicate the work of others. First, graduate students learn about the necessity of writing a quality Methods section because these sections will guide replication. Too many methods sections are vague on key details. This may be the result of bad writing; most academic writing is of poor quality. It may also be because the author wishes to obfuscate some details that didn’t go the way the researcher intended. For example, perhaps the researcher should have used random assignment but did not because of how the experiment was planned; the researcher may not be specific about this detail because it could undermine faith in the study.

Second, by replicating past work, graduate students will be able to get into the muck and learn about the real work that goes into conducting experiments or other methods of gathering data. No amount of prior planning will resolve all possible issues researchers encounter when running an experiment. I recall a personal example, running an experiment in which one participant had an opportunity to lie to another for financial gain. Not having a procedure to keep the two participants from leaving the experiment at the same time meant occasional awkward encounters after the experiment was done. I had to use a new process for participants, in which they were asked to signal when done completing a survey rather than just leaving. This allowed me greater control to make sure the participants left separately. It is best to learn how to create a good research process as early as possible.

Third, graduate students can conduct analysis of the data with a step-by-step guide from past studies. This reduces time and worry from analysis, and also allows an opportunity to see how analysis both explores the data and what it leaves unexplored. Learning to think about data–from both good and bad perspectives–is key to social science research. The graduate student can ask why the published paper conducted each test. Was there an effort to avoid some variables and use others? Was this done to emphasize significant results and downplay non-significant ones?

Finally, graduate students contribute to scientific knowledge right away, in ways that their “original” ideas likely would not. And this contribution means more experienced researchers can spend time advancing their own research agenda (and contributing future results which may or may not be replicable).

Will this “solve” the replication crisis in psychology? Certainly not in the short term. Instead, it will immediately exacerbate the problem by showing even more results that aren’t replicated. But in the long term, it may. Graduate students will learn the importance of producing good work and the uselessness of significance to tell us what results are meaningful. The best studies tell us something no matter what the results show, but these studies must be carefully constructed. Learning by copying may help budding researchers learn what those research questions look like.

It may still be a tough sell to graduate students. But when replicated research is respected by the academic community at large (as it should be) and when graduate students get to explore some of their favorite research, their attitudes should change. This new method will allow graduate students to emerge as better researchers. Let graduate students be apprentices, building their own tools first, before we turn them out into the world.

I commonly write about the topic of error in assigning grades in a classroom setting, but it always fills me with a certain measure of anxiety.

Assigned Grade = True Grade + Error

This is a fundamental relationship that we have to deal with when trying to measure anything interesting. Yet I often wonder how students reading this blog might feel to be so frequently reminded that the grades they receive in their classes always contain some error. Though error never completely avoidable, it is easy to understand how someone might feel disaffected or angry to learn that the measures which hold sway over their future career prospects contain some degree of error. The degree of error varies from class to class and student to student, but it’s easy to imagine that in almost every class there are some students whose grades are in the grey area between, say, a B and a C, for whom even a small amount of error could be enough to produced an assigned letter grade that is different from what the ‘true’ grade would be if the error had not existed.

How important is the difference between a B and a C? In the case of Megan Thode, a student at Lehigh University, the difference amounted to a $1.3 million lawsuit.

http://www.nydailynews.com/news/national/ex-student-sues-school-1-3m-grade-article-1.1261784

To quote the NY Daily News:

Getting a grade you don’t deserve in school is worth about $1 million in damages, according to a lawsuit filed in Pennsylvania.

A former student at Lehigh University is so unhappy with the C+ she received in a course in 2009 has decided to sue the school for $1.3 million, claiming the unfair grade has ruined her future earning potential.

In this case the C+ grade was enough to disqualify Ms. Thode from continuing in her graduate program. Her lawsuit claims that the grade she received was inaccurate at least in part because bias the instructor carried towards her due to certain political statements she had made and complaints she made about taking part in an internship as part of the class. The university denies any bias.

For the sake of argument, let us assume that the instructor did not make a conscious decision to alter her grade, but that annoyance at Ms. Thode’s behavior subconsciously contributed to the error present in her assigned grade. It would not outrage me to learn that this was true. I personally do not consider myself immune to small biases based on student interactions – a point that I try to make clear to all my students when discussing my professional behavior standards and one of the reasons I employ blind grading for most assignments. This hypothetical situation provokes a few interesting questions:

  • How much error is enough to warrant a lawsuit?
  • If we hold instructors legally liable for grade error, does it matter whether the largest source of the error is unconscious bias, sampling, grader variation, or any of the many other factors that can contribute to error?
  • Is it even possible to determine how much of an assigned grade is error or what caused the error?

I’ll pass on trying to answer the first two questions as they are mostly philosophical. The question of determining the magnitude and cause of error is a practical question. In theory, the answer to the question is ‘yes’, but with several caveats so large as to make any attempt clearly impractical.

Classical testing theory posits that while the sampling error in each assessment is random, the randomness is not uniform, but rather forms a normal distribution centered on zero. From a statistical standpoint, all that you would need to do to reduce the error is to continue sampling, provided – and here comes the big catch – that you could ensure that the thing you are sampling (i.e. the student’s knowledge) has not changed at all, something that would require access to a time machine or alternate dimensions. As for rooting out bias-driven error, it probably isn’t normally distributed, so you might be able to discover it if you were able to fit multiple independent graders into your time machine so you could re-test each student in the class multiple times and then try to correlate the scores between graders.

In other words, short of some kind of smoking gun, it is not practical to sue based on standard assessment error. Frankly I’m glad, since the thought of the chilling effect that would have on grade assignments is enough to make me shiver.

 

Last week, I had the privelge of attending UW-Madison’s Teaching Academy Fall Kickoff Event, which this year focused on issues of grading and assessment. Keynote speaker, Dr. James Wollack, presented challenges with grading and described potential solutions. One issue he addressed was the promotion of positive behaviors (for example, attending office hours, coming to class, participating in discussion) and the challenge of assessment of these behaviors. I’ve long argued that awarding points for attendance systematically discriminates against at-risk and non-traditional students, who may face more challenges to class attendance than other students. But at the same time, it is important to recognize the motivating nature of class point structures; to wit, more students show up for class when they get points for doing so.

Dr. Wollack suggested that instructors could create a separate category of assessment items that act as gatekeepers for certain grades. For example, to receive an A, a student must earn at least 90% of available course points and not miss more than 2 class sessions. Failing to meet this attendance criterion results in a lower letter grade, no matter what percentage of points a student earned in the class.

This idea is very appealing because it allows instructors to build in assessment components that are traditionally hard to score. For example, consider the issue of participation. Let’s say that, in each class session, students can earn up to 5 points for participation. This should motivate students to participate more in class, but it also means that instructors must somehow both conduct class and evaluate student participation. This daunting task reduces reliability of grading and confidence in scores for both instructors and students. But a checkbox system and a gatekeeper point value can help solve this problem.

But what is the theoretical justification behind gatekeeper-style grading? One issue comes in the correlation between gatekeeper items and other normally assessed items. No matter how attendance, for example, is scored, it is still related to overall classroom performance. Students who attend class more frequently do better on exams. This may seem like a great reason to do everything possible to compel students to attend class, but whether attendance is scored or a gatekeeper, it still serves as a barrier to higher performance from a motivated student who, for whatever reason cannot attend class as regularly as the instructor would like.

Think of it like this.

Attendance = Learning + error
Test Score = Learning + error

Ideally, attendance and test scores would be a perfect reflection of what the student has learned in the class. But of course, we recognize that some error takes place, as some students might attend class but fall asleep or be distracted; others might skip class but study extra hard on their own. For this system to work, we need to assume that the two error terms are not correlated with each other. But we are assessing the same construct (Learning, the student’s knowledge of the course material), and thus we know that these error terms in fact SHOULD be correlated.

Furthermore, we want the error terms to be random, such that error in measurement does not systematically help or hurt students. But because both Test Scores and Attendance are measures of Learning, when we kick in points for Attendance, it is like awarding automatic extra credit points on the test for students who were in class. The error term isn’t random any more, because the relationship between Learning and Test Scores is influenced by the relationship between Attendance and Test Scores.

In short, what results is a big assessment mess that is difficult to straighten out. There is little theoretical justification for awarding points for attendance, whether as a scored item or as a gatekeeper. There are other views on this issue, however. I hope to feature some of those in upcoming posts.

Some people argue that a free market economy is the best economic system for getting the most good to the most people. Given that some form of the free market is the only economic system that has been shown to work on a large scale, it can be hard to argue against this claim. But in the 2012 presidential election, much of the debate centers around just how much governmental involvement in the economy is the right amount. While neither side favors complete laissez-faire or communistic approaches, there are stark contrasts between President Barack Obama and Republican presidential candidate Mitt Romney.

How does this relate to measurement and error? We can understand the debate between the two candidates as one connecting our personal values with the values of economics. The free market functions based on a profit motive, but humans are much more complex. And thus we can represent the debate with an equation.

Personal Values = Economic Values + error

In a perfect free market, all of our own values would be solved with free market economics. Sick people need health care! Don’t worry, there’s a free market solution to the problem. Poor children need to be fed! Look no further than economics for help. City streets must be cleaned, in poor and rich parts of town! The free market is there to help.

Unfortunately, all our values–health care, feeding children, clean streets–do not have a free market solution to them. The most salient example is that of providing health insurance to individuals with “pre-existing conditions.” When insurance companies provide insurance for these people, it is much more expensive. Before health care reform legislation was passed in early 2010, insurance companies dealt with this issue by charging higher rates or denying coverage altogether for these people. That is a free market solution, but it did not reflect the values of many people.

New policy forbids insurance companies from these practices and in turn requires all Americans to obtain health insurance, thus ostensibly increasing insurance company revenues. This is anti-free market policy. It restricts the behavior of both corporations and individuals. But it also fits with the values of many people. And proponents argue that it is a case where the free market had failed and intervention was needed.

The error in the equation is because of situations in which there appears to not be a free market solution to a problem. For Republican Mitt Romney, the error term in the equation for him is very small; there aren’t many problems that cannot be solved by the free market. For Democrat Barack Obama, the error term is larger; government has a vital role to play in the lives of all Americans. Your own view of this equation should be a big guiding factor in how you vote, and both candidates point out that their disagreement on the size of the error term shapes their unique visions for the country. In this case, error and your view of it will help determine who wins the election in November.

A few days ago, on my professional blog, I wrote about how the IRB can promote best practices. The IRB (or Institutional Review Board) must approve all research at an institution and works to ensure that research complies with federal, state, local, and university laws and policies. Though this suggests a standard process with carefully constructed guidelines for review, many researchers note frequent discrepancies between institutions and even between protocols. A procedure requiring modifications in one proposed project may go unquestioned in another. This naturally results in much frustration from researchers, no matter their level of concern with research ethics. Indeed, it should be those most concerned about practicing research ethically who protest loudest against IRB discrepency.

We can view the whole process as an equation and investigate further how the error term leads to frustration.

IRB Practice = Best Practice + Error

Let’s first agree that for most research practices, there is a “right answer” regarding the protection of participants. I use the example of data storage in the blog post linked above. “Consider, for example, handling collected data (the surveys and spreadsheets that contain participant responses). This data must be stored in a secure location. On paper, it should be in a locked room that is accessible to very few people. Digitally, it should be stored on a computer and backed up on a secure server.”

If there was no error, then any deviation from this best practice should be flagged by the IRB, and all protocols asserting that they will follow these practices should be accepted by the IRB. Where there is no error (that is, no difference between best and IRB practices), there is no frustration. But this isn’t the case. Some IRBs may flag protocols that have not described backup locations as secure, while others may not worry about this detail. Error (the inconsistency) leads to frustration because a researcher cannot accurately predict what issue the IRB will flag.

The IRB, however, has some power over this error term. Right now, researchers must answer a variety of questions about items like data security. The application contains empty boxes into which the researcher must detail her plans and hope that they pass muster, with little guidance about what a best practice might be. This creates error because the researcher can easily forget or misstate an important piece of information. But if the IRB were to simply state the best practice and instruct researchers to either agree to follow it or describe their alternative plan, then the error term would be eliminated for most applicants. The IRB would save time as well.

Thus, inasmuch as the IRB wants to reduce researcher frustration, they should promote these best practices. By understanding the process of researcher approval in a classical test theory manner, we can see direct guidance as to how the IRB can improve their application procedure. In these cases, the error term is nothing but trouble for both IRB credibility and researcher sanity. All steps that reduce that error, while protecting participant safety, can and should be taken.

Over the past few days I’ve been working on editing a series of interviews on individual views about the classroom assessment process that will be presented at the upcoming Teaching Academy fall kickoff event at the University of Wisconsin.

One of the first things we asked our interviewees to do was to provide their definition of a ‘C’. At the risk of spoiling the result, I’ll just go ahead and say that there wasn’t a clear consensus (shocking, I know). Those who chose to give criterion-based responses tended to describe C-level as “adequate”, “acceptable”, or “OK”. These responses track well with the meaning of the grade in most academic environments where the C is the minimum grade that a student can receive and still continue on to the next class in a sequence. Interestingly, none of them made reference to any rubric or institutionally standardized level of achievement in determining what qualified as ‘adequate’, giving a sense that this was still a pretty subjective judgment.

A large segment of the interviewees gave a normative response, though they seemed to have a tough time agreeing on exactly where a C fell within the norm. Only one instructor described the C as ‘average’. The rest all described a C as worse than average, varying from ‘slightly below average’ to ‘the bottom 30%’ to ‘basically seen as failing’, with the harshest assessment coming from an undergraduate student. One professor clearly felt that it was proper for there to be multiple norms for a C, saying that because students at Wisconsin were above average, that he considered a BC to be average.

A few people gave answers that were not linked to achievement in learning, instead categorizing a C as representing a lack of effort or hard work on the part of the student.

Regardless of the view you take on the validity of each of these positions, I think the sheer variety of answers is itself the strongest indictment against the utility of letter grade systems. Why, after all, do so many institutions adopt the familiar A, B, C, D, F system if not, well, for its familiarity? The value of standardization is clear; take languages as an example. I can author a post on this blog in English and be confident that even accounting for regional vernacular, a reader in any other English-speaking region will be able to reasonably understand most of the meaning. This is the same basic goal that underpins a standardized grade system. I don’t assign a student a grade of ‘7.5 walruses’ because I can just imagine the work a college recruiter would have to go through to decipher the meaning of a transcript if every school used a proprietary system.

ABCDF has become our lingua franca, but what good is a ‘common language’ if everyone interprets the meaning of the words differently? Such a system is actually worse than a fragmented system because it gives the illusion that grades from different instructors mean the same thing, when in fact they don’t. If I give the grade of 7.5 walruses, at least the person trying to interpret the grade will recognize that they need to get more information to know what it means. Experienced recruiters and admissions agents may be able to learn to adjust for institutions with reputations for strict or lax grading, but it’s unrealistic to expect them to be able to account for significant differences between instructors within the same institution.

We’re left with a situation in which everyone loses. Instructors are stuck with a stiflingly reductive grade system, without the value of its simplicity and standardization actually being leveraged. Recruiters are given a signal that is ostensibly accurate, but which contains noise from not only the inherent error of assessment but also from a disagreement in the meaning of the grades themselves. Students are subtly penalized if they don’t seek out instructors who have a lower standard for the meaning of the grades. I suppose the advantage of such as system is that it fits with the long held traditions of giving university professors autonomy within their classrooms, but I’ve personally never met an instructor who relished the process of having to come up with their own definition of what differentiates a C from a B (or B- or BC).

There are a few potential solutions to the problem of our language barrier when it comes to letter grades. One would be to tie it to a relatively objective quantitative measurement, such as the student’s rank within a class; i.e curved grade distributions with all the hand-wringing that entails. Another would be to set out clear rubrics within departments and seek out collaboration with other major universities to develop standards, giving up some freedom in the process. Still another would be to broaden the use of university-level standardized tests, like the MCAT, LSAT, GRE, or FE. Whatever the solution, it’s about time we start doing something.

After the conviction of Jerry Sandusky, former Penn State assistant football coach, on charges of child molestation and rape; and the release of the Freeh Report, which finds former Penn State football coach Joe Paterno and other leaders at Penn State did not act responsibly when informed of accusations against Mr. Sandusky, Penn State is faced with many difficult choices. One of those choices is whether to remove the status of Mr. Paterno that currently sits outside Penn State’s stadium.

Paterno Statue

On the face of it, this is a choice about how Mr. Paterno’s legacy should be treated. Should be be honored for his commitment to the university and his longevity as coach? Or should he be admonished for his lack of moral character after he recommended not reporting Mr. Sandusky to the police? Choose the former and the statue should stay; choose the latter and it should go.

To help clarify this debate, it is valuable to consider the statue as a representation of Mr. Paterno’s time at Penn State.

Statue Representation of Mr. Paterno = Mr. Paterno in Actuality + error

Prior to the scandal, we might have concluded there was little error in the portrayal of Mr. Paterno in the statue. He was a treasured figure on campus, and the criticisms that might have been levied against him paled in comparison to his contributions to Penn State, in both athletics and academics.

After the scandal, it turns out that Mr. Paterno, in actuality, was not quite the same as the statue representing him. Indeed, there was many complexities in Mr. Paterno’s leadership. Mr. Paterno both lead football players onto the field to engage in sport; he also lead university officials to ignore a moral imperative to protect young boys from a serial child rapist. With Mr. Sandusky exposed and convicted for his crimes, we see Mr. Paterno in an entirely differently light.

The result is an error term that taints the representation in Mr. Paterno’s statue. And given that the statue is designed to represent the success of Mr. Paterno at Penn State, we can see the representation as a flawed portrayal of a man who, we can now conclude, was far from a saint (which is how one mural on campus presented him–with a halo–before the artist removed it). If university officials believe that Mr. Paterno’s reputation is now sullied (and they should), then they need to ask themselves what the statue is meant to represent and what it actually represents. A mismatch between the two (the error term) suggests a solid rationale for why the statue should be removed.

The error in the representation, an indication of Mr. Paterno’s own errors, means the statue’s intention is subverted. It now stands as a bitter, ironic representation of Mr. Paterno’s time at Penn State, not a celebration of football success. Because its intention now falls far short of reality, Penn State officials should opt to remove the statue.

When in Paris in 2008, the funniest movie posters I saw were for the American film Step Up 2: The Streets. Disposing with any pretense, the movie had been retitled Sexy Dance 2. This title, I suppose, told French audiences everything they needed to know about the film. According to Box Office Mojo, the film grossed over $4 million in France, or 7% of the films take globally, so we can conclude the new title was successful.

But other movie titles don’t seem as aptly translated. Some of the most perplexing and amusing:
If You Leave Me, I Delete You, instead of Eternal Sunshine of the Spotless Mind (Italy)
The Jungle Died Laughing, instead of George of the Jungle (Israel)
Urban Neurotic, instead of Annie Hall (Germany)
His Great Device Makes Him Famous, instead of Boogie Nights (China)
Six Naked Pigs, instead of The Full Monty (China)

Something seems missing from these titles. Some lack nuance. Others lack cleverness, though perhaps that is lost in the back-translation. In any case, the translations are hardly a faithful representation of the original title.

We can put this into a simple equation: Actual Translation = Exact Translation + error

But is that error actually random? Likely not, as we can assume that the film’s international distributers worked very hard to come up with a title that would attract an audience in that country. Audiences in Germany, the re-title suggests, are much more interested in seeing a film with a descriptive (and accurate!) title than a person’s name. In this case, marketing appears to be the force that is adding error to the relationship between the actual title and an exact translation of the original title.

We see this same kind of process in education. When constructing an assessment tool, an instructor wants to make the tool capture student learning as faithfully as possible. But the instructor may also want the assessment tool to provide some pleasure (or at least lack of pain) for students. This could include paper assignments that the instructor thinks will be fun for students or test questions with amusing aspects.

Ideally, these fun tools would measure learning just as well as a more straightforward method of assessment. But just like movie titles, marketing efforts (making the tool fun) may add error to the process. In attempting to translate student learning into an assessment tool, the instructor can easily become distracted with other goals. Marketing can make translations into a mess.

Instructors should be careful. We laugh at funny translated movie titles, but poor test items are no laughing matter. Just consider the uproar about a story question featuring a rabbit racing a pineapple that made news this spring. It was a funny, entertaining question that made a lot of people mad. It was just another error in translation.