Fidelity to a program, policy, or practice means that people who follow the procedures follow them faithfully. They do not deviate in systematic or random ways. They do not decide to do something else because it seems better to them. Instead, they stick to whatever the procedure says and earnestly carry out the work.
Organizations may check for fidelity by conducting a fidelity review. These reviews can take many forms, so I’m going to describe the one I’m most familiar with and use that example throughout this post. The type of review is in Social Work, and it’s called a case review. In a case review, an expert reviewer goes through a selection of cases and attempts to determine if the caseworker was faithful to the procedures she was supposed to follow. The reviewer documents the overall rate of fidelity to a set of procedures and then reports the results in aggregate.
Let’s be clear from the start: This is really hard to do. It’s hard to figure out what fidelity means. It’s hard to figure out what to look at. It’s hard to figure out what rate of fidelity (or level of fidelity) is good and what is bad. Professionals in this area have a hard time doing this work well, so it’s no surprise that when anyone trying to do it right finds it challenging.
But that doesn’t mean that there are any excuses for doing it poorly and then making conclusions based on bad data. A fidelity review done incorrectly IS NOT TRUSTWORTHY! It tells us nothing! It actually makes us know LESS than we knew before because we now have inaccurate, unreliable data flopping around our head that we have to consciously tell ourselves, “Ignore those results! Forget them! They mean nothing!”
Let’s consider the steps that need to be followed so that a fidelity review can go right.
1. What are we trying to measure? That is, what does fidelity mean?
Let’s consider a simple example. A procedure instructs that a caseworker must confirm the safety of every child in a home each time she visits the home. So, what does “confirm” mean? Does it mean the caseworker should ask the adult in the home if every child is safe? Is that confirming safety? Does it mean the caseworker must see every child? What if a child is not at home when the caseworker visits? Does she then have to travel to where that child is (school, a part-time job, visiting family or friends elsewhere) to confirm the child is safe? Does confirming safety require asking questions of each child in private? Does it involve some kind of examination of each child? These are the questions that must be clarified before the fidelity review can be planned and performed.
Furthermore, these questions cannot be answered unilaterally by the person performing the review, by a person in an authority position, by a worker. Instead, they must be already defined and disseminated to the workforce. If not, then the reviewer may conclude workers failed to meet a standard that workers had no idea they were supposed to meet. In this case, there wouldn’t be a lack of fidelity; there would be a lack of implementation.
2. What does the performance of what we are trying to measure look like?
We are discussing a case review, in which the reviewer is looking over the case notes and other documents the worker provided. It is not possible for the worker to capture everything that went on during her home visit, and thus the reviewer will always be missing information. The worker could have conducted expansive safety checks for each child but not documented any of it. In that case, there would be no evidence that the worker was following the safety check procedure. That’s the kind of error that we accept as appropriate, because it is reasonable to consider undocumented actions as actions that did not occur.
But what if the worker writes “all children okay” and makes no other mention of safety checks? Is that considered evidence that the worker indeed confirmed the safety of all children? The reviewer needs to specify what evidence will be accepted to judge the requirement was met or not met. And if the reviewer wishes to judge with greater nuance than that binary standard, then there needs to be more detailed specifications for each level.
3. Where will the evidence be located?
Fitting with the “if it isn’t documented, then it didn’t happen” standard, the reviewer also must specify where the evidence will be located. Does it have to be in one specific place? And if it isn’t in that place, but is somewhere else, then will the reviewer find that the requirement was not met? Just like question 1, this particular standard must be disseminated throughout the entire organization and understood by everyone. Otherwise, a worker may be performing a thorough safety check but putting the information on the wrong form; she believes (and likely her supervisor too) that she is doing everything correctly, but the reviewer will reach a different conclusion.
4. What sample of cases will be reviewed? And related, what kind of conclusions do we want to reach with the review?
If 10 “best cases” are selected and reviewed, then what kinds of conclusions can we draw about fidelity? We can conclude something about the fidelity review of the cases selected as “best” by the people who selected them. Is that a useful conclusion? Can the conclusion tell us anything about cases more broadly? No, and no.
Sampling strategy is complicated and cannot be summarized in a paragraph, blog post, or even a series of posts. But the basic idea is that larger samples are better than smaller samples (and larger means dozens or scores of cases), conclusions between case types require large samples of every case type considered, and random selection of cases is better than any other selection method.
5. What evidence do we have to suggest that Reviewer A’s judgment on fidelity is the same as another reviewer’s?
This question is all about “interrater reliability,” a measure of the likelihood that, given the same materials, Reviewer A will reach the same conclusion as Reviewer B. In my experience, questions 1 through 3 are usually answered by those conducting fidelity reviews, but this question is more often ignored or answered with inadequate evidence.
Just like sampling in Question 4, there are entire books written on establishing and measuring interrater reliability. Go read those to fully understand this question. The basic procedure is that reviewers should be trained together and then given a set of cases to review and rate independently. If interrater reliability is established, then two or more reviewers, given the same materials and working independently, should reach the same conclusion. If they do not, it is because A) they aren’t using the same standards, B) the standards aren’t realistic (for example, too complicated) and can never be used to produce the same results, or C) some combination of A and B. (Reason C is the most likely explanation.)
If reviewers disagree, then the standards should be reviewed, rewritten as necessary, the reviewers re-trained, and the exercise repeated. If the reviewers continue to disagree, there are three options: A) once again revise the standards, retrain, and retest; B) create a method by which the reviewers’ differing judgments can be combined (say, through averaging, or through mutual discussion to reach consensus); or C) scrap the entire effort and judge the particular procedure used as too subjective to be evaluated for fidelity.
Under no circumstance should a fidelity review be conducted without diligent, earnest efforts to establish interrater reliability beforehand. For example, perhaps an agency has one person passionate about fidelity reviews. No one else has time to help or interest in the topic, so this reviewer proceeds on his own and writes up the results of his fidelity review. What good are these results? THEY ARE USELESS AND SHOULD IMMEDIATELY BE DISCARDED. Furthermore, the employee should be disciplined for wasting time. At best, the employee has produced a subjective report on his own views of what should be happening. At worst, he has produced a biased report that makes people who read it stupider because they use his subjective, unverified judgment to answer questions about fidelity, when in fact his report provides no answers at all.
The best thing about establishing interrater reliability is that we do not need multiple reviewers for each case. Instead, because we know the judgment of Reviewer A is similar to the judgment of Reviewer B, we can have cases looked at by only one reviewer. This means we can increase our number of cases reviewed more efficiently.
6. How will conclusions be written?
Congratulations to you if you managed to complete a properly sized case review after establishing interrater reliability! Now, what kinds of conclusions can we reach? The best advice is to write limited conclusions with plenty of caveats and limitations. For example, let’s say our review finds that 60% of cases have safety checks documented, and 40% do not. It is not reasonable to conclude that 40% of children served by our agency are unsafe. Instead, we can write that 40% of children served did not have safety checks documented in the specific places reviewers looked. Anything other than this conclusion is an abuse of the fidelity review.
Fidelity reviews are hard to do. They take practice, continued reading on best practices, and a willingness to do a lot of work for limited return. But when fidelity reviews do not follow the above recommendations, they are a waste of time that makes people know LESS than they did before the fidelity review was conducted. So proceed with caution. A quality fidelity review is a thing of beauty, but you should have no faith in unreliable fidelity reviews.