Reviewing WFTDA Official Reviews

An analysis of the (lack of) success rate of ORs during the 2014 WFTDA Division 1 playoffs.

You’re watching the action during the WFTDA Division 1 playoffs, and an official timeout is taken. Coaches and captains come to the center of the track, and one of the teams files a grievance with the head referee. The officials huddle and mull over the play in question. A decision is reached and announced to the audience, then game gets back under way—after a lengthy delay.

This is the WFTDA official review, the process that allows teams to request that referees double-check the accuracy of any call or non-call during a jam, or for any other reason. This system ensures that an incorrect officiating decision does not unjustly affect the outcome of a game by giving teams the opportunity to right the wrong.

However, many regular viewers of WFTDA playoff roller derby may question stopping the game for an official review that can often last several minutes, when the teams requesting them are so frequently unsuccessful in getting the call that they were looking for.

The apparent lack of overturned calls got me curious about just how many of them there were during the playoffs. Many people think that successful reviews are rare. Just how rare are they, actually? Are they rare enough for their apparent abuse to be a problem for the WFTDA game?

To find this out, I went back through the YouTube archives of all 80 WFTDA Division 1 playoff and championship games and did a review of all official reviews called for by teams. The WFTDA’s swanky new scoreboard widget, which made it very clear when an official review was taken1, together with the play-by-play timeline of the @WFTDALive twitter feed (which did a good job reporting reviews), allowed me to get a pretty accurate count of them, and how many of them were upheld or overturned.

The scoreboard also kept time on the length of any official review. Not letting that information go to waste, I recorded the additional delay time required to complete each review. The crack production and announcing team at made sure to let the audience know why an official review was taking place, so I was also able to collect data on the specific reasons why reviews were called for.

The fidelity of these reports also made it possible to track other breakdowns of review tendencies, such as whether one was to overturn a team’s own penalty or force one on an opponent, and whether or not there were other circumstances that may have rendered an official review moot.

Let’s dive straight into the data. In the Division 1 playoffs, I counted 148 official reviews requested by teams, out of the 320 total potential reviews2 teams could have used throughout the course of the 80-game season. That’s a use rate of 46.3%, even before we get to the success rate of those reviews. And at almost two reviews per game, it seems as if teams feel a lot of missed calls are happening on the track.

In reality, teams are missing the correct calls being made by referees 9 times out of 10.

Data source: Manually collected from video archives
Data source: Manually collected from video archives

Out of 147 legitimate reviews3, only 15 were successful, allowing the team that requested it to retain it for use within the same period. This means that a whopping 132 reviews were not successful, leading to a loss of review.

This is a very low overturn percentage. For the most part, teams are only going to use an official review during normal gameplay conditions if they feel a call was made incorrectly. Yet WFTDA officials only needed to be corrected in 15 jams out of 3,475 total during the playoffs, an error rate of just 0.4%. Despite this, teams were throwing out reviews as if the referees were nine times as inaccurate than they actually were.

Why are teams wrong so often? Let’s take a look at what sort of calls teams were reviewing. Here, we can get a better idea of where teams were thinking certain calls are being missed or misinterpreted by officials—and which of the few that actually were.

Data source: Manually collected from video archives
Data source: Manually collected from video archives

By a wide margin, cutting penalties were the most requested review type throughout the D1 playoffs, with 40 out of 142 reviews where a confirmed reason was given. A variety of illegal procedure requests (14) came second4, followed by back blocks (13) and points scored on jams (11).

These four categories made up the majority of official review topics, so it follows that the majority of successful reviews (10 of 15) land within this range.  Five cutting penalties were overturned, the most of any one penalty, but only at a success rate of 12.5%. Challenging the number of points scored on a jam happened successfully four times, making it the highest-percentage play at 36.4% reviews retained.

Of the top six most requested reviews, 95 in total, in only six instances did the review involve the actions of a blocker. Along with the jammer-centric point challenges, every single cut, back block, and skating out of bounds penalty; and 7 out of 8 forearm (non)calls, had teams requesting that a power jam be awarded or overturned afterward.

It follows that 80% of all reviews requested, 114 in total, were in regards to the actions of a jammer. This is despite blockers receiving 80% of all penalties called during the playoffs. With how crucial jammer penalties are in WFTDA roller derby5 it makes sense that team would burn reviews on that position more than the others. This shows in the data: Only one of the 15 overturned calls sprung a blocker from the penalty box.

The importance of power jams, and the use of ORs to potentially get one, may be getting a bit out of hand in the WFTDA. The whole point of the official review system is to make sure a bad call doesn’t screw over a team. However, with 65.7% of reviews seeking a penalty on the opposing team, it can be argued that official review has been re-purposed as a weapon aimed at trying to screw over the other team.

That the non-call stood 88.2% of the time during these accusations—or roughly the same rate as the 89.8% call-stands rate for all official reviews—shows that at the end of the day, the referees get the call right the vast majority of the time, despite the intentions of the team requesting a review after the fact.

It’s supposed to be this way; WFTDA rules are quite clear on the line that needs to be crossed for a penalty to be issued.

WFTDA Rules 2014
Section 8 – Officials

8.3 Referee Discretion

8.3.2 – If the referee is in doubt on a call (e.g., the referee sees the effects of a hit but does not see the action), a penalty must not be called.

8.3.3 – If the referee is in a position where intent must be inferred but is not clear, legal intent must be presumed.

8.3.4 – If the referee is not sure whether an action warrants a penalty, a penalty will not be assessed.

8.3.5 – If the referee is not sure whether an action warrants an expulsion, an expulsion will not be assessed.

Officials must always give the benefit of the doubt to the skaters, the same skaters that wrote the rules. That so many calls do not need to be reviewed, and that the overturn percentage on those reviewed is as low as it is, directly reflects this. Teams asking for a review, especially a review trying to generate a penalty against an opponent, are effectively asking the officiating crew to ignore the rules that prevented them from calling a penalty in the first place. Because there was no penalty in the first place!

Using official reviews as a tactical tool against an opponent instead of an insurance policy for your own team is something that started to show up last year. After adding the provision that a team that successfully uses a review gets to keep it for the remainder of the half, it appears as if this tactic was used even more in 2014.

In this next breakdown of official review data, let’s take a look at the most critical reviews, the ones that are requested during the times where a bad call could really mess up a game. If the score is close (within 25 points) or as the end of a period is approaching (within the last 2 minutes), there is little opportunity for a team to battle back from an officiating mistake. Official reviews, therefore, would be the most important during these scenarios of the game.

Since there were a fair number of close games during the playoffs this year, there were also a good number of official reviews taken while so; 49, to be exact. That was just about one of every three reviews. One of every four reviews, 37 altogether, happened near the end of a period. During this time, the team requesting one has nothing lose by requesting it; they don’t carry over, so you might as well.6 Right?

Sure. Then again, just because something can be done doesn’t mean it should.

Data source: Manually collected from video archives
Data source: Manually collected from video archives

When it’s extra-important to get calls and non-calls correct, officials are more accurate when it matters than during general phases of play. Teams wouldn’t be asking for most of these critical ORs unless they thought they saw something that maybe should warrant/overturn a penalty and help see that the outcome of the game happens at it should.7

That half of all official reviews taken (71 of 148) came during close periods of games, near the end of games, or near the end of close games is not surprising. Nor is the fact that WFTDA officials, some of the finest out there, are getting so, so many of these calls right.

What is surprising, and disappointing, is that many of these official reviews are being taken when teams know that they have a snowball’s chance in hell of retaining it. Doing this wastes everyone’s time, and for no good reason other than to game the system by saving a team timeout.

Under WFTDA rules, it is tactically sound to use an official review as a strategic maneuver, if a clock stoppage, extended huddle, or rest break is required, but using up an allotted team TO is not desired. However, the lengths to which teams are throwing official reviews out there is starting to become ridiculous. For example:

  • Three official reviews were taken near the end of a game when a team had no remaining team timeouts to stop the clock. Two of these essentially became a 4th team TO after their review requests were denied, which was probably their ulterior motive anyway. This defeats  the purpose of limiting teams to three TOs in the first place. (The third review overturned an Angel City cutting penalty and the end of the 2nd period against Texas—the sole successful end-of-game review of the playoffs.)
  • There were two instances during the playoffs (SoCal vs. Jet City; Rose City vs. Bay Area) when a team requested an official review to ask for a penalty on the opposing team…only to find out from the head referee that the penalty had already been called during the jam. When requesting an OR, the teams aren’t even bothering to double-check if they missed a call before (inaccurately) suggesting that the officials missed one themselves.8
  • Two official reviews, both by Windy City, were blatantly frivolous clock stoppages at the end of the first half of their games against Rocky Mountain and Rose City. They made no effort to hide it, requesting clarification on how many timeouts their opponents had remaining. The dumb thing about these reviews is that neither one was necessary to preserve team timeouts: Windy had two unused TOs at the end of the Rocky game, and by the time they used their last (4th) timeout against Rose, the final result had long since been determined.
  • And the most egregious example of official review abuse: During the Bay Area vs. Charm City game in Salt Lake City, there was a 2-minute non jam.9 After I.M. Pain pushed Lulu Lockjaw out of bounds behind the jammer line at the start of the jam, everyone just stood around and did absolutely nothing for a full two minutes. Charm City didn’t like the fact that they didn’t get a cut or unicorn recycle out of this, so they called for an official review to try start the next jam on the power jam. You can guess how that went.

These were the most extreme examples of official reviews no longer being used for their intended purpose. In watching the games and the circumstances around all of these reviews, I started to get the feeling that teams were asking for referees to call retroactive penalties because they don’t like that fact that one was not called in their favor, more than them wanting to make sure the correct call was made.

Did our jammer go down awkwardly after legal contact? Tell the refs to call the penalty next time! Did that already-penalized hit look extra-rough? Thirty seconds isn’t enough, ask for an expulsion! Did we lose the Hydra on the last jam because our top jammer committed a clear penalty in the penultimate jam? Better call an official review to make sure the referees aren’t in the wrong. Because we certainly couldn’t be!10

Openly questioning officials for the express purpose of resting tired players or breaking up an opponent’s momentum is not the sporting way. But teams are doing this, whether they want to admit it or not. For the long-term health of roller derby, both from the officials’ standpoint and the viewpoint of fans, OR abuse is something that needs to be addressed.

Here’s where the rubber meets the road on that second point. As mentioned at the top, I recorded the length of additional delay required for each official review.11 Stopping the game to discuss the review request, huddle with the zebras, and then relay the result of the review takes a definite amount of time.

With the high frequency of official reviews throughout the playoffs, the amount of time spent twiddling thumbs and waiting for derby adds up quickly.

Data source: Manually collected from video archives
Data source: Manually collected from video archives

If you were to take two-minute jams and replace them with two-minute official reviews, you would have to sit through nearly nine games of zebra huddles to appreciate just how much real time was spent discussing calls—which were more than 99.5% correct during the playoffs.

That’s the equivalent of having to sit through the Saturday and Sunday Division 1 games at WFTDA Championships without any roller derby happening.

Bless our referees, but everyone would rather have people play roller derby and watch roller derby being played—not watch refs talk about roller derby in the middle of the track. There is clearly room for improvement here, on multiple fronts.

First, on the length of reviews. The median additional time needed for them during the playoffs was 2m15s; the average was a bit longer, at 2m24s.12 Only 18 of 148 official reviews kept the extra delay under 90 seconds; an additional 35 got things over and done with within a 2-minute delay. (Of these short reviews, only one was successfully retained.)

The occasional review delay isn’t a big deal, and certainly a review of any length can be tolerated if it results in the reversal of a wrong call. Successful reviews take 37.5% longer on average than unsuccessful ones, or about 50 seconds more in real time. That is always time well spent.

But people notice game stoppages; they really notice long game stoppages, even if they are for legitimate administrative reasons.

Since you can’t really make reviews shorter without messing with the necessary detective work to complete one, the next best thing is to reduce their frequency. That doesn’t necessarily mean affording teams fewer official reviews; the current system of one per half plus an extra one for a successful overturn is actually a very fair system for the WFTDA.

Instead, the WFTDA should think about encouraging a system where official reviews are a tool of last resort, a “Break Glass in Case of Emergency” kind of deal. Officials are held to this standard, in that if there is any possible reason that a penalty cannot be called, the rules make them not call it. Rules should make this apply to teams too, but in reverse: If there is any possible reason that an official review should not be used, it should not be used.

One way of deterring teams from reviewing calls for no good reason would be to charge them a team timeout if the review results in the on-track call standing. This would create a system where every single clock stoppage is justified: Either the refs were wrong and an official timeout corrects the wrong call, or the refs were right (as they usually are) and the team requesting the bad review is responsible for the delay.

Taking away a team timeout for an unsuccessful review would work great in the WFTDA.13 The majority of WFTDA games played would not really be affected by such a change, since many aren’t particularly competitive and teams generally don’t use all their timeouts. Nor would this change affect legitimate reviews, the ones where you’re 95~100% sure you’re going to get the call overturned.

But for the majority of reviews, where you’ve only got a 10% chance of getting the call overturned—the going rate for the 148 of them taken during the Division 1 playoffs—it may be better for everyone that the review never be taken to begin with.

The data supports this. Teams may be more likely to use a frivolous review during close games as a free timeout, but we know that the success rate during this time is almost half as less as any other time.

If the primary motivation for using such a review is to stop the clock, then they would be much better served by just calling a team timeout under a lose/lose scenario; it would be pointless to lose an official review with the timeout, rather than the TO alone. This setup would restore the true purpose of each tool: Stop the clock when wanted with timeouts and correct wrongs when needed with reviews.

If a team is out of timeouts, but still has a review available, they can still be allowed to take it—but at the cost of a delay of game penalty to their captain should it not be successful. A team would generally not be out of timeouts until deep into the second half, near the end of the game. Again, the data shows that end-of-period calls deemed worth reviewing are four times less likely to be overturned, so there would be absolutely no reason for a review to be called for during this time period unless a legitimate overturn is mostly assured.

Additionally, reviews should be restricted to on-track activities only. WFTDA officials are expert enough to be able to quickly confirm with a coach the number of timeouts or penalties a team or player has during 30 second jam resets. This non-critical information, especially if requested in the first half, can wait until an unrelated OTO or at halftime to be affirmed if more time is requred. There is absolutely no reason to stop the whole damn game for nitpicky, loopholey crap like that.

These sorts of reforms would help to create fewer clock stoppages, improving game flow and making for a better fan experience, without unduly affecting the official reviews that absolutely need to happen.

It is impossible to expect teams to only review calls that they will win and not review calls that they will lose. That is not what this analysis is trying to promote. What should happen is that the official review system eventually get to the point where it is used only on plays where it is truly uncertain if the correct call or non-call is made, and the accuracy of that call could meaningfully affect the outcome of the game.

To show signs of progress in this area, the overturn rate of official reviews must increase. It’s a counter-intuitive thought, but it’s exactly what will make the derby product better for fans.

This works under the assumption that the error rate of human referees is more or less fixed (and low). The fewer official reviews needed to correct the same amount of errors, the better. The ideal success rate would approach 50%, a figure that would indicate that the only calls worthy of being reviewed are for the important calls that are close enough to really and truly go either way.14

Much like the WFTDA rule set itself, the official review system is still a work in progress. We’ll see if anything is changed in the upcoming 2015 rules release to help make official reviews less prominent during games, and roller derby action more prominent.

Until then, the next time you see an official review during a roller derby bout, and it is unsuccessful, ask yourself if that review really necessary. Statistically—and actually—it probably wasn’t.