Posted: Thursday March 24, 2011 11:28AM ; Updated: Friday March 25, 2011 10:53AM

Explaining the Simpson Paradox

Story Highlights

Who do you want up to bat in a key spot? Stats can be deceiving

Which NCAA Tournament Sweet 16 teams rely too much on one player?

By Jon Wertheim and Tobias J. Moskowitz

Decrease font Decrease font
Enlarge font Enlarge font
What is Scorecasting?
In our book Scorecasting we tried to challenge assumptions and conventional wisdom in sports and explore new territory. Using principles of behavioral economics, experimental psychology, and lots of data, we tried to explain everything from the source of home-field advantage to the Cubs' ritual awfulness to why -- even before Thanksgiving night 2000 -- there was evidence that Tiger Woods was mortal.

We invited readers to suggest other topics to explore, other ways to look at sports a little bit differently. The response has been overwhelming. The goal is to use this space to keep the conversation going. If you're interested in contributing -- anything from suggesting analytic quirks to explore or sending in academic research -- send an e-mail.

Old joke: What do statistics and prisoners have in common? Twist them and torture them and eventually you can get them to say anything you want them to.

A slightly less cynical metaphor: Statistics are like pieces of forensic evidence. They can help prove a case. But, like all bits of evidence, they can also be mishandled and tampered with.

In Scorecasting we noted that anytime you hear an announcer gush that "Williams is three for his last four!" you should assume automatically that he's three for his last five (....or six or seven.) Likewise, did the Flyers really score "two goals in 40 seconds!!"? Or did they score two goals in 50 minutes, that just happened to come 40 seconds apart? But broadcasters -- and teams, fans and scoreboard operators -- have a vested interest in issuing the most dramatic scenario possible, so they offer a selective data set.

Sports statistics are similarly susceptible to Simpson's Paradox. Having nothing to do with O.J. or Homer, Simpson's Paradox tells us that the aggregated numbers tend to obscure correlations, or trends.

A few years ago, the Bureau of Labor reported that, in the teeth of the Great Recession, the overall unemployment rate still wasn't quite so grim as it was during the economic slump of the 1980s. Yet the unemployment rate among college graduates was higher than during the 1980s recession. Ditto for workers with some college, high-school graduates and high-school dropouts. So, wait: The overall unemployment rate was lower than it had been in the '80s. Yet, for every category, the unemployment rate was higher in 2009? Come again?

It turns out that there was no inconsistency. Why? Because in 2009, there were far more college graduates than there were in the early '80s -- and college graduates had the lowest unemployment rate. And there were fewer high school dropouts in 2009 than there were in 1980s -- and dropouts had the highest unemployment rate. So it's entirely possible that the aggregated data yielded a different conclusion than the weighted data.

  Treatment A Treatment B
Small Stones Group 1
Group 2
Large Stones Group 3
Group 4
Both 78%

Simpson's Paradox also comes into play with medical studies. A famous example taught in many statistics classes tells us there are two prospective treatments for kidney stones. To the right are the success rates for Treatment A and Treatment B.

The counterintuitive bit, of course, is that Treatment A is more effective for treating both small and large stones, but less effective overall. Why? Because, weighing the averages, Treatment B was used for far more patients with small stones (which had a higher overall rate) and was used on far fewer patients with large stones (the lower overall rate.)

Confounded by Simpson's Paradox, doctors sometimes end up prescribing a treatment that doesn't maximize the patient's chances of recovery.

And here's an example from the Republic of Sports, with a nod to blogger Iowahawk. Imagine two batters, Hitter A and Hitter B.

Hitter A:

-- Against right-handed pitchers: 300 at-bats, 90 hits (.300 average)
-- Against left-handed pitchers: 200 at-bats, 50 hits (.250 average)
Total: 500 at-bats, 140 hits (.280 average)

Hitter B:

-- Against right-handed pitchers: 100 at-bats, 32 hits (.320 average)
-- Against left-handed pitchers: 300 at-bats, 78 hits (.260 average)
Total: 400 at-bats, 110 hits (.275 average)

Hitter B has a higher batting average against both righties and lefties, but Hitter A has a higher overall average by dint of facing a different mix of pitchers. Again, the devil is in the weighted averages. It's the bottom of the ninth, two out, and you need a hit. Who would you insert as a pinch hitter, A or B?

A concrete example comes courtesy of Ken Ross, a retired professor at the University of Oregon and baseball enthusiast. Ross realized that in both 1995 and 1996, Derek Jeter had a lower batting average for each season than David Justice did. Yet combining the two years, however, Mr. Jeter had a higher average.

How's that? In 1995, Jeter had only 12 hits in 48 at-bats (a .250 average) while Justice had a .253 average in 411 at-bats. Then the following year, Jeter had a .314 average in 582 at-bats. Justice had only 140 at-bats, but a higher average of .321.

• Re: the NCAA tournament... Thanks to reader William Jewitt wrote in:

"I discovered a stat this weekend that screens teams from making it to the final game, much less winning it all. It's based on the premise that well balanced teams do better than "not well balanced" teams in the tournament. The stat I used to flag "not well balanced teams" is when a team has a player in the top 100 of "Percentage of Shots Taken" per Ken Pomeroy's site,

When I applied this to the 2005-2010 championship games, not a single team had a player listed for that season (6 games, 12 teams). Unfortunately, Ken doesn't list it for years prior to 2005, so I can't check back any further.

In looking at No. 4 seeds or better for this year's tourney, these teams are "not well balanced."

Duke -- Smith 30.6 percent
UConn -- Walker 32.8 percent
Texas -- Hamilton 32.7 percent
Purdue -- Moore 30.1 percent AND Johnson 30.0 percent
BYU -- Fredette (ranked No. 1) 37.8 percent
Wisconsin -- Leuer -- 32.2 percent"

• Nate Silver was uncannily accurate in predicting the 2008 U.S. Presidential Election. Here's his forecast for the NCAA tournament.

• In his excellent piece of Princeton's upset of UCLA in the first round of the 1996 NCAA Tournament, Time's Sean Gregory mentioned that Princeton players, accustomed to playing in small Ivy League gyms, had to adjust to the sightlines of shooting in the cavernous RCA Dome. "It was like trying to shoot in a park out in the woods," recalled Brian Earl, then a Princeton sharpshooter. "There was no depth perception. I was nervous about that because we weren't going to be doing a ton of driving to the basket." A factor worth considering next year when we fill out brackets?

Comments, ideas, suggestions are welcome by email.
Hot Topics: NBA Playoffs NHL Playoffs Chris Johnson Jameis Winston NFL Draft Michael Sam Aldon Smith
TM & © 2014 Time Inc. A Time Warner Company. All Rights Reserved.
Terms under which this service is provided to you. Read our privacy guidelines, your California privacy rights, and ad choices.
SI CoverRead All ArticlesBuy Cover Reprint