Christopher Keyes

Talking Baseball: Heuristics for Unlikely No Hitters

(Originally published on March 20, 2022)

This is part of a series of posts on sabermetrics and the mathematics of baseball. You can find more here.

On May 19, 2021, Corey Kluber of the New York Yankees no-hit the Texas Rangers. This came just a day after Spencer Turnbull and the Tigers no-hit the Seattle Mariners. No hitters are pretty rare, so seeing two in back-to-back days seems unlikely enough, but this was already the sixth no hitter of a young 2021 season. What's more, three separate teams were on the losing end of two no hitters each: Texas, Seattle, and Cleveland. This was an MLB first, and the 2021 season ended with a record 9 no hitters in total.

That got me thinking; how rare should we expect this event to be? More broadly, how likely (or unlikely) is it to see any number of no hitters in a given season, and how many different ways might the winning and losing teams be configured?

I put together a quick heuristic analysis that suggests that the likelihood of this --- exactly six no hitters over the first 638 games of a season, with three teams getting no hit twice each --- is about 0.000000206, or 0.0000206%. To put that infinitesimally small number in perspective, we would expect this to occur about once in every five million stretches of 638 games.

The assumptions and calculations that led to this number are intended to be rather back-of-the-napkin, but nevertheless lead to some interesting mathematics and follow up questions!

How many no hitters are expected in a season?

To estimate the likelihood of seeing any number of no hitters in a season, we will make the basic assumption that each MLB game has some fixed probability of being a no hitter or not. Let's denote this probability by \(p_{NH}\). To estimate its value, let's use the number of no hitters per game from 1998 - 2019. I counted 59 no hitters between during this span, a stretch of approximately 53460 games, giving \[p_{NH} = \frac{59}{53460} \approx 0.0011036 = 0.11036 \%.\] Why this range? I chose it primarily because this is when MLB expanded to its current slate of 30 teams. There were also no strikes or pandemic-shortened seasons during this period, which is convenient. You might prefer to look back further (e.g. to 1969 when the mound was lowered) and get a different value of \(p_{NH}\), but this is good enough for me today.

Let's start by computing the probability that there are zero no hitters in the season: all 2430 MLB games in a season are played without seeing a single no hitter. Using our assumptions, each game has a \((1-p_{NH})\) chance of not being a no hitter, so we find \[Prob(0 \text{ no hitters}) = (1-p_{NH})^{2430} = 0.0683.\] That's a 6.83% chance, so we might expect a season without a no hitter to occur once every 15 years or so. (Incidentally, in the 24 seasons since 1998, this has happened exactly twice.)

What about the probability of exactly one no hitter in a season? There is a \(p_{NH}\) chance of one game being a no hitter, and a \((1 - p_{NH})^{2429}\) chance of the remaining games not being no hitters. This can happen in 2430 ways, since our one no hitter could occur in any game of the season, giving us \[Prob(\text{exactly 1 no hitter}) = 2430 p_{NH} (1-p_{NH})^{2429} = 0.183.\]

More generally, we can calculate the probability of seeing exactly \(N\) no hitters in a season: \[Prob(\text{exactly }N\text{ no hitters}) = \binom{2430}{N} (p_{NH})^N (1-p_{NH})^{2430-N}.\] Without the binomial coefficient factor, this is the probability of having \(N\) no hitters in a row followed by none the rest of the season. The binomial coefficient accounts for how many different ways these \(N\) no hitters could be distributed throughout the season's games. Experts will recognize the binomial distribution here! Another way to phrase our assumption is that no hitters in a season are binomially distributed with probability \(p_{NH}\) and 2430 independent trials.

Below is a table showing these probabilities, as well as the probabilities for at most \(N\) no hitters in a season. We also repeat these calculations with 2430 replaced by 638, to obtain the probability of seeing 6 no hitters in the first 638 games, as occurred in the 2021 season.

Number of
no hitters,
N
Full Season (G = 2430) First 638 games
Prob(exactly N) Prob(at most N) Prob(exactly N) Prob(at most N)
0 0.0683 0.0683 0.494 0.494
1 0.183 0.252 0.348 0.843
2 0.246 0.498 0.123 0.965
3 0.220 0.718 0.0287 0.994
4 0.148 0.866 0.00504 0.9992
5 0.0791 0.945 0.000706 0.99991
6 0.0353 0.980 0.0000823 0.999991
7 0.0135 0.994 8.21e-5 0.9999992
8 0.00452 0.998 7.15e-6 0.99999994
9 0.00134 0.9995 5.53e-7 0.999999996

This model suggests that we should expect half of seasons to have 2 or fewer no hitters, while half of seasons have 3 or more. So perhaps it shouldn't be surprising that in the 24 seasons since 1998, there have been exactly 12 seasons each of 2 or fewer no hitters and 3 or more no hitters! (Though note that I am including the shortened 2020 season, which I probably shouldn't.) It also predicts between 5 and 6 of those 24 seasons are expected to have exactly 3 no hitters, and indeed there were 6 such seasons.

On the other hand, there have been two seasons with 7 no hitters (2012 and 2015), which seems very unlikely, as well as our unicorn 2021 season, with its 9 total no hitters, 6 of which came in the first 638 games. The probabilities of these two occurrences are quite low, as highlighted above.

Distributing winners or losers

Not only were the number of no hitters in the 2021 season unexpected, but so were the distribution of their losers. Never before had three teams been no hit twice each in a season. To model this, let's assume that each MLB team is equally likely to be on the losing end of each no hitter. To be clear, we're sweeping a few things under the rug, but I'll address them later. Here, the key assumption is that the likelihood of each configuration of losing teams is proportional to the number of ways it could occur.

For example, three teams losing two no hitters each will be the configuration denoted (2,2,2). This can happen in (30 choose 3) = 4060 different ways, since this is the number of three team combinations out of the 30 MLB teams. Other configurations are calculated similarly. For instance, the configuration (2,1,1,1,1) represents 1 team losing two no hitters and four other teams losing one each; this can happen in 30(29 choose 4) = 712530 ways, since there are 30 choices for the team that loses two no hitters, then (29 choose 4) ways to choose the four teams to lose one no hitter from the 29 remaining teams. All of the possible configurations for 6 no hitters, the number of ways they can occur, and their respective probabilities are shown below.

Configuration # of ways Probability
(1,1,1,1,1,1) 593775 0.366
(2,1,1,1,1) 712530 0.439
(2,2,1,1) 164430 0.101
(2,2,2) 4060 0.00250
(3,2,1) 24360 0.0150
(3,1,1,1) 109620 0.0675
(3,3) 435 0.000268
(4,2) 870 0.000536
(4,1,1) 12180 0.00750
(5,1) 870 0.000536
(6) 30 0.0000185
Total 1623160 1

According to this heuristic, there was only a 0.25% chance that three teams lost two no hitters each! Perhaps even more surprisingly, in a season (or stretch of games) in which six no hitters occur, the most likely configuration of losers is (2,1,1,1,1) at 43.9%. The chance of having six different losers, i.e. configuration (1,1,1,1,1,1), is only 36.58%, meaning one might expect at least one repeat loser in six no hitters.

We can repeat these calculations with other numbers of total no hitters. For instance, if there were \(N = 9\) no hitters as in 2021, we expect the configurations

9 different teams won the 9 no hitters in 2021, so (1,1,1,1,1,1,1,1,1) is the configuration of the winning teams, while (3,2,2,1,1) is the configuration of losing teams. Both of these are rather unlikely; (2,1,1,1,1,1,1,1) and (2,2,1,1,1,1,1) would expected to occur 28.7% and 26.2% of the time, respectively.

For a choice of \(N\) with more available data, consider \(N=3\). Here we expect the configurations

In the 24 seasons since 1998, there have been 6 with exactly three no hitters (1999, 2001, 2007, 2011, 2013, and 2018). Our model suggests that we should expect (1,1,1) to occur much more often than (2,1), and we're unlikely to see (3) appear at all. Indeed, if we look at the configurations of losing teams in those years, we find that (1,1,1) occured 5 times and (2,1) once! (The Padres were no hit twice in 2001.) In all six of these years, there were three distinct winning teams. This is a small sample size, but it's nevertheless encouraging to see once again that our heuristics seem to reflect reality.

Aside: stars, bars, and partitions

Experts might recognize more general phenomena appearing once again here! Given \(N\) no hitters, the problem of finding the total number of ways to pick losing (or symmetrically, winning) teams is the same as selecting \(N\) items from \(M=30\) distinct bins. Note that we allow selection of multiple (and possibly all \(N\)) items from each bin; this represents one team being no hit more than once. (This problem also goes by other names, such as choosing \(N\) donuts from \(M\) flavors, etc.)

It turns out the answer is \[\binom{N+M-1}{N} = \frac{(N+M-1)!}{(M-1)!N!},\] where here \(M\) denotes the number of bins. The proof I know is a construction called stars and bars. Our \(N\) items are represented by stars, \[*** \cdots ***\] and choosing the \(M\) bins they came from is represented by placing \(M-1\) bars between them, e.g. something like \[||**|* \cdots |**|*.\] Thus we have lined up \(N+M-1\) total objects, and we want to know how many different ways we can choose \(N\) of them to be stars, which is precisely \(N+M-1\) choose \(N\).

We also see partitions popping up! A partition of \(N\) is a collection of positive integers which sum to \(N\). Thus we see that our "configurations" of losing teams are actually partitions of \(N\). There are 11 partitions of 6, hence the 11 entries in the table above. There are 30 partitions of 9, explaining why I only listed the two that actually occurred in 2021 above.

A one in 5 million season?

So where did that 0.0000206% come from? If we take the probability of 6 no hitters occuring in the first 638 games to be 0.0000823, and the probability that three separate teams were on the losing end of two no hitters each is 0.00250, then multiplying them together gives the probability of both occuring to be 0.000000206. Thus we might expect this to happen about once in every \(\frac{1}{0.000000206} \approx 4860000 \) seasons, or almost 5 million.

Looking at the full 2021 season, the probability of exactly 9 no hitters occuring with the losing teams configuration (3,2,2,1,1) is \[(0.00182)(0.0262) = 0.0000476,\] or 0.00476%. This we might expect to happen once in every 20000 years or so, if we are to believe this analysis.

Are all outcomes this rare?

My first question --- and the first one a friend asked me when I mentioned this calculation --- was: is every outcome expected to be rare? Are we asking for too much by requesting a fixed number of no hitters to occur and for those no hitters to have some given configuration of losing (or winning) teams? If every possibility is unlikely, but one has to happen, then we shouldn't be surprised when seemingly rare events occur!

I can say with some confidence that this is not the case, at least for small values of \(N\). For example, consider a season with \(N=3\) no hitters, with three distinct losing teams, i.e. a configuration of (1,1,1). Using our heuristic calculations above, we expect this to occur with probability \[(0.220)(0.819) = 0.180,\] which is a bit more than 1/6. In the 24 seasons since 1998, this has actually occurred 5 times, so once more than our expectation!

Other scenarios with small \(N\) are relatively likely and have occurred in the last 24 years: \(N=2\) with losing configuration (1,1) is expected to occur 23% of the time, and has actually happened 4 times, while \(N=1\) is expected to occur 18% of the time and has happened 6 times.

As \(N\) grows however, the probability of having \(N\) no hitters decreases rapidly while the number of possible configurations (partitions of \(N\)) increases, lowering the probability of even the more likely configurations occuring. Thus at the larger values of \(N\) like 6 or 9, it is somewhat true that each possibility becomes unlikely.

Limitations and future questions

This whole approach hinges on our heuristic assumptions, namely that each MLB game has a fixed probability of being a no hitter and that for any given no hitter, each team is equally likely to be on the winning or losing end. Both of these are flawed!

Let's first consider the effect of the teams playing. Looking at the 2021 MLB season summary, we see that the Seattle Mariners had the fewest hits in the league, while the LA Dodgers pitching allowed the fewest hits. This leads me to believe that a Dodgers-Mariners matchup might be more likely to end up in the a no hitter than any other! I also expect a team like the Astros, who led the league in hits, would be harder to no hit, or a team like the Orioles, who gave up the most hits, to be less likely to throw a no hitter. Still, our assumptions made these calculations fairly easy and gave results that seem at least somewhat reasonable. Trying to tease out each team's likelihood of being on either end of a no hitter separately sounds like a massive headache.

Even if we could analyze each team separately, I might expect player level effects as well, namely from the starting pitcher. The best pitchers are usually good at preventing hits, which should translate to more no hitters, right? Perhaps equally interesting are catchers, since they appear in many more games than pitchers. Are there catcher metrics that correlate well with the number of no hitters caught over their career?

Even if we believe our assumptions are valid, we're entirely ignoring how the value of \(p_{NH}\) might be changing from year to year! I've written in the past about the effect of different run environments on the game, and something similar could be going on here too. For instance, hitters have shifted their focus from making contact to hitting home runs, resulting in a drop in league batting average from .266 in 1998 to .244 in 2021. It might be fun to study correlations between no hitter rates and batting average, or other metrics. A strong correlation might suggest that a season with 9 no hitters is more likely than I calculated here!