The coupon collector problem is a classic problem in probability. It is usually described in this way: a person (called the coupon collector) is trying to collect all coupons (promotional gift items) when purchasing a certain type of products (e.g. boxes of breakfast cereal). Suppose that there are different types of coupons. Suppose that when someone buys a unit of the product, he or she does not know what type of coupons that is inside the box. How many units of the product does the coupon collector have to buy in order to collect the entire set of coupons?
Note that the number of units to buy is a random quantity. To see it for yourself, roll a fair die until each of the six faces appears at least once. You can try this with your own die and record the numbers as you go. We rolled our own die and the following shows three such experiments.
3, 4, 6, 2, 6, 1, 3, 2, 1, 4, 6, 2, 6, 1, 6, 1, 2, 3, 5
6, 6, 4, 1, 3, 6, 6, 4, 1, 5, 4, 2
6, 2, 1, 1, 5, 5, 1, 1, 1, 3, 6, 1, 3, 5, 4
The first experiment has 19 trials, the second one has 12 trials and the third one has 15 trials. If we run the random experiment another time, we would get another sequence. Also note that the first 5 of the faces (coupons) show up fairly quickly. It takes a few more trials just to get the last coupon.
Rolling a dice is simply a way to generate random numbers. Thus we can describe the coupon collector problem as a random sampling problem. Draw numbers repeatedly from the sample space with replacement until each of the numbers has been chosen at least once. Let be the number of trials that are required to perform this experiment. As demonstrated above (or as confirmed by your rolling of a die), is a random variable. The smallest value it can take on is . Theoretically, the random variable could be or some other integer greater than . The goal is to describe the probability distribution of the random variable .
Some interesting things to know about this probability distribution are the expected value (the mean or the average number of trials that are needed to complete the experiment). Another one is the probability function (or probability mass function) so that we can determine how likely it is to complete the experiment in trials. For a more detailed discussion, go to this previous post. In the remainder of the post, we discuss the expected value.
The expected value has a clear formula and is easily computed.
For the 6-coupon case (or the rolling of a die), it would take on average 15 trials to get all 6 faces.
Of course, 15 is simply the average. The actual results may be more or less than 15 as the above illustrations show (or your own rolling of a die). To find out how likely it is to have a result around 15, it is necessary to have the probability function of . See the previous post for more information on that.
Consider this example. Suppose that you want to throw candies at random to a group of 10 kids. Each candy is assigned to each child at random. You want to throw candies until each of the children has at least one candy. Of course some kids would receive a few candies. But the goal is that no one would come away empty handed. On average, you would need to throw about 30 candies.
If there are 50 coupons, it would take on average 225 trials to get all 50 coupons. One interesting thing is that it is relatively easy to get the first few coupons. To get the last coupon, it would take a quite a few trials. In fact, it would take on average trials to get the last coupon (if is the number of coupon types).
Another point we would like to make is that the coupon collector problem requires an understanding of the geometric distribution. A random experiment that can modeled by a geometric distribution is the performance of a series of independent trials until some criterion is satisfied. The random variable described here has some similarity with a geometric model (but not totally similar). The similarity: we keep drawing numbers with replacement until all numbers are drawn.
Let’s look at the third experiment 6, 2, 1, 1, 5, 5, 1, 1, 1, 3, 6, 1, 3, 5, 4 listed above. We draw the first number (happens to be 6). Next we wish to draw another number different from 6. Luckily it takes only one trail to get 2. Now we wish to draw a number different from 6 and 2. It also takes one trial (to get 1). Next, continue to draw until we get a number different from 6, 2 and 1. This time it takes two trials to get a 5. Then continue to draw until we get a number different from 6, 2, 1, and 5. It takes 5 trials to get a 3. Lastly it takes 5 more trials to get the last number 4. What this shows is that the random variable is a sum of 6 geometric random variables.
Each of the is a geometric random variable. The first one is kind of special. The value of is always 1 (it takes one trial to get the first number). The second variable is a geometric random variable (the number of trials to get a number different from the first chosen number. Then is the number of trials to get a number different from the first two chosen numbers and so on. If there are coupons, then is the sum of geometric random variables.
The formula for the expected value is derived through this connection with the geometric distribution. See the previous post for a more in-depth discussion.