# Celebrate the year of the rooster

According to the Chinese zodiac, the year 2017 is the year of the rooster. In fact, today (January 28, 2017) is the start lunar New Year, which is the first day of the year of the rooster. What better way to celebrate the year of the rooster than working a related math puzzle, or to perform a related random experiment!

At the office yesterday, conversation at the beginning of a meeting, before discussing the main topics, centered on the Chinese zodiac animal signs (the Chinese zodiac system is a 12-year cycle with a different animal representing each year). One coworker mentioned he is a tiger. Another coworker did not know his sign and Googled to find out that he is a rat! Another coworker is a rooster. It turns out that a pig is also represented. Imagine that you keep picking people at random and ascertain his/her animal sign. How many people do you have to ask in order to have met all 12 animal signs?

The random experiment that has been described is this. Put 12 slips of papers numbered 1 through 12 in a hat. Randomly draw a piece of paper and note the number and then put it back into the hat. Keep drawing until all 12 numbers have been chosen. Let $X$ be the number of selections that are required to perform this random experiment. Of course, you can expand the sample space to include more slips of papers (i.e. with more numbers). But the context will not be picking animal signs.

There are two ways to get a handle on the random variable $X$ as described above. One is through simulation and the other is through math.

Before discussing the simulation or the math, let’s point out that the problem discussed here is a classic problem in probability that goes by the name “the coupon collector problem”. The numbers 1 to 12 in a hat are like coupon (prizes) that are randomly given out when purchasing a product (e.g. a box of cereal). The problem discussed here is that the coupon collector desires to collect the entire set of coupons.
___________________________________________________________________________

Simulation

To get a sense of how long it will take, simulate random numbers from 1 through 12 until all numbers have appeared. The following is 5 iterations of the experiment.

9, 2, 6, 8, 9, 10, 2, 9, 8, 1, 5, 11, 1, 1, 2, 10, 9, 8, 8, 9, 5, 11, 9, 3, 7, 9, 8, 8, 4, 3, 1, 4, 3, 12 (34 draws)

7, 7, 1, 10, 11, 11, 10, 4, 5, 8, 8, 2, 6, 4, 6, 2, 12, 6, 6, 12, 9, 5, 8, 10, 1, 5, 10, 4, 9, 4, 1, 11, 11, 6, 2, 1, 6, 6, 3 (39 draws)

9, 5, 2, 2, 1, 5, 6, 11, 7, 11, 4, 6, 1, 12, 3, 7, 8, 3, 3, 2, 2, 3, 5, 6, 2, 5, 1, 6, 8, 5, 4, 10 (32 draws)

1, 5, 5, 4, 5, 12, 10, 1, 8, 1, 3, 9, 1, 3, 11, 9, 10, 3, 9, 11, 4, 4, 4, 7, 7, 3, 1, 11, 11, 4, 10, 6, 3, 2 (34 draws)

6, 7, 6, 1, 12, 6, 1, 1, 7, 1, 11, 10, 3, 3, 9, 6, 9, 4, 2, 6, 11, 7, 7, 11, 2, 6, 2, 1, 7, 2, 5, 9, 6, 12, 6, 11, 1, 11, 11, 2, 5, 6, 7, 5, 2, 11, 2, 2, 6, 2, 12, 5, 5, 5, 12, 10, 3, 11, 1, 10, 10, 6, 9, 11, 10, 7, 11, 5, 1, 9, 11, 9, 8 (73 draws)

Each of the number is generated by using the =RANDBETWEEN(1, 12) in Excel. In each iteration, the numbers are generated until all 12 numbers have been generated.

There is considerable fluctuation in this 5 iterations of the experiment. With the 5th one being exceptionally long, it is possible that it takes a long time to find all 12 animal signs. The average of the first iteration is obviously 34. The average of the first two iteration is 36.5. The averages of the first 3, 4, and 5 iterations are 35, 34.75, and 42.4, respectively.The last average of 42.4 is quite different from the average of 37 indicated earlier.

What if we continue to run the experiment, say, for 10,000 times? What would the long run averages look like? The following graph shows the averages from first 100 runs of the experiment. It plots the average of the $n$ iterations from $n=1$ to $n=100$.

Figure 1 – Long run averages from 100 runs

Figure 1 shows that there is quite a bit of fluctuation in the averages in the first 25 runs or so. Eventually, the averages settle around 37 but still retain noticeable fluctuation. The following graph shows the averages from first 1000 runs of the experiment.

Figure 2 – Long run averages from 1000 runs

The graph is Figure is smoother as it moves toward 1000, but still has noticeable fluctuation from 37 (in fact the graph is mostly below 37). The following graph shows the averages from first 10000 runs of the experiment.

Figure 3 – Long run averages from 10000 runs

The graph in Figure 3 shows the average of the first $n$ iterations with $n$ goes from 1 to 10,000. The graph is for the most parts a horizontal line slightly above 37, especially after $n=3000$. In fact the average of all 10,000 iterations is 37.3381, which is close to the theoretical average of 37.2385.

The simulation is an illustration of the law of larger numbers. The essence of the law of large numbers is that the short run results of a random experiment are unpredictable while the long run results are stable and predictable and eventually settle around the theoretical average.

The first 5 runs of the experiment (as shown above) are certainly unpredictable. It may take 34 draws or may take 73 draws. The first 100 simulations also have plenty of ups and downs, even though graph in Figure 1 shows a movement toward 37. The first 1000 simulations display more stable results but are below average as the graph move toward 1000 (Figure 2). In simulating the experiment 10,000 times (Figure 3), the long run averages settle around the theoretical average of 37.2385.

So if you survey people their animal signs, the time it takes has a great deal of random fluctuations. It may take 34 asks or 73 asks (as shown in the first 5 simulations). If the experiment is done repeatedly, the results are predictable, i.e. the average is around 37.

The long run results of a gambling game are predictable too and will settle around the theoretical average. The theoretical average of a gambling game is usually referred to as the house edge. For example, for the game of roulette, the house edge is 5.26%. For each bet of \$1, the gambler is expected to lose 5.26 cents. In playing a few games, the gambler may win big. In the long run, the house is expected to gain 5.26 cents per one dollar bet. Thus the law of large numbers can mean financial ruin for the gambler (or profits for the casino). For an illustration of the law of large numbers in the context of the game of Chuck-a-Luck, see here. For an illustration in the context of the roulette wheel, see here.

Another piece of useful information from the 10,000 simulated runs of the experiment is the frequency distribution.

Table 1
Frequency Distribution of the 10,000 Simulated Runs
$\begin{array}{rrrrr} \text{Interval} & \text{ } & \text{Frequency} & \text{ } & \text{Relative Frequency} \\ \text{ } & \text{ } \\ \text{10 to 19} & \text{ } & 375 & \text{ } & 3.75 \% \\ \text{20 to 29} & \text{ } & 2817 & \text{ } & 28.17 \% \\ \text{30 to 39} & \text{ } & 3267 & \text{ } & \text{ } 32.67 \% \\ \text{40 to 49} & \text{ } & 1931 & \text{ } & \text{ } 19.31 \% \\ \text{50 to 59} & \text{ } & 901 & \text{ } & \text{ } 9.01 \% \\ \text{60 to 69} & \text{ } & 379 & \text{ } & \text{ } 3.79 \% \\ \text{70 to 79} & \text{ } & 190 & \text{ } & \text{ } 1.90 \% \\ \text{80 to 89} & \text{ } & 88 & \text{ } & \text{ } 0.88 \% \\ \text{90 to 99} & \text{ } & 30 & \text{ } & \text{ } 0.30 \% \\ \text{100 to 109} & \text{ } & 13 & \text{ } & \text{ } 0.13 \% \\ \text{110 to 119} & \text{ } & 3 & \text{ } & \text{ } 0.03 \% \\ \text{120 to 129} & \text{ } & 3 & \text{ } & \text{ } 0.03 \% \\ \text{130 to 139} & \text{ } & 1 & \text{ } & \text{ } 0.01 \% \\ \text{140 to 149} & \text{ } & 2 & \text{ } & \text{ } 0.01 \% \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{Total } & \text{ } & 10000 & \text{ } & 100.00 \% \end{array}$

Figures 1 to 3 tell us the long run behavior of the simulations (e.g. the long run average is 37). Table 1 gives the counts of the simulations that fall into each interval and the corresponding relative frequency (the percentage). Table 1 tells us how often or how likely a given possibility occurs. The total number of simulations that fall within the range 20 to 49 is 8015. So about 80% of the time, the experiment ends in 20 to 49 draws. Furthermore, 92.95% of the simulations fall into the interval 20 to 69. This really tells us what the likely results would be if we perform the experiment. The frequency distribution also tells us what is unlikely. There is only 3.75% chance that the experiment can be completed with less than 20 draws. In the simulations, there are two that are above 140 (they are 141 and 142). These extreme results can happen but are extremely rare. They only happened about 2 times per 10,000 simulations.

___________________________________________________________________________

The math angle

There is also a mathematical description of the random experiment of surveying people until all 12 animal signs are obtained. For example, there is a formula for calculating mean, and there is also a formula for calculating the variance. There is also a probability function, i.e. a formula for calculating probabilities (akin to Table 1). The formula for the mean is actually simple to describe.

Let $X_n$ be the number of draws from the set $\left\{1,2,3,\cdots,n \right\}$ with replacement such that each number in the set is picked at least once. The expectation of $X_n$ is the following.

$\displaystyle E[X_n]=n \biggl[ \frac{1}{n}+\frac{1}{n-1}+ \cdots+ \frac{1}{3}+\frac{1}{2}+1 \biggr]$

The 37.2385 theoretical average discussed above comes from this formula. The the case of $n=12$, the mean would be

$\displaystyle E[X_{12}]=12 \biggl[ \frac{1}{12}+\frac{1}{11}+ \cdots+ \frac{1}{3}+\frac{1}{2}+1 \biggr]=37.23852814$

For more details of the math discussion, see this previous post. For a more in-depth discussion including the probability function, see this post in a companion blog on probability.

___________________________________________________________________________
$\copyright \ 2017 \text{ by Dan Ma}$

# The coupon collector problem

The coupon collector problem is a classic problem in probability. It is usually described in this way: a person (called the coupon collector) is trying to collect all coupons (promotional gift items) when purchasing a certain type of products (e.g. boxes of breakfast cereal). Suppose that there are $n$ different types of coupons. Suppose that when someone buys a unit of the product, he or she does not know what type of coupons that is inside the box. How many units of the product does the coupon collector have to buy in order to collect the entire set of coupons?

Note that the number of units to buy is a random quantity. To see it for yourself, roll a fair die until each of the six faces appears at least once. You can try this with your own die and record the numbers as you go. We rolled our own die and the following shows three such experiments.

3, 4, 6, 2, 6, 1, 3, 2, 1, 4, 6, 2, 6, 1, 6, 1, 2, 3, 5

6, 6, 4, 1, 3, 6, 6, 4, 1, 5, 4, 2

6, 2, 1, 1, 5, 5, 1, 1, 1, 3, 6, 1, 3, 5, 4

The first experiment has 19 trials, the second one has 12 trials and the third one has 15 trials. If we run the random experiment another time, we would get another sequence. Also note that the first 5 of the faces (coupons) show up fairly quickly. It takes a few more trials just to get the last coupon.

Rolling a dice is simply a way to generate random numbers. Thus we can describe the coupon collector problem as a random sampling problem. Draw numbers repeatedly from the sample space $S=\left\{1,2,3,\cdots n \right\}$ with replacement until each of the $n$ numbers has been chosen at least once. Let $X_n$ be the number of trials that are required to perform this experiment. As demonstrated above (or as confirmed by your rolling of a die), $X_n$ is a random variable. The smallest value it can take on is $n$. Theoretically, the random variable $X_n$ could be $n$ or some other integer greater than $n$. The goal is to describe the probability distribution of the random variable $X_n$.

Some interesting things to know about this probability distribution are the expected value $E[X_n]$ (the mean or the average number of trials that are needed to complete the experiment). Another one is the probability function (or probability mass function) so that we can determine how likely it is to complete the experiment in $k$ trials. For a more detailed discussion, go to this previous post. In the remainder of the post, we discuss the expected value.

The expected value $E[X_n]$ has a clear formula and is easily computed.

$\displaystyle E[X_n]=n \biggl[ 1 + \frac{1}{2}+\frac{1}{3}+ \cdots+\frac{1}{n-1}+\frac{1}{n} \biggr]$

For the 6-coupon case (or the rolling of a die), it would take on average 15 trials to get all 6 faces.

$\displaystyle E[X_6]=6 \biggl[ 1 + \frac{1}{2}+\frac{1}{3}+\frac{1}{4}+\frac{1}{5}+\frac{1}{6} \biggr]=14.7 \approx 15$

Of course, 15 is simply the average. The actual results may be more or less than 15 as the above illustrations show (or your own rolling of a die). To find out how likely it is to have a result around 15, it is necessary to have the probability function of $X_6$. See the previous post for more information on that.

Consider this example. Suppose that you want to throw candies at random to a group of 10 kids. Each candy is assigned to each child at random. You want to throw candies until each of the children has at least one candy. Of course some kids would receive a few candies. But the goal is that no one would come away empty handed. On average, you would need to throw about 30 candies.

$\displaystyle E[X_{10}]=10 \biggl[ 1 + \frac{1}{2}+\frac{1}{3}+ \cdots+\frac{1}{9}+\frac{1}{10} \biggr]=29.29 \approx 30$

If there are 50 coupons, it would take on average 225 trials to get all 50 coupons. One interesting thing is that it is relatively easy to get the first few coupons. To get the last coupon, it would take a quite a few trials. In fact, it would take on average $n$ trials to get the last coupon (if $n$ is the number of coupon types).

Another point we would like to make is that the coupon collector problem requires an understanding of the geometric distribution. A random experiment that can modeled by a geometric distribution is the performance of a series of independent trials until some criterion is satisfied. The random variable $X_n$ described here has some similarity with a geometric model (but not totally similar). The similarity: we keep drawing numbers with replacement until all numbers are drawn.

Let’s look at the third experiment 6, 2, 1, 1, 5, 5, 1, 1, 1, 3, 6, 1, 3, 5, 4 listed above. We draw the first number (happens to be 6). Next we wish to draw another number different from 6. Luckily it takes only one trail to get 2. Now we wish to draw a number different from 6 and 2. It also takes one trial (to get 1). Next, continue to draw until we get a number different from 6, 2 and 1. This time it takes two trials to get a 5. Then continue to draw until we get a number different from 6, 2, 1, and 5. It takes 5 trials to get a 3. Lastly it takes 5 more trials to get the last number 4. What this shows is that the random variable $X_6$ is a sum of 6 geometric random variables.

$X_6=C_1+C_2+C_3+C_4+C_5+C_6$

Each of the $C_i$ is a geometric random variable. The first one $C_1$ is kind of special. The value of $C_1$ is always 1 (it takes one trial to get the first number). The second variable $C_2$ is a geometric random variable (the number of trials to get a number different from the first chosen number. Then $C_3$ is the number of trials to get a number different from the first two chosen numbers and so on. If there are $n$ coupons, then $X_n$ is the sum of $n$ geometric random variables.

$X_n=C_1+C_2+C_3+\cdots +C_{n-1}+C_n$

The formula for the expected value $E[X_n]$ is derived through this connection with the geometric distribution. See the previous post for a more in-depth discussion.

___________________________________________________________________________
$\copyright \ 2017 \text{ by Dan Ma}$