Search

Visualizing ‘Regression To The Mean’ In Python

Let’s take a philosophical bent to our programming and consider something related to research. I decided to consider regression to the mean because I have found that topic fascinating.

regression to the mean python

 

What is regression to the mean?

Regression to the mean, or sometimes called reversion towards the mean, is a phenomenon in which if the sample point of a random variable is extreme or close to an outlier, a future point will be close to the mean or average on further measurements. Note that the variable under measure has to be random for this effect to play out and to be significant.

Sir Francis Galton first described this phenomenon when he was observing hereditary stature in his book: “Regression towards mediocrity in hereditary stature.” He observed that parents who were taller than average in the community tend to give birth to children who became shorter or close to the community average height.

Since then, this phenomenon has been described in other fields of life where randomness or luck is also a factor.

For example, if a business has a highly profitable quarter in one year, in the next coming quarter it is likely not to do as well. If one medical trial suggests that a particular drug or treatment is outperforming all other treatments for a condition, then in a second trial it is more likely that the outperforming drug or treatment will perform closer to the mean the next quarter.

But the regression to the mean should not be confused with the gambler’s fallacy that states that if an event occurs more frequently than normal in the past, then in the future it is less likely to happen even where it has been established that in such events the past does not determine the future i.e they are independent.

I was thinking about regression to the mean while coding some challenge that involved tossing heads and tails along with calculating their probability, so I decided to add a post on this phenomenon.

This is the gist of what we are looking for in the code. Suppose we have a coin that we flip a set number of times and find the average of those times. Then we aggregate the flips for several trials. For each trial, we look for the averages that were extremes and find out if the average flip after that extreme regressed towards the mean. Note that the mean of the flip of a coin is 0.5 because the probability that a fair coin will come heads is ½ and the probability it will come tails is also ½.

So after collecting the extremes along with the trial that comes after it, we will want to see if the trials were regressing towards the mean or not. We do this visually by plotting a graph of the extremes and the trials after the extremes.

So, here is the complete code. I will explain the graph that accompanies the code after you run it and then provide a detailed explanation of the code by lines.

After you run the above code, you will get a graph that looks like that below.

regression to mean python


We drew a line across the 0.5 mark on the y-axis that shows when the points cross the average line. From the graph you will see rightly that for several occasions, when there are extremes above or below the average line, the next trial results in an flip that moved towards the mean line except for one occasion when it did not. So, what is happening here? Because the coin flip is a random event, it has the tendency to exhibit this phenomenon.

Now, let me explain the code I used to draw the visuals. There are two functions here, one that acts as the coin flip function and the other to collect the extremes and subsequent trials.

First, the code for the coin flip.

    
def flip(num_flips):
    ''' assumes num_flips a positive int '''
    heads = 0
    for _ in range(num_flips):
        if random.choice(('H', 'T')) == 'H':
            heads += 1
    return heads/num_flips

The function, flip, takes as argument a specified number of flips that the coin should be tossed. Then for each flip which is done randomly, it finds out if the outcome was a head or a tail. If it is a head, it adds this to the heads variable and finally returns the average of all the flips.

Then the next function, regress_to_mean.

    
def regress_to_mean(num_flips, num_trials):
    # get fractions of heads for each trial of num_flips
    frac_heads = []
    for _ in range(num_trials):
        frac_heads.append(flip(num_flips))
    # find trials with extreme results and for each 
    # store it and the next trial
    extremes, next_trial = [], []
    for i in range(len(frac_heads) - 1):
        if frac_heads[i] < 0.33 or frac_heads[i] > 0.66:
            extremes.append(frac_heads[i])
            next_trial.append(frac_heads[i+1])
    # plot results 
    plt.plot(extremes, 'ko', label = 'Extremes')
    plt.plot(next_trial, 'k^', label = 'Next Trial')
    plt.axhline(0.5)
    plt.ylim(0,1)
    plt.xlim(-1, len(extremes) + 1)
    plt.xlabel('Extremes example and next trial')
    plt.ylabel('Fraction Heads')
    plt.title('Regression to the mean')
    plt.legend(loc='best')
    plt.savefig('regressmean.png')
    plt.show()

This function is the heart of the code. It flips the coin a set number of times for a set number of trials, accumulating each average for each trial in a list. Then later, it finds out which of the averages is an extreme or outlier. When it gets an outlier, it adds it to the extremes list, and then adds the next trial to the next_trial list. Finally, we used matplotlib to draw the visuals. The visuals is a plot of the extremes and next_trial figures with a horizontal line showing the average line for the viewer to better understand what direction the next trial is expected to move to when there is an extreme.

I hope you sure enjoyed the code. You can run it on your machine or download it to study it, regress_to_mean.py.

Thanks for your time. I hope you do leave a comment.

Happy pythoning.

2 comments:

Your comments here!

Matched content