Really Simple Statistics: What is heteroscedasticity? #MRX

really simple statisticsWelcome to Really Simple Statistics (RSS). There are lots of places online where you can ponder over the minute details of complicated equations but very few places that make statistics understandable to everyone. I won’t explain exceptions to the rule or special cases here. Let’s just get comfortable with the fundamentals.

** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** **

Oh my, what a gorgeous heteroscedasticity you have! You mean other than a really cool eight syllable statistics word that you can show off with in front of friends?

This long and lovely word comes into play when you’re dealing with pairs of variables – perhaps height and weight, or grades and time spent studying, or voting behaviour and time spent reading the political section of the paper. It has mean and nasty effects on correlation coefficients and regression models so pay attention!

Specifically, it refers to the distribution of numbers for one variable in relation to  the distribution of numbers for another variable.  Homoscedasticity refers to a spread that is very even and regular no matter which section of the chart you look at. This is what you see in the first chart.

Heteroscedasticity refers to a spread that is uneven and irregular – like the second scatterplot you see here. The datapoints are very close to each other in the bottom left but then they are spread out a lot in the top right.More examples please!
  1. We all know that shorter people weigh less and taller people weigh more. But, what if most 5 foot tall women
  2. weigh between 90 and 100 pounds while most 6 foot tall women weigh between 130 and 170
    points. The range of 10 pounds at 5 feet is very different from the range of 40 pounds at 6 feet. That’s a lot of heterobebijicty!
  3. We also know that people who study a lot tend to get higher grades. Now, what if people who studied 1 hour per week got a D while people who studied 2 hours per week got a C, B, or A? Once again, 1 hour resulted in one possible grade while 2 hours resulted in three possible grades. That’s even more heteroihjusdfgicty.
  4. And, what if jogging for 30 minutes burns 200 to 250 calories while jogging for 60 minutes burns 400 to 500 calories. Half an hour resulted in a range of 50 calories while a full hour resulted in a range of…. also 50 calories per half hour. That’s a lot of…. homoscedasticity!

So the next time you’re wondering why your correlation coefficient or regression equation isn’t as nice as what you had hoped for, have at look for heteroscedasticity. And make it a habit to look before you statisticize.

One response

  1. mysereneparadise

    Thank you! I’m analyzing methylation microarray data in R and when it got to the use of Beta and M values and ‘heteroscedasticity’ I wasn’t quite sure what the issue was there. Your explanation really helped and I enjoyed your sense of humour n__n

%d bloggers like this: