This month, the most interesting thing I learned was about the “Simpson’s Paradox” in statistics. Simpson’s Paradox is a phenomenon in statistics where two different conclusions / results can be derived from the same data.
Image source: RJ Andrews (https://twitter.com/infowetrust). Will happily take down upon request. All credit goes to the creator.
How does it work?
Let’s say you’re comparing average income of households in Texas vs Alabama. The results show up as follows:
Fake Data Between Texas and Alabama (Income):
- 2011, Texas: $37,000
- 2011, Alabama: $50,000
- 2012, Texas: $43,000
- 2012, Alabama: $53,000
- 2013, Texas: $50,000
- 2013, Alabama: $55,000
From this alone, you may conclude that people in Alabama make more compared to those in Texas. But, how about if we divide these groups by ethnicity? The results may look like this (showing one as an example):
2011 Texas Fake Data By Ethnicity:
- White: $50,000
- Black: $34,000
- Latinx: $44,000
2011 Alabama Fake Date by Ethnicity:
- White: $47,000
- Black: $30,000
- Latinx: $37,000
As you can see – when we divide by ethnicity – the same data tells us that each ethnicity group in Texas actually makes more than ethnicity groups in Alabama.
How is the same data saying two different things? That is because we need to account for other factors that influence the data (such as ethnicity demographics in both states). Maybe Texas has more of a diverse cultural / socioeconomic population compared to Alabama leading Alabama to seem more “affluent” in comparison to Texas. These external factors such as race / socioeconomic make-up end up impacting the data and we can see two different trends based on the same dataset.
Simpson’s Paradox really made me question the data we see normally. One great example is shown in this research paper: https://homepage.stat.uiowa.edu/~mbognar/1030/Bickel-Berkeley.pdf
What have you learned this month or today?