Statistics That Deceive
It is well accepted knowledge that the larger the data set, the better the resultsSimpson’s Paradox demonstrates that a great deal of care has to be taken when combining smaller data sets into a larger oneSometimes the conclusions from the larger data set areoppositethe conclusion from the smallerdatasets!
Example: Simpson’s Paradox
Baseball batting statistics for two players:
How could Player A beat Player B for both halves individually,but then have a lower total season batting average?
We weren’t told how manyat batseach player had:
Player A’s dismal second half and Player B’s great first halfhad higher weights than the other two values.
Average college physics grades for students in an engineering program:taken HS physics no HS physicsNumber of Students 50 5Average Grade8070
Average college physics grades for students in a liberal arts program:taken HS physics no HS physicsNumber of Students 5 50Average Grade9585It appears that inbothmajors (Liberal Arts and Engineering),taking high schoolphysicsimproves yourcollege physics grade by 10.
In order to get better results, let’s combine our datasets.In particular, let’s combine all the students that took high school physics.More precisely,let’s combine the Engineering majors that tookhigh school physicswiththe LA majors thattook high school physics.Likewise, combinetheEngineers thatdidnottakehigh schoolphysicswith LAsthat didnottake high school physics.But be careful!You can’t just take the average of the two averages,because each dataset has a different number of values!!
Average college physics grades for students whotookhigh school physics:# Students AvgGrades Weighted GradeEngineering 50 80 50/55*80=72.7Lib Arts 5 95 5/55*95=8.6Total 55Average(72.7 + 8.6)=81.3Average college physics grades for students who didnottake high school physics:# Students AvgGrades Weighted GradeEngineering 5 70 5/55*70=6.4Lib Arts 50 85 50/55*85=77.3Total 55Average(6.4 + 77.3)=83.7Did the students that did not have high school physics actually do better?
Same example calculated another way
Average college physics grades for students who took high school physics:# Students Grades Grade PtsEngineering 50 80 4000Lib Arts 5 95 475Total 55 4475Average(4000/4475*80 + 475/4475*95)81.3Average college physics grades for students who didnottake high school physics:# Students Grades Grade PtsEngineering 5 70 350Lib Arts 50 85 4250Total 55 4600Average(350/4600*70 + 4250/4600*85)83.7Did the students that did not have high school physics actually do better?
Two problems with combining the dataThere was a larger percentage of one type of student in each tableThe engineering students had a more rigorous physicsclass(e.g. “Physics for Enginners”) thanthe liberal arts students, thus there is ahiddenvariableIn fact, this ‘lurking variable’ that makes the subcategories different from one another is the most common cause of Simpson’s ParadoxKey Point: Bevery careful when you combine data into a larger set