Saturday, January 14, 2017

Analyzing Gapminder Data


Founded in Stockholm by Ola Rosling, Anna Rosling Rönnlund and Hans Rosling, GapMinder is a non-profit venture promoting sustainable global development and achievement of the United Nations Millennium Development Goals. It seeks to increase the use and understanding of statistics about social, economic, and environmental development at local, national, and global levels. Since its conception in 2005, Gapminder has grown to include over 200 indicators, including gross domestic product, total employment rate, and estimated HIV prevalence. Gapminder contains data for all 192 UN members, aggregating data for Serbia and Montenegro. Additionally, it includes data for 24 other areas, generating a total of 215 areas.

GapMinder collects data from a handful of sources, including the Institute for Health Metrics and Evaulation, US Census Bureau’s International Database, United Nations Statistics Division, and the World Bank.

Decision tree analysis was performed (using python) to test nonlinear relationships among a series of explanatory variables (0-13 below) and a binary, categorical response variable (above average suicides per 100,000 in 2005). Note that python does not support pruning. The code is included below the analysis.




IndexVariable Description
X[0]income per person 2010 Gross Domestic Product per capita in constant 2000 US$
X[1]alcohol consumption 2008 alcohol consumption per adult (age 15+), 
X[2]armed forces rate Armed forces personnel (% of total labor force) 
X[3]breast cancer per 100th 2002 breast cancer new cases per 100,000 female
X[4]co2 emissions 2006 cumulative CO2 emission (metric tons)
X[5]female employment rate 2007 female employees age 15+ (% of population)
X[6]HIV rate 2009 estimated HIV Prevalence % - (Ages 15-49)
X[7]internet use rate 2010 Internet users (per 100 people)
X[8]oil per person 2010 oil Consumption per capita (tonnes per year and person)
X[9]polity score 2009 Democracy score (Polity)
X[10]residential electricity consumption per person 2008 residential electricity consumption, per person (kWh)
X[11]employment rate 2007 total employees age 15+ (% of population)
X[12]urban rate 2008 urban population (% of total)
responsesuicide per 100th Is the 2005 Suicide, age adjusted, per 100 000above average? (True or False)


Alcohol consumption was the first variable to separate the sample into two subgroups. The threshold was identified to be 16.19 litres and countries where it was higher than that had a likelihood of above average suicide rate. Next one was the amount of electricity consumed in residences. If the consumption was below 230 kwh, there was a likelihood of higher suicide rates. This was followed by armed forces rates - if less than 41% of the population was part of armed forces, the likelihood was higher. Alcohol consumption and employment rate comes next - higher amounts of alcohol consumption (higher than 14%) and a low employment rate (lower than 49%) had the likelihood of an above average suicide rate. Countries with low alcohol consumption (14%) and low female employment rates (less than 56%) had lower likelihood of a below average suicide rate. However, if this is combined with low internet use (lower than 85%) this led to likelihood of above average suicide rates.

The total model classified 64% of the sample correctly.


No comments: