Tukeys Blog: January 2017

Thursday, January 26, 2017

K-Means Cluster Analysis on Outlook on Life Survey Data

The Outlook on Life Survey (OOL) was designed to study political and social attitudes in the United States. The project included two surveys fielded between August and December 2012 using a sample from an Internet panel. A total of 2,294 respondents participated in this study during Wave 1 and 1,601 were interviewed during Wave 2. There are 436 variables in this study.

Note that full access to data archive is available for members on the ICPSR site.

Our goal in this study is to do a K-Means Cluster Analysis to determine if there is a set of variables we can identify that can measure how angry people feel about the way things are going in the country these days (W1_B4).

The variables we selected as the cluster variables are:

Variable	Description
W1_H1	Society has reached the point where Blacks and Whites have equal opportunities for achievement
W1_O5	[Black people should teach their children to be careful around the police] How much emphasis or de-emphasis should Black people place on each statement in the education of their children?
PPINCIMP	Household Income
W1_F4_D	[To become wealthy ] For yourself and people like you, how easy or hard is it to reach these goals?
W1_F6	How far along the road to your American Dream do you think you will ultimately get on a 10-point scale where 1 is not far at all and 10 nearly there?
W1_M5	How often do you attend religious services?

We are going to validate our clusters by excluding the variable W1_B4 (Description - Generally speaking, how angry do you feel about the way things are going in the country these days?) from our analysis. We would treat this variable as a measure and we expect to see some differences on this variable amongst our clusters.

While doing the cluster analysis we noticed the following: (a) In some cases, when more data was included for determining the clusters or the cluster analysis was repeated, we observed that the cluster means for our measure variable came quite close. In addition, if the number of clusters we wanted was increased, that also had an impact on the cluster means and the significance of the difference amongst the cluster means.

First, we ran a simulation trying to determine what the affect of clustering on the average distance of the points from centroids. For our analysis, we will go with three clusters. As we can see the choice of number of clusters is completely subjective. As the number of clusters increase, the average distance will decrease.

Here is the plot of the cluster after applying PCA to reduce the number of variable dimensions to two:

The split amongst the three clusters is (the cluster IDs are 0.0, 1.0 and 2.0)

0.0 419

2.0 355

1.0 348

Our next step is to try and fit a simple least squares line using W1_B4 as the dependent variable based on the cluster number we have determined for the points. The results of that are:

The mean and standard deviation for W1_B4 in the three clusters are:

cluster
0.0 2.916468
1.0 3.129310
2.0 3.008451
standard deviations for W1_B4 by cluster
W1_B4
cluster
0.0 1.307026
1.0 1.192171
2.0 1.224716

Because we had more than two clusters, we decided to conduct a tukey test to determine which one of the two means had significant difference. It seems that cluster 0.0 and cluster 1.0 had means that were significantly different (led to rejection of the null hypothesis).

Then we analyzed the two clusters are the means of the different cluster variables within those clusters.

The following is the mean of the various clustering variables in their respective clusters:

W1_B4 mean is lower for cluster 0.0 indicating that people in cluster 0.0 are generally more angry about the situation in the country than those in cluster 1.0 (lower values in the survey correspond to higher anger felt by the respondent). Then we try to look deeper into the cluster variable means. We find that W1_F4_D is lower for cluster 0.0. Lower value indicates that the respondents think that it is hard to become wealthy in this country. A lower value of W1_H1 (in cluster 0.0) indicates that the respondents agree much more that blacks and whites have equal opportunities for achievement. A lower value of W1_M5 for cluster 0.0 also indicates that respondents attend religious services more frequently there. A higher value for W1_05 for cluster 0.0 indicates that respondents place a strong emphasis that black people should teach their children to be careful around the police in the education of their children. Cluster 0.0 also has lower average household income.

So what can we establish - it seems that the likelihood that a person is going to be unhappy about the situation in this country is going to be higher when their income is lower and they generally do not believe that their wealth can significantly increase in this country (even though they believe that blacks and whites have equal opportunities). The higher likelihood seems connected with lower income, more religious bent of mind and general wariness of law enforcement.

The code is included below:

Sunday, January 22, 2017

Lasso Regression for U.S. National Epidemiological Survey on Alcohol and Related Conditions (NESARC) data

Background on the data used here (NESARC) was described in an earlier post in this blog. In that post, we conducted a Random Forest analysis for the boolean response variable: EVER GAMBLED 5+ TIMES IN ANY ONE YEAR (S12Q1). In this post we conduct Lasso Regression for the same. Before we dive into the analysis, there were some interesting observations about using Lasso, which performs linear regression for a response variable. When this response variable has an integer value vs a boolean we get different results. When we use a boolean variable, the results are quite similar to the Random Forest analysis which also used a boolean response variable. However, if we use an integer value for the response variable, which makes more sense to use in a linear regression (e.g Lasso where MSE is calculated), the list of important variables detected are different from those reported by Random Forest. This requires more analysis.

The following variables (listed below in order of importance) were identified as the most important since their coefficients were the largest in absolute values in this regression. We can compare the coefficients since we scaled the input variables to zero mean and unit variance. The most important variables identified in this analysis were: whether the individual had panic attacks (S6Q1) or was engaged in reckless driving (S11AQ1A15) or was not too open with anyone including those close to him/her (S10Q1A3). The rest of the variables are mentioned as well. The variables whose coefficient was reported to be zero by Lasso are briefly mentioned below as well.

Variable	Description	Coefficients
S6Q1	HAD PANIC ATTACK, SUDDENLY FELT FRIGHTENED/OVERWHELMED/NERVOUS AS IF IN GREAT DANGER BUT WERE NOT	0.226163
S11AQ1A15	EVER GET MORE THAN 3 TICKETS FOR RECKLESS/CARELESS DRIVING, SPEEDING, OR CAUSING AN ACCIDENT	0.175727
S10Q1A3	FIND IT HARD TO BE "OPEN" EVEN WITH PEOPLE YOU ARE CLOSE TO	0.125577
S11AQ1A25	EVER DO SOMETHING YOU COULD HAVE BEEN ARRESTED FOR, REGARDLESS OF WHETHER YOU WERE CAUGHT OR NOT	0.085541
S11AQ1A2	EVER STAY OUT LATE AT NIGHT EVEN THOUGH PARENTS TOLD YOU TO STAY HOME	0.068325
S11AQ1A22	EVER SHOPLIFT	0.067515
S11AQ1A1	OFTEN CUT CLASS, NOT GO TO CLASS OR GO TO SCHOOL AND LEAVE WITHOUT PERMISSION	0.067146
S11AQ1A14	EVER DO THINGS THAT COULD EASILY HAVE HURT YOU OR SOMEONE ELSE, LIKE SPEEDING OR DRIVING AFTER HAVING TOO MUCH TO DRINK	0.054445
S9Q1A	EVER HAD 6+ MONTH PERIOD FELT TENSE/NERVOUS/WORRIED MOST OF TIME	0.047805
S10Q1A16	THE KIND OF PERSON WHO FOCUSES ON DETAILS/ORDER/ORGANIZATION OR LIKES TO MAKE LISTS AND SCHEDULES	0.047434
S1Q1G	NUMBER OF YEARS LIVED IN UNITED STATES	-0.030454
SMOKER	TOBACCO USE STATUS	0.026548
S11BQ1	BLOOD/NATURAL FATHER EVER HAD BEHAVIOR PROBLEMS	0.024917
AGE	CYEAR (DATE OF INTERVIEW: YEAR) - DOBY (DATE OF BIRTH: YEAR)	-0.024049
S2AQ5G	HOW OFTEN DRANK 5+ BEERS IN LAST 12 MONTHS	0.022640
S10Q1A25	HAVE OTHERS TOLD YOU THAT YOU ARE STUBBORN OR RIGID	0.019741
S3AQ3B2	USUAL FREQUENCY WHEN SMOKED CIGARS	-0.015306
S1Q1D5	"WHITE" CHECKED IN MULTIRACE CODE	-0.012658
S1Q10A	TOTAL PERSONAL INCOME IN LAST 12 MONTHS	-0.012306
MARITAL	CURRENT MARITAL STATUS	0.011049
DGSTATUS	DRUG USE STATUS	-0.009435
S3AQ3B1	USUAL FREQUENCY WHEN SMOKED CIGARETTES	0.008346
S1Q9B	OCCUPATION: CURRENT OR MOST RECENT JOB	-0.007877
CHLD0_17	NUMBER OF CHILDREN UNDER AGE 18 IN HOUSEHOLD	0.007562
S2AQ10	HOW OFTEN DRANK ENOUGH TO FEEL INTOXICATED IN LAST 12 MONTHS	-0.007253
S10Q1A43	ARE THERE VERY FEW PEOPLE YOU'RE REALLY CLOSE TO OUTSIDE OF IMMEDIATE FAMILY	0.007089

Variables reported with zero coefficents (these could be correlated with the variables mentioned above that have non-zero coefficents since Lasso would randomly end up picking one of the correlated variables) or unrelated with the response variable:

Variable	Description	Possible Explanation
S10Q1A52	THE SORT OF PERSON WHO DOESN'T CARE ABOUT WHAT PEOPLE THINK OF YOU	Seems Correlated
S10Q1A58	FLIRT A LOT	Seems Correlated
S2AQ12F	HOW OFTEN DROVE MOTOR VEHICLE AFTER 3+ DRINKS IN LAST 12 MONTHS	Seems Correlated
S10Q1A47	HAVE ALMOST ALWAYS PREFERRED TO DO THINGS ALONE RATHER THAN WITH OTHERS	Seems Correlated
S10Q1A46	TAKE LITTLE PLEASURE IN BEING WITH OTHERS	Seems Correlated
S10Q1A45	WOULD BE JUST HAPPY WITHOUT HAVING ANY CLOSE RELATIONSHIP	Seems Correlated
S2AQ12B	HOW OFTEN DRANK AFTER MIDNIGHT IN LAST 12 MONTHS	Seems Correlated
S10Q1A32	OFTEN GET ANGRY OR LASH OUT WHEN SOMEONE CRITICIZES OR INSULTS YOU	Seems Correlated
S10Q1A22	HARD TO LET OTHERS HELP IF THEY DON'T AGREE TO DO THINGS EXACTLY THE WAY YOU WANT	Seems Correlated
S2AQ9	HOW OFTEN DRANK 4+ DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS (WOMEN ONLY)	Seems Correlated
NUMPER18	NUMBER OF PERSONS 18 YEARS AND OLDER IN HOUSEHOLD	Seems Unrrelated
NUMPERS	NUMBER OF PERSONS IN HOUSEHOLD	Seems Correlated/Unrelated

The training and test R-square errors were reported as 0.670503947 and 0.66454160107 respectively which were fairly high indicating a good fit of the model with the training and test data. The training and test mean squared error was 0.474361978873 and 0.495379372341 respectively which was fairly low in both cases as well. The MSE plot is as follows and we can see that the MSE is successively going down as the alpha value increases.

The plot for the progression of the coefficients as the variables are added one-by-one in the Lasso Regression is show below.

Python code is included below:

Friday, January 20, 2017

Random Forest Analysis of U.S. National Epidemiological Survey on Alcohol and Related Conditions (NESARC) data

In 2001/2002, the National Institute on Alcohol Abuse and Alcoholism (NIAAA) conducted the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), the largest and most ambitious comorbidity (simultaneous presence of two chronic conditions in a patient) study ever conducted. In addition to an extensive battery of questions addressing present and past alcohol consumption, alcohol use disorders (AUDs), and utilization of alcohol treatment services, NESARC included similar sets of questions related to tobacco and illicit drug use (including nicotine dependence and drug use disorders).

The unprecedented sample size of NESARC (n = 43,093) made it possible to achieve stable estimates of even rare conditions. Moreover, its oversampling of Blacks and Hispanics as well as the inclusion of Hawaii and Alaska in its sampling frame yielded enough minority respondents to make NESARC an ideal vehicle for addressing the critical issue of race and/or ethnic disparities in comorbidity and access to health care services.

NESARC studies the occurrence of more than one psychological disorder or substance use disorder in the same person. In this analysis I have utilized the data collected in the survey for gambling and tried to determine its linkage to other predictors like alcoholic tendencies or drug use or other personality traits that can be detected. General information about the survey can be found here.

The following response variable was predicted: EVER GAMBLED 5+ TIMES IN ANY ONE YEAR (S12Q1).

The following variables (listed below in order of importance) were amongst those included as explanatory variables in Random Forest technique. The accuracy of the random forest was 74% with the maximum accuracy achievable with around 12 trees in the forest. Further increase in number of estimators (trees) did little to increase overall accuracy of the model. On closer examination of the variables found important by the model, we can see that an we can predict about the gambling tendencies of an individual by examining their alcohol consumption patterns, personality and drug use amongst other things.

Variable	Description	Score
S1Q1G	NUMBER OF YEARS LIVED IN UNITED STATES	0.059010
S1Q10A	TOTAL PERSONAL INCOME IN LAST 12 MONTHS	0.057991
AGE	CYEAR (DATE OF INTERVIEW: YEAR) - DOBY (DATE OF BIRTH: YEAR)	0.056678
S1Q1E	ORIGIN OR DESCENT	0.048400
S1Q9B	OCCUPATION: CURRENT OR MOST RECENT JOB	0.046075
BUILDTYP	TYPE OF BUILDING FOR HOUSEHOLD	0.036685
MARITAL	CURRENT MARITAL STATUS	0.029524
SMOKER	TOBACCO USE STATUS	0.028936
NUMPERS	NUMBER OF PERSONS IN HOUSEHOLD	0.027357
S2AQ12B	HOW OFTEN DRANK AFTER MIDNIGHT IN LAST 12 MONTHS	0.025185
S10Q1A52	THE SORT OF PERSON WHO DOESN'T CARE ABOUT WHAT PEOPLE THINK OF YOU	0.024643
S2AQ10	HOW OFTEN DRANK ENOUGH TO FEEL INTOXICATED IN LAST 12 MONTHS	0.024418
S10Q1A16	THE KIND OF PERSON WHO FOCUSES ON DETAILS/ORDER/ORGANIZATION OR LIKES TO MAKE LISTS AND SCHEDULES	0.023065
CHLD0_17	NUMBER OF CHILDREN UNDER AGE 18 IN HOUSEHOLD	0.022352
S10Q1A20	OTHERS THINK YOU HAVE UNREASONABLY HIGH STANDARDS/MORALS/IDEAS ABOUT RIGHT AND WRONG	0.022323
S10Q1A43	ARE THERE VERY FEW PEOPLE YOU'RE REALLY CLOSE TO OUTSIDE OF IMMEDIATE FAMILY	0.022088
S10Q1A25	HAVE OTHERS TOLD YOU THAT YOU ARE STUBBORN OR RIGID	0.021601
S11AQ1A2	EVER STAY OUT LATE AT NIGHT EVEN THOUGH PARENTS TOLD YOU TO STAY HOME	0.020961
NUMPER18	NUMBER OF PERSONS 18 YEARS AND OLDER IN HOUSEHOLD	0.020405
S11AQ1A1	OFTEN CUT CLASS, NOT GO TO CLASS OR GO TO SCHOOL AND LEAVE WITHOUT PERMISSION	0.020345
S10Q1A45	WOULD BE JUST HAPPY WITHOUT HAVING ANY CLOSE RELATIONSHIP	0.020180
NUMREL18	NUMBER OF RELATED PERSONS 18 YEARS AND OLDER IN HOUSEHOLD	0.019861
S2AQ5G	HOW OFTEN DRANK 5+ BEERS IN LAST 12 MONTHS	0.019397
S3AQ3B1	USUAL FREQUENCY WHEN SMOKED CIGARETTES	0.019338
S11BQ1	BLOOD/NATURAL FATHER EVER HAD BEHAVIOR PROBLEMS	0.018869
S10Q1A47	HAVE ALMOST ALWAYS PREFERRED TO DO THINGS ALONE RATHER THAN WITH OTHERS	0.018636
S11AQ1A25	EVER DO SOMETHING YOU COULD HAVE BEEN ARRESTED FOR, REGARDLESS OF WHETHER YOU WERE CAUGHT OR NOT	0.018017
S10Q1A22	HARD TO LET OTHERS HELP IF THEY DON'T AGREE TO DO THINGS EXACTLY THE WAY YOU WANT	0.017820
DGSTATUS	DRUG USE STATUS	0.017671
S1Q1D5	"WHITE" CHECKED IN MULTIRACE CODE	0.017200
S6Q1	HAD PANIC ATTACK, SUDDENLY FELT FRIGHTENED/OVERWHELMED/NERVOUS AS IF IN GREAT DANGER BUT WERE NOT	0.016674
S11AQ1A14	EVER DO THINGS THAT COULD EASILY HAVE HURT YOU OR SOMEONE ELSE, LIKE SPEEDING OR DRIVING AFTER HAVING TOO MUCH TO DRINK	0.016527
S2AQ12F	HOW OFTEN DROVE MOTOR VEHICLE AFTER 3+ DRINKS IN LAST 12 MONTHS	0.016235
S10Q1A58	FLIRT A LOT	0.015546
S3AQ3B2	USUAL FREQUENCY WHEN SMOKED CIGARS	0.014674
S2AQ9	HOW OFTEN DRANK 4+ DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS (WOMEN ONLY)	0.014626
S9Q1A	EVER HAD 6+ MONTH PERIOD FELT TENSE/NERVOUS/WORRIED MOST OF TIME	0.014188
S11AQ1A22	EVER SHOPLIFT	0.014131
S10Q1A3	FIND IT HARD TO BE "OPEN" EVEN WITH PEOPLE YOU ARE CLOSE TO	0.013714
S10Q1A32	OFTEN GET ANGRY OR LASH OUT WHEN SOMEONE CRITICIZES OR INSULTS YOU	0.013670
S10Q1A46	TAKE LITTLE PLEASURE IN BEING WITH OTHERS	0.012579
S11AQ1A15	EVER GET MORE THAN 3 TICKETS FOR RECKLESS/CARELESS DRIVING, SPEEDING, OR CAUSING AN ACCIDENT	0.012403

The code and output is included below

Confusion Matrix:

[[11983 892]

[ 3540 823]]

Accuracy Score is: 0.742893607147

Monday, January 16, 2017

The quick and dirty on getting a Hadoop cluster up and running

The last time I tested out a hadoop cluster, it was about four years ago. At the time, the setup was manual and I used two machines to set up a two node cluster. I was able to run map reduce jobs and test the system out. Fast forward to 2017 - there are a lot more animals in the Hadoop circus and cloudera has made it very convenient by having various options to test - a quickstart VM, docker images etc. Still getting the system up and running and executing the first sqoop job took some effort. I have documented what I did including the tweaks so anyone running into the same can get help. Here are the steps:

Get a system ready - at least 16GB or RAM. I had a Linux Mint box on hand that I used. Mint is generally similar to Ubuntu but you do have to watch out for version specific instructions.
The next step is to install Docker. For Linux Mint 18, the steps here from Simon Hardy came really handy.
Cloudera provides a Docker Quickstart container - do not use that. There is a lot of documentation and links on that and its quite easy to go down that path. A better option is to use the Cloudera Clusterdock. Clusterdock is a multi-node cluster deployment on the same Docker host (by default it does a two node). The clusterdock documentation is very useful but there are a few catches that I will note here. There are few other links on clusterdock here:

The cloudera online tutorial is based on the quickstart docker container or the quickstart VM. There are several dependencies including a mysql database, flume files that are used in the demo etc. I would suggest that you keep that container also around for a bit and copy the data as needed to the clusterdock nodes. The clusterdock setup is much more stable in the cloudera manager and the slight inconvenience (occasional hardware freezing) may be worth it. After launching the quickstart container, you may simply tar the /opt/examples folder and the mysql retail_db database and transfer that to the host machine using the docker cp command and then kill the quickstart container.
The clusterdock.sh script on the cloudera website lacks 'sudo' in a couple of places. Be aware of that. Its easy to spot that in case it causes a problem. For example the ssh function has this problem.
I wanted to run the sqoop command and for that I needed a myql database to connect to. There was one I installed on the host machine. It was almost impossible or looked very time consuming to connect the clusterdock containers to the host mysql. You run into a docker networking issue. The easy way out is to install a mysql docker container and put it on the same user defined network that the clusterdock nodes use. Note that you have to force the IP and the network on the mysql container to do that and also map the mysql port 3306 so its open for access. Else you will waste a lot of time!

Next step is to ssh on the master node and run a sqoop job. At this time you will run into a lot of permissions issues if you are not careful with where you are storing the target imported files. sqoop will generally report exceptions stack trace with these permissions errors. Best is to google and fix the any paths you give to the sqoop command.
You also need to copy the mysql jar file in the sqoop lib folders. Easiest way is to get it on the host box and then use the docker cp commands to move it to the desired location.
That should do it - get you past the sqoop step and then you can run a query in Hue.

Saturday, January 14, 2017

Analyzing Gapminder Data

Founded in Stockholm by Ola Rosling, Anna Rosling Rönnlund and Hans Rosling, GapMinder is a non-profit venture promoting sustainable global development and achievement of the United Nations Millennium Development Goals. It seeks to increase the use and understanding of statistics about social, economic, and environmental development at local, national, and global levels. Since its conception in 2005, Gapminder has grown to include over 200 indicators, including gross domestic product, total employment rate, and estimated HIV prevalence. Gapminder contains data for all 192 UN members, aggregating data for Serbia and Montenegro. Additionally, it includes data for 24 other areas, generating a total of 215 areas.

GapMinder collects data from a handful of sources, including the Institute for Health Metrics and Evaulation, US Census Bureau’s International Database, United Nations Statistics Division, and the World Bank.

Decision tree analysis was performed (using python) to test nonlinear relationships among a series of explanatory variables (0-13 below) and a binary, categorical response variable (above average suicides per 100,000 in 2005). Note that python does not support pruning. The code is included below the analysis.

Index	Variable	Description
X[0]	income per person	2010 Gross Domestic Product per capita in constant 2000 US$
X[1]	alcohol consumption	2008 alcohol consumption per adult (age 15+),
X[2]	armed forces rate	Armed forces personnel (% of total labor force)
X[3]	breast cancer per 100th	2002 breast cancer new cases per 100,000 female
X[4]	co2 emissions	2006 cumulative CO2 emission (metric tons)
X[5]	female employment rate	2007 female employees age 15+ (% of population)
X[6]	HIV rate	2009 estimated HIV Prevalence % - (Ages 15-49)
X[7]	internet use rate	2010 Internet users (per 100 people)
X[8]	oil per person	2010 oil Consumption per capita (tonnes per year and person)
X[9]	polity score	2009 Democracy score (Polity)
X[10]	residential electricity consumption per person	2008 residential electricity consumption, per person (kWh)
X[11]	employment rate	2007 total employees age 15+ (% of population)
X[12]	urban rate	2008 urban population (% of total)
response	suicide per 100th	Is the 2005 Suicide, age adjusted, per 100 000above average? (True or False)

Alcohol consumption was the first variable to separate the sample into two subgroups. The threshold was identified to be 16.19 litres and countries where it was higher than that had a likelihood of above average suicide rate. Next one was the amount of electricity consumed in residences. If the consumption was below 230 kwh, there was a likelihood of higher suicide rates. This was followed by armed forces rates - if less than 41% of the population was part of armed forces, the likelihood was higher. Alcohol consumption and employment rate comes next - higher amounts of alcohol consumption (higher than 14%) and a low employment rate (lower than 49%) had the likelihood of an above average suicide rate. Countries with low alcohol consumption (14%) and low female employment rates (less than 56%) had lower likelihood of a below average suicide rate. However, if this is combined with low internet use (lower than 85%) this led to likelihood of above average suicide rates.

The total model classified 64% of the sample correctly.