Thursday, January 26, 2017

K-Means Cluster Analysis on Outlook on Life Survey Data

The Outlook on Life Survey (OOL) was designed to study political and social attitudes in the United States. The project included two surveys fielded between August and December 2012 using a sample from an Internet panel. A total of 2,294 respondents participated in this study during Wave 1 and 1,601 were interviewed during Wave 2. There are 436 variables in this study.

Note that full access to data archive is available for members on the ICPSR site.

Our goal in this study is to do a K-Means Cluster Analysis to determine if there is a set of variables we can identify that can measure how angry people feel about the way things are going in the country these days (W1_B4).

The variables we selected as the cluster variables are:

Variable Description
W1_H1       Society has reached the point where Blacks and Whites have equal opportunities for achievement
W1_O5       [Black people should teach their children to be careful around the police] How much emphasis or de-emphasis should Black people place on each statement in the education of their children?
PPINCIMP   Household Income
W1_F4_D         [To become wealthy ] For yourself and people like you, how easy or hard is it to reach these goals?
W1_F6       How far along the road to your American Dream do you think you will ultimately get on a 10-point scale where 1 is not far at all and 10 nearly there?
W1_M5     How often do you attend religious services?


We are going to validate our clusters by excluding the variable W1_B4 (Description - Generally speaking, how angry do you feel about the way things are going in the country these days?) from our analysis. We would treat this variable as a measure and we expect to see some differences on this variable amongst our clusters.

While doing the cluster analysis we noticed the following: (a) In some cases, when more data was included for determining the clusters or the cluster analysis was repeated, we observed that the cluster means for our measure variable came quite close. In addition, if the number of clusters we wanted was increased, that also had an impact on the cluster means and the significance of the difference amongst the cluster means.

First, we ran a simulation trying to determine what the affect of clustering on the average distance of the points from centroids. For our analysis, we will go with three clusters. As we can see the choice of number of clusters is completely subjective. As the number of clusters increase, the average distance will decrease.



Here is the plot of the cluster after applying PCA to reduce the number of variable dimensions to two:





The split amongst the three clusters is (the cluster IDs are 0.0, 1.0 and 2.0)

0.0    419
2.0    355
1.0    348

Our next step is to try and fit a simple least squares line using W1_B4 as the dependent variable based on the cluster number we have determined for the points. The results of that are:



The mean and standard deviation for W1_B4 in the three clusters are:

cluster        
0.0      2.916468
1.0      3.129310
2.0      3.008451
standard deviations for W1_B4 by cluster
            W1_B4
cluster        
0.0      1.307026
1.0      1.192171
2.0      1.224716

Because we had more than two clusters, we decided to conduct a tukey test to determine which one of the two means had significant difference. It seems that cluster 0.0 and cluster 1.0 had means that were significantly different (led to rejection of the null hypothesis).









Then we analyzed the two clusters are the means of the different cluster variables within those clusters.

The following is the mean of the various clustering variables in their respective clusters:


W1_B4 mean is lower for cluster 0.0 indicating that people in cluster 0.0 are generally more angry about the situation in the country than those in cluster 1.0 (lower values in the survey correspond to higher anger felt by the respondent). Then we try to look deeper into the cluster variable means. We find that W1_F4_D is lower for cluster 0.0. Lower value indicates that the respondents think that it is hard to become wealthy in this country. A lower value of W1_H1 (in cluster 0.0) indicates that the respondents agree much more that blacks and whites have equal opportunities for achievement. A lower value of W1_M5 for cluster 0.0 also indicates that respondents attend religious services more frequently there. A higher value for W1_05 for cluster 0.0 indicates that respondents place a strong emphasis that black people should teach their children to be careful around the police in the education of their children. Cluster 0.0 also has lower average household income.

So what can we establish - it seems that the likelihood that a person is going to be unhappy about the situation in this country is going to be higher when their income is lower and they generally do not believe that their wealth can significantly increase in this country (even though they believe that blacks and whites have equal opportunities). The higher likelihood seems connected with lower income, more religious bent of mind and general wariness of law enforcement.

The code is included below:


No comments: