Friday, January 20, 2017

Random Forest Analysis of U.S. National Epidemiological Survey on Alcohol and Related Conditions (NESARC) data

In 2001/2002, the National Institute on Alcohol Abuse and Alcoholism (NIAAA) conducted the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), the largest and most ambitious comorbidity (simultaneous presence of two chronic conditions in a patient) study ever conducted. In addition to an extensive battery of questions addressing present and past alcohol consumption, alcohol use disorders (AUDs), and utilization of alcohol treatment services, NESARC included similar sets of questions related to tobacco and illicit drug use (including nicotine dependence and drug use disorders).

The unprecedented sample size of NESARC (n = 43,093) made it possible to achieve stable estimates of even rare conditions. Moreover, its oversampling of Blacks and Hispanics as well as the inclusion of Hawaii and Alaska in its sampling frame yielded enough minority respondents to make NESARC an ideal vehicle for addressing the critical issue of race and/or ethnic disparities in comorbidity and access to health care services.

NESARC studies the occurrence of more than one psychological disorder or substance use disorder in the same person. In this analysis I have utilized the data collected in the survey for gambling and tried to determine its linkage to other predictors like alcoholic tendencies or drug use or other personality traits that can be detected. General information about the survey can be found here.

The following response variable was predicted: EVER GAMBLED 5+ TIMES IN ANY ONE YEAR (S12Q1).

The following variables (listed below in order of importance) were amongst those included as explanatory variables in Random Forest technique. The accuracy of the random forest was 74% with the maximum accuracy achievable with around 12 trees in the forest. Further increase in number of estimators (trees) did little to increase overall accuracy of the model. On closer examination of the variables found important by the model, we can see that an we can predict about the gambling tendencies of an individual by examining their alcohol consumption patterns, personality and drug use amongst other things.


Variable Description Score
S1Q1G       NUMBER OF YEARS LIVED IN UNITED STATES 0.059010
S1Q10A     TOTAL PERSONAL INCOME IN LAST 12 MONTHS 0.057991
AGE         CYEAR (DATE OF INTERVIEW: YEAR) - DOBY (DATE OF BIRTH: YEAR) 0.056678
S1Q1E       ORIGIN OR DESCENT 0.048400
S1Q9B       OCCUPATION: CURRENT OR MOST RECENT JOB 0.046075
BUILDTYP   TYPE OF BUILDING FOR HOUSEHOLD 0.036685
MARITAL     CURRENT MARITAL STATUS 0.029524
SMOKER     TOBACCO USE STATUS 0.028936
NUMPERS     NUMBER OF PERSONS IN HOUSEHOLD 0.027357
S2AQ12B     HOW OFTEN DRANK AFTER MIDNIGHT IN LAST 12 MONTHS 0.025185
S10Q1A52   THE SORT OF PERSON WHO DOESN'T CARE ABOUT WHAT PEOPLE THINK OF YOU 0.024643
S2AQ10     HOW OFTEN DRANK ENOUGH TO FEEL INTOXICATED IN LAST 12 MONTHS 0.024418
S10Q1A16   THE KIND OF PERSON WHO FOCUSES ON DETAILS/ORDER/ORGANIZATION OR LIKES TO MAKE LISTS AND SCHEDULES 0.023065
CHLD0_17   NUMBER OF CHILDREN UNDER AGE 18 IN HOUSEHOLD 0.022352
S10Q1A20   OTHERS THINK YOU HAVE UNREASONABLY HIGH STANDARDS/MORALS/IDEAS ABOUT RIGHT AND WRONG 0.022323
S10Q1A43   ARE THERE VERY FEW PEOPLE YOU'RE REALLY CLOSE TO OUTSIDE OF IMMEDIATE FAMILY 0.022088
S10Q1A25   HAVE OTHERS TOLD YOU THAT YOU ARE STUBBORN OR RIGID 0.021601
S11AQ1A2   EVER STAY OUT LATE AT NIGHT EVEN THOUGH PARENTS TOLD YOU TO STAY HOME 0.020961
NUMPER18   NUMBER OF PERSONS 18 YEARS AND OLDER IN HOUSEHOLD 0.020405
S11AQ1A1   OFTEN CUT CLASS, NOT GO TO CLASS OR GO TO SCHOOL AND LEAVE WITHOUT PERMISSION 0.020345
S10Q1A45   WOULD BE JUST HAPPY WITHOUT HAVING ANY CLOSE RELATIONSHIP 0.020180
NUMREL18   NUMBER OF RELATED PERSONS 18 YEARS AND OLDER IN HOUSEHOLD 0.019861
S2AQ5G     HOW OFTEN DRANK 5+ BEERS IN LAST 12 MONTHS 0.019397
S3AQ3B1     USUAL FREQUENCY WHEN SMOKED CIGARETTES 0.019338
S11BQ1     BLOOD/NATURAL FATHER EVER HAD BEHAVIOR PROBLEMS 0.018869
S10Q1A47   HAVE ALMOST ALWAYS PREFERRED TO DO THINGS ALONE RATHER THAN WITH OTHERS 0.018636
S11AQ1A25   EVER DO SOMETHING YOU COULD HAVE BEEN ARRESTED FOR, REGARDLESS OF WHETHER YOU WERE CAUGHT OR NOT 0.018017
S10Q1A22   HARD TO LET OTHERS HELP IF THEY DON'T AGREE TO DO THINGS EXACTLY THE WAY YOU WANT 0.017820
DGSTATUS   DRUG USE STATUS 0.017671
S1Q1D5     "WHITE" CHECKED IN MULTIRACE CODE 0.017200
S6Q1       HAD PANIC ATTACK, SUDDENLY FELT FRIGHTENED/OVERWHELMED/NERVOUS AS IF IN GREAT DANGER BUT WERE NOT 0.016674
S11AQ1A14   EVER DO THINGS THAT COULD EASILY HAVE HURT YOU OR SOMEONE ELSE, LIKE SPEEDING OR DRIVING AFTER HAVING TOO MUCH TO DRINK 0.016527
S2AQ12F     HOW OFTEN DROVE MOTOR VEHICLE AFTER 3+ DRINKS IN LAST 12 MONTHS 0.016235
S10Q1A58FLIRT A LOT 0.015546
S3AQ3B2     USUAL FREQUENCY WHEN SMOKED CIGARS 0.014674
S2AQ9       HOW OFTEN DRANK 4+ DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS (WOMEN ONLY) 0.014626
S9Q1A       EVER HAD 6+ MONTH PERIOD FELT TENSE/NERVOUS/WORRIED MOST OF TIME 0.014188
S11AQ1A22   EVER SHOPLIFT 0.014131
S10Q1A3     FIND IT HARD TO BE "OPEN" EVEN WITH PEOPLE YOU ARE CLOSE TO 0.013714
S10Q1A32   OFTEN GET ANGRY OR LASH OUT WHEN SOMEONE CRITICIZES OR INSULTS YOU 0.013670
S10Q1A46   TAKE LITTLE PLEASURE IN BEING WITH OTHERS 0.012579
S11AQ1A15   EVER GET MORE THAN 3 TICKETS FOR RECKLESS/CARELESS DRIVING, SPEEDING, OR CAUSING AN ACCIDENT 0.012403

The code and output is included below


Confusion Matrix: 
[[11983   892]
 [ 3540   823]]
Accuracy Score is:  0.742893607147

No comments: