Sunday, January 22, 2017

Lasso Regression for U.S. National Epidemiological Survey on Alcohol and Related Conditions (NESARC) data

Background on the data used here (NESARC) was described in an earlier post in this blog. In that post, we conducted a Random Forest analysis for the boolean response variable: EVER GAMBLED 5+ TIMES IN ANY ONE YEAR (S12Q1). In this post we conduct Lasso Regression for the same. Before we dive into the analysis, there were some interesting observations about using Lasso, which performs linear regression for a response variable. When this response variable has an integer value vs a boolean we get different results. When we use a boolean variable, the results are quite similar to the Random Forest analysis which also used a boolean response variable. However, if we use an integer value for the response variable, which makes more sense to use in a linear regression (e.g Lasso where MSE is calculated), the list of important variables detected are different from those reported by Random Forest. This requires more analysis.

The following variables (listed below in order of importance) were identified as the most important since their coefficients were the largest in absolute values in this regression. We can compare the coefficients since we scaled the input variables to zero mean and unit variance. The most important variables identified in this analysis were: whether the individual had panic attacks (S6Q1) or was engaged in reckless driving (S11AQ1A15) or was not too open with anyone including those close to him/her (S10Q1A3). The rest of the variables are mentioned as well. The variables whose coefficient was reported to be zero by Lasso are briefly mentioned below as well.


Variable Description Coefficients
S6Q1       HAD PANIC ATTACK, SUDDENLY FELT FRIGHTENED/OVERWHELMED/NERVOUS AS IF IN GREAT DANGER BUT WERE NOT 0.226163
S11AQ1A15   EVER GET MORE THAN 3 TICKETS FOR RECKLESS/CARELESS DRIVING, SPEEDING, OR CAUSING AN ACCIDENT 0.175727
S10Q1A3     FIND IT HARD TO BE "OPEN" EVEN WITH PEOPLE YOU ARE CLOSE TO 0.125577
S11AQ1A25   EVER DO SOMETHING YOU COULD HAVE BEEN ARRESTED FOR, REGARDLESS OF WHETHER YOU WERE CAUGHT OR NOT 0.085541
S11AQ1A2   EVER STAY OUT LATE AT NIGHT EVEN THOUGH PARENTS TOLD YOU TO STAY HOME 0.068325
S11AQ1A22   EVER SHOPLIFT 0.067515
S11AQ1A1   OFTEN CUT CLASS, NOT GO TO CLASS OR GO TO SCHOOL AND LEAVE WITHOUT PERMISSION 0.067146
S11AQ1A14   EVER DO THINGS THAT COULD EASILY HAVE HURT YOU OR SOMEONE ELSE, LIKE SPEEDING OR DRIVING AFTER HAVING TOO MUCH TO DRINK 0.054445
S9Q1A       EVER HAD 6+ MONTH PERIOD FELT TENSE/NERVOUS/WORRIED MOST OF TIME 0.047805
S10Q1A16   THE KIND OF PERSON WHO FOCUSES ON DETAILS/ORDER/ORGANIZATION OR LIKES TO MAKE LISTS AND SCHEDULES 0.047434
S1Q1G       NUMBER OF YEARS LIVED IN UNITED STATES -0.030454
SMOKER     TOBACCO USE STATUS 0.026548
S11BQ1     BLOOD/NATURAL FATHER EVER HAD BEHAVIOR PROBLEMS 0.024917
AGE         CYEAR (DATE OF INTERVIEW: YEAR) - DOBY (DATE OF BIRTH: YEAR) -0.024049
S2AQ5G     HOW OFTEN DRANK 5+ BEERS IN LAST 12 MONTHS 0.022640
S10Q1A25   HAVE OTHERS TOLD YOU THAT YOU ARE STUBBORN OR RIGID 0.019741
S3AQ3B2     USUAL FREQUENCY WHEN SMOKED CIGARS -0.015306
S1Q1D5     "WHITE" CHECKED IN MULTIRACE CODE -0.012658
S1Q10A     TOTAL PERSONAL INCOME IN LAST 12 MONTHS -0.012306
MARITAL     CURRENT MARITAL STATUS 0.011049
DGSTATUS   DRUG USE STATUS -0.009435
S3AQ3B1     USUAL FREQUENCY WHEN SMOKED CIGARETTES 0.008346
S1Q9B       OCCUPATION: CURRENT OR MOST RECENT JOB -0.007877
CHLD0_17   NUMBER OF CHILDREN UNDER AGE 18 IN HOUSEHOLD 0.007562
S2AQ10     HOW OFTEN DRANK ENOUGH TO FEEL INTOXICATED IN LAST 12 MONTHS -0.007253
S10Q1A43   ARE THERE VERY FEW PEOPLE YOU'RE REALLY CLOSE TO OUTSIDE OF IMMEDIATE FAMILY 0.007089

Variables reported with zero coefficents (these could be correlated with the variables mentioned above that have non-zero coefficents since Lasso would randomly end up picking one of the correlated variables) or unrelated with the response variable:


Variable Description Possible Explanation
S10Q1A52   THE SORT OF PERSON WHO DOESN'T CARE ABOUT WHAT PEOPLE THINK OF YOU Seems Correlated
S10Q1A58FLIRT A LOT Seems Correlated
S2AQ12F     HOW OFTEN DROVE MOTOR VEHICLE AFTER 3+ DRINKS IN LAST 12 MONTHS Seems Correlated
S10Q1A47   HAVE ALMOST ALWAYS PREFERRED TO DO THINGS ALONE RATHER THAN WITH OTHERS Seems Correlated
S10Q1A46   TAKE LITTLE PLEASURE IN BEING WITH OTHERS Seems Correlated
S10Q1A45   WOULD BE JUST HAPPY WITHOUT HAVING ANY CLOSE RELATIONSHIP Seems Correlated
S2AQ12B     HOW OFTEN DRANK AFTER MIDNIGHT IN LAST 12 MONTHS Seems Correlated
S10Q1A32   OFTEN GET ANGRY OR LASH OUT WHEN SOMEONE CRITICIZES OR INSULTS YOU Seems Correlated
S10Q1A22   HARD TO LET OTHERS HELP IF THEY DON'T AGREE TO DO THINGS EXACTLY THE WAY YOU WANT Seems Correlated
S2AQ9       HOW OFTEN DRANK 4+ DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS (WOMEN ONLY) Seems Correlated
NUMPER18   NUMBER OF PERSONS 18 YEARS AND OLDER IN HOUSEHOLD Seems Unrrelated
NUMPERS     NUMBER OF PERSONS IN HOUSEHOLD Seems Correlated/Unrelated

The training and test R-square errors were reported as 0.670503947 and 0.66454160107 respectively which were fairly high indicating a good fit of the model with the training and test data. The training and test mean squared error was 0.474361978873 and 0.495379372341 respectively which was fairly low in both cases as well. The MSE plot is as follows and we can see that the MSE is successively going down as the alpha value increases.


The plot for the progression of the coefficients as the variables are added one-by-one in the Lasso Regression is show below.


Python code is included below:





No comments: