Background on the data used here (NESARC) was described in an earlier post in this blog. In that post, we conducted a Random Forest analysis for the boolean response variable: EVER GAMBLED 5+ TIMES IN ANY ONE YEAR (S12Q1). In this post we conduct Lasso Regression for the same. Before we dive into the analysis, there were some interesting observations about using Lasso, which performs linear regression for a response variable. When this response variable has an integer value vs a boolean we get different results. When we use a boolean variable, the results are quite similar to the Random Forest analysis which also used a boolean response variable. However, if we use an integer value for the response variable, which makes more sense to use in a linear regression (e.g Lasso where MSE is calculated), the list of important variables detected are different from those reported by Random Forest. This requires more analysis.
The following variables (listed below in order of importance) were identified as the most important since their coefficients were the largest in absolute values in this regression. We can compare the coefficients since we scaled the input variables to zero mean and unit variance. The most important variables identified in this analysis were: whether the individual had panic attacks (S6Q1) or was engaged in reckless driving (S11AQ1A15) or was not too open with anyone including those close to him/her (S10Q1A3). The rest of the variables are mentioned as well. The variables whose coefficient was reported to be zero by Lasso are briefly mentioned below as well.
Variable | Description | Coefficients |
---|---|---|
S6Q1 | HAD PANIC ATTACK, SUDDENLY FELT FRIGHTENED/OVERWHELMED/NERVOUS AS IF IN GREAT DANGER BUT WERE NOT | 0.226163 |
S11AQ1A15 | EVER GET MORE THAN 3 TICKETS FOR RECKLESS/CARELESS DRIVING, SPEEDING, OR CAUSING AN ACCIDENT | 0.175727 |
S10Q1A3 | FIND IT HARD TO BE "OPEN" EVEN WITH PEOPLE YOU ARE CLOSE TO | 0.125577 |
S11AQ1A25 | EVER DO SOMETHING YOU COULD HAVE BEEN ARRESTED FOR, REGARDLESS OF WHETHER YOU WERE CAUGHT OR NOT | 0.085541 |
S11AQ1A2 | EVER STAY OUT LATE AT NIGHT EVEN THOUGH PARENTS TOLD YOU TO STAY HOME | 0.068325 |
S11AQ1A22 | EVER SHOPLIFT | 0.067515 |
S11AQ1A1 | OFTEN CUT CLASS, NOT GO TO CLASS OR GO TO SCHOOL AND LEAVE WITHOUT PERMISSION | 0.067146 |
S11AQ1A14 | EVER DO THINGS THAT COULD EASILY HAVE HURT YOU OR SOMEONE ELSE, LIKE SPEEDING OR DRIVING AFTER HAVING TOO MUCH TO DRINK | 0.054445 |
S9Q1A | EVER HAD 6+ MONTH PERIOD FELT TENSE/NERVOUS/WORRIED MOST OF TIME | 0.047805 |
S10Q1A16 | THE KIND OF PERSON WHO FOCUSES ON DETAILS/ORDER/ORGANIZATION OR LIKES TO MAKE LISTS AND SCHEDULES | 0.047434 |
S1Q1G | NUMBER OF YEARS LIVED IN UNITED STATES | -0.030454 |
SMOKER | TOBACCO USE STATUS | 0.026548 |
S11BQ1 | BLOOD/NATURAL FATHER EVER HAD BEHAVIOR PROBLEMS | 0.024917 |
AGE | CYEAR (DATE OF INTERVIEW: YEAR) - DOBY (DATE OF BIRTH: YEAR) | -0.024049 |
S2AQ5G | HOW OFTEN DRANK 5+ BEERS IN LAST 12 MONTHS | 0.022640 |
S10Q1A25 | HAVE OTHERS TOLD YOU THAT YOU ARE STUBBORN OR RIGID | 0.019741 |
S3AQ3B2 | USUAL FREQUENCY WHEN SMOKED CIGARS | -0.015306 |
S1Q1D5 | "WHITE" CHECKED IN MULTIRACE CODE | -0.012658 |
S1Q10A | TOTAL PERSONAL INCOME IN LAST 12 MONTHS | -0.012306 |
MARITAL | CURRENT MARITAL STATUS | 0.011049 |
DGSTATUS | DRUG USE STATUS | -0.009435 |
S3AQ3B1 | USUAL FREQUENCY WHEN SMOKED CIGARETTES | 0.008346 |
S1Q9B | OCCUPATION: CURRENT OR MOST RECENT JOB | -0.007877 |
CHLD0_17 | NUMBER OF CHILDREN UNDER AGE 18 IN HOUSEHOLD | 0.007562 |
S2AQ10 | HOW OFTEN DRANK ENOUGH TO FEEL INTOXICATED IN LAST 12 MONTHS | -0.007253 |
S10Q1A43 | ARE THERE VERY FEW PEOPLE YOU'RE REALLY CLOSE TO OUTSIDE OF IMMEDIATE FAMILY | 0.007089 |
Variables reported with zero coefficents (these could be correlated with the variables mentioned above that have non-zero coefficents since Lasso would randomly end up picking one of the correlated variables) or unrelated with the response variable:
The training and test R-square errors were reported as 0.670503947 and 0.66454160107 respectively which were fairly high indicating a good fit of the model with the training and test data. The training and test mean squared error was 0.474361978873 and 0.495379372341 respectively which was fairly low in both cases as well. The MSE plot is as follows and we can see that the MSE is successively going down as the alpha value increases.
Variable | Description | Possible Explanation |
---|---|---|
S10Q1A52 | THE SORT OF PERSON WHO DOESN'T CARE ABOUT WHAT PEOPLE THINK OF YOU | Seems Correlated |
S10Q1A58 | FLIRT A LOT | Seems Correlated |
S2AQ12F | HOW OFTEN DROVE MOTOR VEHICLE AFTER 3+ DRINKS IN LAST 12 MONTHS | Seems Correlated |
S10Q1A47 | HAVE ALMOST ALWAYS PREFERRED TO DO THINGS ALONE RATHER THAN WITH OTHERS | Seems Correlated |
S10Q1A46 | TAKE LITTLE PLEASURE IN BEING WITH OTHERS | Seems Correlated |
S10Q1A45 | WOULD BE JUST HAPPY WITHOUT HAVING ANY CLOSE RELATIONSHIP | Seems Correlated |
S2AQ12B | HOW OFTEN DRANK AFTER MIDNIGHT IN LAST 12 MONTHS | Seems Correlated |
S10Q1A32 | OFTEN GET ANGRY OR LASH OUT WHEN SOMEONE CRITICIZES OR INSULTS YOU | Seems Correlated |
S10Q1A22 | HARD TO LET OTHERS HELP IF THEY DON'T AGREE TO DO THINGS EXACTLY THE WAY YOU WANT | Seems Correlated |
S2AQ9 | HOW OFTEN DRANK 4+ DRINKS OF ANY ALCOHOL IN LAST 12 MONTHS (WOMEN ONLY) | Seems Correlated |
NUMPER18 | NUMBER OF PERSONS 18 YEARS AND OLDER IN HOUSEHOLD | Seems Unrrelated |
NUMPERS | NUMBER OF PERSONS IN HOUSEHOLD | Seems Correlated/Unrelated |
The training and test R-square errors were reported as 0.670503947 and 0.66454160107 respectively which were fairly high indicating a good fit of the model with the training and test data. The training and test mean squared error was 0.474361978873 and 0.495379372341 respectively which was fairly low in both cases as well. The MSE plot is as follows and we can see that the MSE is successively going down as the alpha value increases.
The plot for the progression of the coefficients as the variables are added one-by-one in the Lasso Regression is show below.
Python code is included below:
No comments:
Post a Comment