Rise Of The Machines: DSc Machine Learning In Social Research #AAPOR #MRX #NewMR 


Enjoy my live note taking at AAPOR in Austin, Texas. Any bad jokes or errors are my own. Good jokes are especially mine.  

Moderator: Masahiko Aida, Civis Analytics

Employing Machine Learning Approaches in Social Scientific Analyses; Arne Bethmann, Institute for Employment Research (IAB) Jonas F. Beste, Institute for Employment Research (IAB)

  • [Good job on starting without a computer being ready. Because who needs computers for a talk about data science which uses computers:) ]
  • Demonstration of chart of wages by age and gender which is far from linear, regression tree is fairly complex
  • Why use machine learning? Models are flexible, automatic selection of features and interactions, large toolbox of modeling strategies; but risk is overfitting, not easily interpretable, etc
  • Interesting that you can kind of see the model in the regression tree alone
  • Start by setting every case in a sample to 0, e.g., male and female are both 0; then predict responses for every person; calculate AME/APE as mean difference between predictions for all cases
  • Regression tree and linear model end up with very different results
  • R package for average M effects – MLAME on github
  • MLR package as well [please ask author for links to these packages]
  • Want to add more functions to these – conditional AME, SE estimation, MLR wrapper

Using Big Census Data to Better Understand a Large Community Well-being Study: More than Geography Divides Us; Donald P. Levy, Siena College Research Institute Meghann Crawford, Siena College Research Institute

  • Interviewed 16000 people by phone, RDD
  • Survey of quality of community, health, safety, financial security, civic engagement, personal well being
  • Used factor analysis to group and test multiple indicators into factors, did the items really rest within in each factor [i love factor analysis. It helps you see groupings that are invisible to the naked eye. ]
  • Mapped out cities and Burroughs, some changed over time
  • Rural versus urban have more in common than neighbouring areas [is this not obvious?]
  • 5 connections – wealthy, suburban, rural, urban periphery, urban core
  • Can set goals for your city based on these scores
  • Simple scoring method based on 111 indicators to help with planning and awareness campaigns, make the numbers public and they are shared in reports and on public transportation so the public knows what they are, helps to identify obstacles, help to enhance quality of life

Using Machine Learning to Infer Demographics for Respondents; Noble Kuriakose, SurveyMonkey; Tommy Nguyen, SurveyMonkey

  • Best accuracy for gender inferring is 80%, Google has seen this
  • Use mobile survey, but not everyone fills out the entire demographic survey
  • Works to find twins, people you look like based on app usage
  • Support vector machines try to split a scatter plot where male and female are as far apart as possible 
  • Give a lot of power to the edges to split the data 
  • Usually the data overlaps a ton, you don’t see men on the left and women on the right
  • “Did this person use this app?” Split people based on gender, Pinterest is often the first node because it is the best differentiator right now, Grindr and emoticon use follow through to define the genders well, stop when a node is all one specific gender
  • Men do use Pinterest though, ESPN is also a good indicator but it’s not perfect either, HotOrNot is more male
  • Use time spend per app, app used, number of apps installed, websites visited, etc
  • Random forest works the best
  • Feature selection really matters, use a selected list not a random list
  • Really big differences with tree depth
  • Can’t apply the app model to the android model, the apps are different, the use of apps is different

Dissonance and Harmony: Exploring How Data Science Helped Solve a Complex Social Science Problem; Michael L. Jugovich, NORC at the University of Chicago; Emily White, NORC at the University of Chicago

  • [another speaker who marched on when the computer screens decided they didn’t want to work🙂 ]
  • Recidivism research, going back to prison
  • Wanted a national perspective of recidivism
  • Offences differ by state, unstructured text forms means a lot of text interpretation, historical data is included which messes up the data if it’s vertical or horizontal in different states
  • Have to account for short forms and spelling errors (kinfe)
  • Getting the data into a useable format talks the longest time and most work
  • Big data is often blue in pictures with spirals [funny comments🙂 ]
  • Old data is changed and new data is added all the time
  • 30 000 regular expressions to identify all the pieces of text
  • They seek 100% accuracy rate [well that’s completely impossible]
  • Added in supervised learning and used to help improve the speed and efficiency of manual review process
  • Wanted state specific and global economy models, over 300 models, used brute force model
  • Want to improve with neural networks, auto make data base updates

Machine Learning Our Way to Happiness; Pablo Diego Rosell, The Gallup Organization

  • Are machine learning models different/better than theory driven models
  • Using Gallup daily tracking survey
  • Measuring happiness using the ladder scale, best possible life to worst possible life, where do you fall along this continuum, Most people sit around 7 or 8
  • 500 interviews everyday, RDD of landlines and mobile, English and Spanish, weighted to national targets and phone lines
  • Most models get an R share of .29. Probably because they miss interactions we can’t even imagine
  • Include variables that may not be justified in a theory driven model, include quadratic terms that you would never think of, expanded variables from 15 to 194
  • [i feel like this isn’t necessarily machine learning but just traditional statistics with every available variable crossed with every other variable included in the process]
  • For an 80% solution, needed only five variables
  • This example didn’t uncover significant unmodeled variables
  • [if machine learning is just as fast and just as predictive as a theory driven model, I’d take the theory driven model any day. If you don’t understand WHY a model is what it is, you can’t act on it as precisely.]
%d bloggers like this: