Moderator: Jessica L. Holzberg, U.S. Census Bureau
Satisfied or Dissatisfied? Does Order Matter?; Jolene D. Smyth, University of Nebraska-Lincoln Richard Hull, University of Nebraska-Lincoln
- Best practice is to use a balanced question stem and keep response options in order
- What order should it be in the question stem
- Doesn’t seem to matter whether the scale is left to right or top to bottom
- Visual Heurstic Theory – help make sense of questions, “left and top mean first” and “up means good”, people expect the positive answer to come first, maybe it’s harder to answer if good is a the bottom
- Why should the question stem matter, we rarely look at this
- “How satisfied or dissatisfied are you? [I avoid this completely by saying what is your opinion about this and then use those words in the scale, why duplicate words and lengthen questions]
- Tested Sat first and Disat second in the stem, and then Sat top and Disat bottom in the answer list, and vice versa
- What would the non repsonse look like in these four options – zero differences
- Order in question stem had practically no impact, zero if you think about random chance
- Did find that you get more positive answers when positive answer is first
- [i think we overthink this. If the question and answers are short and simple, people change no trouble and random chance takes its course. Also, as long as all your comparisons are within the test, it won’t affect your conclusions]
- [She just presented negative results. No one would ever do that in a market research conference 🙂 ]
Question Context Effects on Subjective Well-being Measures; Sunghee Lee, University of Michigan Colleen McClain, University of Michigan
- External effects – weather, uncomfortable chair, noise in the room
- Internal effects – survey topic, image, instructions, response sale, question order
- People don’t view questions in isolation, it’s a flow of questions
- Tested with life satisfaction and self-rated health, how are the two related, does it matter which one you ask first; how will thinking about my health satisfaction affect my rating of life satisfaction
- People change their behaviors when they are asked to think about mortality issues, how is it different for people whose parents are alive or deceased
- High correlations in direction as expected
- When primed, people whose parents are deceased expected a lesser lifespan
- Primed respondents said they considered their parents death and age at death
- Recommend keeping the questions apart to minimize effects [but this is often/rarely possible]
- Sometimes priming could be a good thing, make people think about the topic before answering
Instructions in Self-administered Survey Questions: Do They Improve Data Quality or Just Make the Questionnaire Longer?
Cleo Redline, National Center for Education Statistics Andrew Zukerberg, National Center for Education Statistics Chelsea Owens, National Center for Education Statistics Amy Ho, National Center for Education Statistics
- For instance, if you say “how many shoes do you have not including sneakers”, and what if you have to define loafers
- Instructions are burdensome and confusing, and they lengthen the questionnaire
- Does formatting of instructions matter
- Put instructions in italics, put them in bullet points because there were several somewhat lengthy instructions
- Created instructions that conflicted with natural interpretation of questions, eg assessment does not include quits or standardized tests
- Tried using paragraph or list, before or after, with or without instructions
- Adding instructions did not change mean responses
- Instructions intended to affect the results did actually do so, I.e., people read and interpreted the instructions
- Instructions before the question are effective as a paragraph
- Instructions after the question are more effective as lists
- On average, instructions did not improve data question, problems are real bu they are small
- Don’t spend a lot of time on it if there aren’t obvious gains
- Consider not using instructions
Investigating Measurement Error through Survey Question Placement; Ashley R. Wilson, RTI International Jennifer Wine, RTI International Natasha Janson, RTI International John Conzelmann, RTI International Emilia Peytcheva, RTI International
- Generally pool results from self administered and CATI results, but what about sensitive items, social desirability, open end questions, what is “truth”
- Can evaluate error with fictitious issues – e.g., a policy that doesn’t exist [but keep in mind policy names sound the same and could be legitimately misconstrued ]
- Test using reverse coded items, straight lining, check consistency of seeming contradictory items [of course, there are many cases where what SEEMS to contradict is actually correct, e.g., Yes, I have a dog, No I don’t buy dog food; this is one of the weakest data quality checks]
- Can also check against administrative data
- “AssistNow” loan program did not exist [I can see people saying they agree becuase they think any loan program is a good thing]
- On the phone, there were more substantive answers on the phone, more people agreed with the fictitious program [but it’s a very problematic questions to begin with]
- Checked how much money they borrowed, $1000 average measurement error [that seems pretty small to me, borrow $9000 vs $10000 is a non-issue, even less important at $49000 and $50000]
- Mode effects aren’t that big
Do Faster Respondents Give Better Answers? Analyzing Response Time in Various Question Scales; Daniel Goldstein, NYC Department of Housing Preservation and Development; Kristie Lucking, NYC Department of Housing Preservation and Development; Jack Jerome, NYC Department of Housing Preservation and Development; Madeleine Parker, NYC Department of Housing Preservation and Development; Anne Martin, National Center for Children and Families
- 300 questions, complicated sections, administered by two interviewers, housing, finances, debt, health, safety, demographics; Variety of scales throughout
- 96000 response times measured, left skewed with a really long tail
- Less education take longer to answer questions, people who are employed take longer to answer, older people take longer to answer, and none glish speakers take the longest to answer
- People answer more quickly as they go through the survey, become more familiar with how the survey works
- Yes no are the fastest, check all that apply are next fast as they are viewed as yes no questions
- Experienced interviewers are faster
- Scales with more answer categories take longer
It’s true that for the most part, leading questions are the sign of a poorly skilled, inexperienced survey writer. When it’s pointed out, most of us can see that these are terrible questions.
- Do you agree that sick babies deserve free healthcare?
- Should poorly constructed laws be struck down?
- Is it important to fund new products that improve the lives of people?
- Should products that cause rashes be pulled from stores?
- Should stores always have enough cashiers so that no one has to wait in a long line?
But are leading questions always bad? I think not. However, these are situations that only experienced researchers should attempt. Leading questions may be appropriate when you are trying to measure socially undesirable, embarrassing, unethical, inappropriate, or illegal activities. Consider these examples.
Would you say yes to this:
- Have you driven drunk in the past three months?
What about to this?
- Many people realize that they have driven after having too much to drink. Is this something you have done in the last three months?
Would you say no this?
- Have you donated to charity in the past three months?
What about to this?
- Sometimes it’s hard to donate to charity even when you really want to. Have you donated to charity in the past three months?
In both cases, it is possible that the first question will cause people to give a more socially appropriate answer, but not necessarily the valid answer. In both cases, the second question might create a mindset where the responder feels better about sharing a socially undesirable answer.
The next time you need to write a survey, consider whether you need to write a leading question. Consider your wording carefully!
6 papers moderated by Martin Barron, NORC
prezzie 1: evaluting quality control questions, by Keith Phillips
- people become disengaged in a moment but not throughout an entire survey, true or false – these people are falsely accused [agree so much!]
- if most people fail a data quality question, its a bad question
- use a long paragraph and then state at the end please answer with none of the above to this engagement question – use a question that everyone can answer –> is there harm in removing these people
- no matter how a dataset is cleaned, the answers remained the same, they don’t hurt data quality, likely because it happens randomly
- people who fail many data quality questions are the problem, which questions are most effective?
- most effective questions were low incidence check, open ends, speeding
prezzie 2: key factor of opinion poll quality
- errors in political polling have doubled over the last ten years in canada
- telephone coverage has decreased to 67% when it used to be 95%
- online panel is highly advantageous for operational reasons but it has high coverage error and it depends on demographic characteristics
- online generated higher item selection than IVR/telephone
prezzie 3: new technology for global population insights
- random domain intercept technology – samples people who land on 404 pages, reaches non-panel people
- similar to random digit dialing
- allows access to many countries around the world
- skews male, skews younger, but that is the nature of the internet
- rr in usa are 6% compared to up to 29% elsewhere [wait until we train them with our bad surveys. the rates will come down!]
- 30% mobile in USA but this is competely different around the world
- large majority of people have never or rarely take surveys, very different than panel
prezzie 5: surveys based on incomplete sampling
- first mention of total survey error [its a splendid thing isn’t it!]
- nonprobability samples are more likely to be early adopters [no surprise, people who want to get in with new tech want to get in with other things too]
- demographic weighting is insufficient
- how else are nonprobability samples different – more social engagement, higher self importance, more shopping behaviours, happier in life, feel like part of the community, more internet usage
- can use a subset of questions to help reduce bias – 60 measures reduced to number surveys per month, hours on internet, trying new products first, time spent watching TV, using coupons, number of times moved in last 5 years
- calibrated research results matched census data well
- probability sampling is always preferred but we can compensate greatly
prezzie 6: evaluating questionnaire biases across online sample providers
- calculated the absolute difference possible when completing rewriting a survey in every possible way – same topic but different orders, words, answer options, answer order, imagery, not using a dont know
- for example, do you like turtles vs do you like cool turtles
- probability panel did the best, crowd sourced was second best, opt in panel and river and app clustered together at the worst
- conclusions – more research is needed [shocker!]
Concurrent Session A, Moderator Carl Ramirez, US Government Accountability Office, 9 papers!
prezzie 1: using item repsonse theory modeling
- useful for generating shorter question lists [assuming you are writing scales, plan to reuse scales many times, and don’t require data to every question youve written]
- [know what i love about aapor? EVERYONE can present regardless of presentation skill. content comes first. and on a tangent, I’ve already eaten all the candies i found in the dish]
prezzie 2: measurements of adiposity
- prevalence rate of obesity is 36% in the USA, varies by state but every state is at least 20% [this is embarrassing in a world where millions of people starve to death]
- we most often use self reported height and weight to calculate BMI, this is how national CDC measures it but these reports are not reliable
- correlations of BMI and body fat is less than 40%, we create a proxy with an unreliable measure
- underwater weight is a better measure but there are oviously many drawbacks to that
prezzie 3: asking sensitive GLBT questions
- respondents categorize things differently than researchers, instructions do affect answers, does placement of those intructions matter? [hm, never really thought of that before]
- tested long instructions before vs after the question
- examined means and nonresponse
- data collection incomplete so can’t report results
prezzie 4: response order effects related to global warming
- most americans believe climate change is real but one third do not
- primacy and recency effects can affect results, primacy more often in self-administered, recency more often in interviewer assisted
- reverse ordered five questions for two groups, 5 attitudes were arranged on a scale from belief to disbelief
- more people believed global warming when it was presented first, effect size small around 5%
- it affected the first and last items, not the middle opinions
- less educated people were more affected by response orders, also people who weren’t interested in the topic were more affected
prezzie 5: varying administration of sensitive questions to reduce nonresponse
- higher rates of LGB are assumed to be more accurate
- 18 minute survey on work and job benefits
- tried assigning numbers versus words to the answers ( are you 1: gay…. vs are you gay) [interesting idea!]
- [LOVE sample sizes of over 2300]
- non response differences were significant but the effect size was just 1% or less
- it did show higher rates of LGB, do recommend trying this in a telephone survey
prezzie 6: questionnaire length and response rates
- CATI used a 10$ incentive, web was $1, and mail was $1 to $4 [confound number 1]
- short survey was 30% shorter but still well over 200 questions
- no significant difference in response rate, completion rate better for short version 3% more
- no effect on web, significant effect on mail
prezzie 8: follow up short surveys to increase response rates
- based on a taxpayer burden survey n=20000
- 6 stage invite and reminder process, could receive up to 3 survey packages, generates 40% response rate
- short form is 4 pages about time to complete and money to complete, takes 10 minutes tocomplete
- many of the questions are simply priming questions so that people answer the time and money questions more accurately
- at stage 6, divided into long and short form
- there was no significant difference in response rate overall
- no differences by difficulty or by method of filing
- maybe people didn’t realize the envelope has a shorter survey, they may have chucked it without knowing
- will try a different envelope next time as well as saying overtly its a shorter survey
prezzie 9: unbalanced vs balanced scales to reduce measurement error
- attempted census, 92% response rate of peace corps volunteers
- before it was 0 to 5, after it was -2 to +2
- assume most variation will be in the middle as very happy and very unhappy people will go to the extremes anyways
- only 33 people in the test, 206 items
- endpoint results were consistent but the means were very slightly different [needs numerous replications to get over the sample size and margin of error issue]
Live blogged in Nashville. Any errors or bad jokes are my own.
Frances Barlas, Patricia Graham, and Thomas Subias
– we used to be constrained by an 800 by 600 screen. screen resolution has increased, can now have more detail, more height and width. but now mobile devices mean screen resolution matters again.
– more than 25% of surveys are being started with a mobile devices, less are being completed with a mobile device
– single response questions don’t serve a lot of needs on a survey but they are the easiest on a mobile device. and you have to take the time to consider each one uniquely. then you have to wait to advance to the next question
– hence we love the efficiency of grids. you can get data almost twice as fast with grids.
– myth – increase a scale of 3 to a scale of 11 will increase your variance. not true. a range adjust value shows this is not true. you’re just seeing bigger numbers.
– myth – aggregate estimates are improved by having more items measure the same construct. it’s not the number of items, it’s the number of people. it improves it for a single person, not for the construct overall. think about whether you need to diagnose a person’s illness versus a gender’s purchase of a product [so glad to hear someone talking about this! its a huge misconception]
– grids cause speeding, straightlining, break-offs, lower response rates in subsequent surveys
– on a mobile device, you can’t see all the columns of a grid. and if you shrink it, you cant read or click on anything
– we need to simplify grids and make them more mobile friendly
– in a study, they randomly assigned people to use a device that they already owned [assuming people did as they were told, which we know they won’t 🙂 ]
– only have of completes came in on the assigned device. a percentages answered on all three devices.
– tested items in a grid, items by one by one in a grid, and an odd one which is items in a list with one scale on the side
– traditional method was the quickest
– no differences on means
[more to this presentation but i had to break off. ask for the paper 🙂 ]
Over here at #AAPOR, I’ve been overdosing on questionnaire design and data quality papers. What kind of headers should we use, how should we order them, what kind of words should they contain. All of these efforts are made to correct errors such as non-normal distributions, overly positive (or negative) distributions and more.
But dare I ask. Are we really correcting the distributions? Aren’t we more accurately just affecting the distributions?
There are indeed ‘correct’ answers to questions like how many cars do you own and when did you last go to the dentist. But there is no real or true distribution of how happy are you or who do you think you will vote for.
Perceptions in themselves are truth and cannot be validated. There is no correct way to ask these kind of questions. There are only ways of asking questions that give you a distribution of responses that allow you to test your hypothesis. If you need a wide distribution and I need a narrow one, then our question designs will create two truths.
We do need to understand how our word choices and design styles affect responses but please don’t think that your truth is my truth and that your method of correcting your “errors” will fix my questions.
Survey question writing is a difficult science and a magical art. There are so many intricacies that researchers must learn. Include a “none of the above” where appropriate. Be sure to include a zero. Be sure to include an upper limit. Be sure you don’t miss an option. Be sure not to overlap options. But here’s a rule you may not have encountered before, a rule that is often ignored.
I invite you to answer only ONE of the following two questions, the correctly designed question. Those of you who know the rule will find this an easy task. For everyone else, I hope once you figure out what the rule is that you will keep it top of mind the next time you’re tasked with writing a survey.