Moderator: Jessica L. Holzberg, U.S. Census Bureau
Satisfied or Dissatisfied? Does Order Matter?; Jolene D. Smyth, University of Nebraska-Lincoln Richard Hull, University of Nebraska-Lincoln
- Best practice is to use a balanced question stem and keep response options in order
- What order should it be in the question stem
- Doesn’t seem to matter whether the scale is left to right or top to bottom
- Visual Heurstic Theory – help make sense of questions, “left and top mean first” and “up means good”, people expect the positive answer to come first, maybe it’s harder to answer if good is a the bottom
- Why should the question stem matter, we rarely look at this
- “How satisfied or dissatisfied are you? [I avoid this completely by saying what is your opinion about this and then use those words in the scale, why duplicate words and lengthen questions]
- Tested Sat first and Disat second in the stem, and then Sat top and Disat bottom in the answer list, and vice versa
- What would the non repsonse look like in these four options – zero differences
- Order in question stem had practically no impact, zero if you think about random chance
- Did find that you get more positive answers when positive answer is first
- [i think we overthink this. If the question and answers are short and simple, people change no trouble and random chance takes its course. Also, as long as all your comparisons are within the test, it won’t affect your conclusions]
- [She just presented negative results. No one would ever do that in a market research conference 🙂 ]
Question Context Effects on Subjective Well-being Measures; Sunghee Lee, University of Michigan Colleen McClain, University of Michigan
- External effects – weather, uncomfortable chair, noise in the room
- Internal effects – survey topic, image, instructions, response sale, question order
- People don’t view questions in isolation, it’s a flow of questions
- Tested with life satisfaction and self-rated health, how are the two related, does it matter which one you ask first; how will thinking about my health satisfaction affect my rating of life satisfaction
- People change their behaviors when they are asked to think about mortality issues, how is it different for people whose parents are alive or deceased
- High correlations in direction as expected
- When primed, people whose parents are deceased expected a lesser lifespan
- Primed respondents said they considered their parents death and age at death
- Recommend keeping the questions apart to minimize effects [but this is often/rarely possible]
- Sometimes priming could be a good thing, make people think about the topic before answering
Instructions in Self-administered Survey Questions: Do They Improve Data Quality or Just Make the Questionnaire Longer?
Cleo Redline, National Center for Education Statistics Andrew Zukerberg, National Center for Education Statistics Chelsea Owens, National Center for Education Statistics Amy Ho, National Center for Education Statistics
- For instance, if you say “how many shoes do you have not including sneakers”, and what if you have to define loafers
- Instructions are burdensome and confusing, and they lengthen the questionnaire
- Does formatting of instructions matter
- Put instructions in italics, put them in bullet points because there were several somewhat lengthy instructions
- Created instructions that conflicted with natural interpretation of questions, eg assessment does not include quits or standardized tests
- Tried using paragraph or list, before or after, with or without instructions
- Adding instructions did not change mean responses
- Instructions intended to affect the results did actually do so, I.e., people read and interpreted the instructions
- Instructions before the question are effective as a paragraph
- Instructions after the question are more effective as lists
- On average, instructions did not improve data question, problems are real bu they are small
- Don’t spend a lot of time on it if there aren’t obvious gains
- Consider not using instructions
Investigating Measurement Error through Survey Question Placement; Ashley R. Wilson, RTI International Jennifer Wine, RTI International Natasha Janson, RTI International John Conzelmann, RTI International Emilia Peytcheva, RTI International
- Generally pool results from self administered and CATI results, but what about sensitive items, social desirability, open end questions, what is “truth”
- Can evaluate error with fictitious issues – e.g., a policy that doesn’t exist [but keep in mind policy names sound the same and could be legitimately misconstrued ]
- Test using reverse coded items, straight lining, check consistency of seeming contradictory items [of course, there are many cases where what SEEMS to contradict is actually correct, e.g., Yes, I have a dog, No I don’t buy dog food; this is one of the weakest data quality checks]
- Can also check against administrative data
- “AssistNow” loan program did not exist [I can see people saying they agree becuase they think any loan program is a good thing]
- On the phone, there were more substantive answers on the phone, more people agreed with the fictitious program [but it’s a very problematic questions to begin with]
- Checked how much money they borrowed, $1000 average measurement error [that seems pretty small to me, borrow $9000 vs $10000 is a non-issue, even less important at $49000 and $50000]
- Mode effects aren’t that big
Do Faster Respondents Give Better Answers? Analyzing Response Time in Various Question Scales; Daniel Goldstein, NYC Department of Housing Preservation and Development; Kristie Lucking, NYC Department of Housing Preservation and Development; Jack Jerome, NYC Department of Housing Preservation and Development; Madeleine Parker, NYC Department of Housing Preservation and Development; Anne Martin, National Center for Children and Families
- 300 questions, complicated sections, administered by two interviewers, housing, finances, debt, health, safety, demographics; Variety of scales throughout
- 96000 response times measured, left skewed with a really long tail
- Less education take longer to answer questions, people who are employed take longer to answer, older people take longer to answer, and none glish speakers take the longest to answer
- People answer more quickly as they go through the survey, become more familiar with how the survey works
- Yes no are the fastest, check all that apply are next fast as they are viewed as yes no questions
- Experienced interviewers are faster
- Scales with more answer categories take longer
Live note taking at #AAPOR in Austin Texas. Any errors or bad jokes are my own.
The feedback of respondent committment and tailored feedback on response quality in an online survey; Kristin Cibelli, U of Michigan
- People can be unwilling or unable to provide high quality data, will informing them of the importance and asking for committment help to improve data quality [I assume this means the survey intent is honourable and the survey itself is well written, not always the case]
- Used administrative records as the gold standard
- People were told their answers would help with social issues in the community [would similar statements help in CPG, “to help choose a pleasant design for this cereal box”]
- 95% of people agreed to the committment statement, 2.5% did not agree but still continued; thus, we could assume that the control group might be very similar in committment had they been asked
- Reported income was more accurate for committed respondents, marginally significant
- Overall item nonresponse was marginally better for committed respondents, not committed people skipped more
- Not committed were more likely to straightlining
- Reports of volunteering, social desirability were possibly lower in the committed group, people confessed it was important for the resume
- Committed respondents were more likely to consent to reviewing records
- Committment led to more responses to income question, and improved the accuracy, more likely to check their records to confirm income
- Should try asking control group to commit at the very end of the survey to see who might have committed
Best Practice Instrument design and communications evaluation: An examination of the NSCH redesign by William Bryan Higgins, ICF International
- National and state estimates of child well-being
- Why redesign the survey? To shift from landline and cell phone numbers to household address based sampling design because kids were answering the survey, to combine two instruments into one, to provide more timely data
- Moe to self completion mail or web surveys with telephone follow-up as necessary
- Evaluated communications about the survey, household screener, the survey itself
- Looked at whether people could actually respond to questions and understand all of the questions
- Noticed they need to highlight who is supposed to answered the survey, e.g., only for households that have children, or even if you do NOT have children. Make requirments bold, high up on the page.
- The wording assumed people had read or received previous mailings. “Since we last asked you, how many…”
- Needed to personalize the people, name the children during the survey so people know who is being referred to
- Wanted to include less legalese
Web survey experiments on fully balanced, minimally balanced, and unbalanced rating scales by Sarah Cho, SurveyMonkey
- Is now a good time or a bad time to buy a house. Or, is now a good time to buy a house or not? Or, is now a good time to buy a house?
- Literature shows a moderating effect for education
- Research showed very little difference among the formats, no need to balance question online
- Minimal differences by education though lower education does show some differences
- Conclusion, if you’re online you don’t need to balance your results
How much can we ask? Assessing the effect of questionnaire length on survey quality by Rebecca Medway, American Insitute for research
- Adult education and training survey, paper version
- Wanted to redesign the survey but the redesign was really long
- 2 version were 20 pages and 28 pages, 138 questions or 98 questions
- Response rate slightly higher for shorter questionnaire
- No significant differences in demographics [but I would assume there is some kind of psychographic difference]
- Slightly more non-response in longer questionnaire
- Longer surveys had more skips over the open end questions
- Skip errors had no differences between long and short surveys
- Generally longer had lower repsonse rate but no extra problems over the short
- [they should have tested four short surveys versus the one long survey 98 is just as long as 138 questions in my mind]
It’s true that for the most part, leading questions are the sign of a poorly skilled, inexperienced survey writer. When it’s pointed out, most of us can see that these are terrible questions.
- Do you agree that sick babies deserve free healthcare?
- Should poorly constructed laws be struck down?
- Is it important to fund new products that improve the lives of people?
- Should products that cause rashes be pulled from stores?
- Should stores always have enough cashiers so that no one has to wait in a long line?
But are leading questions always bad? I think not. However, these are situations that only experienced researchers should attempt. Leading questions may be appropriate when you are trying to measure socially undesirable, embarrassing, unethical, inappropriate, or illegal activities. Consider these examples.
Would you say yes to this:
- Have you driven drunk in the past three months?
What about to this?
- Many people realize that they have driven after having too much to drink. Is this something you have done in the last three months?
Would you say no this?
- Have you donated to charity in the past three months?
What about to this?
- Sometimes it’s hard to donate to charity even when you really want to. Have you donated to charity in the past three months?
In both cases, it is possible that the first question will cause people to give a more socially appropriate answer, but not necessarily the valid answer. In both cases, the second question might create a mindset where the responder feels better about sharing a socially undesirable answer.
The next time you need to write a survey, consider whether you need to write a leading question. Consider your wording carefully!
Live blogged from #ESRA15 in Reykjavik. Any errors or bad jokes are my own.
Well, last night i managed to stay up until midnight. The lights at the church went on, lighting up the tower and the very top in an unusual way. They were quite pretty! The rest of the town enjoyed mood lighting as it didn’t really get dark at all. Tourists were still wandering in the streets since there’s no point going to bed in a delightful foreign city if you can still see where you’re going. And if you weren’t a fan of the mood lighting, have no fear! The sun ‘rose’ again just four hours later. If you’re scared of the dark, this is a great place to be – in summer!
Today’s program for me includes yet another sessions of question data quality, polling question design, and my second presentation on how non-native English speakers respond to English surveys. We may like to think that everyone answering our surveys is perfectly fluent but let’s be realistic. About 10% of Americans have difficulty reading/writing in English because it is not their native language. Add to that weakly and non-literate people, and there’s potential big trouble at hand.
- compared 2 point scale and 11 point scale, different order of questions and question can even be very widely apart, looked at perceived prestige of occupations
- separated two pages of the surveys with a music game of guessing the artist and song, purely as distraction from the survey. the second page was the same questions in a completely different order, did the same thing numerous times changing the number of reponse options and question orders each time. whole experiment lasted one hour
- assumed scale was unidimensional
- no differences comparing 4 point to 9 point scale, none between 2 point and 9 point scale [so STOP USING HUGE SCALES!!!]
- prestige does not change depending on order in the survey [but this is to be expected with non-emotional, non-socially desirable items]
- respondents confessed they tried to answer well but maybe not the best of their ability or maybe their answers would change the next time [glad to see people know their answers aren’t perfect. and i wouldn’t expect anything different. why SHOULD they put 100% effort into a silly task with no legitimate outcome for them.]
measuring attitudes towards immigration with direct questions – can we compare 4 answer categories with dichotomous responses
- when sensitive questions are asked, social desirability affects response distributions
- different groups are affected in different ways
- asked questions about racial immigration – asked binary or as a 4 point scale
- it’s not always clear that slightly is closer to none or that moderately is closer to strongly. can’t just assume the bottom two boxes are the same or the top two boxes are the same
- education does have an effect, as well as age in some cases
- expression of opposition for immigration depends on the response scale
- binary responses leads to 30 to 50% more “allow none” responses than the 4 point scale
- responents with lower education have lower probability to choose middle scale point
cross cultural differences in the impact of number of repsonse categories on response behaviour and data structure of a short scale for locus of control
- locus of control scale, 4 items, 2 internal, 2 external
- tested 5 point vs 9 point scale
- do the means differ, does the factor structure differ
- I’m own boss; if i work hard, i’ll succeed; when at work or in m private life what I do is mainly determined by others; bad luck often gets in the way of m plans
- labeled doesn’t apply at all, applies completely
- didn’t see important demographic differences
- saw one interaction but it didn’t really make sense [especially given sample size of 250 and lots of other tests happening]
- [lots of chatter about significance and non-significance but little discussion of what that meant in real words]
- there was no effect of item order, # of answer options mattered for external locus but not internal locus of control
- [i’d say hard to draw any conclusions given the tiny number of items, small sample size. desperately needs a lot of replication]
the optimal number of categories in item specific scales
- type of rating scale where the answer is specific to the scale and doesn’t necessarly apply to every other item – what is your health? excellent, good, poor
- quality increased with the number of answer options comparing 11,7,5,3 point scales but not comparing 10,6,4 point scales
- [not sure what quality means in this case, other audience members didn’t know either, lacking clear explanation of operationalization]
6 papers moderated by Martin Barron, NORC
prezzie 1: evaluting quality control questions, by Keith Phillips
- people become disengaged in a moment but not throughout an entire survey, true or false – these people are falsely accused [agree so much!]
- if most people fail a data quality question, its a bad question
- use a long paragraph and then state at the end please answer with none of the above to this engagement question – use a question that everyone can answer –> is there harm in removing these people
- no matter how a dataset is cleaned, the answers remained the same, they don’t hurt data quality, likely because it happens randomly
- people who fail many data quality questions are the problem, which questions are most effective?
- most effective questions were low incidence check, open ends, speeding
prezzie 2: key factor of opinion poll quality
- errors in political polling have doubled over the last ten years in canada
- telephone coverage has decreased to 67% when it used to be 95%
- online panel is highly advantageous for operational reasons but it has high coverage error and it depends on demographic characteristics
- online generated higher item selection than IVR/telephone
prezzie 3: new technology for global population insights
- random domain intercept technology – samples people who land on 404 pages, reaches non-panel people
- similar to random digit dialing
- allows access to many countries around the world
- skews male, skews younger, but that is the nature of the internet
- rr in usa are 6% compared to up to 29% elsewhere [wait until we train them with our bad surveys. the rates will come down!]
- 30% mobile in USA but this is competely different around the world
- large majority of people have never or rarely take surveys, very different than panel
prezzie 5: surveys based on incomplete sampling
- first mention of total survey error [its a splendid thing isn’t it!]
- nonprobability samples are more likely to be early adopters [no surprise, people who want to get in with new tech want to get in with other things too]
- demographic weighting is insufficient
- how else are nonprobability samples different – more social engagement, higher self importance, more shopping behaviours, happier in life, feel like part of the community, more internet usage
- can use a subset of questions to help reduce bias – 60 measures reduced to number surveys per month, hours on internet, trying new products first, time spent watching TV, using coupons, number of times moved in last 5 years
- calibrated research results matched census data well
- probability sampling is always preferred but we can compensate greatly
prezzie 6: evaluating questionnaire biases across online sample providers
- calculated the absolute difference possible when completing rewriting a survey in every possible way – same topic but different orders, words, answer options, answer order, imagery, not using a dont know
- for example, do you like turtles vs do you like cool turtles
- probability panel did the best, crowd sourced was second best, opt in panel and river and app clustered together at the worst
- conclusions – more research is needed [shocker!]
Concurrent Session A, Moderator Carl Ramirez, US Government Accountability Office, 9 papers!
prezzie 1: using item repsonse theory modeling
- useful for generating shorter question lists [assuming you are writing scales, plan to reuse scales many times, and don’t require data to every question youve written]
- [know what i love about aapor? EVERYONE can present regardless of presentation skill. content comes first. and on a tangent, I’ve already eaten all the candies i found in the dish]
prezzie 2: measurements of adiposity
- prevalence rate of obesity is 36% in the USA, varies by state but every state is at least 20% [this is embarrassing in a world where millions of people starve to death]
- we most often use self reported height and weight to calculate BMI, this is how national CDC measures it but these reports are not reliable
- correlations of BMI and body fat is less than 40%, we create a proxy with an unreliable measure
- underwater weight is a better measure but there are oviously many drawbacks to that
prezzie 3: asking sensitive GLBT questions
- respondents categorize things differently than researchers, instructions do affect answers, does placement of those intructions matter? [hm, never really thought of that before]
- tested long instructions before vs after the question
- examined means and nonresponse
- data collection incomplete so can’t report results
prezzie 4: response order effects related to global warming
- most americans believe climate change is real but one third do not
- primacy and recency effects can affect results, primacy more often in self-administered, recency more often in interviewer assisted
- reverse ordered five questions for two groups, 5 attitudes were arranged on a scale from belief to disbelief
- more people believed global warming when it was presented first, effect size small around 5%
- it affected the first and last items, not the middle opinions
- less educated people were more affected by response orders, also people who weren’t interested in the topic were more affected
prezzie 5: varying administration of sensitive questions to reduce nonresponse
- higher rates of LGB are assumed to be more accurate
- 18 minute survey on work and job benefits
- tried assigning numbers versus words to the answers ( are you 1: gay…. vs are you gay) [interesting idea!]
- [LOVE sample sizes of over 2300]
- non response differences were significant but the effect size was just 1% or less
- it did show higher rates of LGB, do recommend trying this in a telephone survey
prezzie 6: questionnaire length and response rates
- CATI used a 10$ incentive, web was $1, and mail was $1 to $4 [confound number 1]
- short survey was 30% shorter but still well over 200 questions
- no significant difference in response rate, completion rate better for short version 3% more
- no effect on web, significant effect on mail
prezzie 8: follow up short surveys to increase response rates
- based on a taxpayer burden survey n=20000
- 6 stage invite and reminder process, could receive up to 3 survey packages, generates 40% response rate
- short form is 4 pages about time to complete and money to complete, takes 10 minutes tocomplete
- many of the questions are simply priming questions so that people answer the time and money questions more accurately
- at stage 6, divided into long and short form
- there was no significant difference in response rate overall
- no differences by difficulty or by method of filing
- maybe people didn’t realize the envelope has a shorter survey, they may have chucked it without knowing
- will try a different envelope next time as well as saying overtly its a shorter survey
prezzie 9: unbalanced vs balanced scales to reduce measurement error
- attempted census, 92% response rate of peace corps volunteers
- before it was 0 to 5, after it was -2 to +2
- assume most variation will be in the middle as very happy and very unhappy people will go to the extremes anyways
- only 33 people in the test, 206 items
- endpoint results were consistent but the means were very slightly different [needs numerous replications to get over the sample size and margin of error issue]
Live blogged in Nashville. Any errors or bad jokes are my own.
Frances Barlas, Patricia Graham, and Thomas Subias
– we used to be constrained by an 800 by 600 screen. screen resolution has increased, can now have more detail, more height and width. but now mobile devices mean screen resolution matters again.
– more than 25% of surveys are being started with a mobile devices, less are being completed with a mobile device
– single response questions don’t serve a lot of needs on a survey but they are the easiest on a mobile device. and you have to take the time to consider each one uniquely. then you have to wait to advance to the next question
– hence we love the efficiency of grids. you can get data almost twice as fast with grids.
– myth – increase a scale of 3 to a scale of 11 will increase your variance. not true. a range adjust value shows this is not true. you’re just seeing bigger numbers.
– myth – aggregate estimates are improved by having more items measure the same construct. it’s not the number of items, it’s the number of people. it improves it for a single person, not for the construct overall. think about whether you need to diagnose a person’s illness versus a gender’s purchase of a product [so glad to hear someone talking about this! its a huge misconception]
– grids cause speeding, straightlining, break-offs, lower response rates in subsequent surveys
– on a mobile device, you can’t see all the columns of a grid. and if you shrink it, you cant read or click on anything
– we need to simplify grids and make them more mobile friendly
– in a study, they randomly assigned people to use a device that they already owned [assuming people did as they were told, which we know they won’t 🙂 ]
– only have of completes came in on the assigned device. a percentages answered on all three devices.
– tested items in a grid, items by one by one in a grid, and an odd one which is items in a list with one scale on the side
– traditional method was the quickest
– no differences on means
[more to this presentation but i had to break off. ask for the paper 🙂 ]
Over here at #AAPOR, I’ve been overdosing on questionnaire design and data quality papers. What kind of headers should we use, how should we order them, what kind of words should they contain. All of these efforts are made to correct errors such as non-normal distributions, overly positive (or negative) distributions and more.
But dare I ask. Are we really correcting the distributions? Aren’t we more accurately just affecting the distributions?
There are indeed ‘correct’ answers to questions like how many cars do you own and when did you last go to the dentist. But there is no real or true distribution of how happy are you or who do you think you will vote for.
Perceptions in themselves are truth and cannot be validated. There is no correct way to ask these kind of questions. There are only ways of asking questions that give you a distribution of responses that allow you to test your hypothesis. If you need a wide distribution and I need a narrow one, then our question designs will create two truths.
We do need to understand how our word choices and design styles affect responses but please don’t think that your truth is my truth and that your method of correcting your “errors” will fix my questions.
Survey question writing is a difficult science and a magical art. There are so many intricacies that researchers must learn. Include a “none of the above” where appropriate. Be sure to include a zero. Be sure to include an upper limit. Be sure you don’t miss an option. Be sure not to overlap options. But here’s a rule you may not have encountered before, a rule that is often ignored.
I invite you to answer only ONE of the following two questions, the correctly designed question. Those of you who know the rule will find this an easy task. For everyone else, I hope once you figure out what the rule is that you will keep it top of mind the next time you’re tasked with writing a survey.