Tag Archives: reliability

Harnessing text for human insights #IIeX 

Live note-taking at #IIeX in Atlanta. Any errors or bad jokes are my own.

Chaired by Seth Grimes

Automated text coding: humans and machines learning together by Stu Shulman

  • It is a 2500 year old problem, Plato argued it would be frustrating and it still is.
  • Coders are expensive, it’s difficult at scale, some models are easier to validation than others, don’t replace humans, no one right way to do it, validation of humans and machines is essential
  • Want to efficiently code, annotate coding with shared memoirs, manage coding permissions, have unlimited collaborators, easily measure inter-rather reliability, adjudicate validity decisions
  • Wanted to take the mouse out of the process, so items load efficiently for coding
  • Computer science and HSF influence  measure everything 
  • Measure how fast each annotator works, measure interacter reliability, reliability can change drastically by topic
  • Adjudication – sometimes it’s clear when an error has been made, allows you to create a gold standard training set, and give feedback to coders; can identify which coders are weak at even the simplest task, there is human aptitude and not everyone has it, there is a distribution of competencies 
  • 25% of codes are wrong so you need to train machines to trust the people who do a better job at coding
  • Pillars of text analytics – search, filtering, deduplication and clustering and works well with surveys as well, human coding or labelling or tagging which is where most of their work goes, machine learning – this gives a high quality training set
  • If humans can’t do the labelling, then the machines can’t either
  • Always good to keep humans in the loop
  • Word sense dis ambiguities – relevant – is bridge a game or a road, it smoking a cigarette or being awesome

Automated classification interesting, at scale and depth by Ian McCarty

  • Active data collection is specific and granular, as well as standardized; but it’s slow and difficult to scale, there is uncertainty, may be observer bias via social desirability, demand characteristics, Hawthorne effect [EVERY method has strength and weaknesses]
  • Declared vs demonstrated interests – you can give 5 stars to a great movie and then watch Paul Blart Mall Cop 5 times a 6 months [Paul Blart is a great movie! Loved it 🙂 ]
  • They replicate the experience of a specific URL to generate more specific data
  • Closed network use case – examined search queries from members to recruit them into studies, segmentation was manual and company needed to automate and scale; lowered per person costs and increased accuracy, found more panelists in more specific clusters, normalized surveys if declared behaviors conflicted with demonstrated behaviors 
  • Open network use case: home improvement brand needed a modern shared meaning with customers, wanted to automate a manual process; distinguished brand follower end compared to competitive followers, identified where brand values and consumer values aligned, delivered map for future content creation and path to audience connection

Text analytics or social media insights by Michalis Michael

  • Next gen research is here now, listening, asking questions, tracking behavior, insights experts
  • Revenues don’t reflect expectations, yet.
  • We’re not doing a great job of integrating insights yet, social media listening analytics is not completely integrated in our industry yet 
  • Homonyms are major noise, eliminating them needs humans and machines
  • Machine learning is language agnostic, create a taxonomy with it, a dictionary of the product category using the words that people use in social media not marketing words
  • It is possible to have 80% agreement with text analytics and the human [I believe this when the language is reasonably simple and known]
  • Becks means beer and David beckham but you need training algorithms to do this, Beck Hanson is a singer, you need hundreds of clarifications to identify the exact Becks that is beer
  • Beer is related to appearance and occasions, break down occasions into in home or out of home, then at a BBQ or club
  • What do you say about a beer when they do a commercial that has nothing to do with the beer
  • English has s a lot of sarcasm, more than a lot of other languages [yeah right, sure, I believe you]
  • Break down sentiment into emotions – anger, desire, disgust, hate, joy, love, sadness – can benchmark brands in these categories as well
  • Can benchmark NPS with social media
  • Brand tracking questions can be matched to topics in a social media taxonomy, and there can be even more in the social media version than the survey version

Survey says… #MRX

Charlie Brown's Super Book of Questions and An...

After two days at CASRO, I learned the following:

  • When you use a 5 point or 7 point scale, you will get different answers
  • When you label or don’t label scales, you will get different answers
  • When you use a web survey vs a mobile survey, you will get different answers
  • When you gamify a survey, you will get different answers
  • (And from the good ol’ days) when you run the same survey on two different panels, you will get different answers

What are we to gain from all of this? Well, no matter what you do or how you do it, you will get different results on surveys every time. There’s just no way around it. What we HOPE is that the results won’t be contrary, but rather simply different in magnitude. That rank orders will remain generally similar, that hates will remain hates, and loves will remain loves. Indeed, if we are lucky enough to run a single study across a number of different methods or styles and get similar rank orders every time, it’s a good indication that the conclusions we’ve drawn are both reliable and valid. Heaven.

What this problem also suggests is that there is and can be no right answer. The only right answer is the one in the responder’s head and given that people can’t even adequately describe what is going on in their head, it seems that we will never know the right answer. What we can do is develop clear and specific research hypotheses, and match them up with clear and specific research designs. That is best way to create reliable and valid answers.

We may not know the exact right answer, but we can know a good answer.

Radical Market Research Idea #3: Insist on quantity over quality #MRX

Wait, was that a typo? Quantity over quality? Well, I meant what I said.

Question #1: What was the sample size of your last tracker? 30 per time frame? 50 per time frame? What about your last custom study? 300? 500?

Question #2: How many pages of questions and demos and cross-tabs did you flip through searching for any chi-square or t-test that was statistically significant? 100? 200?

Here’s the problem. We run ridiculously long surveys with far too few participants per test cell and we are ok with searching through far too many Type 2 errors.

Here’s the solution. Put your money into large sample sizes and not into question topic after question topic. Focus on sample sizes within demographic groups rather than questions with 4 or 8 people per cell. Trade variety of questions for reliability of results. Trade overly long surveys for properly sampled cells. Trade breadth of topics for validity of individual questions. Take money away from more and more questions and put it directly into more and more validity and reliability. Radical.

Please comment below. What was the sample size of your last study and what was the sample size within many of the cells?

Tom Anderson: Web Analytics #Netgain #mrx

php|architect's Guide to Web Scraping

Image by CalEvans via Flickr

What follows are some of my silly musings and key take-aways of the session.
Tom Anderson – Web Analytics
– 85% of all data stored is unstructured, it doubles every three months, 7 million web pages are added every single day
– First, tracking survey case study, analysis of guest satisfaction survey which has 10 point scales and permits verbatim responses
– Funny thing is the checkbox answers were different from the verbatims. Checkmarks related to the room and the bed but the verbatim was about the food that made her throw up. The verbatims MUST be read! (people assume you’ll look after the problems and use the comment box for stuff you forgot to ask about, at least that’s what i do)
– Problem with manual coding is code frame changes over time, some codes are missing, some codes become irrelevant, inter-rater reliability (different people and same person would code it differently)
– ooooh, CHAID results, and regression equation 🙂
– Future – surveys might look like a blank post card, thumbs up or down and then write in all your comments
– Second case study, five hotels within a travel website
– Indexing might be the new word for webscraping (it’s a tech term that’s nicer than scraping!)
– 20% of the users are responsible for 80% of the posts, pareto principle, most people make just one or two posts in the last year or so
– “online introverts” folks who are listening but don’t say too much
– People posting on multiple hotel boards are looking for cheaper rates, free nights
– Loyalists who focus on one hotel board are more positive about the hotel
– Had a board lurker who interacted with posters, he knew specific people (slippery slope, researchers can’t do this but the client was the lurker so he was able to)
– Was able to see client’s promotional schedule in the text analysis, nice validation
– 60% of online population uses a social network, anyone under 24 is on a social network usually facebook
– WW2 generation is showing the fastest growth particularly to stay in touch with their family, photos of the grandkids and such (ah, isn’t that sweet, STOP following me gramma!)
– LinkedIn has 65 million users (hey, LinkIn with me!)
– Social networks let people raise their hands that they like a certain brand
– Text analytics predict income and purchasing/spending power on LinkedIn
– Qualitative analysis is a sample of information, text analytics can measure entire population

Related Links
#Netgain5 Keynote Roundup: Last Thoughts
Brian Levine: Neuroscience and Marketing Research
Brian Singh: Insights from the Nenshi Campaign
Monique Morden: Online Communities, MROCs
Ray Poynter – Overview of Online Research Trends
Tom Anderson: Web Analytics
Will Goodhand: Social Media Research and Digividuals

Focus groups: The best and only research method

Here’s your task. Read the following list of tasks and identify which ones are useless to brands and clients: – Watching how people interact with and actually use a product – Listening to how people talk about products with their peers – Learning which features people use to convince other consumers – Learning how consumers convince others to use a product
– Observing facial expressions of disgust and shame and love and peace – Watching for passion and complacency

Your second task: Make a list of all of the research methods that are error-free, risk-free and always give valid and reliable results.

There may be no perfect research method but there’s definitely a place for focus groups.


PRAM: Multiple Coder Reliability Calculator

* The link seems to be unavailable now, sorry. If you find an active link, please do share it. *

I just found a quick little free bit of software for calculating reliability. Thanks to Skymeg Software and Dr. Neuendorf for making it available for a small cost of plugging it. Which I don’t mind doing.

PRAM calculates:
1. Percent Agreement
2. Scott’s Pi
3. Cohen’s Kappa
4. Spearman Rho
5. Pearson Correlation
6. Lin’s Concordance
7. Holsti’s Reliability

It doesn’t have Krippendorfs alpha which is a shame, but free is free.

You can find PRAM here.

“We will not be held accountable for any damages that result from the use, install, download, or thinking about PRAM.”

Related Articles

%d bloggers like this: