Sentiment analysis is a very controversial subject with many people highly doubtful of the validity of the results. With that in mind, I have developed a set of rules that will allow you to ensure your data is scored with validity levels greater than 90%.
- Choose messages that are short. The shorter the better. Tweets are a perfect example as people generally make only one concise point that can’t be misconstrued. Longer messages simply introduce extraneous information that isn’t essential to the main message.
- Don’t collect data from blogs and forums where people may express their points in long, drawn out, overly verbose ways. These types of messages may include well described pros and cons, positives and negatives, and this only confuses things.
- Remove from your dataset any messages that incorporate unclear opinions or contradictory opinions. Obviously, the speaker isn’t sure of what they are speaking and so their opinion won’t be helpful.
- Remove from your dataset any messages that you aren’t sure how to score. Perhaps they contain emoticons you aren’t familiar with, slang that doesn’t make any sense, or grammatical errors that render the message not understandable.
Rather than worrying about ignoring important subsamples of people who have complicated opinions, people who associate themselves with subcultures, or the obvious skewing and biasing of results, simply focus on the parts of the data that you know will be correct. And there you go. Your sentiment analysis is 95% accurate. It doesn’t generalize to any population, but boy is it accurate!
— If you’re interested in a less sarcastic view of how accurate sentiment analysis, Seth Grimes is the expert in this field. Read here as he explains why validity scores can’t really be any better than 83%