Why off the shelf text mining doesn’t cut the mustard

The shelves are filling up. There are lots of off-the-shelf text mining solutions out there, and I think I’ve tried them all. Initially to great effect: I popped in a news story. Brilliant. Fed in a twitter feed of recommended restaurants, again, I was quite excited.

However, I soon learnt that one machine can’t effectively interpret all the different types of text you throw at it. When I tried out some client data I was left with an epic fail. It seemed that the more specific the genre of the data was, the weirder the results became. The generic engine was good at generic things. In fact, it was really good. But when it came to niche, specific information, such as pharma research, or reviews of high-end sports cars, well, you’d think I had shown it Japanese rather than English. I wanted to believe. I really wanted it to work. But I was disappointed.

What was needed was to be able to teach the algorithm things I already knew about my data. Moreover, I wanted a way to reach into an algorithm and point it in the right direction. At the very least I needed to tell it that ‘Repaglinide’ is a name of a pharmaceutical, that a ‘sump’ is part of a car engine and that neither is a spelling mistake. Or teach it that when someone leaves a service review quoting “free” in inverted commas, they are most likely being sarcastic – except perhaps when they’re reviewing an indie album.

That’s why Klondike was built.


Behind the scenes whir algorithms, which we programme to learn about sector-specific topics. This gives us the freedom to fine-tune our analysis for richer, meaningful insights. A generic topic model is simply not good enough to catch the nuances of domain specific information.

There is another reason why a bespoke tool beats the competition: it is purposefully built for generating relevant insight by giving us the ability to manipulate and work with all the variables within the data.

For example, in an unstructured text analysis project, you need to be able to cut and slice the text in number of ways. If the word “queues” emerges alongside “time-wasting”, “unnecessary” or “long”, that’s an indicator something was going wrong for customers. But this in itself is not actionable insight yet, is it? I need to dig deeper so that my client can make relevant changes. I want to know where the queues are, or if they only build up during peak times. That’s the magic of being able to drill down immediately into all the mentions of “queues”. If Klondike has identified that as a key topic, I can use it to go beyond simple word search and look at a broader topic that will also pull up all associated words people might use such as “lines”, “files”, “backlogs” or “logjams”. The topic may also include mentions of unresponsive staff, or how quickly a manager reacts to cut down on waiting times. A drill-down using positive sentiment as a filter may go as far as finding the solution – or at the very least, give examples where customers believe queues have been effectively dealt with.

And it can do all that because it is bespoke.

Klondike is a brand new service from Simpson Carpenter that extracts the key themes from any text data, whatever the source. For more information, see Klondike.marketing or request a demo here.

Ryan Howard, Associate Director, Marketing Sciences