The so-called Hype Cycle of emerging technologies, first coined by the research firm Gartner, begins with the spark of an idea, surges to a peak of wildly inflated expectations, followed by a “trough of disillusionment” as barriers emerge. Gartner now suggests the idea of big data itself may be tumbling toward just such a trough, as companies recognize that acquiring vast sums of information can mean nothing without the ability to discern, apply and monetize patterns from this information, while remaining respectful of privacy concerns and regulatory constraints. In fact, Gartner suggests that the “gravitational pull of big data is now so strong that even people who haven’t a clue as to what it’s all about report that they’re running big data projects.”
So how do insurers unlock value from big data? Jeff Heaton, RGA’s Chief Data Scientist, published author and professor at Washington University in St. Louis, has a few ideas. To start, he suggests it’s time for insurers to better understand the basics of data science. To that end, he self-produced a video to explain the basics in just four minutes. RGA sat down with Heaton to discuss the video and his thoughts on what every employee at an insurance company should know about this form of statistics.
What do people most misunderstand about data science?
Most of all, data science is a form of statistical detective work. As I say in my video, you can’t have science without data. Maybe you have a little data. Maybe a lot. You take what you have and use a variety of techniques to fill in the gaps. Often your data can resemble a Microsoft Excel spreadsheet, only with certain columns or rows missing. You then develop and train a model to predict the values that are missing based on the information you already know. It’s not magic. It’s more like deductive reasoning.
You mention techniques. Can you provide an overview of the basic concepts.
Sure. I talked about deduction. Another way to think about this is teaching a model to learn. That’s a counter-intuitive concept, I know. How do you teach a model? Basically there is a logic behind all data: information comes from somewhere. The data scientist looks at categories or values that are already present and extrapolates what might be missing by refining and adjusting a mathematical model intended to represent this logic. There is supervised learning, where you are trying to either predict a number through something called regression analysis or a class or a category or a type, through what we call classification. Then, with unsupervised learning, maybe you don’t know the value are looking for, so you take the values that you do have and then try to cluster them. These are the two most popular approaches, but new methods are being developed all the time.
Cluster? Are you referring to a scatter plot?
Not quite. Clustering is what it sounds like. It refers to groups or clusters of dots representing data, but it can take many forms. Maybe you assign different colors to data points. Maybe you have more or fewer clusters. It all rests on your choices and what your algorithm or model requires.
What is the point of clustering data?
People don’t recognize that data science is beautiful! Data science is all about finding patterns, and these often reveal themselves through visual representations. For example, imagine I plot my data on a graph, but all those numbers are noisy. Upon first glance, they are represented as a jagged or even random assortment of dots on a grid. Many of our models or algorithms, especially for supervised learning, seek to fit a line on top of the data – to apply a pattern to this information. As you train the model with more inputs, it gets closer and closer and closer to the actual distribution of your data points, forming a shape that the eye can follow. But beware: the line can’t exactly match your data or you are in danger of over-fitting.
Isn’t fitting a model to the data a good thing? How can you over-fit?
Over-fitting is one of your arch enemies as a data scientist. Here’s an analogy: If you’re given a sample exam and you study just that sample over and over and over, you may eventually get a 100% on the exam, but will you pass the real exam? Probably not. You don’t understand the underlying concepts – or in this case the logic underlining how data is organized. Instead, to extend the analogy, you are just parroting back the responses you memorized, even though the questions may have changed. The point is to learn why certain pieces of data are what they are – and then to apply these insights to answer new questions.
There’s under-fitting too. Under-fitting occurs when the model you’ve chosen, whether it’s an RBF, Gaussian Process, Decision Tree, Random Forest, Neural Net, fails to deliver any discernible organizational structure or pattern. This is called bias error – the model simply does not fit the data. The solution is an ensemble: you pull in additional columns of data to try to and refine and improve your model. Bottom line, you want to follow the Goldilocks rule – your model should be neither too hot nor too cold, neither to close to the existing data nor too far away. You want your model to be just right.
That sounds really subjective. What happens if you just can’t find a model that fits your data?
You can try to identify a better model of your data through Feature Engineering, where you create additional calculated fields – basically strengthening your model by feeding it more and more information.
If there’s one takeaway you would leave with insurers about data science, what would it be?
The truth is that data science isn’t a single monolithic field – it involves many different specialties and sub-specialties. And it’s not static. It’s constantly evolving, and expanding as we learn new things. Some companies apply simple models to large datasets, some apply complex models to small ones, some need to train their models as they go, and some don’t use conventional models at all. It all depends on the industry and the application or specialization.
You can’t understand data science in four minutes, four hours or four days, but you can start to think about the opportunities and limitations of this emerging field.