Data Analytics is in a big dilemma: On one end, management has ambitious expectations, and on the other end, data scientists are struggling with a massive data productivity crisis. Too much time is spent on handling data (see “Data is Broken”).
Let’s take a closer look at the data crisis: For one, Big Data is often misunderstood. Yes, some select AI applications, such as deep learning with its neural network-based methods and popular tools like TensorFlow, require massive amounts of data. Flagship applications include recognition of text (sentiment analysis), images (automatic license plate recognition, ALPR), video (autonomous vehicles) and speech (Amazon’s Alexa virtual assistant). The problem: Refining raw data into AI-ready data is still handmade, made-to-order and doesn’t scale well. The more data is required, the bigger the productivity problem becomes. One solution here would be building a Data Factory (see “Data Factory for Data Products“).
For another, it may not necessarily take Big Data to create AI success. Instead, a lot of gold can be extracted from small data samples – yet mining it requires analytics specifically designed for small data samples. Samples may be small because (a) nobody saw a need to collect more data in the past, (b) data collection is difficult, and therefore, expensive and limited, or (c) the right data may be naturally sparse. For example, consider anomaly detection and predictive maintenance – both popular data analytics applications. The issue is not a lack of historical data. The problem is inherent, an essential character of the phenomenon: An event is called an anomaly because its occurrence is unusual and rare (Webster 2020). With one defect per year, even 10 years’ worth of data represents a sample of just 10 data points. There are just 10 “signals,” the rest is “noise” (Silver 2012). What can be predicted from n = 10 and how reliable can it be? In roulette, for example, we intuitively understand that getting 8 reds out of a series of ten spins is no indication of an anomaly – yet 80% red in a ten times larger sample of 100 spins would be considered an anomaly, following the law of large numbers (LLN) theorem in probability theory. We refer to these small sample opportunities as Small Data, which is not to be confused with the notion of small data as a synonym for a small data projects or small scope of a data initiative (Redman & Hoerl 2019).
Ignoring Small Data may not be an option, because more is coming, such as with edge computing, another key trend in the Gartner Hype Cycle analysis (Gartner 2019). Edge computing is pushing analytics closer to the data source and of the cloud into some device at the edge of the cloud, such as a sensor or a vehicle’s electronic control units (ECU). Edge computing is done for speed and cost savings. It is about achieving similar results with much less data (the data found at the edge) and fewer computations (low-parametric algorithms) to economize on data transfer and computing power in order to minimize latency and power consumption (battery life). So, how to deal with less data?
The bad news: A difficult medical problem requires a good doctor; similarly, small data requires a good data scientist. Selecting the right medicine requires an experienced doctor who can analyze symptoms well and is knowledgeable about the diagnosis. The good news: Lately, the boom in AI has triggered great progress with new tools for Small Data analytics. One such tool improves results with heavily unbalanced samples, where the data of one class (majority) greatly outnumbers another class (minority). Example: Sikora & Langdon on Drucker Customer Lab: Marketing to “Minorities”: Mitigating Class Imbalance Problems with Majority Voting Ensemble Learning, link.