Data is promised to be the next big business (e.g., Wall 2019, Gartner 2018a). Investment banks, analysts and consultants further feed the frenzy with big revenue forecasts. In terms of data monetization opportunities, consultants McKinsey & Company estimate that car-generated data alone will be worth between US$450 billion and US$750 billion by 2030, less than two vehicle generations away (McKinsey 2016). Consumer data is already a big business today. Google and Facebook live off the data that users create on their platforms. Almost all their revenue is from advertising, selling “eyeballs” and user engagement to advertisers.
All of the above is just the beginning. A big data boost is expected from IoT data (Internet of Things): IoT is essentially turning objects into websites. Historically, the Web and website tracking created a first wave of Big Data (which in turn created new technology to store and process it, such as Hadoop). Now ordinary objects are turned into websites. For example, cars: connected and autonomous vehicles are projected to generate four terabytes (TB) of data a day (Krzanich 2016). Furthermore, this IoT boom is fueled by a confluence of trends in information systems, such as miniaturization of sensors like lidar (light detection and ranging sensor for autonomous cars (NOAA 2020)), device technology, e.g., edge computing, and a new 5G cellular mobile communications standard.
A key mechanism to release value from data is analytics. With Websites, it took tools like Google Analytics (Formerly known as Urchin) to benefit from Website tracking and attract advertising budgets. Google Analytics is mostly descriptive analytics. Far more value is generated from consecutive stages of predictive and prescriptive analytics (McKinsey 2018, Gartner 2018b). Examples include product recommendations using machine learning as an amplifier of word-of-mouth marketing (e.g., on Amazon and Netflix); and the application of deep learning or neural network methods across many domains for the recognition of text (sentiment analysis), image (automatic license plate recognition, ALPR), video (autonomous vehicles) and speech (Amazon’s Alexa virtual assistant). Yet despite the media hype, a quick review of time spent in data analytics projects reveals a big problem.
Companies have gone from databases to data warehouses and now to data lakes (Porter & Heppelmann 2015). And they seem to be drowning in data (It is even unclear how to quantify the size of date, see “Data Quality”). If “time is money” as famously noted by one of the founding fathers of the United States (Franklin 1749), then data analytics is a disaster. Today, according to the literature more than 80% of the time budget of a data analytics project is spent on data processing and wrangling – not with algorithms (Press 2016, Vollenweider 2016). This would turn the 80/20 Pareto principle, a cornerstone of business efficiency, upside down (e.g., Neuman, M.E. 2005). Figure 1 illustrates the data analytics productivity crisis.
We even conducted our own analysis using surveys. We acknowledge that surveys are often a weak means to support an argument; surveys are popular because they are quick and easy, but results are often bad and misleading. Problems with surveys range from data collection (questionable representativeness, abysmal response rates, etc.) and design (bias in survey instruments, leading or ambiguous questions, inadequate response options, etc.) to interpretation and extrapolation of results (lack of statistical significance, rating level inconsistencies, etc.). Knowing about these survey pitfalls, we put a premium on representatives and simple, unambiguous questions. Our sample is a convenience sample but chosen to maximize its representativeness. Because our focus is on data analytics in business, our data was collected at data science events, which were specifically targeted at data experts in business – and not at an academic or research audience. Figure 2 depicts our survey findings, questions and results.
Our survey of data experts confirms the problem. If an analytics project is broken into the three phases of (a) data processing, (b) analytics modelling & evaluation, and (c) deployment, then timeshares are reported as 48%, 32% and 20% respectively (n = 65). The implications are clear: For data analytics to become successful, the data productivity problem has to be solved. Other industries offer clues on how to solve it. For example, the auto business: Data processing for AI remains handmade and made-to-order just like cars before Henry Ford industrialized automaking. Gottlieb Daimler invented the motor car in 1886, but it was Henry Ford who invented the modern auto business about 20 years later (Womak et al. 1990).
Henry Ford evolved the automaking from a hand-made affair to mass production. He invented the auto business with factories. A factory is about automation and productization. Automation is obvious. The moving assembly line is probably the most visible and striking feature. However, less obvious, for automation to work Ford critically required interchangeability of parts, which in turn required metrics (Clark and Fujimoto 1991). Parts had to be made to precise measurements so that all copies of a part were similar in order to be attached to cars coming down the line quickly without lengthy calibration and refitting work. Mechanical engineering introduced the notion of tolerance as “the range of variation permitted in maintaining a specified dimension in machining a piece” (Webster 2019). Parts were specified (“specced”) in engineering drawings or “blueprints” and then manufactured within precise tolerances to make them interchangeable.
The challenges with data pertaining to both measurement and automation (Crosby & Schlueter Langdon, 2019). As of 2020 data attributes have remained qualitative and subjective (For emerging quality metrics, see “Data Quantity“). For first solutions for data productization and automation, see “Creating a data factory for data products“)