Creating datapools for AI: More of the right data

The first rule of success with data analytics and artificial intelligence (AI) is to use (a) the right data and (b) in the right quantity, because AI can only extract insights if the information for it is captured in the data – and the more, the better. For example, if predicting the failure of a machine is the objective, then any data input for creating the algorithm must contain information on past failure events and lots of these events. Otherwise, it is “garbage in, garbage out” (GIGO).

Doing the right thing first

Artificial Intelligence (AI) feeds on data. Particularly, neural networks and deep learning, such as TensorFlow, have a voracious data appetite. Yet, despite its importance, data often remains an afterthought. Typically, planning for a new data analytics project is occupied with debates about the right skill set of data scientists, the right tools, deadlines and, of course, budget. As a result, most of the time of a data analytics project (measurements range from 50% to < 80%) is consumed with data search, collection, and refinement (see “Data is broken,” link). A key solution to saving time and money is to specify data needs upfront and create data pools accordingly.

Creating data pools

On their own very few companies will be able to collect the massive amounts of data that helped data analytics pioneers like Amazon, Facebook and Google create success stories. One trick to level the playing field is teaming up with others to pool data. Data can be pooled:
(a) vertically along the successive stages of a supply chain (for example, to predict a shipment’s estimated time of arrival)
(b) horizontally, for one machine make and model across all users (for example, to predict outages and improve uptime) by stacking it “on top of each other” to create “data sandwiches.” One example is layering street maps with data on vehicle traffic, people traffic, weather conditions and event information to predict traffic flows.

Data governance needed to use pooled data

Yet, as data scientists, we have run into this problem quite often: insufficient amounts of good data. Pooling data from different sources could be a solution. In a recent interview with t3n Federal Minister Altmaier even advocated data pools for all of Europe (May 29, 2019). Until recently, it was complicated to pool data with others. One key concern has been data governance and the ability to manage it effectively. “The question of data sovereignty is key for our competitiveness,” concludes Federal Minister Altmaier.
The International Dataspaces Association (IDSA, link) has responded to this challenge and created a blueprint for a data governance architecture that allows for data pools and data sandwiches across enterprise boundaries without compromising the management of data governance. IDSA defines data governance “as a natural person’s or corporate entity’s capability of being entirely self-determined with regard to its data” (IDSA Reference Architecture Mode 3.0, page 9, link). IDSA is an association of industry participants, created to promote data governance architecture solutions based on research conducted by German Fraunhofer Institute with funding from the German government (Fraunhofer initiative for secure dataspace launched, 2015). Today, members include automakers like Volkswagen, suppliers like Bosch, and traditional information and communications technology specialists like IBM and Deutsche Telekom.