Erleben, was verbindet.

1. Purpose

This document introduces end- users to the Data Intelligence Hub (DIH) platform. It explains the main components, why they are beneficial, and how to use them.

It is primarily written for a “premium user,” a user that is set up within a premium tenant. There is also the “freemium user.” This user type has fewer features than a premium user. This means that not all features described in this guide will be available to a “freemium user.”

This guide starts with a brief overview of the Telekom DIH, such as [Marketplace], [Workspaces], [Projects], [My workspace], [Data] and [Tools] (Chapter 2). It goes on to provide step-by-step instructions outlining how a premium tenant administrator can set up tenant users (Chapter 3).

All remaining chapters introduce DIH features on the basis of use cases, chapter by chapter. We start from simple and common cases, such as obtaining data from the [Marketplace]; and finish with more complex and specialized examples, such as how to share data with other organizations while maintaining data sovereignty with the [DIH Trusted Connector].

This guide adheres to the following conventions:

  • [Tabs]: square brackets are used to recognize key elements of the DIH.
  • Links / URLs / Buttons / Special terms: links, URLs, special terms and DIH interface buttons are italicized.
  • Sections: specific areas within a DIH page are referenced in bold.

 

For example, the picture below exhibits the key DIH elements [Marketplace], [Workspaces], [Control Center], [My Workspace], [Data], [Projects], and [Tools].

2. Overview – Data Intelligence Hub

The Telekom Data Intelligence Hub (DIH) from Deutsche Telekom is a B2B platform for the secure exchange, processing, and analysis of data from diverse organizational and professional fields. It provides three types of service and component:

  • A secure open data portal or marketplace or data exchange: it currently provides more than 50k+ datasets from the European Open Data Portal (https://www.europeandataportal.eu/en) as well as a growing assortment of open data from various other sources
  • A cloud-based data analytics environment with data storage
  • A secure and trusted data exchange based on the specifications of International Data Spaces (IDS, link) with our DIH Trusted Connector

The DIH is implementing IDS as a cloud-based service offering to reduce the cost and time required for sharing data. It functions as a secure, highly available, and flexible platform for providers and consumers of Data Science and IoT applications. It provides the highest security standards, and it can guarantee individual data sovereignty through decentralized data storage.

Key elements:

  • [Marketplace] for data exchange
  • [Workspaces] for processing and manipulating data with analytics tools
  • [Control Center] for managing data and data offers
  • Within [Workspaces] is [My workspace]
  • Within [My workspace] is [Data], [Projects], [Tools]
  • [Tools]: We offer an Open Source solution as a default. Premium tools are available, such as from Cloudera and Databricks. Our Open Source solution is a Jupyter Notebook with a curated assortment of Python libraries, such as pandas, scikit-learn, matplotlib, mlflow, and many others.

Looking ahead:

  • A connector for secure peer-to-peer data transfer
  • A growing selection of commercial analytics tools
  • The possibility to trade data and analytics assets (data monetization)

 

Resources:

 

3. Becoming a freemium or premium user

DIH users are classified into two types: a) premium user and b) freemium user. The account affiliated with a premium user is called an organizational account. The default account that any user (premium and freemium included) would normally have is called a personal account. Both accounts differ in their computing power, storage resources, and the availability of premium tools.

As a premium user, your computational power, storage facility, and the availability of the premium tool is configured by your organization. As a freemium user, your personal account is limited in its computational power and storage. The freemium user has access to just one tool, i.e., JupyterHub.

To become a premium user, you require a special invitation from the organization’s administrator. This enables you to use the DIH’s premium environment. To become a freemium user, you do not require any kind of special invitation to gain access to the DIH. You can simply use the links on the DIH homepage to open a free account. The sections below outline a detailed process of becoming a freemium user or a premium user.

 

3.1 DIH personal account and freemium user

Step 1: DIH homepage

Visit the DIH website: https://dih.telekom.net/

 

 

Note: if you want to switch to English, please click the “DE” button, which is located next to the Login Sign, and select “EN” for English.

 

Step 2: Account registration

To register for a DIH account, click on the Get started button, which is in the lower section of the main page.

 

 

Then select the Register for free link, which can be found underneath the Log in button.

 

 

Clicking on the Register for free link brings up a Free User Registration form. Please fill out the form with your details and click the Register for free button. This should send you an automated e-mail with a link for verifying your e-mail address. Please head to your inbox to find the automated e-mail from the DIH and click the Verify E-Mail Address button. This action should directly take you to the DIH [Marketplace], logged in as a freemium user.

 

 

Step 3: Login

Subsequent login attempts are done by navigating to https://dih.telekom.net. Click the Login button located in the top right corner on the main page and enter your credentials to log in.

 

 

 

If the login process was successful, you will see the Data Marketplace with the header “Gain business-crucial insights” and the search bar underneath.

 

To change your personal details, go to the dropdown menu at the top right (accessible from any page), click the link Profile underneath Personal. In the General information section, you can add a profile picture, register an organization, reset your password, change the language or delete your account.

 

To add more details about yourself, click on the Edit icon (pen) at the top right of the General information section. This displays an overlay with an input mask and the option to Save.

You can return to the overview of your personal account at any time by clicking the “X” (overlay closes).

 

If you are a part of one or more organizations, the Affiliation section of the Profile page displays the list of your affiliated organizations.

 

3.2 DIH premium user

A premium tenant is associated with an organization account. The admin of a premium tenant has the right to invite the members to their tenant. They does so by sending an invite to the members’ e-mail addresses. The invitation contains a link to join the team and an autogenerated temporary password.

When you are invited, you have to click on the Join team button and log in with your username and password as displayed in the e-mail. You will then receive another e-mail to verify your e-mail address. Clicking on the Verify E-Mail Address button takes you to the personal account of the DIH, also known as the freemium account. It should be noted that becoming a premium user also gives you access to a freemium account associated with your e-mail address.

 

Now that you are in your personal account, click on the Profile link, head to the Security and click the Reset password button. You should now receive an e-mail with the instructions to reset the password. Following the sequence of instructions should update your account with your desired password.

 

 

In the account selection menu at the top right of each page (indicated by a down arrow), you can switch between your personal account and your organization’s account. The access details will, however, need to be entered again in order to switch between the accounts. Once you switch to an organizational account, the menu header at the top right will display your organization name instead of “Personal”.

 

 

The workspaces found in the personal and the organizational accounts are different and have different features and capabilities. The organizational account has many more features and capabilities than the personal account.

Currently there can only be one workspace per account.

 

4. DIH premium tenant administration

Unlike the freemium tenant, the premium tenant has dedicated storage for its customers. The data stored in the DIH is completely isolated from other tenants, thereby giving the tenant owner sovereignty over their data. The tenant owner also has the power to customize the availability of the premium tools for their members. This chapter describes specific details about the DIH premium tenant’s administrative functions. Firstly, we will demonstrate how administrator login credentials can be obtained and secondly, we will elaborate on functions such as customizing the company profile, sending out invitations to ask members to join the premium tenant, setting up organizational units, and resetting company details.

 

4.1 Admin login credentials

After successfully subscribing to the DIH premium tenant, the DIH delivery manager will provide the premium tenant owner with administrator login credentials. The admin user will receive an invite (see below) containing their username and password.

 

 

  1. Use the username and password provided to log in to the Data Intelligence Hub (https://dih.telekom.net/)
  2. After you have successfully logged in, you will see the Marketplace home screen. You can then get started
  3. It is strongly recommended that you change the admin password soon after you have logged into your account.

 

4.2 Organizing your team

Log in to the DIH with your admin account credentials. In the account selection menu at the top left (accessible from every page), you will see four links below the name of your organization:

  1. Profile
  2. Team
  3. Units
  4. Details

4.2.1 Profile

This page allows you to configure the various aspects of your company profile. Things such as the company description, background image or logo, field of activity, communication routes (social media and e-mail), featured data offers (up to 9), and other aspects of branding.

 

4.2.2 Team

This page displays all verified users assigned to the organization (team members). It also allows you to invite new members to the tenant. You can start by clicking on the Add members button. Those who have been invited to join the team but have not accepted the invitation will have the word “pending” displayed beside their e-mail address. This page also allows you to remove members from the tenant where required.

 

 

Clicking the Add members button opens an overlay box where you can enter the e-mail invitee’s address. Paste the new member’s e-mail address into the box and click on the Send invite button.

 

The member will receive an e-mail with the invitation to your organizational tenant. The invitation will appear as in the picture below.

 

The list team members can be sorted (according to the team members’ e-mail addresses: A – Z, Z – A) and searched (all team members matching the letter sequence in the search; e.g., by entering “max” you will find the colleagues max@organization.com and maxima@organization.com).

You can also remove users from the tenant by simply selecting the e-mail address of the user and click the delete icon.

 

4.2.3 Units

This page lets you define one or more organizational units. Defining the organizational units helps members to create data offers affiliated with the specific organizational unit.

 

 

To create a unit, click on the Add unit button. This will show you an overlay box with various fields that need to be filled out before you can create the organizational unit.

 

Fill out the fields in the overlay box and click Save to create a new organizational unit.

 

4.2.4 Details

The details tab displays information such as the company’s address, logo, and admin e-mail address. This is the information that you provided to the DIH when opening the organizational tenant. This information cannot be edited by admin users. To change this information, you can contact our DIH support using the Request changes button.

 

4.3 Organizing your tools

The tenant administrator also has the right to activate a tool for their organization members or deactivate a specific tool during idle times to save resources. For example, if you would like to deactivate the JupyterHub tool, you can do so by going to [Workspaces] > [My Workspace] > [Tools]. Within this section, click on the … button beside the JupyterHub tile and click on the Deactivate tool button. You will be prompted for confirmation to successfully deactivate the tool. Follow the same procedure to reactivate the tool.

 

5. Use case: obtaining data from the Marketplace

Data exchange and data marketplace: data offers from the Data Intelligence Hub are posted in the [Marketplace]. The [Marketplace] provides access to data offers (a) via categories and (b) via a search function. Each data offer is displayed in a separate tile and contains the most crucial information or metadata, such as title, provider, date of update, available formats (CSV, geoJSON, JSON, XML, zip, pdf, etc.,), usage fee, and duration. Users can find suitable offers by using a keyword search. Search results can be filtered if a large number of hits are found. The desired data offer can be purchased by clicking on the Order button in the detailed view of the data offer. The sections below explain these steps in detail.

 

In the overview you will find:

  • A pre-selection of trending data offers
  • Open data offers
  • Topic-based categories with the number of offers they contain

The most important information is contained in the brief summary of the offers (tiles):

  • Title
  • Provider
  • Date of update
  • Available formats (csv, geoJSON, JSON, xml, zip, pdf, etc.)
  • Usage fee
  • License attribution

What can be viewed in the [Marketplace] of the DIH differs according to your user group:

  • When logged-in as a team member of an organization, you can see all [Marketplace] offers and order them, depending on your user rights.
  • With a personal account, regardless of whether you belong to the team of a registered organization or have only registered as an individual user, you will see data offers that are made visible to registered members of the DIH.
  • As a user who is not logged in to the DIH, you will see data offers that are made public and are free of charge, which you can order after registering.

 

5.1 Searching for data offers

You can find matching offers using a keyword search:

 

If your search does not return any results,

  • check that you have spelled the search term correctly
  • try it with
    • fewer search terms
    • synonyms
    • a superordinate search term

You can also browse through recently posted data offers simply by clicking on the View all button on the [Marketplace] page, and refine them using filters:

 

 

Filter

You can also filter your search results by clicking on a filter above the search results and defining your preferred parameters:

  • Formats, Time period, Subject area, More filters
  • Within More filters, you can also search by License types and Region

Set one or more checkmarks in the filter selection menus. Next to each option, the number of corresponding available offers is displayed.

 

Time period

With this filter, you can limit the search results according to time period:
as soon as you click on one of the date fields, a calendar view will open.

 

 

Define the desired period by

  • using the arrow buttons to choose the month in the current year and selecting your start and end date in these months.
  • If your desired period is more than a year, click on the month in the first calendar view – you will see a list with years. Select a year, then a new list will open showing the months in this year.
  • The first day of the selected month is your starting point. Proceed as before to select the end point in the second calendar. The last day of the month selected there is your end point. By clicking on a date in the month view, you can then define a specific start and end date.

 

Subject area

This dropdown menu displays all the subject areas that your search term is relevant to. You can check the box next to the relevant topic to further refine the search for your desired dataset.

 

More filters

Depending on the size of your screen, one or more filters can be seen. If your screen is small, you might only see the first filter, i.e., Formats, and the remaining would be grouped into the More filters overlay menu. To further refine very large numbers of results, click on “More filters” at the right-hand side of the bar.

This will open an overlay with additional filters including License types and Region, which you can use to further refine very large numbers of results. You can use the License types filter straight away. If you would like to filter your results by region, however, you need to set the set the slider to ON.

 

 

 

Region

With this filter, you can limit the search results to the spatial reference of the data contained within those offers. To do this, zoom in with “+” and “–” and move the map displayed with the cursor. For example, if you zoom in on Germany it would only display offers that are tagged with a city within that spatial reference.

Confirm the selection by clicking the Apply filter button. Click on Cancel at the bottom of the overlay to reject the settings. The overlay closes and you are returned to the page with the (if applicable, filtered) offers.

 

Sort the results

You can sort the results list by Relevant, Latest, or Popular by clicking on the corresponding links.

 

Removing and changing filters

Click on the “X” next to a filter to remove it so offers without this restriction are once again displayed in your results list.

 

 

You can remove all other filters by clicking on the “X” next to “More filters”. To change just a few of the additional filters, click on “More filters”. The overlay will open again. Here you can adapt or completely remove all other filters and “Apply” your changes.

 

5.2 Data offers – individual view

The detailed view of an offer contains the familiar summary at the top:

  • Title
  • Provider
  • Open data or commercial
  • Date of Creation
  • Date of Update
  • Data Formats (csv, geoJSON, JSON, xml, zip, pdf etc.,)
  • License type (cc-by, iodl, opendata, opengov, national etc.,)
  • Terms (free or price tag)

Using the tab navigation, you can find more detailed information on the data offer:

 

  • Description

This is where you will find detailed sections such as Offer description, Keywords, Subject area, Impressions, Examples of application, and images.

 

 

  • Preview/sample data

If the data provider supplies sample data, you can switch between different display options here:

    • Table
    • Map
    • Diagram
    • Raw data

 

The Map tab will be made available if the dataset contains information about the Latitude and Longitude.

  • Specifics

Here you will find (if supplied by the data provider)

    • Files and/or links to documentation
    • A list of included datasets in the offer (can be sorted by clicking on the column header)
    • Information on origin and topicality
    • Reference URL (DIH URL to access this data offer)

 

 

5.3 Order data

Click on Order in the detailed view of the data offer. (You can cancel the ordering process at any time by clicking on the X in the top left corner.)

 

 

    1. Check details: Confirm order. Is everything as desired? Please check the details of your order.
    2. Accept license: Review and accept terms of license. You can find a link to the license under “License”. Please read through it to see if the license fits your requirements. You need to accept the license details before concluding your order.
    3. Conclude order: After completing step 1 and step 2, click on the Conclude order button to complete the transaction.

 

Your orders should now be available under [Control Center] > [Orders]. Click on the View orders button to navigate your way to the orders section.

 

 

If you have already purchased the desired data offer, the Already ordered  button will redirect you to [Control Center] > [Orders] where you will find the data offer you have already purchased.

 

 

If you have posted the data offer as a part of the organization, you will not be able to order your own dataset. Instead you will see the View our offers button.

 

5.4 Allocating ordered data to your workspace

In order to make use of the datasets in the data offer you have subscribed to; you first need to assign the data offer to your workspace. To do this,

  • go to [Workspaces] > [My Workspace] > [Data]
  • click Allocate data button to display an overlay box

 

  • The overlay box displays all of the data offers that you have subscribed to under the Available offers Click on the check box next to your desired data offer to put it in the Selected offers section.
  • Click on the Allocate data button to proceed with allocating the dataset to your workspace. You should now see the data allocated to your workspace.

6. Use case: creating a data offer

Companies are currently underutilizing most of the IoT data they collect. The terabytes and petabytes of data that are being generated by factories, cities, vehicles, retail, etc., quickly pile up and it gets difficult to benefit from this flood of information. One effective way for them to put this data to work and monetize its value is to make metainformation about their data available on data marketplaces. These marketplaces help data providers richly furnish their data with metainformation so that it can be discovered easily by consumers in the search results. The metainformation that the provider may choose to supply could be sample data, sector of industry, location, keywords, data source type, description of the data, impressions, sample use cases, documentation, time and date, etc.

DIH is a data marketplace that provides self-service technology and infrastructure to the data provider so they can create data offers and exchange data with consumers in a sovereign manner. All while complying with legal requirements.

In past chapters we covered how to search and make use of the datasets from the [Marketplace]. In this chapter, we will cover how to create your own data offer to supply it on the DIH [Marketplace].

Before you continue, ensure that the data you want to offer is well prepared, with a good description and examples of use to help others find your data offer on the [Marketplace]. To create a data offer, you must first ensure that you are logged into the DIH as a premium user (section 3.2).

  • Navigate your way to [Control Center] > [Offers]
  • Click on “+ Create new offer”to create a data offer for the Marketplace.This should call up an overlay asking you to fill the details about the data offer. There 6 sections in the data offer creation process and we will visit them in detail.
  • As you fill out the information in the overlay box, you will see the gray indicator turn red, yellow, and green. Each of these colors indicate the extent to which the various sections have been completed.The gray indicates that nothing has been filled out. The yellow indicates that the information is partially filled out, and the green indicates that the information is sufficiently filled out.

For illustrative purposes, we are using screenshots of a pre-existing data offer to explain the details.

 

General information

Description:

  • Enter a title and description for your offer.

Icon:

This is the icon that will be displayed beside the offer title in the Marketplace. You therefore need to complete the following steps to display the desired icon for your data offer:

  • Prepare an icon for your offer using a .jpg, .jpeg, or .png. file
  • The icon can be uploaded via the Choose file button or simply by drag and drop.
  • Upload the icon from your computer or server (drag and drop, select file)
  • Selection: you may select from (previously uploaded) icons
  • If you have not assigned particular icons, they will be displayed at random.

 

 

Data

Here you can select the data that will be part of the offer.

Included data:

  • In this list you will find the data that is currently part of the offer. To remove a file or source from an offer, remove the mark from the checkbox by clicking on it. The entry will move to the ‘More data/Available data’ section.
  • To make it easier to find the files and sources belonging to an offer, you can sort the entries (Name (A – Z/Z – A; Origin: Connector/Upload).

Copyright statement:

  • Please confirm that you have the rights to publish all the components within the data offer.

More data/Available data:

  • Use this folder to select the data components that will form the offer.
  • To add a file or source to an offer, click the checkbox next to it. The entry will move to the ‘Included data’ section.

Add new data:

  • Select ‘Add new data’ to add new data from your computer or API endpoint as an offer or an empty data source.
  • When you upload, it will be automatically saved in the Data section of your organization account.

 

 

  • You may also use the advanced settings to annotate the data with Time reference, Geospatial reference, and Dates.

 

 

Examples

Provide data samples and examples of use to showcase your offer.

Examples of use:

  • Describe a typical usage scenario for your data offer and provide a title for the example.
  • Select or upload an image to illustrate your example of use.

Sample images:

Upload images from your own server or computer – or select from images you have already uploaded to the Data Intelligence Hub.

 

 

Sample data:

  • Upload a data sample from your own server or computer – or select from data samples you already uploaded to the Data Intelligence Hub.
  • If your sample data contains latitude and longitude information, you can turn on the map preview option and further annotate the data. This will display the Map module in the data offer preview tab.

 

 

API documentation:
If your data source is an API endpoint, you can either provide the URL for the API documentation or, you can also upload the documentation as a PDF.

 

Meta information

Enter further information that makes it easier to find your offer.

Organizational unit:

Here you select the organizational unit you belong to. Organizational units are typically the sub-units of a parent organization. However, you do not have to choose an organizational unit in order to post a data offer. This would simply indicate that you posted the data offer as a member of the parent organization.

Keywords:

Enter the keywords for the search function in the [Marketplace] (free text). Below the text field you will find keywords which have previously been entered; you can remove these by clicking ‘x’.

Contact person:

Help prospective visitors by naming a contact person for your offer, making it possible for them to get in touch.

Categories:

If you add a topic to your offer via the dropdown menu, you will make your offer easier to find in the Marketplace.

Additional information:

Here you can further annotate your data as a key value pair and delete irrelevant pairs.

 

 

License

  • What are the licensing terms for your offer: for sale or just for a specific project, static with one-time download, with or without planned updates, or as a dynamic data stream for a specific time period?
  • Select from previously uploaded licenses or upload a new license (drag and drop, select file), PDF format only.

 

 

Access

You determine who can see and use your offer (standard setting: everyone)

Visibility: 

You can select: all authorized organizations, only registered organizations, all registered users, all users and visitors (including the public Marketplace).

Availability: 

Will your offer be available for a limited or unlimited time period: enter the beginning and end dates for the desired timeframe.

Ordering options: 

Choose between three different ordering methods. Order directly, Request quote, or Informal request.

 

 

As you start filling out the information in each of the sections, you will notice that the gray dots turn yellow and green. After making sure you have entered all the information required to make your data offer attractive to your customers, you can go ahead and click on the Publish button. Your data offer will be created, and a condensed version will be displayed in an overlay. You can now create another offer, use this as a template, or go back to the Offers overview. Your new offer is now visible in the Offers overview.

 

 

Click on View your offer link to view your offer on the DIH Marketplace.

 

7. Use case: creating a project

A project is a combination of data, tools, and the results of working with data and tools such as an algorithm or source code for an algorithm. [My workspace] can have multiple projects. Now that you have allocated the data to your workspace, the next step is to create a project to start working with the data.

  • Select [Workspaces] > [My workspace] and select the [Projects] tile
  • Select + Create project This action will open an overlay box prompting you to fill out three sections, Project details, Data and Tools.

 

 

  1. Project details:
  • Call the new project “Awesome project” and add the description: see the little gray dot on the project tab turn green.

 

 

  1. Data:
  • This is where you allocate data to the project. There are therefore two steps when allocating data. First, allocating the data to [My workspace] and secondly, allocating data to a project within [My workspace] > [Projects].
  • Get started by selecting the Titanic dataset that you have just ordered and allocated to your workspace. Please note that the Titanic dataset does not show up in the Available data section if you have not allocated it to your workspace (please see: 4 Allocate ordered data to your workspace)
  • It should be noted that the Data section is optional, i.e., you can still create a project without selecting any dataset. It is, however, mandatory to carry out step 1 and step 3, where you select a tool.

 

 

  1. Tool:
  • In this section you can select the tool that you would like to use for this project.
  • As a freemium user, the only available option you will see is JupyterHub. As a premium user, however, you might see a wide assortment of tools to select from.
  • In this example, you would select JupyterHub as your choice of tool.
  • Once you have labelled the project, added a description, and selected a tool, the Save button will be activated so you can click on it.

 

 

Having clicked the save button, you should now see your project listed in the [Projects] section.

 

8. Use case: Jupyter Notebook

With data analytical projects, creating understandable scripts with reproducible results is of paramount importance. There is a great need for software that embeds the logic of the algorithm while at the same time showing the results in a digestible format, such as in figures and interactive plots. All the more reason the scripts should also serve as documentation which includes all of the aforementioned aspects. The Jupyter project addresses all these requirements with solutions such as JupyterHub, Jupyter Notebook, and JupyterLab.

 

The Jupyter solution stack is fully hosted by the DIH. Connecting to it may take a few seconds depending upon your internet bandwidth.

 

JupyterHub

JupyterHub brings the power of notebooks to groups of users. It gives users access to computational environments and resources without burdening the users with installation and maintenance tasks. Users – including students, researchers, and data scientists – can get their work done in their own workspaces using shared resources which can be managed efficiently by system administrators. JupyterHub makes it possible to serve a pre-configured data science environment to any user in the world. It is customizable and scalable and is suitable for small and large teams, academic courses, and large-scale infrastructure.

The DIH deploys a dedicated pre-configured JupyterHub environment for its premium users. Freemium users, on the other hand, can use the DIH JupyterHub environment that is aimed at generally fulfilling the requirements most users have for basic analytical needs.

 

Jupyter Notebook

The Jupyter Notebook is an open-source web application that allows you to create and share documents containing live code, equations, visualizations and narrative text. Uses include data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. Jupyter notebooks can be recognized by their file extension, ipynb.

 

JupyterLab

JupyterLab is a next-generation web-based user interface for Project Jupyter. JupyterLab enables you to work with documents and activities such as Jupyter notebooks, text editors, terminals, and custom components in a flexible, integrated, and extensible manner. JupyterHub provides a classic environment to work with Jupyter notebooks where each notebook opens in a separate browser tab. In contrast, JupyterLab provides you with an Integrated Development Environment (IDE), where you can work on multiple aspects of coding in a single browser tab. For more on the use case of the JupyterLab, visit section 13.1.

 

Resources:

 

8.1 Launching a Jupyter Notebook from [Projects]

There are two different ways in which you can launch the Jupyter environment from [Projects], either from the [Projects] page or the destination project’s detailed view page.

 

8.1.1 Launching from the destination project details page

In order to view the details of your project, click on the checkbox beside the project name and click on the View icon (binocular shaped). You may notice other buttons on the same row. These buttons help you perform other functions, such as launching this project with a tool specified during project creation, duplicating the project, editing the project’s details, and deleting the project.

 

 

The project details page has two sections, Project data, which lists all datasets that are allocated to this project, and Project repository, which contains the actual datasets in a hierarchy of folders. We will have a deeper look at the Project data and Project repository functionalities in the upcoming chapters (section 8.1). You will find the Start tool button at the top right corner of the project details page.

 

 

Clicking on the Start tool button will launch the Jupyter environment in a separate browser tab, prompting you to sign in with your DIH credentials. You must confirm the launch by clicking on the orange button, Sign in with DIH User Account.

 

It may take a few moments to start Jupyter for the first time.

 

 

If the environment has been successfully deployed, you should be shown the root directory with your project folder. Note that the folder name will be an internally generated number instead of your specified project name.

 

Clicking on the Start tool button a second time should take you directly to an autogenerated Jupyter Notebook with pre-filled code that displays the size of the first dataset from the Project repository.

 

 

Notice that every autogenerated Jupyter Notebook has file_id and project­­_key. The file_id is usually the first file in the Project repository, while the project_key is unique to every project created. If you want to read another file from your Project repository you can therefore just replace the file_id. Visit section 9.1 to know more about obtaining the IDs of the files in your Project repository.

 

8.1.2 Launching from the [Projects] page

Another way to launch the Jupyter environment from the [Projects] page is to do this directly from the [Projects] homepage. Click on the checkbox beside the project name then click on the Start tool icon (wrench-shaped).

 

8.2 Launching a Jupyter Notebook from [My workspace] [Tools]

The second way to launch the Jupyter Notebook is from the [My workspace] > [Tools]. It should be noted that, unlike the former method (), launching the Jupyter Notebook from [My Workspace] is a bit of lengthy process. The following steps outline this methodology.

  • Navigate your way to the location [My workspace] > [Tools] and click the Open tool button on the JupyterHub tile.
  • A new tab will open, and you must confirm your launch by clicking on the orange button Sign in with DIH User Account.
  • You should now see your Jupyter root directory. As you navigate your way into the subdirectories, the path would be shown on the top of the interface. For each project created on the DIH, a corresponding directory of the project can be found in the Jupyter root directory; this is typically represented by numbers.
  • Within each of the directories you can find a .ipynb Jupyter Notebook containing a code template which prints out the size of the first dataset assigned to your project.
  • To launch this notebook, just double click on the file and this should open the file in a new browser tab. You are now ready and can start working with your Notebook.

 

 

  • DIH-Jupyter integration: mapping of projects in the projects section on DIH (list of real names) and list of projects in the Jupyter root directory (project folders have numbers such as 649) is not obvious. Solution: Every numbered folder contains a Jupyter notebook. The name of the Jupyter notebook is the name of project. As a result, the obvious way to know whether a folder belongs to the project is to search the numbered folders one by one. You could even search the folders based on the time stamp of when the project has been created.
  • To launch the Jupyter Notebook for a project, you will have to enter one of the project folders and open the Jupyter notebook (file with the. ipynb extension). If you want to launch a new Notebook, however, click on the New symbol in the top right corner and select the Python 3 Notebook.

 

9. Use case: adding data to a destination project

All of the chapters up to now have focused on using the data that is available on the DIH Marketplace. It is often the case, however, that we work with data that we have generated on our own PCs. It is therefore essential to get this data onto the DIH [My Workspace] so that it can be analyzed on the DIH analytical platform. The following section focuses on how to put your own data on the DIH to work with it. There are three main ways to do this:

  • Upload from local storage to a specific DIH project or “destination project”.
  • Allocate data from [Data] to a project; Chapters 3 and 5.4, explained how to obtain data from the Marketplace and transfer it to [Data]. Chapter 6 explained how to allocate the data to a project when creating the project.
  • Upload data from local storage to a specific Jupyter project folder.

 

9.1 Uploading data from local storage to the destination project on the DIH

We will demonstrate this use case using our example project, Awesome Project.

  • Download this example dataset to your local PC. csv
  • Go to [Projects]
  • Select the destination project (tick the checkbox)
  • Click on the [View] button
  • Go to the [Project repository] section
  • Click on the [Upload files] button. This should open an overlay box.
  • Choose the file from your local PC
  • Click on the Upload data button to find the file listed in your Project repository.

 

 

Note:

The DIH generates a unique file ID for every file and folder present in the Project repository, irrespective of whether the files have been uploaded manually or automatically copied as part of the project creation process. The unique file ID of a file can be obtained by navigating your way to the file location, selecting the file (by ticking the checkbox), and clicking on the Copy path icon (link shaped). This ID is essential for reading the data from the Project repository with the Jupyter Notebook.

 

 

 

9.2 Allocating data from [Data] to the destination project on the DIH

This method is very useful when you want to work with additional datasets on an existing project. The DIH enables you to allocate additional datasets to an existing project. Before you get started with the following steps, make sure that you have subscribed to at least two data offers from the [Marketplace] and make sure that you have allocated these datasets to [My Workspace] (see: section 5.4)

  • Go to [Projects]
  • Select destination project (tick the checkbox)
  • Click the View button
  • Go to the Project data section
  • Click the edit button located in the top right corner of the Project data
  • Select the datasets from the list of Available data
  • The selected datasets would then appear in the Selected data section
  • Click the Allocate selected data to allocate the additional data sources to the project.
  • The files belonging to the selected data offers will be copied to the project repository

 

9.3 Uploading data from local storage into the destination project in the Jupyter root directory

Unlike the above use case where you upload the datasets to the Project repository, it is also possible to upload datasets and Jupyter Notebooks within the Jupyter environment. This methodology is important for the upcoming use cases.

 

Download the files:

Before we proceed further, please download the following files to your local PC.

 

Create a new folder:

  • Go to [Workspaces] > [Tools]
  • Click Open tool on the JupyterHub tile.
  • JupyterHub is opened in a new browser tab.
  • Create a new folder with the name “Data Refinement” by going to New > Folder.
  • This action creates an Untitled Folder. You therefore need to select the folder and rename it.
  • Select the Untitled Folder and click on the Rename This brings up an overlay where you can rename the folder. Click the blue Rename button to finish.
  • You will now see the renamed folder in your root directory.

 

 

Upload the downloaded files:

The next step is to upload the downloaded files to the newly created folder.

  • Navigate your way to the newly created folder, i.e., Data Refinement
  • Click the Upload button, browse, and select all three downloaded files and click Open to upload it to JupyterHub.
  • Click on the blue Upload button to finish uploading to the JupyterHub.

 

10. Use case: refining your data

There can be a lot of issues with data: missing values, outliers, wrongly formatted values, etc. Data is often even fragmented into several files. This requires a data scientist who can clean up the data, merge files, standardize, tag, categorize, and summarize everything before it can finally be used with an analytics or AI method.

The objective is to create AI ready data. In the example we have given here, we start out with two data files, RevenueHD.csv and TimeHD.csv, which have to be merged to create a single file for further data analytics. To get started quickly, we provide the links to download these two datasets. We also provide a link to download a Jupyter notebook (.ipynb extension) that includes pre-prepared code.

 

10.1 Preparing data refinement folder

The files to be downloaded are as follows:

 

You will need to download the three files listed above, create a folder named “Data Refinement” on the Jupyter environment, and upload the downloaded files to the Data Refinement directory. Before proceeding any further, follow the guidelines in section 7.2 on how to open the Jupyter environment from [Workspaces] > [Tools]. You should also use the detailed instructions in section 8.3 on how to create and upload the downloaded files to the newly created Data Refinement directory.

Your directory should now resemble the picture below.

 

 

10.2 Merging two datasets into one

Merging is a very frequent step in the data wrangling process. It is often done to enrich the data source in order to derive new insights. In Python, the merging of the datasets is popularly carried out using the Pandas library.

Pandas provides high-level data structures and functions to make working with structured or tabular data fast, easy, and expressive. The primary objects in pandas are the DataFrame, a tabular, column-orientated data structure with both row and column labels, and the Series, a one-dimensional labeled array object. In this use case example, we will focus on the DataFrame.

In this simple use case, we merge two .csv files, namely, RevenueHD.csv and TimeHD.csv. This merging operation will result in an output.csv file.

 

 

In order to follow this use case, single click to open the file “Data Join Lab.ipynb”. The file will be opened in a new browser tab. This Jupyter Notebook is a mix of Markdown language and Python code. Each of these elements are encompassed in blocks called Cells. These cells can be run one by one. To do so, click on the Run button. Clicking it once will run the (blue color) cell which is currently highlighted and move on to the next one. Go ahead and run all the cells one by one. You can also go to the Cell menu and click Run All in order to run all of the cells at once.

 

 

 

After executing all of the cells in the “Data Join Lab.ipynb”, the merged table will be displayed in the Jupyter Notebook. You will also be able to see the ouput.csv file which was created in your Data Refinement directory. In addition, you will notice that the “Data Join Lab.ipynb” has turned from gray to green. This indicates that the notebook is currently running. If you have finished executing the notebook, you can shut it down safely by selecting the notebook (by ticking the checkbox) and clicking Shutdown from the menu that appears just below the File menu.

 

 

Resources:

11. Use case: data analytics with AI

This section will demonstrate how to create an algorithm or analytics asset. To develop an algorithm, you need a workspace, data, and tools. DIH provides [My workspaces] which includes [Projects], a Jupyter Notebook, and a curated list of Python libraries. These include “Pandas”, “Scikit-learn”, “Mathplotlib”, and more. A simple linear regression example will be used to illustrate how this is done on the DIH. Firstly, you download the necessary files, upload them to a folder on the Jupyter environment, and run them.

 

11.1 Preparing a data analytics folder

The files that are to be downloaded are as follows:

 

You will need to download the two files listed above, create a folder named “Data Analytics” on the Jupyter environment, and upload the downloaded files to the Data Analytics directory. Before proceeding any further, follow the guidelines in section 7.2 on how to open the Jupyter environment from the [Workspaces] > [Tools]. You should also use the guidelines in section 8.3 on how to create and upload the downloaded files to the newly created Data Analytics directory.

Your directory should now resemble the picture below.

 

 

11.2 Example of creating a linear regression algorithm

As an example of linear regression, we will be using the “Scikit-learn” library, which along with “pandas” is critical for making Python data science programming language. Scikit-learn has become a general-purpose machine learning toolkit for Python; it includes submodules for classification, regression, and clustering. In this use case, we will be using the regression libraries from Scikit-learn. In order to visualize our data, we will use the “Mathplotlib” library, one of the most popular libraries for producing plots and other two-dimensional data visualizations. It is designed for creating plots for publication.

In this simple use case, we will enter the Revenu_Time_data.csv file and plot the data using the Mathplotlib python library. We will then use the Scikit-learn library to efficiently fit a regression line to the data, giving us a regression model that can predict the Revenue, given the Time variable.

 

 

To follow this use case, navigate your way to the folder Data Analytics and single click to open the file “Linear_Regression_Lab.ipynb”. The file will be opened in a new browser tab. This Jupyter Notebook is a mix of Markdown language and Python code. Each of these elements make up blocks which are entitled Cell. These Cells can be run one by one. To do so, click on the Run button. Clicking it once will run the cell which is currently highlighted (blue color) and move on to the next one. Go ahead and run all of the cells one by one. Youcan also go to the Cell menu and click Run All in order to run all of the cells at once.

 

Resources:

12. Use case: asset management with GitHub

12.1 DIH tutorials on GitHub

Git is primarily a version control system and a staple in any software development project. It usually serves 3 main purposes: code backup, versioning, and sharing. You can work on your code step-by-step, saving your progress on each step as you go, in case you need to return to a backup copy. You can also share your code/algorithms with your colleagues.

GitHub is a free and open source distributed code hosting platform for version control and collaboration for repositories (Directories) created using Git. You will need an account to use their services, but standard accounts are free, and the professional version is free for students. With the number of public notebooks on GitHub exceeding 1.8 million by early 2018, it is surely the most popular independent platform for sharing coding projects with the world.

We have prepared some onboarding material for you which you can access at this URL:

12.2 Cloning assets onto the DIH

Git is also available on the DIH as a command line interface (CLI). To get started:

  1. Navigate your way to https://github.com/Data-Intelligence-Hub/dih-tutorials-jupyter on your browser and copy the Git URL for cloning.

 

  1. You need to open the JupyterHub terminal in order to clone the repository. To do this, go to the JupyterHub root directory (section 2) and go to New > Terminal. This should open a terminal in a new browser tab.
  2. Type the following command into the terminal and press enter.

 

git clone https://github.com/Data-Intelligence-Hub/dih-tutorials-jupyter.git

 

If the cloning is successful, you should see an output as follows.

jovyan@jupyter-mark-evans:~$ git clone https://github.com/Data-Intelligence-Hub/dih-tutorials-jupyter.git

Cloning into ‘dih-tutorials-jupyter’…

remote: Enumerating objects: 6, done.

remote: Counting objects: 100% (6/6), done.

remote: Compressing objects: 100% (6/6), done.

remote: Total 151 (delta 1), reused 3 (delta 0), pack-reused 145

Receiving objects: 100% (151/151), 73.29 MiB | 8.21 MiB/s, done.

Resolving deltas: 100% (13/13), done.

Checking out files: 100% (116/116), done.

 

You should now have all the tutorials in your session to work with.

You can also use the DIH Git CLI to move your own assets into your personal Git repository. There are plenty of online resources explaining this Git functionality.

 

 

Resources:

13. Use case: advanced analytics tools

Jupyter Notebook classic allows a user to write python code but has no functionality providing a simultaneous view of the folders or for quickly crosschecking some essential elements in the Terminal. These are the functionalities of an IDE (Integrated Development Environment); Jupyter Notebook is not an IDE. The DIH therefore also comes with an IDE called JupyterLab, which has a wide range of functionalities to greatly improve your productivity.

This section will walk you through our approach:

  1. JupyterLab — features & plugins [not Jupyter Classic]
  2. Insights with pandas profiling
  3. MLFlow UI and parameter tracking
  4. Building Dash applications

 

13.1 JupyterLab – features and plugins

JupyterLab is a next-generation web-based user interface for Project Jupyter. JupyterLab enables you to work with documents and activities such as Jupyter notebooks, text editors, terminals, and custom components in a flexible, integrated, and extensible manner. For a demonstration of JupyterLab and its features, you can view this video:

 

 

To launch JupyterLab in DIH,

  • Navigate your way to the JupyterLab root directory (section 2)
  • Click on the URL of your browser tab and replace the URL suffix tree with lab and click enter.

These two steps should take you to the JupyterLab as seen in the picture above.

The functionalities of JupyterLab include the following aspects:

  • Opening different kinds of file formats: CSV, images, Markdown.
  • Working with JupyterLab elements:
    • Creating Python Notebooks
    • Markdown scripts
    • Table of contents
    • Rearranging panels

 

13.2 Quick descriptive statistics with pandas profiling

In order to select the right data, a quick descriptive profile of the data is required. This could be the average, mean, median, etc. You can now use pandas profiling to deliver a descriptive profile in just one line of code.

Pandas profiling is a python library which generates profile reports from a pandas DataFrame. The pandas df.describe() function works well, although it is a somewhat basic for serious exploratory data analysis. Pandas profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

For each column, the following statistics – if relevant for the column type – are presented in an interactive HTML report:

  • Type inference: detect the types of columns in a Data Frame.
  • Essentials: type, unique values, missing values
  • Quantile statistics such as minimum value, Q1, median, Q3, maximum, range, interquartile range
  • Descriptive statistics such as mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
  • Most frequent values
  • Histogram
  • Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
  • Missing values: matrix, count, heatmap, and dendrogram of missing values
  • Text analysis: learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.
  • File and Image analysis: extract file sizes, creation dates, and dimensions and scan for truncated images or those containing EXIF information.

 

 

If you have completed section 12.2, navigate your way to the dih-tutorial-jupyter folder of your JupyterHub root directory. Here you will find a Jupyter Notebook with the name “Getting Started.ipynb”. Single click to open the Notebook. This Notebook contains the code that reads the data from the DIH Project repository and then runs pandas profiling on said data to display a report, such as the one you see in the picture above.

 

Resources:

 

13.3 Parameter tracking with MLflow

Machine learning (ML) is not easy, but creating a good workflow which you can reproduce, revisit, and deploy for production is even harder.

MLflow is an open source platform for the complete machine learning lifecycle.

MLflow is designed to work with any ML library, algorithm, deployment tool, or language. It is very easy to add MLflow to your existing ML code so you can benefit from it immediately and share code using any ML library that can be run by others in your organization. MLflow is also an open source project that can be extended by users and library developers.

DIH JupyterLab and JupyterHub Classic provides the plugins to run the MLflow python code. If you have completed section 12.2.

  • Navigate your way to the folder:

/dih-tutorials-jupyter / tutorials / dih_jupyter_deep_dive / mlflow_example

  • Run the Jupyter Notebook named ipynb
  • Go to the JupyterHub root directory > New > Click on MLflow Tracking UI

 

 

  • Clicking on MLflow Tracking UI should open a new browser tab with the following UI, where you can explore the KPIs and parameters of the machine learning runs.

 

Resources:

14. Use case: building a dashboard using Dash

It is difficult to present analytical results in a digestible format. Presenting it in an interactive form and making it available to a general audience is even harder, however. You need a team of front-end and back-end developers who are acquainted with different aspects of web design.

The Dash platform empowers Data science teams to focus on data and models, while producing and sharing enterprise-ready analytic apps that sit on top of Python and R models. What would typically require a team of back-end developers, front-end developers, and IT can all be done with Dash.

The DIH tutorials has sample Dash apps which you can explore in the JupyterLab. Please note that the apps provided in the tutorial only work on the JupyterLab. If you have completed section 12.2,

  • Open the JupyterLab (section 1)
  • Navigate your way to the folder:

/dih-tutorials-jupyter / tutorials / dih_jupyter_deep_dive / dash_examples / dash-oil-and-gas/

  • Open the Jupyter Notebook named ipynb. This should open the file in the panel on the right.
  • Now go to the Run menu at the top and click Run All Cells.
  • This should open the dash app in a new tab.
  • Click on the View menu on the top and uncheck Show Left Sidebar to view the interactive app in full view.

 

 

Resources:

15. Use case: sharing data using IDS with the Trusted Connector (WIP)

The International Data Spaces (IDS) are a peer-to-peer network, a virtual data space that supports the secure exchange and simplified linking of data in business eco-systems by leveraging existing standards and technologies, as well as governance models that are well-accepted in the data economy. Thereby, IDS provide a basis for creating smart-service scenarios and facilitating innovative cross-company business processes, while at the same time guaranteeing data sovereignty for data owners. IDS is managed by the International Data Spaces Association (IDSA), a European non-profit association with currently 118 members from numerous industries and research across 20 countries, predominantly European.

Data sovereignty is a central aspect of the IDS. The International Data Spaces initiative proposes a Reference Architecture Model (RAM) for this capability, including standards for secure and trusted data exchange in business ecosystems. The architecture proposed by the IDS focuses on moving away from a central data pool to decentralized data storage where the data resides with the provider and has complete control over the data being shared. The data sharing is therefore made possible via an IDS certified connector that strictly adheres to IDS standards.

The Trusted Connector, a connector based on IDS standards which is developed by the Telekom Data Intelligence Hub (DIH), enables its customers to securely exchange batch and streaming data while ensuring data sovereignty.

 

 

The Trusted Connector developed by DIH is a software product that needs to be installed on the machines of participating companies for secure data exchange. The software is based on the web user interface. The UI has several sections, but most importantly includes a dashboard and IDS certifications belonging to the participating company, as well as data source and data sink configuration menus.

The connector requires the Provider of the data to first connect their Data Source to the Trusted Connector, which enables them to share their data with the consumer. This configuration is carried out using the connector UI. The configuration can vary based on the type of data source being shared, for example: MySQL, MQTT, file, DIH cloud storage, etc. Next the Provider needs to configure the Outgoing connection and the ID of the consumer they intend to share the data with. The Provider is also able to configure other aspects of the transfer, i.e., a one-off transfer or streaming data.

To receive data from the Provider, the Data Sink menu on the Consumer’s connector UI needs to be configured. The Data Sink is the location where the data from the Provider is expected to arrive. The Data Sink is configured based on the type of data source received from the Provider. Before the connection can be established with the Provider, however, the Consumer is also required to configure an Incoming connection with the server details from the Provider.

When the connection is successfully established and all IDS requirements are fulfilled on both ends, the connector will initiate the data exchange. The Provider and Consumer can configure the logs in which the details concerning the data transfer are stored.

In future, the Trusted Connector will also be aimed at integrating data transformation and data anonymization apps into the connector to supply the Provider with tools for anonymizing data, as well as supplying the Consumer with tools for converting the data so it is compatible with their systems.