Comment by Stefan Müller, IT innovation 5 Tips for modern data management with open source
15.12.2021By Stefan Müller
Rapid technological progress means that the amount of data produced worldwide is growing exponentially year after year. While the topic of data hardly played a role for most people two decades ago, today we are confronted with the question of who uses our data for what purpose and whether we want to give them permission for it. There is a simple reason why data plays such an important role today: anyone who has data can turn it into money in a variety of ways.
Related companies
The author: Stefan Müller is Director Big Data Analytics &IoT at IT-Novum
(Image: IT-Novum)
And this applies by far not only to the data collected on websites in order to gain information about visitors and their behavior, but to all data. All data generated within a company can be used to optimize processes and develop more effective business strategies. However, the mere access to the data is not enough to be able to benefit from them. They must be meaningfully aggregated and analyzed so that they can serve as a basis for decision-making.
Many companies still have their problems with this. According to Veritas’ Databerg Report, 54 percent of company data is dark data. In other words, data that is floating around somewhere in the company, but cannot be used. There are various reasons for this: in 85 percent of cases, there is a lack of tools that allow access to dark data. Often, companies are not able to process data in good quality (66 percent) or they are simply overwhelmed with the existing amount of data (39 percent). The following five recommendations for action help companies to bring their data treasure to light with the help of modern open source solutions.
Tip 1: Using advanced Data Models
The foundation for the successful analysis and use of data is the application of an efficient data model. The data Vault approach has proven its worth, as it can also be used, for example, in the Pentaho data integration platform. The model is composed of different layers (layers). At the beginning, the raw data is merged in the staging layer from different data sources. Then you get into the Raw Data Vault (Vault = “vault”) of the data warehouse layer and, depending on the origin of the data, into one of several optional vaults, which are intended, for example, for specific business data, runtime information or data from operational systems. The third level is the Information Mart Layer, in which the analyzed data is available to the consumer in a visualized form.
In this way, specialist departments have quick access to the information required for making decisions. In addition, Data Vault modeling offers numerous other advantages. The development time for the implementation of new business requirements is very short, due to which the model provides a high degree of flexibility and scalability. The architecture supports compliance requirements by ensuring one hundred percent audit capability through historization and traceability of all data up to the source system.
Tip 2: Self-service approach for more efficient data analysis
Often, the circle of people who have access to the data sources and analysis results in a company is limited. This brings with it some problems with regard to the efficient processing of data. In addition, there is often a poor user experience due to lengthy deployment processes.
However, in order for data to be used profitably in all areas of the company, it is important to authorize employees to independently evaluate data in a meaningful framework. One could speak here of a democratization of data use. This can be achieved by implementing a so-called self-service concept.
The advantages are numerous: departments that are entrusted with the provision of data analysis are relieved, while specialist departments can give specialists better feedback, since they can use analytical instruments for optimization. Decisive factors for the successful implementation of self-service are an intuitive user interface for users without in-depth expertise and appropriate data protection measures to minimize the increased risk of data leaks in self-service applications.
The particularly powerful self-service analysis tools from the open source area include Pentaho, Apache Superset or Metabase.
Tip 3: Analyze data streams in real time
For a long time, data analysis in companies was predominantly batch-oriented. This means that data is extracted, processed and analyzed from the pre-systems at a certain point in time. Today, on the other hand, data is continuously generated by countless sources such as apps, websites or sensors. In many cases, real-time analysis is necessary to benefit from insights. Therefore, the traditional batch philosophy no longer meets the requirements of the present. In the future, the modern approach of streaming analysis will be used. As the name suggests, this involves the analysis of continuously occurring data streams – in real time. In this way, companies continuously benefit from the latest, data-based information and can adapt processes accordingly without delay. In contrast to the batch method, the data is not stored before but after the analysis.
In order to integrate streaming analysis into existing data architectures, these must be supplemented by real-time processing technology. This makes it possible to organize, process and analyze enormous data streams in real time. Proven open source solutions for the implementation of streaming analysis are Apache Kafka, Flume or Spark Streaming.
Tip 4: Standardization of data access via API Gateway
The use of API gateways enables standardized and secure data access for all authorized persons. The advantages over traditional data access approaches include high reliability and security in communication between all relevant sources as well as flexible application in on-premise or cloud infrastructures.
Standardized API gateways enable users and developers to quickly deploy suitable data APIs for the respective application. This also significantly accelerates the development process of data-driven applications, which ultimately contributes to simpler and more efficient data use in the company.
API gateways form a superordinate level within the architecture of all microservices, so that the communication of the microservices takes place via the gateway. A consumer receives access to all necessary services by requesting the gateway. Access via the gateway is therefore decoupled from the underlying microservice architecture. Another advantage: Gateways can be configured individually with regard to usage guidelines, access controls or performance monitoring. Kong is a proven and powerful open source connectivity platform.
Tip 5: Create scalable infrastructures
In the early days of the data warehouse, it was necessary to plan the range of applications precisely in advance of use, since subsequent changes or scaling could only be implemented with great effort. Today, this is hardly imaginable, as the requirements for IT infrastructures have to change very quickly. Against this background, it is necessary that the infrastructure can be adapted to new data applications and increased workloads as quickly as possible and with little effort.
The flexibility required for this is provided, for example, by the cloud-based data platforms from Microsoft, Amazon or Google. Companies benefit from a pay-as-you-go infrastructure: they only pay for what is actually needed at the time, instead of investing large sums in advance of implementation. Containerized data solutions form another cornerstone for the flexible and easy scaling of resources. In practice, the open technology Kubernetes has become virtually the standard here.
Conclusion: Implementing modern data management with open source
The intelligent combination of open solutions enables companies today to efficiently collect, aggregate and analyze all company data from a wide variety of sources. All of this together forms the basis for data-based decisions and thus for the optimization of all business processes in increasingly competitive environments.