Open-source analysis platform That can use the stream processing engine Apache Heron
Apache Heron is an open source tool originally developed by Twitter to analyze large amounts of data. The focus is on large amounts of data that are to be analyzed in a short time.
Apache Heron is an open source tool for distributed, fault-tolerant streaming processing in real time. The tool is mainly used in the management of advertising, monitoring for real-time trends and machine learning as well as for real-time business intelligence. For example, Apache Heron uses Twitter to count words in tweets or to identify trends in tweets. Java 11, Python 3.6 and Python 4.1.0 are required to use Heron.
Heron is used as a supplement to Storm on Twitter, Microsoft and Google
The tool was developed by Twitter to replace or supplement Storm. After a short time, Heron replaced Storm on Twitter in the area of streaming real-time data. In addition to better performance, Heron has another advantage: it requires much less CPU power and RAM.
For this reason, Heron also supports the APIs of Storm, Apache Beam and other solutions. Heron is extensible and can be used as a streaming engine for various applications. Well-known companies such as Google, Microsoft, Standford University and others rely on the functions of Heron.
Since 2016 it has been available as open source under the Apache license and has the incubator status. Heron has been in productive operation at Twitter for years and has proven its usefulness. Heron is characterized by high data throughput and low latencies. Of course, this is essential when processing real-time data in the petabyte range. The individual components are written in C++.
These are the goals of Apache Heron
Apache Heron supports various programming languages. These include C++, Java and Python, among others. Heron can isolate individual tasks that are used for analysis from each other. This makes it possible to use profiles and different containers, in which different data is also analyzed. In this context, it is also possible to rely on containers for the complete operation of Heron. Heron also supports Kubernetes in this context.
The architecture of Heron sprouts and Bolts
At the heart of Heron’s architecture is the scheduler, which manages the individual topologies through which Heron obtains its data. The topology master in the master container in turn manages the individual data containers and also the permissions for accessing Heron. In a Heron architecture, streaming applications are called “topology”. This is an acyclic graph (DAG), the nodes of which have the task of representing data-processing elements and capturing data streams.
Heron primarily uses “sprouts” and “bolts” as nodes. Sprouts are the connections to the data sources for feeding data into the data streams. The bolts process incoming data and also pass on data. This passed-on data can in turn be further processed by other bolts.
In an environment, there can be multiple sprouts that tap into data sources and generate data streams. These data streams in turn record the various bolts in the infrastructure. There can also be complex infrastructures with numerous sprouts and bolts. In such a scenario, several sprouts can supply different bolts with data streams, while in turn the bolts forward different data streams, which are processed by another bolt. Of course, this creates very flexible scenarios for processing streaming data.
Operate and group components of Heron in parallel
Of course, depending on the data being processed, it often happens that a single sprout or bolt is not enough to effectively manage data streams. Even if a sprout is able to do this, it can happen that the node on which the sprout is running is overloaded. Therefore, in Heron there is also the possibility of using an intelligent topology to build several processes that work in parallel with each other.
All nodes in a Heron cluster records the number of concurrent instances running on the system. The system takes sprouts and bolts into account equally. Here, developers can also define which parallelism should be used. This offers flexible possibilities to control the performance of a Heron cluster based on the parallel processes. It is therefore possible to define different values for parallel processes on different nodes.
If the parallelism is defined, i.e. the number of instances of sprouts and bolts that can be used on the nodes, it must also be defined how Heron should distribute the data to the instances. For this, the grouping is used. This can be used to control which Sprout instance should send data streams to which Bolt instance.
In addition to the possibility of sending data statically, i.e. to determine exactly which sprout and Bolt instances work together, Heron can of course also distribute the data randomly. Even complex scenarios can be mapped and comprehensive rules can be created, even in hybrid operation.
The parallelism, grouping and the associated rules and guidelines can be combined into containers. Here, in turn, developers can define how much CPU resources and RAM should be available. Of course, several containers can also be defined. If you do not want to use different containers, you can also implement the entire structure in a single container. This is the default behavior of Heron when developers do not define containers. The containers are in turn transmitted to the already mentioned scheduler, which schedules the execution of the containers.