Integrating Data-Parallel Analytics into Stream-Processing Using an In-Memory Data Grid
In use cases ranging from IoT to ecommerce, an ongoing challenge for stream-processing applications is to extract important insights from real-time systems as fast as possible and then generate effective feedback that optimizes operations or avoids costly failures. Unlike popular software platforms for streaming analytics (e.g., Apache Storm, Flink, Spark Streaming, and legacy CEP), which focus on extracting value from unfiltered data streams, in-memory data grids (IMDGs) have opened the door to stateful stream processing that correlates event streams by data sources using a “digital twin” model and enables much deeper introspection on these data sources. This talk describes the next step in the evolution of the digital twin model made possible by IMDGs: the incorporation of real-time, data-parallel analytics that further enhances introspection by providing immediate feedback on aggregate behaviors. As illustrated by several real-world applications, this new capability leverages the power of IMDGs to significantly increase the effectiveness of stream-processing applications.
Stream-processing and data-parallel analytics have traditionally led parallel lives. As described by the Lambda architecture, streaming analytics are usually hosted in the “speed layer,” while data-parallel analytics are hosted in the “batch layer” with results appearing later in queries that merge these two views. Data-parallel analytics captures vital aggregate trends that can enhance introspection. Built by the necessity to host processing layers on different systems, the Lambda architecture fails to deliver real-time feedback to stream-processing applications from data-parallel analytics.
Consider a medical monitoring application that captures and analyzes telemetry from hundreds of thousands of monitoring devices. Using an IMDG to host a digital twin model of the patients enables the tracking of real-time state for each patient, and it automatically correlates incoming telemetry from the data sources to the respective digital twin models. This allows the application to analyze telemetry in real time with rich context that includes the patient’s medical history and recent events, allowing much deeper introspection than available in stateless stream-processing systems.
The next step is to extract and analyze salient state information from the real-time state of all digital twin models and feed the results back to these models for incorporation into the analysis algorithm. This offers the next level of introspection that considers dynamic, aggregate trends. For example, the medical application can average key parameters, such as heart rate, across all patients and pivot this data by region, age group, gender, etc. These results can be reported back to the digital twin models in real time to enrich the analysis of incoming telemetry.
The use of an IMDG for stateful stream-processing makes integration of data-parallel analytics possible due to its ability to host state information for data sources in memory. Because an IMDG can perform data-parallel analytics (for example, MapReduce) in place – where the data lives, the large volume of accumulating state data does not have to be moved to a separate system for analysis. This allows results to be generated in real time and immediately fed back into the state models. It also avoids data motion which creates network bottlenecks. With these new capabilities, IMDGs significantly increase the power of digital twin models for stream processing.