13 Feb 2024

Min Read

Stream Processing for IoT Data

The Internet of Things (IoT) refers to sensors and other devices that share and exchange data over a network. IoT has been on the rise for years and only seems to continue in its growing popularity with other technological advances, such as 5G cellular networks and more “smart” devices. From tracking patient health to monitoring agriculture, the applications for IoT are plentiful and diverse. Other sectors where IoT are used include security, transportation, home automation, and manufacturing.

Oracle defines Big Data as “data that contains greater variety, arriving in increasing volumes and with more velocity.” This definition is simply described with the 3 Vs – volume, velocity, and variety. IoT definitely matches this description, as sensors can emit a lot of data from numerous sensors and devices.

A platform capable of processing IoT data needs to be scalable in order to keep up with the volume of Big Data. It’s very common for many IoT applications to have new sensors added. Consider a drone fleet for package deliveries as an example – you may start off with 10 or 20 drones, but as demands for deliveries increases the size of your drone fleet can grow by orders of magnitude. The underlying systems processing these data needs to be able to scale horizontally to match the increase in data volume.

Many IoT use cases such as tracking patient health and monitoring security feeds require low latency insights. Sensors and devices providing real-time data often need to be acted on in real-time as well. For this reason, streaming and stream processing technologies have become increasingly popular and perhaps essential for solving these use cases. Streaming storage technologies such as Apache Kafka, Amazon Kinesis, and RedPanda can meet the low latency data transportation requirements of IoT. On the stream processing side, technologies such as Apache Flink and managed solutions such as DeltaStream can provide low latency streaming analytics.

IoT data can also come in various types and structures. Different sensors can have different data formats. Take a smart home for example, the cameras in a smart home will likely send very different data from a light or a thermometer. However, these sensors are all related to the same smart home. It’s important for a data platform handling IoT use cases to be able to join across different data sets and handle any variations in data structure, format, or type.

DeltaStream as a Streaming Analytics Platform and a Streaming Database

DeltaStream is a platform to unify, process, and govern streaming data. DeltaStream sits as the compute and governance layer on top of streaming storage systems such as Kafka. Powered by Apache Flink, DeltaStream is a fully managed solution that can process streaming data with very low latencies.

In this blog post we’ll cover 2 examples to show how DeltaStream can solve real-time IoT use cases. In the first use case, we’ll use DeltaStream’s Materialized Views to build a real-time request driven application. For the second use case, we’ll use DeltaStream to power real-time event-driven pipelines.

Use Case Setup: Transportation Sensor Data

For simplicity, both use cases will use the same source data. Let’s assume that our data is available in Apache Kafka and represents updates and sensor information for a truck fleet. We’ll first define Relations for the data in 2 Kafka topics.

The first Relation represents truck information. This includes an identifier for the truck, the speed of the truck, which thermometer is in the truck, and a timestamp for this update event represented as epoch milliseconds. Later on, we will use this event timestamp field to perform a join with data from other sensors. Since we expect regular truck information updates, we’ll define a Stream for this data.

Create truck_info Stream:

  1. CREATE STREAM truck_info (
  2. event_ts BIGINT,
  3. truck_id INT,
  4. speed_kmph INT,
  5. thermometer VARCHAR
  6. ) WITH (
  7. 'topic' = 'truck_info', 'value.format' = 'json', 'timestamp' = 'event_ts'
  8. );

The second Relation represents a thermometer sensor’s readings. The fields include an identifier for the thermometer, the temperature reading, and a timestamp for when the temperature was taken that is represented as epoch milliseconds. Later on, the event timestamp will be used when joining with the truck_info Stream. We will define a Changelog for this data using sensor_id as the primary key.

Create temperature_sensor Changelog:

  1. CREATE CHANGELOG temperature_sensor (
  2. "time" BIGINT,
  3. temperature_c INTEGER,
  4. sensor_id VARCHAR,
  5. PRIMARY KEY (sensor_id)
  6. ) WITH (
  7. 'topic' = 'temperature_sensor', 'value.format' = 'json', 'timestamp' = 'time'
  8. );

Using the Relations we have just defined, we want to find out what the latest temperature readings are in each truck. We can achieve this by using a temporal join to enrich our truck_info updates with the latest temperature readings from the temperature_sensor Changelog. The result of this join will be a Stream of enriched truck information updates with the latest temperature readings in the truck. The following SQL statement will launch a long-lived continuous query that will continually join these two Relations and write the results to a new Stream that is backed by a new Kafka topic.

Create truck_info_enriched Stream using CSAS:

  1. CREATE STREAM truck_info_enriched AS
  2. SELECT
  3. truck_info.event_ts,
  4. truck_info.truck_id,
  5. truck_info.speed_kmph,
  6. temp.sensor_id AS thermometer,
  7. temp.temperature_c
  8. FROM truck_info
  9. JOIN temperature_sensor temp
  10. ON truck_info.thermometer = temp.sensor_id;

While a truck fleet in a real-world environment will likely have many more sensors, such as cameras, humidity sensors, and others, we’ll keep this use case simple and just use a thermometer as the additional sensor. However, users could continue to enrich their truck information events with joins for each additional sensor data feed.

Use Case Part 1: Powering a real-time dashboard

Monitoring and health metrics are essential for managing a truck fleet. Being able to check on the status of particular trucks and generally see that trucks are doing fine can provide peace of mind for the truck fleet manager. This is where a real-time dashboard can be helpful – to have the latest metrics readily available on the status of the truck fleet.

So for our first use case, we’ll use Materialized Views to power a real-time dashboard. By materializing our truck_info_enriched Stream into a queryable view, we can build charts that can query the view and get the latest truck information. We’ll build the Materialized View in two steps. First we’ll define a new Changelog that mirrors the truck_info_enriched Stream, then we’ll create a Materialized View from this Changelog.

Create truck_info_enriched_changelog Changelog:

  1. CREATE CHANGELOG truck_info_enriched_changelog (
  2. event_ts BIGINT,
  3. truck_id INT,
  4. speed_kmph INT,
  5. thermometer VARCHAR,
  6. temperature_c INTEGER,
  7. PRIMARY KEY (truck_id)
  8. ) WITH (
  9. 'topic' = 'truck_info_enriched',
  10. 'value.format' = 'json'
  11. );

Create truck_info_mview Materialized View using CVAS:

  1. CREATE MATERIALIZED VIEW truck_info_mview AS
  2. SELECT * FROM truck_info_enriched_changelog;

Note that we could have created this Materialized View sourcing from the truck_info_enriched Stream, but if we created the Materialized View from the Stream, then each event would be a new row in the Materialized View (append mode). Instead we are building the Materialized View from a Changelog so that each event will add a new row or update an existing one based on the Changelog’s primary key (upsert mode). For our example, we only need to know the current status of each truck, so building the Materialized View with upsert mode better suits our use case.

A continuous query will power this Materialized View, constantly ingesting records from the truck_info_enriched Stream and sinking the results to truck_info_mview. Then, we can write queries to SELECT from the Materialized View. A dashboard can easily be built that simply queries this Materialized View to get the latest statuses for trucks. Here are some example queries that might be helpful when building a dashboard for the truck fleet.

Query to get truck IDs with the highest 10 temperatures:

  1. SELECT truck_id, temperature_c
  2. FROM truck_info_mview
  3. ORDER BY temperature_c DESC
  4. LIMIT 10;

Query to get all information about a truck:

  1. SELECT *
  2. FROM truck_info_mview
  3. WHERE truck_id = 3;

Query to count the number of trucks that are speeding:

  1. SELECT COUNT(truck_id) AS num_speeding_trucks
  2. FROM truck_info_mview
  3. WHERE speed_kmph > 90;

Use Case Part 2: Building a real-time alerting pipeline

While it’s great to be able to pull for real-time metrics for our truck fleet (Use Case Part 1), there are also situations where we may want the truck update events themselves to trigger actions. In our example, we’ve included thermometers as one of the sensors in each of our delivery trucks. Groceries, medicines, and some chemicals need to be delivered in refrigerated trucks. If the trucks aren’t able to stay within a desired temperature range, it could cause the items inside to go bad or degrade. This can be quite serious, especially for medicines and hazardous materials that can have a direct impact on people’s health.

For our second use case, we want to build out a streaming analytics pipeline to power an alerting service. We can use a CSAS to perform real-time stateful transformations on our data set, then sink the results into a new Stream backed by a Kafka topic. Then the sink topic will contain alertable events that the truck fleet company can feed into their alerting system or other backend systems. Let’s stick to our refrigeration example and write a query that detects if a truck’s temperature exceeds a certain threshold.

Create overheated_trucks Stream using CSAS:

  1. CREATE STREAM overheated_trucks AS
  2. SELECT * FROM truck_info_enriched WHERE temperature_c > 10;

Submitting this CSAS will launch a long-lived continuous query that ingests from the truck_info_enriched Stream, filters for only events where the truck’s temperature is greater than 10 degrees celsius, and sink the results to a new Stream called overheated_trucks. Downstream, the truck fleet company can ingest these records and send alerts to the correct teams or use these records to trigger actions in other backend systems.

Processing IoT Data with DeltaStream

IoT data can be challenging to process due to the high volume of data, the inherent real-time requirements of many IoT applications, and the distributed nature of collecting data from many different sources. While we often treat IoT use cases as their own category, they really span many sectors and use cases. That’s why using a real-time streaming platform, such as DeltaStream, that is able to keep up with the processing demands of IoT and can serve as both a streaming database and streaming analytics platform is essential.

If you want to learn more about how DeltaStream can help your business, schedule a demo with us. We also have a free trial available if you want to try out DeltaStream yourself.

01 Feb 2024

Min Read

Secure Sharing for Real-Time Data Streams

Real-time data has become increasingly valuable across many industries, including healthcare, IoT, machine learning, and cybersecurity. Stream processing frameworks such as Apache Flink enable organizations to gain immediate insights into their streaming data. For example, in the cases of IoT, companies can understand within sub-seconds if a sensor has failed. However, what if you want to share that processed data (stream) with another team who needs it for a downstream application? The processed stream is what we refer to as a Data Product, and sharing these Data Products is difficult because of the underlying permissioning in a streaming store such as Kafka. If we build a system of governing these source and processed streams, we unlock the ability to securely share Data Products. This enables cross functional and external sharing use cases.

What does it mean to Securely Share Data?

Secure data sharing is a technique that enables data owners to share their data resources in a secure and controlled way with data consumers. Secure sharing for real-time data streams involves the following aspects:

  • Data privacy: ensuring that the data streams are protected from unauthorized access, disclosure, and modification, and that the data owners have control over who can access their data and for what purpose.
  • Data access: ensuring that the data streams are available, accessible, and usable for the intended data consumers, and that the data sharing is scalable, efficient, and cost-effective.
  • Data auditing: ensuring that the usage and sharing of data streams is transparent, accountable, and auditable.

Secure data sharing improves both data collaboration and data innovation. With secure data sharing, data owners and consumers can safely exchange real-time data and insights, encouraging cross team collaboration and extracting more value out of the otherwise siloed data.

Current State of Sharing Streaming Data

Managing data authorization is one of the most common ways to securely share real-time streaming data. Let’s consider Apache Kafka for example, the most popular real-time event streaming storage systems. Data events in Kafka are organized into topics, and access to these topics is managed through Access Control Lists (ACLs). However, at scale, ACLs are difficult to manage and prone to mistakes.

To process real-time streaming data, users often look towards stream processing frameworks like Flink. ACLs would need to be configured in the streaming store to allow the Flink job to read from and write to topics. Then, to share the topic containing the processed data with other users, the ACL needs further additions. There are many streaming and stream processing technologies that make up a streaming data ecosystem, and all of these systems put stress on the ACLs used to control access to data streams

In order to overcome these challenges, there needs to be a data platform that can manage data processing, access control, and data sharing capabilities at a higher level.

Best Practices for Securely Sharing Streaming Data

There are two main types of data sharing that data platforms should support – internal data sharing and 3rd party data sharing.

Internal Data Sharing

Internal data sharing refers to the process of making data accessible for other users, teams, or applications, within an organization. Secure internal data sharing capabilities play along nicely with data meshes, where data owners can be any user or team within an organization. These data owners have the capability of authorizing who has access to data, and determining what access permissions each party has.

One of the common ways we see popular data systems, such as Snowflake, provide internal data sharing capabilities is through Role-Based Access Control (RBAC). RBAC is scalable and easy to understand, making it an effective tool for controlling access to data. While access to streaming data has typically been defined with ACLs, an RBAC-based approach would address the scalability issues seen in streaming storage systems today. A data platform that sits on top of streaming storage systems, such as Kafka, and provides an RBAC interface for managing access to the underlying Kafka topics would provide a much more intuitive access control experience for real-time data users.

3rd Party Data Sharing

3rd party data sharing refers to the process of making data accessible for parties outside of an organization. This enables organizations to collaborate with other organizations without giving them access to their data ecosystem directly. In the current streaming landscape, this kind of data sharing is not natively supported. For instance, only internal data sharing is allowed in Kafka, through ACLs.

Using Snowflake as an example, they enable secure 3rd party data sharing through the concept of shares, allowing data objects in a Snowflake organization to be made shareable for other Snowflake organizations. The data providers in this case can specify which accounts can consume from these data objects. This is just one example out of many possible ways to implement 3rd party sharing. Providing such functionality for streaming data would unlock opportunities for organizations to collaborate with real-time streaming data as well.

Conclusion

Secure sharing for real-time data streams improves data accessibility and enhances collaboration with real-time analytics, while enabling teams to maintain the privacy of their data assets. However, the data sharing capabilities in the current state of real-time streaming data are not up to par with the capabilities seen in the batch world for at-rest data. 

DeltaStream is a stream processing data platform that aims to provide an intuitive way to share streaming data, both internally and with 3rd parties. At DeltaStream, we use RBAC as the approach for access control and provide capabilities for sharing data between organizations. Data Governance and secure data sharing are essential for providing an easy-to-use data ecosystem that allows users to focus on their data products. If you are interested in learning more about DeltaStream, reach out to us for a demo or sign up for a free trial.

24 Jan 2024

Min Read

The Importance of Data Unification in Real-Time Stream Processing

In our previous blog post, Streaming Data Governance in DeltaStream, we discussed what Data Governance is, why it’s important, and how it works hand-in-hand with Data Unification in the context of real-time streaming data. In this blog, we’ll dive deeper into the concept of Data Unification.

What is Data Unification and why is it Important?

Stream processing solutions have relied on connecting directly to streaming storage systems, such as Apache Kafka and RedPanda, and running transformations on a series of topics. For the stream processing user, this requires an understanding of where and how the data (i.e. topics) are stored. An ops team would then need to authorize access to these data assets. The current stream processing approach is similar to running individual Spark jobs as opposed to using a platform such as Databricks.

The fundamental problems with this approach include the following:

  1. Access to streaming storage is limited to a small number of individuals due to complex and disjointed permissioning.
  2. Data analysts are required to understand which data storage systems and which topics contain the data needed for analysis.
  3. Stream processing workloads are created in silos by different teams. It’s very common for teams to be running their own Flink or other stream processing workloads.
  4. Reusing and sharing new data streams and data assets is difficult without a central platform that enables collaboration.

Data Unification for stream processing is needed to provide an organizational layer on top of the streaming stores that provide a complete view of streaming, relational and other data stores. Once a unified view is created, it unlocks the ability to seamlessly govern, access, and run stream processing workloads across an organization’s data footprint. 

The Current Data Landscape

The technologies that make up the current landscape of real-time streaming data were not built with Data Unification in mind. This isn’t a criticism of these technologies, as they have enabled real-time data products and solve complex distributed systems problems, but more of a statement to point out what’s been missing in the streaming data landscape.

Let’s consider Apache Kafka, which is currently the most popular streaming storage system. Topics are the organizational objects in Kafka that hold data events, and these topics are stored in clusters. Access to these topics is granted through Access Control Lists (ACLs). In most cases, organizations with real-time data will have multiple Kafka clusters or utilize a combination of streaming storage solutions, including RedPanda, Kinesis, and Apache Pulsar. Performing streaming analytics on these data sources requires users to work directly with the storage system APIs or use a stream processing framework, such as Apache Flink. This setup has 3 problems for Data Unification:

  1. Managing access control through ACLs is cumbersome and error prone. Access must be given per user, per topic. As the number of users and the number of topics grow, these lists can easily become unmanageable, resulting in security risks. Also, ACLs can’t be applied across multiple Kafka clusters, so access control operations are still siloed to the individual clusters or systems.
  2. Organization of data assets (topics) is flat. Without any namespacing capabilities, there is no way to organize or categorize topics. A flat namespace results in challenges with data discovery and data governance.
  3. Connecting directly to source/sink topics for each stream processing job is redundant and error prone. Writing stream processing applications that interact directly with the storage layer results in a large overhead to configure/maintain permissions. This can easily lead to mistakes in providing the wrong data access, resulting in organizations limiting the set of users that have access to data assets.

In the batch world, products like Databricks and Snowflake address these exact issues. Let’s consider Databricks for example. Databricks’s Unity Catalog provides a hierarchical namespace to Databricks Tables, such that each Table exists within a Catalog and Schema. While the Table is backed by parquet files existing in some S3 location (in the case of using Databricks on AWS), the logical representation of the Table in the Unity Catalog can be namespaced into any Catalog and Schema. This is very similar to the organizational structure of relational databases. Another similarity to relational databases is Databricks’s support of RBAC on their Unity Catalog. A particular Databricks user or team can be authorized access to a Catalog, Schema, or Table. Databricks also allows users to define SQL queries for data processing, which utilizes Apache Spark behind the scenes. As a result of having a Unity Catalog to view all of a user’s Tables, when a user writes such SQL queries, the queries can simply source from or write to Tables in the Unity Catalog. By operating at this logical abstraction layer with the Unity Catalog, the burden of obtaining S3 access, setting up connectivity to S3, and interacting directly with the storage layer is eliminated for users.

When compared to a data streaming system like Kafka, it becomes clear that Kafka is more akin to a physical storage system than a platform like Databricks, which offers products built on top of the storage layer. What is missing in real-time data stream processing is a solution that builds on top of streaming storage systems, such as Kafka and Kinesis, and allows users to organize and represent their streaming data in a single unified data catalog.

DeltaStream as the platform for Streaming Data Unification

DeltaStream is a complete stream processing platform to Unify, Process, and Govern all of your streaming data. Taking after the examples in the batch world, DeltaStream utilizes concepts such as data catalogs and RBAC to provide a unified and governed data platform for real-time streaming data.

DeltaStream can connect to any data streaming storage system, including Kafka, RedPanda, and Kinesis, as well as relational databases, such as PostgreSQL. A Store in DeltaStream is used to define the connection to these storage systems. Once a Store has been defined, users can create Relations that are backed by the data objects in each Store. For example, a Relation can be created that is backed by a topic in a Kafka Store. These Relations are organized into DeltaStream’s Streaming Catalog. The Streaming Catalog has two levels of namespacing in the form of Databases and Schemas. Relations can belong to any Database and Schema, and Relations from different storage systems can be co-located into the same Database and Schema. Since Relations in the Streaming Catalog can be backed by data in different storage systems, the Streaming Catalog becomes the singular unified place to view and interact with all of this data.

With a Streaming Catalog, DeltaStream users can write queries which read and write to Relations in the Streaming Catalog. Access to the Streaming Catalog is managed through a robust RBAC model that enables privileges only to the data a user or team needs. With RBAC, users can easily and securely share their data assets. By writing DeltaStream queries using Relations, users can simply focus on their business logic as opposed to dealing with the underlying storage layer connectivity.

Overview of DeltaStream’s Streaming Catalog

Bring Data Unification to your Organization

In this post, we covered what Data Unification is and why it is important in the context of streaming data and stream processing. The part of Data Unification that often gets overlooked is having a unified view in the form of a data catalog. With a unified data catalog, Data Governance and data processing features built on this catalog become simpler and more intuitive to use. This is exactly why DeltaStream not only connects to different data storage systems, but also provides a Streaming Catalog to provide this unified view of data to users.

If you want to learn more about DeltaStream, reach out to us and schedule a demo.

05 Dec 2023

Min Read

Exploring Pattern Recognition using MATCH_RECOGNIZE

Pattern recognition is a common use case in data processing. Detecting trend reversals, identifying anomalies, and finding sequences in data are all examples of pattern recognition problems. In SQL, Row Pattern Recognition (RPR) became part of the SQL standard in 2016 (ISO/IEC 9075:2016) and introduced the MATCH RECOGNIZE SQL syntax. Using this new syntax, users can write concise SQL queries to solve pattern recognition problems.

While pattern recognition is an important challenge to solve in the batch world, there are also many pattern recognition use cases in the real-time context. That’s why as a leading streaming platform, it is necessary for DeltaStream to support MATCH RECOGNIZE in its SQL syntax.

In our previous blog post Analyzing Real-Time NYC Bus Data with DeltaStream, we showed how we could write a SQL query in DeltaStream to solve the pattern recognition use case of detecting bus trips where the delays are significantly increasing. In this blog post we will do a deep dive into that query and explain the purpose and meaning behind each line.

DSQL MATCH_RECOGNIZE Query

Below is the same pattern recognition SQL query from our Analyzing Real-Time NYC Bus Data with DeltaStream post:

  1. CREATE STREAM trips_delay_increasing AS
  2. SELECT
  3. trip,
  4. vehicle,
  5. start_delay,
  6. end_delay,
  7. start_ts,
  8. end_ts,
  9. CAST(FROM_UNIXTIME((start_epoch_secs + end_epoch_secs) / 2) AS TIMESTAMP) AS avg_ts
  10. FROM trip_updates
  11. MATCH_RECOGNIZE(
  12. PARTITION BY trip
  13. ORDER BY "ts"
  14. MEASURES
  15. C.row_timestamp AS row_timestamp,
  16. C.row_key AS row_key,
  17. C.row_metadata AS row_metadata,
  18. C.vehicle AS vehicle,
  19. A.delay AS start_delay,
  20. C.delay AS end_delay,
  21. A.ts AS start_ts,
  22. C.ts AS end_ts,
  23. A.epoch_secs AS start_epoch_secs,
  24. C.epoch_secs AS end_epoch_secs
  25. ONE ROW PER MATCH
  26. AFTER MATCH SKIP TO LAST C
  27. PATTERN (A B C)
  28. DEFINE
  29. A AS delay > 0,
  30. B AS delay > A.delay + 30,
  31. C AS delay > B.delay + 30
  32. ) AS MR WITH ('timestamp'='ts')
  33. QUERY WITH ('state.ttl.millis'='3600000');

Breakdown of MATCH_RECOGNIZE Query

First of all, you can find all of our documentation on MATCH RECOGNIZE here. In this section, we’ll be breaking down the above query and discussing the thought process behind each line of the query.

In this query, the source Stream is bus trip updates with a field called delay which represents the number of seconds a bus is delayed on its current route. The goal of the query is to output bus trips where the delay is significantly increasing so that we can get a better understanding of when buses will actually arrive at their upcoming stops.

CSAS (lines 1-2)

This query is what we call a CREATE STREAM AS SELECT (CSAS) query, where we are creating a new Stream that will be the output of running the continuous SELECT statement. Running a continuous query will launch a long-lived stream processing job behind the scenes and write the results to the physical storage that is backing the new Stream. As explained in the original blog post Analyzing Real-Time NYC Bus Data with DeltaStream, a Kafka topic is the physical storage layer that is backing this Stream.

Projection (lines 2-9)

The first few lines of the SELECT query are the fields that are being projected for the resulting Stream trips_delay_increasing. These fields are made available by the MEASURE and the PARTITION BY clauses in the MATCH_RECOGNIZE statement. The following fields are being projected:

  • trip and vehicle represents the particular bus trip that is experiencing increasing delays
  • start_delay, start_ts, end_delay, and end_ts give insights into the pattern that was matched and how fast delays are increasing
  • avg_ts is the midpoint between the start_epoch_secs and end_epoch_secs, which also represents the midpoint of the matched pattern. To evaluate this field, we use built-in functions to convert from epoch seconds, an INTEGER, into a TIMESTAMP. In the original blog post, this field was used in a follow up query to find the positions of the buses during the time that the delays were increasing.

Source (lines 10-11, 32)

The FROM clause defines the source of data for the query. In this query, we are sourcing from the result of the MATCH_RECOGNIZE clause on the trip_updates Stream. This source also has a WITH clause to specify a timestamp field, ts, which is a field in the trip_updates Stream. By specifying the timestamp field, we are setting the event time of incoming records to be equal to the value of that field, so later on in the ORDER BY clause we can use the correct timestamp for ordering events in our patterns.

MATCH_RECOGNIZE – PARTITION BY (line 12)

The first line in the MATCH_RECOGNIZE query is the optional PARTITION BY clause. This subclause groups the underlying data based on the partitioning expression, and optimizes DeltaStream’s compute resources for parallel processing of the source data. In this query, we are partitioning the events by the trip field, so any detected patterns are unique to a particular bus trip. In other words, we want to know for each particular bus trip if there are increasing delays. The PARTITION BY clause is necessary for our use case because the delays for one bus trip are not relevant to other bus trips. Note that the fields in the PARTITION BY clause are available for projection, as shown in this query by selecting the trip field in the SELECT statement.

MATCH_RECOGNIZE – ORDER BY (line 13)

The ORDER BY clause is required for MATCH_RECOGNIZE. This clause defines the order in which the data should be sorted for each partition before they are evaluated for matched patterns. For our query, the ordering field is ts, so bus trip updates will be ordered in ascending order according to the value of the ts field. One requirement for the ORDER BY subclause is that the ordering field must be the same as the timestamp field for the source Relation, which is set on line 32 (also mentioned in the Source section above).

MATCH_RECOGNIZE – MEASURES (lines 14-24)

The MEASURES clause defines the output schema for the MATCH_RECOGNIZE statement and has a similar meaning as a SELECT clause. The fields specified in the MEASURES subclause are made available to the SELECT clause. The MEASURES subclause and the DEFINE subclause (lines 28-31) are closely related, as the MEASURES subclause is projecting fields from rows defined by the DEFINE subclause. For example, on line 19 we define start_delay as A.delay. In this case, the delay for the row matching A’s definition is being projected as start_delay, whereas on line 20 the delay for the row matching C’s definition is being projected as end_delay. There are 3 fields in the MEASURES sub-clause that aren’t being used in our query’s SELECT clause. These are the row metadata columns – row_timestamp, row_key, and row_metadata. Since the MATCH_RECOGNIZE operator alters the projection columns of its input Relation, a PATTERN variable must be chosen for these special fields as part of the MEASURES subclause, which we do on lines 15-17. See the MATCH_RECOGNIZE - PATTERN below for more information on the PATTERN variables.

MATCH_RECOGNIZE – Output Mode (line 25)

ONE ROW PER MATCH is the only supported output mode, which means for a given sequence of events that matches our pattern, we should output one result. So in our query, for a bus trip with significantly increasing delays, we should output one event for this trip.

MATCH_RECOGNIZE – AFTER MATCH strategy (line 26)

The after match strategy defines where to begin looking for the next pattern. In this query, we specify AFTER MATCH SKIP TO LAST C, which means after finding a pattern match, look for the next pattern match starting with the last event of the current match. Other match strategies could inform our query to start looking for the next pattern starting from a different event, such as the event after a pattern match. However in our case, we want to make sure we are capturing continuous patterns. Specifically for our query, the pattern that we are looking for is 3 trip updates with increases in delay in a row (see MATCH_RECOGNIZE – PATTERN section below). So, if there was a bus trip with 5 trip updates with strictly increasing delay, then there would be 2 results from our MATCH_RECOGNIZE statement with our after match strategy. The first result would be for updates 1-3, and the second result for updates 3-5. The first match’s C would also be the second match’s A in this case. See other after match strategies in our documentation.

MATCH_RECOGNIZE – PATTERN (line 27)

The PATTERN clause specifies the pattern of events that should be considered a match. The PATTERN subclause contains pattern variables, which can each be associated with a quantifier to specify how many rows of that variable to allow in the pattern (see the documentation for more details). In our query, we have a simple pattern of (A B C) without any quantifiers, meaning that in order for a series of events to be considered a match, there needs to be 3 consecutive events with the first one matching A’s definition, the second one matching B’s definition, and the third one matching C’s definition.

MATCH_RECOGNIZE – DEFINE (lines 28-31)

The DEFINE subclause defines each of the pattern variables from the PATTERN subclause. If a variable is not defined then it will evaluate to true for an event contributing to a pattern match. This clause is similar to the WHERE clause in SQL, in that it specifies conditions that an event must meet in order to be considered as one of the pattern variables. When defining these pattern variables, we can access fields from the original events of the source Relation, in our case the trip_updates Stream, and define expressions for evaluation. Aggregation and offset functions can also be applied here. In our query, we are defining our pattern variables based on their delay. For the first event in our pattern, A, we want to see bus trips that are already delayed by 30 seconds. B is defined as an event that is 30 seconds more delayed than A, and similarly C is defined as an event that is 30 seconds more delayed than B. Combined with our PATTERN of (A B C), our query is essentially finding patterns where the delay is increasing by 30 seconds with each trip update for 3 trip updates in a row.

QUERY WITH (line 33)

The last line of our query is the QUERY WITH clause. This optional clause is used to set query properties. For our query, we are setting the state.ttl.millis which is used to inform the query when it is safe to purge its state. Another way to limit the state size for MATCH_RECOGNIZE queries is the optional WITHIN subclause that specifies a duration for patterns. Without specifying some way for the query to purge state, the amount of information the query needs to keep in memory will grow indefinitely and the query will eventually fail.

Conclusion

Queries using the MATCH_RECOGNIZE syntax can seem very complex at first glance. This blog post aims to bring clarity to the different parts of a MATCH_RECOGNIZE query by using a specific, easy to follow example. Hopefully this post helps make pattern recognition queries easier to understand, as it is quite a powerful operator that can solve many use cases.

If you’re interested in trying out MATCH_RECOGNIZE queries in DeltaStream yourself, sign up for a free trial or schedule a demo with us.

29 Nov 2023

Min Read

Top 3 Challenges of Stream Processing and How to Address Them

Real-time shipment tracking, truck fleet management, real-time training of ML models, and fraud detection are all real use-cases that are powered by streaming technologies. As companies try to keep up with the trends of building real-time products, they require a data stack capable of handling and processing streaming data. Unlocking stream processing allows businesses to do advanced data analytics at low latencies, which ultimately allows for faster business insights and more informed decision making. However, many of the popular stream processing frameworks, like Apache Flink, are quite complex and can be challenging to operate. In this blog post we’ll cover 3 of the main challenges with operating a stream processing platform and how to address them.

Challenge 1: Resource Management

Stream processing jobs are long lived in order to continuously process data. Because these jobs are long lived, it’s important that your stream processing platform properly allocates resources for the stream processing jobs. On one hand, over-allocating resources will result in overspending, while on the other hand, under-allocating resources will cause jobs to fall behind or fail. The proper allocation for stream processing jobs varies case by case. For instance, jobs with high throughputs that need to hold a lot of state should receive more memory, while jobs with complex transformations should receive more CPU. Many workloads can also fluctuate and a dynamic allocation of resources in such cases is necessary to match the workload. Take a stream of data for page visits on a website for example – it’s likely that the website will be visited more during the day when people aren’t asleep. So, stream processing jobs that source from this data should scale up during the day, then scale down for the night.

Solution: Utilization Metrics

Exposing resource utilization metrics is an important first step to tackling the challenges of resource management. Having visibility into the resource utilization trends of your jobs can allow your stream processing platform to have rules for resource allocation. In the simplest case, if your job’s resource allocation is stable, you can allocate the amount of resources to match what’s shown in the metrics. For jobs with predictable fluctuations, such as a job sourcing from data that peaks during the day and dips during the night, you can set up a system that adjusts the resource allocation on a timely basis. In the most complex case, for jobs with unpredictable resource fluctuations, the best approach is to add an auto-scaling service that can automatically resize compute resources based on resource metrics. Building a platform that exposes the correct metrics, can safely resize your stream processing jobs, and includes an auto-scaling service are all necessary to generically support stream processing workloads, but can take a lot of engineering time and effort. If building these systems is too costly of an engineering investment, you can also consider implementing fully managed third party solutions that can help to partially or fully address these challenges.

Challenge 2: Data Heterogeneity

For production use cases, data can come from many different sources and in many different formats. Streaming data coming from sensors, cloud providers, databases, and backend services will differ from each other, which makes them difficult to compare. It is not easy to create a stream processing platform that can handle various data formats and quality levels. The engineering team that supports this platform will need to understand the nuances of different data sources and provide tools/features that can help to make variable data more uniform. However, this platform can create many possibilities as businesses can use data from sources that were isolated before.

Solution: Data Standardization

Standardizing the data across your organization and implementing quality control over the data are the best solutions for dealing with data heterogeneity. Providing data standards and encouraging data schematization at the data’s source is the best practice, but if that’s not possible, stream processing can also help to transform a set of data into a standardized format that can be easily processed and integrated with other data streams. Stream processing platforms that enable users to filter out bad or duplicate data, enrich data with missing fields, and mutate data to fit standardized data formats can mitigate many of the issues of variable data.

One tip when it comes to dealing with data with variable data quality is to provide different configurations for error handling. For many stream processing frameworks, if there is a record that a job doesn’t know how to deserialize or make sense of, the job will simply fail. For data sources that don’t have great data integrity, having options to skip over those records or produce them to a dead-letter queue for further analysis can be better options for your overall application.

Challenge 3: Streaming Knowledge Gaps

Most stream processing frameworks are highly complex pieces of software. Building up expertise to understand the challenges of streaming vs. the traditional challenges in the batch world takes time. For some organizations, having engineers ramp up on streaming technologies may not be a worthwhile or affordable investment. For organizations that do end up investing in a team of streaming experts, a knowledge gap between the streaming team and other teams often form. Even with a stream processing platform available to them, product engineers in many cases may not have much exposure to streaming concepts or how to best leverage the streaming tools available to them. In these cases, engineers working on product features or applications may require a lot of back and forth with the team of streaming engineers to realize their projects, or they may not even realize the benefits of adding stream processing to their projects in the first place. These situations can lead to loss of business potential and impact developer velocity.

Solution: Education and Democratization

Two ways to help address these challenges is by investing in developer education and by democratizing the streaming and stream processing platforms. Setting up regular knowledge sharing sessions and encouraging collaboration between teams can go a long way to reducing knowledge gaps between teams. From the platform perspective, democratizing streaming and stream processing by making these platforms easy to use will lower the barrier of entry to these technologies. Popular stream processing frameworks such as Flink and Spark Streaming have SQL APIs for defining data processing jobs. Exposing SQL to abstract away some of the complexities of the underlying system is one way to make the platform easier to use.

Conclusion

In this blog post we highlighted 3 of the main challenges we’ve seen organizations face when building and operating their own stream processing platforms. Overcoming each of these challenges requires engineering time and effort. While some organizations may be able to spend the up front time and money to build their own in-house data streaming platforms, others may not be able to afford to. This is where fully managed cloud services, such as DeltaStream, can help.

Our aim at DeltaStream is to provide an easy to use data streaming platform to unify, process, and govern your data. Here’s how DeltaStream addresses each of the challenges above:

  1. Resource Management: DeltaStream is a serverless platform, meaning resource scaling and operations are completely taken care of. No cluster sizing or resource provisioning.
  2. Data Heterogeneity: Out of the box, DeltaStream has support for all major data serialization formats – JSON, ProtoBuf, and Avro. DeltaStream also has native support for many data storage systems including Kafka, Kinesis, Delta Lake, Snowflake, and PostgreSQL. DeltaStream’s rich processing capabilities also allow users to filter, enrich, and transform data to mitigate data quality issues.
  3. Streaming Knowledge Gaps: DeltaStream exposes an easy to use SQL interface for interacting with all streaming resources and streaming queries. Tools for sampling data and testing queries are also provided to help users iterate faster.

If you want to learn more about how DeltaStream can enable stream processing in your organization, schedule a demo with us. If you want to try DeltaStream out yourself, sign up for a free trial.

14 Nov 2023

Min Read

Key Components of a Modern Data Stack: An Overview

Data is one of the most valuable assets of any company in the digital age. Drawing insights from data can help companies better understand their market and make well informed business decisions. However, to unlock the full potential of data, you need a data stack that can collect, store, process, analyze, and act on data efficiently and effectively. When considering a data stack, it’s important to understand what your needs are and how your data requirements may change in the future, then make decisions that are best suited for your particular business. For almost every business, making the most out of data collection and data insights are essential, and you don’t want to find yourself stuck with a legacy data system. Data systems can quickly go out of date and updating to new emerging technologies can be painful. Using modern cloud-based data solutions is one way you can help keep your data stack flexible, scalable, and ultimately save time and money down the road.

In this blog post, we’ll cover the benefits of building a modernized data stack, the main components of a data stack, and how DeltaStream fits into a modern data stack.

What makes a Modern Data Stack and Why You Need One

The modern data stack is built on cloud-based services and low-code or no-code tools. In contrast to legacy systems, modern data stacks are typically built in a distributed manner and aim to improve on the flexibility, scalability, latency, and security of older systems.

How we build a data stack to manage and process data has developed and morphed greatly in the last few years alone. Changing standards, new regulations, and the latest tech have all played a role in how we handle our data. Some of the core properties of a modern data stack include being:

  • Cloud-based: A cloud-based product is something that users can pay for and use over the internet, whereas before users would have to buy and manage their own hardware. The main benefit for users is scalability of their applications and reduced operational maintenance. Modern data solutions offer elastic scalability, meaning users can easily increase or decrease the resources running for their applications by simply adjusting some configuration. More recently, modern data solutions have been offering a serverless option, where users don’t need to worry about resources at all, and automatic scaling is handled behind the scenes by the solution itself. Compare this with legacy systems, where scaling up requires planning, hardware setup, and maintenance of the software and new hardware. Elastic scaling enables users to be flexible with the workloads they plan to run, as resources are always available to use. To ensure availability of their products, modern data solutions typically guarantee SLAs, so users can expect these products to work without worrying about system maintenance or outages. Without the burdens of resource management and system maintenance, users can completely focus on writing their applications, which will improve developer velocity and allow businesses to innovate faster.
  • Performant: In order for modern data solutions to be worthwhile, they need to be performant. Low latency, failure recovery, and security are all requirements for modern data solutions. While legacy systems may have been sufficient to meet the standards of the past, not all of them have evolved to meet the requirements of data workloads today. Many of the current modern data products utilize the latest open-source technologies in their domain. For example, Databricks utilizes and contributes back to the Apache Spark project, Confluent does the same for Apache Kafka, and MongoDB does the same for the MongoDB open source project. For modern data solutions that aren’t powered by open source projects, they typically feature advanced state-of-the-art software and provide benchmarks on how their performance compares to the status quo. Modern data solutions make the latest technologies accessible, enabling businesses to build powerful tools and applications that get the most out of their data.
  • Easy to use: The rapid development and advancement of technologies to solve emerging use cases has led to the most powerful tech to also become the most complex and specialized. While these technologies have made it possible to solve many new use cases, they’re oftentimes only accessible to a handful of experts. Because of this, the modern trend is towards building democratized solutions that are low-code and easy to use. That’s why modern solutions abstract away the complexities of their underlying tech and expose easy to use interfaces to users. Consider AWS S3 as an example, users can use the AWS console to store or retrieve files without writing any code. However, behind the scenes, there is a highly complex system to provide strong consistency guarantees, high availability, scale indefinitely, and provide a low-latency experience for users. Businesses using modern data solutions no longer need to hire or train experts to manage these technologies, which ultimately saves time and money.

Components of a Data Stack

At a high level, a modern data stack consists of four components: collection, storage, processing and analysis/visualization. While most modern data products loosely fit into a single component, it’s also not uncommon for solutions to span multiple components. For example, data warehouses such as Databricks and Snowflake act as both Data Storage (with data governance) and Data Processing.

  • Data Collection: These are services or systems that ingest data from your data sources into your data platform. This includes data coming from APIs, events from user facing applications, and data from databases. Data can come structured or unstructured and in various data formats.
  • Data Storage: These are systems that hold data for extended periods of time, either to be kept indefinitely or for processing later on. This includes databases like MongoDB, or streaming storage platforms like RedPanda and Kinesis. Data warehouses and data lakes such as Snowflake and AWS S3 can also be considered data storage.
  • Data Processing: These are the systems that perform transformations on your data, such as enrichment, filtering, aggregations, and joins. Many databases and data warehouses have processing capabilities, which is why you may see some products in multiple categories below. For example Snowflake is a solution where users can store their data and run SQL queries to transform the data in their tables. For stream processing use cases, users may look at products like DeltaStream that use frameworks such as Apache Flink to handle their processing.
  • Data Analysis and Visualization: These are the tools that enable you to explore, visualize, and share data insights. These include BI platforms, analytics software, and chart building software. Some popular tools include Tableau and Microsoft Power BI among others. Data analysis and visualization tools can help users draw insights from their data, discover patterns, and communicate findings.

Overview of the current landscape of modern data technologies

How DeltaStream fits into your Modern Data Stack

DeltaStream is a serverless data platform that unifies, governs, and processes real-time data. DeltaStream acts as both an organizational and compute layer on top of users data resources, specifically streaming resources. In the modern data stack, DeltaStream fits in the Data Storage and Data Processing layers. In particular, DeltaStream makes it easier to manage and process data for streaming and stream processing applications.

In the Data Storage layer, DeltaStream is a data platform that helps users properly manage their data through unification and governance. By providing proper data management tools, DeltaStream helps make the entire data stack more scalable and easier to use, allowing developers to iterate faster. Specifically, DeltaStream improves on the flat namespace model that most streaming storage systems, such as Kafka, currently have. DeltaStream instead uses a hierarchical namespace model such that data resources exist within a “Database” and “Schema”, and data from different storage systems can exist within the same database and schema in DeltaStream. This way the logical representation of your streaming data is decoupled from the physical storage systems. Then, DeltaStream provides Role Based Access Control (RBAC) on top of this relational organization of data so that roles can be created for certain permissions and users can inherit the roles that they’ve been granted. While DeltaStream isn’t actually storing any raw data itself, providing data unification and data governance to otherwise siloed data makes managing and securing all of a user’s streaming data easier. The graphic below is a good representation of DeltaStream’s federated data governance model.

DeltaStream federated data governance

In the Data Processing layer, DeltaStream provides an easy to use SQL interface to define long lived stream processing jobs. DeltaStream is a serverless platform, so fault tolerance, deployment, resource scaling, and operational maintenance of these queries are all taken care of by the DeltaStream platform. Under the hood, DeltaStream leverages Apache Flink, which is a powerful open source stream processing framework that can handle large volumes of data and supports complex data transformations. For DeltaStream users, they get all the benefits of Flink’s powerful stream processing capabilities without having to understand Flink or write any code. With simple SQL queries, users can deploy long-lived stream processing jobs in minutes without having to worry about scaling, operational maintenance, or the complexities of stream processing. This fits well with the scalable, performant, and easy to use principles of modern data stacks.

Conclusion

In this blog post we covered the core components of a data stack and discussed the benefits in investing in a modern data stack. For streaming storage and stream processing, we discussed how DeltaStream fits nicely in the modern data stack for unifying, governing, and processing data. DeltaStream’s integrations with non-streaming databases and data warehouses such as PostgreSQL, Databricks, and Snowflake make it a great option to run alongside these already popular modern data products. If you have any questions about modern data stacks or are interested in learning more about DeltaStream, reach out to us or schedule a demo.

01 Nov 2023

Min Read

A Guide to Repartitioning Kafka Topics Using PARTITION BY in DeltaStream

In this blog post we are going to be highlighting why and how to use the PARTITION BY clause in queries in DeltaStream. While we are going to be focusing on repartitioning Kafka data in this post, any data storage layer that uses a key to partition their data can benefit from PARTITION BY.

We will first cover how Kafka data partitioning works and explain why a Kafka user may need to repartition their data. Then, we will show off how simple it is to repartition Kafka data in DeltaStream using a PARTITION BY query.

How Kafka partitions data within a topic

Kafka is a distributed and highly scalable event logging platform. A topic in Kafka is a category of data representing a single log of records. Kafka is able to achieve its scalability by allowing each topic to have 1 or more partitions. When a particular record is produced to a Kafka topic, Kafka determines which partition that record belongs to and the record is persisted to the broker(s) assigned to that partition. With multiple partitions, writes to a Kafka topic can be handled by multiple brokers, given that the records being produced will be assigned to different partitions.

In Kafka, each record has a key payload and a value payload. In order to determine which partition a record should be produced to, Kafka uses the record’s key. Thus, all records with the same key will end up in the same topic partition. Records without a key will be produced to a random partition.

Let’s see how this works with an example. Consider you have a Kafka topic called ‘pageviews’ which is filled with records with the following schema:

  1. {
  2. ts: long, // timestamp of pageviews event
  3. uid: string, // user id
  4. pid: string // page id
  5. }

The topic has the following records (omitting ts for simplicity):

If we partition by the uid field by setting it as the key, then the topic with 3 partitions will look like the following:

If we partition by the pid field by setting it as the key, then the topic with 3 partitions will look like the following:

Why repartition your Kafka topic

The relationship between partitions and consumers in Kafka for a particular application is such that there can be at most 1 consumer per partition, but a consumer can read from multiple partitions. What this means is if our Kafka topic has 3 partitions and our consumer group has 4 consumers, then one of the consumers will sit idle. In the inverse case, if our Kafka topic has 3 partitions and our consumer group has 2 consumers, then one of the consumers will read from 2 partitions while the other reads from only 1 partition.

In most cases, users will set up their applications that consume from a Kafka topic to have a number of consumers that is a divisor of the number of partitions so that one consumer won’t be overloaded relative to other consumers. However, data can still be distributed unevenly to different partitions if there is a hotkey or poor partitioning strategy, and repartitioning may be necessary in these cases.

To showcase how data skew can be problematic, let’s look again at our pageviews example. Imagine that half of the records have a pid value of A and we partition by the pid field. In a 3 partition topic, ~50% of the records will be sent to one partition while the other two partitions get ~25% of the records. While data skew might hurt performance and reliability for the Kafka topic itself, it can also make it difficult for downstream applications that consume from this topic. With data skew, one or more consumers will be overloaded with a disproportionate amount of data to process. This can have a direct impact on how well downstream applications perform and result in problems such as many very out of order records, exploding application state sizes, and high latencies (see what Apache Flink has implemented to address some of the problems caused by data skew in sources). By repartitioning your Kafka topic and picking a field with more balanced values as the key to partition your data, data skew can be reduced if not eliminated.

Another reason you may want to repartition your Kafka data is to align your data according to its context. In the pageviews example, if we choose the partition key to be the uid field, then all data for a particular user id will be sent to the same partition and thus the same Kafka broker. Similarly, if we choose the partition key to be the pid field, then all data for a particular page id will be sent to the same partition and Kafka broker. If our use case is to perform analysis based on users, then it makes more sense to partition our data using uid rather than pid, and downstream applications will actually process data more efficiently.

Consider we are counting the number of pages a user visits in a certain time window and are partitioning by pid. If the application that aggregates the data has 3 parallel threads to perform the aggregation, each of these threads will be required to read records from all partitions, as the data belonging to a particular uid can exist in many different partitions. If our topic was partitioned by uid instead, then each thread can process data from their own distinct sets of partitions as all data for a particular uid would be available in a single partition. Stream processing systems like Flink and Kafka Streams require some kind of repartition step in their job to handle cases where operator tasks need to process data based on a key and the source Kafka topic is not partitioned by that key. In the case of Flink, the source operators need to map data to the correct aggregation operators over the network. The disk I/O and network involved for stream processing jobs to repartition and shuffle data can become very expensive at scale. By properly partitioning your source data to fit the context, you can avoid this overhead for downstream operations.

PARTITION BY in DeltaStream

Now the question is, how do I repartition or rekey my Kafka topic? In DeltaStream, it’s made simple by PARTITION BY. Given a Kafka topic, you can define a Stream on this topic and write a single PARTITION BY query that rekeys the data and produces the results to a new topic. Let’s see how to repartition a keyless Kafka topic ‘pageviews’.

First, define the ‘pageviews’ Stream on the ‘pageviews’ topic by writing a CREATE STREAM query:

  1. CREATE STREAM pageviews (
  2. ts BIGINT, uid VARCHAR, pid VARCHAR
  3. ) WITH (
  4. 'topic' = 'pageviews', 'value.format' = 'JSON'
  5. );

Next, create a long-running CREATE STREAM AS SELECT (CSAS) query to rekey the ‘pageviews’ Stream using uid as the partition key and output the results to a different Stream:

  1. CREATE STREAM pageviews_keyed AS
  2. SELECT
  3. *
  4. FROM pageviews
  5. PARTITION BY uid;

The output Stream, ‘pageviews_keyed’, will be backed by a new topic with the same name. If we PRINT the input ‘pageviews’ topic and the output ‘pageviews_keyed’ topic, we can see the input has no key assigned and the output has the uid value defined as the key.

  1. db.public/cc_kafka# PRINT TOPIC pageviews_ctan;
  2. | {"ts":46253951,"uid":"User_7","pid":"Page_90"}
  3. | {"ts":46253961,"uid":"User_5","pid":"Page_18"}
  4. | {"ts":46253971,"uid":"User_9","pid":"Page_64"}

  1. db.public/cc_kafka# PRINT TOPIC pageviews_keyed;
  2. {"uid":"User_6"} | {"ts":46282721,"uid":"User_6","pid":"Page_79"}
  3. {"uid":"User_7"} | {"ts":46282731,"uid":"User_7","pid":"Page_56"}
  4. {"uid":"User_1"} | {"ts":46282741,"uid":"User_1","pid":"Page_70"}

As you can see, with a single query, you can repartition your Kafka data using DeltaStream in a matter of minutes. This is one of the many ways we remove barriers to make building streaming applications easy.
If you want to learn more about DeltaStream or try it for yourself, you can request a demo or join our free trial.

24 Oct 2023

Min Read

Seamless Data Flow: How Integrations Enhance Stream Processing

Data processing systems can be broadly classified into 2 categories. batch processing & stream processing. Enterprises often use both streaming and batch processing systems because they serve different purposes and have distinct advantages, and using them together can provide a more comprehensive and flexible approach to data processing and analysis. Stream processing platforms help organizations process data as close to real-time as possible which is important for handling use-cases that are latency sensitive. Some examples include monitoring IoT sensors, fraud detection and threat detection. Batch processing systems are useful for a different set of situations – for example, when you want to analyze past data to find patterns or handle a lot of data from different sources.

Integrating between systems

The need for stream and batch processing is exactly why you are seeing companies implementing “lambda architecture”, which brings the best of both worlds together. In this architecture you generally see batch processing & stream processing systems working together to serve multiple use-case. So for these enterprises it’s important to be able to: 

  1. Move data between these systems seamlessly – ETL
  2. Make data products available to users in their preferred format/platform
  3. Continue using existing systems while leveraging new technologies to improve overall efficiency

Having the ability to process and prepare data as soon as it arrives for downstream consumption is an extremely critical function in the data landscape. For this you need to be able to  1. EXTRACT data from source systems to process and after processing, 2.  LOAD the data  into your platform of choice. This is precisely where stream processing platforms & integrations come into the picture. While integrations help you extract and load data, Stream processing helps you with the Transformation.

Every enterprise uses multiple disparate systems, each serving its own purpose, to manage their data. For these enterprises, it is important for data teams to produce data products in real-time and serve them across multiple data platforms that are in use. This will require a certain level of sophisticated processing, data governance and integration with commonly used data systems. 

Real-world integration uses

Let’s take a look at an example which can help us understand how companies manage their data across multiple platforms and how they process, access and transfer data across them.

Consider a scenario at a Bank. You have an incoming stream of transactions from Kafka. This stream of transactions is connected to DeltaStream where you can analyze transactions as they come in and flag them in real time if you notice fraudulent activity based on various predefined rules and alert your users as soon as possible. This is extremely time sensitive and a Stream Processing Platform is best suited for such use-cases.

Now, other teams within a Bank for eg : the marketing team, would want to understand trends based on customer activity and customize how they market their products to a given customer. For this, you need to look at transactions going back to a month or a week and process it all to generate enough context. Instead of going back to the ‘source’ system you can now have DeltaStream send all the processed transactions in the right format to your batch processing systems using our Integrations so that you can: 

  1. Have the data ready in the right format & schema for processing 
  2. Reduce the back and forth between multiple platforms as data transits the pipeline just once – reduction in data transfer costs 
  3. Eliminate the duplication of data sets across multiple platforms for processing
  4. Easily manage compliance – for eg :  by reducing the footprint of your PII data.

By integrating both the batch processing and stream processing system, the entire pipeline becomes more manageable and it reduces the complexity of managing data across different systems.

Integrating with DeltaStream

It’s evident that we need integration across different platforms to enable data teams to process and manage data. DeltaStream provides for all the ingredients required for such an operation. Our Stream processing platform is powered by Flink. DeltaStream’s RBAC enables you to govern and securely share data products. The integrations to Databricks and Snowflake allow for data products to be used in the data systems that your teams are using. With the launch of these integrations, you can do MORE with your data. To unlock the power of your streaming data reach out to us or request a free trial!

30 Aug 2023

Min Read

Choosing Serverless Over Managed Services: When and Why to Make the Switch

Consider storage service where you store and retrieve files. In on-prem environments, HDFS(Hadoop Distributed File System) has been one of the most common storage platforms to use. As for any service in an on-prem environment, as a user you are responsible for all aspects of the operations. This includes bringing up the HDFS cluster, scaling up and down the cluster based on the usage requirements, dealing with different types of failures including but not limited to server failures, network failures and many more. Of course, in an ideal situation as a user you would like to just use the storage service and focus on your business without dealing with the complexity of operating such infrastructure, including the one mentioned above. If you use a cloud environment instead of on-prem, you have the option of choosing to use storage services that are provided by a variety of vendors instead of running HDFS on cloud yourself. However, there are different ways to provide the same service on cloud that can significantly affect user experience and ease of use for such services. 

A Look at Managed Services

Let’s go back to our storage service and consider that we are now using a cloud environment and can take advantage of not running our required services ourselves and instead using the ones that vendors offer on this cloud environment. One common option to provide such services is to provide a managed version of the on-prem services. In this case, the service provider takes the same platform that is used in the on-prem environment and makes some improvements to run it on the cloud environment. While this takes away some of the burden of operations from the user, the user still is involved in many other aspects of the operations of such managed services. For the storage service we are considering here, a managed HDFS service would be an example of such an approach. When using a “fully managed” cloud HDFS, as a user you should still make decisions such as provisioning a HDFS cluster through the service. This means that you need to have a good understanding of the amount of storage resources you will be using and let the service provider know if you need more or less of such resources after provisioning the initial cluster. Requiring the user to provide such information in many cases results in confusion and in most cases the initial decision won’t be the accurate one and as the usage continues there will be a need for adjusting the provisioned resources. You cannot expect a user to accurately know how much storage they will need in the next six months or a year.

In addition, the notion of cluster brings many limitations. A cluster has a finite amount of resources available and as the usage continues, the cluster resources would be consumed and there will be a need for more resources. In the case of our “managed” HDFS service, the provisioned storage(disk space) is one of the limited resources and as more and more data is stored in the cluster, the available storage will shrink. At this point, the user has to decide between scaling up the existing cluster by adding more nodes and disk space or adding a new cluster to accommodate the growing need for the storage. To accommodate such issues, users may over-provision resources which in turn can result in unused resources and extra cost. Finally, once a cluster is provisioned, the user will start incurring the cost of the whole cluster regardless if half or all of the cluster resources are utilized. In short, the managed cloud service in most cases will put the burden of resource provisioning on the user, this in turn requires the user to have deep understanding of the required resources not for now, but for short term and long term future.

Unlocking Cloud Simplicity: The Serverless Advantage

Now let’s assume instead of taking the managed service path, a vendor takes a different route and builds a new cloud-only storage service from the ground up where all the complexity and operations are handled under the hood by the vendor and the users don’t have to think about the complexities such as resource provisioning as described above. In the case of storage service, object store services such as AWS S3 are great examples of such an approach, which is called Serverless. As an S3 user, you just need to set up buckets and folders and read and write your files. No need to manage anything or provision any cluster, no need to worry about having enough disk space or nodes. All operational aspects of the service including making sure the service is always available with required storage space is handled by S3. This is a huge win for the users since they can focus on building their applications instead of worrying about provisioning and sizing clusters correctly. With such simplicity in use, we can see why almost every cloud user uses object stores such as S3 for their storage needs unless there is an explicit requirement to use anything else. S3 is a great example for superiority of cloud-native serverless architecture compared to providing a “fully managed” version of the on-prem products. 

Another benefit of serverless platforms such as S3 compared to managed services is that S3 enables users to access, organize and secure their data in one place instead of dealing with multiple clusters. In S3 you can organize your data in buckets and folders, have a unified view of all of your data and control access to your data in one place. The same cannot be said for managed HDFS service if you have more than one cluster! In this case users have to keep track of which cluster has which data and how to control access to data across multiple clusters which is a much more complex and error prone process.

Choosing Serverless for Stream Processing

We can have the same argument in favor of serverless offering compared to managed services for many other platforms including stream processing and streaming databases. You can have “fully managed” services where the user has to provision clusters along with specifying the amount of resources this cluster will have before starting to write any query. Indeed, in managed services for stream processing you have much more complexities compared to the managed HDFS example we explained above. The cluster in the stream processing case will be shared among multiple queries which means imperfect isolation and the possibility of one bad query bringing down the whole cluster which disrupts the other queries even though they were healthy and running with no issues. To exacerbate the situation, as you add more streaming queries to the cluster eventually the cluster resources will all be used since the streaming queries are long running jobs and you will need to scale up your cluster or launch new cluster to accommodate newer queries. The first option results in having a larger cluster with more queries that share the same cluster resources and can interfere with each other’s resources. 

On the other hand, the second option will result in a growing number of clusters to keep track of and also keep track of which query is running on which cluster. So anytime you have to provision or declare a cluster is a “fully managed” stream processing or streaming database service, you are dealing with the managed service along with the above mentioned restrictions and many more. Even worse, once you provision a cluster, the billing for the cluster starts regardless of having no queries or several queries running in the cluster.

We built DeltaStream as a serverless platform because we believe that such a service should be as easy to use as S3. You can refer to DeltaStream as the S3 of stream processing. In DeltaStream there is no notion of cluster or provisioning. You just connect to your streaming stores such as Apache Kafka or AWS Kinesis and you are up and running ready to run queries. Only pay for queries that are running, and since there is no concept of cluster, you won’t be charged for idle or under utilized clusters! Focus on building your streaming applications and pipelines and leave the infrastructure to us. Launch as many queries as you want and there is no notion of running out of resources! Your queries run in isolation and we can scale them up and down independently without interfering with each other. 

DeltaStream is a unified platform that provides stream processing(streaming analytics) and streaming database in one platform. You can build streaming pipelines, event base applications and always uptodate materialized views within familiar SQL syntax. In addition, DeltaStream enables you to organize and secure your streaming data across multiple streaming storage systems. If you are using any flavor of Apache Kafka including Confluent Cloud, AWS MSK or RedPanda or AWS Kinesis you can now try DeltaStream for free by signing up for our free trial.

22 Jun 2023

Min Read

What is Streaming ETL and how does it differ from Batch ETL?

In today’s data-driven world, organizations are seeking effective and reliable ways to extract insights to make timely decisions based on the ever-increasing volume and velocity of data. ETL (Extract, Transform, Load) is a process where data is extracted from various sources, transformed to fit specific requirements, such as cleaning, formatting, and aggregating, and loaded into a target system or data warehouse. ETL ensures data consistency, quality, and usability, and enables organizations to effectively analyze their data. Traditional batch processing, while effective for certain use cases, falls short in meeting the demands of real-time and event-driven data processing and analysis. This is where streaming ETL emerges as a powerful solution.

Streaming ETL: An Overview

Unlike traditional batch processing, which operates on fixed intervals, streaming ETL operates on a continuous stream of data as records arrive, allowing for real-time analysis. A streaming ETL pipeline begins with the continuous data ingestion phase, in which records are collected from different sources varying from databases to event streaming platforms like Apache Kafka, and Amazon Kinesis. Once data is ingested, it goes through real-time transformation operations for cleaning, normalization, enrichment, etc. Stream processing frameworks such as Apache Flink, Kafka Stream, ksqlDB and Apache Spark provide tools and APIs to apply these transformations to prepare the data. The same frameworks allow processing data in real time and support operations and functionalities such as real-time aggregation, and complex machine learning operations. Finally, the results of the streaming ETL are delivered to downstream systems and applications for immediate consumption, or they are stored in data warehouses and data stores for future use.

Streaming ETL can be applied in various domains, including fraud detection and prevention, real-time analytics and personalization for targeted advertisements, and IoT data processing and monitoring to handle high velocity and volume of data generated by devices such as sensors and smart appliances.

Why should you use Streaming ETL?

Streaming ETL offers numerous advantages in real-time data processing. Here is a list of most important ones:

  • Streaming ETL provides real-time insights into the emerging trends, anomalies and critical events in the data as they happen. It operates with low latency, and ensures the processing results are up to date. This reduces the gap between the time data arrives and the time it is processed. This facilitates accurate and timely decision-making, and enables organizations to capitalize on time-sensitive opportunities or address emerging issues promptly.
  • Streaming ETL frameworks are designed to scale horizontally, which is crucial for handling increased data volumes and processing requirements in real-time applications. This elasticity allows for seamless scaling of resources based on demand, and enables the system to manage spikes in data volume and velocity without sacrificing the performance.

With all its advantages, Streaming ETL also presents some challenges:

  • Streaming ETL process typically introduces additional complexity to data processing. Real-time data ingestion, continuous transformations, and persisting results while maintaining performance and data consistency require careful design and implementation.
  • Streaming ETL pipelines run in a distributed streaming environment, which introduces new challenges to the data processing process. Unless an appropriate delivery guarantee such as exactly-once or at-least once is used, there is a risk of delay or data loss during ingestion, and delivery stages, due to parallel and asynchronous processing when applying transformations. Ordering events and maintaining data consistency are complex in such situations, and if not handled properly, they may impact the accuracy of certain computations that rely on the event order. Using fault-tolerant mechanisms, such as replication, checkpointing, and backup strategies are essential to prevent data loss and ensure reliability and correctness of results.

Using a modern stream processing platform, such as DeltaStream, can help address the above challenges and enable organizations to benefit from all the advantages of Streaming ETL.

Differences between Streaming ETL and Batch ETL

Data processing model: Batch ETL starts with collecting large volumes of data over a time period and processes these batches in fixed intervals. Therefore, it applies transformations on the entire dataset as a batch. Streaming ETL operates on data as it arrives in real-time and continuously processes and transforms data as individual records or small chunks.

Latency: Batch ETL introduces inherent latency since data is processed in intervals. This latency normally ranges from minutes to hours. Streaming ETL processes data in real-time and offers low latency. Results are available immediately and are updated continuously.

Data volume and velocity: Batch ETL is well-suited for processing large volumes of data, collected over time. Therefore it is effective when dealing with historical data. Streaming ETL, on the other hand, is designed for high-velocity data streams and is effective for use cases that require immediate processing.

Processing frameworks: Batch ETL typically utilizes frameworks like Apache Hadoop, Apache Spark and traditional ETL and data warehouses tools. These frameworks are optimized for processing large volumes of data in parallel, but not necessarily for real-time use cases. Streaming ETL leverages specialized stream processing frameworks such as Apache Flink, and Apache Kafka. These frameworks are optimized for processing continuous streams of data in real time. With recent changes, some of these frameworks, such as Apache Flink, are now capable of processing batch workloads too. As these efforts and improvements continue, the overlap between frameworks which process these workloads are expected to get bigger.

Fault tolerance: Batch ETL typically processes large volumes of data in fixed intervals. If a failure occurs, all the data within that batch may be affected which could lead into results written partially. This makes failure recovery challenging in Batch ETL as it involves cleaning partial results and reprocessing the entire batch. Removing partial results and state, and starting a new run is a complicated process and normally involves manual intervention. Reprocessing a batch is time-consuming and resource-intensive and can take long which may result in issues for processing the next batch as producing results for the current batch has fallen behind. Moreover, rerunning some tasks could have unexpected side effects which may impact the correctness of final results.  Such issues need to be handled properly during a job restart. 

Streaming ETL does not involve many jobs that run sequentially many times over time, but there is a single long-running job which maintains its state and does incremental computation as data arrives. Therefore, Streaming ETL is generally better equipped to handle failures and partial results. Given that results are generated incrementally in the Streaming ETL case, a failure does not force discarding already generated results and reprocessing sources from the beginning. Stream processing frameworks provide transaction-like processing, exactly-once semantics, and write-ahead logs, ensuring atomicity and data consistency. They have built-in mechanisms for fault recovery, handle out-of-order events, and ensure end-to-end reliability by leveraging distributed messaging systems.

Choosing Between Streaming ETL and Batch ETL

There are several factors to consider, when deciding between Streaming ETL and Batch ETL for a data processing use case. The most important factor is latency requirements. Consider the desired latency of insights and actions. If a real-time response is critical, then Streaming ETL is the correct choice. The other important factor is data volume and velocity along with the cost of processing. You should evaluate the volume of data and the rate it arrives at. Streaming ETL is capable of processing fast data immediately. However, due to its inherent complexity and higher resource demand, it is more difficult to maintain. 

Streaming ETL also introduces challenges related to maintaining the correct order of events, especially in distributed environments. Batch ETL processes data in a much more deterministic and mostly sequential manner, which ensures data consistency and ordered processing. A modern stream processing platform is a viable solution to handle these challenges and difficulties when picking Streaming ETL as a solution. Finally, you need to consider how often the data sources evolve over time in your use case as that can impact the structure of incoming records. Data processing pipelines need to handle schema evolution properly to prevent disruptions and errors. Managing schema changes, versioning, and implementing schema inference mechanisms become crucial to ensure correctness and reliability. Using a stream processing framework enables you to address these changes in a streaming ETL pipeline, which is intended to run continuously with no interruption.

Conclusion

Choosing between streaming ETL and batch ETL requires a thorough understanding of the specific requirements and trade-offs of each. While both approaches have their strengths and weaknesses, they are effective in different use cases and for different data processing needs. Streaming ETL offers real-time processing with low latency and high scalability to handle high-velocity data. On the other hand, batch ETL is well-suited for historical analysis, scheduled reporting, and scenarios where near-real-time results are not critical. In this blog post, we covered the specifics as well as the pros and cons of each approach, and explained the important factors to consider when deciding which one to choose for a given data processing use case.

DeltaStream provides a comprehensive stream processing platform to manage, secure and process all your event streams. You can easily use it to create your streaming ETL solutions. It is easy to operate, and scales automatically. You can get more in-depth information about DeltaStream features and use cases by checking our blogs series. If you are ready to try a modern stream processing solution, you can reach out to our team to schedule a demo and start using the system.

alert-icon

Please enter a valid email address.

Request Submitted

Thank you for requesting a demo.
You will receive your login information to your email soon.