05 Jun 2023

Min Read

All About Streaming Data Mesh

The Current State of Streaming Data

The adoption of Streaming data technologies has grown exponentially over the past 5 years. Every industry understands the importance of accessing and understanding data in real-time, so they can make decisions about their products, services and customers. The sooner you can access the event – could be a customer purchasing an item or an IoT device relaying data – the faster you can process and react to it. Making all this happen in real-time is where technologies like Apache Kafka, AWS Kinesis, Google Pub-Sub, Apache Flink and many more come into play. Most of these are available as a managed service, making it easy for companies to adopt them. For example,Confluent and AWS offer Apache Kafka as a managed service.

With rapid adoption of these services, many companies now have access to real-time data.

By their very nature technologies like Apache Kafka interact with multiple types of data producers and consumers. While Datalakes and Data Warehouses meant storing large amounts of data in centralized repositories, data streaming technologies have always promoted decentralization of services and the data behind them. As data sets transit these systems, being able to manage, access and self-provision data is extremely important. This is where Data Mesh comes into the picture.

What is a Data Mesh?

Data Mesh, as first defined by Zhamak Dehghani,  is a modern data management architecture that facilitates and promotes a  decentralized way to manage, access and share data. The four core principles of a data mesh are : 

  • Domain ownership 
  • Data as a product
  • Self-serve data platform 
  • Federated computational governance

Pic from : datamesh-architecture.com

A Data Mesh gives you the agility to work with the data that flows in and out of complex systems across multiple organizations. It reduces dependencies and removes bottlenecks that arise while accessing data; thereby improving the organization’s ability to respond and make business decisions.

While there are tools and technologies that can help you build a Data Mesh over “Data at Rest”, it isn’t the same with “Data in Motion”. This is where DeltaStream comes into the picture, to enable organizations develop a Data Mesh over and across all their streaming data  – which could span multiple platforms and multiple public cloud providers (Apache Kafka, Confluent Cloud, Kinesis, Pub-Sub, RedPanda .. etc)

Streaming Data Mesh with DeltaStream

What we have built at DeltaStream is an extremely powerful framework which will provide a single platform to access your operational and analytical data, in real-time, across your entire organization in a decentralized manner. No more expensive pipelines or duplicating your data or building central teams to manage Datalakes / Data Warehouses.

Let’s revisit the core tenets of Data Mesh and how DeltaStream helps you achieve each one of those and more.

Domain Ownership

  • In DeltaStream, the data boundaries are clearly defined using our namespaces &  RBAC. Each stream is isolated with strict access control and any query against that stream inherits the same permissions and boundaries. This way,  your core data sets remain under your purview and you can control what others can access. The queries are isolated at the infrastructure level, which means queries are scaled independently, reducing cost and operational complexity. With all these controls in place, the core objective of decentralized ownership of data streams is achieved while keeping your data assets secure.

Data as a Product

  • Making data discoverable and usable while maintaining the quality is central to serving data as a product to your internal teams. Being able to do it as close to the source as possible is extremely beneficial and in-line with the data mesh philosophy of domain owners owning the quality of data. This is exactly what you can achieve with DeltaStream. Using our powerful SQL interface you can quickly alter and enrich your data and get it ready-to-use in real-time. As your data models change,  DeltaStreams Data Catalog along with Schema registry can track your data evolution and help you iterate on your data assets  as you grow and evolve.

Self-serve Data Platform

  • Once you achieve “Domain ownership” and “Data as a Product”, self-service follows. The main objective of self-service is to remove the need for a centralized team to coordinate data distribution across the entire company. Leveraging DeltaStream’s catalog you can make your high quality data streams discoverable and combining this with our RBAC you can securely share them, with both internal and external users.  This means, the data will continuously flow from its source systems to end-users without ever having the need to store, extract or load. This is very powerful in the way that you get to securely democratize “streaming data” across your company, while having the ability to share it with users that are outside of your Organization too.

Federated Computational Governance

  • With everything decentralized  – from data ownership to data delivery – governance becomes a very important requirement. This is to ensure that data originating in a particular domain is consumable in any part of the organization. Schemas and schema registry go a long way in ensuring that. Data, just like the organization it serves, changes and evolves over time. Integrating your data streams with schema registry is critical to maintaining a common language to communicate.  Also, interoperability is a key component of governance and the ability of Deltastream to operate across multiple streaming stores is a great capability.

Conclusion

Decentralization of data ownership and its delivery is critical for organizations to be nimble. As Data Mesh gains traction, there are tools and technologies readily available to implement it over your data at rest. But the same can’t be said when you are dealing with streaming data. This is what Deltastream provides. With tooling built around the core processing framework,  it enables companies to implement data mesh on top of streaming data.

As streaming data continues its rapid growth, it is extremely important for organizations to set-up the right foundations to maximize the value of their real-time data. While architectures like DataMesh provide the right framework, platforms like DeltaStream offer a lot to bring it all together. 

We offer a free trial of DeltaStream that you can sign up for start using DeltaStream today. If you have any questions, please feel free to reach out to us.

10 May 2023

Min Read

Why Apache Flink is the Industry Gold Standard

Apache Flink is a distributed data processing engine that is used for both batch processing and stream processing. Although Flink supports both batch and streaming processing modes, it is first and foremost a stream processing engine used for real-time computing. By processing data on the fly as a stream, Flink and other stream processing engines are able to achieve extremely low latencies. Flink in particular is a highly scalable system, capable of ingesting high throughput data to perform stateful stream processing jobs while also guaranteeing exactly-once processing and fault tolerance. Initially released in 2011, Flink has grown into one of the largest open source projects. Large and reputable companies such as Alibaba, Amazon, Capital One, Uber, Lyft, Yelp, and others have relied on Flink to provide real-time data processing to their organizations.

Before stream processing, companies relied on batch jobs to fulfill their big data needs, where jobs would only perform computations on a finite set of data. However, many use cases don’t naturally fit a batch model. Consider fraud detection for example, where a bank may want to act if anomalous credit card behavior is detected. In the batch world, the bank would need to collect credit card transactions over a period of time, then run a batch job over that data to determine if a transaction is potentially a fraudulent one. Since the batch job is run after the data collection, by the time the batch job finishes, it might not be until a few hours that a transaction is marked as fraudulent, depending on how frequently the job is being scheduled. In the streaming world, transactions can be processed in real time and the latency for action will be less. Being able to freeze credit card accounts in real-time can save time and money for the bank. You can read more about the distinction of streaming vs batch in our other blog post.

Without Apache Flink or other stream processing engines, you could still write code to read from a streaming message broker (e.g. Kafka, Kinesis, RabbitMQ) and do stream processing from scratch. However, with this approach you would miss out on many of the features that modern stream processing engines give you out of the box, such as fault tolerance, the ability to perform stateful queries, and API’s that vastly simplify processing logic. There are many stream processing frameworks in addition to Apache Flink such as Kafka Streams, ksqlDB, Apache Spark (Spark streaming), Apache Storm, Apache Samza, and dozens of others. Each have their own pros and cons and comparing them could easily become its own blog post, but Apache Flink has become an industry favorite because of these key features:

  • Flink’s design is truly based on processing streams instead of relying on micro-batching
  • Flink’s checkpointing system and exactly-once guarantees
  • Flink’s ability to handle large throughputs of data from many different types of data sources at a low latency
  • Flink’s SQL API that leverages Apache Calcite allows users to define their Flink jobs using SQL, making it easier to build applications

Flink is designed to be a highly scalable distributed system, capable of being deployed with most common setups, that can perform stateful computations at in-memory speed. To get a high level understanding about Flink’s design, we can look further into the Flink ecosystem and execution model.

The Flink ecosystem can be broken down into 4 layers:

  • Storage: Flink doesn’t have its own storage layer and instead relies on a number of connectors to connect to data. These include but are not limited to HDFS, AWS S3, Apache Kafka, AWS Kinesis, and SQL databases. You can similarly use managed products such as Amazon MSK, Confluent Cloud, and Redpanda among others.
  • Deploy: Flink is a distributed system and integrates seamlessly with resource managers like YARN and Kubernetes, but it can also be deployed in standalone mode or in a single JVM.
  • Runtime: Flink’s runtime layer is where the data processing happens. Applications can be split into tasks that are parallelizable and thus highly scalable across many cluster nodes. With Flink’s incremental and asynchronous checkpointing algorithm, Flink is also able to provide fault tolerance for applications with large state and provide exactly-once processing guarantees while minimizing the effect on processing latency.
  • APIs and Libraries: The APIs and libraries layer are provided to make it easy to work with Flink’s runtime. The DataStream API can be used to declare stream processing jobs and offers operations such as windowing, updating state, and transformations. The Table and SQL APIs can be used for batch processing or stream processing. The SQL API allows users to write SQL queries to define their Flink jobs and the Table API allows users to call functions on their data based on Table schemas used to define them.

In the execution architecture, there is a client and the Flink cluster. Two kinds of nodes make up the Flink cluster – the Job manager and the Task manager. There is always a Job manager and any number of Task managers.

  • Client: Responsible for generating the Flink job graph and submitting it to the Job manager.
  • Job manager: Receives the job graph from the Client then schedules and oversees the execution of the job’s tasks on the Task managers.
  • Task manager: Responsible for executing all the tasks it was assigned from the Job manager and report updates on its status back to the Job manager.

In Figure 1, you’ll notice that storage and deployment are somewhat detached from the Flink runtime and APIs. The benefit of designing Flink this way is that Flink can be deployed in any number of ways and connect to any number of storage systems, making it a good fit for almost any tech stack. In Figure 2, you’ll notice that there can be multiple Task managers and they can be scaled horizontally, making it easy to add resources to run jobs at scale.

Apache Flink has become the most popular real-time computing engine and the industry gold standard for many reasons, but its ability to process large volumes of real-time data with low latency is its main draw. Other features include the following:

  • Stateful processing: Flink is a stateful stream processor, allowing for transformations such as aggregations and joins of continuous data streams.
  • Scalability: Flink can scale up to thousands of Task manager nodes to handle virtually any workload without sacrificing latency.
  • Fault tolerance: Flink’s state of the art checkpointing system and distributed runtime makes it both fault tolerant and highly available. Checkpoints are used to save the application state, so when your application inevitably encounters a failure, the application can be resumed from the latest checkpoint. For applications that use transactional sinks, exactly-once end to end processing can be achieved. Also, since Flink is compatible with most resource management systems like YARN and Kubernetes, when nodes crash they will automatically be restarted. Flink also has high availability setup options based on Apache Zookeeper to help prevent failure states as well.
  • Time mechanisms: Flink has a concept called “watermarks” for working with event time which represents the application’s progress with regard to time. Different watermark strategies can be applied to allow users to configure how they want to handle out of order or late data. Watermarks are important for the system to understand when it is safe to close windows and emit results for stateful operations.
  • Connector ecosystem: Since Flink uses external systems for storage, Flink has a rich ecosystem of connectors. Wherever your data is, Flink likely already has a library to connect to it. Popular existing connectors include connectors for Apache Kafka, AWS Kinesis, Apache Pulsar, JDBC connector for databases, Elasticsearch, and Apache Cassandra. If a connector you need doesn’t exist, Flink provides APIs to help you create a new connector.
  • Unified batch and streaming: Flink is both a batch and stream processor, making it an ideal candidate for applications that need both streaming and batch processing. With Flink, you can use the same application logic for both cases reducing the headache of managing two different technologies.
  • Large open source community: In 2022, Flink had over 1,600 contributions and the Github project currently has over 21 thousand stars. To put this in perspective, Apache Kafka currently has 24,800 stars and Apache Hadoop currently has 13,500 thousand stars. Having a large and active open source community is extremely valuable for fixing bugs and staying up to date with the latest tech requirements.

Conclusion

In this blog post, we’ve introduced Apache Flink at a high level and highlighted some of its features. While we would suggest those who are interested in adding stream processing to their organizations to take a survey of the different stream processing frameworks available, we’re also confident that Flink would be a top choice for any use case. At DeltaStream, although we are detached from using any particular compute engine, we have chosen to use Flink for the reasons explained above. DeltaStream is a unified, serverless stream processing platform that takes the pain out of managing your streaming stores and launching stream processing queries. If you’re interested in learning more get a free trial of DeltaStream.

17 Apr 2023

Min Read

Batch vs. Stream Processing: How to Choose

Rapid growth in the volume, velocity and variety of data has made data processing a critical component of modern business operations. Batch processing and stream processing are two major and widely used methods to perform data processing. In this blog, we will explain batch processing and stream processing and will go over their differences. Moreover, we will explore the pros and cons of each, and discuss the importance of choosing the right approach for your use cases. If you are looking for Streaming ETL vs. Batch ETL, we have a blog on that too.

What is Batch Processing?

Batch processing is a data processing method on large volumes of fully stored data. Depending on the use case specifics, a batch processing pipeline typically consists of multiple steps including data ingestion, data preprocessing, data analysis and data storage. Data goes through various transformations for cleaning, normalization, and enrichment, and different tools and frameworks could be used for analysis and value extraction. The final processed data gets stored in a new location with a new format such as a database, a data warehouse, a file, or a report.

Batch processing is used in a wide range of applications where large volumes of data need to be processed efficiently in a cost-effective manner with high throughput. Examples are: ETL processes for data aggregation, log processing, and training predictive ML models.Pros and cons of batch processing

Batch processing has its advantages and disadvantages. Here are some of its pros and cons.
Pros:

  • Batch processing is efficient and cost-effective for applications with large volumes of already stored data.
  • The process is predictable, reliable and repeatable which makes it easier to schedule, maintain, and recover from failures.
  • Batch processing is scalable and can handle workloads with complex transformations on large volumes of data.

Cons:

  • Batch processing has a high latency, and processing time can be long depending on the volume of data and complexity of the workload.
  • Batch processing is generally not interactive, while the processing is running. Users need to wait until the whole process is complete before they can access the results. This means batch processing does not provide real time insights into data which can be a disadvantage in applications where real-time access to (partial) results is necessary.

What is Stream Processing?

Stream processing is a data processing method where processing is done in real time as data is being generated and ingested. It involves analyzing and processing continuous streams of data in order to extract insights and information from them. A stream processing pipeline typically consists of several phases including data ingestion, data processing, data storage, and reporting. While the steps in a stream processing pipeline may look similar to those in batch processing, these two methods are significantly different from each other.

Stream processing pipelines are designed to have low latency and process the data in a continuous mode. When it comes to the complexity of transformations, batch processing normally involves more complex and resource intensive transformations which run over large, discrete batches of data. In contrast, stream processing pipelines run simpler transformations on smaller chunks of data, as soon as the data arrives. Given that stream processing is optimized for continuous processing with low latency, it is suitable for interactive use cases where users need to receive immediate feedback. Examples include fraud detection and online advertising.

Pros and cons of stream processing

Here is a list of pros and cons for stream processing.
Pros:

  • Stream processing allows for real time processing of data with low latency, which means the results can be gained quickly to serve use cases and applications with real-time processing demands.
  • Stream processing pipelines are flexible and can be quickly adapted to changes in data or processing needs.

Cons:

  • Stream processing pipelines tend to be more expensive due to their requirements to handle long-running jobs and data in real time, which means they need more powerful hardware and faster processing capabilities. However, with serverless platforms such as DeltaStream, stream processing can be simplified significantly.
  • Stream processing pipelines require more maintenance and tuning due to the fact that any change or error in data, as it is being processed in real time, needs to be addressed immediately. They also need more frequent updates and modifications to adapt to changes in the workload. Serverless stream processing platforms like DeltaStream simplify this aspect of stream processing as well.

Choosing between Stream Processing and Batch Processing

There are several factors to consider when deciding whether to use stream processing or batch processing for your use case. First, consider the nature and specifics of the application and its data. If the application is highly time-sensitive and requires real-time analysis and immediate response (low latency), then stream processing is the option to choose. On the other hand, if the application can rely on offline and periodic processing of large amounts of data, then batch processing is more appropriate.

Second factor is the volume and velocity of the data. Stream processing is suitable for processing continuous streams of data arriving in high velocity. However, if the data volume is too high to be processed in real time and it needs extensive and complex processing, batch processing is most likely a better choice though it comes with the cost of sacrificing low latency. Scaling up stream processing to thousands of workers to handle petabytes of data is very expensive and complex. Finally, you should consider the cost and resource constraints of the application. Stream processing pipelines are typically more expensive to set up, maintain, and tune. Batch processing pipelines tend to be more cost-effective and scalable, as long as no major changes appear in the workload or nature of the data.

Conclusion

Batch processing and stream processing are two widely used data processing methods for data-intensive applications. Choosing the right approach among them for a use case depends on characteristics of the data being processed, complexity of the workload, frequency of data and workload changes, and requirements of the use case in terms of latency and cost. It is important to evaluate your data processing requirements carefully to determine which approach is the best fit. Moreover, keep an eye on emerging technologies and solutions in the data processing space, as they may provide new options and capabilities for your applications. While batch processing technologies have been around for several decades; Stream processing technologies have seen significant growth and innovation in recent years. Beside several open-source stream processing frameworks such as Apache Flink which gained widespread adoption, cloud-based stream processing services are emerging as well which aim at providing easy-to-use and scalable data processing capabilities.

DeltaStream is a unified serverless stream processing platform to manage, secure and process all your event streams. DeltaStream provides a comprehensive stream processing platform that is easy to use, easy to operate, and scales automatically. You can get more in-depth information about DeltaStream features and use cases by checking our blogs series. If you are ready to try a modern stream processing solution, you can reach out to our team to schedule a demo and start using the system.

22 Mar 2023

Min Read

When to use Streaming Analytics vs Streaming Databases

Event streaming infrastructure has become an important part of modern data stack in many organizations enabling them to rapidly ingest and process streaming data to gain real-time insight and take actions accordingly. Streaming storage systems such as Apache Kafka provide a storage layer to ingest and make data streams available in real-time. Streaming analytics(stream processing) systems and streaming databases are two platforms enabling users to gain real-time insight from their data streams. However, each of these systems is suitable for different use cases and one system alone may not be the right choice for all of your stream processing needs. In this blog, we will provide a brief introduction on streaming analytics(stream processing) and streaming database systems and discuss how they work and what use cases each of them is suitable for. We then show how DeltaStream can provide capabilities of both these systems on one unified platform along with more capabilities unique to DeltaStream.

What is Streaming Analytics? 

Streaming analytics, also known as stream processing, continuously processes and analyzes event streams as they arrive. Streaming analytics systems do not have their own storage but provide a compute layer on top of other storage systems. These systems usually read data streams from streaming storage systems such as Apache Kafka, process it and write the results back to the streaming storage system. Streaming analytics systems provide both stateless(e.g., filtering, projection, transformation) and stateful(e.g., aggregation, windowed aggregation, join) processing capabilities on data streams. Many of them also provide a SQL interface where users can express their processing logic in familiar SQL statements. Apache Flink, Kafka Stream/Confluent’s ksqlDB and Spark Streaming are some of the commonly used streaming analytics(stream processing) platforms. The following figure depicts a high level overview of a streaming analytics system and its relation with streaming stores.

Streaming analytics systems are used in a wide variety of use cases. Building real-time data pipelines is one of the main use cases for the streaming analytics systems where data streams are ingested from source stores, processed and written into destination stores. Streaming data integration, real-time ML pipelines and real-time data quality assurance are just a few solutions you can use real-time data pipelines powered by streaming analytics systems. Other use cases for streaming analytics systems include their use in data mesh deployment and building event-driven microservices.

An important characteristic of streaming analytics systems is the focus on processing. In these systems, queries also known as jobs are a fundamental part of the platform. They provide fine grained control over these queries where users can start, pause and terminate queries individually without having to make any changes in the sources and sinks of those queries. The focus on queries is one of the main differences between streaming analytics and streaming databases as we explore them in the next section.

What is a Streaming Database?

Streaming databases combine stream processing and the serving of results in one system. Unlike streaming analytics systems, that are compute-only platforms and don’t have their own storage, streaming databases include their own storage where the results of continuous computations are stored and served to user applications. Materialized views are one of the foundational concepts in streaming databases that represent the result of continuous queries that are updated incrementally as the input data arrives. These materialized views are then available to query through SQL. 

The following figure depicts a high level view of a streaming database architecture.

The role of queries is different in streaming databases. As mentioned above, continuous queries are a fundamental part of streaming analytics systems, however, streaming databases don’t provide for fine grain control over queries. In streaming databases, the life cycle of continuous queries are attached to the materialized views and they are started and terminated implicitly by creating and dropping materialized views. This significantly limits the capabilities of users to manage such queries which is an essential part of building and managing streaming data pipelines. Therefore, streaming analytics systems are much more suited to build real-time data pipelines than streaming databases.

While streaming databases may not be the right tool for building real-time streaming pipelines, the incrementally updated materialized views in streaming databases can be used to serve real-time data to power user-facing dashboards and applications. They can also be used to build real-time feature stores for machine learning use cases.

Can we have both? With DeltaStream, yes

While streaming analytics systems are ideal for building real-time data pipelines, streaming databases are great in building incrementally updated materialized views and serving them to applications. Customers are left having to choose between two systems or even worse buy multiple solutions. The question is, can you have one system to reduce complexity and provide a comprehensive and powerful platform to build real-time applications on streaming data? The answer, as you expected, is Yes! 

DeltaStream does this seamlessly as a serverless service. As the following diagram shows, DeltaStream provides capabilities of both streaming analytics and streaming databases along with additional features to provide a complete platform to manage, process, secure and share streaming data.

Using DeltaStream’s SQL interface you can build streaming applications and pipelines. You can also build continuously updated materialized views and serve them to applications and dashboard in real-time. DeltaStream also provides fine grained control on continuous queries where you can start, restart and terminate each query. In addition to compute capabilities on streaming data, DeltaStream also provides familiar hierarchical name-spacing for your streams. In DeltaStream, you declare relations such as streams and changelogs on your streaming data such as Apache Kafka topics and organize them in databases and schemas the same way you organize tables in traditional databases. You can then control access to your streaming data with the familiar Role-based Access Control (RBAC) mechanism that you would use in traditional relational databases. This significantly simplifies using DeltaStream since you don’t need to learn a new security model for your streaming data. Finally, DeltaStream enables users to securely share real-time data with internal and external users.

So if you are using streaming storage systems such as Apache Kafka(Confluent Cloud, AWS MSK or any other Apache Kafka) or AWS Kinesis, you should check out DeltaStream as the platform for processing, organizing and securing your streaming data. You can schedule a demo where you can see all these capabilities in the context of a real world streaming application.

alert-icon

Please enter a valid email address.

Request Submitted

Thank you for requesting a demo.
You will receive your login information to your email soon.