24 Oct 2023

Min Read

Seamless Data Flow: How Integrations Enhance Stream Processing

Data processing systems can be broadly classified into 2 categories. batch processing & stream processing. Enterprises often use both streaming and batch processing systems because they serve different purposes and have distinct advantages, and using them together can provide a more comprehensive and flexible approach to data processing and analysis. Stream processing platforms help organizations process data as close to real-time as possible which is important for handling use-cases that are latency sensitive. Some examples include monitoring IoT sensors, fraud detection and threat detection. Batch processing systems are useful for a different set of situations – for example, when you want to analyze past data to find patterns or handle a lot of data from different sources.

Integrating between systems

The need for stream and batch processing is exactly why you are seeing companies implementing “lambda architecture”, which brings the best of both worlds together. In this architecture you generally see batch processing & stream processing systems working together to serve multiple use-case. So for these enterprises it’s important to be able to: 

  1. Move data between these systems seamlessly – ETL
  2. Make data products available to users in their preferred format/platform
  3. Continue using existing systems while leveraging new technologies to improve overall efficiency

Having the ability to process and prepare data as soon as it arrives for downstream consumption is an extremely critical function in the data landscape. For this you need to be able to  1. EXTRACT data from source systems to process and after processing, 2.  LOAD the data  into your platform of choice. This is precisely where stream processing platforms & integrations come into the picture. While integrations help you extract and load data, Stream processing helps you with the Transformation.

Every enterprise uses multiple disparate systems, each serving its own purpose, to manage their data. For these enterprises, it is important for data teams to produce data products in real-time and serve them across multiple data platforms that are in use. This will require a certain level of sophisticated processing, data governance and integration with commonly used data systems. 

Real-world integration uses

Let’s take a look at an example which can help us understand how companies manage their data across multiple platforms and how they process, access and transfer data across them.

Consider a scenario at a Bank. You have an incoming stream of transactions from Kafka. This stream of transactions is connected to DeltaStream where you can analyze transactions as they come in and flag them in real time if you notice fraudulent activity based on various predefined rules and alert your users as soon as possible. This is extremely time sensitive and a Stream Processing Platform is best suited for such use-cases.

Now, other teams within a Bank for eg : the marketing team, would want to understand trends based on customer activity and customize how they market their products to a given customer. For this, you need to look at transactions going back to a month or a week and process it all to generate enough context. Instead of going back to the ‘source’ system you can now have DeltaStream send all the processed transactions in the right format to your batch processing systems using our Integrations so that you can: 

  1. Have the data ready in the right format & schema for processing 
  2. Reduce the back and forth between multiple platforms as data transits the pipeline just once – reduction in data transfer costs 
  3. Eliminate the duplication of data sets across multiple platforms for processing
  4. Easily manage compliance – for eg :  by reducing the footprint of your PII data.

By integrating both the batch processing and stream processing system, the entire pipeline becomes more manageable and it reduces the complexity of managing data across different systems.

Integrating with DeltaStream

It’s evident that we need integration across different platforms to enable data teams to process and manage data. DeltaStream provides for all the ingredients required for such an operation. Our Stream processing platform is powered by Flink. DeltaStream’s RBAC enables you to govern and securely share data products. The integrations to Databricks and Snowflake allow for data products to be used in the data systems that your teams are using. With the launch of these integrations, you can do MORE with your data. To unlock the power of your streaming data reach out to us or request a free trial!

30 Aug 2023

Min Read

Choosing Serverless Over Managed Services: When and Why to Make the Switch

Consider storage service where you store and retrieve files. In on-prem environments, HDFS(Hadoop Distributed File System) has been one of the most common storage platforms to use. As for any service in an on-prem environment, as a user you are responsible for all aspects of the operations. This includes bringing up the HDFS cluster, scaling up and down the cluster based on the usage requirements, dealing with different types of failures including but not limited to server failures, network failures and many more. Of course, in an ideal situation as a user you would like to just use the storage service and focus on your business without dealing with the complexity of operating such infrastructure, including the one mentioned above. If you use a cloud environment instead of on-prem, you have the option of choosing to use storage services that are provided by a variety of vendors instead of running HDFS on cloud yourself. However, there are different ways to provide the same service on cloud that can significantly affect user experience and ease of use for such services. 

A Look at Managed Services

Let’s go back to our storage service and consider that we are now using a cloud environment and can take advantage of not running our required services ourselves and instead using the ones that vendors offer on this cloud environment. One common option to provide such services is to provide a managed version of the on-prem services. In this case, the service provider takes the same platform that is used in the on-prem environment and makes some improvements to run it on the cloud environment. While this takes away some of the burden of operations from the user, the user still is involved in many other aspects of the operations of such managed services. For the storage service we are considering here, a managed HDFS service would be an example of such an approach. When using a “fully managed” cloud HDFS, as a user you should still make decisions such as provisioning a HDFS cluster through the service. This means that you need to have a good understanding of the amount of storage resources you will be using and let the service provider know if you need more or less of such resources after provisioning the initial cluster. Requiring the user to provide such information in many cases results in confusion and in most cases the initial decision won’t be the accurate one and as the usage continues there will be a need for adjusting the provisioned resources. You cannot expect a user to accurately know how much storage they will need in the next six months or a year.

In addition, the notion of cluster brings many limitations. A cluster has a finite amount of resources available and as the usage continues, the cluster resources would be consumed and there will be a need for more resources. In the case of our “managed” HDFS service, the provisioned storage(disk space) is one of the limited resources and as more and more data is stored in the cluster, the available storage will shrink. At this point, the user has to decide between scaling up the existing cluster by adding more nodes and disk space or adding a new cluster to accommodate the growing need for the storage. To accommodate such issues, users may over-provision resources which in turn can result in unused resources and extra cost. Finally, once a cluster is provisioned, the user will start incurring the cost of the whole cluster regardless if half or all of the cluster resources are utilized. In short, the managed cloud service in most cases will put the burden of resource provisioning on the user, this in turn requires the user to have deep understanding of the required resources not for now, but for short term and long term future.

Unlocking Cloud Simplicity: The Serverless Advantage

Now let’s assume instead of taking the managed service path, a vendor takes a different route and builds a new cloud-only storage service from the ground up where all the complexity and operations are handled under the hood by the vendor and the users don’t have to think about the complexities such as resource provisioning as described above. In the case of storage service, object store services such as AWS S3 are great examples of such an approach, which is called Serverless. As an S3 user, you just need to set up buckets and folders and read and write your files. No need to manage anything or provision any cluster, no need to worry about having enough disk space or nodes. All operational aspects of the service including making sure the service is always available with required storage space is handled by S3. This is a huge win for the users since they can focus on building their applications instead of worrying about provisioning and sizing clusters correctly. With such simplicity in use, we can see why almost every cloud user uses object stores such as S3 for their storage needs unless there is an explicit requirement to use anything else. S3 is a great example for superiority of cloud-native serverless architecture compared to providing a “fully managed” version of the on-prem products. 

Another benefit of serverless platforms such as S3 compared to managed services is that S3 enables users to access, organize and secure their data in one place instead of dealing with multiple clusters. In S3 you can organize your data in buckets and folders, have a unified view of all of your data and control access to your data in one place. The same cannot be said for managed HDFS service if you have more than one cluster! In this case users have to keep track of which cluster has which data and how to control access to data across multiple clusters which is a much more complex and error prone process.

Choosing Serverless for Stream Processing

We can have the same argument in favor of serverless offering compared to managed services for many other platforms including stream processing and streaming databases. You can have “fully managed” services where the user has to provision clusters along with specifying the amount of resources this cluster will have before starting to write any query. Indeed, in managed services for stream processing you have much more complexities compared to the managed HDFS example we explained above. The cluster in the stream processing case will be shared among multiple queries which means imperfect isolation and the possibility of one bad query bringing down the whole cluster which disrupts the other queries even though they were healthy and running with no issues. To exacerbate the situation, as you add more streaming queries to the cluster eventually the cluster resources will all be used since the streaming queries are long running jobs and you will need to scale up your cluster or launch new cluster to accommodate newer queries. The first option results in having a larger cluster with more queries that share the same cluster resources and can interfere with each other’s resources. 

On the other hand, the second option will result in a growing number of clusters to keep track of and also keep track of which query is running on which cluster. So anytime you have to provision or declare a cluster is a “fully managed” stream processing or streaming database service, you are dealing with the managed service along with the above mentioned restrictions and many more. Even worse, once you provision a cluster, the billing for the cluster starts regardless of having no queries or several queries running in the cluster.

We built DeltaStream as a serverless platform because we believe that such a service should be as easy to use as S3. You can refer to DeltaStream as the S3 of stream processing. In DeltaStream there is no notion of cluster or provisioning. You just connect to your streaming stores such as Apache Kafka or AWS Kinesis and you are up and running ready to run queries. Only pay for queries that are running, and since there is no concept of cluster, you won’t be charged for idle or under utilized clusters! Focus on building your streaming applications and pipelines and leave the infrastructure to us. Launch as many queries as you want and there is no notion of running out of resources! Your queries run in isolation and we can scale them up and down independently without interfering with each other. 

DeltaStream is a unified platform that provides stream processing(streaming analytics) and streaming database in one platform. You can build streaming pipelines, event base applications and always uptodate materialized views within familiar SQL syntax. In addition, DeltaStream enables you to organize and secure your streaming data across multiple streaming storage systems. If you are using any flavor of Apache Kafka including Confluent Cloud, AWS MSK or RedPanda or AWS Kinesis you can now try DeltaStream for free by signing up for our free trial.

22 Jun 2023

Min Read

What is Streaming ETL and how does it differ from Batch ETL?

In today’s data-driven world, organizations are seeking effective and reliable ways to extract insights to make timely decisions based on the ever-increasing volume and velocity of data. ETL (Extract, Transform, Load) is a process where data is extracted from various sources, transformed to fit specific requirements, such as cleaning, formatting, and aggregating, and loaded into a target system or data warehouse. ETL ensures data consistency, quality, and usability, and enables organizations to effectively analyze their data. Traditional batch processing, while effective for certain use cases, falls short in meeting the demands of real-time and event-driven data processing and analysis. This is where streaming ETL emerges as a powerful solution.

Streaming ETL: An Overview

Unlike traditional batch processing, which operates on fixed intervals, streaming ETL operates on a continuous stream of data as records arrive, allowing for real-time analysis. A streaming ETL pipeline begins with the continuous data ingestion phase, in which records are collected from different sources varying from databases to event streaming platforms like Apache Kafka, and Amazon Kinesis. Once data is ingested, it goes through real-time transformation operations for cleaning, normalization, enrichment, etc. Stream processing frameworks such as Apache Flink, Kafka Stream, ksqlDB and Apache Spark provide tools and APIs to apply these transformations to prepare the data. The same frameworks allow processing data in real time and support operations and functionalities such as real-time aggregation, and complex machine learning operations. Finally, the results of the streaming ETL are delivered to downstream systems and applications for immediate consumption, or they are stored in data warehouses and data stores for future use.

Streaming ETL can be applied in various domains, including fraud detection and prevention, real-time analytics and personalization for targeted advertisements, and IoT data processing and monitoring to handle high velocity and volume of data generated by devices such as sensors and smart appliances.

Why should you use Streaming ETL?

Streaming ETL offers numerous advantages in real-time data processing. Here is a list of most important ones:

  • Streaming ETL provides real-time insights into the emerging trends, anomalies and critical events in the data as they happen. It operates with low latency, and ensures the processing results are up to date. This reduces the gap between the time data arrives and the time it is processed. This facilitates accurate and timely decision-making, and enables organizations to capitalize on time-sensitive opportunities or address emerging issues promptly.
  • Streaming ETL frameworks are designed to scale horizontally, which is crucial for handling increased data volumes and processing requirements in real-time applications. This elasticity allows for seamless scaling of resources based on demand, and enables the system to manage spikes in data volume and velocity without sacrificing the performance.

With all its advantages, Streaming ETL also presents some challenges:

  • Streaming ETL process typically introduces additional complexity to data processing. Real-time data ingestion, continuous transformations, and persisting results while maintaining performance and data consistency require careful design and implementation.
  • Streaming ETL pipelines run in a distributed streaming environment, which introduces new challenges to the data processing process. Unless an appropriate delivery guarantee such as exactly-once or at-least once is used, there is a risk of delay or data loss during ingestion, and delivery stages, due to parallel and asynchronous processing when applying transformations. Ordering events and maintaining data consistency are complex in such situations, and if not handled properly, they may impact the accuracy of certain computations that rely on the event order. Using fault-tolerant mechanisms, such as replication, checkpointing, and backup strategies are essential to prevent data loss and ensure reliability and correctness of results.

Using a modern stream processing platform, such as DeltaStream, can help address the above challenges and enable organizations to benefit from all the advantages of Streaming ETL.

Differences between Streaming ETL and Batch ETL

Data processing model: Batch ETL starts with collecting large volumes of data over a time period and processes these batches in fixed intervals. Therefore, it applies transformations on the entire dataset as a batch. Streaming ETL operates on data as it arrives in real-time and continuously processes and transforms data as individual records or small chunks.

Latency: Batch ETL introduces inherent latency since data is processed in intervals. This latency normally ranges from minutes to hours. Streaming ETL processes data in real-time and offers low latency. Results are available immediately and are updated continuously.

Data volume and velocity: Batch ETL is well-suited for processing large volumes of data, collected over time. Therefore it is effective when dealing with historical data. Streaming ETL, on the other hand, is designed for high-velocity data streams and is effective for use cases that require immediate processing.

Processing frameworks: Batch ETL typically utilizes frameworks like Apache Hadoop, Apache Spark and traditional ETL and data warehouses tools. These frameworks are optimized for processing large volumes of data in parallel, but not necessarily for real-time use cases. Streaming ETL leverages specialized stream processing frameworks such as Apache Flink, and Apache Kafka. These frameworks are optimized for processing continuous streams of data in real time. With recent changes, some of these frameworks, such as Apache Flink, are now capable of processing batch workloads too. As these efforts and improvements continue, the overlap between frameworks which process these workloads are expected to get bigger.

Fault tolerance: Batch ETL typically processes large volumes of data in fixed intervals. If a failure occurs, all the data within that batch may be affected which could lead into results written partially. This makes failure recovery challenging in Batch ETL as it involves cleaning partial results and reprocessing the entire batch. Removing partial results and state, and starting a new run is a complicated process and normally involves manual intervention. Reprocessing a batch is time-consuming and resource-intensive and can take long which may result in issues for processing the next batch as producing results for the current batch has fallen behind. Moreover, rerunning some tasks could have unexpected side effects which may impact the correctness of final results.  Such issues need to be handled properly during a job restart. 

Streaming ETL does not involve many jobs that run sequentially many times over time, but there is a single long-running job which maintains its state and does incremental computation as data arrives. Therefore, Streaming ETL is generally better equipped to handle failures and partial results. Given that results are generated incrementally in the Streaming ETL case, a failure does not force discarding already generated results and reprocessing sources from the beginning. Stream processing frameworks provide transaction-like processing, exactly-once semantics, and write-ahead logs, ensuring atomicity and data consistency. They have built-in mechanisms for fault recovery, handle out-of-order events, and ensure end-to-end reliability by leveraging distributed messaging systems.

Choosing Between Streaming ETL and Batch ETL

There are several factors to consider, when deciding between Streaming ETL and Batch ETL for a data processing use case. The most important factor is latency requirements. Consider the desired latency of insights and actions. If a real-time response is critical, then Streaming ETL is the correct choice. The other important factor is data volume and velocity along with the cost of processing. You should evaluate the volume of data and the rate it arrives at. Streaming ETL is capable of processing fast data immediately. However, due to its inherent complexity and higher resource demand, it is more difficult to maintain. 

Streaming ETL also introduces challenges related to maintaining the correct order of events, especially in distributed environments. Batch ETL processes data in a much more deterministic and mostly sequential manner, which ensures data consistency and ordered processing. A modern stream processing platform is a viable solution to handle these challenges and difficulties when picking Streaming ETL as a solution. Finally, you need to consider how often the data sources evolve over time in your use case as that can impact the structure of incoming records. Data processing pipelines need to handle schema evolution properly to prevent disruptions and errors. Managing schema changes, versioning, and implementing schema inference mechanisms become crucial to ensure correctness and reliability. Using a stream processing framework enables you to address these changes in a streaming ETL pipeline, which is intended to run continuously with no interruption.

Conclusion

Choosing between streaming ETL and batch ETL requires a thorough understanding of the specific requirements and trade-offs of each. While both approaches have their strengths and weaknesses, they are effective in different use cases and for different data processing needs. Streaming ETL offers real-time processing with low latency and high scalability to handle high-velocity data. On the other hand, batch ETL is well-suited for historical analysis, scheduled reporting, and scenarios where near-real-time results are not critical. In this blog post, we covered the specifics as well as the pros and cons of each approach, and explained the important factors to consider when deciding which one to choose for a given data processing use case.

DeltaStream provides a comprehensive stream processing platform to manage, secure and process all your event streams. You can easily use it to create your streaming ETL solutions. It is easy to operate, and scales automatically. You can get more in-depth information about DeltaStream features and use cases by checking our blogs series. If you are ready to try a modern stream processing solution, you can reach out to our team to schedule a demo and start using the system.

05 Jun 2023

Min Read

All About Streaming Data Mesh

The Current State of Streaming Data

The adoption of Streaming data technologies has grown exponentially over the past 5 years. Every industry understands the importance of accessing and understanding data in real-time, so they can make decisions about their products, services and customers. The sooner you can access the event – could be a customer purchasing an item or an IoT device relaying data – the faster you can process and react to it. Making all this happen in real-time is where technologies like Apache Kafka, AWS Kinesis, Google Pub-Sub, Apache Flink and many more come into play. Most of these are available as a managed service, making it easy for companies to adopt them. For example,Confluent and AWS offer Apache Kafka as a managed service.

With rapid adoption of these services, many companies now have access to real-time data.

By their very nature technologies like Apache Kafka interact with multiple types of data producers and consumers. While Datalakes and Data Warehouses meant storing large amounts of data in centralized repositories, data streaming technologies have always promoted decentralization of services and the data behind them. As data sets transit these systems, being able to manage, access and self-provision data is extremely important. This is where Data Mesh comes into the picture.

What is a Data Mesh?

Data Mesh, as first defined by Zhamak Dehghani,  is a modern data management architecture that facilitates and promotes a  decentralized way to manage, access and share data. The four core principles of a data mesh are : 

  • Domain ownership 
  • Data as a product
  • Self-serve data platform 
  • Federated computational governance

Pic from : datamesh-architecture.com

A Data Mesh gives you the agility to work with the data that flows in and out of complex systems across multiple organizations. It reduces dependencies and removes bottlenecks that arise while accessing data; thereby improving the organization’s ability to respond and make business decisions.

While there are tools and technologies that can help you build a Data Mesh over “Data at Rest”, it isn’t the same with “Data in Motion”. This is where DeltaStream comes into the picture, to enable organizations develop a Data Mesh over and across all their streaming data  – which could span multiple platforms and multiple public cloud providers (Apache Kafka, Confluent Cloud, Kinesis, Pub-Sub, RedPanda .. etc)

Streaming Data Mesh with DeltaStream

What we have built at DeltaStream is an extremely powerful framework which will provide a single platform to access your operational and analytical data, in real-time, across your entire organization in a decentralized manner. No more expensive pipelines or duplicating your data or building central teams to manage Datalakes / Data Warehouses.

Let’s revisit the core tenets of Data Mesh and how DeltaStream helps you achieve each one of those and more.

Domain Ownership

  • In DeltaStream, the data boundaries are clearly defined using our namespaces &  RBAC. Each stream is isolated with strict access control and any query against that stream inherits the same permissions and boundaries. This way,  your core data sets remain under your purview and you can control what others can access. The queries are isolated at the infrastructure level, which means queries are scaled independently, reducing cost and operational complexity. With all these controls in place, the core objective of decentralized ownership of data streams is achieved while keeping your data assets secure.

Data as a Product

  • Making data discoverable and usable while maintaining the quality is central to serving data as a product to your internal teams. Being able to do it as close to the source as possible is extremely beneficial and in-line with the data mesh philosophy of domain owners owning the quality of data. This is exactly what you can achieve with DeltaStream. Using our powerful SQL interface you can quickly alter and enrich your data and get it ready-to-use in real-time. As your data models change,  DeltaStreams Data Catalog along with Schema registry can track your data evolution and help you iterate on your data assets  as you grow and evolve.

Self-serve Data Platform

  • Once you achieve “Domain ownership” and “Data as a Product”, self-service follows. The main objective of self-service is to remove the need for a centralized team to coordinate data distribution across the entire company. Leveraging DeltaStream’s catalog you can make your high quality data streams discoverable and combining this with our RBAC you can securely share them, with both internal and external users.  This means, the data will continuously flow from its source systems to end-users without ever having the need to store, extract or load. This is very powerful in the way that you get to securely democratize “streaming data” across your company, while having the ability to share it with users that are outside of your Organization too.

Federated Computational Governance

  • With everything decentralized  – from data ownership to data delivery – governance becomes a very important requirement. This is to ensure that data originating in a particular domain is consumable in any part of the organization. Schemas and schema registry go a long way in ensuring that. Data, just like the organization it serves, changes and evolves over time. Integrating your data streams with schema registry is critical to maintaining a common language to communicate.  Also, interoperability is a key component of governance and the ability of Deltastream to operate across multiple streaming stores is a great capability.

Conclusion

Decentralization of data ownership and its delivery is critical for organizations to be nimble. As Data Mesh gains traction, there are tools and technologies readily available to implement it over your data at rest. But the same can’t be said when you are dealing with streaming data. This is what Deltastream provides. With tooling built around the core processing framework,  it enables companies to implement data mesh on top of streaming data.

As streaming data continues its rapid growth, it is extremely important for organizations to set-up the right foundations to maximize the value of their real-time data. While architectures like DataMesh provide the right framework, platforms like DeltaStream offer a lot to bring it all together. 

We offer a free trial of DeltaStream that you can sign up for start using DeltaStream today. If you have any questions, please feel free to reach out to us.

10 May 2023

Min Read

Why Apache Flink is the Industry Gold Standard

Apache Flink is a distributed data processing engine that is used for both batch processing and stream processing. Although Flink supports both batch and streaming processing modes, it is first and foremost a stream processing engine used for real-time computing. By processing data on the fly as a stream, Flink and other stream processing engines are able to achieve extremely low latencies. Flink in particular is a highly scalable system, capable of ingesting high throughput data to perform stateful stream processing jobs while also guaranteeing exactly-once processing and fault tolerance. Initially released in 2011, Flink has grown into one of the largest open source projects. Large and reputable companies such as Alibaba, Amazon, Capital One, Uber, Lyft, Yelp, and others have relied on Flink to provide real-time data processing to their organizations.

Before stream processing, companies relied on batch jobs to fulfill their big data needs, where jobs would only perform computations on a finite set of data. However, many use cases don’t naturally fit a batch model. Consider fraud detection for example, where a bank may want to act if anomalous credit card behavior is detected. In the batch world, the bank would need to collect credit card transactions over a period of time, then run a batch job over that data to determine if a transaction is potentially a fraudulent one. Since the batch job is run after the data collection, by the time the batch job finishes, it might not be until a few hours that a transaction is marked as fraudulent, depending on how frequently the job is being scheduled. In the streaming world, transactions can be processed in real time and the latency for action will be less. Being able to freeze credit card accounts in real-time can save time and money for the bank. You can read more about the distinction of streaming vs batch in our other blog post.

Without Apache Flink or other stream processing engines, you could still write code to read from a streaming message broker (e.g. Kafka, Kinesis, RabbitMQ) and do stream processing from scratch. However, with this approach you would miss out on many of the features that modern stream processing engines give you out of the box, such as fault tolerance, the ability to perform stateful queries, and API’s that vastly simplify processing logic. There are many stream processing frameworks in addition to Apache Flink such as Kafka Streams, ksqlDB, Apache Spark (Spark streaming), Apache Storm, Apache Samza, and dozens of others. Each have their own pros and cons and comparing them could easily become its own blog post, but Apache Flink has become an industry favorite because of these key features:

  • Flink’s design is truly based on processing streams instead of relying on micro-batching
  • Flink’s checkpointing system and exactly-once guarantees
  • Flink’s ability to handle large throughputs of data from many different types of data sources at a low latency
  • Flink’s SQL API that leverages Apache Calcite allows users to define their Flink jobs using SQL, making it easier to build applications

Flink is designed to be a highly scalable distributed system, capable of being deployed with most common setups, that can perform stateful computations at in-memory speed. To get a high level understanding about Flink’s design, we can look further into the Flink ecosystem and execution model.

The Flink ecosystem can be broken down into 4 layers:

  • Storage: Flink doesn’t have its own storage layer and instead relies on a number of connectors to connect to data. These include but are not limited to HDFS, AWS S3, Apache Kafka, AWS Kinesis, and SQL databases. You can similarly use managed products such as Amazon MSK, Confluent Cloud, and Redpanda among others.
  • Deploy: Flink is a distributed system and integrates seamlessly with resource managers like YARN and Kubernetes, but it can also be deployed in standalone mode or in a single JVM.
  • Runtime: Flink’s runtime layer is where the data processing happens. Applications can be split into tasks that are parallelizable and thus highly scalable across many cluster nodes. With Flink’s incremental and asynchronous checkpointing algorithm, Flink is also able to provide fault tolerance for applications with large state and provide exactly-once processing guarantees while minimizing the effect on processing latency.
  • APIs and Libraries: The APIs and libraries layer are provided to make it easy to work with Flink’s runtime. The DataStream API can be used to declare stream processing jobs and offers operations such as windowing, updating state, and transformations. The Table and SQL APIs can be used for batch processing or stream processing. The SQL API allows users to write SQL queries to define their Flink jobs and the Table API allows users to call functions on their data based on Table schemas used to define them.

In the execution architecture, there is a client and the Flink cluster. Two kinds of nodes make up the Flink cluster – the Job manager and the Task manager. There is always a Job manager and any number of Task managers.

  • Client: Responsible for generating the Flink job graph and submitting it to the Job manager.
  • Job manager: Receives the job graph from the Client then schedules and oversees the execution of the job’s tasks on the Task managers.
  • Task manager: Responsible for executing all the tasks it was assigned from the Job manager and report updates on its status back to the Job manager.

In Figure 1, you’ll notice that storage and deployment are somewhat detached from the Flink runtime and APIs. The benefit of designing Flink this way is that Flink can be deployed in any number of ways and connect to any number of storage systems, making it a good fit for almost any tech stack. In Figure 2, you’ll notice that there can be multiple Task managers and they can be scaled horizontally, making it easy to add resources to run jobs at scale.

Apache Flink has become the most popular real-time computing engine and the industry gold standard for many reasons, but its ability to process large volumes of real-time data with low latency is its main draw. Other features include the following:

  • Stateful processing: Flink is a stateful stream processor, allowing for transformations such as aggregations and joins of continuous data streams.
  • Scalability: Flink can scale up to thousands of Task manager nodes to handle virtually any workload without sacrificing latency.
  • Fault tolerance: Flink’s state of the art checkpointing system and distributed runtime makes it both fault tolerant and highly available. Checkpoints are used to save the application state, so when your application inevitably encounters a failure, the application can be resumed from the latest checkpoint. For applications that use transactional sinks, exactly-once end to end processing can be achieved. Also, since Flink is compatible with most resource management systems like YARN and Kubernetes, when nodes crash they will automatically be restarted. Flink also has high availability setup options based on Apache Zookeeper to help prevent failure states as well.
  • Time mechanisms: Flink has a concept called “watermarks” for working with event time which represents the application’s progress with regard to time. Different watermark strategies can be applied to allow users to configure how they want to handle out of order or late data. Watermarks are important for the system to understand when it is safe to close windows and emit results for stateful operations.
  • Connector ecosystem: Since Flink uses external systems for storage, Flink has a rich ecosystem of connectors. Wherever your data is, Flink likely already has a library to connect to it. Popular existing connectors include connectors for Apache Kafka, AWS Kinesis, Apache Pulsar, JDBC connector for databases, Elasticsearch, and Apache Cassandra. If a connector you need doesn’t exist, Flink provides APIs to help you create a new connector.
  • Unified batch and streaming: Flink is both a batch and stream processor, making it an ideal candidate for applications that need both streaming and batch processing. With Flink, you can use the same application logic for both cases reducing the headache of managing two different technologies.
  • Large open source community: In 2022, Flink had over 1,600 contributions and the Github project currently has over 21 thousand stars. To put this in perspective, Apache Kafka currently has 24,800 stars and Apache Hadoop currently has 13,500 thousand stars. Having a large and active open source community is extremely valuable for fixing bugs and staying up to date with the latest tech requirements.

Conclusion

In this blog post, we’ve introduced Apache Flink at a high level and highlighted some of its features. While we would suggest those who are interested in adding stream processing to their organizations to take a survey of the different stream processing frameworks available, we’re also confident that Flink would be a top choice for any use case. At DeltaStream, although we are detached from using any particular compute engine, we have chosen to use Flink for the reasons explained above. DeltaStream is a unified, serverless stream processing platform that takes the pain out of managing your streaming stores and launching stream processing queries. If you’re interested in learning more get a free trial of DeltaStream.

17 Apr 2023

Min Read

Batch vs. Stream Processing: How to Choose

Rapid growth in the volume, velocity and variety of data has made data processing a critical component of modern business operations. Batch processing and stream processing are two major and widely used methods to perform data processing. In this blog, we will explain batch processing and stream processing and will go over their differences. Moreover, we will explore the pros and cons of each, and discuss the importance of choosing the right approach for your use cases. If you are looking for Streaming ETL vs. Batch ETL, we have a blog on that too.

What is Batch Processing?

Batch processing is a data processing method on large volumes of fully stored data. Depending on the use case specifics, a batch processing pipeline typically consists of multiple steps including data ingestion, data preprocessing, data analysis and data storage. Data goes through various transformations for cleaning, normalization, and enrichment, and different tools and frameworks could be used for analysis and value extraction. The final processed data gets stored in a new location with a new format such as a database, a data warehouse, a file, or a report.

Batch processing is used in a wide range of applications where large volumes of data need to be processed efficiently in a cost-effective manner with high throughput. Examples are: ETL processes for data aggregation, log processing, and training predictive ML models.Pros and cons of batch processing

Batch processing has its advantages and disadvantages. Here are some of its pros and cons.
Pros:

  • Batch processing is efficient and cost-effective for applications with large volumes of already stored data.
  • The process is predictable, reliable and repeatable which makes it easier to schedule, maintain, and recover from failures.
  • Batch processing is scalable and can handle workloads with complex transformations on large volumes of data.

Cons:

  • Batch processing has a high latency, and processing time can be long depending on the volume of data and complexity of the workload.
  • Batch processing is generally not interactive, while the processing is running. Users need to wait until the whole process is complete before they can access the results. This means batch processing does not provide real time insights into data which can be a disadvantage in applications where real-time access to (partial) results is necessary.

What is Stream Processing?

Stream processing is a data processing method where processing is done in real time as data is being generated and ingested. It involves analyzing and processing continuous streams of data in order to extract insights and information from them. A stream processing pipeline typically consists of several phases including data ingestion, data processing, data storage, and reporting. While the steps in a stream processing pipeline may look similar to those in batch processing, these two methods are significantly different from each other.

Stream processing pipelines are designed to have low latency and process the data in a continuous mode. When it comes to the complexity of transformations, batch processing normally involves more complex and resource intensive transformations which run over large, discrete batches of data. In contrast, stream processing pipelines run simpler transformations on smaller chunks of data, as soon as the data arrives. Given that stream processing is optimized for continuous processing with low latency, it is suitable for interactive use cases where users need to receive immediate feedback. Examples include fraud detection and online advertising.

Pros and cons of stream processing

Here is a list of pros and cons for stream processing.
Pros:

  • Stream processing allows for real time processing of data with low latency, which means the results can be gained quickly to serve use cases and applications with real-time processing demands.
  • Stream processing pipelines are flexible and can be quickly adapted to changes in data or processing needs.

Cons:

  • Stream processing pipelines tend to be more expensive due to their requirements to handle long-running jobs and data in real time, which means they need more powerful hardware and faster processing capabilities. However, with serverless platforms such as DeltaStream, stream processing can be simplified significantly.
  • Stream processing pipelines require more maintenance and tuning due to the fact that any change or error in data, as it is being processed in real time, needs to be addressed immediately. They also need more frequent updates and modifications to adapt to changes in the workload. Serverless stream processing platforms like DeltaStream simplify this aspect of stream processing as well.

Choosing between Stream Processing and Batch Processing

There are several factors to consider when deciding whether to use stream processing or batch processing for your use case. First, consider the nature and specifics of the application and its data. If the application is highly time-sensitive and requires real-time analysis and immediate response (low latency), then stream processing is the option to choose. On the other hand, if the application can rely on offline and periodic processing of large amounts of data, then batch processing is more appropriate.

Second factor is the volume and velocity of the data. Stream processing is suitable for processing continuous streams of data arriving in high velocity. However, if the data volume is too high to be processed in real time and it needs extensive and complex processing, batch processing is most likely a better choice though it comes with the cost of sacrificing low latency. Scaling up stream processing to thousands of workers to handle petabytes of data is very expensive and complex. Finally, you should consider the cost and resource constraints of the application. Stream processing pipelines are typically more expensive to set up, maintain, and tune. Batch processing pipelines tend to be more cost-effective and scalable, as long as no major changes appear in the workload or nature of the data.

Conclusion

Batch processing and stream processing are two widely used data processing methods for data-intensive applications. Choosing the right approach among them for a use case depends on characteristics of the data being processed, complexity of the workload, frequency of data and workload changes, and requirements of the use case in terms of latency and cost. It is important to evaluate your data processing requirements carefully to determine which approach is the best fit. Moreover, keep an eye on emerging technologies and solutions in the data processing space, as they may provide new options and capabilities for your applications. While batch processing technologies have been around for several decades; Stream processing technologies have seen significant growth and innovation in recent years. Beside several open-source stream processing frameworks such as Apache Flink which gained widespread adoption, cloud-based stream processing services are emerging as well which aim at providing easy-to-use and scalable data processing capabilities.

DeltaStream is a unified serverless stream processing platform to manage, secure and process all your event streams. DeltaStream provides a comprehensive stream processing platform that is easy to use, easy to operate, and scales automatically. You can get more in-depth information about DeltaStream features and use cases by checking our blogs series. If you are ready to try a modern stream processing solution, you can reach out to our team to schedule a demo and start using the system.

22 Mar 2023

Min Read

When to use Streaming Analytics vs Streaming Databases

Event streaming infrastructure has become an important part of modern data stack in many organizations enabling them to rapidly ingest and process streaming data to gain real-time insight and take actions accordingly. Streaming storage systems such as Apache Kafka provide a storage layer to ingest and make data streams available in real-time. Streaming analytics(stream processing) systems and streaming databases are two platforms enabling users to gain real-time insight from their data streams. However, each of these systems is suitable for different use cases and one system alone may not be the right choice for all of your stream processing needs. In this blog, we will provide a brief introduction on streaming analytics(stream processing) and streaming database systems and discuss how they work and what use cases each of them is suitable for. We then show how DeltaStream can provide capabilities of both these systems on one unified platform along with more capabilities unique to DeltaStream.

What is Streaming Analytics? 

Streaming analytics, also known as stream processing, continuously processes and analyzes event streams as they arrive. Streaming analytics systems do not have their own storage but provide a compute layer on top of other storage systems. These systems usually read data streams from streaming storage systems such as Apache Kafka, process it and write the results back to the streaming storage system. Streaming analytics systems provide both stateless(e.g., filtering, projection, transformation) and stateful(e.g., aggregation, windowed aggregation, join) processing capabilities on data streams. Many of them also provide a SQL interface where users can express their processing logic in familiar SQL statements. Apache Flink, Kafka Stream/Confluent’s ksqlDB and Spark Streaming are some of the commonly used streaming analytics(stream processing) platforms. The following figure depicts a high level overview of a streaming analytics system and its relation with streaming stores.

Streaming analytics systems are used in a wide variety of use cases. Building real-time data pipelines is one of the main use cases for the streaming analytics systems where data streams are ingested from source stores, processed and written into destination stores. Streaming data integration, real-time ML pipelines and real-time data quality assurance are just a few solutions you can use real-time data pipelines powered by streaming analytics systems. Other use cases for streaming analytics systems include their use in data mesh deployment and building event-driven microservices.

An important characteristic of streaming analytics systems is the focus on processing. In these systems, queries also known as jobs are a fundamental part of the platform. They provide fine grained control over these queries where users can start, pause and terminate queries individually without having to make any changes in the sources and sinks of those queries. The focus on queries is one of the main differences between streaming analytics and streaming databases as we explore them in the next section.

What is a Streaming Database?

Streaming databases combine stream processing and the serving of results in one system. Unlike streaming analytics systems, that are compute-only platforms and don’t have their own storage, streaming databases include their own storage where the results of continuous computations are stored and served to user applications. Materialized views are one of the foundational concepts in streaming databases that represent the result of continuous queries that are updated incrementally as the input data arrives. These materialized views are then available to query through SQL. 

The following figure depicts a high level view of a streaming database architecture.

The role of queries is different in streaming databases. As mentioned above, continuous queries are a fundamental part of streaming analytics systems, however, streaming databases don’t provide for fine grain control over queries. In streaming databases, the life cycle of continuous queries are attached to the materialized views and they are started and terminated implicitly by creating and dropping materialized views. This significantly limits the capabilities of users to manage such queries which is an essential part of building and managing streaming data pipelines. Therefore, streaming analytics systems are much more suited to build real-time data pipelines than streaming databases.

While streaming databases may not be the right tool for building real-time streaming pipelines, the incrementally updated materialized views in streaming databases can be used to serve real-time data to power user-facing dashboards and applications. They can also be used to build real-time feature stores for machine learning use cases.

Can we have both? With DeltaStream, yes

While streaming analytics systems are ideal for building real-time data pipelines, streaming databases are great in building incrementally updated materialized views and serving them to applications. Customers are left having to choose between two systems or even worse buy multiple solutions. The question is, can you have one system to reduce complexity and provide a comprehensive and powerful platform to build real-time applications on streaming data? The answer, as you expected, is Yes! 

DeltaStream does this seamlessly as a serverless service. As the following diagram shows, DeltaStream provides capabilities of both streaming analytics and streaming databases along with additional features to provide a complete platform to manage, process, secure and share streaming data.

Using DeltaStream’s SQL interface you can build streaming applications and pipelines. You can also build continuously updated materialized views and serve them to applications and dashboard in real-time. DeltaStream also provides fine grained control on continuous queries where you can start, restart and terminate each query. In addition to compute capabilities on streaming data, DeltaStream also provides familiar hierarchical name-spacing for your streams. In DeltaStream, you declare relations such as streams and changelogs on your streaming data such as Apache Kafka topics and organize them in databases and schemas the same way you organize tables in traditional databases. You can then control access to your streaming data with the familiar Role-based Access Control (RBAC) mechanism that you would use in traditional relational databases. This significantly simplifies using DeltaStream since you don’t need to learn a new security model for your streaming data. Finally, DeltaStream enables users to securely share real-time data with internal and external users.

So if you are using streaming storage systems such as Apache Kafka(Confluent Cloud, AWS MSK or any other Apache Kafka) or AWS Kinesis, you should check out DeltaStream as the platform for processing, organizing and securing your streaming data. You can schedule a demo where you can see all these capabilities in the context of a real world streaming application.

alert-icon

Please enter a valid email address.

Request Submitted

Thank you for requesting a demo.
You will receive your login information to your email soon.