Real-time shipment tracking, truck fleet management, real-time training of ML models, and fraud detection are all real use-cases that are powered by streaming technologies. As companies try to keep up with the trends of building real-time products, they require a data stack capable of handling and processing streaming data. Unlocking stream processing allows businesses to do advanced data analytics at low latencies, which ultimately allows for faster business insights and more informed decision making. However, many of the popular stream processing frameworks, like Apache Flink, are quite complex and can be challenging to operate. In this blog post we’ll cover 3 of the main challenges with operating a stream processing platform and how to address them.

Challenge 1: Resource Management

Stream processing jobs are long lived in order to continuously process data. Because these jobs are long lived, it’s important that your stream processing platform properly allocates resources for the stream processing jobs. On one hand, over-allocating resources will result in overspending, while on the other hand, under-allocating resources will cause jobs to fall behind or fail. The proper allocation for stream processing jobs varies case by case. For instance, jobs with high throughputs that need to hold a lot of state should receive more memory, while jobs with complex transformations should receive more CPU. Many workloads can also fluctuate and a dynamic allocation of resources in such cases is necessary to match the workload. Take a stream of data for page visits on a website for example – it’s likely that the website will be visited more during the day when people aren’t asleep. So, stream processing jobs that source from this data should scale up during the day, then scale down for the night.

Solution: Utilization Metrics

Exposing resource utilization metrics is an important first step to tackling the challenges of resource management. Having visibility into the resource utilization trends of your jobs can allow your stream processing platform to have rules for resource allocation. In the simplest case, if your job’s resource allocation is stable, you can allocate the amount of resources to match what’s shown in the metrics. For jobs with predictable fluctuations, such as a job sourcing from data that peaks during the day and dips during the night, you can set up a system that adjusts the resource allocation on a timely basis. In the most complex case, for jobs with unpredictable resource fluctuations, the best approach is to add an auto-scaling service that can automatically resize compute resources based on resource metrics. Building a platform that exposes the correct metrics, can safely resize your stream processing jobs, and includes an auto-scaling service are all necessary to generically support stream processing workloads, but can take a lot of engineering time and effort. If building these systems is too costly of an engineering investment, you can also consider implementing fully managed third party solutions that can help to partially or fully address these challenges.

Challenge 2: Data Heterogeneity

For production use cases, data can come from many different sources and in many different formats. Streaming data coming from sensors, cloud providers, databases, and backend services will differ from each other, which makes them difficult to compare. It is not easy to create a stream processing platform that can handle various data formats and quality levels. The engineering team that supports this platform will need to understand the nuances of different data sources and provide tools/features that can help to make variable data more uniform. However, this platform can create many possibilities as businesses can use data from sources that were isolated before.

Solution: Data Standardization

Standardizing the data across your organization and implementing quality control over the data are the best solutions for dealing with data heterogeneity. Providing data standards and encouraging data schematization at the data’s source is the best practice, but if that’s not possible, stream processing can also help to transform a set of data into a standardized format that can be easily processed and integrated with other data streams. Stream processing platforms that enable users to filter out bad or duplicate data, enrich data with missing fields, and mutate data to fit standardized data formats can mitigate many of the issues of variable data.

One tip when it comes to dealing with data with variable data quality is to provide different configurations for error handling. For many stream processing frameworks, if there is a record that a job doesn't know how to deserialize or make sense of, the job will simply fail. For data sources that don’t have great data integrity, having options to skip over those records or produce them to a dead-letter queue for further analysis can be better options for your overall application.

Challenge 3: Streaming Knowledge Gaps

Most stream processing frameworks are highly complex pieces of software. Building up expertise to understand the challenges of streaming vs. the traditional challenges in the batch world takes time. For some organizations, having engineers ramp up on streaming technologies may not be a worthwhile or affordable investment. For organizations that do end up investing in a team of streaming experts, a knowledge gap between the streaming team and other teams often form. Even with a stream processing platform available to them, product engineers in many cases may not have much exposure to streaming concepts or how to best leverage the streaming tools available to them. In these cases, engineers working on product features or applications may require a lot of back and forth with the team of streaming engineers to realize their projects, or they may not even realize the benefits of adding stream processing to their projects in the first place. These situations can lead to loss of business potential and impact developer velocity.

Solution: Education and Democratization

Two ways to help address these challenges is by investing in developer education and by democratizing the streaming and stream processing platforms. Setting up regular knowledge sharing sessions and encouraging collaboration between teams can go a long way to reducing knowledge gaps between teams. From the platform perspective, democratizing streaming and stream processing by making these platforms easy to use will lower the barrier of entry to these technologies. Popular stream processing frameworks such as Flink and Spark Streaming have SQL APIs for defining data processing jobs. Exposing SQL to abstract away some of the complexities of the underlying system is one way to make the platform easier to use.

Conclusion

In this blog post we highlighted 3 of the main challenges we’ve seen organizations face when building and operating their own stream processing platforms. Overcoming each of these challenges requires engineering time and effort. While some organizations may be able to spend the up front time and money to build their own in-house data streaming platforms, others may not be able to afford to. This is where fully managed cloud services, such as DeltaStream, can help.

Our aim at DeltaStream is to provide an easy to use data streaming platform to unify, process, and govern your data. Here’s how DeltaStream addresses each of the challenges above:

  1. Resource Management: DeltaStream is a serverless platform, meaning resource scaling and operations are completely taken care of. No cluster sizing or resource provisioning.
  2. Data Heterogeneity: Out of the box, DeltaStream has support for all major data serialization formats – JSON, ProtoBuf, and Avro. DeltaStream also has native support for many data storage systems including Kafka, Kinesis, Delta Lake, Snowflake, and PostgreSQL. DeltaStream’s rich processing capabilities also allow users to filter, enrich, and transform data to mitigate data quality issues.
  3. Streaming Knowledge Gaps: DeltaStream exposes an easy to use SQL interface for interacting with all streaming resources and streaming queries. Tools for sampling data and testing queries are also provided to help users iterate faster.

If you want to learn more about how DeltaStream can enable stream processing in your organization, schedule a demo with us. If you want to try DeltaStream out yourself, sign up for a free trial.