27 Feb 2025

Min Read

5 Signs It’s Time to Move from Batch Processing to Real-Time

In the past decade, we’ve witnessed a fundamental transformation in the way companies handle their data. Traditionally, organizations relied on batch processing, which involves collecting and processing data at fixed intervals. This worked well in slower-paced industries where insights weren’t needed instantly. However, in a world where speed and real-time decisions are everything, batch processing can feel like an outdated relic, unable to keep up with the demands of real-time decisions and customer expectations. So, how do you know if your business is ready to make the leap from batch to real-time processing? Below, we’ll explore five telltale signs that it’s time to leave batch behind and embrace real-time systems for a more agile, responsive business.

1. Delayed Decision-Making Is Impacting Outcomes

In many industries, the ability to make decisions quickly is the difference between seizing an opportunity and losing it forever. If delays consistently hinder your decision-making in data availability caused by batch processing, your business is suffering.

For example, imagine a retailer that runs inventory updates only once a day through batch processes. If a product sells out in the morning but isn’t flagged as unavailable until the nightly update, the company risks frustrating customers with out-of-stock orders. In contrast, a real-time system would update inventory levels immediately, ensuring always accurate availability.

Delayed decisions caused by outdated data can also lead to financial losses, missed revenue opportunities, and compliance risks in industries such as banking, healthcare, and manufacturing. If you say, “We could’ve avoided this if we had known sooner,” consider real-time processing.

2. Customer Expectations for Real-Time Experiences

Today’s customers expect instant gratification. Whether they want real-time updates on their food delivery, immediate approval for a loan application, or a seamless shopping experience, the demand for speed is non-negotiable. With its inherent lag, batch processing simply can’t meet these expectations.

Take, for example, the rise of ride-sharing apps like Uber or Lyft. These platforms rely on real-time data to match drivers with riders, calculate arrival times, and adjust pricing dynamically. A batch system would create noticeable delays and undermine the entire user experience.

If you receive complaints about laggy services, slow responses, or poor user experience, this is a strong indicator that you need to adopt real-time systems to meet customer expectations.

3. Data Volumes Are Exploding

The amount of data businesses collect today is staggering and growing exponentially. Whether it’s customer interactions, IoT device outputs, social media activity, or transaction data, the challenge is collecting and processing this data efficiently.

Batch processing often struggles to handle high data volumes. Processing large datasets in a single batch can lead to delays, system overloads, and inefficiencies. On the other hand, real-time processing is designed to handle continuous streams of data, breaking them into manageable chunks and processing them as they arrive.

If your data pipelines are becoming unmanageable and your batch processes are taking longer and longer to run, it’s time to shift to a real-time architecture. Real-time systems allow you to scale as data volumes grow, ensuring your business operations remain smooth and efficient.

4. Operational Bottlenecks in Data Pipelines

Batch processing systems can create bottlenecks in your data pipeline, where data piles up waiting for the next scheduled processing window. These bottlenecks can cause delays across your organization, especially when multiple teams rely on the same data to perform their functions.

For example, a finance team waiting for overnight sales reports to run forecasts, a marketing team waiting for campaign performance data, or an operations team waiting for stock updates can all face unnecessary delays due to batch processing constraints.

With real-time systems, data flows continuously, eliminating these bottlenecks and ensuring that teams have access to the insights they need, exactly when they need them. If your teams constantly wait for data to do their jobs, it’s time to break free of batch and move to real-time processing.

5. Business Use Cases Demand Continuous Insights

Certain business use cases simply cannot function without real-time data. These include fraud detection, dynamic pricing, predictive maintenance, and real-time monitoring of IoT devices. Batch processing cannot support these use cases because it relies on processing data after the fact – by which point, the window to act has often already closed.

Take fraud detection as an example. Identifying and preventing fraudulent transactions requires real-time monitoring and analysis of incoming data streams in banking. A batch system that only processes transactions at the end of the day would miss the opportunity to block fraudulent activity in real-time, exposing the business and its customers to significant risks.

If your business expands into use cases requiring immediate action based on fresh data, batch processing will hold you back. Real-time systems provide the continuous insights needed to support these advanced use cases and unlock new growth opportunities.

Making the Transition from Batch Processinto Real-Time

Transitioning from batch to real-time processing is a significant shift, but it pays off. Moving to real-time systems, you can respond instantly to customer needs, operational challenges, and market changes. You’ll also future-proof your organization, ensuring you can scale with growing data volumes and stay competitive in an increasingly real-time world.

If you see one or more of these signs in your business – delayed decisions, lagging customer experiences, overwhelmed data pipelines, or a need for continuous insights, it’s time to act. Although leaving batch processing behind may feel daunting, it’s a necessary step to meet the demands of modern business and thrive in a real-time world.

The sooner you make the move, the sooner you can start capitalizing on the benefits of real-time systems – faster decisions, happier customers, and a more agile business. So, are you ready for real-time? The signs are all there.

19 Feb 2025

Min Read

The 8 Most Impactful Apache Flink Updates

With Apache Flink 2.0 fast approaching, and as a companion to our recent blog, “What’s Coming in Apache Flink 2.0?” I thought I’d look back on some of the impactful updates we’ve seen since it was released in 2014. Apache Flink is an open-source, distributed stream processing framework that has become a cornerstone in real-time data processing. Flink has continued to innovate since its release, pushing the boundaries of what stream and batch processing systems can achieve. With its powerful abstractions and robust scalability, Flink has empowered organizations to process large-scale data across every business sector. Over the years, Flink has undergone a fantastic evolution as a leading stream processing framework. Let’s dive into some history with that intro out of the way.

1. Introduction of Stateful Stream Processing

One of Apache Flink’s foundational updates was the introduction of stateful stream processing, which set it apart from traditional stream processing systems. Flink’s ability to maintain application state across events unlocked new possibilities, such as implementing complex event-driven applications and providing exactly-once state consistency guarantees.

This update addressed one of the biggest challenges in stream processing: ensuring that data remains consistent even during system failures. Flink’s robust state management capabilities have been critical for financial services, IoT applications, and fraud detection systems, where reliability is paramount.

2. Support for Event Time and Watermarks

Flink revolutionized stream processing by introducing event-time processing and the concept of watermarks. Unlike systems that rely on processing time (the time at which an event is processed by the system), Flink’s event-time model processes data based on the time when an event actually occurred. This feature enabled users to handle out-of-order data gracefully, a common challenge in real-world applications.

With watermarks, Flink can track the progress of event time and trigger computations once all relevant data has arrived. This feature has been a game-changer for building robust applications that rely on accurate, time-sensitive analytics, such as monitoring systems, real-time recommendation engines, and predictive analytics.

3. The Blink Planner Integration

In 2019, Flink introduced the modern planner (sometimes referred to as Blink), which significantly improved Flink’s SQL and Table API capabilities. Initially developed by Alibaba, the Blink planner was integrated into the Flink ecosystem to optimize query execution for both batch and streaming data. It offered enhanced performance, better support for ANSI SQL compliance, and more efficient execution plans.

This integration was a turning point for Flink’s usability, making it accessible to a broader audience, including data engineers and analysts who preferred working with SQL instead of Java or Scala APIs. It also established Flink as a strong contender in the world of streaming SQL, competing with other frameworks like Apache Kafka Streams and Apache Beam.

4. Kubernetes Native Deployment

With the rise of container orchestration systems like Kubernetes, Flink adapted to modern infrastructure needs by introducing native Kubernetes support in version 1.10, released in 2022. This update allowed users to seamlessly deploy and manage Flink clusters on Kubernetes, leveraging its scalability, resilience, and operational efficiency.

Flink’s Kubernetes integration simplified cluster management by enabling dynamic scaling, fault recovery, and resource optimization. This update also made it easier for organizations to integrate Flink into cloud-native environments, providing greater operational ability for companies adopting containerized workloads.

5. Savepoints and Checkpoints Enhancements

Over the years, Flink has consistently improved its checkpointing and savepoint mechanisms to enhance fault tolerance. Checkpoints allow Flink to create snapshots of application state during runtime, enabling automatic recovery in the event of failures. Conversely, savepoints are user-triggered, allowing for controlled application updates, upgrades, or redeployments.

Recent updates have focused on improving the efficiency and storage options for checkpoints and savepoints, including support for cloud-native storage systems like Amazon S3 and Google Cloud Storage. These enhancements have made it easier for enterprises to achieve high availability and reliability in mission-critical streaming applications.

6. Flink’s SQL and Table API Advancements

Flink’s SQL and Table API have evolved significantly over the years, making Flink more user-friendly for developers and analysts. Recent updates have introduced support for streaming joins, materialized views, and advanced windowing functions, enabling developers to implement complex queries with minimal effort.

Flink’s SQL advancements have also enabled seamless integration with popular BI tools like Apache Superset, Tableau, and Power BI, making it easier for organizations to generate real-time insights from their streaming data pipelines.

7. PyFlink: Python Support

To broaden its appeal to the growing data science community, Flink introduced PyFlink, its Python API, as part of version 1.9, released in 2019. This update has been particularly impactful as Python remains the go-to language for data science and machine learning. With PyFlink, developers can write Flink applications in Python, access Flink’s powerful stream processing capabilities, and integrate machine learning models directly into their pipelines.

PyFlink has helped Flink bridge the gap between stream processing and machine learning, enabling use cases like real-time anomaly detection, fraud prevention, and personalized recommendations.

8. Flink Stateful Functions (StateFun)

Another transformative update was the introduction of Flink Stateful Functions (StateFun). StateFun extends Flink’s stateful processing capabilities by providing a framework for building distributed, event-driven applications with strong state consistency. This addition made Flink a natural fit for microservices architectures, enabling developers to build scalable, event-driven applications with minimal effort.

Conclusion

Since its inception, Apache Flink has continually evolved to meet the demands of modern data processing. From its innovative stateful stream processing to powerful integrations with SQL, Python, and Kubernetes, Flink has redefined what’s possible in real-time analytics. As organizations embrace real-time data-driven decision-making, Flink’s ongoing innovations ensure it remains at the forefront of stream processing technologies. With a strong community, enterprise adoption, and cutting-edge features, Flink’s future looks brighter than ever.

27 Jan 2025

Min Read

A Guide to the Top Stream Processing Frameworks

Every second, billions of data points pulse through the digital arteries of modern business. A credit card swipe, a sensor reading from a wind farm, or stock trades on Wall Street – each signal holds potential value, but only if you can catch it at the right moment. Stream processing frameworks enable organizations to process and analyze massive streams of data with low latency. This blog explores some of the most popular stream processing frameworks available today, highlighting their features, advantages, and use cases. These frameworks form the backbone of many real-time applications, enabling businesses to derive meaningful insights from ever-flowing torrents of data.

What is Stream Processing?

Stream processing refers to the practice of processing data incrementally as it is generated rather than waiting for the entire dataset to be collected. This allows systems to respond to events or changes in real-time, making it invaluable for time-sensitive applications.
For example:
Fraud detection in banking: Transactions can be analyzed in real-time for suspicious activity.
E-commerce recommendations: Streaming data from user interactions can be used to offer instant product recommendations.
IoT monitoring: Data from IoT devices can be processed continuously for system updates or alerts.
Stream processing frameworks enable developers to build, deploy, and scale real-time applications. Let’s examine some of the most popular ones.

Apache Kafka Streams

Overview:

Apache Kafka Streams, an extension of Apache Kafka, is a lightweight library for building applications and microservices. It provides a robust API for processing data streams directly from Kafka topics and writing the results back to other Kafka topics or external systems. The API only supports JVM languages, including Java and Scala.

Key Features:

It is fully integrated with Apache Kafka, making it a seamless choice for Kafka users.
Provides stateful processing with the ability to maintain in-memory state stores.
Scalable and fault-tolerant architecture.
Built-in support for windowing operations and event-time processing.

Use Cases:

Real-time event monitoring and processing.
Building distributed stream processing applications.
Log aggregation and analytics.

Kafka Streams is ideal for developers already using Kafka for message brokering, as it eliminates the need for additional stream processing infrastructure.

Apache Flink

Overview:
Apache Flink is a highly versatile and scalable stream processing framework that excels at handling unbounded data streams. It offers powerful features for stateful processing, event-time semantics, and exactly-once guarantees.

Key Features:

Support for both batch and stream processing in a unified architecture.
Event-time processing: Handles out-of-order events using watermarks.
High fault tolerance with distributed state management.
Integration with popular tools such as Apache Kafka, Apache Cassandra, and HDFS.

Use Cases:

Complex event processing in IoT applications.
Fraud detection and risk assessment in finance.
Real-time analytics for social media platforms.

Apache Flink is particularly suited for applications requiring low-latency processing, high throughput, and robust state management.

Apache Spark Streaming

Overview:
Apache Spark Streaming extends Apache Spark’s batch processing capabilities to real-time data streams. Its micro-batch architecture processes streaming data in small, fixed intervals, making it easy to build real-time applications.

Key Features:

Micro-batch processing: Processes streams in discrete intervals for near-real-time results.
High integration with the larger Spark ecosystem, including MLlib, GraphX, and Spark SQL.
Scalable and fault-tolerant architecture.
Compatible with popular data sources like Kafka, HDFS, and Amazon S3.

Use Cases:

Live dashboards and analytics.
Real-time sentiment analysis for social media.
Log processing and monitoring for large-scale systems.

While its micro-batch approach results in slightly higher latency compared to true stream processing frameworks like Flink, Spark Streaming is still a popular choice due to its ease of use and integration with the Spark ecosystem.

Apache Storm

Overview:
Apache Storm is one of the pioneers in the field of distributed stream processing. Known for its simplicity and low latency, Storm is a reliable choice for real-time processing of high-velocity data streams.

Key Features:

Tuple-based processing: Processes data streams as tuples in real time.
High fault tolerance with automatic recovery of failed components.
Horizontal scalability and support for a wide range of programming languages.
Simple architecture with “spouts” (data sources) and “bolts” (data processors).

Use Cases:

Real-time event processing for online gaming.
Fraud detection in financial transactions.
Processing sensor data in IoT systems.

Although Apache Storm has been largely overtaken by newer frameworks like Flink and Kafka Streams, it remains an option for applications where low latency and simplicity are key priorities. It is being actively maintained and updated, with version 2.7.1 released in November 2024.

Google Dataflow

Overview:
Google Dataflow is a fully managed, cloud-based stream processing service. It is built on the Apache Beam model, which provides a unified API for batch and stream processing and enables portability across different execution engines.

Key Features:

Unified programming model for batch and stream processing.
Integration with Google Cloud services like BigQuery, Pub/Sub, and Cloud Storage.
Automatic scaling and resource management.
Support for windowing and event-time processing.

Use Cases:

Real-time analytics pipelines in cloud-native applications.
Data enrichment and transformation for machine learning workflows.
Monitoring and alerting systems.

Google Dataflow is best for businesses already operating in the Google Cloud ecosystem.

Amazon Kinesis

Overview:
Amazon Kinesis is a cloud-native stream processing platform provided by AWS. It simplifies streaming data ingestion, processing, and analysis in real-time.

Key Features:

Fully managed service with automatic scaling.
Supports custom application development using the Kinesis Data Streams API.
Integration with AWS services such as Lambda, S3, and Redshift.
Built-in analytics capabilities with Kinesis Data Analytics.

Use Cases:

Real-time clickstream analysis for e-commerce platforms.
IoT telemetry data processing.
Monitoring application logs and metrics.

Amazon Kinesis can be the most sensible option for a company already using AWS services, as it offers a quick way to start.

Choosing the Right Stream Processing Framework

The choice of a stream processing framework depends on your specific requirements, such as latency tolerance, scalability needs, ease of integration, and existing technology stack. For example:

If you’re heavily invested in Kafka, Kafka Streams is a likely fit.
Apache Flink is an excellent choice for low-latency, high-throughput applications and works with a wide array of data repository types.
Organizations with expertise in the cloud can benefit from managed services like Google Dataflow or Amazon Kinesis.

Conclusion

Stream processing frameworks are essential for extracting real-time insights from dynamic data streams. The frameworks mentioned above – Apache Kafka Streams, Flink, Spark Streaming, Storm, Google Dataflow, and Amazon Kinesis, each have unique strengths and ideal use cases. By selecting the right tool for your needs, you can unlock the full potential of real-time data processing, powering next-generation applications and services.

17 Dec 2024

Min Read

Enhancing Fraud Detection with PuppyGraph and DeltaStream

The banking and finance industry has been one of the biggest beneficiaries of digital advancements. Many technological innovations find practical applications in finance, providing convenience and efficiency that can set institutions apart in a competitive market. However, this ease and accessibility have also led to increased fraud, particularly in credit card transactions, which remain a growing concern for consumers and financial institutions.

Traditional fraud detection systems rely on rule-based methods that struggle in real-time scenarios. These outdated approaches are often reactive, identifying fraud only after it occurs. Without real-time capabilities or advanced reasoning, they fail to match fraudsters’ rapidly evolving tactics. A more proactive and sophisticated solution is essential to combat this threat effectively.

This is where graph analytics and real-time stream processing come into play. Combining PuppyGraph, the first and only graph query engine, with DeltaStream, a stream processing engine powered by Apache Flink, enables institutions to improve fraud detection accuracy and efficiency, including real-time capabilities. In this blog post, we’ll explore the challenges of modern fraud detection and the advantages of using graph analytics and real-time processing. We will also provide a step-by-step guide to building a fraud detection system with PuppyGraph and DeltaStream.

Let’s start by examining the challenges of modern fraud detection.

Common Fraud Detection Challenges

Credit card fraud has always been a game of cat and mouse. Even before the rise of digital processing and online transactions, fraudsters found ways to exploit vulnerabilities. With the widespread adoption of technology, fraud has only intensified, creating a constantly evolving fraud landscape that is increasingly difficult to navigate. Key challenges in modern fraud detection include:

Volume: Daily credit card transactions are too vast to review and identify suspicious activity manually. Automation is critical to sorting through all that data and identifying anomalies.
Complexities: Fraudulent activity often involves complex patterns and relationships that traditional rule-based systems can’t detect. For example, fraudsters may use stolen credit card information to make a series of small transactions before a large one or use multiple cards in different locations in a short period.
Real-time: The sooner fraud is detected, the less financial loss there will be. Real-time analysis is crucial in detecting and preventing transactions as they happen, especially when fraud can be committed at scale in seconds.
Agility: Fraudsters will adapt to new security measures. Fraud detection systems must be agile, even learning as they go, to keep up with the evolving threats and tactics.
False positives: While catching fraudulent transactions is essential, it’s equally important to avoid flagging legitimate transactions as fraud. False positives can frustrate customers, especially when a card is automatically locked out due to legitimate purchases. As a consequence, they can adversely affect revenue.

To tackle these challenges, businesses require a solution that processes large volumes of data in real-time, identifies complex patterns, and evolves with new fraud tactics. Graph analytics and real-time stream processing are essential components of such a system. By mapping and analyzing transaction networks, businesses can more effectively detect anomalies in customer behavior and identify potentially fraudulent transactions.

Leveraging Graph Analytics for Fraud Detection

Traditional fraud detection methods analyze individual transactions in isolation. This can miss connections and patterns that emerge when we examine the bigger picture. Graph analytics allows us to visualize and analyze transactions as a network of connected things.

Think of it like a social network. Each customer, credit card, merchant, and device becomes a node in the graph, and each transaction connects those nodes. We can find hidden patterns and anomalies that indicate fraud by looking at the relationships between nodes.

Figure: an example schema for fraud detection use case

Here’s how graph analytics can be applied to fraud detection:

Finding suspicious connections: Graph algorithms can discover unusual patterns of connections between entities. For example, if the same person uses multiple credit cards in different locations in a short period or a single card is used to buy from a group of merchants known for fraud, those connections will appear in the graph and be flagged as suspicious.
Uncovering fraud rings: Fraudsters often work within the same circles, using multiple identities and accounts to carry out scams. Graph analytics can find those complex networks of people and their connections, helping to identify and potentially break up entire fraud rings.
Surfacing identity theft: When a stolen credit card is used, the spending patterns will generally be quite different from the cardholder’s normal behavior. By looking at the historical and current transactions within a graph, you can see sudden changes in spending habits, locations, and types of purchases that may indicate identity theft.
Predicting future fraud: Graph analytics can predict future fraud by looking at historical data and the patterns that precede a fraudulent transaction. By predicting fraud before it happens, businesses can take action to prevent it.

Of course, all of these benefits are extremely helpful. However, the biggest hurdle to realizing them is the complexity of implementing a graph database. Let’s look at some of those challenges and how PuppyGraph can help users avoid them entirely.

Challenges of Implementing and Running Graph Databases

As shown, graph databases can be an excellent tool for fraud detection. So why aren’t they used more frequently? This usually boils down to implementing and managing them, which can be complex for those unfamiliar with the technology. The hurdles that come with implementing a graph database can far outweigh the benefits for some businesses, even stopping them from adopting this technology altogether. Here are some of the issues generally faced by companies implementing graph databases:

Cost: Traditional relational databases have been the norm for decades, and many organizations have invested heavily in their infrastructure. Switching to a graph database or even running a proof of concept requires a significant upfront investment in new software, hardware, and training.
Implementing ETL: Extracting, transforming, and loading (ETL) data into a graph database can be tricky and time-consuming. Data needs to be restructured to fit into a graph model, which requires knowledge of the underlying data to be moved over and how to represent these entities and relationships within a graph model. This requires specific skills and adds to the implementation time and cost, meaning the benefits may be delayed.
Bridging the skills gap: Graph databases require a different data modeling and querying approach from traditional databases. In addition to the previous point regarding ETL, finding people with the skills to manage, maintain, and query the data within a graph database can also be challenging. Without these skills, graph technology adoption is mostly dead in the water.
Integration challenges: Integrating a graph database with existing systems and applications is complex. This usually involves taking the output from graph queries and mapping them into downstream systems, which requires careful planning and execution. Getting data to flow smoothly and be compatible with different systems is significant.

These challenges highlight the need for solutions that make graph database adoption and management more accessible. A graph query engine like PuppyGraph addresses these issues by enabling teams to integrate their data and query it as a graph in minutes without the complexity of ETL processes or the need to set up a traditional graph database. Let’s look at how PuppyGraph helps teams become graph-enabled without ETL or the need for a graph database.

How PuppyGraph Solves Graph Database Challenges

PuppyGraph is built to tackle the challenges that often hinder graph database adoption. By rethinking graph analytics, PuppyGraph removes many entry barriers, opening up graph capabilities to more teams than otherwise possible. Here’s how PuppyGraph addresses many of the hurdles mentioned above:

Zero-ETL: One of PuppyGraph’s most significant advantages is connecting directly to your existing data warehouses and data lakes—no more complex and time-consuming ETL. There is no need to restructure data or create separate graph databases. Simply connect the graph query engine directly to your SQL data store and start querying your data as a graph in minutes.
Cost: PuppyGraph reduces the expenses of graph analytics by using your existing data infrastructure. There is no need to invest in new database infrastructure or software and no ongoing maintenance costs of traditional graph databases. Eliminating the ETL process significantly reduces the engineering effort required to build and maintain fragile data pipelines, saving time and resources.
Reduced learning curve: Traditional graph databases often require users to master complex graph query languages for every operation, including basic data manipulation. PuppyGraph simplifies this by functioning as a graph query engine that operates alongside your existing SQL query engine using the same data. You can continue using familiar SQL tools for data preparation, aggregation, and management. When more complex queries suited to graph analytics arise, PuppyGraph handles them seamlessly. This approach saves time and allows teams to reserve graph query languages specifically for graph traversal tasks, reducing the learning curve and broadening access to graph analytics.
Multi-query language support: Engineers can continue to use their existing SQL skills and platform, allowing them to leverage graph querying when needed. The platform offers many ways to build graph queries, including Gremlin and Cypher support, so your existing team can quickly adopt and use graph technology.
Effortless scaling: PuppyGraph’s architecture separates compute and storage so it can easily handle petabytes of data. By leveraging their underlying SQL storage, teams can effortlessly scale their compute as required. You can focus on extracting value from your data without scaling headaches.
Fast deployment: With PuppyGraph, you can deploy and start querying your data as a graph in 10 minutes. There are no long setup processes or complex configurations. Fast deployment means you can start seeing the benefits of graph analytics and speed up your fraud detection.

In short, PuppyGraph removes the traditional barriers to graph adoption so more institutions can use graph analytics for fraud detection use cases. By simplifying, reducing costs, and empowering existing teams with effortless graph adoption, PuppyGraph makes graph technology accessible for all teams and organizations.

Real-Time Fraud Prevention with DeltaStream

Speed is key in the fight against fraud, and responsiveness is crucial to preventing or minimizing the impact of an attack. Systems and processes that act on events with minimal latency can mean the difference between successful and unsuccessful cyber attacks. DeltaStream empowers businesses to analyze and respond to suspicious transactions in real-time, minimizing losses and preventing further damage.

Why Real-Time Matters:

Immediate Response: Rapid incident response means security and data teams can detect, isolate, and trigger mitigation protocols, minimizing their vulnerability window faster than ever. With real-time data and sub-second latency, the Mean Time to Detect (MTTD) and Mean Time to Respond (MTTR) can be significantly reduced.
Proactive Prevention: Data and security teams can identify behavior patterns as they emerge and implement mitigation tactics. Real-time allows for continuous monitoring of system health and security with predictive models.
Improved Accuracy: Real-time data provides a more accurate view of customer behavior for precise detection. Threats are more complex than ever and often involve multi-stage attack patterns; streaming data aids in identifying these complex and ever-evolving threat tactics.

DeltaStream’s Key Features:

Speed: Increase the speed of your data processing and your team’s ability to create data applications. Reduce latency and cost by shifting your data transformations out of your warehouse and into DeltaStream. Data teams can also quickly write queries in SQL to create analytics pipelines with no other complex languages to learn.
Team Focus: Eliminate maintenance tasks with our continually optimizing Flink operator. Your team isn’t focused on infrastructure, meaning they can focus on building and strengthening pipelines.
Unified View: An organization’s data rarely comes from just one source. Process streaming data from multiple sources in real-time to get a complete picture of activities. This means transaction data, user behavior, and other relevant signals can be analyzed together as they occur.

By combining PuppyGraph’s graph analytics with DeltaStream’s real-time processing, businesses can create a dynamic fraud detection system that stays ahead of evolving threats.

Step-by-Step tutorial: DeltaStream and PuppyGraph

In this tutorial, we go through the high-level steps of integrating DeltaStream and PuppyGraph.

The detailed steps are available at:

Starting a Kafka Cluster

We start a Kafka Server as the data input. (Later in the tutorial, we’ll send financial data through Kafka.)

We create topics for financial data like this:

bin/kafka-topics.sh --create --topic kafka-Account --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1

Setting up DeltaStream

Connecting to Kafka

Log in to the Deltastream console. Then, navigate to Resources and add a Kafka Store – for example, kafka_demo – with the Kafka Cluster parameters we created in the previous step.

Next, in the Workspace, create a deltastream database – for example: kafka_db
After that, we use DeltaStream SQL to create streams for the Kafka topics we created in the previous step. The stream describes the topic’s physical layout so it can be easily referenced with SQL. Here is an example of one of the streams we create in DeltaStream for a Kafka topic. Once we declare the streams, we can build streaming data pipelines to transform, enrich, aggregate, and prepare streaming data for analysis in PuppyGraph. First, we’ll define the account_stream from the kafka-Account topic.

CREATE STREAM account_stream (
  "label" STRING,
  "accountId" BIGINT,
  "createTime" STRING,
  "isBlocked" BOOLEAN,
  "accoutType" STRING,
  "nickname" STRING,
  "phonenum" STRING,
  "email" STRING,
  "freqLoginType" STRING,
  "lastLoginTime" STRING,
  "accountLevel" STRING
) WITH (
  'topic' = 'kafka-Account',
  'value.format' = 'JSON'
);

Next, we’ll define the accountrepayloan_stream from the kafka-AccountRepayLoan topic:

CREATE STREAM accountrepayloan_stream (
  "label" STRING,
  "accountrepayloandid" BIGINT,
  "loanId" BIGINT,
  "amount" DOUBLE,
  "createTime" STRING
) WITH (
  'topic' = 'kafka-AccountRepayLoan',
  'value.format' = 'JSON'
);

And finally, we’ll show the accounttransferaccount_stream from the kafka-AccountTransferAccount. You’ll note there is both a fromid and toid that will like to the loanId. This allows us to enrich data in the account payment stream with account information from the account_stream and combine it with the account transfer stream.

With DeltaStream, this can then easily be written out as a more succinct and enriched stream of data to our destination, such as Snowflake or Databricks. We combine data from three streams with just the information we want, preparing the data in real-time from multiple streaming sources, which we then graph using PuppyGraph.

CREATE STREAM accounttransferaccount_stream (
  "label" VARCHAR,
  "accounttransferaccountid", BIGINT,
  "fromd" BIGINT,
  "toid" BIGINT,
  "amount" DOUBLE,
  "createTime" STRING,
  "ordernum" BIGINT,
  "comment" VARCHAR,
  "paytype" VARCHAR,
  "goodstype" VARCHAR
) WITH (
  'topic' = 'kafka-AccountTransferAccount',
  'value.format' = 'JSON'
);

Adding a Store for Integration

PuppyGraph will connect to the stores and allow querying as a graph.

Once our data is ready in the desired format, we can write streaming SQL queries in DeltaStream to write data continuously in the desired storage. In this case, we can use DeltaStream’s native integration with Snowflake or Databricks, where we will use PoppyGraph. Here is an example of writing data continuously into a table in Snowflake or Databricks from DeltaStream:

CREATE TABLE ds_account
WITH
(
'store' = '<store_name>'
<Storage parameters>
) AS
SELECT * FROM account_stream;

For Databricks integration, refer to the Databricks integration documentation for detailed steps.
For Snowflake integration, refer to Snowflake integration documentation for detailed steps.

Starting data processing

Now, you can start a Kafka Producer to send the financial JSON data to Kafka. For example, to send account data, run:

kafka-console-producer.sh --broker-list localhost:9092 --topic kafka-Account < json_data/Account.json

DeltaStream will process the data, and then we will query it as a graph.

Query your data as a graph

You can start PuppyGraph using Docker. Then upload the Graph schema, and that’s it! You can now query the financial data as a graph as DeltaStream processes it.

Start PuppyGraph using the following command:

docker run -p 8081:8081 -p 8182:8182 -p 7687:7687 \
-e DATAACCESS_DATA_CACHE_STRATEGY=adaptive \
-e <STORAGE PARAMETERS> \
--name puppy --rm -itd puppygraph/puppygraph:stable

Log into the PuppyGraph Web UI at http://localhost:8081 with the following credentials:

Username: puppygraph

Password: puppygraph123

Upload the schema:Select the file schema_<storage>.json in the Upload Graph Schema JSON section and click Upload.

Navigate to the Query panel on the left side. The Gremlin Query tab offers an interactive environment for querying the graph using Gremlin. For example, to query the accounts owned by a specific company and the transaction records of these accounts, you can run:

g.V("Company[237]")
  .outE('CompanyOwnAccount').inV()
  .outE('AccountTransferAccount').inV()
  .path()

Conclusion

As this blog post explores, traditional fraud detection methods simply can’t keep pace with today’s sophisticated criminals. Real-time analysis and the ability to identify complex patterns are critical. By combining the power of graph analytics with real-time stream processing, businesses can gain a significant advantage against fraudsters.

PuppyGraph and DeltaStream offer robust and accessible solutions for building real-time dynamic fraud detection systems. We’ve seen how PuppyGraph unlocks hidden relationships and how DeltaStream analyzes real-time data to quickly and accurately identify and prevent fraudulent activity. Ready to take control and build a future-proof, graph-enabled fraud detection system? Try PuppyGraph and DeltaStream today. Visit PuppyGraph and DeltaStream to get started!

13 Nov 2024

Min Read

What’s Coming in Apache Flink 2.0?

As champions for Apache Flink, we are excited for the 2.0 release and all that it will bring. Apache Flink 1.0 was released in 2016, and while we don’t have an exact release date, it looks like 2.0 will be released in late 2024/early 2025. Version 1.2 was just released in August 2024. Version 2.0 is set to be a major milestone release, marking a significant evolution in the stream processing framework. This blog runs down some of the key features and changes coming in Flink 2.0.

Disaggregated State Storage and Management

One of the most exciting features of Flink 2.0 is the introduction of disaggregated state storage and management. It will utilize a Distributed File System (DFS) as the primary storage for state data. This architecture separates compute and storage resources, addressing key scalability and performance needs for large-scale, cloud-native data processing.

Core Advantages of Disaggregated State Storage

Improved Scalability
By decoupling storage from compute resources, Flink can manage massive datasets—into the hundreds of terabytes—without being constrained by local storage. This separation enables efficient scaling in containerized and cloud environments.
Enhanced Recovery and Rescaling
The new architecture supports faster state recovery on job restarts, efficient fault tolerance, and quicker job rescaling with minimal downtime. Key components include shareable checkpoints and LazyRestore for on-demand state recovery.
Optimized I/O Performance
Flink 2.0 uses asynchronous execution and grouped remote state access to minimize the latency impact of remote storage. A hybrid caching mechanism can improve cache efficiency, providing up to 80% better throughput than traditional file-level caching.
Improved Batch Processing
Disaggregated state storage enhances batch processing by better handling large state data and integrating batch and stream processing tasks, making Flink more versatile across diverse workloads.
Dynamic Resource Management
The architecture enables flexible resource allocation, minimizing CPU and network usage spikes during maintenance tasks like compaction and cleanup.

API and Configuration Changes

Several API and configuration changes will be introduced, including:

Removal of deprecated APIs, including the DataSet API and Scala versions of DataStream and DataSet APIs
Deprecation of the legacy SinkFunction API in favor of the Unified Sink API
Overhaul of the configuration layer, enhancing user-friendliness and maintainability
Introduction of new abstractions such as Materialized Tables in v1.2 and further enhanced in v2
Updates to configuration options, including proper type usage (e.g., Duration, Enum, Int)

Modernization and Unification

Flink 2.0 aims to further unify batch and stream processing:

Modernization of legacy components, such as replacing the legacy SinkFunction with the new Unified Sink API
Enhanced features that combine batch and stream processing seamlessly
Improvements to Adaptive Batch Execution for optimizing logical and physical plans

Performance Improvements

The community is working on making Flink’s performance on bounded streams (batch use cases) competitive with dedicated batch processors. This can further simplify your data processing stack.

Dynamic Partition Pruning (DPP) to minimize I/O costs
Runtime Filter to reduce I/O and shuffle costs
Operator Fusion CodeGen to improve query execution performance

Cloud-Native Focus

Flink 2.0 is being designed with cloud-native architectures in mind:

Improved efficiency in containerized environments
Better scalability for large state sizes
More efficient fault tolerance and faster rescaling

Summary of Flink 2.0

This is an exciting time for Apache Flink 2.0. It represents a significant leap forward in unified batch and stream processing, focusing on cloud-native architectures, improved performance, and streamlined APIs. These changes aim to address the evolving needs of data-driven applications and set new standards for what’s possible in data processing. DeltaStream is proudly powered by Apache Flink, which makes it easy to start running Flink in minutes. Get a free trial of DeltaStream and see for yourself.

29 Oct 2024

Min Read

A Guide to Standard SQL vs. Streaming SQL: Why Do We Need Both?

Understanding the Differences Between Standard SQL and Streaming SQL

SQL has long been a foundational tool for querying databases. Traditional SQL queries are typically run against static, historical data, generating a snapshot of results at a single point in time. However, the rise of real-time data processing, driven by applications like IoT, financial transactions, security monitoring/intrusion, and social media, has led to the evolution of Streaming SQL. This variant extends traditional SQL capabilities, offering features specifically designed for real-time, continuous data streams.

Standard SQL and Streaming SQL Key Differences

1. Point-in-Time vs. Continuous Queries

In standard SQL, queries are typically run once and return results based on a snapshot of data. For instance, when you query a traditional database to get the sum of all sales, it reflects only the state of data up until the moment of the query.

In contrast, Streaming SQL works with data that continuously flows in, updating queries in real-time. The same query can be run in streaming SQL, but instead of receiving a one-time result, the query is maintained in a materialized view that updates as new data arrives. This is especially useful for use cases like dashboards or monitoring systems, where the data needs to stay current.

2. Real-Time Processing with Window Functions

Streaming SQL introduces window functions, allowing users to segment a data stream into windows for aggregation or analysis. For example, a tumbling window is a fixed-length window (such as one minute) that collects data for aggregation over that time frame. In contrast, a hopping window is a fixed-size time interval that will hop by a specified length. That means if you want to calculate the current inventory every two minutes but update the results every minute, the hopping window would then be two minutes, and the hop size would be a minute.

Windowing in traditional SQL is static and backward-looking, whereas in streaming SQL, real-time streams are processed continuously, updating aggregations within the described window.

3. Watermarks for Late Data Handling

In streaming environments, data can arrive late or out of order. To manage this, Streaming SQL introduces watermarks. Watermarks mark the point in time up to which the system expects to have received data. For instance, if an event is delayed by a minute, a watermark ensures it’s still processed if it arrives within that window, making streaming SQL robust for real-world, unpredictable data flows. Conventional SQL has no ability or need to address this scenario.

4. Continuous Materialization

One of the unique aspects of Streaming SQL is the ability to materialize views incrementally. Unlike traditional databases that recompute queries when data changes, streaming SQL continuously maintains these views as new data flows in. This approach dramatically improves performance for real-time analytics by avoiding expensive re-computations.

Use Cases for Streaming SQL

The rise of streaming SQL has been a game-changer across industries. Common applications include:

Real-time analytics dashboards, such as stock trading platforms or retail systems where quick insights are needed to make rapid decisions.
Event-driven applications where alerts and automations are triggered by real-time data, such as fraud detection or IoT sensor monitoring.
Real-time customer personalization, where user actions or preferences update in real-time to deliver timely recommendations.

Conclusion

While Standard SQL excels in querying static, historical datasets, Streaming SQL is optimized for real-time data streams, offering powerful features like window functions, watermarks, and materialized views. These advancements handle fast-changing data with low latency, offering immediate insights and automation. This article at Datanami in July 2023 pegged 177% growth in streaming adoption in the previous 12 months. As more industries rely on real-time decision-making, streaming SQL is becoming a critical tool for modern data infrastructures.

23 Oct 2024

Min Read

Democratizing Data with All-in-One Streaming Solutions

In today’s fast-paced data landscape, organizations must maximize efficiency, enhance collaboration, and maintain data quality. An all-in-one streaming data solution offers a single, integrated platform for real-time data processing, which simplifies operations, reduces costs, and makes advanced tools accessible across teams.

This blog explores the benefits of such solutions and their role in promoting a democratized data culture.

Key Benefits of All-in-One Streaming Data Solutions

Streamlined Learning Curve

All-in-one platforms simplify adoption by providing a single interface, unlike traditional setups requiring expertise in multiple tools and languages. This accelerates adoption and facilitates collaboration across teams.

Consolidated Toolset

By merging data integration, processing, and visualization into a unified system, these platforms eliminate the need to manage multiple applications. Teams can perform tasks like joins, filtering, and creating materialized views within one environment, improving workflow efficiency.

Simplified Language Support

Most all-in-one platforms use a common language, such as SQL, for all data operations. This reduces the need for proficiency in multiple languages, streamlines processes, and enables easier collaboration between team members.

Enhanced Security and Compliance

With centralized security controls, these platforms simplify the enforcement of compliance standards like GDPR and HIPAA. Fewer components reduce vulnerabilities, providing a more secure data environment.

Cost Savings

Managing multiple tools leads to increased costs, both in licensing and staffing. An all-in-one solution consolidates these tools, reducing expenses and providing long-term cost stability.

Improved Data Quality

Using a single platform for all data operations—collection, transformation, streaming, and analysis—minimizes errors and ensures consistent validation, resulting in more accurate and reliable insights.

Centralized Platform for Unified Operations

An all-in-one solution enables teams to handle all aspects of data processing on one platform, from combining datasets to filtering large volumes of data and creating materialized views for real-time access. This integrated approach reduces errors and boosts operational efficiency.

Single Interface for Event Streams

These platforms provide a single interface to access and work with event streams, regardless of location or device. This consistent access allows teams to monitor and manage streams globally, facilitating seamless data handling across distributed environments.

Breaking Down Silos

All-in-one platforms promote collaboration by breaking down data silos, enabling cross-functional teams to work with shared data in real-time. Whether in marketing, sales, engineering, or product development, everyone has access to the same data streams, facilitating collaboration and maximizing the value of data.

Democratized Data Access and Collaboration

Centralized Data Access

In traditional environments, only a few technical users control critical data pipelines. An all-in-one solution democratizes data by giving all team members access to the same tools, empowering them to make data-driven decisions regardless of technical expertise.

Simplified Data Analysis

These platforms provide intuitive tools for querying and visualizing data, allowing less technically sophisticated users to engage in data analysis. This extends the role of data across the organization, improving decision-making and fostering collaboration.

Cross-Functional Collaboration

The integration of all tools into a single platform enhances collaboration across functions. Teams from different departments can work together more efficiently, aligning on data-driven strategies without needing to navigate disparate systems or fight through inconsistent user access, i.e., some people may have access to tools A and B while others only to tools C and D.

Reduced Effort

With only one platform to learn, teams experience reduced effort and cognitive load, freeing up more time to focus on deriving insights rather than managing multiple tools. This ease of use encourages widespread adoption and enhances overall productivity.

Scalability and Flexibility

All-in-one solutions are designed for scalability, enabling organizations to grow without constantly adopting new tools or overhauling systems. Whether increasing data streams or integrating new sources, these platforms scale effortlessly with business needs.

Conclusion

Is this the promise of Data Mesh? All-in-one streaming data solutions are revolutionizing how organizations handle real-time data. By consolidating tools, simplifying workflows, and fostering collaboration, these platforms democratize data access while maintaining data quality and operational efficiency. Whether you’re a small team seeking streamlined processes or a large enterprise focused on scalability, the benefits of an all-in-one solution are clear. Investing in such platforms is a strategic move to unlock the full potential of real-time data.

DeltaStream can be part of your toolbox, supporting the shift-left paradigm for operational efficiency. If you’re interested in giving it a try, sign up for a free trial or contact us for a demo.

01 Oct 2024

Min Read

Streaming Analytics vs. Real-time Analytics: Key Differences to Know

Introduction

Businesses rely heavily on timely insights to make informed decisions in today’s data-driven world. Two key approaches that enable organizations to derive value from their data as it is generated are streaming analytics and real-time analytics. While both terms are often used interchangeably, they differ in their operation and the types of use cases they address. This blog post will delve into the core differences between streaming, and real-time analytics, their respective architectures, and practical applications.

Defining Streaming and Real-Time Analytics

Streaming Analytics: Streaming analytics refers to analyzing and acting on data as it flows into the system continuously. Data is processed in real-time as it is ingested, typically in small, unbounded batches or event streams. These streams come from various sources like IoT devices, log files, and social media, with the analytics system making decisions or generating insights from the live data.

Real-Time Analytics: Real-time analytics, while similar in time sensitivity, typically involves processing a dataset or query with minimal latency. It involves quickly processing data to provide near-instantaneous insights, although the data is often stored or batched before it is analyzed. Real-time analytics operates in response to queries where results are expected from data as it enters the system, such as personalized advertising. Typically there are two types:
On-demand: Provides analytic results only when a query is submitted.
Continuous: Proactively sends alerts or triggers responses in other systems as the data is generated.

Differences in Data Ingestion and Processing

Streaming Analytics: In streaming analytics, data is processed in motion. As the data arrives in the system, it is immediately ingested and analyzed. The focus is on processing and analyzing the continuous flow of data, often in a windowed manner, to derive immediate actions from the data stream. This involves handling large volumes of unbounded, real-time data flows.

Example: A fraud detection system in a bank continuously monitors transactions. The moment suspicious activity is detected from a stream of transaction data, the system flags or blocks the transaction in real time.

Real-Time Analytics: While real-time analytics also deals with fast-moving data, it focuses on responding to queries in real time. The data might already reside in databases, and the system retrieves and processes it almost instantaneously when requested. This method is often less continuous than streaming analytics, but it’s still geared towards low-latency responses.

Example: A dashboard monitoring a retail chain’s sales might be refreshed every minute to reflect the latest sales data. Even though the updates are frequent, the data comes from a batched set that is processed in real time rather than directly from an event stream.

Latency and Time Sensitivity Distinctions

Streaming Analytics: Streaming analytics systems are designed to handle extremely low latency, as the focus is on processing data instantly as it arrives. This is critical in situations where immediate insights are required, like automated decision-making in fraud detection, predictive maintenance, or dynamic pricing. Streaming analytics typically involves sub-second latency, allowing for almost instantaneous actions based on data.

Real-Time Analytics: Real-time analytics also aims for low latency, but the data may be processed in slightly larger windows (seconds or minutes). The insights provided by real-time analytics are often near real-time, and acceptable latency can range from milliseconds to a few seconds, depending on the system’s requirements. Real-time analytics may involve batch processing, where the data is aggregated and processed as needed, rather than on a continuous stream.

Contrasting Architecture and Tools

Streaming Analytics: The architecture for streaming analytics is built around continuous data flows. The tools and platforms used for streaming analytics—such as Apache Kafka, Apache Flink, and Apache Storm—are designed to support data streams and perform calculations on the fly. The architecture involves source systems that generate continuous streams of events, a processing engine that can handle this real-time input, and sinks that store or act on the processed data.

Streaming analytics systems often incorporate concepts like event-driven architecture and micro-batching, where data is split into tiny batches to be processed almost instantaneously. The key focus is on scalability and the ability to handle high-throughput streams with very low latency.

Real-Time Analytics: Real-time analytics architecture is often centered around fast querying and low-latency data retrieval from storage. Systems like Apache Pinot, Apache Druid, and in-memory databases like Memcached are frequently used to achieve real-time query performance. Data is often ingested in bursts, cleaned, stored, and queried using systems optimized for low-latency access, such as in-memory or columnar databases.

While it can handle streaming data, real-time analytics systems usually aggregate and store data first, making it suitable for reporting and dashboarding where up-to-the-second freshness is only sometimes critical but very close to real time is required.

Streaming and Real-time Analytics Use Cases

Streaming Analytics:
IoT Sensor Monitoring: Where devices continuously generate data, analytics systems monitor this data in real time to detect anomalies or trigger automated responses.
Stock Market and High-Frequency Trading: In financial markets, price data, transaction volumes, and other metrics must be processed in real time to make split-second trading decisions.
Social Media Monitoring: For businesses that rely on sentiment analysis or real-time social media engagement, streaming analytics helps gauge public reaction instantly, allowing businesses to respond immediately.

Real-Time Analytics:
Customer Personalization: In e-commerce, real-time analytics helps provide personalized recommendations by processing customer interaction data stored in databases, delivering insights in near real-time during customer sessions.
Operational Dashboards: Many organizations utilize real-time analytics for internal monitoring, where data on sales, system health, or customer interactions is processed quickly but not instantaneously, such as refreshing every minute.
Dynamic Pricing: Real-time analytics can be used to adjust pricing based on historical sales and demand data that is processed every few minutes or hours.

Challenges with Streaming and Real-time Analytics

Streaming Analytics: One of the main challenges is dealing with the constant flow of high-velocity data. Ensuring data consistency, scaling infrastructure to handle bursts in data streams, and maintaining sub-second latency requires sophisticated engineering solutions. Another challenge is managing “event time” versus “processing time,” where events arrive out of order or late.

Real-Time Analytics: Real-time analytics faces the challenge of balancing query performance with data freshness. Storing and retrieving large volumes of data with low latency is difficult without optimized database architectures. Additionally, ensuring that the data queried reflects the most recent information without overwhelming the system requires careful tuning.

Conclusion

While both streaming and real-time analytics offer rapid data processing and insights, they serve different purposes depending on the specific use case. Streaming analytics excels in environments where decisions must be made instantly on data as it arrives, making it ideal for real-time monitoring and automated responses. Real-time analytics, on the other hand, offers low-latency querying for decision-making where instantaneous data streams aren’t necessary but timely responses are critical.

If your use case requires sub-second latency, consider technologies like DeltaStream. It handles both Streaming Analytics and acts as a Streaming Database, supporting the shift-left paradigm for operational efficiency. If you’re interested in giving it a try, sign up for a free trial or contact us for a demo.

02 Jul 2024

Min Read

A Guide to RBAC vs ABAC vs ACL

Access control is necessary for data platforms to securely share data. In order for users to confidently share their data resources with the intended parties, access control should be easy to understand and scalable, especially as more data objects and more users are added. Without a sensible access control model, users have a higher risk of inadvertently sharing data objects with the wrong parties and failing to realize incorrect permissions. Choosing the right access control model depends heavily on the use case, so it’s important to understand the benefits and drawbacks of popular options. In this post, we’ll cover three different access control models: access control lists (ACL), role-based access control (RBAC), and attribute-based access control (ABAC). This guide to RBAC vs ABAC vs ACL will cover what they are, their pros and cons, and what to consider when choosing an access control model.

Access Control List (ACL)

An ACL is a list of permissions for a particular resource and is the simplest of the access control models that we’ll cover. When a user attempts an action on a resource, such as a read or write, the ACL associated with that resource is used to allow or deny the attempt. In order to add or remove permissions to a resource, an entry in the ACL is either added or deleted. ACLs are a simple model that are easy to understand and implement, however they can be difficult to manage when there are many users and resources as these lists can grow quickly.

To illustrate how ACLs work, let’s consider an example of a university with professors, teaching assistants, and students:

Students are able to submit assignments and view their grades
Teaching assistants are able to grade assignments
Professors are able to grade assignments and view student grades

As you can see from the diagram, each individual is given specific permissions for what they’re able to do. If another student were to join, the ACL would need to be updated to grant the new student privilege to submit assignments and view their grades.

Pros:

Simple and easy to understand: User privileges for a particular resource are stated plainly in a list.
Allows for fine-grained access control to resources: ACLs typically allow different types of access to be defined (i.e. read, write, share).

Cons:

Does not scale well: As more users, user groups, and resources are added, access must be individually specified in ACLs each time.
Low visibility on a user’s permissions: Checking a particular user’s privileges requires a lookup in every ACL in the organization.
Error-prone when used at scale: When ACLs are used at scale, it can be cumbersome to add the proper permissions for users, or detect if a user has been given permissions they shouldn’t have. The difficulty in managing ACLs at scale makes it more likely that errors will occur.

Role-based Access Control (RBAC)

RBAC manages permissions with roles, where roles act as an intermediary between users and resources. In this model, users are assigned a set of roles, and roles are given permissions on resources. This model works well when there are clear groups of users who need the same set of privileges and permissions. Compared to ACLs where every permission needs to be explicitly defined, RBAC scales well with new users and resources. New users can be assigned their relevant roles and adopt all the privileges associated with those roles. Similarly, permissions for new resources can be added to existing roles and users with those roles will automatically inherit the permissions for the new resource.

Using the example from earlier, we can see how RBAC might be applied to a university setting:

Students are able to submit assignments and view their grades
Teaching assistants are able to grade assignments
Professors are able to grade assignments and view student grades

As we can see, the relationships in this diagram are simpler than the diagram with ACLs. Instead of specifying direct access to resources, users are assigned roles which have privileges on resources. If a new student were to join the class, they would just need to be assigned the student role and all the permissions they need will be inherited through the “student” role.

Pros:

Easy-to-manage policy enforcement: Updating a privilege for a role will automatically update apply for all users with that role, making it easier to enforce policies at a more granular level.
Scalable: New users can be granted the roles that apply for them and inherit all the privileges with those roles. As new resources are created, access to them can be granted to roles or additional roles can easily be created.
Better security and compliance: RBAC ensures that users only have access to the roles relevant for them, and by extension, only the privileges given to those roles. This results in users only having the necessary permissions and reduces the risk of unauthorized access.
Widely adopted: RBAC has been around for decades and is used in many popular databases and data products, including PostgreSQL, MySQL, MongoDB, and Snowflake.

Cons:

Role explosion: While RBAC is generally quite scalable, creating too many roles can occur in cases where group privileges are not clearly differentiated. When too many roles get created, RBAC can become difficult to manage. Organizations should come up with and enforce best practices for defining roles to avoid role explosion.
Limited flexibility: For use cases where the privileges of roles are very dynamic, RBAC can feel rigid. For instance, if an organization restructures its team structure, new roles may need to be created and existing roles may need to change their permissions. The process of safely adding and removing permissions from roles, cleaning up any deprecated roles, and restructuring role hierarchy can be cumbersome, slow down productivity, and result in tech debt.

Attribute-based Access Control (ABAC)

ABAC gates access to resources based on attributes, as opposed to users or roles. Attributes, such as who the user is, what action they’re trying to perform, which environment they are performing the action in, and what resource they are trying to perform the action on, are all considered when deciding whether or not access should be permitted. Rules are set up such that access is only allowed when conditions, determined by attributes, are met. For example, a rule can be set up such that a teaching assistant can only view grades if they’re in the grading room and it’s between 4:00 pm and 8:00 pm.

Let’s see how ABAC might be applied to the university example:

In this diagram, we can see how the ABAC policy works for a student who is trying to submit their assignment. For a student to submit their assignment under this policy, the student needs to have specific attributes, such as being enrolled and not being suspended. There are also contextual constraints, such as the submission needing to be before the deadline. If all of the conditions in the policy are satisfied, then the student can successfully submit their assignment.

Pros:

Highly scalable: New rules and attributes can easily be added as business needs evolve. As resources evolve, administrators can simply assign attributes to the resource, as opposed to creating a new role or changing an existing one.
Flexible custom policies: Rules are highly customizable, enabling administrators to easily set up access policies based on context, such as time of day and location.
Attributes to ensure compliance with data regulations: Administrators can add attributes to sensitive resources, allowing for labels to be added such as personally identifiable information (PII) or HIPAA for healthcare related information. This makes it easier to set up rules to ensure data privacy and data compliance with various regulations are met.

Cons:

Complex to implement and maintain: Attributes and policies need to be carefully defined and governed. The initial designing and assigning of attributes for users and resources can be a time consuming and complex process. Then, continuing to maintain the attributes and access policies as business needs and applications change can require significant time and effort.
Difficult to assess risk exposure: Although it’s generally beneficial to be able to create highly customizable access policies, it can make it difficult to audit and assess risk exposure. For instance, understanding the full access a particular user has can be difficult since policies can be complex and contingent on context-specific conditions.

Choosing an Access Control Model

When it comes to choosing an access control model, users should consider how their organization may scale in the future, who will be responsible for maintaining the access control system, and if their needs actually require going with a more complex model. If there are a limited number of users and resources, ACLs may be the best approach as they are simple to understand and implement. If access policies need to be highly customized and dynamic, then ABAC may be a better approach. For something more scalable than ACLs but without the complexity of ABAC, then RBAC is probably sufficient. Organizations may also find that a hybrid approach of these models best serves their needs, such as RBAC and ABAC together.

At DeltaStream, we’ve taken the approach of adding RBAC to our platform. DeltaStream is a real-time stream processing platform that allows users to share, process, and govern their streaming data. In the data streaming space, Apache Kafka has been one of the leading open source projects for building streaming data pipelines and storing real-time events. However, access control with Kafka is managed through ACLs, and as the number of topics and users grow, managing these ACLs has been a pain point for Kafka users. As a data streaming platform that can connect to any streaming data source, DeltaStream allows users to manage and govern their streaming resources with RBAC. RBAC strikes the balance of improving on the scalability issues of ACLs without overcomplicating access control.

If you’re interested in discussing access control or learning more about DeltaStream, feel free to reach out or get a free trial.

23 May 2024

Min Read

Workload Isolation: Everything You Need to Know

In cloud computing, workload isolation is critical for providing efficiency and security when running business workloads. Workload isolation is the practice of separating computing tasks into their own resources and/or infrastructure. By providing physical and logical separations, one compromised workload or resource cannot impact the others. This offers security and performance benefits and may be necessary to comply with regulatory requirements for certain applications.

Benefits of Workload Isolation

Security By isolating workloads, organizations can reduce the ‘blast radius’ of security breaches. For instance, if an attacker were able to compromise the workload in one environment, workload isolation would protect the other workloads because they are being run in different environments. This helps to minimize, contain, and resolve potential security issues.
Performance Isolated workloads can operate without interference from other tasks, ensuring that resources are dedicated and performance is optimized for each specific task. By isolating workloads, task performance becomes more predictable as tasks don’t need to compete for shared resources, making it easier to provide service level agreements (SLAs). Without workload isolation, a sudden spike in resource utilization for one task could negatively impact the performance of other tasks running on the same resources.
Compliance Workload isolation simplifies compliance with various regulations by clearly defining boundaries between different data sets and processing activities.

Achieving workload isolation

Workload isolation can take many different forms and can be achieved with different approaches. When thinking about workload isolation, it is best to consider the multiple ways your workloads can be isolated, and to take a combined approach.

Resource Governance Resource Governance is the ability to specify boundaries and limits for computing task resources. Popular container orchestration systems, such as Kubernetes, allow users to set resource limits on their services and workloads. Containerizing and limiting the resources for specific tasks removes the “noisy neighbor” problem, where one task can starve other tasks by consuming all of the resources.
Governance and Access Control Providing access controls on data sets and compute environments ensures that only necessary individuals and services can access specific workloads. Most data systems have some form of access control that can be defined, whether that is in the form of an access control list (ACL), role-based access control (RBAC), or attribute-based access control (ABAC). Defining access control for users is essential to protect against unauthorized access.
Network Level Isolation Network isolation aims to create distinct boundaries within a network, creating subnetworks with limited access between them. This practice improves security by limiting access to particular environments and helps ensure that an attacker cannot affect workloads on different subnetworks.

Workload isolation for Streaming Resources with DeltaStream

DeltaStream is a stream processing platform that is fully managed and serverless, allowing users to easily govern and process their streaming data from sources such as Apache Kafka or AWS Kinesis. As a security-minded stream processing solution, DeltaStream’s workload isolation plays a significant role in ensuring that computational queries are secure and performant. Below are some ways DeltaStream provides workload isolation:

Each Query Runs in its Own Environment Powered by Apache Flink, each DeltaStream query runs in its own Flink cluster with its own dedicated resources and network. This ensures that users’ data is the only data being processed in a particular environment, minimizing the risk of sensitive data leakage. It also boosts performance, as each query can be scaled and tuned independently.
Multiple Deployment Options DeltaStream offers various deployment options, including dedicated deployment and private SaaS deployment (also known as bring your own cloud or BYOC), catering to security-sensitive users. With the dedicated deployment option, a DeltaStream data plane runs in a cloud account dedicated to a single organization. In the private SaaS deployment option, a DeltaStream data plane operates within an organization’s cloud account. These options provide users with an additional level of assurance that their data is confined to a non-shared network — in the case of private SaaS, the data never leaves the user’s own network.
Role-based Access Control (RBAC) Access to queries and data objects within the DeltaStream Catalog is managed through DeltaStream’s RBAC. This gives users an easy-to-use and scalable system for properly governing and restricting access to their streaming data and workloads.

Workload isolation is essential for maintaining security and compliance in cloud products, with the added benefit of protecting workload performance. At DeltaStream, we have designed a stream processing platform that fully embraces workload isolation. If you’re interested in giving it a try, sign up for a free trial or contact us for a demo.