17 Dec 2024

Min Read

Enhancing Fraud Detection with PuppyGraph and DeltaStream

The banking and finance industry has been one of the biggest beneficiaries of digital advancements. Many technological innovations find practical applications in finance, providing convenience and efficiency that can set institutions apart in a competitive market. However, this ease and accessibility have also led to increased fraud, particularly in credit card transactions, which remain a growing concern for consumers and financial institutions.

Traditional fraud detection systems rely on rule-based methods that struggle in real-time scenarios. These outdated approaches are often reactive, identifying fraud only after it occurs. Without real-time capabilities or advanced reasoning, they fail to match fraudsters’ rapidly evolving tactics. A more proactive and sophisticated solution is essential to combat this threat effectively.

This is where graph analytics and real-time stream processing come into play. Combining PuppyGraph, the first and only graph query engine, with DeltaStream, a stream processing engine powered by Apache Flink, enables institutions to improve fraud detection accuracy and efficiency, including real-time capabilities. In this blog post, we’ll explore the challenges of modern fraud detection and the advantages of using graph analytics and real-time processing. We will also provide a step-by-step guide to building a fraud detection system with PuppyGraph and DeltaStream. 

Let’s start by examining the challenges of modern fraud detection.

Common Fraud Detection Challenges

Credit card fraud has always been a game of cat and mouse. Even before the rise of digital processing and online transactions, fraudsters found ways to exploit vulnerabilities. With the widespread adoption of technology, fraud has only intensified, creating a constantly evolving fraud landscape that is increasingly difficult to navigate. Key challenges in modern fraud detection include:

  • Volume: Daily credit card transactions are too vast to review and identify suspicious activity manually. Automation is critical to sorting through all that data and identifying anomalies.
  • Complexities: Fraudulent activity often involves complex patterns and relationships that traditional rule-based systems can’t detect. For example, fraudsters may use stolen credit card information to make a series of small transactions before a large one or use multiple cards in different locations in a short period.
  • Real-time: The sooner fraud is detected, the less financial loss there will be. Real-time analysis is crucial in detecting and preventing transactions as they happen, especially when fraud can be committed at scale in seconds.
  • Agility: Fraudsters will adapt to new security measures. Fraud detection systems must be agile, even learning as they go, to keep up with the evolving threats and tactics.
  • False positives: While catching fraudulent transactions is essential, it’s equally important to avoid flagging legitimate transactions as fraud. False positives can frustrate customers, especially when a card is automatically locked out due to legitimate purchases. As a consequence, they can adversely affect revenue.

To tackle these challenges, businesses require a solution that processes large volumes of data in real-time, identifies complex patterns, and evolves with new fraud tactics. Graph analytics and real-time stream processing are essential components of such a system. By mapping and analyzing transaction networks, businesses can more effectively detect anomalies in customer behavior and identify potentially fraudulent transactions.

Leveraging Graph Analytics for Fraud Detection

Traditional fraud detection methods analyze individual transactions in isolation. This can miss connections and patterns that emerge when we examine the bigger picture. Graph analytics allows us to visualize and analyze transactions as a network of connected things.

Think of it like a social network. Each customer, credit card, merchant, and device becomes a node in the graph, and each transaction connects those nodes. We can find hidden patterns and anomalies that indicate fraud by looking at the relationships between nodes.

Figure: an example schema for fraud detection use case

Here’s how graph analytics can be applied to fraud detection:

  • Finding suspicious connections: Graph algorithms can discover unusual patterns of connections between entities. For example, if the same person uses multiple credit cards in different locations in a short period or a single card is used to buy from a group of merchants known for fraud, those connections will appear in the graph and be flagged as suspicious.
  • Uncovering fraud rings: Fraudsters often work within the same circles, using multiple identities and accounts to carry out scams. Graph analytics can find those complex networks of people and their connections, helping to identify and potentially break up entire fraud rings.
  • Surfacing identity theft: When a stolen credit card is used, the spending patterns will generally be quite different from the cardholder’s normal behavior. By looking at the historical and current transactions within a graph, you can see sudden changes in spending habits, locations, and types of purchases that may indicate identity theft.
  • Predicting future fraud: Graph analytics can predict future fraud by looking at historical data and the patterns that precede a fraudulent transaction. By predicting fraud before it happens, businesses can take action to prevent it.

Of course, all of these benefits are extremely helpful. However, the biggest hurdle to realizing them is the complexity of implementing a graph database. Let’s look at some of those challenges and how PuppyGraph can help users avoid them entirely.

Challenges of Implementing and Running Graph Databases

As shown, graph databases can be an excellent tool for fraud detection. So why aren’t they used more frequently? This usually boils down to implementing and managing them, which can be complex for those unfamiliar with the technology. The hurdles that come with implementing a graph database can far outweigh the benefits for some businesses, even stopping them from adopting this technology altogether. Here are some of the issues generally faced by companies implementing graph databases:

  • Cost: Traditional relational databases have been the norm for decades, and many organizations have invested heavily in their infrastructure. Switching to a graph database or even running a proof of concept requires a significant upfront investment in new software, hardware, and training. 
  • Implementing ETL: Extracting, transforming, and loading (ETL) data into a graph database can be tricky and time-consuming. Data needs to be restructured to fit into a graph model, which requires knowledge of the underlying data to be moved over and how to represent these entities and relationships within a graph model. This requires specific skills and adds to the implementation time and cost, meaning the benefits may be delayed.
  • Bridging the skills gap: Graph databases require a different data modeling and querying approach from traditional databases. In addition to the previous point regarding ETL, finding people with the skills to manage, maintain, and query the data within a graph database can also be challenging. Without these skills, graph technology adoption is mostly dead in the water.
  • Integration challenges: Integrating a graph database with existing systems and applications is complex. This usually involves taking the output from graph queries and mapping them into downstream systems, which requires careful planning and execution. Getting data to flow smoothly and be compatible with different systems is significant.

These challenges highlight the need for solutions that make graph database adoption and management more accessible. A graph query engine like PuppyGraph addresses these issues by enabling teams to integrate their data and query it as a graph in minutes without the complexity of ETL processes or the need to set up a traditional graph database. Let’s look at how PuppyGraph helps teams become graph-enabled without ETL or the need for a graph database.

How PuppyGraph Solves Graph Database Challenges

PuppyGraph is built to tackle the challenges that often hinder graph database adoption. By rethinking graph analytics, PuppyGraph removes many entry barriers, opening up graph capabilities to more teams than otherwise possible. Here’s how PuppyGraph addresses many of the hurdles mentioned above:

  • Zero-ETL: One of PuppyGraph’s most significant advantages is connecting directly to your existing data warehouses and data lakes—no more complex and time-consuming ETL. There is no need to restructure data or create separate graph databases. Simply connect the graph query engine directly to your SQL data store and start querying your data as a graph in minutes.
  • Cost: PuppyGraph reduces the expenses of graph analytics by using your existing data infrastructure. There is no need to invest in new database infrastructure or software and no ongoing maintenance costs of traditional graph databases. Eliminating the ETL process significantly reduces the engineering effort required to build and maintain fragile data pipelines, saving time and resources.
  • Reduced learning curve: Traditional graph databases often require users to master complex graph query languages for every operation, including basic data manipulation. PuppyGraph simplifies this by functioning as a graph query engine that operates alongside your existing SQL query engine using the same data. You can continue using familiar SQL tools for data preparation, aggregation, and management. When more complex queries suited to graph analytics arise, PuppyGraph handles them seamlessly. This approach saves time and allows teams to reserve graph query languages specifically for graph traversal tasks, reducing the learning curve and broadening access to graph analytics.
  • Multi-query language support: Engineers can continue to use their existing SQL skills and platform, allowing them to leverage graph querying when needed. The platform offers many ways to build graph queries, including Gremlin and Cypher support, so your existing team can quickly adopt and use graph technology.
  • Effortless scaling: PuppyGraph’s architecture separates compute and storage so it can easily handle petabytes of data. By leveraging their underlying SQL storage, teams can effortlessly scale their compute as required. You can focus on extracting value from your data without scaling headaches.
  • Fast deployment: With PuppyGraph, you can deploy and start querying your data as a graph in 10 minutes. There are no long setup processes or complex configurations. Fast deployment means you can start seeing the benefits of graph analytics and speed up your fraud detection.

In short, PuppyGraph removes the traditional barriers to graph adoption so more institutions can use graph analytics for fraud detection use cases. By simplifying, reducing costs, and empowering existing teams with effortless graph adoption, PuppyGraph makes graph technology accessible for all teams and organizations.

Real-Time Fraud Prevention with DeltaStream

Speed is key in the fight against fraud, and responsiveness is crucial to preventing or minimizing the impact of an attack. Systems and processes that act on events with minimal latency can mean the difference between successful and unsuccessful cyber attacks. DeltaStream empowers businesses to analyze and respond to suspicious transactions in real-time, minimizing losses and preventing further damage.

Why Real-Time Matters:

  • Immediate Response: Rapid incident response means security and data teams can detect, isolate, and trigger mitigation protocols, minimizing their vulnerability window faster than ever. With real-time data and sub-second latency, the Mean Time to Detect (MTTD) and Mean Time to Respond (MTTR) can be significantly reduced.
  • Proactive Prevention: Data and security teams can identify behavior patterns as they emerge and implement mitigation tactics. Real-time allows for continuous monitoring of system health and security with predictive models. 
  • Improved Accuracy: Real-time data provides a more accurate view of customer behavior for precise detection. Threats are more complex than ever and often involve multi-stage attack patterns; streaming data aids in identifying these complex and ever-evolving threat tactics.

DeltaStream’s Key Features:

  • Speed: Increase the speed of your data processing and your team’s ability to create data applications. Reduce latency and cost by shifting your data transformations out of your warehouse and into DeltaStream. Data teams can also quickly write queries in SQL to create analytics pipelines with no other complex languages to learn.
  • Team Focus: Eliminate maintenance tasks with our continually optimizing Flink operator. Your team isn’t focused on infrastructure, meaning they can focus on building and strengthening pipelines.
  • Unified View: An organization’s data rarely comes from just one source. Process streaming data from multiple sources in real-time to get a complete picture of activities. This means transaction data, user behavior, and other relevant signals can be analyzed together as they occur.

By combining PuppyGraph’s graph analytics with DeltaStream’s real-time processing, businesses can create a dynamic fraud detection system that stays ahead of evolving threats.

Step-by-Step tutorial: DeltaStream and PuppyGraph

In this tutorial, we go through the high-level steps of integrating DeltaStream and PuppyGraph. 

The detailed steps are available at:

Starting a Kafka Cluster

We start a Kafka Server as the data input. (Later in the tutorial, we’ll send financial data through Kafka.)

We create topics for financial data like this:

  1. bin/kafka-topics.sh --create --topic kafka-Account --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1

Setting up DeltaStream

Connecting to Kafka

Log in to the Deltastream console. Then, navigate to Resources and add a Kafka Store – for example, kafka_demo – with the Kafka Cluster parameters we created in the previous step.

Next, in the Workspace, create a deltastream database – for example: kafka_db
After that, we use DeltaStream SQL to create streams for the Kafka topics we created in the previous step. The stream describes the topic’s physical layout so it can be easily referenced with SQL. Here is an example of one of the streams we create in DeltaStream for a Kafka topic. Once we declare the streams, we can build streaming data pipelines to transform, enrich, aggregate, and prepare streaming data for analysis in PuppyGraph. First, we’ll define the account_stream from the kafka-Account topic.

  1. CREATE STREAM account_stream (
  2. "label" STRING,
  3. "accountId" BIGINT,
  4. "createTime" STRING,
  5. "isBlocked" BOOLEAN,
  6. "accoutType" STRING,
  7. "nickname" STRING,
  8. "phonenum" STRING,
  9. "email" STRING,
  10. "freqLoginType" STRING,
  11. "lastLoginTime" STRING,
  12. "accountLevel" STRING
  13. ) WITH (
  14. 'topic' = 'kafka-Account',
  15. 'value.format' = 'JSON'
  16. );

Next, we’ll define the accountrepayloan_stream from the kafka-AccountRepayLoan topic:

  1. CREATE STREAM accountrepayloan_stream (
  2. "label" STRING,
  3. "accountrepayloandid" BIGINT,
  4. "loanId" BIGINT,
  5. "amount" DOUBLE,
  6. "createTime" STRING
  7. ) WITH (
  8. 'topic' = 'kafka-AccountRepayLoan',
  9. 'value.format' = 'JSON'
  10. );

And finally, we’ll show the accounttransferaccount_stream from the kafka-AccountTransferAccount. You’ll note there is both a fromid and toid that will like to the loanId. This allows us to enrich data in the account payment stream with account information from the account_stream and combine it with the account transfer stream. 

With DeltaStream, this can then easily be written out as a more succinct and enriched stream of data to our destination, such as Snowflake or Databricks. We combine data from three streams with just the information we want, preparing the data in real-time from multiple streaming sources, which we then graph using PuppyGraph.

  1. CREATE STREAM accounttransferaccount_stream (
  2. "label" VARCHAR,
  3. "accounttransferaccountid", BIGINT,
  4. "fromd" BIGINT,
  5. "toid" BIGINT,
  6. "amount" DOUBLE,
  7. "createTime" STRING,
  8. "ordernum" BIGINT,
  9. "comment" VARCHAR,
  10. "paytype" VARCHAR,
  11. "goodstype" VARCHAR
  12. ) WITH (
  13. 'topic' = 'kafka-AccountTransferAccount',
  14. 'value.format' = 'JSON'
  15. );

Adding a Store for Integration

PuppyGraph will connect to the stores and allow querying as a graph.

Once our data is ready in the desired format, we can write streaming SQL queries in DeltaStream to write data continuously in the desired storage. In this case, we can use DeltaStream’s native integration with Snowflake or Databricks, where we will use PoppyGraph. Here is an example of writing data continuously into a table in Snowflake or Databricks from DeltaStream:

  1. CREATE TABLE ds_account
  2. WITH
  3. (
  4. 'store' = '<store_name>'
  5. <Storage parameters>
  6. ) AS
  7. SELECT * FROM account_stream;

Starting data processing

Now, you can start a Kafka Producer to send the financial JSON data to Kafka. For example, to send account data, run:

  1. kafka-console-producer.sh --broker-list localhost:9092 --topic kafka-Account < json_data/Account.json

DeltaStream will process the data, and then we will query it as a graph.

Query your data as a graph

You can start PuppyGraph using Docker. Then upload the Graph schema, and that’s it! You can now query the financial data as a graph as DeltaStream processes it.

Start PuppyGraph using the following command:

  1. docker run -p 8081:8081 -p 8182:8182 -p 7687:7687 \
  2. -e DATAACCESS_DATA_CACHE_STRATEGY=adaptive \
  3. -e <STORAGE PARAMETERS> \
  4. --name puppy --rm -itd puppygraph/puppygraph:stable

Log into the PuppyGraph Web UI at http://localhost:8081 with the following credentials:

Username: puppygraph

Password: puppygraph123

Upload the schema:Select the file schema_<storage>.json in the Upload Graph Schema JSON section and click Upload.

Navigate to the Query panel on the left side. The Gremlin Query tab offers an interactive environment for querying the graph using Gremlin. For example, to query the accounts owned by a specific company and the transaction records of these accounts, you can run:

  1. g.V("Company[237]")
  2. .outE('CompanyOwnAccount').inV()
  3. .outE('AccountTransferAccount').inV()
  4. .path()

Conclusion

As this blog post explores, traditional fraud detection methods simply can’t keep pace with today’s sophisticated criminals. Real-time analysis and the ability to identify complex patterns are critical. By combining the power of graph analytics with real-time stream processing, businesses can gain a significant advantage against fraudsters.

PuppyGraph and DeltaStream offer robust and accessible solutions for building real-time dynamic fraud detection systems. We’ve seen how PuppyGraph unlocks hidden relationships and how DeltaStream analyzes real-time data to quickly and accurately identify and prevent fraudulent activity. Ready to take control and build a future-proof, graph-enabled fraud detection system? Try PuppyGraph and DeltaStream today. Visit PuppyGraph and DeltaStream to get started!

13 Nov 2024

Min Read

What’s Coming in Apache Flink 2.0?

As champions for Apache Flink, we are excited for the 2.0 release and all that it will bring. Apache Flink 1.0 was released in 2016, and while we don’t have an exact release date, it looks like 2.0 will be released in late 2024/early 2025. Version 1.2 was just released in August 2024. Version 2.0 is set to be a major milestone release, marking a significant evolution in the stream processing framework. This blog runs down some of the key features and changes coming in Flink 2.0.

Disaggregated State Storage and Management

One of the most exciting features of Flink 2.0 is the introduction of disaggregated state storage and management. It will utilize a Distributed File System (DFS) as the primary storage for state data. This architecture separates compute and storage resources, addressing key scalability and performance needs for large-scale, cloud-native data processing.

Core Advantages of Disaggregated State Storage

  1. Improved Scalability
    By decoupling storage from compute resources, Flink can manage massive datasets—into the hundreds of terabytes—without being constrained by local storage. This separation enables efficient scaling in containerized and cloud environments.
  2. Enhanced Recovery and Rescaling
    The new architecture supports faster state recovery on job restarts, efficient fault tolerance, and quicker job rescaling with minimal downtime. Key components include shareable checkpoints and LazyRestore for on-demand state recovery.
  3. Optimized I/O Performance
    Flink 2.0 uses asynchronous execution and grouped remote state access to minimize the latency impact of remote storage. A hybrid caching mechanism can improve cache efficiency, providing up to 80% better throughput than traditional file-level caching.
  4. Improved Batch Processing
    Disaggregated state storage enhances batch processing by better handling large state data and integrating batch and stream processing tasks, making Flink more versatile across diverse workloads.
  5. Dynamic Resource Management
    The architecture enables flexible resource allocation, minimizing CPU and network usage spikes during maintenance tasks like compaction and cleanup.

API and Configuration Changes

Several API and configuration changes will be introduced, including:

  • Removal of deprecated APIs, including the DataSet API and Scala versions of DataStream and DataSet APIs
  • Deprecation of the legacy SinkFunction API in favor of the Unified Sink API
  • Overhaul of the configuration layer, enhancing user-friendliness and maintainability
  • Introduction of new abstractions such as Materialized Tables in v1.2 and further enhanced in v2
  • Updates to configuration options, including proper type usage (e.g., Duration, Enum, Int)

Modernization and Unification

Flink 2.0 aims to further unify batch and stream processing:

  • Modernization of legacy components, such as replacing the legacy SinkFunction with the new Unified Sink API
  • Enhanced features that combine batch and stream processing seamlessly
  • Improvements to Adaptive Batch Execution for optimizing logical and physical plans

Performance Improvements

The community is working on making Flink’s performance on bounded streams (batch use cases) competitive with dedicated batch processors. This can further simplify your data processing stack.

  • Dynamic Partition Pruning (DPP) to minimize I/O costs
  • Runtime Filter to reduce I/O and shuffle costs
  • Operator Fusion CodeGen to improve query execution performance

Cloud-Native Focus

Flink 2.0 is being designed with cloud-native architectures in mind:

  • Improved efficiency in containerized environments
  • Better scalability for large state sizes
  • More efficient fault tolerance and faster rescaling

This is an exciting time for Apache Flink 2.0. It represents a significant leap forward in unified batch and stream processing, focusing on cloud-native architectures, improved performance, and streamlined APIs. These changes aim to address the evolving needs of data-driven applications and set new standards for what’s possible in data processing. DeltaStream is proudly powered by Apache Flink, which makes it easy to start running Flink in minutes. Get a free trial of DeltaStream and see for yourself.

29 Oct 2024

Min Read

A Guide to Standard SQL vs. Streaming SQL: Why Do We Need Both?

Understanding the Differences Between Standard SQL and Streaming SQL

SQL has long been a foundational tool for querying databases. Traditional SQL queries are typically run against static, historical data, generating a snapshot of results at a single point in time. However, the rise of real-time data processing, driven by applications like IoT, financial transactions, security monitoring/intrusion, and social media, has led to the evolution of Streaming SQL. This variant extends traditional SQL capabilities, offering features specifically designed for real-time, continuous data streams. 

Standard SQL and Streaming SQL Key Differences

1. Point-in-Time vs. Continuous Queries

In standard SQL, queries are typically run once and return results based on a snapshot of data. For instance, when you query a traditional database to get the sum of all sales, it reflects only the state of data up until the moment of the query.

In contrast, Streaming SQL works with data that continuously flows in, updating queries in real-time. The same query can be run in streaming SQL, but instead of receiving a one-time result, the query is maintained in a materialized view that updates as new data arrives. This is especially useful for use cases like dashboards or monitoring systems, where the data needs to stay current.

2. Real-Time Processing with Window Functions

Streaming SQL introduces window functions, allowing users to segment a data stream into windows for aggregation or analysis. For example, a tumbling window is a fixed-length window (such as one minute) that collects data for aggregation over that time frame. In contrast, a hopping window is a fixed-size time interval that will hop by a specified length. That means if you want to calculate the current inventory every two minutes but update the results every minute, the hopping window would then be two minutes, and the hop size would be a minute.

Windowing in traditional SQL is static and backward-looking, whereas in streaming SQL, real-time streams are processed continuously, updating aggregations within the described window.

3. Watermarks for Late Data Handling

In streaming environments, data can arrive late or out of order. To manage this, Streaming SQL introduces watermarks.  Watermarks mark the point in time up to which the system expects to have received data. For instance, if an event is delayed by a minute, a watermark ensures it’s still processed if it arrives within that window, making streaming SQL robust for real-world, unpredictable data flows. Conventional SQL has no ability or need to address this scenario.

4. Continuous Materialization

One of the unique aspects of Streaming SQL is the ability to materialize views incrementally. Unlike traditional databases that recompute queries when data changes, streaming SQL continuously maintains these views as new data flows in. This approach dramatically improves performance for real-time analytics by avoiding expensive re-computations.

Use Cases for Streaming SQL

The rise of streaming SQL has been a game-changer across industries. Common applications include:

  • Real-time analytics dashboards, such as stock trading platforms or retail systems where quick insights are needed to make rapid decisions.
  • Event-driven applications where alerts and automations are triggered by real-time data, such as fraud detection or IoT sensor monitoring.
  • Real-time customer personalization, where user actions or preferences update in real-time to deliver timely recommendations.

Conclusion

While Standard SQL excels in querying static, historical datasets, Streaming SQL is optimized for real-time data streams, offering powerful features like window functions, watermarks, and materialized views. These advancements handle fast-changing data with low latency, offering immediate insights and automation. This article at Datanami in July 2023 pegged 177% growth in streaming adoption in the previous 12 months. As more industries rely on real-time decision-making, streaming SQL is becoming a critical tool for modern data infrastructures.

23 Oct 2024

Min Read

Democratizing Data with All-in-One Streaming Solutions

In today’s fast-paced data landscape, organizations must maximize efficiency, enhance collaboration, and maintain data quality. An all-in-one streaming data solution offers a single, integrated platform for real-time data processing, which simplifies operations, reduces costs, and makes advanced tools accessible across teams. 

This blog explores the benefits of such solutions and their role in promoting a democratized data culture.

Key Benefits of All-in-One Streaming Data Solutions

Streamlined Learning Curve

All-in-one platforms simplify adoption by providing a single interface, unlike traditional setups requiring expertise in multiple tools and languages. This accelerates adoption and facilitates collaboration across teams.

Consolidated Toolset

By merging data integration, processing, and visualization into a unified system, these platforms eliminate the need to manage multiple applications. Teams can perform tasks like joins, filtering, and creating materialized views within one environment, improving workflow efficiency.

Simplified Language Support

Most all-in-one platforms use a common language, such as SQL, for all data operations. This reduces the need for proficiency in multiple languages, streamlines processes, and enables easier collaboration between team members.

Enhanced Security and Compliance

With centralized security controls, these platforms simplify the enforcement of compliance standards like GDPR and HIPAA. Fewer components reduce vulnerabilities, providing a more secure data environment.

Cost Savings

Managing multiple tools leads to increased costs, both in licensing and staffing. An all-in-one solution consolidates these tools, reducing expenses and providing long-term cost stability.

Improved Data Quality

Using a single platform for all data operations—collection, transformation, streaming, and analysis—minimizes errors and ensures consistent validation, resulting in more accurate and reliable insights.

Centralized Platform for Unified Operations

An all-in-one solution enables teams to handle all aspects of data processing on one platform, from combining datasets to filtering large volumes of data and creating materialized views for real-time access. This integrated approach reduces errors and boosts operational efficiency.

Single Interface for Event Streams

These platforms provide a single interface to access and work with event streams, regardless of location or device. This consistent access allows teams to monitor and manage streams globally, facilitating seamless data handling across distributed environments.

Breaking Down Silos

All-in-one platforms promote collaboration by breaking down data silos, enabling cross-functional teams to work with shared data in real-time. Whether in marketing, sales, engineering, or product development, everyone has access to the same data streams, facilitating collaboration and maximizing the value of data.

Democratized Data Access and Collaboration

Centralized Data Access

In traditional environments, only a few technical users control critical data pipelines. An all-in-one solution democratizes data by giving all team members access to the same tools, empowering them to make data-driven decisions regardless of technical expertise.

Simplified Data Analysis

These platforms provide intuitive tools for querying and visualizing data, allowing less technically sophisticated users to engage in data analysis. This extends the role of data across the organization, improving decision-making and fostering collaboration.

Cross-Functional Collaboration

The integration of all tools into a single platform enhances collaboration across functions. Teams from different departments can work together more efficiently, aligning on data-driven strategies without needing to navigate disparate systems or fight through inconsistent user access, i.e., some people may have access to tools A and B while others only to tools C and D.

Reduced Effort

With only one platform to learn, teams experience reduced effort and cognitive load, freeing up more time to focus on deriving insights rather than managing multiple tools. This ease of use encourages widespread adoption and enhances overall productivity.

Scalability and Flexibility

All-in-one solutions are designed for scalability, enabling organizations to grow without constantly adopting new tools or overhauling systems. Whether increasing data streams or integrating new sources, these platforms scale effortlessly with business needs.

Conclusion

Is this the promise of Data Mesh? All-in-one streaming data solutions are revolutionizing how organizations handle real-time data. By consolidating tools, simplifying workflows, and fostering collaboration, these platforms democratize data access while maintaining data quality and operational efficiency. Whether you’re a small team seeking streamlined processes or a large enterprise focused on scalability, the benefits of an all-in-one solution are clear. Investing in such platforms is a strategic move to unlock the full potential of real-time data.

DeltaStream can be part of your toolbox, supporting the shift-left paradigm for operational efficiency. If you’re interested in giving it a try, sign up for a free trial or contact us for a demo.

01 Oct 2024

Min Read

Streaming Analytics vs. Real-time Analytics: Key Differences to Know

Introduction


Businesses rely heavily on timely insights to make informed decisions in today’s data-driven world. Two key approaches that enable organizations to derive value from their data as it is generated are streaming analytics and real-time analytics. While both terms are often used interchangeably, they differ in their operation and the types of use cases they address. This blog post will delve into the core differences between streaming, and real-time analytics, their respective architectures, and practical applications.

Defining Streaming and Real-Time Analytics


Streaming Analytics: Streaming analytics refers to analyzing and acting on data as it flows into the system continuously. Data is processed in real-time as it is ingested, typically in small, unbounded batches or event streams. These streams come from various sources like IoT devices, log files, and social media, with the analytics system making decisions or generating insights from the live data.

Real-Time Analytics: Real-time analytics, while similar in time sensitivity, typically involves processing a dataset or query with minimal latency. It involves quickly processing data to provide near-instantaneous insights, although the data is often stored or batched before it is analyzed. Real-time analytics operates in response to queries where results are expected from data as it enters the system, such as personalized advertising. Typically there are two types:
On-demand: Provides analytic results only when a query is submitted.
Continuous: Proactively sends alerts or triggers responses in other systems as the data is generated.

Differences in Data Ingestion and Processing


Streaming Analytics: In streaming analytics, data is processed in motion. As the data arrives in the system, it is immediately ingested and analyzed. The focus is on processing and analyzing the continuous flow of data, often in a windowed manner, to derive immediate actions from the data stream. This involves handling large volumes of unbounded, real-time data flows.

Example: A fraud detection system in a bank continuously monitors transactions. The moment suspicious activity is detected from a stream of transaction data, the system flags or blocks the transaction in real time.

Real-Time Analytics: While real-time analytics also deals with fast-moving data, it focuses on responding to queries in real time. The data might already reside in databases, and the system retrieves and processes it almost instantaneously when requested. This method is often less continuous than streaming analytics, but it’s still geared towards low-latency responses.

Example: A dashboard monitoring a retail chain’s sales might be refreshed every minute to reflect the latest sales data. Even though the updates are frequent, the data comes from a batched set that is processed in real time rather than directly from an event stream.

Latency and Time Sensitivity Distinctions


Streaming Analytics: Streaming analytics systems are designed to handle extremely low latency, as the focus is on processing data instantly as it arrives. This is critical in situations where immediate insights are required, like automated decision-making in fraud detection, predictive maintenance, or dynamic pricing. Streaming analytics typically involves sub-second latency, allowing for almost instantaneous actions based on data.

Real-Time Analytics: Real-time analytics also aims for low latency, but the data may be processed in slightly larger windows (seconds or minutes). The insights provided by real-time analytics are often near real-time, and acceptable latency can range from milliseconds to a few seconds, depending on the system’s requirements. Real-time analytics may involve batch processing, where the data is aggregated and processed as needed, rather than on a continuous stream.

Contrasting Architecture and Tools


Streaming Analytics: The architecture for streaming analytics is built around continuous data flows. The tools and platforms used for streaming analytics—such as Apache Kafka, Apache Flink, and Apache Storm—are designed to support data streams and perform calculations on the fly. The architecture involves source systems that generate continuous streams of events, a processing engine that can handle this real-time input, and sinks that store or act on the processed data.
 
Streaming analytics systems often incorporate concepts like event-driven architecture and micro-batching, where data is split into tiny batches to be processed almost instantaneously. The key focus is on scalability and the ability to handle high-throughput streams with very low latency.

Real-Time Analytics: Real-time analytics architecture is often centered around fast querying and low-latency data retrieval from storage. Systems like Apache Pinot, Apache Druid, and in-memory databases like Memcached are frequently used to achieve real-time query performance. Data is often ingested in bursts, cleaned, stored, and queried using systems optimized for low-latency access, such as in-memory or columnar databases.

While it can handle streaming data, real-time analytics systems usually aggregate and store data first, making it suitable for reporting and dashboarding where up-to-the-second freshness is only sometimes critical but very close to real time is required.

Streaming and Real-time Analytics Use Cases


Streaming Analytics:
IoT Sensor Monitoring: Where devices continuously generate data, analytics systems monitor this data in real time to detect anomalies or trigger automated responses. 
Stock Market and High-Frequency Trading: In financial markets, price data, transaction volumes, and other metrics must be processed in real time to make split-second trading decisions.
Social Media Monitoring: For businesses that rely on sentiment analysis or real-time social media engagement, streaming analytics helps gauge public reaction instantly, allowing businesses to respond immediately.

Real-Time Analytics:
Customer Personalization: In e-commerce, real-time analytics helps provide personalized recommendations by processing customer interaction data stored in databases, delivering insights in near real-time during customer sessions.
Operational Dashboards: Many organizations utilize real-time analytics for internal monitoring, where data on sales, system health, or customer interactions is processed quickly but not instantaneously, such as refreshing every minute.
Dynamic Pricing: Real-time analytics can be used to adjust pricing based on historical sales and demand data that is processed every few minutes or hours.

Challenges with Streaming and Real-time Analytics


Streaming Analytics: One of the main challenges is dealing with the constant flow of high-velocity data. Ensuring data consistency, scaling infrastructure to handle bursts in data streams, and maintaining sub-second latency requires sophisticated engineering solutions. Another challenge is managing “event time” versus “processing time,” where events arrive out of order or late.

Real-Time Analytics: Real-time analytics faces the challenge of balancing query performance with data freshness. Storing and retrieving large volumes of data with low latency is difficult without optimized database architectures. Additionally, ensuring that the data queried reflects the most recent information without overwhelming the system requires careful tuning.

Conclusion


While both streaming and real-time analytics offer rapid data processing and insights, they serve different purposes depending on the specific use case. Streaming analytics excels in environments where decisions must be made instantly on data as it arrives, making it ideal for real-time monitoring and automated responses. Real-time analytics, on the other hand, offers low-latency querying for decision-making where instantaneous data streams aren’t necessary but timely responses are critical.

If your use case requires sub-second latency, consider technologies like DeltaStream. It handles both Streaming Analytics and acts as a Streaming Database, supporting the shift-left paradigm for operational efficiency. If you’re interested in giving it a try, sign up for a free trial or contact us for a demo.


02 Jul 2024

Min Read

A Guide to RBAC vs ABAC vs ACL

Access control is necessary for data platforms to securely share data. In order for users to confidently share their data resources with the intended parties, access control should be easy to understand and scalable, especially as more data objects and more users are added. Without a sensible access control model, users have a higher risk of inadvertently sharing data objects with the wrong parties and failing to realize incorrect permissions. Choosing the right access control model depends heavily on the use case, so it’s important to understand the benefits and drawbacks of popular options. In this post, we’ll cover three different access control models: access control lists (ACL), role-based access control (RBAC), and attribute-based access control (ABAC). This guide to RBAC vs ABAC vs ACL will cover what they are, their pros and cons, and what to consider when choosing an access control model.

Access Control List (ACL)

An ACL is a list of permissions for a particular resource and is the simplest of the access control models that we’ll cover. When a user attempts an action on a resource, such as a read or write, the ACL associated with that resource is used to allow or deny the attempt. In order to add or remove permissions to a resource, an entry in the ACL is either added or deleted. ACLs are a simple model that are easy to understand and implement, however they can be difficult to manage when there are many users and resources as these lists can grow quickly.

To illustrate how ACLs work, let’s consider an example of a university with professors, teaching assistants, and students:

  • Students are able to submit assignments and view their grades
  • Teaching assistants are able to grade assignments
  • Professors are able to grade assignments and view student grades

As you can see from the diagram, each individual is given specific permissions for what they’re able to do. If another student were to join, the ACL would need to be updated to grant the new student privilege to submit assignments and view their grades.

Pros:

  • Simple and easy to understand: User privileges for a particular resource are stated plainly in a list.
  • Allows for fine-grained access control to resources: ACLs typically allow different types of access to be defined (i.e. read, write, share).

Cons:

  • Does not scale well: As more users, user groups, and resources are added, access must be individually specified in ACLs each time.
  • Low visibility on a user’s permissions: Checking a particular user’s privileges requires a lookup in every ACL in the organization.
  • Error-prone when used at scale: When ACLs are used at scale, it can be cumbersome to add the proper permissions for users, or detect if a user has been given permissions they shouldn’t have. The difficulty in managing ACLs at scale makes it more likely that errors will occur.

Role-based Access Control (RBAC)

RBAC manages permissions with roles, where roles act as an intermediary between users and resources. In this model, users are assigned a set of roles, and roles are given permissions on resources. This model works well when there are clear groups of users who need the same set of privileges and permissions. Compared to ACLs where every permission needs to be explicitly defined, RBAC scales well with new users and resources. New users can be assigned their relevant roles and adopt all the privileges associated with those roles. Similarly, permissions for new resources can be added to existing roles and users with those roles will automatically inherit the permissions for the new resource.

Using the example from earlier, we can see how RBAC might be applied to a university setting:

  • Students are able to submit assignments and view their grades
  • Teaching assistants are able to grade assignments
  • Professors are able to grade assignments and view student grades

As we can see, the relationships in this diagram are simpler than the diagram with ACLs. Instead of specifying direct access to resources, users are assigned roles which have privileges on resources. If a new student were to join the class, they would just need to be assigned the student role and all the permissions they need will be inherited through the “student” role.

Pros:

  • Easy-to-manage policy enforcement: Updating a privilege for a role will automatically update apply for all users with that role, making it easier to enforce policies at a more granular level.
  • Scalable: New users can be granted the roles that apply for them and inherit all the privileges with those roles. As new resources are created, access to them can be granted to roles or additional roles can easily be created.
  • Better security and compliance: RBAC ensures that users only have access to the roles relevant for them, and by extension, only the privileges given to those roles. This results in users only having the necessary permissions and reduces the risk of unauthorized access.
  • Widely adopted: RBAC has been around for decades and is used in many popular databases and data products, including PostgreSQL, MySQL, MongoDB, and Snowflake.

Cons:

  • Role explosion: While RBAC is generally quite scalable, creating too many roles can occur in cases where group privileges are not clearly differentiated. When too many roles get created, RBAC can become difficult to manage. Organizations should come up with and enforce best practices for defining roles to avoid role explosion.
  • Limited flexibility: For use cases where the privileges of roles are very dynamic, RBAC can feel rigid. For instance, if an organization restructures its team structure, new roles may need to be created and existing roles may need to change their permissions. The process of safely adding and removing permissions from roles, cleaning up any deprecated roles, and restructuring role hierarchy can be cumbersome, slow down productivity, and result in tech debt.

Attribute-based Access Control (ABAC)

ABAC gates access to resources based on attributes, as opposed to users or roles. Attributes, such as who the user is, what action they’re trying to perform, which environment they are performing the action in, and what resource they are trying to perform the action on, are all considered when deciding whether or not access should be permitted. Rules are set up such that access is only allowed when conditions, determined by attributes, are met. For example, a rule can be set up such that a teaching assistant can only view grades if they’re in the grading room and it’s between 4:00 pm and 8:00 pm.

Let’s see how ABAC might be applied to the university example:

In this diagram, we can see how the ABAC policy works for a student who is trying to submit their assignment. For a student to submit their assignment under this policy, the student needs to have specific attributes, such as being enrolled and not being suspended. There are also contextual constraints, such as the submission needing to be before the deadline. If all of the conditions in the policy are satisfied, then the student can successfully submit their assignment.

Pros:

  • Highly scalable: New rules and attributes can easily be added as business needs evolve. As resources evolve, administrators can simply assign attributes to the resource, as opposed to creating a new role or changing an existing one.
  • Flexible custom policies: Rules are highly customizable, enabling administrators to easily set up access policies based on context, such as time of day and location.
  • Attributes to ensure compliance with data regulations: Administrators can add attributes to sensitive resources, allowing for labels to be added such as personally identifiable information (PII) or HIPAA for healthcare related information. This makes it easier to set up rules to ensure data privacy and data compliance with various regulations are met.

Cons:

  • Complex to implement and maintain: Attributes and policies need to be carefully defined and governed. The initial designing and assigning of attributes for users and resources can be a time consuming and complex process. Then, continuing to maintain the attributes and access policies as business needs and applications change can require significant time and effort.
  • Difficult to assess risk exposure: Although it’s generally beneficial to be able to create highly customizable access policies, it can make it difficult to audit and assess risk exposure. For instance, understanding the full access a particular user has can be difficult since policies can be complex and contingent on context-specific conditions.

Choosing an Access Control Model

When it comes to choosing an access control model, users should consider how their organization may scale in the future, who will be responsible for maintaining the access control system, and if their needs actually require going with a more complex model. If there are a limited number of users and resources, ACLs may be the best approach as they are simple to understand and implement. If access policies need to be highly customized and dynamic, then ABAC may be a better approach. For something more scalable than ACLs but without the complexity of ABAC, then RBAC is probably sufficient. Organizations may also find that a hybrid approach of these models best serves their needs, such as RBAC and ABAC together.

At DeltaStream, we’ve taken the approach of adding RBAC to our platform. DeltaStream is a real-time stream processing platform that allows users to share, process, and govern their streaming data. In the data streaming space, Apache Kafka has been one of the leading open source projects for building streaming data pipelines and storing real-time events. However, access control with Kafka is managed through ACLs, and as the number of topics and users grow, managing these ACLs has been a pain point for Kafka users. As a data streaming platform that can connect to any streaming data source, DeltaStream allows users to manage and govern their streaming resources with RBAC. RBAC strikes the balance of improving on the scalability issues of ACLs without overcomplicating access control.

If you’re interested in discussing access control or learning more about DeltaStream, feel free to reach out or get a free trial.

23 May 2024

Min Read

Workload Isolation: Everything You Need to Know

In cloud computing, workload isolation is critical for providing efficiency and security when running business workloads. Workload isolation is the practice of separating computing tasks into their own resources and/or infrastructure. By providing physical and logical separations, one compromised workload or resource cannot impact the others. This offers security and performance benefits and may be necessary to comply with regulatory requirements for certain applications.

Benefits of Workload Isolation

  • Security: By isolating workloads, organizations can reduce the ‘blast radius’ of security breaches. For instance, if an attacker were able to compromise the workload in one environment, workload isolation would protect the other workloads because they are being run in different environments. This helps to minimize, contain, and resolve potential security issues.
  • Performance: Isolated workloads can operate without interference from other tasks, ensuring that resources are dedicated and performance is optimized for each specific task. By isolating workloads, task performance becomes more predictable as tasks don’t need to compete for shared resources, making it easier to provide service level agreements (SLAs). Without workload isolation, a sudden spike in resource utilization for one task could negatively impact the performance of other tasks running on the same resources.
  • Compliance: Workload isolation simplifies compliance with various regulations by clearly defining boundaries between different data sets and processing activities.

Achieving workload isolation

Workload isolation can take many different forms and can be achieved with different approaches. When thinking about workload isolation, it is best to consider the multiple ways your workloads can be isolated, and to take a combined approach.

  • Resource Governance: Resource Governance is the ability to specify boundaries and limits for computing task resources. Popular container orchestration systems, such as Kubernetes, allow users to set resource limits on their services and workloads. Containerizing and limiting the resources for specific tasks removes the “noisy neighbor” problem, where one task can starve other tasks by consuming all of the resources.
  • Governance and Access Control: Providing access controls on data sets and compute environments ensures that only necessary individuals and services can access specific workloads. Most data systems have some form of access control that can be defined, whether that is in the form of an access control list (ACL), role-based access control (RBAC), or attribute-based access control (ABAC). Defining access control for users is essential to protect against unauthorized access.
  • Network Level Isolation: Network isolation aims to create distinct boundaries within a network, creating subnetworks with limited access between them. This practice improves security by limiting access to particular environments and helps ensure that an attacker cannot affect workloads on different subnetworks.

Workload isolation for Streaming Resources with DeltaStream

DeltaStream is a stream processing platform that is fully managed and serverless, allowing users to easily govern and process their streaming data from sources such as Apache Kafka or AWS Kinesis. As a security-minded stream processing solution, DeltaStream’s workload isolation plays a significant role in ensuring that computational queries are secure and performant. Below are some ways DeltaStream provides workload isolation:

  • Each Query Runs in its Own Environment: Powered by Apache Flink, each DeltaStream query runs in its own Flink cluster with its own dedicated resources and network. This ensures that users’ data is the only data being processed in a particular environment, minimizing the risk of sensitive data leakage. It also boosts performance, as each query can be scaled and tuned independently.
  • Multiple Deployment Options: DeltaStream offers various deployment options, including dedicated deployment and private SaaS deployment (also known as bring your own cloud or BYOC), catering to security-sensitive users. With the dedicated deployment option, a DeltaStream data plane runs in a cloud account dedicated to a single organization. In the private SaaS deployment option, a DeltaStream data plane operates within an organization’s cloud account. These options provide users with an additional level of assurance that their data is confined to a non-shared network — in the case of private SaaS, the data never leaves the user’s own network.
  • Role-based Access Control (RBAC): Access to queries and data objects within the DeltaStream Catalog is managed through DeltaStream’s RBAC. This gives users an easy-to-use and scalable system for properly governing and restricting access to their streaming data and workloads.

Workload isolation is essential for maintaining security and compliance in cloud products, with the added benefit of protecting workload performance. At DeltaStream, we have designed a stream processing platform that fully embraces workload isolation. If you’re interested in giving it a try, sign up for a free trial or contact us for a demo.

24 Apr 2024

Min Read

Prepare Data for ClickHouse Using Apache Flink

ClickHouse and Apache Flink are two powerful tools used for high-performance data querying and real-time data processing. By using these tools together, businesses can significantly improve the efficiency of their data pipelines, enabling data teams to get insights into their datasets more quickly.

ClickHouse is a fast and resource efficient column-based database management system (DBMS). It specializes in online analytical processing (OLAP) and can handle many queries with minimal latency. With ClickPipes, users who have streaming data, such as data in Apache Kafka, can easily and efficiently build ClickHouse tables from their Kafka topics.

Apache Flink is a stream processing framework that allows users to perform stateful computations over their real-time data. It is fast, scalable, and has become an industry standard for event time stream processing. As a system with a rich connector ecosystem, Flink also integrates easily with Apache Kafka.

ClickHouse and Flink have been used together across the industry at companies like GoldSkyInstaCartLyft, and others. The typical infrastructure is as follows:

  1. Data from user product interactions, backend services, or database events via CDC are produced to a streaming data storage system (e.g. Kafka, Kinesis, Pulsar).
  2. Data in streaming storage is ingested by Flink, where it can be cleaned, filtered, aggregated, or otherwise sampled down.
  3. Flink produces the data back to the streaming storage where it is then loaded into ClickHouse via ClickPipes.
  4. Data scientists and data engineers can query ClickHouse tables for the latest up-to-date data and take advantage of ClickHouse’s high-performance querying capabilities.

You may be wondering why Flink is needed in this architecture. Since ClickPipes already enable users to load data from streaming stores directly into ClickHouse, why not just skip Flink altogether?

The answer is that although ClickHouse is a highly optimized DBMS for querying data, performing queries such as aggregations over large data sets still forces the ClickHouse query engine to bring the relevant columns of every entry into memory to perform the aggregation, which can affect query latency. In this ClickHouse blog, they listed that the following query took 15 seconds to complete:

  1. SELECT
  2. project,
  3. sum(hits) AS h
  4. FROM wikistat
  5. WHERE date(time) = '2015-05-01'
  6. GROUP BY project
  7. ORDER BY h DESC
  8. LIMIT 10

One feature that ClickHouse has to reduce latencies for commonly run queries is Materialized Views (ClickHouse docs on creating Views). In their blog, they first created a materialized view to compute the result, then ran the same query against the materialized view. The result was computed in 3ms as opposed to 15s.

For users who load their raw streaming data directly into a ClickHouse table, they can utilize materialized views to transform and prepare the data for consumption. However, these views need to be maintained by ClickHouse, and this overhead can add up, especially if a lot of views are being created. Having too many materialized views and putting too much computational load onto ClickHouse can lead to performance degradation, resulting in lower throughputs for writes and increased latencies for querying data.

Introducing a stream processing engine, such as Flink, lets users transform and prepare streaming data before loading it into ClickHouse. This helps alleviate pressure from ClickHouse and allows users to take advantage of the features that come with Flink. For instance, ClickHouse is known to struggle with queries that include joins. By using Flink, datasets can be joined and transformed in Flink before being loaded into ClickHouse. This way, instead of resources being diverted into data preparation queries, ClickHouse can focus on serving high-volume OLAP queries, which it excels at. Since Flink is built to be able to efficiently handle large and complex stream processing workloads, offloading complex computations from ClickHouse to Flink ultimately makes data available more quickly and reduces computational expenses.

Building with Cloud Products

There are many benefits to utilizing this architecture for real-time analytics, but the reality for many companies is that the systems involved require too many resources to maintain and operate. This is the classic build vs buy dilemma. If your company does decide to go with the buy route, here are the cloud offerings we recommend for the 3 main components of this architecture:

  1. Streaming Storage: For Kafka compatible solutions, Confluent Cloud, RedPanda, Amazon MSK, and WarpStream are all viable options with different tradeoffs. Other streaming storage solutions include Amazon Kinesis and StreamNative for managed Pulsar among others.
  2. Stream Processing: DeltaStream is a great serverless solution to handle stream processing workloads. Powered by Apache Flink, DeltaStream users can benefit from the capabilities of Flink without having to worry about the complexity of learning, managing, and deploying Flink themselves.
  3. ClickHouse: ClickHouse Cloud is a serverless ClickHouse solution that is simple to set up, reliable, and has an intuitive SQL-based user interface.

Conclusion

In this post, we discussed a popular architecture involving Kafka, Flink, and ClickHouse that many companies have been adopting across the industry. These systems work together to enable high-performance analytics for real-time data. In particular, we touched on how Flink complements both Kafka and ClickHouse in this architecture.

If you’re looking for a cloud-based stream processing solution, DeltaStream is a serverless platform that is powerful, intuitive, and easy to set up. Stay tuned for our next blog post as we cover a use case using this architecture, with DeltaStream in place of Flink. Meanwhile, if you want to give DeltaStream a try yourself, you can sign up for a free trial.

17 Apr 2024

Min Read

Data Warehouse vs Data Lake vs Data Lakehouse: What’s the difference?

As data technologies continue to advance, modern companies are ingesting, storing, and processing more data than ever before in order to make the most informed business decisions. While relational databases may have been enough for the data demands 25 years ago, the continual increase of data operations has led to the emergence of new data technologies to support the era of big data. These days, there are a host of cloud products for data teams to choose from, many of which describe themselves as data warehouses, data lakes, or data lakehouses. With such similar terms, it can be difficult to understand what vendors mean by these terms. In this post, we’ll break down what these terms mean, then discuss how real-time data streaming plays a role in the big data landscape.

What is a Data Warehouse?

A data warehouse is a storage and processing hub, primarily intended for generating reports and performing historical analysis. Data stored in data warehouses are structured and well-defined, allowing the warehouse to perform fast and performant analysis on its datasets. Data from relational databases, streaming storage systems, backend systems, and other sources are loaded into the data warehouse through ETL (extract, transform, load) processes, where data is cleaned and otherwise transformed to match the data integrity requirements expected by the data warehouse. Most data warehouses allow users to access data through SQL clients, business intelligence (BI) tools, or other analytical tools.

Data warehouses are a great choice for organizations that primarily need to do historical data analytics and reporting on structured data. However, the ETL process adds complexity to the ingestion of data into the data warehouse and the requirements for structured data can make the system limiting for some use cases. Popular data warehouse vendors include Snowflake, Amazon Redshift, Google BigQuery, and Oracle Autonomous Data Warehouse.

What is a Data Lake?

A data lake is a massive storage system designed to store both structured and unstructured data at any scale. Similar to data warehouses, data lakes can ingest data from many different sources. However, data lakes are designed to be flexible so that users are able to store their raw data as-is, without needing to clean, reformat, or restructure the data first. By utilizing cheap object data storage and accommodating a wide range of data formats, data lakes make it easy for developers to simply store their data. This ultimately results in organizations accumulating large repositories of data that can be used to power use cases such as machine learning analytics, aggregations on large datasets, and exploring patterns in data from different data sources. One of the challenges of working with data lakes, however, is that downstream tasks need to make sense of differently formatted data to perform analysis on them. Further, if poorly maintained, data quality can very easily become an issue in data lakes. Tools like Apache Hadoop and Apache Spark are popular for doing analysis with a data lake, as these tools allow developers to write custom logic to make sense of different kinds of data, but they require more expertise to work with which limits the set of people who can feasibly work with the data lake.

Data lakes are a good choice for organizations that have a lot of data they need to store, accommodating both structured and unstructured data, but analyzing and maintaining the data lake can be a challenge. Data lakes are commonly built on cheap cloud storage solutions such as AWS S3, Azure Data Lake Storage, and Google Cloud Storage.

What is a Data Lakehouse?

Data lakehouses merge the features of data warehouses and data lakes into a single system, hence the name. As data warehouses began adding more features found in data lakes, and as data lakes began adding more features found in data warehouses, the distinction between the two concepts became somewhat blurred. Before data lakehouses, organizations would typically need both a data lake for storage and a data warehouse for processing, but this setup could end up causing data teams a lot of overhead, as data from one location would often need to be processed or duplicated to the other location for data engineers to perform complete analyses. By merging the two concepts into a single system, data lakehouses aim to remove these silos and get the benefits of both worlds. Similar to data lakes, storing data in a data lakehouse is still cheap, scalable, and flexible, but metadata layers are also provided to enforce things like schemas and data validation where necessary. This allows the data lakehouse to still be performant for querying and analytics, like data warehouses are.

Since data is typically loaded into a data lakehouse in its raw format, it’s common for a medallion architecture to be used. The medallion architecture describes a series of queries or processing steps to transform raw data (bronze), to filtered/cleaned data (silver), to business ready aggregated results (gold), where the gold set of data can be easily queried for BI purposes.

While the actual distinctions of what makes a system a data lakehouse instead of a data lake or data warehouse are somewhat nuanced, popular cloud vendors that have data lakehouse capabilities include Databricks Lakehouse Platform, Snowflake, Amazon Redshift Spectrum, and Google Cloud BigLake. While data lakehouses can handle a wide range of use cases, they can be complex to manage and still require skilled data experts to extract its full benefits.

Impacts of Real-time Streaming Data

As big data technologies continue to evolve, there has been an increasing demand for real-time data products. Users are becoming more accustomed to getting results instantly, and in order to support these use cases, companies have been adopting streaming technologies such as Apache Kafka and Apache Flink.

The Challenges of Streaming Data in the Current Ecosystem

Apache Kafka is a real-time event log that uses a publisher/consumer model. Micro-services, clients, and other systems with real-time data will produce events to Kafka topics, then the data events in these topics are consumed by other real-time services that act on these events. Data in Kafka and other streaming storage systems typically set some expiration period for their data events, so in order to keep their real-time data long-term, organizations typically load this data into a data lake, data warehouse, or data lakehouse for analysis later on. However, streaming data coming from IoT sensors, financial services, and web interactions can sum up to a large volume of data, and doing computation raw form of this data can be too slow or too computationally expensive to be viable. In order to address this, data engineers will typically do downsampling or other transformations to prepare the raw data for end users. In the case of data lakehouses, a medallion architecture, as mentioned earlier, is recommended to prepare the data for general consumption. For data lakes, a compute engine such as a data warehouse, or some Spark/Hadoop infrastructure, is needed to transform the data into more consumable results.

A setup that requires constant recomputation comes with an inherent tradeoff. Real-time data is constantly arriving into the data lake or data lakehouse, so users will need to choose between recomputing results often, which can be computationally expensive, or recompute less frequently, resulting in stale datasets. Another issue with the setup mentioned earlier is that computed results need to be stored as well. In the medallion architecture for example, where raw data needs to go through multiple steps of processing before being ready for warehouse-like querying, this could involve storing the same data multiple times. This results in higher storage costs and higher latencies, as each processing step needs to be scheduled for recomputation.

Using Stream Processing to Prepare Streaming Data

This is where a stream processing solution, such as Apache Flink, can become beneficial. Stream processing jobs are long-lived and can produce analytical results incrementally, as new data events arrive. Contrast this to the medallion architecture where new result datasets need to be completely recomputed. By adding stream processing to the data stack, streaming data can be filtered, transformed, and aggregated before ever arriving to the data lake, data warehouse, and data lakehouse layer. This results in lower computational costs and lower end-to-end latencies.

One of the main burdens of Apache Flink and other stream processing frameworks is their complexity. Understanding how to develop, manage, scale, and provide fault tolerance for stream processing applications requires skilled personnel and time. With DeltaStream, we take all of that complexity away so that users can focus on their processing logic. DeltaStream is a fully managed serverless stream processing solution that is powered by Apache Flink. If you’re interested in how DeltaStream can help you manage your streaming data, schedule a demo with us or reach out to us on one of our socials.

13 Mar 2024

Min Read

How to Read Kafka Source Offsets with Flink’s State Processor API

Apache Flink is one of the most popular frameworks for data stream processing. As a stateful processing engine, Flink is able to handle processing logic with aggregations, joins, and windowing. To ensure that Flink jobs are recoverable with exactly-once semantics, Flink has a state-of-the-art state snapshotting mechanism, so in the event of a failure, the job can be resumed from the latest snapshot.

In some advanced use cases, such as job migrations or job auditing, users may be required to inspect or modify their Flink job’s state snapshots (called Savepoints and Checkpoints in Flink). For this purpose, Flink provides the State Processor API. However, this API is not always straightforward to use and requires deep understanding of Flink operator states.

In this post, we’ll cover an example of using the State Processor API, broken up into 3 parts:

  1. Introduce our Flink job which reads data from an Apache Kafka topic
  2. Deep dive into how Flink’s KafkaSource maintains its state
  3. Use the State Processor API to extract the Kafka partition-offset state from the Flink job’s savepoint/checkpoint

If you want to see an example of the State Processor API in use, feel free to skip ahead to the last section.

Note that this post is a technical tutorial for those who want to get started with the State Processor API, and is intended for readers who already have some familiarity with Apache Flink and stream processing concepts.

Creating a Flink Job

Below is the Java code for our Flink job. This job simply reads from the “source” topic in Kafka, deserializes the records as simple Strings, then writes the results to the “sink” topic.

  1. public class FlinkTest {
  2.  
  3. public static void main(String[] args) throws Exception {
  4. StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
  5.  
  6. KafkaSource<String> source = KafkaSource.<String>builder()
  7. .setBootstrapServers("localhost:9092")
  8. .setTopics("source")
  9. .setGroupId("my-group")
  10. .setStartingOffsets(OffsetsInitializer.latest())
  11. .setValueOnlyDeserializer(new SimpleStringSchema())
  12. .build();
  13.  
  14. DataStream<String> sourceStream = env.fromSource(
  15. source, WatermarkStrategy.forMonotonousTimestamps(), "Kafka Source")
  16. .uid("kafkasourceuid");
  17.  
  18. KafkaRecordSerializationSchema<String> serializer = KafkaRecordSerializationSchema.builder()
  19. .setValueSerializationSchema(new SimpleStringSchema())
  20. .setTopic("sink")
  21. .build();
  22. Properties kprops = new Properties();
  23. kprops.setProperty("transaction.timeout.ms", "300000"); // e.g., 5 mins
  24. KafkaSink<String> sink = KafkaSink.<String>builder()
  25. .setBootstrapServers("localhost:9092")
  26. .setRecordSerializer(serializer)
  27. .setDeliveryGuarantee(DeliveryGuarantee.EXACTLY_ONCE)
  28. .setKafkaProducerConfig(kprops)
  29. .setTransactionalIdPrefix("txn-prefix")
  30. .build();
  31.  
  32. sourceStream.sinkTo(sink);
  33. env.enableCheckpointing(10000L);
  34. env.getCheckpointConfig().setCheckpointTimeout(60000);
  35. env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
  36. env.getCheckpointConfig().setMinPauseBetweenCheckpoints(500L);
  37. env.getCheckpointConfig().setTolerableCheckpointFailureNumber(1);
  38. env.getCheckpointConfig().setCheckpointStorage("file:///tmp/checkpoints");
  39. env.execute("tester");
  40. }
  41. }

There are a few important things to note from this Flink job:

  1. We are creating a KafkaSource object. On line 6, the KafkaSource is then given to the StreamExecutionEnvironment’s fromSource method which returns a DataStreamSource object, which represents the actual Flink source operator.
  2. We set the operator ID for our KafkaSource operator using the uid method. This is set on line 16. It’s best practice to set all the IDs for all Flink operators when possible, but we’re emphasizing this here because we’ll actually need to refer to this ID when we use the State Processor API to inspect the state snapshots.
  3. Flink checkpointing is turned on. On lines 33-38, we are configuring our Flink environment’s checkpointing configurations to ensure that the Flink job will take incremental checkpoints. We’ll be analyzing these checkpoints later on.

Understanding the KafkaSource State

Before we inspect the checkpoints generated from our test Flink job, we first need to understand how the KafkaSource Flink operator saves its state.

As we’ve already mentioned, we’re using Flink’s KafkaSource to connect to our source Kafka data. Flink sources have 3 main components – Split, SourceReader, SplitEnumerator (Flink docs). A Split represents a portion of data that a source consumes and is the granularity that the source can parallelize reading data. For the KafkaSource, each Kafka partition corresponds to a separate Split, represented by the KafkaPartitionSplit class. The KafkaPartitionSplit is serialized by the KafkaPartitionSplitSerializer class. The logic for this serializer is pretty simple, it writes out a byte array of the Split’s topic, partition, and offset.

KafkaPartitionSplitSerializer’s serialize method:

  1. @Override
  2. public byte[] serialize(KafkaPartitionSplit split) throws IOException {
  3. try (ByteArrayOutputStream baos = new ByteArrayOutputStream();
  4. DataOutputStream out = new DataOutputStream(baos)) {
  5. out.writeUTF(split.getTopic());
  6. out.writeInt(split.getPartition());
  7. out.writeLong(split.getStartingOffset());
  8. out.writeLong(split.getStoppingOffset().orElse(KafkaPartitionSplit.NO_STOPPING_OFFSET));
  9. out.flush();
  10. return baos.toByteArray();
  11. }
  12. }

At runtime, Flink will instantiate all of the operators, including the SourceOperator objects. For each stateful Flink operator, there is a name associated with each stateful object. In the case of a source operator, the name associated with the split states are defined by SPLIT_STATE_DESC.

  1. static final ListStateDescriptor<byte[]> SPLITS_STATE_DESC =
  2. new ListStateDescriptor<>("SourceReaderState", BytePrimitiveArraySerializer.INSTANCE);

We can inspect the SourceOperator class further to see where these split states are initialized, in the initializeState method.

SourceOperator’s initializeState method:

  1. @Override
  2. public void initializeState(StateInitializationContext context) throws Exception {
  3. super.initializeState(context);
  4. final ListState<byte[]> rawState =
  5. context.getOperatorStateStore().getListState(SPLITS_STATE_DESC);
  6. readerState = new SimpleVersionedListState<>(rawState, splitSerializer);
  7. }

The Flink state that source operators use is the SimpleVersionedListState, which uses the SimpleVersionedSerialization class. In the SimpleVersionedListState class, the serialize method calls the writeVersionAndSerialize method to ultimately serialize the state.

Finally, if we inspect the writeVersionAndSerialize method in the SimpleVersionedSerialization, we can see that before writing the actual data associated with our source operator, we first write out a few bytes for the serializer version and the data’s length.

SimpleVersionedSerialization’s writeVersionAndSerialize method:

  1. public static <T> void writeVersionAndSerialize(
  2. SimpleVersionedSerializer<T> serializer, T datum, DataOutputView out)
  3. throws IOException {
  4. checkNotNull(serializer, "serializer");
  5. checkNotNull(datum, "datum");
  6. checkNotNull(out, "out");
  7.  
  8. final byte[] data = serializer.serialize(datum);
  9.  
  10. out.writeInt(serializer.getVersion());
  11. out.writeInt(data.length);
  12. out.write(data);
  13. }

Let’s quickly recap the important parts from above:

  1. The KafkaSource operator stores its state in KafkaPartitionSplit objects.
  2. The KafkaPartitionSplit keeps track of the current topic, partition, and offset that the KafkaSource has last processed.
  3. When Flink savepointing/checkpointing occurs, a byte array representing the KafkaSource state gets written to the state snapshot. The byte array has a header which includes the serializer version and the length of data. Then the actual state data, which is a serialized version of the KafkaPartitionSplit, makes up the rest of the state byte array.

Now that we have some idea of how data is being serialized into Flink savepoints and checkpoints, let’s see how we can use the State Processor API to extract the Kafka source operator information from these state snapshots.

State Processor API to Inspect Kafka Source State

For maven projects, you can add the following dependency to your pom.xml file to start using the Flink State Processor API.

  1. <dependency>
  2. <groupId>org.apache.flink</groupId>
  3. <artifactId>flink-state-processor-api</artifactId>
  4. <version>1.18.0</version>
  5. </dependency>

The following class showcases the full example of how we can use the State Processor API to read KafkaSource offsets from a Flink savepoint or checkpoint.

  1. public class StateProcessorTest {
  2.  
  3. public static void main(String[] args) throws Exception {
  4. StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
  5.  
  6. String savepointPath = Path.of("/tmp/checkpoints/609bc335486ca6cfcc8692e4c1ff8782/chk-8").toString();
  7. SavepointReader savepoint = SavepointReader.read(env, savepointPath, new HashMapStateBackend());
  8. DataStream<byte[]> listState = savepoint.readListState(
  9. OperatorIdentifier.forUid("kafkasourceuid"),
  10. "SourceReaderState",
  11. PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO);
  12. CloseableIterator<byte[]> states = listState.executeAndCollect();
  13. while (states.hasNext()) {
  14. byte[] s = states.next();
  15. KafkaPartitionSplitSerializer serializer = new KafkaPartitionSplitSerializer();
  16. KafkaPartitionSplit split = serializer.deserialize(serializer.getVersion(), Arrays.copyOfRange(s, 8, s.length));
  17. System.out.println(
  18. String.format("topic=%s, partition=%s, startingOffset=%s, stoppingOffset=%s, topicPartition=%s",
  19. split.getTopic(), split.getPartition(),
  20. split.getStartingOffset(), split.getStoppingOffset(), split.getTopicPartition()));
  21. }
  22.  
  23. System.out.println("DONE");
  24. }
  25. }

First, we’ll load the savepoint. The SavepointReader class from the State Processor API allows us to load a full savepoint or checkpoint. On line 7, we are loading a checkpoint that was created in “/tmp/checkpoints” as a result of running the test Flink job. As we mentioned in the previous section, the source operators use a SimpleVersionedListState, which the SavepointReader can read using the readListState method. When reading the list states, we need to know 3 things:

  1. Operator ID: “kafkasourceuid” set in our test Flink job
  2. State Name: “SourceReaderState” set in Flink’s SourceOperator class
  3. State TypeInformation: PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO set in Flink’s SourceOperator class

After we get our list states, we can simply iterate through each of the states, which are given as byte arrays. Since the SimpleVersionedSerialization serializer first writes the version and data length, which we don’t care about, we need to skip those headers. You’ll see on line 16 that we deserialize the byte array as a KafkaPartitionSplit after skipping the first 8 bytes of the state byte array.

Running the above code example gives the following result:

  1. topic=source, partition=0, startingOffset=3, stoppingOffset=Optional.empty, topicPartition=source-0
  2. DONE

Conclusion

In this post, we explained how Flink’s KafkaSource state is serialized into savepoints and covered an example of reading this state with the State Processor API. Flink’s State Processor API can be a powerful tool to analyze and modify Flink savepoints and checkpoints. However, it can be confusing for beginners to use and requires some in-depth knowledge about how the Flink operators manage their individual states. Hopefully this guide will help you understand the KafkaSource and serve as a good tutorial for getting started with the State Processor API.

For more content about Flink and stream processing, check out more content from DeltaStream’s blog. DeltaStream is a platform that simplifies the unification, processing, and governance of streaming data.

Resources:

alert-icon

Please enter a valid email address.

Request Submitted

Thank you for requesting a demo.
You will receive your login information to your email soon.