19 Dec 2023
Min Read
Securing Real-Time Streaming Data: DeltaStream’s Approach
The introduction of The General Data Protection Regulation (GDPR) in 2018 and other laws following such as California Privacy Rights Act (CRPA) have made it a requirement that data security and data privacy are handled properly for businesses and consumers. While data security in the industry has improved since the passing of these laws, “data leakage” events are still not uncommon. The fallout from such events can have devastating impacts for both the company that had the data leakage and for any customers or partners associated with that company. Since the beginning of DeltaStream, real-time data security has been one of its foundational pillars. In order to build a data platform that users can trust in production, we knew that we had to design our system with security in the forefront of our minds. In this blog post, we discuss how DeltaStream keeps user data safe and how DeltaStream provides tools to help users safely share their data.
How DeltaStream Keeps your Data Safe: Zero Trust
DeltaStream is highly secure and private connections between DeltaStream and other networks can be set up. However, if a security breach was to occur, the exposure of that breach should be minimal. In other words, a zero trust strategy should be taken where security verifications exist at multiple levels and security risks are actively avoided. Here are three ways DeltaStream's design aims to keep your data safe.
Overview of DeltaStream’s architecture
Queries for Data Processing are Run in Isolated Environments
In DeltaStream, users write SQL statements to define long running queries to process their streaming data. Behind the scenes, these long running queries are powered by Apache Flink. On its own, the Flink framework does not have a security model, so it’s DeltaStream’s responsibility to ensure the Flink is run in a secure manner.
One way this is done is by securing the network for the query’s runtime environment. Outside network calls into the runtime environment are not allowed and only necessary network calls from the runtime environment to the outside are allowed, such as connecting to a specific Kafka broker if the query requires reading or writing to Kafka.
Another way we keep a query’s runtime secure is by dedicating separate environments for each query. This way, if an attacker is able to gain access to a query’s runtime environment, then the attacker will only have access to a single query’s environment and cannot affect other queries that may be running. This also means that queries do not compete for resources and a faulty query cannot adversely affect another query.
Only Required Data is Held and Encrypted by DeltaStream
The best way to keep user’s data safe is to not store it in the first place. Of course there is some amount of data that DeltaStream must store in order to support its feature set, but any data that is not absolutely required is not stored.
Let’s consider the common use-case query that does the following:
- Read data from a Kafka topic
- Mutate the data (according to the SQL query)
- Write the results to sink storage
In this example, DeltaStream connects to the source Kafka topic and reads the source records into memory. Mutations for that data are performed in-memory, then the data is written to the sink destination (another Kafka topic, S3 in Delta format for Databricks, Snowflake, etc). User data is kept entirely in memory and at no point in this scenario does DeltaStream persist the user data to disk.
There are three caveats where some form of user data is written to disk, and in each of these cases, the at-rest data is stored in encrypted storage.
- Stateful queries, such as queries that perform aggregations, have their operational state occasionally snapshotted and stored for fault-tolerance purposes
- Data given by the user to add connectivity to their storage layer, such as Kafka
- Queries sinking to a Materialized View write data to an encrypted data store where the view is created
Finally, any data that is in-transit is encrypted with TLS to ensure that there is end to end data security.
BYOC and Dedicated Data Plane Deployment Options
The DeltaStream platform is implemented with a control plane and a data plane. At a high level, the control plane decides how data is managed and processed, and the data plane is where the actual management and processing of data occurs. In DeltaStream, data never leaves the data plane. The only communication between the control plane and data plane includes the data plane pulling instructions from the control plane, and the data plane pushing metrics and status updates to the control plane.
A user can choose between three different data plane deployment models: Public, Dedicated, and Bring Your Own Cloud (BYOC).
- The Public data plane shares network and other resources with other DeltaStream customers.
- A Dedicated data plane can be set up for users who want their own isolated network and resources in a cloud managed by DeltaStream.
- The BYOC deployment is for users who require network and resources to be in their own cloud account so that data never leaves their cloud account.
In all three options, DeltaStream manages the network and resources so that users can still get a serverless DeltaStream experience, the difference is in which VPC is the data plane running.
Although most users may be satisfied with the Public deployment option, at DeltaStream we recognize that security requirements differ for different customers in different industries. That’s why we’ve chosen to provide BYOC and Dedicated as options to allow users to choose which model best suits their security needs. Read more about BYOC (also called Private SaaS).
Federated Data Governance and DeltaStream
Federated data governance is the union of data federation and data governance. Why both? Data federation is the concept of viewing multiple data sources in a unified view. Data governance is the idea of managing and ensuring data availability, security, and integrity. Only with both data federation and data governance together will users have things like access control over a unified view of all their data sources. This means there is a single place to define access control policies for all your data leading to less management overhead and fewer mistakes when defining policies.
For existing streaming storages like Kafka, the data is structured into a flat namespace. Access control is typically managed by Access Control Lists (ACLs), which are created for each user and resource (i.e. a single user has access to a single topic). Managing ACLs is cumbersome. If you need to give 3 users access to 50 topics you have to create 150 ACLS. This time-consuming process is prone to error, especially as the storage layer grows larger. This poses a security risk, as ACLs can easily be misconfigured and give a user more permissions than they ought to be granted for example. Further, managing consistent access control across storage systems, such as multiple Kafka clusters or across multiple streaming platforms, becomes extremely difficult.
To address the deficiencies of a flat namespace, DeltaStream brings a hierarchical namespace model such that data resources exist within a “Database” and “Schema”. In this model, data from different storage systems can exist within the same database and schema in DeltaStream, so the logical organization of data is decoupled from the data’s physical storage systems. DeltaStream then provides Role Based Access Control (RBAC) on top of this relational organization of data, where roles can define access policies for particular resources and users can inherit one or many roles.
Diagram of the unified view that DeltaStream offers to users over their streaming resources, RBAC can be applied to Database, Schemas and Relations
By simplifying and standardizing access control with namespacing and RBAC, DeltaStream empowers users to implement granular and sophisticated access control policies across multiple data sets. This means within your organization, data can be shared easily on a need-to-know basis.
Wrapping up
Designing data systems with security in mind is essential. We have covered at a high level how DeltaStream secures user data and provides federated data governance to help users secure data within their organization. Announcing our SOC II compliance a few months ago was just the first step towards proving our commitment to data security. If you want to learn more about DeltaStream’s approach to security or want to try DeltaStream for yourself, reach out to us to schedule a demo.