16 Jan 2024
Min Read
Streaming Data Governance and DeltaStream
Ensuring your data is accessible, compliant, and secure across your systems, known as Data governance, is increasingly vital for organizations. For real-time stream processing, there is a growing need for Governance of data assets, but to date it’s proven difficult to achieve. In this post, we will explore what Data Governance is, why it’s important, and how Data Unification and Data Governance work hand in hand when it comes to stream processing.
What is Data Governance and why is it Important?
Data Governance refers to the rules, policies and systems designed to make data available, secure and compliant across the organization. This includes determining which users are responsible for particular data assets and how these data assets can be accessed. Businesses that have a strong Data Governance system are able to extract the most value out of their data because they are able to easily make data available while ensuring the quality, security, and privacy of their data. This ultimately leads to faster data-driven decision making, improved efficiency for internal operations, and enhanced data security.
The main elements of Data Governance include:
- Data cataloging - the metadata that is used to organize the different data assets in an organization to improve data discovery and management
- Access control and security of data - the mechanisms that ensure only authorized users have access to data assets
- Secure data sharing - the ability to make data resources available to other users
- Data quality - the measurement of how accurate, complete, timely, and consistent the data is
Data Governance is more Powerful with Unified Data
In the data streaming space, it’s not uncommon for companies to utilize multiple data streaming stores. Having multiple Kafka clusters or some combination of storage systems, such as Kafka with Kinesis, is a pattern we’ve seen many companies adopt across the industry. The reasons behind choosing multiple data stores varies, but can typically be attributed to one or more of the following: ensuring data isolation, tech debt as the result of data migrations, different teams preferring different technologies, and different technologies being better suited for different use cases.
However, Data Governance only goes as far as the data platform’s reach. For example, Kafka provides Access Control Lists (ACLs) as a mechanism for Data Governance. Using ACLs, users can define access policies for topics, but ACLs are only limited to the topics in a single Kafka cluster. As we mentioned above, most companies don’t operate using a single Kafka cluster for their entire data streaming storage layer, meaning they’ll need to govern each streaming data store individually. If there are two Kafka clusters, this would mean two separate ACLs that need to be maintained. Herein lies the problem: Data Governance at the individual storage system level has a large overhead, is cumbersome, and is error-prone.
What is really needed is a unified view over all of the data storage systems, so that Data Governance can be applied in one place. Providing capabilities like data namespacing and access control at a layer above the individual storage systems allows users to have more control over how they organize their data, as opposed to being forced to manage data resources based on where that data is physically stored. This way, if a team is responsible for topics in two different Kafka clusters, they could organize them into the same namespace in the unified view. By operating with this unified view, the data silos are eliminated, making it easier for organizations to categorize, build/share data products, and provide policies for their data. Also, having a single place to apply Data Governance relieves the burden on data users who would otherwise need to understand details of the individual storage systems, leading to improved productivity with fewer mistakes.
DeltaStream’s Approach – Unification and Governance of Streaming Data
DeltaStream is a stream processing platform that both Unifies and Governs your streaming data across all your streaming storage systems. Using DeltaStream, organizations are able to get real-time insights across all their data assets in one platform. DeltaStream can connect to both streaming and relational data stores, and provide a unified view of all your data in a Streaming Catalog. Role-Based Access Control (RBAC) is then applied to the Streaming Catalog to Govern user access to specific data assets.
Overview of Data Governance in DeltaStream’s Platform in the middle, highlighted in light-purple
Streaming Catalog
The Streaming Catalog organizes your data assets in the platform. DeltaStream can connect to any of your streaming or relational data stores, including your Kafka clusters, Kinesis, PostgreSQL databases, and others. This provides a global view of all your data sources. DeltaStream then allows users to categorize the data into hierarchical namespaces to isolate the data that users and teams need access to. DeltaStream allows for an unlimited number of namespaces. For an organization, this creates a central data platform for stream processing.
In DeltaStream, Stores define the connectivity to data storage systems such as Kafka, RedPanda, Kinesis, and PostgreSQL. Users can then define Relations to represent data entities in the Store – in the case of Kafka a data entity is a topic, in the case of PostgreSQL a data entity is a table. Similarly to other data catalogs, when these Relations are created, they are added to a specific Database and Schema that the user can specify. The Streaming Catalog in an Organization can contain any number of Databases, and any number of Schemas within those Databases. The Database and Schema provide two namespacing levels for users to organize their Relations. Data from different storage systems can exist within the same Database and Schema in DeltaStream, so the logical organization of data in the Streaming Catalog is decoupled from the data’s physical storage systems. This way, teams can organize their data to align with the context in which that data is being used.
Role-Based Access Control (RBAC)
RBAC within DeltaStream is used to manage user access to the Streaming Catalog. RBAC also applies to securable objects within DeltaStream, including Databases, Schemas, Relations, Queries, UDFs, Stores, and others. Within the Streaming Catalog, Databases, Schemas, and Relations can be global, or can be for specific individuals, teams, or other units of organization. This enables the right level of privileged access to the underlying data. Using RBAC, users can easily define policies to secure and share their data resources. We chose RBAC as our access control strategy because it strikes a balance between usability and scalability. Roles are set up to clearly define permissions and ownership for data assets, and users are granted one or several roles. The roles granted to users makes it clear which permissions each user has and as new data assets get added to the Organization, these data assets can be granted to an existing role or new role.
RBAC addresses a major pain point accessing streaming storage systems, such as Apache Kafka and Apache Pulsar. As mentioned earlier, these systems use access control lists (ACLs) and although these lists are pretty straight-forward to use at first, they quickly become unmanageable as more users and data objects get added to the platform due to access control needing to be specified for each user and topic. RBAC directly addresses these shortcomings of ACLs by assigning user roles that authorizes access to specific namespaces. As new data objects are added to the namespace the user role will automatically be granted access to that data object.
Streaming Catalogs combined with RBAC enables a single powerful platform for all your teams; it enables a central data platform for stream processing and improves the sharing of data products, data visibility, data security, data compliance and stream quality.
Wrapping up
In the first half of this post, we discussed what Data Governance is, why it is important, and how Data Governance over unified data solves many of the pain points of governing streaming data today. In the second half of this post, we discussed DeltaStream's approach towards Data Unification and Data Governance of streaming data, specifically highlighting the Streaming Catalog and RBAC features which power these concepts.
In the following weeks, expect more content from us showing off these concepts in action. Meanwhile, if you want to learn more about DeltaStream, reach out to us to schedule a demo or start your free trial.