release

Eyes on true compliance: how to bring STRM Privacy into a Data Warehouse

Bring STRM into your data warehouse and achieve true compliance in data privacy. That's scalable privacy infrastructure the STRM way.

Privacy is a complex domain. If you are in a legal role, it’s hard to know and understand how legal requirements translate into system demands (but it’s where compliance is achieved). If you are in a data role, it’s hard to know and solve everything from the legal perspective (but it’s where compliance is defined).

In this post, we’ll outline how you can combine those perspectives, bring privacy by design into any data warehouse and achieve scalable privacy infrastructure the STRM way. “Privacy by design” for data, after all.

We’ll break it up in four different waves of privacy, each with their own applicability to organizational contexts:

  1. Wave 1 - Point to Point. For when you are living in the past.
  2. Wave 2 - Privacy Streams. For when you need maximum compliance.
  3. Wave 3 - Role-Based Access Control. For when you want additional control without changing your org model.
  4. Wave 4 - Purpose-Based Access Control. For when you need maximum compliance without changing your org model.

Read along to learn more about the different privacy models for data or dive right into the tutorials and (code) examples for Snowflake, Redshift, Bigquery and Spark (Databricks).

TL;DR

Combining STRM with RBAC in a data warehouse brings:

  • Reduced risk: true data privacy compliance, not just on paper.
  • Balance privacy with using data: integration with existing (data) workflows.
  • Less coordination (and so cost) and better governance and control of data flow.
  • ✅ privacy-by-design, integrated into your data warehouse: optimal and true compliance at no extra cost and effort.

The four models of privacy in data

The First Wave of Privacy: Point To Point

Good if: you’re stuck in the past

STRM was founded on our experiences implementing privacy in large-scale data and innovating in sensitive data domains (like online retail and HR).

Organisations were trying to comply with GDPR, but translating legal constraints into system requirements was vague, lacked standards, and often led to different realities between compliance in theory and the actual data processes. We imagined there had to be a better way to balance data and compliance in light of data privacy laws, but we couldn’t find it.

Fast-forward to today, we see the first wave of privacy has passed in most organizations: they have good policies reflecting organisational goals, interests and compliancy arguments. There are privacy officers and legal counsels involved to ensure consistent compliance and advise product and operational teams on edge cases or new applications (usually accompanied by a set of spreadsheets, for instance to structure information around DPIA (Data Privacy Impact Assessments). Education and awareness around privacy are often good enough for employees to signal and reach out in cases where privacy might be violated, or a privacy perspective is necessary.

But that’s mostly also where it ends. Because what happens inside the data is often in compliance’s blind spot. Compliance is approached as a checklist, a paper exercise.

Fulfilling data protection obligations (like RTBF)? I have the name I need to reach out to!

But there is an important link missing: the one to the data itself.

Are we using data within purpose? We’ll just make sure we filter for that column against access policies.

But what if a policy is misconfigured or adversely accessed?

And when a team approaches an edge case, like re-using HR data for different purposes, they need to coordinate first, consuming considerable resources, capital and slowing down a release or launch.

In the first wave of privacy, many organisations go about privacy in a point-to-point fashion: the foundation is in policies, and that is where it ends.

The Next Wave of Privacy: By Design

As the sugar rush that is data in a modern business is peaking, many organisations start to discover their approach is too limited. With data as a critical business asset, data protection regulations in more jurisdictions and active enforcement from data protection authorities, privacy is everywhere! As every data point is a potential compliance risk, controlling where and how data flows is now an essential aspect of data governance and compliance.

Wave 2, Privacy streams: purpose-bound data consumption

Good if: You need maximum compliance, e.g. data sharing and sensitive/risky use cases

So with policies an processes in place, how to provide the necessary level of control and enforcement in data, especially in sensitive data domains like healthcare or finance?

Against that background, we started to develop the concept of privacy streams: apply a data contract to your data, annotate the privacy implications and use this to build end-to-end data pipelines.

In these pipelines, privacy constraints are immutably coupled to data. Data is transformed according to the contract and cannot be used outside the collection purpose (including removing personality in the data, transforming for anonymity). It allows control and enforcement of data streams while simultaneously maximising utility for data consumers. It is, in short, a by-design approach to privacy in data flows. If you need to aggregate data before downstream consumption, consuming the key stream allows you to return privacy-transformed data to its original plaintext value.

Privacy streams are great for purpose-bound data consumption, like data sharing (e.g. between clinical trials and pharmaceuticals), or as the data lifeline of a machine learning system that can learn and do inference on personal data. If you need the data elsewhere or for different purposes, you can define a new interface -et voila, data pipelines as part of a structured approach to privacy! It can even be done in real-time (streaming), and with customized degrees of de-identification as required for the purpose.

But there are two important caveats:

  • Although storage has become cheaper, it does require duplicating data for every purpose (a direct implication of bounding data to purpose). As organisations grow, more privacy streams are necessary, and storage costs magnify. Thus, expense becomes an essential aspect of (not) moving towards privacy by design approach.
  • Privacy streams are an excellent fit for specific use cases. But in a model with a single source of (data) truth, the operational change and implications to workflows to achieve privacy by design at scale complicate implementing it.

Wave 3, Surf it straightforward: Role-Based Access Control

Good if: Utility matters more than compliance in centralised data workflows

So, with privacy streams being a magic potion if consumed responsibly, more and more complicated privacy regulations force every organisation with a decent sense of compliance to change the practices of data hoarding and tie data collection and usage to purpose one way or the other.

With data processes strongly tied to a single truth in a data lake/warehouse, many organisations have started to “fix” the gap between policy and data by treating privacy as an access problem: who can access what?

In effect, data collection (described by the privacy policy) and access control policies determine who can query a specific data type through profiles. Compliance gets control and enforcement, and data gets utility. A practical, straightforward approach to privacy that fits neatly in existing data (work)flows and leverages (often native) capabilities of the great cloud resources you can choose to buy.

But practical also means limited:

  • Minimisation of compliance risks is profound but not optimal (potential configuration risks remain).
  • Access is controlled, but the data itself is left untouched. Importantly, data utility is partly destroyed as patterns in data outside of the allowed access are lost!
  • Moreover, personality dimensions might be left in other columns, making you vulnerable to linkage attacks (using secondary data to identify a data subject).
  • A lack of tight purpose definitions - so what happens if we want to use data outside of purpose? In the RBAC model, you are just not allowed to or don’t even know where data came from.

Defining access through a profile codifies access, not origin policies. It’s a step, but a better balance between privacy and data utility is available.

Wave 4, Purpose-Based Access Control: merging RBAC and privacy-by-design

Good if: You need to balance utility and compliance in centralised data workflows (privacy by design and zero-trust security)

So, it got us thinking: can we merge the purpose-bound model with the utility-optimised access model? That would yield the optimal balance between compliance/control (ergo privacy) and data usability (ergo utility)!

We can: by integrating the concept of privacy streams and transformations with (native!) role-based access controls and foreign keys inside data warehouse solutions.

In short, it brings privacy streams (localised, purpose-bound + use case specific) to data warehousing (centralised + use case agnostic). It’s the best way to balance privacy and data utility, as you get:

  • Control and enforcement of policy implications through data contracts
  • Purpose-bound data consumption that warrants only data within purpose is queried/consumed.
  • Control over retention and expiry (without additional handling!) through the mechanisms used for transformation.
  • Integration into existing workflows - it doesn’t require you to change the way your teams work with data.
  • More utility from data as transformations warrant you can use data outside of purpose with the right level of anonymity applied!
  • Reduced coordination as data contracts support alignment between legal and data and codify how data is collected and can be used.
  • ✅ privacy-by-design, integrated into your data warehouse: optimal and actual compliance at little extra cost and effort.

How it works

From an integration view, getting started requires the following steps:

  • Create Data Contracts to define the data schema and its privacy implications: how can data be used under which purpose?
  • Leveraging encryption, tokenisation and masking, STRM refines and filters data into privacy streams.
  • The key stream is added to your data warehouse as a foreign key, and data is assigned keys for every combination of field, data contract and purpose.
  • Data is stored in your DWH, fully transformed and encrypted.
  • You configure access policies to match data purpose (a profile/role has access to specific purpose views)
  • At query time, native foreign key capabilities (or through user-defined functions) match the keys to the data you need.
  • ✅ performant, natively integrated privacy by design in your data warehouse!

Thirsty for more technical details? We’ve shown how it works for Snowflake, Redshift, Bigquery and Spark (Databricks).

At your service: a summary

👓 Best if you🏎 Flow control🚮 Info loss🔐 Data security📟 Data minimisation㋏ Purpose limitation🚀 Scale & performance👩‍🍳 Org model
DWH with Role-based AccessUtility > compliance in existing data workflowsGoodPossible, column-levelBest if table x row specific keysCoarse and no protection against secondary ID.Possible, but hard to configureVery goodCentralized data storage and aggregation.
Privacy streams (pipelines)Maximum compliance is necessaryBest governance guarantees, tailor-madeNone more than purpose imposes (e.g. de-ID or anon)Best security out-of-the boxBest practiceBy designVery good, but duplicated for each purposeApplication, or use-case specific
DWH + STRMBalance utility and compliance in existing data workflows (privacy by design and zero-trust security)Best governance guarantees and tailor-made, integrated to existing access modelNone more than purpose imposes (e.g. de-ID or anon)Best security out-of-the boxBest practiceBy design through access roles, on-the-fly transformationBest of both: high-performance purpose-based access control to transformed data.Centralized data storage and aggregation, but with privacy by design.

privacy policy is just the tip of the iceberg

Decrease risk and cost, increase speed encode privacy inside data with STRM.