GDPR Compliant Event Sourcing With HashiCorp Vault
How do you comply with GDPR when using event stores with immutable, undeletable data? Capgemini is looking into a method using HashiCorp Vault.
CQRS and event sourcing are two of the most popular strategies for processing data within a microservice architecture, but making sure the event store is properly compliant and secure from a GDPR and cybersecurity perspective can be difficult.
Events are immutable and undeletable, but deleting data is mandatory from a privacy and security perspective. Many of the options for storing events, currently, do not have proper solutions from a cyber and information security perspective to address challenges related to building a secure and GDPR-compliant event store. GDPR requirements in regard to the data processor or controllers obligations to protect the privacy of each individual user also may seem to be at odds with event sourcing.
Capgemini is working on projects that use HashiCorp Vault for encryption and storing sensitive information. For one of these projects, Capgemini is building a microservice platform that gathers personal prescription history data from all Norwegian citizens.
In this talk, Johan Sydseter and Bjørn Lilleeng will show how to properly secure an event store using Vault, sharing their experiences with privacy and cybersecurity when using event sourcing to deliver a GDPR-compliant microservice platform.
Speakers
- Bjørn LilleengSecurity Architect, Capgemini
- Johan SydseterSecurity Consultant, Capgemini
Transcript
Bjorn Lilleeng: Hello everybody, and welcome to this session. My name is Bjorn Lilleeng, and I work as a delivery architect at Capgemini Norway, and I'm currently functioning as a security architect for the project that forms the basis for this presentation.
Johan Sydseter: nd My name is Johan Sydseter. I'm a senior consultant and lead developer.
Bjorn Lilleeng: The mission of the project is to create a technological platform for certain health-related services for the Norwegian population. Right now, we are in the middle of the project, and the services that we are currently providing— which were released about a half year ago—are only dealing with non-personal data, non-controversial data. It does verification on certain health services, but there are no controversial personal data there.
The challenges of managing personal data
Bjorn Lilleeng: We are approaching a stage where we are facing challenges regarding personal data, and this is serious business, obviously. We learned on Monday that a large European airline was fined €200 million for exposing 400,000 personal records—including credit card data. In yesterday's news, a large hotel chain was fined about €100 million for exposing millions of customer records. The authorities are really taking this seriously now. With that in mind, it's good to know it's possible to combine technologies as we have done using HashiCorp Vault with event sourcing. It seems that we are covered pretty well if you map the technology—the security features—against the original seven foundations for privacy by design introduced by Ann Cavoukian back in the early nineties.
We'll come back to that and Johan will have some live code as well to show what we're dealing with. But, let’s start with an overview. We have a platform as a service containing microservices. It's based on Pivotal Cloud Foundry and inside this platform is a growing amount of personal data that I mentioned.
The current release only deals with the other, less controversial data. But at some point, we need our services to handle both sides. Especially the one in the middle called SPI data—sensitive personal information. This requires special care, according to GDPR Article 9, because this data combines personal info and health data.
The client system exposes certain services to several locations around Norway and uses our platform to access personal data. Other client systems just access other data, but there are client systems that access the sensitive ones—the data that combines all this.
We do what we can to protect this. The platform is accessible via a private cloud from a Norwegian supplier, and the client system accesses the cloud from the Internet. We use various techniques, like mutual TLS (mTLS), OAuth2, anti-spoofing, and a lot of other techniques through our carefully designed solution to protect the platform to the best of our ability.
Customers require more data access
Bjorn Lilleeng: When we got the contract from the customer, this was the initial plan. But after a while, the customer found out that, "We need an admin interface here and the client system is not useful for administrative purposes." Then they say, "We need a VPN connection."
After a while, they also say, "There are some governmental requirements that say we need to make certain reports and certain queries available for governmental use. The government wants to keep track of what we are doing." That's a new requirement we need to take into account, and we need to let them get into our solution as well.
The customer also found out that the value of the data here is interesting for others. There's a commercial aspect here, which means that we need to make some of the information is available for the commercial actors. We know in the future, we will be challenged even more. We’ve got a lot of potential clients that are challenging our initial infrastructure.
This is our concern. In addition, of course, we have the GDPR concern, which will punish us hard if you don't do this correctly. You need to be careful here. According to the law, we are a data processor. The customer is the controller. Under Article 28, the controller must only appoint processors who can provide sufficient guarantees to meet the requirements. This means that the controller—our customer—trusts that we can provide such guarantees.
We are also required, according to GDPR 32, to use the best technology available to provide security. It says, "state of the art technology." That's a special requirement for ourselves. In addition, GDPR says that the subjects have special rights—and they should be allowed to exercise their rights. There are several GDPR articles—like the right to be forgotten, the right to be rectified, the right to be informed in case of a breach. Several things need to be taken into account here.
Establishing a zero-trust network
Bjorn Lilleeng: It doesn't matter if only the security architect and the lead developer knows about this. This must be a part of the team DNA. All developers—all architects, project leads—everyone needs to be part of this whole. We are introduced zero-trust as an architectural state of mind. That's our team DNA.
We want to develop security from the inside out. We don't trust the barriers that they have created. We need that as well, but the important part is that we develop security from the inside, and make sure that we encrypt everything—all personal data.
This means there is a combination of processes and technologies. The technologies are important, of course, but the process on top of the technologies must be available for everybody. The customer has a tendency to say, "Okay if we use too much time on security, that will take away the focus on all the functionality."
But we say that if we never discuss functionality without also taking the security aspect into account, that will be a business enabler rather than a business restriction. It means you don't need to think about security all the time because it's a part of your DNA.
Things are user application-centric. It's not about infrastructure here. And of course the principle of least privilege. You should only get access to what you need to do the job. No implicit access; it's explicitly you. We always verify that someone that tries to access personal data actually has access to it. It's also based on context, which means that you may have access to personal data doing one thing, but that doesn't imply you have access to the same data in another context. We need to control the access here.
The customers say, "If we want to access other data, then it's probably open for all customers or all clients. If we want to access personal data, then you should restrict access to a lot of government, commercial actors, what have you."
It makes it hard if you open a bit more—if you combine personal data and other data, and if you want to access only the sensitive, private data. This data should only be accessible from a few client systems. Otherwise, a small error in our own code or a vulnerability in a third-party may open it all unless we do security from the inside out.
Any insider needs special permission as there are no general openings here. They need special permissions to get access to the personal data. You need a set of building blocks, and HashiCorp Vault comes in very handy here. If you look at secret management, data encryption at rest, encryption for data in transit, all these three areas are used for HashiCorp Vault.
We don’t just use mutual TLS for the external communications but also internally, between microservices. We use HashiCorp Vault to handle the private keys. Using HashiCorp Vault this way, we have very nice zero-trust building blocks.
The right to privacy and data protection
Johan Sydseter: I have been a developer for some time, but I didn't get to know what it meant to be compliant before the end of 2017 when the GDPR came in focus. That's when I learned I, as a European citizen, have the right to privacy and data protection according to the European Charter for Fundamental Rights—and that we and all our clients have to implement these rights to be compliant.
The GDPR is mainly a reiteration of the General Data Directive, except for one thing, and that is we now have to do data protection by design and data protection by default. But what does it mean to do data protection by design and by default? Data protection by design and by default comes from two terms. One is data protection, and the other is privacy by design and privacy by default.
Originally, the European Commission used the terms "privacy by design" and "data protection by design" to mean the same thing. The data protection by design and data protection by default was chosen as the basis for many of the regulation's articles. It was also chosen to make a distinction between right to privacy and the right to data protection. For example, Article 25, which is called, "Data protection by design and by default." Privacy by design and by default, is still considered an essential component for fundamental privacy protection and privacy engineering, according to the European Commission.
Privacy by design and by default
Johan Sydseter: Privacy by design and by default, is based on seven foundational principles. They were originally conceived by Dr. Ann Cavoukian during the nineties, when she was the Information and Privacy Commissioner in Ontario, Canada. You can look at it as the manifesto for the GDPR. HashiCorp Vault helps us to implement these foundational principles, specifically in regards to data protection.
If you make your private data inaccessible by encrypting it and you don't grant access to decrypted data unless you're appropriately authenticated or authorized and audited—then you make sure you are using a proactive and preventive approach to data collection. This means you are assuming that an attacker can take control of your service and get access to your data. But even if he were to do that, that data would be encrypted and protected.
If you also use HashiCorp Vault for decrypting data by default, you're also complying with principle number two in regards to the default setting. By encrypting data by default, we make sure that we are not exposing any data.
HashiCorp Vault makes sure that we can implement these principles—except for maybe principle number six—in regards to visibility and transparency. We have to be able to track changes to personal information. This ensures visibility and transparency of personal information while processing personal information.
To do this, we have to be able to know which operations are responsible for changing our personal information and why they are changing it. Using event sourcing is a very interesting architectural pattern for achieving this is by using event sourcing.
What is event sourcing
Johan Sydseter: With traditional databases, you can know where you are, you can know the current application state, but you have no idea how you got there. With event sourcing, you are recording events and changes to the system as a sequence, and you can know how you were able to get to a certain point.
Take your bank account, for example. Your bank account is not a secure box where you store your money; your bank account is a principal book or computer file for recording and totaling economic transactions.
When you buy a t-shirt or take money from your ATM, the event is registered for a certain transaction. By applying the sum of all transactions that have happened over time, we can calculate our balance.
We're able to be able to track our spending to make sure that we're not using more money than we have. But we can also ensure that nobody takes advantage of us and charges us unreasonably for a t-shirt, for example.
In the same way, we can use event sourcing to ensure visibility and transparency as we are keeping track of all these events as a transaction book.
Bjorn Lilleeng: Another very good example, which has relevance to our project, is your health record. The visits you made to your doctor last year would be a sequence of events. That is a historical fact—you can't deny that. This means that an event sourcing log is immutable; it can't be changed. That pure history.
In event sourcing, writing and reading data are two totally different aspects. The events that we write are never deleted. That is seemingly a contradiction with regard to GDPR. If you look at a figure here, it says that on the left side, GDPR erasure is sometimes mandatory. For instance, to execute the right to be forgotten and also data retention, Article 5.
Cryptotrashing and double encryption
Bjorn Lilleeng: Also, laws require sometimes that some reports should be deleted after three or five years. That also must be executed according to GDPR, whereas in event sourcing, events are immutable and undeletable. How do we solve that particular contradiction?
Johan Sydseter: We can apply a technique called cryptotrashing to address these challenges. This means we initially encrypt the data using a specific key. Then, if we want to remove that data, what we do is that we delete the key—thereby removing the personal information throughout the system. And we do that without changing the application state.
But how do we do use a cryptotrashing to ensure we have GDPR-compliant event sourcing? We can create a specific key for each person. I call this the P-ref key, and then I create another key—a temporal key—for each retention period.
Then we apply double encryption by using the P-ref key to encrypt the personal information for each data subject. Then I apply the temporal key to encrypt the complete personal dataset during a certain retention period.
When an individual wants to exercise his right to be forgotten, I can delete his P-ref key, thereby making all his personal information unavailable. But how does this work in practice? I have installed HashiCorp Vault on my personal laptop, but I've also installed an events store from AxonIQ, which is based here in Amsterdam—for recording my events. In addition to this, I have created a microservice for registering health treatments for elderly patients—living in a home for elderly—that need medical assistance.
As you can see here on the Axon dashboard, I have all the events stored here in my event store. There are no events there currently. I'll go ahead and create some events. I will register a health treatment by prescribing 20 milligrams of Brintellix—an antidepressant drug—for a patient called Yohan Johansson.
Then I will prescribe 20 milligrams of Brintellix to his brother, Steve Johansson, as he has been depressed for many years now. Now if I go back to my events store, I can see that I'm starting to get some events here. And if you look here—if you're able to read this payload here—you can see that data is being increased by HashiCorp Vault.
Vault's transit secrets engine
Johan Sydseter: To be able to see this a little bit better, I've created an endpoint here. Here you can see the whole data structure, and as you see, the name is being encrypted. For demonstration purposes, I have created another endpoint where you can see the data in unencrypted form. We are using the transit secrets engine for encrypting this personal information. We do this for two reasons:
Reason number one, using Vault's secret transit engine, we can apply encryption as a service without exposing that encryption key. That's quite important because what if an attacker takes control of my service and steals my data? As long as the key is securely stored in Vault, the attacker will only have the data in the encrypted form. This makes sure the data still is protected. That's also a very important GDPR requirement if you are processing sensitive information.
Let's say that you are processing sensitive information for the population of Amsterdam. If you hadn’t encrypted your information and you were to have a data breach—and hackers were to steal your data, you would have to contact each data subject individually and tell them about the breach.
This becomes impossible when you have such a large group of people. Instead of doing that, what you can do is that you make sure that you encrypt that information and that the decryption key is inaccessible to the attacker. Thereby you can assure the supervisory authority that the data is protected even if it has been stolen.
Using Vault’s key derivation functionality
Johan Sydseter: What if then Steve Johansson dies and, as his final wish, he wants to have his personal data removed as he doesn't want anyone to know that he was depressed for the final years of his life? I just will delete his P-ref. I can do this because I have used a key derivation, which is a functionality for the Vault secrets engine.
This means I can use one key for multiple purposes by deriving a new key using a user-supplied context value. I take user-derived context values, and I use them for each person and for each retention period. I'll just take his P-ref here and delete it.
If I go back to the endpoint showing the decrypted data, I'm not able anymore to read the name of Steve Johansson, as the key that was used to encrypt and decrypt the information about him is gone. I've ensured that his personal information is not only removed for this microservice, he's also removed through the whole system. From any backups and logs that have recorded any personal data about him.
At the same time, all the events that were recorded previously are also now unreadable. I have, therefore, ensured that he can exercise his right to be forgotten. If I want to do the same in regards to retention periods, I only need to delete the context value for the specific retention period.
The first thing I would do is take a snapshot of the current personal data set, and using the key for the new retention period, I would delete the key for the old retention periods. This means all the events and logs and backups that were generated during the old retention periods are completely unreadable and, therefore, removed from the system.
Using Vault to show you’re correctly encrypting personal data
Johan Sydseter: I can enforce Article 17 regarding the right to be forgotten—and Article 5, about retention. But how then do I assure the data controller and the supervisory authority that I am, in fact, encrypting the personal information that has been flagged as personal? I do that by using a small library that we're using around Vault.
This library takes three values when encrypting and decrypting information. First of all, it takes the JSON Schema—I have the JSON schema here—and then it also takes the context value which is used for key derivation.
The third value is the data structure that I want to encrypt. Now, using a JSON schema draft07, I can specify which fields I want to encrypt using the content and coding and content media types—which are defined as part of that standard. So based on that, the library will take the fields that I want to encrypt and send them to HashiCorp Vault and back again.
Then, when the date comes back, I will take the JSON schema and validate the encrypted data structure. This way, I can ensure the data is encrypted correctly. If the data doesn't comply, I will log a warning. If everything is okay, I will log an info statement.
Then, using that log output, I can test to see that the encryption is being applied correctly. Later, when the schema is in production, I can do a data flow analysis using my log output to prove to my supervisory authority that I'm actually doing encryption correctly in production.
In this way, we can ensure that encryption is being done. But why am I using a JSON schema? Why am I not implementing this in code? Well, the reason for this is that I want to be able to model how I'm doing the encryption in data modeling software so that the architects themselves can model how the encryption is supposed to happen.
Then, they can export those models directly as JSON schemas, thereby ensuring that I have end-to-end traceability from the design phase until the JSON schemas are in production. This process ensures the data is encrypted and protected by design and by default using HashiCorp Vault.
Bjorn Lilleeng: It's obvious that HashiCorp Vault has proven to be really useful for us, and we use it in a lot of other arenas as well—like PKI Certificate Management. We need to do a lot of document signing, and we find HashiCorp Vault useful to perform such processes. But that's the theme from another session.