How to make aggregate record destruction processes work in structured data systems

Apr 02, 2024

This is something that seems to be mystifying even very forward looking practitioners.

In my experience, the common cause of the conceptual problem, is that people instinctively go to the database to try and deal with it.

It's simplistic thinking - we've got a data store (a container for records) and we're looking at it and thinking that it's organised in tables, and the tables have transactions and the transactions are called records - so we want to look at that! (Don’t!)

This logic is like going to a pile of lego and trying to understand what was previously built out of it. It might be possible in very simple cases where a database hasn't been normalised at all, but in most cases, the level of certainty is going to be very low, and the capacity for creating unintended consequences is going to be very high.

A more accurate analogy is that it's like trying to appraise a word document by looking at the ones and zeroes stored on the hard disk. Things get easier when you use the word (the application).

To understand what the transaction level thing is, we need to go and look at the business application itself and understand how that business application creates transactions. Generally this is a trivial thing to do with user facing applications.

There are basically two ways to work it out - sit down with someone who uses the application for five minutes and get them to go through what they do there, or go and have a look at the application interface and look for the things it lets you create.

The thing you should focus on is the nouns.

Applications are developed to capture and update the state of things - state just means some characteristics of the thing at a point in time.

The "things" are almost always nouns - person, customer, asset, product, case, order, contract etc.

What makes managing records in structured data systems confusing for records people, is that we tend to want one single record that contains everything so that we can manage that whole thing to destruction, and if we don’t destroy everything, we’re left feeling like we haven’t done what we were supposed to.

People creating databases see that as wasteful and inefficient, so what you'll generally see, is a set of one to many relationships between nouns that represent a person, place or thing, and nouns which represent activities related to them (the transactions).

It's interesting to note here, that if you were thinking about it in terms of classification schemes, you might say that it was Subject:Activity:Transaction - with the activity generally captured as a type of transaction.

Simple examples - asset management systems.

We are likely to have an asset (one), and then a set of cases (many) related to the asset with varying case types (different activities).

The asset (one) is related to many cases - the transactions.

Another example are "customer service" systems.

The customer (one) is related to many sales - the transactions.

The "many" noun is most likely to be our transaction.

At some point, the singular noun was also created too - and that is also a transaction that will need to be deleted at some point.

This one to many relationship, is where most people come unstuck in thinking about destruction in databases.

What we need to recognise is that to destroy the transaction, we need to avoid destroying ANY of the shared data, and destroy only the data that is UNIQUE to that transaction. If we delete the unique data associated with the transaction, we have destroyed the capacity of what is left to represent the transaction - so we've destroyed the transaction. This sounds complicated, but it generally isn’t - as long as you avoid the database like the plague, and stick up with the application.

Applications don't generally recognise the need for approved destruction, but they do generally have a "delete" function - even if it is not exposed to general users. If the application has been designed well, the delete function will include logic that protects the integrity of the database that the application relies on, and also limits the scope of deletion to just the unique data for that transaction - rather than the elements it depends on.

It does leave a problem for things like "person" and "address" - or some other nouns/entities on which transactions depend, but it's a small conceptual problem that just requires a little flexible thinking - if we have destroyed our transactions, and are left with a person and an address with no transactions, and those things only exist to support those transactions (ie. no retention requirement of the own) - that's easy to report on and destroy as a second action after the transaction level destruction process.

This does get more complicated when there are downstream applications using the data.

This is what lineage and retention schedules are for.

Retention schedules are supposed to encapsulate the length of time that we need that data - including for any downstream uses. Good retention schedules should encapsulate all business use - which includes downstream usage. It might be a bit of a schlep to track it all down if it’s being copied, but it can be done.

Lineage is helpful too, lineage tells us where our data came from - it's provenance with a data label. If we have it, we know where our data is going and where it came from. If we don’t have it, the size of the problems that is likely to cause in a five to ten year period will likely mean that it isn’t too far away - particularly if you’ve got a data warehouse involved, and especially if you have a data lake.

Flippancy aside, any idea that we can't execute a destruction process on data in databases is just wrong, we just need to make sure we're looking at the right level of abstraction. Looking at the database is like trying to appraise a word document by looking at the ones and zeroes on a hard disk, we do better when we use word (the application).

Meta-IRM

Discussion about this post