What you Don’t Collect, you Cannot Lose!

Roel M. Hogervorst

2021/03/21

Categories: thoughts Tags: dataset decisions data-minimalisation

Hello new startup or old company! You are going to lose at least a part of your data. I’m sorry. I’m not the one who is going to do it to you, don’t worry.

But do worry about the impact! Because it is just a statistical probability that you are going to get caught with your pants down. You will lose data because someone will make a mistake. That is fine, mistakes happen, and hopefully you’ll learn from it. But mistakes will be made! You store your data in a bucket on a cloud provider, or you have copies of your data in personal accounts. Someone doesn’t believe that his superstrong password monkey123, will be guessed and a credential stuffing account gets access. Or maybe you mis-configure some system, making your entire database readable by anyone on the internet. This hasn’t happened yet, but it could happen.

people push a car through flooded street

So right now, while you still live in wonderland, thinking nothing is wrong, please think about what could happen when things CAN go wrong. What data are you collecting about people? How could the data you have, be abused? What could happen if a dictator found that information? Can someone impersonate someone else? These are not funny questions, your data people need to answer these questions and the CEO is ultimately responsible. Are you worried? Yes? Good!

Now, for every piece of information you have about people, ask yourself:

what are you going to use this information for?

Direct goals are clear

If you have a direct goal, for example:

Write the goals for each piece of information down and talk about it with your coworkers. You might be able to use other information. Set clear data deletion and archiving rules.

Direct goals are not clear

if you have a future possible vague goal with certain information:

Set a strict timeline and goals f.i.: collect gender information for 2 months, try if you can achieve predictions for sales that are 5% better. If you don’t achieve the goal, re-evaluate. If you collect information with the idea that you might use it in the future, throw away the data and stop collecting it. You will never use it and it is a liability.

(For example:) So if you collect massive amounts of data from google analytics, because that happens automatically, re-evaluate if you need all that information! Throw away everything you don’t need. Re-evaluate if you want to be associated with a big company like Google. Try out if other analytics products might be a better fit for you.

If you have a small blog: what do you do with the information you receive from the website analytics? I made the decision to not add tracking, I’m never going to use it, so why bother?

Because big data is like oil: it is hard to refine, can contain a lot of power, but spillage is hard to clean and the disasters that follow will haunt your company for decades.

go for data minimalisation:

If you don’t collect stuff you don’t need, you cannot lose it!

Picture from unsplash by Saikiran Kesari on Unsplash