How to Create a Data Science Competition

Roel M. Hogervorst

2021/12/06

Categories: R Tags: kaggle python flask cocaine NL simulation

We recently hosted a small kaggle-like competition at work, it was very specific for the the Netherlands: find the cocaine in seacontainers that come into the port of Rotterdam. I was inspired by kaggle competitions that we did at Xomnia when I did my traineeship there. I find the goal of most kaggle-like competitions unrealistic for actual data science work, business people rarely want a super accurate model, they care about savings, costs and improved timings. So in this case I don’t judge people on how accurate they were, I judge them on how many kilos they found. This is highly unfair (also because I randomly generated those without relation to features) but is in practice what happens. I’ll talk about some technical details later but first: I made a starterset in python and R so my coworkers could start right away.

Creating a starter (code scaffolding)

This is super important! Although most actual data science work takes a lot of time, in these competitions you want to get to a submission as fast as possible. The goal of these competitions is to improve your feature engineering and model choices and for that you don’t want to spend a lot of time on making sure your submission works.

So I created a complete starter for R and did what I could in the time I had for python. A good starter has (in my opinion) codesteps that:

The zipfile I created also contained the trainingset, and testset and a README that tells the story about the data.

I think this is called scaffolding in Education Science, you create the scaffolds so people can build (and learn) faster.

The data generation

My girlfriend thought it would be interesting to do predictions on containers. Get into the role of Port Authorities / Custums and figure out how we could catch the containers with cocaine hidden in it.

Of course there is no such dataset, it is by nature hidden. Cocaine smugglers don’t want to you to know what methods they use. And I don’t think the Authorities would share data they have with me, if they have anything like this at all.

So I simulated it based on some public numbers: number of containers per year, types of load, countries of origin.

First some facts:

Some assumptions:

Some loose factoids I picked up by reading reports about cocaine smuggling:

My thought was like this:

If you make sure the modifyers are on average 1 the basic probabilies remain the same. (I allowed it to rise slightly).

link to generators

The scoring application

I generated a application that scores your submissions in python (flask) link to flaskapp repository I could have done that in shiny too, but I’ve been working with flask a lot lately and so I was quite comfortable with this.

About 0.5% of containers is scanned yearly and I gave the contestants 40.000 containers split into a trainingset (80%) and a testset (20%). They could submit 400 of the 80000 containers in their testset. ‘Give me the ones you think are highest probability of containing cocaine.’ This is how such a model would be used in practice so it made sense to me to score them in that way too.

The flask application has two pages:

The scoring code uses pandas (i could have use polars too, but this was fast enough) to read in the csv, merge with the actuals inside the application, sum the kilos, calculate streetworth (every kilo is 50.000 euro) and write those things to the database.

The application is baked into a docker container and uploaded to a cloud service. I hosted the application on Azure (but any docker image hosting would do).

There are some issues of course, if the app crashes, you lose all scores because the sqlite backend is inside the docker image, I thought that wouldn’t happen in once day, but it actually restarted within 2 days. So you could write to a permanent database hosted somewhere else. That would be relatively easy to do.

All in all, I really liked building this, and the participants enjoyed it too!