Deep diving into the data lake
A few years ago, the concept of storing company analytics data in a ‘data lake’ became a thing, and it’s quickly becoming the industry standard for that purpose. Over the past few months, a team of Data Engineers have been keyboard-warrioring away to build our very own data lake, known as the Global Analytics Platform (GAP).
So, what is a data lake?
It’s a place to collect information for analysis that is stored in raw, un-processed form. Or in other words, it’s a way of storing data which our analysts then use to generate insight. This is not a place for the data that powers the ClearScore product.
What on earth does that mean?
Quite simply, this means is that any data that a system produces can be sent to the lake as what we call ‘events’. We’ll take care of storing it and make it accessible to the analysts.
These events could be:
- The website telling us that someone has visited the Learn page
- A back-end service saying that someone has successfully completed registration
- A native application telling us that someone has opened a donut on the dashboard
Traditionally - and currently at ClearScore - that data gets pushed into what’s known as a data warehouse. For the last couple of years, we’ve been using Amazon Redshift for that purpose.
The challenges of using a warehouse
The main challenges are scale and cost. Every day, we produce more and more data for analysis. Every user we register means a dozen more credit reports we pull every year. Eventually, we’ll hit a point where the Redshift cluster fills up and we’d have to rent a bigger one from Amazon Web Services (AWS).
When you pay for a bigger Redshift cluster, that usually comes with a faster CPU on it, which we don’t necessarily want or need. Think about it this way: if your phone is running out of space from all of the videos of your cat you take, your only option is to get a phone with more storage and many more bells and whistles at a huge cost. Using this analogy, what a data lake allows you to do instead is to offload the storage of your cat videos to the cloud without having to upgrade your whole phone.
Essentially, with a warehouse you can store terabytes of information, whereas with a lake you can store petabytes (thousands of terabytes). To give an indication of where we are today, we are at about 8TB of analytics data, roughly half of which was generated in 2018 alone.
The lake is the raw storage where we can then transform the data into tabular form for analysis. It allows you to keep the cloud storage of the entirety of your cat videos, but you can also edit, enhance and keep the funny parts of it on your phone so you can show your friends at the pub. This functionality is known as Extract, Transform and Load (ETL), and is key for generating business metrics.
How does the data lake help ClearScore?
Self-service is something we can really start to leverage as we build out the data lake. Our current analytics platform, IMDB, allows for some of this but in a more limited scope. The company is expanding into more markets and we are producing more features that emit events. Building out the lake will allow people to cherry-pick aspects of data that they’re interested in from the diverse data-sets. Building out and allowing for this capability unlocks some powerful things we can do in the company within the space of machine learning, predictive analytics and user profiling.
Now, I’m not saying we’re going to go full Cambridge Analytica here, but it’s important for us to develop a deeper understanding of our users and how they behave so we can innovate more, particularly in an increasingly competitive landscape.
So can I throw off loads of data in my events for people to use?
Yes and no. We should always be careful about what data we push out into the lake. We should never be sending personally identifiable information (PII) as that will make staying GDPR compliant tricky. Our priority should be building a great product for our users instead of implementing tracking to make other third party systems more usable.
If we’re not careful about the data that gets tracked on every event, we risk turning the clean lake into a swamp. A swamp will mean that delivery teams won’t want to deal with the dirty data, and might start building out their own siloed ponds. That is not a path we want to go down.
We want a clean lake that:
- Analysts can build and share their data smarts from
- Data scientists can visit to investigate trends and patterns to make predictions
- Our leadership team can rely on for accurate business reporting
Why is it so hard to get it fully operational?
There are a few reasons why it’s difficult. One of them is that building out the lake - or the ability to push data into the lake at scale - is probably only half of the work. We’ve been running in the UK for nearly four years now, which means we’ve collected nearly four years of data on what our users have done with the product. We need to fill the lake up with all of that information, and that information has to be accurate, otherwise people will continue to use the old warehouse. Our challenge has been to move the terabytes of data that we have for the UK business in a way that’s accurate, speedy and does not impact live systems.
The most important thing with the data migration is to maintain data integrity. With both platforms we maintain a parent ID for every user, so we have a mechanism that can marry up what someone does before they log in and what they do after. However, IMDB and GAP measure the parent ID in a different way, so we hit a bump that has to be overcome.
In order to make things even more challenging, not only are we migrating platforms, but we have to deal with a complete change in the way events get tracked, or the naming pattern of these events. Take something simple like someone logging in successfully; in IMDB land, that tracks two potential events. One if the user has completed registration and one if they haven’t. In GAP speak, that turns into one event but two potentially different properties. When it comes to the login failure events, that number goes up from two to six. Putting all of this metadata into the right mapping mechanism is a tedious and intricate task. Enter us having to build the first of our systems for migrating from IMDB to GAP: the migrator.
The migrator application is pretty snappy and can lift and shift data at a rate of knots. It also creates a dump of user UUIDs which we can then load up into a separate database that’s used for the aliasing functionality. To load the output from the migrator into the alias database, we had to build a second application: the importer.
The importer is also pretty snappy and moves data into the aliasing database quickly, enabling some functionality that the migrator was otherwise incapable of, but is essential for getting the data into the lake. The migrator would move the data from the old nomenclature into the new, but because data in the aliasing database would not be loaded up it wouldn’t be able to inject parent IDs into the event data. We needed to build something that could take every event generated by the migrator, look up the relevant parent ID and append it onto the event information before flushing into the lake. Enter application number three: the converter.
The converter is not very quick, sadly, because of the fact that it has to consistently look up information from the aliasing database, which proves to be the bottle-neck in the process. A couple of weeks ago, we hit a snag when we ran a conversion for a small event volume for the last two months of 2018. It took two and a half days before we pulled the plug on it. We realised that we had to go back and re-think our plan of attack.
What we managed to do was dump the table contents of the aliasing database into a separate S3 bucket. The bucket’s contents could then be read by the converter on every run, which meant we could use that information within the application and not overload the database during conversion. The same dataset that took two and a half days to not complete finished up in about 8 minutes. That’s about the length of time it takes the average ClearScore employee to neck a pint on a Friday lunchtime.
Does it really take that long to move that much data? My USB stick is faster!
As I said before, maintaining the integrity of the data that gets migrated, imported and converted is critical. If one system says that 200 people applied for a particular credit card on one day but the other one says 208 people did it generates confusion and uncertainty as to which system is trustworthy. The fundamental thing here is that we have to be able to explain why there is a shift and know that some things change when you move from one system to another.
ClearScore’s progress towards the data lake
Are we there yet? Almost. The keyboard-warrioring Data Engineering team have built the platform and are migrating data in as fast as they can. We are in the process of moving the reporting that powers the business from IMDB to GAP, which will allow us to flick the switch over, but that requires a lot of data in the lake. Remember, if one event takes about 10 minutes for a few months, then a few hundred events over nearly four years will still take a long time.
Having said that, we can see the light at the end of what has been a very long, bumpy tunnel and we’re trying to get ClearScore there as quickly as we can!
Navin’s lucky enough to run our Data Engineering team. He loves to engineer massively scalable systems (apart from when he's playing cricket - nobody stops him from playing cricket.)