How to deploy analytics workloads

Published in

datamindedbe

8 min readSep 11, 2020

“It works on my machine”. That’s great. But now how do you make sure it runs in production, repeatedly and reliably? Here we share our lessons learned from deploying many analytics solutions at clients.

By the way, what do we mean with “Analytics workloads”? It’s any workload where you send data to an algorithm and you create some kind of insight. That can be an ML algorithm, that can be a data cleaning job, data integration, NLP processing, … Any piece of the data pipeline really.

Deployments are the most important thing you do

A typical development lifecycle looks like this:

In this cycle, deploying your code is the most important thing you do, because it’s only then that a client can get access to your work. Everything that goes before that is Work In Progress. Everything that comes after that is Business Value (hopefully).

This is often forgotten in data analytics projects. A lot of time is spent on improving the model, ingesting more data, building more features. But as long as you don’t bring your insights to your clients, you are not delivering any value.

Well, how often should you deploy then? We always promote the notion of doing 10 deploys per day (based on the great book The Phoenix Project). That means that your data team pushes 10 new valuable things to downstream consumers every single day. This is unattainable for a lot of companies that still do quarterly releases or monthly releases. If that is the case, try to bring it down to weekly releases, or maybe even daily releases. You will discover roadblocks along the way. Removing those roadblocks will make your team more efficient and will allow you to deliver results to customers faster. And it will also to learn faster from the feedback of customers and you can adjust course quicker.

What are your options?

There are several ways that we see analytics workloads being deployed, and we would like to evaluate them on two axes:

Effectiveness: How good are your deployments?
Are there limitations in what you can deploy? How often can I do deploys?Are the deployments high quality? Can they be stable / easy to monitor?
Feasibility: How easy is it to get started with this?
Do you have to learn new technologies? Will you have to spend a lot of time building the deployment pipeline? Can I have one deployment mechanism for multiple processing frameworks?

If you plot the different deployment options on a 2x2 matrix, along these axes, you get the following image:

Sorry, every consultant has to make a 2x2 matrix

This is of course not 100% correct or the complete picture. But these are the deployment mechanisms we’ve seen a lot at clients and we think this framework helps in making the right decisions for which deployment option is right for your use case.

Let’s go over them one by one:

Deploys to Pet VMs (Low feasibility / low effectiveness)
Use when: Never. Avoid at all cost. Sometimes it’s impossible to avoid.

A Pet VM is a machine you name and you really care about. It’s precious and you manually make sure it stays alive. The good thing about Pet VMs is that it’s probably a technology you know and sometimes the only option. The downsides are that the yearly license cost of a Pet is expensive, maintaining a Pet takes a lot of your time, deployments are manual and error-prone and it’s hard to recover when it goes south.

Run on Laptop and Copy (High feasibility / low effectiveness)
Use when: MVPs and Demos

You do all calculations locally and you upload the resulting model, calculations, insights, … to a central server. It’s relatively easy to do and you have full control of your own tooling. But, it’s hard to scale, limited by the capacity of your laptop, and of course it comes with its security issues. It is also very error prone.

Run on Notebook Sandbox (High feasibility / low effectiveness)
Use when: MVPs and Demos

What is it? You open a cloud notebook with connection to production data. You build your data pipelines right then and there and then you schedule your notebook. Finally, you cry while you do support on this thing.

This works because it is relatively easy to do, a notebook in the cloud can scale, and multiple people can work on the same notebook.

But, having multiple people on the same notebook also brings some chaos in your system. There is no version control, poor error handling and poor testing. Often this leads to infinite spaghetti code, no modularity, and limited visibility on what is in production.

=> This is a way of running production systems that we see way too much and is actively being promoted by some vendors. In software engineering, that’s the equivalent of manually updating the PHP files of a live website. Great for a hobby project. No serious company would work this way.

Automated notebooks (Low feasibility / High effectiveness)
Use when: Mature teams with strong devops skills

When we challenge the notebook approach, people often push back and point to systems like Netflix where they automated the entire notebook experience, structured the code, implemented a scheduler, have a logging solution and so forth…

As you can see it sounds easy but to do it well you still need to build a lot. They recommend you to build an actual application when your notebook is too big. Then, they install your library in the notebook environment and from the notebook you can then just call your library main function/functions and schedule those.

What is also very interesting is the fact that they store notebook runs as immutable traces of your application, so you can always check the notebook of a certain run for errors and launch it from there to debug.

All in all you can see that making a proper scheduled notebook environment requires a lot of work and engineering. And what works for Netflix might not work for you or be complete overkill.

Docker and Kubernetes (Low feasibility / High effectiveness)
Use when: Mature teams with strong devops skills

Kubernetes is a container platform that allows you to build platforms at scale. It’s born in 2014 out of Google and it’s a great building block for your data platform. On top, it has a large and growing ecosystem.

Most software integrates or is integrating with kubernetes. It probably is the platform of the future. And you have fast startup for your applications. The downside is that the future is not there yet. It’s a great building block but deep knowledge of kubernetes is needed to build a data platform. And there are not a lot of people have this knowledge currently.

Containers or Jars + PaaS API (High feasibility / High effectiveness)
Use when: Adopt for most data teams

In this approach you basically take what the cloud vendors offer you and you rely on that as much as possible. Examples are Google Cloud Dataflow, Amazon EMR, Azure Batch, …

This approach works because a lot of the complex work is done for you. These services are stable and can be used at scale. And they often come with metrics and monitoring integrated.

The downsides are that not everything comes out of the box. You still have to do a lot of work yourself, and you need to add a lot of glue to get going. There is always a risk of vendor lock-in if you rely too much on their tooling. And autoscaling and alerting is something you need to often build yourself.

Use frameworks (High feasibility / High effectiveness)
Use when: Adopt for most data teams

You’re not the first one in this situation. Deployment patterns emerge and frameworks help you automate the production-grade deployment of code. Tooling like Netlify and Serverless.com are examples of this approach. We have recently launched Datafy, a data engineering framework for building, deploying and monitoring analytics workloads at scale.

The good thing about these frameworks is that most of the complex work is done for you. You follow industry best practices. And monitoring and scaling comes out of the box. It makes sure you’re up and running in no time.

The downside of a framework is that it is always tailored to specific needs so you are always constrained to what the framework offers. So it’s a matter of choosing the right framework for you.

How does this work at Datafy?

Deployments are the most important thing you do. That’s why we make it super easy and quick to do at least 10 deploys per day, through the use of the CLI:

Create a new project

datafy project new — name analyticspipeline --template project/python

This will set up a code structure, create a data pipeline with a sample job in it plus a unit test, add a Dockerfile, configure a Python virtualenv and basically do all the scaffolding for you.

creating an analyticspipeline is a single command line

As a result, you get a project setup with all the scaffolding done

Build the project

That’s also a single command-line:

datafy project build

What this will do is wrap your code in a docker container, and push that container to a Container Registry in your cloud, in this case ECR.

Deploy your project

That’s as you guessed it, another single command-line:

datafy project deploy --env dev --wait

And about 2 min later, you’re done. Your analytics pipeline is live.

So what?

Why is this important again? Whether you use Datafy, take something else off the shelves, or build your own system, it is important to be able to deploy 10x per day. Organisations who do this, are able to deliver results to customers faster, have higher ROI on their data investments and, most importantly, learn from feedback of their customers.

This blog a is a written version of a webinar we hosted earlier which you can rewatch on youtube here:

How to deploy analytics workloads

Deployments are the most important thing you do

What are your options?

How does this work at Datafy?

So what?

Written by Kris Peeters