Why rising cloud costs are the silent killers of data platforms

Published in

datamindedbe

9 min readJul 6, 2022

Building data platforms in the cloud is changing. Gone are the days that you would manually set up a few EC2 instances and run some modest data processing on them. Solutions in this space have moved up the value chain, going from IaaS to PaaS to Serverless to SaaS. This made it a lot easier for data teams to get started with data. The so-called “Modern Data Stack” is sometimes even defined in terms of this movement:

The most important difference between a modern data stack and a legacy data stack is that the modern data stack is hosted in the cloud and requires little technical configuration by the user. These characteristics promote end-user accessibility as well as scalability to quickly meet your growing data needs without the costly, lengthy downtime associated with scaling local server instances.— Fivetran

Today we can take Snowflake, Databricks, Synapse, Confluent, Fivetran, … off the shelf and start building in no time. This adds tremendous amount of value and reduces a lot of the complexity that comes with building data platforms.

But….

Moving up the value chain means moving up the pricing chain

Let’s look at some of the prices of these services.

Let’s start with Snowflake

Those prices are per credit. A credit is one hour of an X-small DWH. That’s a lot of money. There is a lot of debate about what kind of hardware an X-Small instance runs, but it could be a c5.2xlarge behind the scenes, which has 8 cores and 16GB of RAM which costs on demand $0.34 and on spot currently it’s $0.08. If you use Snowflake Enterprise edition, at $3.90 per credit, which I see a lot of companies do, that is a 11x to 48x price increase vs on on-demand and spot instances respectively.

I know, I know, there is no official mapping from EC2 instances to Snowflake Cloud Credits. I am convinced they do it on-purpose to obscure the real price of the service. Besides, what are you going to do with that vanilla EC2 spot instance? Manage it yourself? Can you get the same performance out of that instance as Snowflake does? Probably not. But bear with me for a second.

How about Databricks?

https://databricks.com/product/aws-pricing

A DBU, like a credit, is a cost per compute hour. If we take the same c5.2xlarge instance, you pay 2.43 DBU per hour. A configuration we often see in enterprises is the Premium All-Purpose compute setup, which costs you $1.3365 for that c5.2xlarge instance. You add the price of the EC2 instance itself, and you still end up with a 5x to 17x price increase versus a plain vanilla on-demand and spot EC2 price.

That’s roughly 2x as “cheap” as Snowflake but still very expensive. And with their Serverless Compute options, they are closing that gap.

But what about performance?

The big vendors often claim they have outstanding performance because of proprietary technology. That’s why you can’t compare based on raw EC2 instances. You have to look at raw performance per dollar spent. A great example of this narrative is this TPCDS benchmark by Databricks.

They are right, in a way. But in most real world examples we see, there are a few crazy heavy lifting data jobs, and hundreds of smaller, almost trivial jobs. Not every job processes 2 TB of data. But it’s hard to “downscale” these services. We often move Spark jobs that run on a 3 node Databricks cluster to a very small single node python script that can run in a single docker container, scheduled on a kubernetes cluster. With 0 impact on performance. On the contrary, because of no network overhead and no spark startup overhead, they often run faster. This can result in 90% cloud cost savings.

These are not the exceptions

We often see Snowflake and Databricks as the biggest cost drivers in data platforms. But they are not alone. Confluent, Azure Synapse, … all these large compute engines come with a significant cost. Bigquery as well. Although Bigquery is dirt cheap if you run it at a small scale. What else do you need, besides processing?

You want a managed Airflow service? MWAA starts at $350 per month. Ideally you have a few of those, one for each environment. Maybe one for each team.
You want some managed ingest with that? Fivetran quickly starts costing >$1000 per month if you do it at a descent scale.
A managed data governance solution? Many solutions are not open about their pricing. Data.world is quickly > $10K per month.
A managed ML inference endpoint? On that same c5.2xlarge instance with AWS Sagemaker? Another $360 per month. And you need more than of those. Companies build dozens to hundreds of ML models.

Bill shock: We don’t know what we don’t know

While the absolute numbers can be staggeringly high. This is often not even the main concerns of enterprises. The main concern is “bill shock”. Every large company has faced it at least once. I’ve seen cases where the organisation planned for a 2-year EUR 500K budget for Snowflake. It was spent in 6 months.

With the pay-as-you-go pricing options, all predictability goes out of the window. The solution is close monitoring and good governance rules. But those things are often implemented too late.

Where does this lead us?

With todays technologies you can greatly accelerate your data roadmap. But you pay a price. Cloud bills well over $10K per month are normal for basic teams. Larger organisations quickly end up with $100K monthly bills. And the sky is the limit.

But the sky isn’t the limit.

The budget is the limit. I am the first one to advocate for the value of data and for the capabilities of cloud. I don’t ever want to go back to on-premise Hadoop clusters. But at the end of the day, the business case has to make sense.

People in data teams often don’t realise how much leeway they get from C-level. Everybody agrees that it’s important “to invest in data”. No executive who wants to get a next promotion dares to challenge the ever-increasing expenses in data. They are labeled as old-fashioned, and they “don’t get it”. In reality, most executives don’t get it. The data emperor wears no clothes.

This story doesn’t last forever. Maybe the CIO is pressured to do budget cuts. Maybe no other executive sees enough value for their org to justify the costs. Sooner or later an executive collects the courage to start asking questions.

And that’s why rising cloud costs are the silent killers of data platforms. You don’t see them coming. But when you’re confronted with them, it’s often too late to do something about it. Or it requires a big investment. Which is not great timing. It usually causes a big disruption in data teams.

Should we take back control?

There is an alternative. You can setup and run your own data platform from scratch. We’ve blogged about that before (eg. here and here) and also why we wouldn’t do it anymore (eg. here). That’s definitely an option, and if you have a strong technical team, you should consider that. Companies who go this route, often have a factor 5 to 10 lower cloud costs than fully managed solutions described above. Of course, it comes with its own downsides. For one, you need quite a few very good engineers. Those are hard to find, and expensive as well.

In the end, it boils down to where you stand in the unbundling versus rebundling of the data platform. No matter where you stand on the spectrum, own the decision. If you prefer to manage each low level component yourself, to keep full control and reduce cloud costs to the max, then make sure to invest in a stellar data team that can focus on building the platform and offer a great developer experience to the users of the platform. Companies like Twitter, Spotify, AirBnB, … clearly go this route. And they are nice enough to open-source their work from time to time.

The way forward

So what is the data organisation to do? Not every organisation can attract Spotify-level talent. Here’s the advice we typically give companies today:

Focus on building the use cases that will add value from competition. Avoid the undifferentiated heavy lifting. Take off-the-shelf tools. Make sure those tools are self-service.
Don’t lock into one platform. Snowflake and Databricks specifically claim to do everything. From SQL to Spark to ML and governance. This brings back memory of Oracle and IBM a few decades ago. Definitely use those platforms if they add value. As part of an overall solution. But don’t go all-in.
Standardise on something that you can easily move from platform to platform. Today, for all processing code, the humble docker container is a great abstraction layer. It’s universally supported across clouds. And you can run it yourself or go for managed solutions.
Keep a close eye on cloud costs, and pro-actively optimise where it hurts the most. Don’t wait for the executives to do this exercise for you. Cloud vendors have tools and dashboards to do exactly this. Use them!
Whatever you do, iterate. Don’t make 5 year plans. Don’t spend 1 year on architecture alone. Ship useful stuff. Collect feedback. Iterate. A modern data platform evolves over time.

How does this help with rising cloud costs? Simple. Let’s say you need to do data ingestion. This can be a sequence of actions you decide to do: Run ingests in a python script → replace python script with Fivetran if it becomes too complex to manage → Move away from Fivetran if it becomes too expensive by setting up Airbyte yourself.

I’m not saying one solution is better than the other. Just keep your options open. And make sure you can switch between DIY, open-source tooling and managed services.

How we help

We’ve launched Conveyor, a managed data engineering platform which solves many of the challenges above:

We focus hard on cost management: We have cost dashboard where you can drill down where you spend most of your money. By default all jobs run on spot instances and we are smart about avoiding spot outages as much as we can. We agressively scale up as well as down to reduce idle nodes. We often help our clients right-sizing their jobs.
It is based on containers. And we make it very easy for you to get started with Python, Spark and DBT templates. As such, whatever you run on Conveyor, you can run wherever you want. There is no lock-in.
It’s fully managed, serverless and self-service. Many buzzwords in one sentence. It means that you can focus on building the use case, and we take care of scheduling, running and monitoring your work. And yes, that includes cost monitoring :-)

If you want to learn more about Conveyor, you can take a look at our website or our 📕 documentation.

If you want to try it out, use the following link. From there you can get started for Free on your own AWS account in a few easy steps. If you rather talk to a human, you can book a meeting with me here. I promise I won’t bite.

Looking forward to your feedback and experiences!