AI News, BigQuery vs Redshift: Pricing Strategy How to optimize your query costs

BigQuery vs Redshift: Pricing Strategy How to optimize your query costs

Many of our customers ask us which data warehousing option is cheaper: BigQuery or Redshift?

TL;DR: Redshift is cheaper as soon as you don’t leverage multi-stage querying and don’t care a lot about SQL query structure.

In this blog post, we’re going to break down BigQuery vs Redshift pricing structures and see how they work in detail.

The cheapest Redshift cluster you can spin up will cost you $0.25 per hour, or about $180 per month.

Per TB pricing is $0.425 / TB / hour for HDD storage and $1.5625 / TB / hour for SSD storage.

BigQuery charges separately for storage at $20 / TB / month and $5 / TB processed in query.

Bytes processed in query, also known as bytes billed, are pretty tricky to understand and this usually makes up about 95% of your BigQuery costs.

BigQuery uses columnar storage, and bills are based on scanned data within columns and not within rows.

Here is an example query on a NOAA Global Surface Summary of the Day Weather Data dataset: Unlike in the previous example, costs will be reduced to the size of columns of tables starting with gsod2010 up to gsod2018.

At the moment, BigQuery supports the CREATE TABLE AS statement in beta mode, as well as it supports load jobs, which you can use to create a table from an SQL query.

So, for our query we can introduce the following roll up table: And to query it we can run something like: The first query costs us 3.04GB, which is obviously more than in the case of table suffix filtering.

Meanwhile, BigQuery will allow you to query only about 10-50 queries per 1TB of data stored for that price per day. At the same time, we saw a lot of BigQuery deployments, which are at least 4 times cheaper than Redshift due to multi-stage querying.

Redshift vs. BigQuery: The Full Comparison

As our platform delivers full-stack data automation, a critical chunk of the stack hinges not only on the massively parallel data warehouse used internally to store hundreds of terabytes of data, but the capability to analyze it in minutes.

Analytical queries on the other hand are normally performed by just a few analysts where each query is a batch process over a huge dataset that can take minutes or even hours to compute.

Normal relational database system, like Postgres and MySQL, store data internally in row form: all data rows are stored together and are usually indexed by a primary key to facilitate efficient accessibility.

This setup is perfect for most transactional operations, like reading rows by id to display your user profile, for example, however it’s ineffective for many analytical use-cases.

This structure significantly speeds up analytical queries by only reading the columns involved in the query, resulting in accelerated disk access and CPU cache.

In addition, they’re designed to be massively parallel where every query is distributed across multiple servers to accelerate query execution.

In an analytical context performance has a direct effect on the productivity of your team: running a query for hours means days of iterations between business question (think compilation time of the past).

We needed to select a technology that would scale well not only in terms of data volume but just as importantly, in quantity and complexity of the queries and analysts on our team.

There is a pittance of published benchmarks comparing BigQuery and Amazon Redshift performance, and there are none that we could find that ran a real world comparison with Redshift configured with correct sortkeys and distkeys.

The only caveat being that if the data is generated in AWS (that’s where we run our production infrastructure), the time required to transfer the data to Google Cloud Storage came out to an additional 7820 seconds (roughly 2 hours) using Google’s transfer service.

This approach affords considerable flexibility in fine-tune performance to business needs, especially when we consider distributing data across different clusters/types.

For example: we store frequently-accessed critical data on DC1 SSD which gives us just over a 10x performance boost, while everything else is kept on slower/cheaper instance types.

instead of using vanilla Redshift, un-optimized or configured in any way, we applied some best-practice optimizations, like compressing tables and setting sort and distribution keys.

To work around that, with each execution we made a tiny parameterized change to the query (like changing the variable X from 1000 to 1000.01, see below) in order to invalidate the cache.

The same approach was used in Redshift, however Redshift caches data, not queries, which gives it an edge at handling similar queries with modified parameters, while under performing at repetitive identical queries.

It has three variations where X = 1000 (small), 100 (medium) and 10 (large): Note that BigQuery doesn’t allow output results to exceed 128MB.

It has three variations where X = 8 (small), 10 (medium) and 12 (large): The third and final query measures join speed.

It has three variations where X = 1980-04-01 (small), 1983-01-01 (medium) and 2010-01-01 (large): Contrary to previous findings that didn’t consider optimization, when reasonably optimized, Redshift outperforms BigQuery in 9 out of 11 use-cases hands down.

Because BigQuery doesn’t provide any indexes, and many analytical queries cover entire database, we can assume that each query will need to scan a big chunk of the data.

In addition to the flat cost, there are three additional drawbacks to BigQuery’s pricing model: Usability Usability is probably the most debated, intangible and subjective aspect of all database comparisons.

That said it cannot be skipped so we mapped out all the key subjects we cared about in terms of usability for the comparison of the two databases: Simplicity is by far the biggest advantage BigQuery holds over Redshift throughout this entire comparison.

This often translates into dedicated engineers battling to keep your company’s infrastructure up to par with your company’s data scale and query patterns as well as industry’s best practices.

Redshift also supports the Numeric type which stores values with user-defined precision, which is important for exact numeric operations, for example, when dealing with financial data where absolute precision is important.

This apparent complexity is double edged as it can be seen as greater flexibility that allows us to fine-tune Redshift to our specific needs which will result in an even greater advantage in performance and cost.

So when we set out to develop our platform it was easy to justify the tradeoff of simplicity for functionality and focus on elevating Redshift to the same level of simplicity as BigQuery, and then take it way beyond.

Redshift vs BigQuery – Choosing the Right Data Warehouse

As analytics in your company graduates from a MySQL/PostgreSQL/SQL Server, a pertinent question that you need to answer is which data warehouse is best suited for you.

Our customers want to know which data warehouse will give them faster query times, how much data will it be able to handle and what will it cost.

The answer depends on various inputs like the size of data, the nature of use and the technical capability of users managing the warehouse.

On many head-to-head tests, Redshift has proved to show better query times when configured and tweaked correctly.

For example, Redshift will expect you know about how to distribute your data across nodes and will require you to do vacuuming operations on a periodic basis.

On the surface this pricing might seem to be cheaper but, this approach makes costs for BigQuery unpredictable and it will turn out to be more expensive than Redshift when query volumes are high.

How to Estimate Google BigQuery Pricing

Since Google BigQuery pricing is all based on usage, there are primarily only (3) core aspects of your BigQuery data storage you need to consider when estimating the costs: Storage Data, Long Term Storage Data, and Query Data Usage.

If we estimate that 25% of our dataset from the above example will remain static and only be viewed or queried, we can modify our previous estimation of $10,000 per month for Data Storage down to $8,750 per month, giving us a $1,250 per month savings due to Long Term Storage.

To calculate Query Data Usage we need to start by estimating a few basic paramaters of our service, then use that information to calculate our monthly costs: We can then take those parameters and apply a basic calculation to estimate our monthly Query Data Usage: Thus, if we estimate we’ll see about 150 Users per day, each running 50 Queries per day, with an average data usage of 5 GB per query, we just plug those values in to get our estimation: With a rough estimation of 1125 TB of Query Data Usage per month, we can simply multiple that by the $5 per TB cost of BigQuery at the time of writing to get an estimation of ~$5,625 / month for Query Data Usage.

Analyzing Big Data in less time with Google BigQuery

Most experienced data analysts and programmers already have the skills to get started. BigQuery is fully managed and lets you search through terabytes of data ...

05 BigQuery and Redshift comparison

Data Warehouse Showdown: Redshift vs BigQuery vs Snowflake

Data Warehouse Showdown: Redshift vs BigQuery vs Snowflake by George Fraser, CEO & Co-Founder, Fivetran Funny pre-event tech-trivia: ...

BigQuery and Open Datasets

We all love data. But it can be hard to make practical use of large datasets. In this episode of AI Adventures, Yufeng Guo introduces BigQuery public datasets, ...

Google BigQuery Q&A #6: Pricing Predictability of Cloud Data Warehouses

Querying Massive Datasets using Google BigQuery

Ryan Boyd and Michael Manoochehri show you how to query some massive datasets using Google BigQuery.

Data Warehousing with Amazon Redshift

Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale ...

Trivia: Data Warehouse Showdown: Redshift vs BigQuery vs Snowflake

Pre-event trivia hosted by Eric David Benari from the Database Month NoSQL & NewSQL NYC Meetup: Data Warehouse Showdown: Redshift vs BigQuery vs ...

Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL

Amazon Athena is a new serverless query service that makes it easy to analyze data in Amazon S3, using standard SQL. With Athena, there is no infrastructure ...

Using BigQuery for near real-time analytics- Le Kien Truc - FOSSASIA 2018

Speaker: Le Kien Truc Info: Real-time ingestion and analysis of data streams is advantageous for ..