To cajole skeptical colleagues to my point of view, I ventured into a corner of the internet I had done nothing but roll my eyes at; amazing the efforts we’ll take to win an argument.

Hungry for a free, huge dataset I reached for the public ledger of polygon, a web3 cryptocurrency project. Opening up the mysterious terabyte payload I wondered aloud what a dApp column means and questioning if the gas fees scrolling across my screen were worth it.

I’m here to run large-scale quantile computations and these NFT traders are going to help me.

Big Query’s approximate aggregations save considerable processing cost and even make certain operations possible at huge scale. Since scant details of their implementation are made available, it limits our understanding of their characteristics and leads teams to fallback on tried-and-true exact algorithms.

In this post I’ll demonstrate that approx_percentile and other approximate aggregations in Big Query are deterministic by comparing against exact queries across caching behavior, runtime characteristics and data sampling; with this information you can decide if they are right to add to your services.

What’s this even about?

Let’s get technical: a deterministic function returns the same result under identical inputs.

A few years ago I worked on a project in the consumer space for a company that maintained leaderboards of performance statistics for millions of its users. Among the most expensive APIs the service supported was determining a particular users’ rank within a club, i.e. a set of other users. The exact operation scanned the entire table (Cassandra partition in this case) looking for the ranks of their friends to compare the user against; this rarely used API easily accounted for half of the system’s total CPU utilization. Lucky for the company’s infrastructure it was obscured away behind several clicks.

With this in mind an argument was made to use approximate algorithms. It’s worth asking in the context of a social application: how would a user who is “exactly” ranked 23,512th out of 100,000+ competitors know if instead an approximate function with a 0.5% error reported a rank of 23,400th instead and saved a tremendous amount of compute cycles?

At the time of writing this post I’m working on a A/B experimentation company selling to enterprise customers - the software needs to compile reproducible results of its customers’ while being able to execute against the warehouses of some of the largest companies in the world in a timely fashion. The engineers on this team need to carefully select deterministic algorithms.

Side-track: how are quantiles computed?

No you didn’t stumble onto LeetCode - it’s pretty straightforward: take a list of elements, sort them while keeping a count; any percentile is then trivial to ascertain.

While simple to reason about, these algorithms are slow because they need to sort the elements. For large distributed systems this involves shuffling massive data around the network.

Compared to t-digest, a well understood data structure which aids in computing approximate quantiles, simply needs to scan the dataset once without sorting it to accumulate statistics.

The difference in runtime is O(n log n) compared to O(n) - while this is theoretical now, later in this post I’ll show that the empirical differences on real queries are profound.

For non-trivial sized datasets these exact algorithms might even be untenable to compute - not only is the cost of executing them 1 or 2 orders of magnitude greater but its common to be unable to complete them at all when warehouse workers running out of memory.

Run some BIG queries

It’s that time everyone’s been waiting for - I’ll review my experiment with exact and approximate quantile computation to understand how they compare.

We’ll use the information collected to demonstrate that Big Query’s approximate aggregations are deterministic.

Proving that if Big Query caches query results therefore the approximate aggregations are deterministic

BigQuery’s documentation says that query results are cached and identical requests will be served from cache without additional cost, albeit with some notable exceptions:

  • If any of the referenced tables or logical views have changed since the results were previously cached.
  • If the query uses non-deterministic functions; for example, date and time functions such as CURRENT_TIMESTAMP() and CURRENT_DATE, and other functions such as SESSION_USER(), it returns different values depending on when a query is executed.

We know that when Big Query caches a query its internal engine blesses it as deterministic, therefore if results of queries containing approximate aggregations are cached, than those functions are deterministic.

Test plan

Run several queries to explore which are cached.

  • Exact quantiles
  • Approximate quantiles
  • Adding a known non-deterministic function to each and observe cache breaking

Setting up the experiment

As I joked in the beginning of the post, for this experiment I’m going to use a public Big Query Dataset: a snapshot of Polygon, an Ethereum based smart contract platform.

Experiment 1A: Exact 50th and 99th percentile of gas costs

SELECT
  PERCENTILE_CONT(receipt_gas_used, 0.5) OVER() AS percentile50
  , PERCENTILE_CONT(receipt_gas_used, 0.99) OVER() AS percentile99
FROM `crypto_polygon.transactions`
LIMIT 1

Data

Row percentile50 percentile99
1 60,667 950,831

Runtime

Bytes processed Bytes shuffled Elapsed time Slot time consumed
15.29 GB 300 GB 42 sec 9hr 33min

Experiment 1B: Repeat exact query

✨Served from cache!

Runtime

Bytes processed Bytes shuffled Elapsed time Slot time consumed
0 B 0 B 0 sec 0 sec

Experiment 1C: Add cache buster

The current_timestamp function is documented as non-deterministic and this query is never cached.

SELECT
  PERCENTILE_CONT(receipt_gas_used, 0.5) OVER() AS percentile50
  , PERCENTILE_CONT(receipt_gas_used, 0.99) OVER() AS percentile99
 , current_timestamp()
FROM `crypto_polygon.transactions`
LIMIT 1

Experiment 2A: Approximate 50th and 99th percentile of gas cost

Switching over now to looking at approximate quantiles.

SELECT
  approx_quantiles(receipt_gas_used, 100)[OFFSET(50)] as percentile_50
  , approx_quantiles(receipt_gas_used, 100)[OFFSET(99)] as percentile_99
FROM `crypto_polygon.transactions`

Data

Row percentile50 percentile99
1 60,595 951,204

Runtime

Bytes processed Bytes shuffled Elapsed time Slot time consumed
15.29 GB 0.5 GB 7 sec 13min 57sec

Experiment 2B: Repeat the approximate query

✨Served from cache!

Runtime

Bytes processed Bytes shuffled Elapsed time Slot time consumed
0 B 0 B 0 sec 0 sec

Experiment 2C: Add cache buster

Adds current_timestamp column to avoid the cache.

SELECT
  approx_quantiles(receipt_gas_used, 100)[OFFSET(50)] as percentile_50
  , approx_quantiles(receipt_gas_used, 100)[OFFSET(99)] as percentile_99
  , current_timestamp()
FROM `crypto_polygon.transactions`

What did we demonstrate?

First and foremost, Big Query caches the results of queries which contain approximate aggregations in the same way as exact algorithms. This is a strong indicator that these functions are considered deterministic by Big Query’s developers.

All results

Data

Query percentile50 percentile99
Exact 60,667 950,831
Approximate 60,595 951,204
Error 0.12% 0.04%

Runtime

Query Bytes processed Bytes shuffled Elapsed time Slot time consumed
Exact 15.29 GB 300 GB 42 sec 9hr 33min
Approximate 15.29 GB 0.5 GB 7 sec 13min 57sec

Is sampling happening?

If Big Query’s implementation is sampling data internally and using the same random seed, this could give the appearance of determinism while masking its true behavior.

The documentation unequivalently states otherwise:

The approximate aggregate functions in this section work directly
on the input data, rather than an intermediate estimation of the data.

We see in our experimental results that the same number of bytes are being read in both algorithms. This is a strong indication that no sampling is happening.

Compare to explicit table sampling for exact algorithm

SELECT
  PERCENTILE_CONT(receipt_gas_used, 0.5) OVER() AS percentile50
  , PERCENTILE_CONT(receipt_gas_used, 0.99) OVER() AS percentile99
  , current_timestamp()
FROM `crypto_polygon.transactions`
TABLESAMPLE SYSTEM (10 PERCENT)
LIMIT 1

Data

  • percentile50 = 60,356
  • percentile99 = 964,826

Runtime

  • Bytes processed = 1.53 GB
  • Bytes shuffled = 30 GB
  • Duration = 29 sec
  • Slot time consumed = 30min 35sec

Table sampling with approximate algorithm

SELECT
  approx_quantiles(receipt_gas_used, 100)[OFFSET(50)] as percentile_50
  , approx_quantiles(receipt_gas_used, 100)[OFFSET(99)] as percentile_99
  , current_timestamp()
FROM `crypto_polygon.transactions`
TABLESAMPLE SYSTEM (10 PERCENT);

Data

  • percentile50 = 59,965
  • percentile99 = 919,490

Runtime

  • Bytes processed = 1.53 GB
  • Bytes shuffled = 50 MB
  • Duration = 1 sec
  • Slot time consumed = 57 sec

Table sampling demonstrates that fewer bytes are being read. It also serves to show that approx_quantiles on its own performs no sampling.

I’m really excited to see how low the error is at the 99th percentile for the unsampled approximate function.

Final data tabulation

Data

Query percentile50 percentile99
Exact 60,667 950,831
Approximate 60,595 951,204
Exact (with sampling) 60,356 964,826
Approximate (with sampling) 59,965 919,490

Runtime

Query Bytes processed Bytes shuffled Elapsed time Slot time consumed
Exact 15.29 GB 300 GB 42 sec 9hr 33min
Approximate 15.29 GB 0.5 GB 7 sec 13min 57sec
Exact (with sampling) 1.53 GB 30 GB 29 sec 30min 35sec
Approximate (with sampling) 1.53 GB 0.05 GB 1 sec 57sec

Really neat to see the amount of bytes processed exactly 1/10th as much in the sampled as the full set.

A really important observation is that the elapsed time, slot time consumed does NOT diminish by 10x in the exact algorithm.

We already discussed that this algorithm is slower than linear wheras the approximate algorithm’s resource utilization scales with data volume.

Concluding thoughts

Snowflake publishes their internal algorithm as t-digest - it would be exciting if Big Query also shed some light into the internal implementation or allowed a configurable error parameter.

I’m excited to use these functions as the need arises for outlier detection, dimensionality reduction and other applications.

Big Query’s documentation “encourages” developers to implement their own quantile algorithms by using the provided lower level HyperLogLog sketch functions which have parameterized precision - this seems like a fantastic challenge for another time.

Read the discussion on my Twitter post.

References

  1. Zhiwei Chen, Aoqian Zhang. A Survey of Approximate Quantile Computation on Large-Scale Data (Technical Report). IEEE Access 8, 2020, https://arxiv.org/abs/2004.08255v1.