AI News, Percentile and Quantile Estimation of Big Data: The t-Digest

Percentile and Quantile Estimation of Big Data: The t-Digest

No problem you think, as you create a small function to sum the elements and divide by the total count.

No problem you think, as you create a function that returns the sum of the elements and the count of the elements, and send this function to each computer, and divide the sum of all the sums by the sum of all the counts.

Next, suppose you are interested in the sample median of that same distributed dataset. No problem you think, as you create a function that sorts the array and takes the middle element, and send this function to each computer, and - wait.

What's needed is an algorithm that can approximate the median, while still being space efficient.  First published in 2013 by the uber-practical and uber-intelligent Ted Dunning, the t-Digest is a probabilistic data structure for estimating the median (and more generally any percentile) from either distributed data or streaming data.

Flat portions of the CDF, like near x=3 and x=8, only need to be summarized by a few points.  Running a small test locally, I streamed 8mb of pareto-distributed data into a t-Digest.

Percentile and Quantile Estimation of Big Data: The t-Digest

No problem you think, as you create a small function to sum the elements and divide by the total count.

No problem you think, as you create a function that returns the sum of the elements and the count of the elements, and send this function to each computer, and divide the sum of all the sums by the sum of all the counts.

Next, suppose you are interested in the sample median of that same distributed dataset. No problem you think, as you create a function that sorts the array and takes the middle element, and send this function to each computer, and - wait.

What's needed is an algorithm that can approximate the median, while still being space efficient.  First published in 2013 by the uber-practical and uber-intelligent Ted Dunning, the t-Digest is a probabilistic data structure for estimating the median (and more generally any percentile) from either distributed data or streaming data.

Flat portions of the CDF, like near x=3 and x=8, only need to be summarized by a few points.  Running a small test locally, I streamed 8mb of pareto-distributed data into a t-Digest.

T-Digest: An interesting datastructure to estimate quantiles accurately.

A new data structure for accurate accumulation of rank-based statistics such as quantiles &

Quartiles divides into 4 equal parts and Percentile divides into 100 equal parts.[1] Trimmed mean: Trimmed mean is the average of the dataset that we get after trimming X%.

Trimmed mean is obviously less susceptible than the effects of higher score than the arithmetic mean.[3] Internals of T-Digest: Sample Case Study: Problem statement: Let’s say we have a dataset of billion values ranging from [100–10000000].

For more detailed explanation refer to this blog: [6] Currently using: References: [1] — : Medians/Quantiles/Outliers well explained [2] — : Ted[T-Digest owner] presentation [3] — : Trimmed Mean explained [4] — : CDF(Cumulative Distribution Function) [5] — : Paper on T-Digest [6] — : Blog on T-Digest, Anomaly

Building the World's Largest Enterprise Data Warehouse with BigQuery (Cloud Next '18)

This talk, by one of the founders of the BigQuery team and a founder and current CTO of Looker, will make the case that BigQuery is not just another Enterprise ...

V. Narry Kim (IBS and SNU) 2: Tailing in the Regulation of microRNA and Beyond

Part 1: microRNA Biogenesis and Regulation: Narry Kim takes us through the ..

19. Discovering Quantitative Trait Loci (QTLs)

MIT 7.91J Foundations of Computational and Systems Biology, Spring 2014 View the complete course: Instructor: David Gifford This ..

20. Human Genetics, SNPs, and Genome Wide Associate Studies

MIT 7.91J Foundations of Computational and Systems Biology, Spring 2014 View the complete course: Instructor: David Gifford This ..

Telomeres and Reversal of Biological Age

How does the ageing mechanism work on the human body? What can be done through science to live more years healthy? How will we keep our organism, our ...

Die 5 Biologischen Naturgesetze - Die Dokumentation

Die 5 Biologischen Naturgesetze Die dritte Revolution der Medizin Eine Produktion für b

National Assembly for Wales Plenary 10.07.18

Plenary is the meeting of the whole Assembly which takes place in the Siambr, the Senedd's debating chamber. Plenary is chaired by the Presiding Officer and ...

Untangling object recognition

Untangling object recognition: Which neuronal population codes can explain human object recognition performance? A presentation given at Dartmouth ...