AI News, NIPS Proceedingsβ
Part of: Advances in Neural Information Processing Systems 27 (NIPS 2014) We present an inference method for Gaussian graphical models when only pairwise distances of n objects are observed.
We argue that the extension is highly relevant as it yields significantly better results in both synthetic and real-world experiments, which is successfully demonstrated for a network of biological pathways in cancer patients.
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).
It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.
It should be noted that an algorithm that is designed for one kind of model will generally fail on a data set that contains a radically different kind of model. For example, k-means cannot find non-convex clusters. Connectivity-based clustering, also known as hierarchical clustering, is based on the core idea of objects being more related to nearby objects than to objects farther away.
At different distances, different clusters will form, which can be represented using a dendrogram, which explains where the common name 'hierarchical clustering' comes from: these algorithms do not provide a single partitioning of the data set, but instead provide an extensive hierarchy of clusters that merge with each other at certain distances.
When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.
Variations of k-means often include such optimizations as choosing the best of multiple runs, but also restricting the centroids to members of the data set (k-medoids), choosing medians (k-medians clustering), choosing the initial centers less randomly (k-means++) or allowing a fuzzy cluster assignment (fuzzy c-means).
Another interesting property of DBSCAN is that its complexity is fairly low – it requires a linear number of range queries on the database – and that it will discover essentially the same results (it is deterministic for core and noise points, but not for border points) in each run, therefore there is no need to run it multiple times.
Besides that, the applicability of the mean-shift algorithm to multidimensional data is hindered by the unsmooth behaviour of the kernel density estimate, which results in over-fragmentation of cluster tails. In recent years, considerable effort has been put into improving the performance of existing algorithms. Among them are CLARANS (Ng and Han, 1994), and BIRCH (Zhang et al., 1996). With the recent need to process larger and larger data sets (also known as big data), the willingness to trade semantic meaning of the generated clusters for performance has been increasing.
This led to the development of pre-clustering methods such as canopy clustering, which can process huge data sets efficiently, but the resulting 'clusters' are merely a rough pre-partitioning of the data set to then analyze the partitions with existing slower methods such as k-means clustering.
This led to new clustering algorithms for high-dimensional data that focus on subspace clustering (where only some attributes are used, and cluster models include the relevant attributes for the cluster) and correlation clustering that also looks for arbitrary rotated ('correlated') subspace clusters that can be modeled by giving a correlation of their attributes. Examples for such clustering algorithms are CLIQUE and SUBCLU. Ideas from density-based clustering methods (in particular the DBSCAN/OPTICS family of algorithms) have been adopted to subspace clustering (HiSC, hierarchical subspace clustering and DiSH) and correlation clustering (HiCO, hierarchical correlation clustering, 4C using 'correlation connectivity' and ERiC exploring hierarchical density-based correlation clusters).
One is Marina Meilă's variation of information metric; another provides hierarchical clustering. Using genetic algorithms, a wide range of different fit-functions can be optimized, including mutual information. Also message passing algorithms, a recent development in computer science and statistical physics, has led to the creation of new types of clustering algorithms. Evaluation (or 'validation') of clustering results is as difficult as the clustering itself. Popular approaches involve 'internal' evaluation, where the clustering is summarized to a single quality score, 'external' evaluation, where the clustering is compared to an existing 'ground truth' classification, 'manual' evaluation by a human expert, and 'indirect' evaluation by evaluating the utility of the clustering in its intended application. Internal evaluation measures suffer from the problem that they represent functions that themselves can be seen as a clustering objective.
One drawback of using internal criteria in cluster evaluation is that high scores on an internal measure do not necessarily result in effective information retrieval applications. Additionally, this evaluation is biased towards algorithms that use the same cluster model.
Therefore, the internal evaluation measures are best suited to get some insight into situations where one algorithm performs better than another, but this shall not imply that one algorithm produces more valid results than another. Validity as measured by such an index depends on the claim that this kind of structure exists in the data set.
An algorithm designed for some kind of models has no chance if the data set contains a radically different set of models, or if the evaluation measures a radically different criterion. For example, k-means clustering can only find convex clusters, and many evaluation indexes assume convex clusters.
More than a dozen of internal evaluation measures exist, usually based on the intuition that items in the same cluster should be more similar than items in different clusters.:115–121 For example, the following methods can be used to assess the quality of clustering algorithms based on internal criterion: In external evaluation, clustering results are evaluated based on data that was not used for clustering, such as known class labels and external benchmarks.
However, it has recently been discussed whether this is adequate for real data, or only on synthetic data sets with a factual ground truth, since classes can contain internal structure, the attributes present may not allow separation of clusters or the classes may contain anomalies. Additionally, from a knowledge discovery point of view, the reproduction of known knowledge may not necessarily be the intended result. In the special scenario of constrained clustering, where meta information (such as class labels) is used already in the clustering process, the hold-out of information for evaluation purposes is non-trivial. A
In place of counting the number of times a class was correctly assigned to a single data point (known as true positives), such pair counting metrics assess whether each pair of data points that is truly in the same cluster is predicted to be in the same cluster. As with internal evaluation, several external evaluation measures exist,:125–129 for example: To measure cluster tendency is to measure to what degree clusters exist in the data to be clustered, and may be performed as an initial test, before attempting clustering.
- On Wednesday, February 26, 2020
Keynote (Google I/O '18)
Learn about the latest product and platform innovations at Google in a Keynote led by Sundar Pichai. This video is also subtitled in Chinese, Indonesian, Italian, ...
Lecture 2 | Image Classification
Lecture 2 formalizes the problem of image classification. We discuss the inherent difficulties of image classification, and introduce data-driven approaches.
Interaction Theory [New Paradigm] for Solving the Traveling Salesman Problem (TSP)
Interaction Theory [New Paradigm] for Solving the Traveling Salesman Problem (TSP) Keywords : Graph Theory, Traveling Salesman Problem (TSP), Interaction ...
Building Brains to Understand the World's Data
Google Tech Talk February 12, 2013 (more info below) Presented by Jeff Hawkins. ABSTRACT The neocortex works on principles that are fundamentally ...
Break the Addiction to Negative Thoughts & Emotions to Create What You Want - Dr. Joe Dispenza
FREE PDF ☯ My Top 5 Law of Attraction Tips That I Used to COMPLETELY Change My Life Click Here ➡ Learn to break the addiction to ..
Lecture 18: Tackling the Limits of Deep Learning for NLP
Lecture 18 looks at tackling the limits of deep learning for NLP followed by a few presentations.
NW-NLP 2018: Compositional Language Modeling for Icon-Based Augmentative & Alternative Communication
The fifth Pacific Northwest Regional Natural Language Processing Workshop will be held on Friday, April 27, 2018, in Redmond, WA. We accepted abstracts ...
Visual Understanding in Natural Language
Bridging visual and natural language understanding is a fundamental requirement for intelligent agents. This talk will focus mainly on automatic image ...
The Microsoft AI platform - GS07
Join Joseph Sirosh, Corporate Vice President of the Cloud AI Platform, as he dives deep into the latest additions to the Microsoft AI platform and capabilities.
Preventing Overfishing with Machine Learning and Big Data Analytics (Google Cloud Next '17)
Machine learning is becoming a powerful and important aspect of analytics workloads. Amy Unruh and David Kroodsma look at how the Global Fishing Watch ...