AI News, Graph-based machine learning: Part I

Graph-based machine learning: Part I

Many important problems can be represented and studied using graphs — social networks, interacting bacterias, brain network modules, hierarchical image clustering and many more.

If we accept graphs as a basic means of structuring and analyzing data about the world, we shouldn’t be surprised to see them being widely used in Machine Learning as a powerful tool that can enable intuitive properties and power a lot of useful features.

See more in this recent blog post from Google Research This post explores the tendencies of nodes in a graph to spontaneously form clusters of internally dense linkage (hereby termed “community”);

This has a lot of advantages since it typically only requires a knowledge of first degree neighbors and small incremental merging steps, to bring the global solution towards stepwise equilibriums.

The basic approach does indeed consists of iteratively merging nodes that optimize a local modularity so let’s go ahead and define that as well: This is part of the magic for me as this local optimization function can easily be translated to an interpretable metric within the domain of your graph.

In other words, the weighted links can be a function of the type of nodes computed on-the-fly (useful if you’re dealing with a multidimensional graph with various types of relationships and nodes).

It can be a lot of things: a maximum number of iterations, a minimum modularity gain during the transfer phase, or any other relevant piece of information related to your data that would inform you that it needs to stop.

Modularity (networks)

It was designed to measure the strength of division of a network into modules (also called groups, clusters or communities).

Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules.

However, it has been shown that modularity suffers a resolution limit and, therefore, it is unable to detect small communities.

For example, biological and social patterns, the World Wide Web, metabolic networks, food webs, neural networks and pathological networks are real world problems that can be mathematically represented and topologically studied to reveal some unexpected structural features.[1] Most of these networks possess a certain community structure that has substantial importance in building an understanding regarding the dynamics of the network.

For instance, a closely connected social community will imply a faster rate of transmission of information or rumor among them than a loosely connected community.

Thus, if a network is represented by a number of individual nodes connected by links which signify a certain degree of interaction between the nodes, communities are defined as groups of densely interconnected nodes that are only sparsely connected with the rest of the network.

Hence, it may be imperative to identify the communities in networks since the communities may have quite different properties such as node degree, clustering coefficient, betweenness, centrality.[2] etc., from that of the average network.

Modularity is the fraction of the edges that fall within the given groups minus the expected fraction if edges were distributed at random.

.[3] It is positive if the number of edges within groups exceeds the number expected on the basis of chance.

For a given division of the network's vertices into some modules, modularity reflects the concentration of edges within modules compared with random distribution of links between all nodes regardless of modules.

There are different methods for calculating modularity.[1] In the most common version of the concept, the randomization of the edges is done so as to preserve the degree of each vertex.

(It is important to note that multiple edges may exist between two nodes, but here we assess the simplest case).

Modularity Q is then defined as the fraction of edges that fall within group 1 or 2, minus the expected number of edges within groups 1 and 2 for a random graph with the same node degree distribution as the given network.

The expected number of edges shall be computed using the concept of Configuration Models.[4] The configuration model is a randomized realization of a particular network.

the configuration model cuts each edge into two halves, and then each half edge, called a stub, is rewired randomly with any other stub in the network even allowing self loops.

Thus, even though the node degree distribution of the graph remains intact, the configuration model results in a completely random network.

respectively and rewire the stubs for these two nodes, then, The total number of rewirings possible is equal to the number of stubs remaining after choosing a particular stub,

partitioning into two communities, then the two sub-communities further partitioned into two smaller sub communities only to maximize Q) is a possible approach to identify multiple communities in a network.

Additionally, (3) can be generalized for partitioning a network into c communities.[5] where eij is the fraction of edges with one end vertices in community i and the other in community j: and ai is the fraction of ends of edges that are attached to vertices in community i: We consider an undirected network with 10 nodes and 12 edges and the following adjacency matrix.

An alternative formulation of the modularity, useful particularly in spectral optimization algorithms, is as follows.[1] Define Svr to be 1 if vertex v belongs to group r and zero otherwise.

Then and hence where S is the (non-square) matrix having elements Svr and B is the so-called modularity matrix, which has elements All rows and columns of the modularity matrix sum to zero, which means that the modularity of an undivided network is also always zero.

For networks divided into just two communities, one can alternatively define sv = ±1 to indicate the community to which node v belongs, which then leads to where s is the column vector with elements sv.[1] This function has the same form as the Hamiltonian of an Ising spin glass, a connection that has been exploited to create simple computer algorithms, for instance using simulated annealing, to maximize the modularity.

The general form of the modularity for arbitrary numbers of communities is equivalent to a Potts spin glass and similar algorithms can be developed for this case also.[6] Modularity compares the number of edges inside a cluster with the expected number of edges that one would find in the cluster if the network were a random network with the same number of nodes and where each node keeps its degree, but edges are otherwise randomly attached.

This assumption is however unreasonable if the network is very large, as the horizon of a node includes a small part of the network, ignoring most of it.

Moreover, this implies that the expected number of edges between two groups of nodes decreases if the size of the network increases.

So, if a network is large enough, the expected number of edges between two groups of nodes in modularity's null model may be smaller than one.

If this happens, a single edge between the two clusters would be interpreted by modularity as a sign of a strong correlation between the two clusters, and optimizing modularity would lead to the merging of the two clusters, independently of the clusters' features.

So, even weakly interconnected complete graphs, which have the highest possible density of internal edges, and represent the best identifiable communities, would be merged by modularity optimization if the network were sufficiently large.[7] For this reason, optimizing modularity in large networks would fail to resolve small communities, even when they are well defined.

This bias is inevitable for methods like modularity optimization, which rely on a global null model.[8] There are two main approaches which try to solve the resolution limit within the modularity context: the addition of a resistance r to every node, in the form of a self-loop, which increases (r>0) or decreases (r<0) the aversion of nodes to form communities;[9] or the addition of a parameter γ>0 in front of the null-case term in the definition of modularity, which controls the relative importance between internal links of the communities and the null model.[6] Optimizing modularity for values of these parameters in their respective appropriate ranges, it is possible to recover the whole mesoscale of the network, from the macroscale in which all nodes belong to the same community, to the microscale in which every node forms its own community, hence the name multiresolution methods.

Graph database

In computing, a graph database (GDB[1]) is a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data.

Graph databases are similar to 1970s network model databases in that both represent general graphs, but network-model databases operate at a lower level of abstraction[2] and lack easy traversal over a chain of edges.[3] The underlying storage mechanism of graph databases can vary.

Some depend on a relational engine and “store” the graph data in a table (although a table is a logical element, therefore this approach imposes another level of abstraction between the graph database, the graph database management system and the physical devices where the data is actually stored).

This can be a time consuming process in large tables, so relational databases offer the concept of a database index, which allows data like this to be stored in a smaller subtable, containing only the selected data and a unique key (or primary key) of the record it is part of.

If the phone numbers are indexed, the same search would occur in the smaller index table, gathering the keys of matching records, and then looking in the main data table for the records with those keys.

In order to link users and their email addresses, the system first looks up the selected user records primary keys, looks for those keys in the userpk column in the email table (or more likely, an index of them), extracts the email data, and then links the user and email records to make composite records containing all the selected data.

Depending on the complexity of the query, the number of joins, and the indexing of the various keys, the system may have to search through multiple tables and indexes, gather lots of information, and then sort it all to match it together.[4] In contrast, graph databases directly store the relationships between records.

For example, if one searches for all of the email addresses for users in area code '311', the engine would first perform a conventional search to find the users in '311', but then retrieve the email addresses by following the links found in those records.

In this case a relational database has to first look for all the users with an area code in '311', then look in the subscribers table for any of those users, and then finally look in the users table to retrieve the matching users.

In a relational database this will require several separate searches through the movies and actors tables, doing another search on submarine movies, finding all the actors in those movies, and then comparing the (large) collected results.

These sorts of labels may improve search performance under certain circumstances, but are generally more useful in providing added semantic data for end users.[5] Relational databases are very well suited to flat data layouts, where relationships between data is one or two levels deep.

They can scale more naturally[citation needed] to large data sets as they do not typically need costly join operations (here costly means when executed on databases with non-optimal designs at the logical and physical levels).

In the pre-history of graph databases, in the mid-1960s Navigational databases such as IBM's IMS supported tree-like structures in its hierarchical model, but the strict tree structure could be circumvented with virtual records.[6][7] Graph structures could be represented in network model databases from the late 1960s.

Further, SAP HANA brought in-memory and columnar technologies to graph databases.[9] Also in the 2010s, multi-model databases that supported graph models (and other models such as relational database or document-oriented database) became available, such as OrientDB, ArangoDB, and MarkLogic (starting with its 7.0 version).

Graph Theory - An Introduction!

Thanks to all of you who support me on Patreon. You da real mvps! $1 per month helps!! :) !! Graph Theory - An Introduction

A Local Algorithm for Structure-Preserving Graph Cut

A Local Algorithm for Structure-Preserving Graph Cut Dawei Zhou (Arizona State University) Si Zhang (Arizona State University) Mehmet Yigit Yildirim (Arizona ...

Finding the Maximum Flow and Minimum Cut within a Network

If u found this video helpful u can support this channel through Venmo @letterq or paypal.me/letterq with 42 cents or 42 dollars.

Local Higher Order Graph Clustering

Author: Hao Yin, Institute for Computational and Mathematical Engineering, Stanford University Abstract: Local graph clustering methods aim to find a cluster of ...

2. Optimization Problems

MIT 6.0002 Introduction to Computational Thinking and Data Science, Fall 2016 View the complete course: Instructor: John Guttag ..

Graph approximation and local clustering

We discuss several fascinating concepts and algorithms in graph theory that arose in the design of a nearly-linear time algorithm for solving diagonally-dominant ...

3. Graph-theoretic Models

MIT 6.0002 Introduction to Computational Thinking and Data Science, Fall 2016 View the complete course: Instructor: Eric Grimson ..

Asymmetric Transitivity Preserving Graph Embedding

Author: Ziwei Zhang, Tsinghua University Abstract: Graph embedding algorithms embed a graph into a vector space where the structure and the inherent ...

Learning from Labeled and Unlabeled Vertices in Networks

Learning from Labeled and Unlabeled Vertices in Networks Wei Ye (University of Munich) Linfei Zhou (University of Munich) Dominik Mautz (University of ...

Data Visualization & Interactive Data Exploration with KNIME

This video shows some options for data visualization and interactive data exploration, both within KNIME Analytics Platform and from a web browser through the ...