an annotated bibliography on stochastic blockmodels
Jul 26, 2019

Summary

I’ve been reading a lot of papers on network analysis recently. I thought I’d write down some takeaways and point out papers that I’ve found helpful. This collection of papers is centered around the stochastic blockmodel, and is intended to be introductory rather than comprehensive. I’ve included a few papers with other miscellaneous tidbits of interest.

Models

The most basic random graph model is the Erdos-Renyi graph, which assumes that you have a fixed number of nodes, and edges appear independently with probability $p$. The Erdos-Renyi model has been studied to death and there are tons of interesting papers you can read about it. In particular, the graph structure goes through a number of phrase transitions as $p$ varies. I’m pretty sure there’s a nice paper on this by Fan Chung but I can’t find the reference at the moment.

However, the Erdos-Renyi graph is pretty unrealistic and doesn’t look anything like real world graphs. The stochastic blockmodel (SBM) represents a first pass at a more plausible real world graph. In the SBM, we assume that each node belongs to one of $k$ communities. We then assume connections occur independently with probabilities that depend only on the communities of the nodes.

One major issue with SBMs is that they assume a uniform degree distribution, but real world graphs have highly skewed degree distributions. Karrer and Newman (2011) propose the degree-corrected SBM as a remedy to this problem. The paper additionally contains an interesting information-theoretic interpretation of modularity scores.

It also presents one standard (but unappealing) approach for fitting SBMs: (1) assign every node a community at random, (2) swap community labels at random, keeping the community assignments that result in the highest likelihood, and (3) do this a bunch of times because it’s random and you don’t want to get stuck in a local maximum. This is heuristic way to maximize the likelihood, which we should note is a hard problem here. Likelihood maximization requires optimizing over an $n \times k$ binary matrix, where $n$ is the number of nodes and $k$ is the number of communities. This is very non-convex.

Another key extension of the SBM comes in the form of the mixed membership SBM of Airoldi et al. (2008). Instead of assigning each node to a single cluster, the mixed membership SBM allows soft community assignments, where each node can belong to different communities, with total memberships summing up to one. The paper makes an argument for this model based on heterogeneous degrees (it came out before the degree corrected SBM paper) that I didn’t find terribly convincing (to be fair, I also only skimmed it), but it’s easy to add degree correction to the mixed membership SBM, and I would consider the degree corrected mixed membership SBM as a good starting point in many circumstances. The paper also discusses selecting the number of communities $k$ via cross-validation and BIC, and introduces a variational Bayes estimator.

Personally I am not a huge fan of variational Bayes, so I didn’t read the details about the estimation procedure. In practice, if you use graph-tool, one of the standard Python libraries to fit this sort of thing, I’m fairly sure you get variational Bayes estimates. Finally, the high resolution graphics in the PDF version of this paper caused my printer to shit a brick.

Another interesting extension of the SBM (that can also be degree corrected!) is the overlapping SBM of Latouche, Birmelé, and Ambroise (2011). The overlapping SBM respects multiple community memberships better than the mixed SBM, but the finer details are somewhat lost on me. The paper has an interesting identifiability proof that went way over my mind but that I’d refer back to in the future if I ever wanted to do something similar. The paper also proposes a variational estimator, and then it compares the performance of the overlapping SBM with the mixed SBM on a network of French political blogs. It turns out the overlapping SBM does a little better than the mixed SBM, but they seem to be generally have comparable performance. The data analysis does a great job highlighting that sometimes you just get a trash cluster of nodes that don’t belong in any of the other, more meaningful clusters.

One slightly weird thing about the SBM is that the model is typically parameterized as having Poisson edges (i.e. elements of the network adjacency matrix $A_{ij}$ are assumed to come from a Poisson($\lambda_{ij}$) distribution). The naive way to generate a sample from a SBM is to loop through every element of the adjacency matrix and sample a Poisson random variable for that element. This is $O(n^2)$, which is ridiculous because most graphs are sparse. Rohe et al. (2017) presents an $O(m)$ sampling scheme for random dot product graphs (a generalization of SBMs), where $m$ is the number of edges in the graph. I’m currently working on cleaning up the original code for this into an R package.

Reading Karl’s paper helped me connect my “edge probabilities depend on communities” understanding of SBMs to the matrix-notation Poisson formulation of SBMs. The gist is that the Poisson distribution is completely determined by $\lambda$, which is also the mean of a Poisson distribution, so writing down the expected value of an adjacency matrix is enough to completely determine an SBM. This matrix also has some nice structure and can be expressed as $\mathbb{E} (A) = X B X^T$ in the undirected case, where $X$ is binary, $n \times k$ and encodes community membership, and $B$ is $k \times k$ and encodes probabilities of connections between communities.

I’m fairly sure all of the papers so far mostly deal with undirected graphs (at least in the main exposition), and it’s a bit harder to think about a directed variant of the SBM. Now you have to imagine communities based on incoming edges, and communities based on outgoing edges. Rohe, Qin, and Yu (2016) present a directional version of the SBM and fit it using regularized spectral clustering. They also introduce an asymmetry score for nodes that can detect when incoming edges typically come from a different community than outgoing edges typically head to.

Regularized spectral clustering is definitely something you want to read about at some point; Qin and Rohe (2013) is a good reference. One thing that I’ve been struggling with is the connection between the eigendecomposition of a graph and graph properties that actually make sense to me (things like betweenness, degree, etc). A key connection here comes in the form of Cheeger’s inequality, which connects eigenvalues of the graph Laplacian (more on this in a moment) with graph conductance.

Some other interesting extensions to the SBM come in the form of the bipartite SBM (Larremore, Clauset, and Jacobs (2014)) and the high dimensional SBM (Rohe, Qin, and Fan (2012)). Both of these models are SBMs with structural parameters set to zero. The bipartite SBM seems like a good idea; Karl has told me the estimator proposed in Rohe, Qin, and Fan (2012) doesn’t really work.

Other network models

Gerlach, Peixoto, and Altmann (2018) make interesting connections between SBMs and Latent Dirichlet Allocation. Another part of the network analysis universe deals with Exponential Random Graph Models (ERGMs). Personally I find ERGM parameters totally uninterpretable and the software difficult to use so I haven’t dug into them much and don’t really plan to.

Estimators for SBMs

While I feel like I’ve finally wrapped by head around the various SBM generative models, I am much less certain how to fit the bloody things. Ghasemian, Hosseinmardi, and Clauset (2018) compares a large number of methods for community detection, the vast majority of which are estimators for SBMs. The takeaway seems to be that degree corrected mixed SBMs are the way to go (if you want to do inference, at least). I’m sure there are a bunch of wacky neural nets that do better on various prediction tasks, but I haven’t found those papers yet. Well, more accurately, I have found a ton of papers but I have no idea which ones are good and I don’t care enough about neural nets to read them closely and find out.

Anyway, a rough overview of fitting SBMs. There are a lot of ways to do it, and none of them is especially appealing. The naive approach is to get MLE estimates. The papers that I’ve seen do this use a “swap labels randomly and keep the highest likelihood” approach. I find this unsatisfying, and also confusing. I don’t get why people do this instead of throwing the problem into an industrial grade integer program solver like Gurobi1. As a side note, I assume it’s much easier to optimize the likelihood of the mixed SBM, since group memberships live on simplices, which are convex, rather than lattices, which are not.

Then there’s a wide variety of Bayesian estimation approaches, many of which seem to be variational Bayes. Sure, I guess. I’m happy to use these if someone else implements them, and am happy to pick $k$ based on BIC from these models, but I’m trying to avoid learning anything about variational methods because then people might actually ask me to use them at some point. Also Bayes is slow.

Spectral methods form another very important category of estimators. Spectral here just means “based on an eigendecomposition or an SVD”. Undirected graphs have symmetric adjacency matrices, so you can do an eigendecomposition. Directed graphs might not be positive definite, so you have to do an SVD instead. Oftentimes spectral estimates are used to initialize variational fitting methods.

One important thing to note with spectral methods is that you don’t really want to use the raw adjacency matrix. The eigenvectors of the adjacency matrix localize on high degree nodes (Martin, Zhang, and Newman (2014)). So what people typically do is work with the degree normalized adjacency matrix, which is often called the graph Laplacian2. Using the graph Laplacian (often called $\mathcal L$) like this instead of the adjacency matrix $A$ is sort of like moving from a basic SBM to a degree-corrected SBM. It turns out that the graph Laplacian has problems too, though. The eigenvectors of the graph Laplacian localize on low degree nodes (Zhang and Rohe (2018)). The solution to this is to use a regularized graph Laplacian. Fingers crossed I’ll post a short preprint containing some folk wisdom on regularization soonish.

Anyway, it turns out that running k-means on the eigenvectors of the regularized graph Laplacian is a decent way to estimate the node cluster memberships (Qin and Rohe (2013), Rohe, Chatterjee, and Yu (2011)). For these spectral methods, one way to pick the number of communities is to look at the eigenvalues and use the elbow method. This is very heuristic but it works decently well. Wang and Rohe (2016) point out that getting the number of communities right isn’t terribly important, in a cleverly titled paper Don’t mind the (eigen) gap. You definitely don’t want to wildly overestimate $k$, but if you cluster using $k = 5$ versus $k = 10$, the resulting clusters typically tell a coherent story and don’t contradict each other in terms of substantive conclusions. In spectral land, and eigenvalue of zero corresponds to “this definitely isn’t a cluster”, so as long as you aren’t pushing $k$ up so high that $\lambda_k \approx 0$, you’re probably fine. The intuition for this comes from the $\mathbb E (A) = X B X^T$ parameterization of the blockmodel.

Apparently there’s also some good minimum description length (MDL) approaches to selecting the number of communities, but I haven’t actually done any reading on this.

Some other spectral papers to know about include Chaudhuri, Chung, and Tsiatas (2012), which explores a couple different forms of regularization, and Chung, Lu, and Vu (2003), which discusses the distribution of eigenvalues of the adjacency matrix and the graph Laplacian.

A third category of estimators is based on graph thingajigs like betweenness. For example, the Girvan-Newman algorithm (Newman and Girvan (2004)) finds communities in graphs and I vaguely recall that it’s a consistent estimator for community assignments under some variant of the SBM3. I could be wrong. Anyway, you typically run the algorithm for a bunch of different $k$ and then select $k$ based on a modularity score. The algorithm and modularity score are both based on computer science-y graph concepts, but it’s equivalent to MLE estimation sometimes (Newman (2016)). Anyway, Newman’s papers are generally great and you should just go read all of them4.

The triangle trick

Another paper that I find particularly interesting is Rohe and Qin (2013), both as a technical paper and as an insight into Karl’s psyche as he tries to start a religious cult based on network analysis. The gist of the paper is that there are lots of uninformative edges in sparse and transitive networks, and you can throw these edges out, keeping only the edges that participate in triangles.

You can do this really fast with some matrix algebra. Let $A$ be an adjacency matrix, and let $A \odot B$ be the elementwise product of $A$ and $B$. Then $T = AA \odot A$ is the matrix of triangles. That is, $T_{ij}$ is the number of triangles that both node $i$ and node $j$ participate in. You can then use $T$ just like you would normally use $A$, and doing this cleans clusters up really nicely. You can do this with the (regularized) graph Laplacian as well. Benson, Gleich, and Leskovec (2016) is a followup that generalizes from the shared triangles matrix $T$ to the general case of shared motifs, which means “patterns that both node $i$ and node $j$ participate in.”

Briefly, contagion

Most of the reading I’ve been doing has been about SBMs, but there are some other papers that I highly recommend you read. The first is Shalizi and Thomas (2011), which points out that social contagion is generally confounded in graphs where people are connected to people like themselves. The fancy phrase for this is homophily, and it happens in pretty much all interesting graphs. This is an important observation because there’s a huge literature on contagion models, some of which seems to aware of this result, and some of which doesn’t. One way to avoid homophily-contagion confounding is to run an experiment. Taylor and Eckles (2017) discusses how you might want to do this. In generally I think it’s worth reading some of the more sociological network papers with a healthy dose of skepticism.

The end

That’s it, that’s all I’ve got. Most of the papers I’ve linked to have great references, so if you get bored or this isn’t enough, you can always go down that rabbit hole.

I don’t have a good handle on what kinds of neural nets people are using for graph stuff thesedays, nor do I have a good handle on modern graph embeddings. If you read about and/or use that kind of stuff, please leave paper recommendations in the comments!

1. I say this as someone who likes the idea of letting Gurobi solve all my problems for me, without ever having actually used Gurobi.

2. There are a whole bunch of different but related things that get called the graph Laplacian. If you read a paper that mentions the graph Laplacian, be sure to read the definition.

3. Luay Nakhleh, legendary Rice professor who teaches COMP 182, makes freshmen implement Girvan-Newman community detection in one of his homework assignments.

4. I also really like papers by Dan Larremore, Elizaveta Levina, Aaron Clauset and Bin Yu.