Datasets

This page contains links to several datasets which I have used in my research, along with code for processing the data.

Page Index

Graph Datasets

Hypergraph Datasets

Image Datasets

Other


Wikipedia Network

The wikipedia network dataset provides a large graph of the hyperlinks between wikipedia articles. In is compiled and released as part of the Stanford Large Network Dataset Collection.

This dataset was used in the STAG technical report.

  • Peter Macgregor and He Sun. Spectral toolkit of algorithms for graphs: Technical report (1), 2023. arXiv:2304.03170.

DBLP Co-authorship Data

The DBLP computer science bibliography data is a useful source of network data.

This dataset was used in the following paper.

  • Peter Macgregor and He Sun. Finding bipartite components in hypergraphs. In 34th Advances in Neural Information Processing Systems (NeurIPS’21), pages 7912–7923, 2021.

Berkely Segmentation Dataset (BSDS)

A dataset of images with ground-truth segmentations.

This dataset was used in the following papers.

  • Peter Macgregor and He Sun. Fast Approximation of Similarity Graphs with Kernel Density Estimation, and beyond. In 36th Neural Information Processing Systems (NeurIPS’23), 2023.
  • Peter Macgregor and He Sun. A tighter analysis of spectral clustering, and beyond. In 39th International Conference on Machine Learning (ICML’22), pages 14717–14742, 2022.

Militarized Interstate Disputes

A dataset containing all military disputes between 1816-2014.

  • Dataset homepage from the Correlates of War project.
  • Code for constructing a graph from the dataset.

This dataset was used in the following paper.

  • Peter Macgregor and He Sun. Local algorithms for finding densely connected clusters. In 38th International Conference on Machine Learning (ICML’21), pages 7268--7278, 2021.

US Migration Dataset

The 2000 US census included data about migration within the United States. This data can be viewed as a weighted, directed graph and used to evaluate directed graph clustering.

This dataset was used in the following paper.

  • Peter Macgregor and He Sun. Local algorithms for finding densely connected clusters. In 38th International Conference on Machine Learning (ICML’21), pages 7268--7278, 2021.

MNIST and USPS Datasets

Well-known datasets of hand-written digits.

These datasets were used in the following paper.

  • Peter Macgregor and He Sun. A tighter analysis of spectral clustering, and beyond. In 39th International Conference on Machine Learning (ICML’22), pages 14717–14742, 2022.

Stochastic Block Models

It is common to evaluate graph clustering algorithms on synthetic datasets from the Stochastic block model. The STAG library makes it easy to generate graphs from the stochastic block model, in C++ and Python.


Other Dataset Collections

Several others maintain lists of datasets. For example, I have found the following pages useful.


© 2023 Peter Macgregor