Datasets
This page contains links to several datasets which I have used in my research,
along with code for processing the data.
Page Index
Graph Datasets
Hypergraph Datasets
Image Datasets
Other
Wikipedia Network
The wikipedia network dataset provides a large graph of the hyperlinks
between wikipedia articles.
In is compiled and released as part of the
Stanford Large Network Dataset Collection.
This dataset was used in the STAG technical report.
-
Peter Macgregor and He Sun.
Spectral toolkit of algorithms for graphs: Technical report (1), 2023.
arXiv:2304.03170.
DBLP Co-authorship Data
The DBLP computer science bibliography data is a useful source of network data.
This dataset was used in the following paper.
- Peter Macgregor and He Sun. Finding bipartite components in hypergraphs. In
34th Advances in Neural Information Processing Systems (NeurIPS’21), pages
7912–7923, 2021.
Berkely Segmentation Dataset (BSDS)
A dataset of images with ground-truth segmentations.
This dataset was used in the following paper.
- Peter Macgregor and He Sun. A tighter analysis of spectral clustering, and
beyond. In 39th International Conference on Machine Learning (ICML’22),
pages 14717–14742, 2022.
Militarized Interstate Disputes
A dataset containing all military disputes between 1816-2014.
- Dataset homepage from the Correlates of War project.
- Code for constructing a graph from the dataset.
This dataset was used in the following paper.
- Peter Macgregor and He Sun. Local algorithms for finding densely connected clusters.
In 38th International Conference on Machine Learning (ICML’21), pages 7268--7278, 2021.
US Migration Dataset
The 2000 US census included data about migration within the United States.
This data can be viewed as a weighted, directed graph and used to evaluate directed graph clustering.
This dataset was used in the following paper.
- Peter Macgregor and He Sun. Local algorithms for finding densely connected clusters.
In 38th International Conference on Machine Learning (ICML’21), pages 7268--7278, 2021.
MNIST and USPS Datasets
Well-known datasets of hand-written digits.
These datasets were used in the following paper.
- Peter Macgregor and He Sun. A tighter analysis of spectral clustering, and
beyond. In 39th International Conference on Machine Learning (ICML’22),
pages 14717–14742, 2022.
Stochastic Block Models
It is common to evaluate graph clustering algorithms on synthetic datasets from the Stochastic block model.
The STAG library makes it easy to generate graphs from the stochastic block model, in C++ and Python.
Other Dataset Collections
Several others maintain lists of datasets. For example, I have found the following pages useful.