Blog Mind Garden Projects Movies & Books About

Back

Datasets

Notes and resources about Datasets.

Links#

Google Dataset Search ↗ (HN ↗) (HN ↗)
Tencent ML-Images ↗ - Largest multi-label image database; ResNet-101 model; 80.73% top-1 acc on ImageNet.
Mathematics Dataset ↗ - Dataset code generates mathematical question and answer pairs, from a range of question types at roughly school-level difficulty.
Moving autonomous vehicles forward, together. Dataset by Lyft ↗
CodeSearchNet ↗ - Datasets, tools, and benchmarks for representation learning of code.
Introducing the CodeSearchNet challenge (2019) ↗ (HN ↗)
Facets ↗ - Visualizations for machine learning datasets.
skdata ↗ - Data sets for machine learning in Python.
TensorFlow Datasets ↗ - Collection of datasets ready to use with TensorFlow.
Awesome Public Datasets ↗
Awesome Public Datasets Core ↗ - Next iteration of APD project.
LORIS ↗ - Web-accessible database solution for longitudinal multi-site studies.
ProteinNet ↗ - Standardized data set for machine learning of protein structure.
Registry of Open Data on AWS ↗ (Code ↗)
List of datasets for machine-learning research ↗
Syndetic ↗ - Replaces static data dictionaries with a live data profiling system. Annotate, measure, and monitor your datasets. Share the results. (HN ↗)
FaceForensics++ ↗ - Learning to Detect Manipulated Facial Images.
Scale AI ↗ - High quality training and validation data for AI applications.
Audio Datasets for Machine Learning ↗ (HN ↗)
Collection of large datasets for conversational response selection ↗
NSFW data source URLs ↗ - Collection of NSFW images URLs for the purposes of training an NSFW Image Classifier.
Lambdagram ↗ - Tiny Cloud Service to Build Image Datasets with Instagram.
HN Stories and comments since 2006 ↗
My Giant Data Quality Checklist (2020) ↗
LabelImg ↗ - Graphical image annotation tool.
Common Voice ↗ - Mozilla’s initiative to help teach machines how real people speak.
Replica Dataset ↗ - Dataset of high quality reconstructions of a variety of indoor spaces.
Using Decision Trees for charting ill-behaved datasets (2020) ↗
Human parsing datasets ↗
Data Programming: Creating Large Training Sets, Quickly (2016) ↗
Announcing Artifacts (2020) ↗
DataHub ↗ - Provide various solutions to Publish and Deploy your Data with power and simplicity.
Core Data ↗ - Important, commonly-used data as high quality, easy-to-use & open data packages. (Code ↗)
Awesome collections on DataHub ↗
Label Studio ↗ - Multi-type data labeling and annotation tool with standardized output format. (Code ↗) (Time Series Data Labeling ↗)
Heartex ↗ - Data Management Platform for Machine Learning.
Clothing Dataset: Call for Action (2020) ↗
Unsplash Dataset ↗ - 2,000,000+ Unsplash images made available for research and machine learning. (Web ↗)
100k+ Rows Topic Labeled News Dataset (2020) ↗
Fashion-MNIST ↗ - MNIST-like fashion product database.
FiveThirtyEight Datasets ↗
Books in .txt format for AI training purposes ↗ (HN ↗)
Sweetviz ↗ - Visualize and compare datasets, target values and associations, with one line of code.
SuperAnnotate ↗ - Fastest annotation platform for training AI.
Activeloop Hub ↗ - Fastest way to access and manage datasets for PyTorch and TensorFlow. (Web ↗) (Docs ↗) (Reddit ↗)
Objectron Dataset ↗ - Dataset of short object centeric video clips with pose annotations.
Google Research Datasets ↗
matorage ↗ - Efficient way to store/load and manage dataset, model and optimizer for deep learning.
HN Posts datasets ↗ (HN ↗)
Hypersim Toolkit ↗ - Set of tools for generating photorealistic synthetic datasets from V-Ray scenes.
mirdata ↗ - Interoperable Dataset Loaders for Music Information Retrieval (MIR).
MetFaces Dataset ↗ - Image dataset of human faces extracted from works of art.
Lionbridge AI ↗ - Provides human-labeled data for hundreds of use cases.
Traditional Chinese Landscape Painting Dataset ↗
Awesome Satellite Imagery Datasets ↗
Wikimedia Downloads ↗ - Download the Entire Wikimedia Database. (HN ↗)
Wikipedia: Database download ↗
How to shuffle a big dataset (2018) ↗ (Reddit ↗)
ESC-50: Dataset for Environmental Sound Classification ↗
Booking.com WSDM challenge ↗ - Training dataset consists of over a million of anonymized hotel reservations, based on real data.
Computer Vision Datasets ↗
Voicebook Datasets ↗ - Comprehensive list of open-source datasets for voice and sound computing (50+ datasets).
The Pile ↗ - 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.
doccano ↗ - Open source text annotation tool for machine learning practitioner. (Web ↗)
Weather and Climate Datasets for AI Research ↗ (Code ↗)
NLP Datasets ↗
Total Text Dataset ↗ - Consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.
Datasets collected for network science, deep learning and general machine learning research ↗
MER and SER Data sets ↗ - Data sets for Music Emotion Recognition and Speech Emotion Recognition.
Common Voice Datasets ↗ - Multi-language dataset of voices that anyone can use to train speech-enabled applications. (Code ↗)
Label a Dataset with a Few Lines of Code (2021) ↗ (HN ↗)
Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples (2020) ↗ (Code ↗)
Datasets should behave like git repositories (2021) ↗
The Stanford Question Answering Dataset ↗ (Visual Explorer ↗)
Data.gov ↗ - Home of the U.S. Government’s open data.
Visualizing Data Timeliness at Airbnb (2021) ↗
The Next Evolution of Data Catalogs: Data Discovery Platforms (2021) ↗
DeepLabel ↗ - Cross-platform image annotation tool for machine learning.
WIT : Wikipedia-based Image Text Dataset ↗
Harry Potter Dataset ↗
DocRED: A Large-Scale Document-Level Relation Extraction Dataset (2019) ↗ (Code ↗)
Synthetic Data: Even Better than the Real Thing? (2021) ↗
Google C4 dataset ↗ - Colossal, cleaned version of Common Crawl’s web crawl corpus.
Finding a standard dataset format for machine learning (2020) ↗ (HN ↗)
Hashing techniques to compare large datasets? (2021) ↗
Machine Learning Datasets | Papers With Code ↗ (Twitter ↗)
Ocean Market ↗ - Marketplace to find, publish and trade data sets. (Code ↗)
Ocean Protocol ↗ - Tools for the Web3 Data Economy. (Contracts ↗) (GitHub ↗)
Generating Datasets with Pretrained Language Models (2021) ↗
nbodykit ↗ - Analysis kit for large-scale structure datasets, the massively parallel way.
Dataset Inference: Ownership Resolution in Machine Learning (2021) ↗ (Tweet ↗)
Diffgram ↗ - Data Labeling Software for Machine Learning. (Code ↗)
Data Profiler ↗ - Python library designed to make data analysis, monitoring and sensitive data detection easy.
Tonic ↗ - Fake Data Company. (GitHub ↗)
Datasets for Google Cloud ↗ (Article ↗)
SQLite Data Starter Packs ↗
GitHub Collection: Open data ↗ - Examples of using GitHub to store, publish, and collaborate on open, machine-readable datasets.
Scientific Data Repositories ↗ (HN ↗)
CatMeows: A Publicly-Available Dataset of Cat Vocalizations (2020) ↗ (HN ↗)
ir_datasets ↗ - Python package that provides a common interface to many IR ad-hoc ranking benchmarks, training datasets, etc.
SEDE (Stack Exchange Data Explorer) ↗ - Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data. (Article ↗)
List of Medical (Imaging) Datasets ↗
musescore.com dataset ↗ - Dataset of all music sheets and users on musescore.com.
generatedata.com ↗ - Random data generator. (Code ↗)
MTData ↗ - Tool automates collection and preparation of machine translation datasets.
The MIT Supercloud Dataset (2021) ↗
Datasheets for Datasets (2018) ↗ (Markdown Datasheet for Datasets ↗)
Lightly ↗ - Label only the data which improves your ML model. (HN ↗)
Small Open Datasets ↗ - Collection of automatically-updated, ready-to-use and open-licensed datasets.
DataQA ↗ - Labelling platform for text using distant supervision.
COCO - Common Objects in Context ↗ - Large-scale object detection, segmentation, and captioning dataset. (API ↗)
img2dataset ↗ - Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
How to fit any dataset with a single parameter (2019) ↗ (HN ↗)
Single-dataset Experts for Multi-dataset Question Answering (2021) ↗ (Code ↗)
LabelFlow ↗ - Open standard platform for image labeling. (Code ↗)
Face Synthetics dataset ↗
Toloka ↗ - Fast and efficient way to collect and label large data sources for machine learning and other business purposes. (Code ↗) (GitHub ↗)
PlainTextWikipedia ↗ - Convert Wikipedia database dumps into plaintext files.
Discovering Anomalous Data with Self-Supervised Learning (2021) ↗
Resources to get you the best quality of ML datasets (2021) ↗
Hugging Face Datasets ↗
SDMetrics ↗ - Metrics to evaluate quality and efficacy of synthetic datasets.
doubtlab ↗ - General tricks that may help you find bad, or noisy, labels in your dataset.
Gretel Synthetics ↗ - Synthetic data generators for structured and unstructured text, featuring differentially private learning.
Great datasets to teach with (2021) ↗
A Cartel of Influential Datasets Are Dominating Machine Learning Research ↗ (HN ↗)
The Toxicity Dataset ↗
Data Linter ↗ - Identifies potential issues (lints) in your ML training data.
Cloud Annotations ↗ - Fast, easy and collaborative open source image annotation tool for teams and individuals. (Web ↗)
pyjanitor ↗ - Clean APIs for data cleaning. Python implementation of R package Janitor.
face2comics datasets ↗
arXiv public datasets ↗
AIST++ Dance Motion Dataset ↗ (API Code ↗)
TheAudioDB.com ↗ - Community Database of audio artwork and metadata with a JSON API.
Awesome Video Datasets ↗
Conceptual 12M ↗ - Dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.
Colliding Circles Toy Datasets ↗
Sieve ↗ - Transform raw video into high quality datasets in minutes. (HN ↗) (HN ↗)
IKEA 3D Assembly Dataset ↗
Imbalanced Dataset Sampler ↗ - PyTorch imbalanced dataset sampler for oversampling low frequent classes and undersampling high frequent ones.
ADE20K Dataset ↗ - Composed of more than 27K images from the SUN and Places databases. (Code ↗)
Datasets of Automatic Keyphrase Extraction ↗
Awesome Forests ↗ - Curated list of ground-truth forest datasets for the machine learning and forestry community.
PushShift Data Dumps ↗
DeepEcho ↗ - Synthetic Data Generation for mixed-type, multivariate time series.
deduplify ↗ - Python tool to search for and remove duplicated files in messy datasets.
CSVtoTable ↗ - Simple command-line utility to convert CSV files to searchable and sortable HTML table.
Kubric ↗ - Data generation pipeline for creating semi-realistic synthetic multi-object videos with rich annotations such as instance segmentation masks, depth maps, and optical flow.
ASPset-510 ↗ - Large-scale video dataset for the training and evaluation of 3D human pose estimation models.
Self-Distilled Internet Photos (SDIP) Dataset ↗
Fake News Corpus ↗
Sniffer ↗ - Lightweight Python application for sorting images in your dataset.
Dataset Distillation by Matching Training Trajectories (2022) ↗ (Code ↗)
BeeRef ↗ - Simple Reference Image Viewer.
BookSum: A Collection of Datasets for Long-form Narrative Summarization (2021) ↗ (Code ↗)
HierText Dataset ↗ - Dataset featuring hierarchical annotations of text in natural scenes and documents.
Google Research Datasets ↗
MetaShift: A Dataset of Datasets for Evaluating Distribution Shifts and Training Conflicts (2022) ↗
CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus ↗
Squirrel Datasets Core ↗
GTA-3D Dataset ↗ - Dataset of 2D imagery, 3D point cloud data, and 3D vehicle bounding box labels all generated using the Grand Theft Auto 5 game engine.
Relative Human (RH) ↗ - Multi-person in-the-wild RGB images with rich human annotations.
CSV Base ↗ - Turn CSV files into read+write APIs. (Code ↗)
A Dataset and Explorer for 3D Signed Distance Functions (2022) ↗ (Code ↗)
Vega Datasets ↗ - Collection of datasets used in Vega and Vega-Lite examples.
Azimuth ↗ - Open-source dataset and error analysis tool for text classification.
audio2dataset ↗ - Easily turn large sets of audio urls to an audio dataset.
Datasets for Entity Recognition ↗ - Collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
AudioLoader ↗ - PyTorch Dataset for Speech and Music audio.
Awesome Training Data ↗
MIDI Dataset ↗ - Code for creating a dataset of MIDI ground truth.
Labelbox ↗ - Fastest way to annotate data to build and ship computer vision applications. (Code ↗)
Bamboo ↗ - Mega-scale and information-dense dataset for classification and detection pre-training.
The How2 Dataset ↗ - Multimodal collection of instructional videos with English subtitles. (Code ↗)
Unity Dataset Insights ↗ - Python package for downloading, parsing and analyzing synthetic datasets generated using the Unity Perception package.
ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection (2022) ↗ (Code ↗)
Perceptual Image Processing ALgorithms (PIPAL) ↗ (Code ↗)
Hover ↗ - Label data at scale. Fun and precision included.

MIND GARDEN