Datasets
Notes and resources about Datasets.
Links#
- Google Dataset Search ↗ (HN ↗) (HN ↗)
- Tencent ML-Images ↗ - Largest multi-label image database; ResNet-101 model; 80.73% top-1 acc on ImageNet.
- Mathematics Dataset ↗ - Dataset code generates mathematical question and answer pairs, from a range of question types at roughly school-level difficulty.
- Moving autonomous vehicles forward, together. Dataset by Lyft ↗
- CodeSearchNet ↗ - Datasets, tools, and benchmarks for representation learning of code.
- Introducing the CodeSearchNet challenge (2019) ↗ (HN ↗)
- Facets ↗ - Visualizations for machine learning datasets.
- skdata ↗ - Data sets for machine learning in Python.
- TensorFlow Datasets ↗ - Collection of datasets ready to use with TensorFlow.
- Awesome Public Datasets ↗
- Awesome Public Datasets Core ↗ - Next iteration of APD project.
- LORIS ↗ - Web-accessible database solution for longitudinal multi-site studies.
- ProteinNet ↗ - Standardized data set for machine learning of protein structure.
- Registry of Open Data on AWS ↗ (Code ↗)
- List of datasets for machine-learning research ↗
- Syndetic ↗ - Replaces static data dictionaries with a live data profiling system. Annotate, measure, and monitor your datasets. Share the results. (HN ↗)
- FaceForensics++ ↗ - Learning to Detect Manipulated Facial Images.
- Scale AI ↗ - High quality training and validation data for AI applications.
- Audio Datasets for Machine Learning ↗ (HN ↗)
- Collection of large datasets for conversational response selection ↗
- NSFW data source URLs ↗ - Collection of NSFW images URLs for the purposes of training an NSFW Image Classifier.
- Lambdagram ↗ - Tiny Cloud Service to Build Image Datasets with Instagram.
- HN Stories and comments since 2006 ↗
- My Giant Data Quality Checklist (2020) ↗
- LabelImg ↗ - Graphical image annotation tool.
- Common Voice ↗ - Mozilla’s initiative to help teach machines how real people speak.
- Replica Dataset ↗ - Dataset of high quality reconstructions of a variety of indoor spaces.
- Using Decision Trees for charting ill-behaved datasets (2020) ↗
- Human parsing datasets ↗
- Data Programming: Creating Large Training Sets, Quickly (2016) ↗
- Announcing Artifacts (2020) ↗
- DataHub ↗ - Provide various solutions to Publish and Deploy your Data with power and simplicity.
- Core Data ↗ - Important, commonly-used data as high quality, easy-to-use & open data packages. (Code ↗)
- Awesome collections on DataHub ↗
- Label Studio ↗ - Multi-type data labeling and annotation tool with standardized output format. (Code ↗) (Time Series Data Labeling ↗)
- Heartex ↗ - Data Management Platform for Machine Learning.
- Clothing Dataset: Call for Action (2020) ↗
- Unsplash Dataset ↗ - 2,000,000+ Unsplash images made available for research and machine learning. (Web ↗)
- 100k+ Rows Topic Labeled News Dataset (2020) ↗
- Fashion-MNIST ↗ - MNIST-like fashion product database.
- FiveThirtyEight Datasets ↗
- Books in .txt format for AI training purposes ↗ (HN ↗)
- Sweetviz ↗ - Visualize and compare datasets, target values and associations, with one line of code.
- SuperAnnotate ↗ - Fastest annotation platform for training AI.
- Activeloop Hub ↗ - Fastest way to access and manage datasets for PyTorch and TensorFlow. (Web ↗) (Docs ↗) (Reddit ↗)
- Objectron Dataset ↗ - Dataset of short object centeric video clips with pose annotations.
- Google Research Datasets ↗
- matorage ↗ - Efficient way to store/load and manage dataset, model and optimizer for deep learning.
- HN Posts datasets ↗ (HN ↗)
- Hypersim Toolkit ↗ - Set of tools for generating photorealistic synthetic datasets from V-Ray scenes.
- mirdata ↗ - Interoperable Dataset Loaders for Music Information Retrieval (MIR).
- MetFaces Dataset ↗ - Image dataset of human faces extracted from works of art.
- Lionbridge AI ↗ - Provides human-labeled data for hundreds of use cases.
- Traditional Chinese Landscape Painting Dataset ↗
- Awesome Satellite Imagery Datasets ↗
- Wikimedia Downloads ↗ - Download the Entire Wikimedia Database. (HN ↗)
- Wikipedia: Database download ↗
- How to shuffle a big dataset (2018) ↗ (Reddit ↗)
- ESC-50: Dataset for Environmental Sound Classification ↗
- Booking.com WSDM challenge ↗ - Training dataset consists of over a million of anonymized hotel reservations, based on real data.
- Computer Vision Datasets ↗
- Voicebook Datasets ↗ - Comprehensive list of open-source datasets for voice and sound computing (50+ datasets).
- The Pile ↗ - 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.
- doccano ↗ - Open source text annotation tool for machine learning practitioner. (Web ↗)
- Weather and Climate Datasets for AI Research ↗ (Code ↗)
- NLP Datasets ↗
- Total Text Dataset ↗ - Consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.
- Datasets collected for network science, deep learning and general machine learning research ↗
- MER and SER Data sets ↗ - Data sets for Music Emotion Recognition and Speech Emotion Recognition.
- Common Voice Datasets ↗ - Multi-language dataset of voices that anyone can use to train speech-enabled applications. (Code ↗)
- Label a Dataset with a Few Lines of Code (2021) ↗ (HN ↗)
- Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples (2020) ↗ (Code ↗)
- Datasets should behave like git repositories (2021) ↗
- The Stanford Question Answering Dataset ↗ (Visual Explorer ↗)
- Data.gov ↗ - Home of the U.S. Government’s open data.
- Visualizing Data Timeliness at Airbnb (2021) ↗
- The Next Evolution of Data Catalogs: Data Discovery Platforms (2021) ↗
- DeepLabel ↗ - Cross-platform image annotation tool for machine learning.
- WIT : Wikipedia-based Image Text Dataset ↗
- Harry Potter Dataset ↗
- DocRED: A Large-Scale Document-Level Relation Extraction Dataset (2019) ↗ (Code ↗)
- Synthetic Data: Even Better than the Real Thing? (2021) ↗
- Google C4 dataset ↗ - Colossal, cleaned version of Common Crawl’s web crawl corpus.
- Finding a standard dataset format for machine learning (2020) ↗ (HN ↗)
- Hashing techniques to compare large datasets? (2021) ↗
- Machine Learning Datasets | Papers With Code ↗ (Twitter ↗)
- Ocean Market ↗ - Marketplace to find, publish and trade data sets. (Code ↗)
- Ocean Protocol ↗ - Tools for the Web3 Data Economy. (Contracts ↗) (GitHub ↗)
- Generating Datasets with Pretrained Language Models (2021) ↗
- nbodykit ↗ - Analysis kit for large-scale structure datasets, the massively parallel way.
- Dataset Inference: Ownership Resolution in Machine Learning (2021) ↗ (Tweet ↗)
- Diffgram ↗ - Data Labeling Software for Machine Learning. (Code ↗)
- Data Profiler ↗ - Python library designed to make data analysis, monitoring and sensitive data detection easy.
- Tonic ↗ - Fake Data Company. (GitHub ↗)
- Datasets for Google Cloud ↗ (Article ↗)
- SQLite Data Starter Packs ↗
- GitHub Collection: Open data ↗ - Examples of using GitHub to store, publish, and collaborate on open, machine-readable datasets.
- Scientific Data Repositories ↗ (HN ↗)
- CatMeows: A Publicly-Available Dataset of Cat Vocalizations (2020) ↗ (HN ↗)
- ir_datasets ↗ - Python package that provides a common interface to many IR ad-hoc ranking benchmarks, training datasets, etc.
- SEDE (Stack Exchange Data Explorer) ↗ - Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data. (Article ↗)
- List of Medical (Imaging) Datasets ↗
- musescore.com dataset ↗ - Dataset of all music sheets and users on musescore.com.
- generatedata.com ↗ - Random data generator. (Code ↗)
- MTData ↗ - Tool automates collection and preparation of machine translation datasets.
- The MIT Supercloud Dataset (2021) ↗
- Datasheets for Datasets (2018) ↗ (Markdown Datasheet for Datasets ↗)
- Lightly ↗ - Label only the data which improves your ML model. (HN ↗)
- Small Open Datasets ↗ - Collection of automatically-updated, ready-to-use and open-licensed datasets.
- DataQA ↗ - Labelling platform for text using distant supervision.
- COCO - Common Objects in Context ↗ - Large-scale object detection, segmentation, and captioning dataset. (API ↗)
- img2dataset ↗ - Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
- How to fit any dataset with a single parameter (2019) ↗ (HN ↗)
- Single-dataset Experts for Multi-dataset Question Answering (2021) ↗ (Code ↗)
- LabelFlow ↗ - Open standard platform for image labeling. (Code ↗)
- Face Synthetics dataset ↗
- Toloka ↗ - Fast and efficient way to collect and label large data sources for machine learning and other business purposes. (Code ↗) (GitHub ↗)
- PlainTextWikipedia ↗ - Convert Wikipedia database dumps into plaintext files.
- Discovering Anomalous Data with Self-Supervised Learning (2021) ↗
- Resources to get you the best quality of ML datasets (2021) ↗
- Hugging Face Datasets ↗
- SDMetrics ↗ - Metrics to evaluate quality and efficacy of synthetic datasets.
- doubtlab ↗ - General tricks that may help you find bad, or noisy, labels in your dataset.
- Gretel Synthetics ↗ - Synthetic data generators for structured and unstructured text, featuring differentially private learning.
- Great datasets to teach with (2021) ↗
- A Cartel of Influential Datasets Are Dominating Machine Learning Research ↗ (HN ↗)
- The Toxicity Dataset ↗
- Data Linter ↗ - Identifies potential issues (lints) in your ML training data.
- Cloud Annotations ↗ - Fast, easy and collaborative open source image annotation tool for teams and individuals. (Web ↗)
- pyjanitor ↗ - Clean APIs for data cleaning. Python implementation of R package Janitor.
- face2comics datasets ↗
- arXiv public datasets ↗
- AIST++ Dance Motion Dataset ↗ (API Code ↗)
- TheAudioDB.com ↗ - Community Database of audio artwork and metadata with a JSON API.
- Awesome Video Datasets ↗
- Conceptual 12M ↗ - Dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.
- Colliding Circles Toy Datasets ↗
- Sieve ↗ - Transform raw video into high quality datasets in minutes. (HN ↗) (HN ↗)
- IKEA 3D Assembly Dataset ↗
- Imbalanced Dataset Sampler ↗ - PyTorch imbalanced dataset sampler for oversampling low frequent classes and undersampling high frequent ones.
- ADE20K Dataset ↗ - Composed of more than 27K images from the SUN and Places databases. (Code ↗)
- Datasets of Automatic Keyphrase Extraction ↗
- Awesome Forests ↗ - Curated list of ground-truth forest datasets for the machine learning and forestry community.
- PushShift Data Dumps ↗
- DeepEcho ↗ - Synthetic Data Generation for mixed-type, multivariate time series.
- deduplify ↗ - Python tool to search for and remove duplicated files in messy datasets.
- CSVtoTable ↗ - Simple command-line utility to convert CSV files to searchable and sortable HTML table.
- Kubric ↗ - Data generation pipeline for creating semi-realistic synthetic multi-object videos with rich annotations such as instance segmentation masks, depth maps, and optical flow.
- ASPset-510 ↗ - Large-scale video dataset for the training and evaluation of 3D human pose estimation models.
- Self-Distilled Internet Photos (SDIP) Dataset ↗
- Fake News Corpus ↗
- Sniffer ↗ - Lightweight Python application for sorting images in your dataset.
- Dataset Distillation by Matching Training Trajectories (2022) ↗ (Code ↗)
- BeeRef ↗ - Simple Reference Image Viewer.
- BookSum: A Collection of Datasets for Long-form Narrative Summarization (2021) ↗ (Code ↗)
- HierText Dataset ↗ - Dataset featuring hierarchical annotations of text in natural scenes and documents.
- Google Research Datasets ↗
- MetaShift: A Dataset of Datasets for Evaluating Distribution Shifts and Training Conflicts (2022) ↗
- CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus ↗
- Squirrel Datasets Core ↗
- GTA-3D Dataset ↗ - Dataset of 2D imagery, 3D point cloud data, and 3D vehicle bounding box labels all generated using the Grand Theft Auto 5 game engine.
- Relative Human (RH) ↗ - Multi-person in-the-wild RGB images with rich human annotations.
- CSV Base ↗ - Turn CSV files into read+write APIs. (Code ↗)
- A Dataset and Explorer for 3D Signed Distance Functions (2022) ↗ (Code ↗)
- Vega Datasets ↗ - Collection of datasets used in Vega and Vega-Lite examples.
- Azimuth ↗ - Open-source dataset and error analysis tool for text classification.
- audio2dataset ↗ - Easily turn large sets of audio urls to an audio dataset.
- Datasets for Entity Recognition ↗ - Collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
- AudioLoader ↗ - PyTorch Dataset for Speech and Music audio.
- Awesome Training Data ↗
- MIDI Dataset ↗ - Code for creating a dataset of MIDI ground truth.
- Labelbox ↗ - Fastest way to annotate data to build and ship computer vision applications. (Code ↗)
- Bamboo ↗ - Mega-scale and information-dense dataset for classification and detection pre-training.
- The How2 Dataset ↗ - Multimodal collection of instructional videos with English subtitles. (Code ↗)
- Unity Dataset Insights ↗ - Python package for downloading, parsing and analyzing synthetic datasets generated using the Unity Perception package.
- ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection (2022) ↗ (Code ↗)
- Perceptual Image Processing ALgorithms (PIPAL) ↗ (Code ↗)
- Hover ↗ - Label data at scale. Fun and precision included.