Ibis ↗ & Orchest ↗ are nice.
Notes#
Links#
- Bigslice ↗ - System for fast, large-scale, serverless data processing using Go.
- Reflow ↗ - Language and runtime for distributed, incremental data processing in the cloud.
- Self-managing serverless computing with Bigmachine (2019) ↗
- Bigslice: a cluster computing system for Go (2019) ↗
- When your data doesn’t fit in memory: the basic techniques (2019) ↗ (HN ↗)
- Differential Dataflow ↗ - Implementation of differential dataflow using timely dataflow on Rust. (Book ↗) (HN ↗)
- The Log: What every software engineer should know about real-time data’s unifying abstraction (2013) ↗
- Luna ↗ - Data processing and visualization environment built on a principle that people need an immediate connection to what they are building.
- Guide To The Data Lake — Modern Batch Data Warehousing (2020) ↗
- Plumbing At Scale (2020) ↗ - Event Sourcing and Stream Processing Pipelines at Grab.
- Differential Dataflow! But at what COST? (2017) ↗ (HN ↗)
- Timely Dataflow and Total Order (2020) ↗
- Nuclio ↗ - High-Performance Serverless event and data processing platform.
- Apache Spark ↗ - Unified analytics engine for large-scale data processing. (PySpark ↗) (PySpark Style Guide ↗) (Article ↗) (Web ↗) (Spark Learning Guide ↗)
- Spark: The Definitive Guide Book (2018) ↗ (Code ↗)
- Batch ↗ - Event replay platform. Version control for data passing through your messaging systems. (HN ↗)
- A log/event processing pipeline you can’t have (2019) ↗ (HN ↗)
- mm-ADT ↗ - Multi-Model Abstract Data Type. Distributed virtual machine capable of integrating a diverse collection of data processing technologies. (Code ↗)
- Data Preprocessing in Machine Learning (2020) ↗
- lakeFS ↗ - Open source layer that delivers resilience and manageability to object-storage based data lakes. (Web ↗)
- Baker ↗ - High performance, composable and extendable data-processing pipeline for the big data era.
- Cylon ↗ - Fast, scalable distributed memory data parallel library for processing structured data. (Web ↗)
- cuGraph ↗ - GPU Graph Analytics.
- Opaque ↗ - Secure Apache Spark SQL.
- Apache Beam ↗ - Unified programming model for Batch and Streaming. (Web ↗)
- Stitch ↗ - Simple, extensible ETL built for data teams.
- Databricks ↗ - Unified Data Analytics. (GitHub ↗) (CLI ↗) (Reflecting on Four Years at Databricks (2021) ↗)
- AugMix ↗ - Simple Data Processing Method to Improve Robustness and Uncertainty.
- Snapflow ↗ - Framework for building end-to-end functional data pipelines from modular components.
- Workflow Description Language (WDL) ↗ - Way to specify data processing workflows with a human-readable and writeable syntax.
- Cloudfuse ↗ - Open source serverless data solutions. Future of data pipelines. (GitHub ↗)
- Create your own data stream for Kafka with Python and Faker (2021) ↗
- Hindsight ↗ - C based data processing infrastructure based on the lua sandbox project.
- Reverse ETL — A Primer (2021) ↗
- I wrote one of the fastest DataFrame libraries (2021) ↗
- Build your own “data lake” for reporting purposes in a multi-services environment (2021) ↗
- Feature Stores: The Data Side of ML Pipelines (2021) ↗
- Flowgger ↗ - Fast, simple and lightweight data collector written in Rust.
- Popsink ↗ - Real-time data platform you don’t have to build.
- Flyte ↗ - Structured programming and distributed processing platform that enables highly concurrent, scalable and maintainable workflows for Machine Learning and Data Processing. (Web ↗) (GitHub ↗) (Python SDK ↗) (CLI ↗)
- Winterfell ↗ - Distributed STARK prover.
- Python to Distributed Python to Airflow task in ~5 lines of code ↗
- DataFusion ↗ - Extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.
- Delta Lake ↗ - Reliable Data Lakes at Scale. (GitHub ↗)
- Delta Sharing ↗ - Open Protocol for Secure Data Sharing. (Article ↗) (Tweet ↗)
- Dataform ↗ - Manage data pipelines in BigQuery.
- Legate Pandas ↗ - Aspiring Drop-In Replacement for Pandas at Scale.
- datablocks ↗ - Flow based data processing editor. (HN ↗)
- Reproducible data processing pipelines (2021) ↗
- datasketch ↗ - Probabilistic data structures that can process and search very large amount of data super fast, with little loss of accuracy.
- Tuplex ↗ - Parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. (Web ↗)
- file.d ↗ - Blazing fast tool for building data pipelines: read, process and output events.
- Datafuse ↗ - Modern Real-Time Data Processing in Rust. (Code ↗) (HN ↗)
- MapReduce is making a comeback (2021) ↗ (HN ↗)
- SciPipe ↗ - Robust, flexible and resource-efficient pipelines using Go and the command line. (Docs ↗)
- The Future Is Big Graphs: A Community View on Graph Processing Systems (2021) ↗ (HN ↗)
- What Is the Data Lakehouse Pattern? ↗ (HN ↗)
- Apache Hadoop ↗ - Open-source software for reliable, scalable, distributed computing. (Is Hadoop Dead? ↗) (Code ↗)
- go-stash ↗ - High performance, free and open source server-side data processing pipeline that ingests data from Kafka, processes it, and then sends it to ElasticSearch.
- pypely ↗ - Make your data processing easy - build pipelines in a functional manner.
- An opinionated map of incremental and streaming systems (2021) ↗
- Crossjoin ↗ - Joins together your data from anywhere.
- Ceramic Network ↗ - Decentralized, open source platform for creating, hosting, and sharing streams of data. (TS Code ↗) (GitHub ↗) (Doc ↗)
- Graphite-Web ↗ - Highly scalable real-time graphing system. (Docs ↗)
- vega ↗ - Faster implementation of Apache Spark from scratch in Rust.
- Memgraph ↗ - Build modern, graph-based applications on top of your streaming data in minutes. (Web ↗)
- Apache Parquetv ↗ - Columnar storage format that supports nested data. (Code ↗)
- Data Pipelines Pocket Reference Book (2021) ↗ (Code ↗)
- miniwdl ↗ - Workflow Description Language developer tools & local runner.
- Rain ↗ - Framework for large distributed pipelines.
- Apache SeaTunnel ↗ - Distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time). (Code ↗)
- Databend ↗ - Open Source Serverless Data Warehouse for Everyone. (Web ↗)
- Pydra ↗ - Simple dataflow engine with scalable semantics.
- Bytewax ↗ - Open source Python framework for building highly scalable dataflows.
- Atomic Data ↗ - Modular specification for sharing, modifying and modeling graph data. (Code ↗) (Rust Code ↗)
- Apache Arrow Flight SQL: Accelerating Database Access (2022) ↗ (HN ↗)
- Grist ↗ - Modern relational spreadsheet. Open core alternative to Airtable and Google Sheets. (HN ↗)
- Data Engineering Practice Problems ↗
- Dagster: Rebundling the Data Platform (2022) ↗
- cq ↗ - Clojure Command-line Data Processor for JSON, YAML, EDN, XML and more.
- utt ↗ - Universal text transformer.
- Loggie ↗ - Lightweight, high-performance, cloud-native agent and aggregator based on Go.
- ter ↗ - CLI to run text expressions and perform basic text operations such as filtering, ignoring and replacing on the command line.
- csv-diff ↗ - Python CLI tool and library for diffing CSV and JSON files.
- pqrs ↗ - Command line tool for inspecting Parquet files.
- Kestra ↗ - Infinitely scalable open source orchestration & scheduling platform. (Code ↗) (HN ↗)
- TiFlash ↗ - Analytical engine for TiDB.
- Streamify ↗ - Data pipeline with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more.
- DTL ↗ - Language and JavaScript lib to transform and manipulate data. (HN ↗)
- Hawk ↗ - Haskell text processor for the command-line.
- Alternatives to pandas library ↗
- Zed ↗ - Tooling for super-structured data: a new and easier way to manipulate data. (Web ↗)
- Fast Analysis with DuckDB + PyArrow (2022) ↗ - Trying out some new speedy tools for data analysis.
- Why isn’t there a decent file format for tabular data? (2022) ↗ (HN ↗)
- Data Engineering Wiki ↗ (Code ↗)
- csv-clean ↗ - Command line tool to clean up malformed CSV files.
- rq ↗ - Tool for doing record analysis and transformation.
- Data Integration Guide: Techniques, Technologies, and Tools (2022) ↗
- Mito ↗ - Mito – Excel-like interface for Pandas dataframes in Jupyter notebook. (HN ↗)
- Tornado ↗ - Complex Event Processor that receives reports of events from data sources such as monitoring, email, and telegram, matches them against pre-configured rules.
- Meet Dash-AB — The Statistics Engine of Experimentation at DoorDash (2022) ↗
- dataPipe ↗ - Data processing and data analytics library for JavaScript.
- gosquito ↗ - Pluggable tool for data gathering, data processing and data transmitting to various destinations.
- DLT ↗ - Enables simple python-native data pipelining for data professionals.
- PipeRider ↗ - Toolkit for detecting data issues across pipelines that works with CI systems for continuous data quality assessment.
- airflint ↗ - Enforce Best Practices for all your Airflow DAGs.