Table of Contents
This is the DuckDB design document. DuckDB consists of the following major components:
- API
- Parser
- Planner
- Optimiser
- Execution
- Storage
Below, ideas for individual components are described.
API
- C/C++ API, similar to https://github.com/hannesmuehleisen/MonetDBLite-C/blob/master/src/embedded/embedded.h
- Allows bulk access to values to support analytics tools (R/NumPy)
- Allows multiple databases instances in same process
- No authorisation etc, a wrapping process can do that.
- No schemas, because you can just have another database.
Use case
- Embedded? Server? Or both by wrapping the embedded DB into a process?
- How are we going to share resources (CPU, memory) with other processes?
Parser
- Tokenises query into strings (which may be quoted), then dispatches code that consumes tokens.
- Extensible through registrations of new tokens and code that consumes them.
- Input: Query
- Output: Parse tree (token-specific nodes, e.g. where node containing an expression)
- Perhaps use this: https://github.com/lfittl/libpg_query
Planner
- Transform parse tree into basic, stupid, runnable plan.
- Extensible with new transformation rules so it can handle the new tokens from parser plugins.
- Knows database schema so it can select columns and validate etc.
- Input: Parse tree
- Output: Query Plan, binary tree with operators (scan, project, join etc.).
Optimiser
- Static rewriting + sampling-based optimiser. Samples base tables, runs first and second layer (scan/join) of query plan to get idea about cardinalities. Perhaps http://pages.cs.wisc.edu/~wentaowu/papers/sigmod16-reoptimization.pdf is good?
see also:
** Thomas Neumann, César A. Galindo-Legaria: Taking the Edge off Cardinality Estimation Errors using Incremental Execution. BTW 2013: 73-92.
** Viktor Leis, Bernhard Radke, Andrey Gubichev, Alfons Kemper, Thomas Neumann: Cardinality Estimation Done Right: Index-Based Join Sampling. CIDR 2017
-
Idea: Spawn thread that runs greedy query plan immediately while optimising to handle short-running queries.
-
Extensible with new rewriting rules (See Catalyst, https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html).
-
Question: How to include optimisation rules needed for new operators?
-
Maybe something like Orca (https://15721.courses.cs.cmu.edu/spring2016/papers/p337-soliman.pdf)
-
Or use something that is available as an library: Apache Calcite (JVM, probably not) or Orca
-
State of art: Dynamic programming for Join order (https://15721.courses.cs.cmu.edu/spring2017/papers/14-optimizer1/p539-moerkotte.pdf) and Cascades for extensibility (https://pdfs.semanticscholar.org/360e/cdfc79850873162ee4185bed8f334da30031.pdf)
Execution
- Vectorized execution engine (http://cidrdb.org/cidr2005/papers/P19.pdf)
- Alternatively, we could implement operators in Tim's VM (supposed to be a library) and get vectorized execution (planned: plus adaptive optimizations, pluggable JIT and GPU offloading)
- Work stealing parallelisation (Morsel-driven? https://db.in.tum.de/~leis/papers/morsels.pdf)
- Extensible, new operators can be added
- Input: Query plan
- Output: Result Sets possibly
Transactions
Serializability? (less Tx/s but at least the end-user understands it)
Storage Engine
- Single-File (The problem with single-file is storing the transactional log but if we say that we only require that for atomicity and consistency then we can easily put the log into another file)
- MVCC (https://db.in.tum.de/~muehlbau/papers/mvcc.pdf)
- Fixed block size to support
- Eviction based buffer manager (https://db.in.tum.de/~leis/papers/leanstore.pdf)
- Block-based compressions
- Keep samples for tables around for optimizer
- Min/Max information stored similar to samples, to restrict scan ranges/data types etc
- Use K/V for storage?
- Question: Are we planning to use primary/secondary indexes?
- For primary/clustered indexes we could just sort the table (or keep it sorted), potentially use a PMA (faster inserts, decreased scanning speed)
- Quacking - our automatic indexes?
Implementation Choices
- C++ with auto-formatting
- No dependencies (without a very good reason)
- Try to minimize amount of tunable knobs
- Basic data models are tables. Tables consists of same-length columns of a native type (?)
- Unit tests + End-To-end. Use Travis. Re-use SQLite tests.
- Auto-detect performance regressions?
Notes 2018-07-12
- can we steal planner/parser somewhere? Peloton? Peloton uses this library: https://github.com/lfittl/libpg_query
- sampling, maintain during load, hyper-log-log to estimate unique?
- tree transformation rules with pattern matching
- re-optimize during execution? Bonus feature
- execution engine run pipeline from source to sink. default vectorized but pluggable
- where is the state of the operator tree? on the stack?
- hash tables in query state
- do we need another representation for queries, data structure choices physical representation langugage
- single-prefix add delete blobs, compacting, for cloud
- single write blob, LSM, could give us multi-version
- no indexes? enforcing sorting on pk? z-order in compaction
- join order imdb benchmark
- storage small data types based on global minmax/sample