amirouche
2018-02-10 07:34:13 UTC
Héllo all,
# Introduction
I figured a usecase for an immutable / functional database that works
like git. I like the "streamable immutable database" name but not sure
it's applicable.
This prolly seems ambitious and pretentious, that said, I am certain I
can
get it done. The only uncertainty is performance, but I have also ideas
for that.
The idea of building git-like database is not new but now I have a
better
picture of it.
The question you want to ask, is why not re-implement git in guile and
maybe
use wiredtiger as backing store. Well, that is a legitimate question.
What I am trying to achieve is something more general than git.
Feel free to point me to relevant documentation or argue that git in
guile is the
way forward.
The main use case I want to handle, is the ability to experiment with
different
versions of a given machine learning model / data / dataset that might
be bigger than
RAM. That is, easily and efficiently switch from one version of the
model to another
without resorting on copying all the files or database.
That is a version-ed branch-able fork-able database.
Feel free to argue that data and code are different and that data MUST
BE
distributed out-of-band, I will be reading with great interest.
# Description
It MUST have the following features:
- It support ACID transactions
- It's multi-threaded
- It's an association list database (like guile-wiredtiger's
feature-space) where
keys are symbols and values are any scheme value. Otherwise said,
it's a document
database.
- It support git like features ie. tags, branches, push, pull, revert,
merge
log, diff and of course commits and revision. In particular, it's
possible
to access the history of a given association.
- It's immutable in the sens that CRUD operation instead of changing
values in place create new entries in the database to reflect the
change. In terms of wiredtiger API, there is no call to cursor-update.
It's only using cursor-insert calls.
- 'neon checkout REV' will bring in the working space a more efficient
representation
of the data. That representation MUST BE configurable. Otherwise said,
if the user wants to version csv, a geo-temporal data, timeseries or
whatever it must
be possible.
- It SHOULD allow to mix data with source files.
- It SHOULD also allow to store efficiently binaries.
# TODO
- code the "bare database" ie. the gist of the story that is the
immutable association
list that takes inspiration from git.
- create benchmarks
- Index conceptnet and wikidata and demo the git-like features over the
dictionary
based named entity recognition.
# Introduction
I figured a usecase for an immutable / functional database that works
like git. I like the "streamable immutable database" name but not sure
it's applicable.
This prolly seems ambitious and pretentious, that said, I am certain I
can
get it done. The only uncertainty is performance, but I have also ideas
for that.
The idea of building git-like database is not new but now I have a
better
picture of it.
The question you want to ask, is why not re-implement git in guile and
maybe
use wiredtiger as backing store. Well, that is a legitimate question.
What I am trying to achieve is something more general than git.
Feel free to point me to relevant documentation or argue that git in
guile is the
way forward.
The main use case I want to handle, is the ability to experiment with
different
versions of a given machine learning model / data / dataset that might
be bigger than
RAM. That is, easily and efficiently switch from one version of the
model to another
without resorting on copying all the files or database.
That is a version-ed branch-able fork-able database.
Feel free to argue that data and code are different and that data MUST
BE
distributed out-of-band, I will be reading with great interest.
# Description
It MUST have the following features:
- It support ACID transactions
- It's multi-threaded
- It's an association list database (like guile-wiredtiger's
feature-space) where
keys are symbols and values are any scheme value. Otherwise said,
it's a document
database.
- It support git like features ie. tags, branches, push, pull, revert,
merge
log, diff and of course commits and revision. In particular, it's
possible
to access the history of a given association.
- It's immutable in the sens that CRUD operation instead of changing
values in place create new entries in the database to reflect the
change. In terms of wiredtiger API, there is no call to cursor-update.
It's only using cursor-insert calls.
- 'neon checkout REV' will bring in the working space a more efficient
representation
of the data. That representation MUST BE configurable. Otherwise said,
if the user wants to version csv, a geo-temporal data, timeseries or
whatever it must
be possible.
- It SHOULD allow to mix data with source files.
- It SHOULD also allow to store efficiently binaries.
# TODO
- code the "bare database" ie. the gist of the story that is the
immutable association
list that takes inspiration from git.
- create benchmarks
- Index conceptnet and wikidata and demo the git-like features over the
dictionary
based named entity recognition.