Amirouche Boubekki
2017-11-26 22:33:33 UTC
Héllo,
I made some progress on my culturia project,
I wanted to share with you where it's going
with a few bits about guile-wiredtiger itself.
tl;dr:
$ git clone https://a-guile-mind.github.io/culturia.one
$ git clone https://framagit.org/a-guile-mind/guile-wiredtiger
On guix(sd) environment:
$ guix environment --ad-hoc guile-wiredtiger
I stopped trying to understand what makes
a concept search engine [0]. Instead I will
focus on plain old keyword search engine.
I don't even plan to support synonyms [1].
[1]
https://www.slideshare.net/lucidworks/implementing-conceptual-search-in-solr-using-lsa-and-word2vec-presented-by-simon-hughes-dicecom
[2]
https://blog.algolia.com/inside-the-engine-part-6-handling-synonyms-the-right-way/
But, if you have insights about the suject,
don't hesitate to share them.
You might be wondering why I want to build
a search engine. Yes, since I am not working
anymore to reach the Moon, re-inventing the
wheel might sound useless. It's not.
Because Guile, because wiredtiger.
You see wiredtiger is the last iteration
(over several decades) of somekind of data engine.
And guile is the _only_ high level language that
has true POSIX threads (and fibers (more on that
later)). Does it ring a bell?
So far, what is dominant in the database space
is RDBMS. Basically tables that you can query with
SQL. This is very neat and stuff. But what if we
could have tables queryable in scheme?
That is it! guile-wiredtiger offers a way to query
tables in scheme. Using minikanren if you want that!
What about performance? Well according to my
microbenchmarks it can query 1200 documents
at random in seconds. Which means that if you
don't use a _dynamic_ schema (like grf3 or
feature-space) and use the raw wiredtiger API
found in (wiredtiger wiredtiger) module and
some helpers in (wiredtiger extra). You can
achieve better performance. Also, did I mention
that those are numbers for single thread access?
Guile has threads! Which means more RPS for your
application, less hardware for more users. That
said I don't think doubling the threads will double
the throughput. This needs to be benchmarked.
The thing with NOSQL wiredtiger, is that it's not
the kind of NOSQL you might think about. Unlike
REDIS it's not primary in memory, unlike Cassandra
it doesn't spread it's data accross several nodes.
Though there are ideas on how to do that see for
instance TiKV [3].
[3] https://github.com/pingcap/tikv
According to wiredtiger there is no known limitations
in the size of the database or the number of concurrent
threads − provided the underlying hardware can follow...
One can scale vertically, on a single machine. How far
can we go? That's the question I'd like to answer.
The search engine is the idea to have both potentially
a lot of data and a lot of users (compared to my blog).
I am sure some people who tried and switched to
duckduckgo want to give it a try. At least to see what
the technology behind Google and DuckDuckGo really is.
You can't know for sure without having experienced the
old google or feu altavista.
My point other point, is for most of my search on the
web I don't need often to dig deeper than _my_ first
page (even on ddg). On _my_ first page, there is most
of the time wikipedia, stackoverflow and that's it!
Really, there is not much of the web that concerns
me. Outside some rare scholar articles.
english wikipedia and stackoverflow are already almost
100Go so it's bigger than any one can have as blog.
Right now culturia.one does store pages using three
tables:
Document table will store information about the document,
uid is unique identifier for the document, url and a scheme
list of token uid (This is stored like that for faster
comparaison)
key | value
-----+-------------+----------------------------
uid | url | document as token uid
-----+-------------+----------------------------
1 | gnu.org | 14 32 51 42 63 74 75 23 113
2 | hyperdev.fr | 1 22 1 12 23 71 175 323 14
There is another table that stores, all the tokens
found in the documents:
key | value
-----+--------+-------
uid | token | count
-----+--------+-------
1 | the | 42
Where count, is the count of document where the token
appears at least once. This table as an index on token
column to quikcly retrieve the UID of given TOKEN.
The last index, is the so called inverted-index,
It's a bit special because the key part of the table
has two columns but not bizarre if you work with primary
keys:
key | value
-----+----------+-------
token| document | count
-----+------------------
42 | 1 | 1
(I just figured that I never use that count column,
it supposed to be the number of times of the token 42
appears in document 1)
Anyway, pretty simple no?
Let's imagine we have a simple query like the following:
culturia://guile+manual
We will imagine that we indexed some things that containt
those words (or the engine will throw an exception).
The quering engine will first compute the frequency of both
keywords and then lookup the inverted index for the least
frequent keyword. That way, there is a 'seed' set of documents
that we can filter with a small vm that will interpret the
rest of the query for instance. Something like:
(filter (hit? (cdr query)) seed)
Sort of. I can't make it simpler right now, but you can
have a look at the code. The public procedure and the bottom
called 'search' [4] is the where the code starts.
[4]
https://github.com/a-guile-mind/culturia.one/blob/master/src/wiredtiger/ix.scm#L455
So what is the next iteration:
guile-wiredtiger:
- fix the tests to run with guix
culturia:
- make culturia compatible with guile-wiredtiger found in guix
- write program that will index the whatever wikipedia dump we want
- make a program to index stackoverflow based on archives.org dump
- make a program that will index news.ycombinator.com and the linked
articles
- Create a crawler for sitemaps (or find one)
- Create a crawler for RSS/ATOM feeds (or find one)
- Support WARC file format and crawl gnu.org website
- Implement !g and !ddg in the searchbox to redirect the user
to another search engine.
Conctact me directly if you want to work on one of the tasks
or some other tasks, or if you want to report a bug.
Tx!
I made some progress on my culturia project,
I wanted to share with you where it's going
with a few bits about guile-wiredtiger itself.
tl;dr:
$ git clone https://a-guile-mind.github.io/culturia.one
$ git clone https://framagit.org/a-guile-mind/guile-wiredtiger
On guix(sd) environment:
$ guix environment --ad-hoc guile-wiredtiger
I stopped trying to understand what makes
a concept search engine [0]. Instead I will
focus on plain old keyword search engine.
I don't even plan to support synonyms [1].
[1]
https://www.slideshare.net/lucidworks/implementing-conceptual-search-in-solr-using-lsa-and-word2vec-presented-by-simon-hughes-dicecom
[2]
https://blog.algolia.com/inside-the-engine-part-6-handling-synonyms-the-right-way/
But, if you have insights about the suject,
don't hesitate to share them.
You might be wondering why I want to build
a search engine. Yes, since I am not working
anymore to reach the Moon, re-inventing the
wheel might sound useless. It's not.
Because Guile, because wiredtiger.
You see wiredtiger is the last iteration
(over several decades) of somekind of data engine.
And guile is the _only_ high level language that
has true POSIX threads (and fibers (more on that
later)). Does it ring a bell?
So far, what is dominant in the database space
is RDBMS. Basically tables that you can query with
SQL. This is very neat and stuff. But what if we
could have tables queryable in scheme?
That is it! guile-wiredtiger offers a way to query
tables in scheme. Using minikanren if you want that!
What about performance? Well according to my
microbenchmarks it can query 1200 documents
at random in seconds. Which means that if you
don't use a _dynamic_ schema (like grf3 or
feature-space) and use the raw wiredtiger API
found in (wiredtiger wiredtiger) module and
some helpers in (wiredtiger extra). You can
achieve better performance. Also, did I mention
that those are numbers for single thread access?
Guile has threads! Which means more RPS for your
application, less hardware for more users. That
said I don't think doubling the threads will double
the throughput. This needs to be benchmarked.
The thing with NOSQL wiredtiger, is that it's not
the kind of NOSQL you might think about. Unlike
REDIS it's not primary in memory, unlike Cassandra
it doesn't spread it's data accross several nodes.
Though there are ideas on how to do that see for
instance TiKV [3].
[3] https://github.com/pingcap/tikv
According to wiredtiger there is no known limitations
in the size of the database or the number of concurrent
threads − provided the underlying hardware can follow...
One can scale vertically, on a single machine. How far
can we go? That's the question I'd like to answer.
The search engine is the idea to have both potentially
a lot of data and a lot of users (compared to my blog).
I am sure some people who tried and switched to
duckduckgo want to give it a try. At least to see what
the technology behind Google and DuckDuckGo really is.
You can't know for sure without having experienced the
old google or feu altavista.
My point other point, is for most of my search on the
web I don't need often to dig deeper than _my_ first
page (even on ddg). On _my_ first page, there is most
of the time wikipedia, stackoverflow and that's it!
Really, there is not much of the web that concerns
me. Outside some rare scholar articles.
english wikipedia and stackoverflow are already almost
100Go so it's bigger than any one can have as blog.
Right now culturia.one does store pages using three
tables:
Document table will store information about the document,
uid is unique identifier for the document, url and a scheme
list of token uid (This is stored like that for faster
comparaison)
key | value
-----+-------------+----------------------------
uid | url | document as token uid
-----+-------------+----------------------------
1 | gnu.org | 14 32 51 42 63 74 75 23 113
2 | hyperdev.fr | 1 22 1 12 23 71 175 323 14
There is another table that stores, all the tokens
found in the documents:
key | value
-----+--------+-------
uid | token | count
-----+--------+-------
1 | the | 42
Where count, is the count of document where the token
appears at least once. This table as an index on token
column to quikcly retrieve the UID of given TOKEN.
The last index, is the so called inverted-index,
It's a bit special because the key part of the table
has two columns but not bizarre if you work with primary
keys:
key | value
-----+----------+-------
token| document | count
-----+------------------
42 | 1 | 1
(I just figured that I never use that count column,
it supposed to be the number of times of the token 42
appears in document 1)
Anyway, pretty simple no?
Let's imagine we have a simple query like the following:
culturia://guile+manual
We will imagine that we indexed some things that containt
those words (or the engine will throw an exception).
The quering engine will first compute the frequency of both
keywords and then lookup the inverted index for the least
frequent keyword. That way, there is a 'seed' set of documents
that we can filter with a small vm that will interpret the
rest of the query for instance. Something like:
(filter (hit? (cdr query)) seed)
Sort of. I can't make it simpler right now, but you can
have a look at the code. The public procedure and the bottom
called 'search' [4] is the where the code starts.
[4]
https://github.com/a-guile-mind/culturia.one/blob/master/src/wiredtiger/ix.scm#L455
So what is the next iteration:
guile-wiredtiger:
- fix the tests to run with guix
culturia:
- make culturia compatible with guile-wiredtiger found in guix
- write program that will index the whatever wikipedia dump we want
- make a program to index stackoverflow based on archives.org dump
- make a program that will index news.ycombinator.com and the linked
articles
- Create a crawler for sitemaps (or find one)
- Create a crawler for RSS/ATOM feeds (or find one)
- Support WARC file format and crawl gnu.org website
- Implement !g and !ddg in the searchbox to redirect the user
to another search engine.
Conctact me directly if you want to work on one of the tasks
or some other tasks, or if you want to report a bug.
Tx!
--
Amirouche ~ amz3 ~ http://www.hyperdev.fr
Amirouche ~ amz3 ~ http://www.hyperdev.fr