What's next with culturia search engine? (and guile-wiredtiger)

Discussion:

Amirouche Boubekki

2017-11-26 22:33:33 UTC

Héllo,

I made some progress on my culturia project,
I wanted to share with you where it's going
with a few bits about guile-wiredtiger itself.

tl;dr:

$ git clone https://a-guile-mind.github.io/culturia.one
$ git clone https://framagit.org/a-guile-mind/guile-wiredtiger

On guix(sd) environment:

$ guix environment --ad-hoc guile-wiredtiger

I stopped trying to understand what makes
a concept search engine [0]. Instead I will
focus on plain old keyword search engine.
I don't even plan to support synonyms [1].

[1]
https://www.slideshare.net/lucidworks/implementing-conceptual-search-in-solr-using-lsa-and-word2vec-presented-by-simon-hughes-dicecom
[2]
https://blog.algolia.com/inside-the-engine-part-6-handling-synonyms-the-right-way/

But, if you have insights about the suject,
don't hesitate to share them.

You might be wondering why I want to build
a search engine. Yes, since I am not working
anymore to reach the Moon, re-inventing the
wheel might sound useless. It's not.
Because Guile, because wiredtiger.

You see wiredtiger is the last iteration
(over several decades) of somekind of data engine.
And guile is the _only_ high level language that
has true POSIX threads (and fibers (more on that
later)). Does it ring a bell?

So far, what is dominant in the database space
is RDBMS. Basically tables that you can query with
SQL. This is very neat and stuff. But what if we
could have tables queryable in scheme?

That is it! guile-wiredtiger offers a way to query
tables in scheme. Using minikanren if you want that!

What about performance? Well according to my
microbenchmarks it can query 1200 documents
at random in seconds. Which means that if you
don't use a _dynamic_ schema (like grf3 or
feature-space) and use the raw wiredtiger API
found in (wiredtiger wiredtiger) module and
some helpers in (wiredtiger extra). You can
achieve better performance. Also, did I mention
that those are numbers for single thread access?

Guile has threads! Which means more RPS for your
application, less hardware for more users. That
said I don't think doubling the threads will double
the throughput. This needs to be benchmarked.

The thing with NOSQL wiredtiger, is that it's not
the kind of NOSQL you might think about. Unlike
REDIS it's not primary in memory, unlike Cassandra
it doesn't spread it's data accross several nodes.
Though there are ideas on how to do that see for
instance TiKV [3].

[3] https://github.com/pingcap/tikv

According to wiredtiger there is no known limitations
in the size of the database or the number of concurrent
threads − provided the underlying hardware can follow...

One can scale vertically, on a single machine. How far
can we go? That's the question I'd like to answer.

The search engine is the idea to have both potentially
a lot of data and a lot of users (compared to my blog).

I am sure some people who tried and switched to
duckduckgo want to give it a try. At least to see what
the technology behind Google and DuckDuckGo really is.
You can't know for sure without having experienced the
old google or feu altavista.

My point other point, is for most of my search on the
web I don't need often to dig deeper than _my_ first
page (even on ddg). On _my_ first page, there is most
of the time wikipedia, stackoverflow and that's it!
Really, there is not much of the web that concerns
me. Outside some rare scholar articles.

english wikipedia and stackoverflow are already almost
100Go so it's bigger than any one can have as blog.

Right now culturia.one does store pages using three
tables:

Document table will store information about the document,
uid is unique identifier for the document, url and a scheme
list of token uid (This is stored like that for faster
comparaison)

key | value
-----+-------------+----------------------------
uid | url | document as token uid
-----+-------------+----------------------------
1 | gnu.org | 14 32 51 42 63 74 75 23 113
2 | hyperdev.fr | 1 22 1 12 23 71 175 323 14

There is another table that stores, all the tokens
found in the documents:

key | value
-----+--------+-------
uid | token | count
-----+--------+-------
1 | the | 42

Where count, is the count of document where the token
appears at least once. This table as an index on token
column to quikcly retrieve the UID of given TOKEN.

The last index, is the so called inverted-index,
It's a bit special because the key part of the table
has two columns but not bizarre if you work with primary
keys:

key | value
-----+----------+-------
token| document | count
-----+------------------
42 | 1 | 1

(I just figured that I never use that count column,
it supposed to be the number of times of the token 42
appears in document 1)

Anyway, pretty simple no?

Let's imagine we have a simple query like the following:

culturia://guile+manual

We will imagine that we indexed some things that containt
those words (or the engine will throw an exception).

The quering engine will first compute the frequency of both
keywords and then lookup the inverted index for the least
frequent keyword. That way, there is a 'seed' set of documents
that we can filter with a small vm that will interpret the
rest of the query for instance. Something like:

(filter (hit? (cdr query)) seed)

Sort of. I can't make it simpler right now, but you can
have a look at the code. The public procedure and the bottom
called 'search' [4] is the where the code starts.

[4]
https://github.com/a-guile-mind/culturia.one/blob/master/src/wiredtiger/ix.scm#L455

So what is the next iteration:

guile-wiredtiger:

- fix the tests to run with guix

culturia:

- make culturia compatible with guile-wiredtiger found in guix
- write program that will index the whatever wikipedia dump we want
- make a program to index stackoverflow based on archives.org dump
- make a program that will index news.ycombinator.com and the linked
articles
- Create a crawler for sitemaps (or find one)
- Create a crawler for RSS/ATOM feeds (or find one)
- Support WARC file format and crawl gnu.org website
- Implement !g and !ddg in the searchbox to redirect the user
to another search engine.

Conctact me directly if you want to work on one of the tasks
or some other tasks, or if you want to report a bug.

Tx!

--
Amirouche ~ amz3 ~ http://www.hyperdev.fr

Tom Jakubowski

2017-11-27 22:03:50 UTC

Permalink

Looks very cool! I don't have any comments to add other than that I think
the git URL for culturia.one has an error: it should be
https://github.com/a-guile-mind/culturia.one.git and not
https://a-guile-mind.github.io/culturia.one

In other words:

$ git clone https://github.com/a-guile-mind/culturia.one.git

Tom

Post by Amirouche Boubekki
Héllo,
I made some progress on my culturia project,
I wanted to share with you where it's going
with a few bits about guile-wiredtiger itself.
$ git clone https://a-guile-mind.github.io/culturia.one
$ git clone https://framagit.org/a-guile-mind/guile-wiredtiger
$ guix environment --ad-hoc guile-wiredtiger
I stopped trying to understand what makes
a concept search engine [0]. Instead I will
focus on plain old keyword search engine.
I don't even plan to support synonyms [1].
[1]
https://www.slideshare.net/lucidworks/implementing-conceptual-search-in-solr-using-lsa-and-word2vec-presented-by-simon-hughes-dicecom
[2]
https://blog.algolia.com/inside-the-engine-part-6-handling-synonyms-the-right-way/
But, if you have insights about the suject,
don't hesitate to share them.
You might be wondering why I want to build
a search engine. Yes, since I am not working
anymore to reach the Moon, re-inventing the
wheel might sound useless. It's not.
Because Guile, because wiredtiger.
You see wiredtiger is the last iteration
(over several decades) of somekind of data engine.
And guile is the _only_ high level language that
has true POSIX threads (and fibers (more on that
later)). Does it ring a bell?
So far, what is dominant in the database space
is RDBMS. Basically tables that you can query with
SQL. This is very neat and stuff. But what if we
could have tables queryable in scheme?
That is it! guile-wiredtiger offers a way to query
tables in scheme. Using minikanren if you want that!
What about performance? Well according to my
microbenchmarks it can query 1200 documents
at random in seconds. Which means that if you
don't use a _dynamic_ schema (like grf3 or
feature-space) and use the raw wiredtiger API
found in (wiredtiger wiredtiger) module and
some helpers in (wiredtiger extra). You can
achieve better performance. Also, did I mention
that those are numbers for single thread access?
Guile has threads! Which means more RPS for your
application, less hardware for more users. That
said I don't think doubling the threads will double
the throughput. This needs to be benchmarked.
The thing with NOSQL wiredtiger, is that it's not
the kind of NOSQL you might think about. Unlike
REDIS it's not primary in memory, unlike Cassandra
it doesn't spread it's data accross several nodes.
Though there are ideas on how to do that see for
instance TiKV [3].
[3] https://github.com/pingcap/tikv
According to wiredtiger there is no known limitations
in the size of the database or the number of concurrent
threads − provided the underlying hardware can follow...
One can scale vertically, on a single machine. How far
can we go? That's the question I'd like to answer.
The search engine is the idea to have both potentially
a lot of data and a lot of users (compared to my blog).
I am sure some people who tried and switched to
duckduckgo want to give it a try. At least to see what
the technology behind Google and DuckDuckGo really is.
You can't know for sure without having experienced the
old google or feu altavista.
My point other point, is for most of my search on the
web I don't need often to dig deeper than _my_ first
page (even on ddg). On _my_ first page, there is most
of the time wikipedia, stackoverflow and that's it!
Really, there is not much of the web that concerns
me. Outside some rare scholar articles.
english wikipedia and stackoverflow are already almost
100Go so it's bigger than any one can have as blog.
Right now culturia.one does store pages using three
Document table will store information about the document,
uid is unique identifier for the document, url and a scheme
list of token uid (This is stored like that for faster
comparaison)
key | value
-----+-------------+----------------------------
uid | url | document as token uid
-----+-------------+----------------------------
1 | gnu.org | 14 32 51 42 63 74 75 23 113
2 | hyperdev.fr | 1 22 1 12 23 71 175 323 14
There is another table that stores, all the tokens
key | value
-----+--------+-------
uid | token | count
-----+--------+-------
1 | the | 42
Where count, is the count of document where the token
appears at least once. This table as an index on token
column to quikcly retrieve the UID of given TOKEN.
The last index, is the so called inverted-index,
It's a bit special because the key part of the table
has two columns but not bizarre if you work with primary
key | value
-----+----------+-------
token| document | count
-----+------------------
42 | 1 | 1
(I just figured that I never use that count column,
it supposed to be the number of times of the token 42
appears in document 1)
Anyway, pretty simple no?
culturia://guile+manual
We will imagine that we indexed some things that containt
those words (or the engine will throw an exception).
The quering engine will first compute the frequency of both
keywords and then lookup the inverted index for the least
frequent keyword. That way, there is a 'seed' set of documents
that we can filter with a small vm that will interpret the
(filter (hit? (cdr query)) seed)
Sort of. I can't make it simpler right now, but you can
have a look at the code. The public procedure and the bottom
called 'search' [4] is the where the code starts.
[4]
https://github.com/a-guile-mind/culturia.one/blob/master/src/wiredtiger/ix.scm#L455
- fix the tests to run with guix
- make culturia compatible with guile-wiredtiger found in guix
- write program that will index the whatever wikipedia dump we want
- make a program to index stackoverflow based on archives.org dump
- make a program that will index news.ycombinator.com and the linked
articles
- Create a crawler for sitemaps (or find one)
- Create a crawler for RSS/ATOM feeds (or find one)
- Support WARC file format and crawl gnu.org website
- Implement !g and !ddg in the searchbox to redirect the user
to another search engine.
Conctact me directly if you want to work on one of the tasks
or some other tasks, or if you want to report a bug.
Tx!
--
Amirouche ~ amz3 ~ http://www.hyperdev.fr

Amirouche Boubekki

2017-11-27 23:29:30 UTC

Permalink

Post by Tom Jakubowski
Looks very cool! I don't have any comments to add other than that I
think the git URL for culturia.one has an error: it should be
https://github.com/a-guile-mind/culturia.one.git [12] and not
https://a-guile-mind.github.io/culturia.one [1]
$ git clone https://github.com/a-guile-mind/culturia.one.git [12]
Tom

Tx Tom!

Post by Tom Jakubowski
On Sun, Nov 26, 2017 at 2:34 PM Amirouche Boubekki

https://www.slideshare.net/lucidworks/implementing-conceptual-search-in-solr-using-lsa-and-word2vec-presented-by-simon-hughes-dicecom

Post by Amirouche Boubekki
[3]
[2]

https://blog.algolia.com/inside-the-engine-part-6-handling-synonyms-the-right-way/

Post by Amirouche Boubekki
[4]
But, if you have insights about the suject,
don't hesitate to share them.
You might be wondering why I want to build
a search engine. Yes, since I am not working
anymore to reach the Moon, re-inventing the
wheel might sound useless. It's not.
Because Guile, because wiredtiger.
You see wiredtiger is the last iteration
(over several decades) of somekind of data engine.
And guile is the _only_ high level language that
has true POSIX threads (and fibers (more on that
later)). Does it ring a bell?
So far, what is dominant in the database space
is RDBMS. Basically tables that you can query with
SQL. This is very neat and stuff. But what if we
could have tables queryable in scheme?
That is it! guile-wiredtiger offers a way to query
tables in scheme. Using minikanren if you want that!
What about performance? Well according to my
microbenchmarks it can query 1200 documents
at random in seconds. Which means that if you
don't use a _dynamic_ schema (like grf3 or
feature-space) and use the raw wiredtiger API
found in (wiredtiger wiredtiger) module and
some helpers in (wiredtiger extra). You can
achieve better performance. Also, did I mention
that those are numbers for single thread access?
Guile has threads! Which means more RPS for your
application, less hardware for more users. That
said I don't think doubling the threads will double
the throughput. This needs to be benchmarked.
The thing with NOSQL wiredtiger, is that it's not
the kind of NOSQL you might think about. Unlike
REDIS it's not primary in memory, unlike Cassandra
it doesn't spread it's data accross several nodes.
Though there are ideas on how to do that see for
instance TiKV [3].
[3] https://github.com/pingcap/tikv [5]
According to wiredtiger there is no known limitations
in the size of the database or the number of concurrent
threads − provided the underlying hardware can follow...
One can scale vertically, on a single machine. How far
can we go? That's the question I'd like to answer.
The search engine is the idea to have both potentially
a lot of data and a lot of users (compared to my blog).
I am sure some people who tried and switched to
duckduckgo want to give it a try. At least to see what
the technology behind Google and DuckDuckGo really is.
You can't know for sure without having experienced the
old google or feu altavista.
My point other point, is for most of my search on the
web I don't need often to dig deeper than _my_ first
page (even on ddg). On _my_ first page, there is most
of the time wikipedia, stackoverflow and that's it!
Really, there is not much of the web that concerns
me. Outside some rare scholar articles.
english wikipedia and stackoverflow are already almost
100Go so it's bigger than any one can have as blog.
Right now culturia.one does store pages using three
Document table will store information about the document,
uid is unique identifier for the document, url and a scheme
list of token uid (This is stored like that for faster
comparaison)
key | value
-----+-------------+----------------------------
uid | url | document as token uid
-----+-------------+----------------------------
1 | gnu.org [6] | 14 32 51 42 63 74 75 23 113
2 | hyperdev.fr [7] | 1 22 1 12 23 71 175 323 14
There is another table that stores, all the tokens
key | value
-----+--------+-------
uid | token | count
-----+--------+-------
1 | the | 42
Where count, is the count of document where the token
appears at least once. This table as an index on token
column to quikcly retrieve the UID of given TOKEN.
The last index, is the so called inverted-index,
It's a bit special because the key part of the table
has two columns but not bizarre if you work with primary
key | value
-----+----------+-------
token| document | count
-----+------------------
42 | 1 | 1
(I just figured that I never use that count column,
it supposed to be the number of times of the token 42
appears in document 1)
Anyway, pretty simple no?
culturia://guile+manual
We will imagine that we indexed some things that containt
those words (or the engine will throw an exception).
The quering engine will first compute the frequency of both
keywords and then lookup the inverted index for the least
frequent keyword. That way, there is a 'seed' set of documents
that we can filter with a small vm that will interpret the
(filter (hit? (cdr query)) seed)
Sort of. I can't make it simpler right now, but you can
have a look at the code. The public procedure and the bottom
called 'search' [4] is the where the code starts.
[4]

https://github.com/a-guile-mind/culturia.one/blob/master/src/wiredtiger/ix.scm#L455

Post by Amirouche Boubekki
[8]
- fix the tests to run with guix
- make culturia compatible with guile-wiredtiger found in guix
- write program that will index the whatever wikipedia dump we want
- make a program to index stackoverflow based on archives.org [9]
dump
- make a program that will index news.ycombinator.com [10] and the
linked
articles
- Create a crawler for sitemaps (or find one)
- Create a crawler for RSS/ATOM feeds (or find one)
- Support WARC file format and crawl gnu.org [6] website
- Implement !g and !ddg in the searchbox to redirect the user
to another search engine.
Conctact me directly if you want to work on one of the tasks
or some other tasks, or if you want to report a bug.
Tx!
--
Amirouche ~ amz3 ~ http://www.hyperdev.fr [11]

------
[1] https://a-guile-mind.github.io/culturia.one
[2] https://framagit.org/a-guile-mind/guile-wiredtiger
[3]
https://www.slideshare.net/lucidworks/implementing-conceptual-search-in-solr-using-lsa-and-word2vec-presented-by-simon-hughes-dicecom
[4]
https://blog.algolia.com/inside-the-engine-part-6-handling-synonyms-the-right-way/
[5] https://github.com/pingcap/tikv
[6] http://gnu.org
[7] http://hyperdev.fr
[8]
https://github.com/a-guile-mind/culturia.one/blob/master/src/wiredtiger/ix.scm#L455
[9] http://archives.org
[10] http://news.ycombinator.com
[11] http://www.hyperdev.fr
[12] https://github.com/a-guile-mind/culturia.one.git

--
Amirouche ~ amz3 ~ http://www.hyperdev.fr

Catonano

2018-01-14 08:12:00 UTC

Permalink

Post by Amirouche Boubekki
Héllo,
I made some progress on my culturia project,
I wanted to share with you where it's going
with a few bits about guile-wiredtiger itself.
$ git clone https://a-guile-mind.github.io/culturia.one
$ git clone https://framagit.org/a-guile-mind/guile-wiredtiger
$ guix environment --ad-hoc guile-wiredtiger
I stopped trying to understand what makes
a concept search engine [0]. Instead I will
focus on plain old keyword search engine.
I don't even plan to support synonyms [1].
[1] https://www.slideshare.net/lucidworks/implementing-conceptua
l-search-in-solr-using-lsa-and-word2vec-presented-by-simon-hughes-dicecom
[2] https://blog.algolia.com/inside-the-engine-part-6-handling-
synonyms-the-right-way/
But, if you have insights about the suject,
don't hesitate to share them.
You might be wondering why I want to build
a search engine. Yes, since I am not working
anymore to reach the Moon, re-inventing the
wheel might sound useless. It's not.
Because Guile, because wiredtiger.
You see wiredtiger is the last iteration
(over several decades) of somekind of data engine.
And guile is the _only_ high level language that
has true POSIX threads (and fibers (more on that
later)). Does it ring a bell?
So far, what is dominant in the database space
is RDBMS. Basically tables that you can query with
SQL. This is very neat and stuff. But what if we
could have tables queryable in scheme?
That is it! guile-wiredtiger offers a way to query
tables in scheme. Using minikanren if you want that!
What about performance? Well according to my
microbenchmarks it can query 1200 documents
at random in seconds. Which means that if you
don't use a _dynamic_ schema (like grf3 or
feature-space) and use the raw wiredtiger API
found in (wiredtiger wiredtiger) module and
some helpers in (wiredtiger extra). You can
achieve better performance. Also, did I mention
that those are numbers for single thread access?
Guile has threads! Which means more RPS for your
application, less hardware for more users. That
said I don't think doubling the threads will double
the throughput. This needs to be benchmarked.
The thing with NOSQL wiredtiger, is that it's not
the kind of NOSQL you might think about. Unlike
REDIS it's not primary in memory, unlike Cassandra
it doesn't spread it's data accross several nodes.
Though there are ideas on how to do that see for
instance TiKV [3].
[3] https://github.com/pingcap/tikv
According to wiredtiger there is no known limitations
in the size of the database or the number of concurrent
threads − provided the underlying hardware can follow...
One can scale vertically, on a single machine. How far
can we go? That's the question I'd like to answer.
The search engine is the idea to have both potentially
a lot of data and a lot of users (compared to my blog).
I am sure some people who tried and switched to
duckduckgo want to give it a try. At least to see what
the technology behind Google and DuckDuckGo really is.
You can't know for sure without having experienced the
old google or feu altavista.
My point other point, is for most of my search on the
web I don't need often to dig deeper than _my_ first
page (even on ddg). On _my_ first page, there is most
of the time wikipedia, stackoverflow and that's it!
Really, there is not much of the web that concerns
me. Outside some rare scholar articles.
english wikipedia and stackoverflow are already almost
100Go so it's bigger than any one can have as blog.
Right now culturia.one does store pages using three
Document table will store information about the document,
uid is unique identifier for the document, url and a scheme
list of token uid (This is stored like that for faster
comparaison)
key | value
-----+-------------+----------------------------
uid | url | document as token uid
-----+-------------+----------------------------
1 | gnu.org | 14 32 51 42 63 74 75 23 113
2 | hyperdev.fr | 1 22 1 12 23 71 175 323 14
There is another table that stores, all the tokens
key | value
-----+--------+-------
uid | token | count
-----+--------+-------
1 | the | 42
Where count, is the count of document where the token
appears at least once. This table as an index on token
column to quikcly retrieve the UID of given TOKEN.
The last index, is the so called inverted-index,
It's a bit special because the key part of the table
has two columns but not bizarre if you work with primary
key | value
-----+----------+-------
token| document | count
-----+------------------
42 | 1 | 1
(I just figured that I never use that count column,
it supposed to be the number of times of the token 42
appears in document 1)
Anyway, pretty simple no?
culturia://guile+manual
We will imagine that we indexed some things that containt
those words (or the engine will throw an exception).
The quering engine will first compute the frequency of both
keywords and then lookup the inverted index for the least
frequent keyword.

The least freqent keyword ?
Not the most frequent keyword ?

That way, there is a 'seed' set of documents

Post by Amirouche Boubekki
that we can filter with a small vm that will interpret the
(filter (hit? (cdr query)) seed)
Sort of. I can't make it simpler right now, but you can
have a look at the code. The public procedure and the bottom
called 'search' [4] is the where the code starts.
[4] https://github.com/a-guile-mind/culturia.one/blob/master/src
/wiredtiger/ix.scm#L455

file not fond

All this looks pretty interesting but I have to say that I prefer the work
you're doing on GNUNet ;-)

Post by Amirouche Boubekki
- fix the tests to run with guix
- make culturia compatible with guile-wiredtiger found in guix
- write program that will index the whatever wikipedia dump we want
- make a program to index stackoverflow based on archives.org dump
- make a program that will index news.ycombinator.com and the linked
articles
- Create a crawler for sitemaps (or find one)
- Create a crawler for RSS/ATOM feeds (or find one)
- Support WARC file format and crawl gnu.org website
- Implement !g and !ddg in the searchbox to redirect the user
to another search engine.
Conctact me directly if you want to work on one of the tasks
or some other tasks, or if you want to report a bug.
Tx!
--
Amirouche ~ amz3 ~ http://www.hyperdev.fr

Amirouche Boubekki

2018-01-14 10:05:29 UTC

Permalink

2017-11-26 23:33 GMT+01:00 Amirouche Boubekki

Post by Amirouche Boubekki
The quering engine will first compute the frequency of both
keywords and then lookup the inverted index for the least
frequent keyword.

The least frequent keyword ?
Not the most frequent keyword ?

Yes, imagine you search for serif+font, most common
word and the least discriminant is "font" because there
is (I think) more page containing "font".

The result of the inverted lookup above is used as seed
of the rest of the algorithm that is O(n) so I need to
minimize 'n' ie. the count of initial documents.

Post by Amirouche Boubekki
That way, there is a 'seed' set of documents
that we can filter with a small vm that will interpret the
(filter (hit? (cdr query)) seed)
Sort of. I can't make it simpler right now, but you can
have a look at the code. The public procedure and the bottom
called 'search' [4] is the where the code starts.

This is badly explained. At this point SEED contains the unique
identifier of document that contains the least frequent word.
We remove it from the query hence the (cdr query) and filter
the SEED with the rest of the query. This is small optimization,
because we know that the least frequent word is already in the
documents found in the SEED, so we do not need to check its
presence in the SEED documents. 'hits?' will return somekind
of state-machine that will check that a given document match
the QUERY passed as argument.

That what I mean to do, the (cdr query) to remove the most
discriminant query term is not implemented, yet.

Post by Amirouche Boubekki
[4]

https://github.com/a-guile-mind/culturia.one/blob/master/src/wiredtiger/ix.scm#L455

Post by Amirouche Boubekki
[8]

file not fond

It's here:
https://github.com/a-guile-mind/culturia.one/blob/master/src/ix.scm#L439

I reworked the thing to use grf3 graph abstraction to store
the documents.

Also guile-wiredtiger 0.6.4 is in guix.

All this looks pretty interesting but I have to say that I prefer the
work you're doing on GNUNet ;-)

Tx for you interest!

Catonano

2018-01-14 14:12:23 UTC

Permalink

Post by Amirouche Boubekki

2017-11-26 23:33 GMT+01:00 Amirouche Boubekki

Post by Amirouche Boubekki
The quering engine will first compute the frequency of both
keywords and then lookup the inverted index for the least
frequent keyword.

The least frequent keyword ?
Not the most frequent keyword ?

Yes, imagine you search for serif+font, most common
word and the least discriminant is "font" because there
is (I think) more page containing "font".
The result of the inverted lookup above is used as seed
of the rest of the algorithm that is O(n) so I need to
minimize 'n' ie. the count of initial documents.

I see now. Thanks

Post by Amirouche Boubekki

That way, there is a 'seed' set of documents

This is badly explained. At this point SEED contains the unique
identifier of document that contains the least frequent word.
We remove it from the query hence the (cdr query) and filter
the SEED with the rest of the query. This is small optimization,
because we know that the least frequent word is already in the
documents found in the SEED, so we do not need to check its
presence in the SEED documents. 'hits?' will return somekind
of state-machine that will check that a given document match
the QUERY passed as argument.
That what I mean to do, the (cdr query) to remove the most
discriminant query term is not implemented, yet.

Post by Amirouche Boubekki

Post by Amirouche Boubekki
[4]
https://github.com/a-guile-mind/culturia.one/blob/master/src

/wiredtiger/ix.scm#L455

Post by Amirouche Boubekki
[8]

file not fond

It's here: https://github.com/a-guile-mind/culturia.one/blob/master/src
/ix.scm#L439
I reworked the thing to use grf3 graph abstraction to store
the documents.
Also guile-wiredtiger 0.6.4 is in guix.

I know ;-)

At the moment I don't feel confident with guile code calling C code

There's guile-squee thaht needs some love too, that could be a starting
point ofrr me

Also G-golf is very important. GUIs are not optional, they are fundamental,
we should absolutely have a decent integration between Guile and Gnome

When and if I'll know more, I'll take a look at Culturia too :-)

Post by Amirouche Boubekki

All this looks pretty interesting but I have to say that I prefer the
work you're doing on GNUNet ;-)

Tx for you interest!

No, thank you !
GNUNet is also very important, probably more than g-golf, I'm not sure

I can't wait to test drive it