yumh

writing about things, sometimes.

Experiments in writing a full text search engine

Written by Omar Polo on 14 April 2022 while listening to Weird Fishes / Arpeggi” by Radiohead.

I'm interested in how full text search (FTS from here onwards) works. FTS is the action of searching a collection of documents for some arbitrary text. I've been using elasticsearch and I remember playing a bit with solr, but nothing beats trying to write something from scratch to understand how things works.

Here I'm doing a walkthrough of what I wrote, which is incredibly naive of course, and explain what choices I made and what I think could be improved, and which parts can be expanded.

To keep it simple I decided to index only a small amount of "documents" — the OpenBSD package collection and the abstract of all the English Wikipedia pages — into a custom database. Later, this database will be used by another utility to do the queries. For simplicity, the database is read-only and needs to be recomputed when the source documents change.

The code is available on codeberg or in the github mirror

It was inspired by a previous post I read:

An overview

The main thing that I've learned is that FTS is no black magic, of course, but there are so many things you can customize when building a system that it may feel so.

The general idea is to index the text somehow and then using one or more indexes to find the queried text.

What most engines do — or at least I think they do — is to first split the text into tokens (words), then do some transformations on these and finally inserting them into an index where they map in which documents each word appear. Later, most of this procedure is done to the user query too and the index is consulted to find the documents. Or at least this is the abstract idea. In practice bigger engines probably have more than one index.

Some of the common transformations done after the tokenizer are:

  • lowercasing: so that ignore-case search are easy to do
  • stop words removal: the "stop words" are those very popular words (like "the" or "a" in English) whose omission won't likely impact the ability of the system to provide good results but avoids huge entries in the index.
  • stemming: reducing inflected/derived words to a common "base" form (not the morfological root of the word, more like a syntactical one).

Not all of these steps needs to be done in a real engine, and more can be added. I'm limiting to lowercasing only.

The stop words removal is just a workaround for huge indexes and limits the kind of queries a user can do (imagine doing a search for "to be or not to be" or "The Who").

The stemming is an interesting step thought. It's used to map variations of the same word to the same sequence of character, not necessarly a valid word. For example, the Porter algorithm maps argue, argued, argues and arguing to "argu". This way, documents containing only "argued" will match for "arguing" too.

Decent engines will also rank the documents. Ranking means guessing how much a matching documents is relevant to the user query and sorting the results accordingly. Spoiler: mine isn't a decent engine yet so ranking won't be covered here, but hopefully I will manage to in a follow-up post.

After a document has been processed is inserted in some kind of database and then queried somehow.

Tokenize

#define WDELIMS "..." /* everything but a-zA-Z */

char **
tokenize(const char *s)
{
	char *d, *dup, *t, **tok = NULL;
	void *newtok;
	size_t cap = 0, len = 0, newcap;

	if ((dup = strdup(s)) == NULL)
		return NULL;
	d = dup;

	for (t = d; *t; ++t)
		*t = tolower(*t);

	while ((t = strsep(&d, WDELIMS)) != NULL) {
		if (*t == '\0')
			continue;

		/* keep the space for a NULL terminator */
		if (len+1 >= cap) {
			newcap = cap * 1.5;
			if (newcap == 0)
				newcap = 8;
			newtok = recallocarray(tok, cap, newcap,
			    sizeof(char *));
			if (newtok == NULL)
				goto err;
			tok = newtok;
			cap = newcap;
		}

		if ((tok[len++] = strdup(t)) == NULL)
			goto err;
	}

	free(dup);
	if (tok == NULL)
		return calloc(1, sizeof(char *));
	return tok;

err:
	freetoks(tok);
	free(dup);
	return NULL;
}

My tokenizer is dead easy: it lowercases the text and splits it into words delimited by anything but a-zA-Z. Thus, the sentence "X11 Window Manager" becomes "x", "window", "manager". As a bonus, assuming the text is UTF-8, other European languages works out of the box.

While the tokenizing step may seems simple, it really isn't generally speaking. Programmatically tokenizing languages like Chinese or Japanese is a real challenge!

The dictionary

My indexer builds an in-memory dictionary of words and documents they appear, which is then used to generate the database.

The design of it is really simple:

struct dict_entry {
	char	*word;
	int	*ids;
	size_t	 len;
	size_t	 cap;
};

struct dictionary {
	size_t	len;
	size_t	cap;
	struct dict_entry *entries;
};

A dictionary is just a sorted array of "dictionary entries": a word and the list of document ids that contains that word.

Inserting a word into the dictionary is straightforward: use binary search to locate the word in the dictionary; if it's already present then just push the document id into its list, otherwise the word needs to be inserted in the dictionary.

The database

Once every document has been processed and its word added to the dictionary, the processing is done. I'm saving all the informations gathered during the previous steps in a custom database to avoid having to re-index all the documents for every search.

I'm not sure how a real FTS engine like Lucene manages the storage on disk, but I went with a really simple approach: a single binary file. At the start there is the index: a sequence of 32 bytes for the word and then the offset at which the list of documents is stored. This allows to just memory-map the database and do a binary search on the index.

Then there are the lists of documents id, one for each word. Each list has a variable length, that's why it's not stored in the main index.

At the end of the file there is another table of document names and descriptions. It's included so that the querying program can just look there to generate nice-looking results listing.

,______.________.
| word | offset |
| word | offset |
| ...  | ...    |
`------'--------'
[ids for document #1] [ids for
document #2] [ids for document
#3] [ids for document #4]  ...
[document #1 name and description]
[document #2 name and description]
...

Running simple queries

Finding the documents that match "file manager" then is just a matter of looking up the list of documents that contains "file" and intersecting it with the list of documents for "manager".

This is not the only kind of query that is possible, phrase queries and more general boolean queries are interesting, but for the moment I'm limiting to these simple AND queries.

For the recond, here's the whole `fts' routine:

struct doclist {
	uint32_t	*ids;
	size_t		 len;
};

int
fts(struct db *db, const char *query, db_hit_cb cb, void *data)
{
	struct doclist *xs = NULL;
	size_t i, len;
	char **toks, **t;
	int ret = 0;

	if ((toks = tokenize(query)) == NULL)
		return -1;

	len = 0;
	for (t = toks; *t != NULL; ++t)
		len++;

	if (len == 0)
		goto done;

	if ((xs = calloc(len, sizeof(*xs))) == NULL) {
		freetoks(toks);
		return -1;
	}

	for (i = 0; i < len; ++i) {
		xs[i].ids = db_word_docs(db, toks[i], &xs[i].len);
		if (xs[i].ids == NULL || xs[i].len == 0)
			goto done;
	}

	for (;;) {
		struct db_entry e;
		uint32_t mdoc;

		mdoc = xs[0].ids[0];
		for (i = 1; i < len; ++i) {
			if (xs[i].ids[0] > mdoc)
				goto next;
			while (xs[i].ids[0] < mdoc) {
				if (--xs[i].len == 0)
					goto done;
				xs[i].ids++;
			}

			if (xs[i].ids[0] != mdoc)
				goto next;
		}

		if (db_doc_by_id(db, mdoc, &e) == -1) {
			ret = -1;
			goto done;
		}

		if (cb(db, &e, data) == -1) {
			ret = -1;
			goto done;
		}

	next:
		if (--xs[0].len == 0)
			goto done;
		xs[0].ids++;
	}

done:
	free(xs);
	freetoks(toks);

	return ret;
}

The (only?) interesting part maybe it's the inner loop. It may smells like black magic, but it's a neat solution to intersecting a variable number of sorted numeric arrays that doesn't use any additional memory. The idea is to pick the first element from the first array and then go thru the other arrays:

  • If the first element of the nth-list is greater than what we initially picked then drop it (see the code after the next label) and try again with the next one.
  • If the first element of the nth-array is smaller drop it since it won't be in the intersection.
  • When one of the arrays is empty, we have terminated.

This works because the list of documents id is guaranteed to be already sorted.

A possible improvements would be to sort the words for frequency so that the less common one is first: it may reduce the number of times the loop runs.

Results and limits

The repository includes two command line utilities:

  • mkftsidx builds a FTS database with the OpenBSD packages or from a dump of the abstract of the Wikipedia pages.
  • ftsearch runs the queries and prints the results.

Even if the whole implementation is really naive, the results aren't that bad:

% ls -lah db.wiki
-rw-r--r--  1 op  op  92.9M Apr 14 11:19 db.wiki
% time ./obj/ftsearch -d db.wiki 'small wild cat'
https://en.wikipedia.org/wiki/Wildcat Wildcat
https://en.wikipedia.org/wiki/Catopuma Catopuma
    0m00.02s real     0m00.00s user     0m00.06s system

It took 0.02 seconds to query the 90M database; for comparisons grep takes 0.21 seconds, so it's a nice ~10x speedup!

% time grep cat db.wiki > /dev/null
    0m00.21s real     0m00.21s user     0m00.15s system

The selected documents also aren't that bad, let's find for example all the turn-based strategy games in the ports tree:

% time ./obj/ftsearch 'turn-based strategy game'
1oom               game engine recreation of Master of Orion 1
freeciv-server     Civilization clone for X11; multiplayer; game server
freecol            free Colonization clone
lgeneral           turn-based strategy engine
ufoai              squad-based tactical strategy game
wesnoth            fantasy turn-based strategy game
    0m00.00s real     0m00.00s user     0m00.01s system

Well, these are probably not the only ones but without fancy things like stemming we can only so much and match exact words in the ports comment and description.

However, the database is not really fast to generate: mkftsidx takes more than three minutes to index the 947M enwiki-latest-abstract1.xml Wikipedia dump.

Profiling shows that the 67% of the time is wasted in memmove(3): keeping the dictionary in a sorted array wasn't a great idea after all. Some kind of hash map or tree may lead better results. Also, with huge amounts of data we may not be able to memory map the whole index, so for big collection of documents something like a B+ tree may be the answer.

Further improvements includes:

  • storing statistical data about the words: less popular words are more interesting and can help in ranking the results.
  • gathering per-document statistics: a document that mentions many times a certain word may be more relevant than another one that only mentions it once.
  • per-document ranking: in some use-case we may already have an idea of "how good" a document is, and ranking can reflect that.
  • save the word position in the document: this allows to see how close some words are without consulting the document, at the cost of a bigger/additional index. Helps with ranking too.

In a future post I'd like to extend this small framework to cover some simple ranking solutions, and maybe even some simple stemming.