Intro:
A prefix keyword index is a search index for retrieving text documents by the number of exact and prefix keyword matches with a user-specified query. We use it mostly for entity search, comparable to the search box at wikipedia.org.
This is different to just finding possible continuations of a user-specified query. E.g. a user query of “angela m” should match “Angela Merkel”, but also a query like “m angel”.
Details:
- Improve prefix keyword index implementation at https://github.com/bastiscode/search-index
- add in-memory variant for smaller / in-memory datasets (currently only memory-mapping form file is implemented)
- add bm25, count, and tfidf scoring methods (or think of others yourself)
- make it more flexible regarding input data format (currently, only tsv format with fixed layout is supported)
- improve performance in various ways
- add parallelization with rayon
- better handling of short keywords (< 5 characters)
- better re-scoring of very short keywords (< 3 characters)
- use unsafe Rust where sensible for performance gains
- identify some algorithmic improvement potentials
- improve internal storage format
- find a more performant way to implement the
sub_index_by_ids
method (currently, a sorted list of allowed ids is checked)
- implement functionality to keep track of matched synonyms for a data point; a single data point can have multiple labels / synonyms, e.g. “Albert Einstein”, “Einstein”; current implementation does not provide info through which synonym a data point was matched
- evaluate and benchmark all changes on real-world datasets (provided)
- optionally, do something similar for the q-gram index
- Requirements
- familiarity with Rust and Python