BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Space-Efficient Full-Text Search with Rust and WebAssembly

Space-Efficient Full-Text Search with Rust and WebAssembly

This item in japanese

Matthias Endler, backend engineer for Trivago, published a client-side full-text search engine designed for space efficiency by leveraging Bloom filters. Tinysearch is written in Rust, transpiled to WebAssembly, and used in the browser. Tinysearch claims sizes between 50 and 100KB and can only index full words.

Endler explained in a release blog post the motivation and design goals of Tinysearch:

Static websites don’t come with “static” search engines, too. […] I didn’t want to add yet another dependency on Google [search]; neither did I want to use a stand-alone web-backend like Algolia, which adds latency and is proprietary. […] Each [index] that lunr creates can be multiple megabytes in size. […] On top of that, parsing JavaScript is still time-consuming.
[…]
I wanted some simple, lean, and self-contained search, that could be deployed next to my other static content.

Endler created a prototype in early 2018 by porting existing Python code leveraging Bloom filters to Rust. A Bloom filter is a data structure designed to compute rapidly and memory-efficiently whether an element is present in a set. The memory-efficiency is achieved by relaxing the demand for accuracy of the computation: a Bloom filter is a probabilistic data structure that allows for false positive matches while disallowing false negatives. This means that a Bloom filter can tell whether an element either definitely is not in the set or may be in the set. A research paper quantifies the storage efficiency that can be reached thanks to Bloom filters:

In return, a Bloom filter offers very compact storage: less than 10 bits per element are required for a 1% false positive probability, independent of the size or number of elements in the set.

The 1% false-positive rate can be reduced by a factor of ten by adding 5 bits per element. The time necessary to check the presence of an item in a set is constant independently of the number of items in the set.

Endler explained that having a space-efficient indexing method was only one part. Bringing the functionality to the browser with WebAssembly was the other part and presented a few difficulties. While a Bloom-filter-based index would weigh in one instance as little as 44 KB, the client-side code (which includes the search functionality and the helper code) would weigh in at 216 KB using vanilla wasm-pack. Such a size, even taking into account the fact that WebAssembly is parsed and compiled faster than vanilla JavaScript, can lead to a download time that can hamper the user experience.

Endler detailed how miscellaneous tools helped reduced the size to an acceptable baseline. Using available compiling options for size optimization brought the wasm binary down to 196KB. Leveraging Rust allocator wee_alloc further reduced the size to 188KB. Using Wasm optimizer binaryen resulted in a size of 154 KB. Removing web-sys, which unnecessarily includes bindings for accessing the DOM resulted in a lower size of 153 KB.

At that point, the Bloom filter was taking about half the size of the binary. Further reduction in size came from optimizing the Bloom filter parameters and volume to index. Endler decided to remove stop words and index only full words, eventually reaching a total binary size of 121KB, or 51KB gzipped.

Additionally, Tinysearch download can be deferred to be loaded only when a user clicks into the search field which embodies the full-text search functionality. This further reduces the impact on user experience when loading the page including the search field.

Tinysearch is an example of the possibilities offered by the Rust/Wasm stack today to reuse existing software written in other languages than JavaScript in the browser. Endler account showcases that the generated binary may need to be optimized, and that a range of tools and techniques exist to do so.

Endler concluded the release blog post with a summary of the results of the optimization efforts:

I could finally have a small, static search without the need for a backend! […]
The final Wasm code is laser-fast because we save the roundtrips to a search-server. The instant feedback loop feels more like filtering a list than searching through files.

Demo

The full blog post is available online and goes into further details on the optimization process and capabilities of Tinysearch. Tinysearch is MIT-licensed and targets primarily static websites.

Rate this Article

Adoption
Style

BT