LinkedIn has open sourced PalDB, an embeddable read-only key value store, 8 times faster than LevelDB and taking several times less memory than a hashset.
PalDB is an write-once key-value store written in Java and open sourced by LinkedIn. After the store is created all operations against it are read only. Its purpose is to to improve read operations and lower the memory footprint. LinkedIn recommends it for storing side data. They define side data as “the extra read-only data needed by a process to do its job. For instance, a list of stop words used by a natural language processing algorithm is side data.”
PalDB is embeddable, it does not use a schema and keeps data in one binary file. It offers random data access via an API.
Being optimized for read operations, its performance is comparable to other in-memory data structures such as HashMap or HashSet, according to LinkedIn, but takes significantly less space in memory, one of the main benefits the company was looking for when was designing it. For example, a 100M keys hashset needs over 500MB while PalDB takes about 80MB. Or, 35M member IDs need 1.8GB of RAM in a hashset compared to 290MB for PalDB. Data can be compressed in PalDB using Snappy for even a smaller footprint.
In terms of speed, a test performed by LinkedIn shows that PalDB does 2M reads/s or 6 times faster than HashSet and 8 times faster than LevelDB or RocksDB, on a MacBook Pro 3.1 GHz and a 10M-keys index.
PalDB was optimized for memory access. Keeping the data on a disk will result in considerably poorer performance. While there is no limitation for the size of data, the size of the index is limited to 2GB. Also, it is important to know that PalDB is not thread safe.