On log-structured merge for solid-state drives
Log-structure merge (LSM) is an increasingly prevalent approach to indexing, especially for modern writeheavy workloads. LSM organizes data in levels with geometrically increasing sizes. Records enter the top level; whenever a level fills up, it is merged down into the next level. Hence, the index is updated only through merges and records are never updated inplace. While originally conceived to avoid slow random accesses of hard drives, LSM also turns out to be especially suited to solidstate drives, or any block-based storage with expensive writes. We study how to further reduce writes in LSM. Traditionally, LSM always merges an overflowing level fully into the next. We investigate in depth how partial merges save writes and prove bounds on their effectiveness. We propose new algorithms that make provably good decisions on whether to perform a partial merge, and if yes, which part of a level to merge. We also show how to further reduce writes by reusing data blocks during merges. Overall, our approach offers better worst-case guarantees and better practical performance than existing LSM variants.