Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-17-105, October 2017.
Saurabh Kadekodi, Bin Fan*, Adit Madan*, Garth A. Gibson
Carnegie Mellon University
* Alluxio Inc.
The amount of data written to a storage object, its write size, impacts many aspects of cost and performance. Ideal write sizes for cloud storage systems can be radically different from the write, or file, sizes of a particular application. For applications creating a large number of small files, creating one backing store object per small file can not only lead to prohibitively slow write performance, but can also be cost-ineffective because of the current cloud storage pricing model.
This paper proposes a packing, or bundling, layer close to the application, to transparently transform arbitrary user workloads to a write pattern more ideal for cloud storage. Implemented as a distributed write-only cache, packing coalesces small files (a few megabytes or smaller) to form gigabyte sized blobs for efficient batched transfers to cloud backing stores. Even larger benefits in price / cost can be obtained.
Our packing optimization, implemented in Alluxio (an open-source distributed file system), resulted in >25000x reduction in data ingest cost for a small file create workload and a >61x reduction in end-to-end experiment runtime.
KEYWORDS: distributed file systems, cloud storage, packing, indexing
FULL TR: pdf