Understand the caching mechanisms for the popular distributed SQL engine and how to use them to improve query speed and efficiency.
Caching frequently accessed data allows Presto to retrieve results from faster and closer caches rather than scanning slower storage. There are two types of caches in Presto, the built-in cache and third-party caches.
The built-in cache includes the metastore cache, file list cache, and Alluxio SDK cache. The main benefits of built-in caches are very low latency and no network overhead because data is cached locally within the Presto cluster.
Third-party caches, such as the Alluxio distributed cache, are independently deployable and offer better scalability and increased cache capacity. Memory/SSD/HDD.
None of Presto's caches are enabled by default. Presto's metastore cache stores Hive metastore query results in memory for faster access.
The list file cache substantially improves query latency when the HDFS namenode is overloaded or object stores have poor file listing performance. When the list file status cache is enabled, the Presto coordinator caches file lists in memory for faster access to frequently used data, reducing lengthy remote listFile calls.
Note that the list file status cache can be applied only to sealed directories, as Presto skips caching open partitions to ensure data freshness. The Alluxio SDK cache is a Presto built-in cache that reduces table scan latency.
The Alluxio SDK cache is particularly beneficial for querying remote data like cross-region or hybrid cloud object stores. Soft-affinity scheduling attempts to send requests to workers based on file paths, maximizing cache hits by locating data in worker caches.
The Alluxio distributed cache is one example of a third-party cache. As you can see in the diagram below, the Alluxio distributed cache is deployed between Presto and storage like S3.
Alluxio uses a master-worker architecture where the master manages metadata and workers manage cached data on local storage. On a cache hit, the Alluxio worker returns data to the Presto worker.
Otherwise, the Alluxio worker retrieves data from persistent storage and caches data for future use. In this article, we have introduced different caching mechanisms in Presto, including the metastore cache, the list file status cache, the Alluxio SDK cache, and the Alluxio distributed cache.
As summarized in the table below, you can use these caches to accelerate data access based on your use case. Overloaded HDFS namenodeOverloaded object store like S3. Alluxio SDK cache..