SQL engines, like Presto, Apache Spark SQL, or Apache Hive, consume data structured as tables of rows and columns, whereas files and directories are the standard means for a filesystem to arrange and access data. As a result, there is often a mismatch between the SQL engines and the storage systems. This disparity is analogous to a conversation between two people who speak different languages; in order for one to understand the other, there must always be a translator present. This inefficiency grows as the data scale increases since each piece of information retrieved must first be converted before it is consumable and vice versa when storing computed information.
In this talk, I will go over the challenges created by the mismatch between SQL engines and storage systems, and introduce a solution using Alluxio as an example of an open source data orchestration system that sits between compute and storage to deliver physical data independence, where the logical access of data by the SQL engines is independent from the physical format of the stored data.