War between Data Formats

Posted on March 9th, 2017 in Hadoop, Storage Formats by Rana Faisal Munir

Big data introduces many challenges and one of them is how to physically store data for better access time. For this purpose, researchers have proposed many data formats which store data into different layouts to give optimal performance in different workloads. However, it is really challenging to decide which format is best for a particular workload. In this article, I am presenting latest research work on these formats. It covers research paper, benchmarking, and videos of the data formats.

Research Papers:

These are some research papers related to data formats in bigdata systems. You can observe the trend: first it was going from plain storage formats to binary and then, it shifts within binary, from row storage formats to columnar storage formats.

D. J. Abadi, S. R. Madden, N. Hachem. Column-Stores vs. Row-Stores: How Different Are They Really?. In SIGMOD 2008.
A. Jindal, J.-A. Quian-Ruiz, and J. Dittrich. Trojan Data Layouts: Right Shoes for a Running Elephant. In SOCC, 2011.
Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, and Z. Xu. RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems. In ICDE, 2011.
A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata. Column-Oriented Storage Techniques for MapReduce. In VLDB, 2011.
I. Alagiannis, S. Idreos, and A. Ailamaki. H2O: A Hands-free Adaptive Store. In SIGMOD, 2014.
T. Xu, D. Wang. KCGS-Store: A Columnar Storage Based On Group Sorting Of Key Columns. In Cloud, 2016.
R. F. Munir, O. Romero, A. Abello, B. Bilalli, M. Thiele, W. Lehner. ResilientStore: A Heuristic-based Data Format Selector for Intermediate Results. In: MEDI 2016.

Performance Comparisions:

There are some very good existing benchmarks on th formats. These benchmarks help to see their performance and benefit of using in different workloads.

CERN compares two data formats (Avro and Parquet) with two storage engines (Hbase and Kudu). They concluded that Parquet and Kudu are good for analytical workloads. [https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines]
SVDS compares different data formats which include Plain Text, Sequence Files, Avro, Parquet and ORC. Their results show that Avro is good for scan-based workload whereas Parquet and ORC are good for OLAP workloads. [http://www.svds.com/dataformats/]
Horton also benchmarks JSON, Avro, ORC and Parquet. Their presentation is available on slideshare. [http://www.slideshare.net/HadoopSummit/file-format-benchmark-avro-json-orc-parquet]
Huawei with some other companies introduces a new file format Apache CarbonData. This data format is also allowed to insert, delete and update. Moreover, it also supports indexing. It is also a columnar format and good for OLAP queries. [http://carbondata.incubator.apache.org/]

Videos:

Apache Parquet 2013 [https://www.youtube.com/watch?v=pFS-FScophU&list=PLA70L35Y7kjgvArPec7s6j-lJRJkGM1Yc]
Apache Parquet 2014 [https://www.youtube.com/watch?v=MZNjmfx4LMc&index=4&list=PLA70L35Y7kjgvArPec7s6j-lJRJkGM1Yc]
Horton File Formats Benchmark 2016 [https://www.youtube.com/watch?v=tB28rPTvRiI]
Apache Spark with Parquet 2017 [https://www.youtube.com/watch?v=_0Wpwj_gvzg]
Audio about Apache Parquet and Apache Arrow 2017 [https://softwareengineeringdaily.com/2017/01/13/columnar-data-apache-arrow-and-parquet-with-julien-le-dem-and-jacques-nadeau/]