With real example of a 1TB dataset on Amazon S3 provider
When managing big data datasets, the way in which the information is encoded is very important not only for the interpretation but above all for the processing times of the contents. Using inefficient file formats and storing data without compression affects overall performance and scalability.
This article aims to describe the Parquet format and show a real example of how it can save time and resources.
The Apache Parquet open source data file format was born as a joint project between Twitter and Cloudera in 2013. Since 27 April 2015 it has been sponsored by the non-profit Apache Software Foundation (ASF).
The ASF is a community made up of developers working on open source software projects characterized by a distributed, collaborative development process. Each project is managed by a team of volunteers who are considered the active contributors.
Apache Parquet is column oriented and is designed for storing and reading big data datasets. It provides efficient data compression and performance-optimized coding schemes to handle complex data, ie data types that refer to a combination of other existing data types.
The main feature of this format is that it is explicitly designed to separate metadata from data. This allows the splitting of columns into multiple rows, as well as having a single metadata file referencing multiple parquet rows.
In the example in figure 1 there is a table with N columns divided into M groups of rows. The file metadata contains the positions of all starting positions of the column metadata. More details on what is contained in the metadata can be found in the files called “thrift”.
Figure 1 Apache Parquet format
The metadata is written after the data to allow for single pass writing. They need to be read before the file to find all column blocks they are interested in. The column blocks should then be read sequentially.
This design allows for significant storage space savings and speeds up analysis queries because techniques such as data skipping can be used, whereby queries that retrieve specific column values do not have to read the entire row of data.
Other important advantages of the Parquet format:
- Language agnostic.
- Highly efficient data compression and decompression.
- Supports complex data types and advanced nested data structures (arrays).
- Used for On-Line Analytical Processing (OLAP) use cases, typically in conjunction with traditional On-Line Transaction Processing (OLTP) databases.
Comparison with CSV file
The following example taken from the databricks.com site compares the improvements obtained of a 1TB dataset on Amazon S3 provider by converting the data from CSV to Parquet.
The result highlighted how the Parquet format helps users reduce storage space by at least a third on large datasets. It also significantly improved scan times and thus overall processing costs.
Table 1 Comparison of CSV and Parquet formats
|Size on Amazon S3
|Query Run Time
|87% less when using Parquet
|99% less data scanned
As shown in Table 1 in the third column Query Run Time, with the use of the Parquet format we obtained a speed 34x higher than the CSV format with a cost saving of over 99%.
Apache Parquet is a columnar file format that provides real optimizations to speed up big data queries and is much more efficient than CSV or JSON. It allows archiving with different types of data (structured tables, images, videos and documents).
Save cloud storage space by using highly efficient column compression and flexible column coding schemes. To date, it is an indispensable format for many big data analytics solutions.
To learn more about the services I offer for this type of data, contact me directly from the contacts section, it will be a pleasure to provide you with further useful information for your business.
To learn more about the updated definition of big data, read the article What does the Big Data term really mean?
Published by: Nicola Lapenta