AVRO vs Parquet — what to use?

2 min readJan 16, 2019

I won’t say one is better and the other one is not as it totally depends where are they going to be used.

Apache Avro is a remote procedure call and data serialization framework developed within Apache’s Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. (by Wikipedia)

Since it’s a row based format, it’s better to use when all fields needs to be accessed
Files support block compression and are splittable
Suitable for write intensive operation

Apache Parquet, on the other hand, is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. It is similar to the other columnar-storage file formats available in Hadoop namely RCFile and ORC. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. (by Wikipedia)

Since it’s a column based format, it’s better to use when you only need to access specific fields
Each data file contains the values for a set of rows
Can’t be written from streaming data since it needs to wait for blocks to get finished. However, this will work using micro-batch or bulk sink
Suitable for data exploration — read intensive, complex or analytical querying, low latency data

Both supports the following but depends on certain degree:

Schema evolution
Nested datasets

AVRO vs Parquet — what to use?

Written by Ana Suzuki

Responses (1)