AVRO vs Parquet — what to use?

Ana Suzuki
2 min readJan 16, 2019

--

I won’t say one is better and the other one is not as it totally depends where are they going to be used.

Apache Avro is a remote procedure call and data serialization framework developed within Apache’s Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. (by Wikipedia)

  • Since it’s a row based format, it’s better to use when all fields needs to be accessed
  • Files support block compression and are splittable
  • Suitable for write intensive operation

Apache Parquet, on the other hand, is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. It is similar to the other columnar-storage file formats available in Hadoop namely RCFile and ORC. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. (by Wikipedia)

  • Since it’s a column based format, it’s better to use when you only need to access specific fields
  • Each data file contains the values for a set of rows
  • Can’t be written from streaming data since it needs to wait for blocks to get finished. However, this will work using micro-batch or bulk sink
  • Suitable for data exploration — read intensive, complex or analytical querying, low latency data

Both supports the following but depends on certain degree:

  • Schema evolution
  • Nested datasets

--

--