AVRO vs Parquet — what to use?
I won’t say one is better and the other one is not as it totally depends where are they going to be used.
Apache Avro is a remote procedure call and data serialization framework developed within Apache’s Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. (by Wikipedia)
- Since it’s a row based format, it’s better to use when all fields needs to be accessed
- Files support block compression and are splittable
- Suitable for write intensive operation
Apache Parquet, on the other hand, is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. It is similar to the other columnar-storage file formats available in Hadoop namely RCFile and ORC. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. (by Wikipedia)
- Since it’s a column based format, it’s better to use when you only need to access specific fields
- Each data file contains the values for a set of rows
- Can’t be written from streaming data since it needs to wait for blocks to get finished. However, this will work using micro-batch or bulk sink
- Suitable for data exploration — read intensive, complex or analytical querying, low latency data
Both supports the following but depends on certain degree:
- Schema evolution
- Nested datasets