Object storage file formats#
Object storage connectors support one or more file formats specified by the underlying data source.
In the case of serializable formats, only specific SerDes are allowed:
RCText - RCFile
ColumnarSerDe
RCBinary - RCFile
LazyBinaryColumnarSerDe
JSON -
org.apache.hive.hcatalog.data.JsonSerDe
CSV -
org.apache.hadoop.hive.serde2.OpenCSVSerde
ORC format configuration properties#
The following properties are used to configure the read and write operations with ORC files performed by supported object storage connectors:
Property Name |
Description |
Default |
---|---|---|
|
Sets the default time zone for legacy ORC files that did not declare a time zone. |
JVM default |
|
Access ORC columns by name. By default, columns in ORC files are accessed by
their ordinal position in the Hive table definition. The equivalent catalog
session property is |
|
|
Enable bloom filters for predicate pushdown. |
|
|
Allow reads on ORC files with short zone ID in the stripe footer. |
|
Parquet format configuration properties#
The following properties are used to configure the read and write operations with Parquet files performed by supported object storage connectors:
Property Name |
Description |
Default |
---|---|---|
|
Adjusts timestamp values to a specific time zone. For Hive 3.1+, set this to UTC. |
JVM default |
|
Access Parquet columns by name by default. Set this property to |
|
|
Percentage of parquet files to validate after write by re-reading the whole
file. The equivalent catalog session property is
|
|
|
Maximum size of pages written by Parquet writer. |
|
|
Maximum values count of pages written by Parquet writer. |
|
|
Maximum size of row groups written by Parquet writer. |
|
|
Maximum number of rows processed by the parquet writer in a batch. |
|
|
Whether bloom filters are used for predicate pushdown when reading Parquet
files. Set this property to |
|
|
Skip reading Parquet pages by using Parquet column indices. The equivalent
catalog session property is |
|
|
Ignore statistics from Parquet to allow querying files with corrupted or
incorrect statistics. The equivalent catalog session property is
|
|
|
Sets the maximum number of rows read in a batch. The equivalent catalog
session property is named |
|
|
Data size below which a Parquet file is read
entirely. The equivalent catalog session property is named
|
|