Reading Hive Table Data
Apache Hive is a distributed data warehousing infrastructure. Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv) TextFile, RCFile, ORC, and Parquet.
The PXF Hive connector reads data stored in a Hive table. This section describes how to use the PXF Hive connector.
When accessing Hive 3, the PXF Hive connector supports using the
hive[:] profiles described below to access Hive 3 external tables only. The Connector does not support using the hive[:] profiles to access Hive 3 managed (CRUD and insert-only transactional, and temporary) tables. Use the PXF JDBC Connector to access Hive 3 managed tables instead.Prerequisites
Before working with Hive table data using PXF, ensure that you have met the PXF Hadoop Prerequisites.
If you plan to use PXF filter pushdown with Hive integral types, ensure that the configuration parameter hive.metastore.integral.jdo.pushdown exists and is set to true in the hive-site.xml file in both your Hadoop cluster and $PXF_BASE/servers/default/hive-site.xml. Refer to About Updating Hadoop Configuration for more information.
Hive Data Formats
The PXF Hive connector supports several data formats, and has defined the following profiles for accessing these formats:
| File Format | Description | Profile |
|---|---|---|
| TextFile | Flat file with data in comma-, tab-, or space-separated value format or JSON notation. | hive, hive:text |
| SequenceFile | Flat file consisting of binary key/value pairs. | hive |
| RCFile | Record columnar data consisting of binary key/value pairs; high row compression rate. | hive, hive:rc |
| ORC | Optimized row columnar data with stripe, footer, and postscript sections; reduces data size. | hive, hive:orc |
| Parquet | Compressed columnar data representation. | hive |
| Avro | Serialization system with a binary data format. | hive |
Note: The hive profile supports all file storage formats. It will use the optimal hive[:*] profile for the underlying file format type.
Data Type Mapping
The PXF Hive connector supports primitive and complex data types.
Primitive Data Types
To represent Hive data in Greenplum Database, map data values that use a primitive data type to Greenplum Database columns of the same type.
The following table summarizes external mapping rules for Hive primitive types.
| Hive Data Type | Greenplum Data Type |
|---|---|
| boolean | bool |
| int | int4 |
| smallint | int2 |
| tinyint | int2 |
| bigint | int8 |
| float | float4 |
| double | float8 |
| string | text |
| binary | bytea |
| timestamp | timestamp |
Note: The hive:orc profile does not support the timestamp data type when you specify vectorized query execution (VECTORIZE=true).
Complex Data Types
Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to text. You can create Greenplum Database functions or application code to extract subcomponents of these complex data types.
Examples using complex data types with the hive and hive:orc profiles are provided later in this topic.
Note: The hive:orc profile does not support complex types when you specify vectorized query execution (VECTORIZE=true).
Sample Data Set
Examples presented in this topic operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types:
| Column Name | Data Type |
|---|---|
| location | text |
| month | text |
| number_of_orders | integer |
| total_sales | double |
Prepare the sample data set for use:
-
First, create a text file:
$ vi /tmp/pxf_hive_datafile.txt -
Add the following data to
pxf_hive_datafile.txt; notice the use of the comma (,) to separate the four field values:Prague,Jan,101,4875.33
Rome,Mar,87,1557.39
Bangalore,May,317,8936.99
Beijing,Jul,411,11600.67
San Francisco,Sept,156,6846.34
Paris,Nov,159,7134.56
San Francisco,Jan,113,5397.89
Prague,Dec,333,9894.77
Bangalore,Jul,271,8320.55
Beijing,Dec,100,4248.41
Make note of the path to pxf_hive_datafile.txt; you will use it in later exercises.
Hive Command Line
The Hive command line is a subsystem similar to that of psql. To start the Hive command line:
$ HADOOP_USER_NAME=hdfs hive
The default Hive database is named default.
Example: Creating a Hive Table
Create a Hive table to expose the sample data set.
-
Create a Hive table named
sales_infoin thedefaultdatabase:hive> CREATE TABLE sales_info (location string, month string,
number_of_orders int, total_sales double)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS textfile;Notice that:
- The
STORED AS textfilesubclause instructs Hive to create the table in Textfile (the default) format. Hive Textfile format supports comma-, tab-, and space-separated values, as well as data specified in JSON notation. - The
DELIMITED FIELDS TERMINATED BYsubclause identifies the field delimiter within a data record (line). Thesales_infotable field delimiter is a comma (,).
- The
-
Load the
pxf_hive_datafile.txtsample data file into thesales_infotable that you just created:hive> LOAD DATA LOCAL INPATH '/tmp/pxf_hive_datafile.txt'
INTO TABLE sales_info;In examples later in this section, you will access the
sales_infoHive table directly via PXF. You will also insertsales_infodata into tables of other Hive file format types, and use PXF to access those directly as well. -
Perform a query on
sales_infoto verify that you loaded the data successfully:hive> SELECT * FROM sales_info;
Determining the HDFS Location of a Hive Table
Should you need to identify the HDFS file location of a Hive managed table, reference it using its HDFS file path. You can determine a Hive table's location in HDFS using the DESCRIBE command. For example:
hive> DESCRIBE EXTENDED sales_info;
Detailed Table Information
...
location:hdfs://<namenode>:<port>/apps/hive/warehouse/sales_info
...