Skip to content

Latest commit

 

History

History
105 lines (79 loc) · 4.74 KB

File metadata and controls

105 lines (79 loc) · 4.74 KB

Hive JDBC import into H2O

H2O can ingest data from Hive through the Hive v2 JDBC driver by providing H2O with the JDBC driver for your Hive version. Note: Since Hive only support OFFSET in SELECT statements since Hive 2.0 for earlier versions of Hive it is necessary to import data with fetch_mode="SINGLE" option.

Motivation

Reading data in Hive via the JDBC interface provides a convenient way how to transform and filter data using Hive Query engine before importing it into H2O.

Prerequisities

  • Running Hive instance

Setting up table with data

You can skip these steps if you already have your data prepared in a Hive table.

  1. Get the AirlinesTest dataset from S3.

  2. Run the CLI client for Hive: $ beeline -u jdbc:hive2://hive-host:10000/db-name

  3. Create the DB table:

     CREATE EXTERNAL TABLE IF IT DOES NOT EXIST AirlinesTest(
       fYear STRING ,
       fMonth STRING ,
       fDayofMonth STRING ,
       fDayOfWeek STRING ,
       DepTime INT ,
       ArrTime INT ,
       UniqueCarrier STRING ,
       Origin STRING ,
       Dest STRING ,
       Distance INT ,
       IsDepDelayed STRING ,
       IsDepDelayed_REC INT
     )
         COMMENT 'stefan test table'
         ROW FORMAT DELIMITED
         FIELDS TERMINATED BY ','
         LOCATION '/tmp';
    
  4. Import the data from the dataset: LOAD DATA INPATH '/tmp/AirlinesTest.csv' OVERWRITE INTO TABLE AirlinesTest;

Enable Hive JDBC Driver in H2O

Retrieve the Hive JDBC Client jar

  • For Hortonworks, Hive JDBC client jars can be found on one of the edge nodes after you have installed HDP: /usr/hdp/current/hive-client/lib/hive-jdbc-<version>-standalone.jar. More information is available in Hortonworks Documentation
  • For Cloudera, install the Hive JDBC package for your operating system, and then add /usr/lib/hive/lib/hive-jdbc-<version>-standalone.jar to your classpath. More information is available in Cloudera Documentation
  • You can also retrieve this from Maven for the desire version using mvn dependency:get -Dartifact=groupId:artifactId:version.

Provide H2O with the JDBC Driver

Based on the setup you can:

  • Add the Hive JDBC driver to H2O's classpath for running stand-alone H2O from terminal: java -cp hive-jdbc-standalone.jar:<path_to_h2o_jar> water.H2OApp

  • Init from python: h2o.init(extra_classpath = ["hive-jdbc-standalone.jar"])

  • Init from R h2o.init(extra_classpath=["hive-jdbc-standalone.jar"])

  • Add the Hive JDBC driver to H2O's classpath for running clustered H2O on Hadoop from terminal: hadoop jar h2odriver.jar -libjars hive-jdbc-standalone.jar -nodes 3 -mapperXmx 6g After the jar file with JDBC driver is added, then data from the Hive databases can be pulled into H2O using the import_sql_table and import_sql_select functions.

Putting it all together

You need to define the data that would be used as well as the credentials and the connection url:

# python
import h2o
connection_url = "jdbc:hive2://localhost:10000/default"
select_query = "SELECT * FROM AirlinesTest;"
username = "root"
password = "changeit"

# Load the dataset
airlines_dataset = h2o.import_sql_select(connection_url, select_query, username, password)

# And then utilize it
airlines_dataset["table_for_h2o_import.origin"] = airlines_dataset["table_for_h2o_import.origin"].asfactor()
airlines_dataset["table_for_h2o_import.fdayofweek"] = airlines_dataset["table_for_h2o_import.fdayofweek"].asfactor()
airlines_dataset["table_for_h2o_import.uniquecarrier"] = airlines_dataset["table_for_h2o_import.uniquecarrier"].asfactor()
airlines_dataset["table_for_h2o_import.dest"] = airlines_dataset["table_for_h2o_import.dest"].asfactor()
airlines_dataset["table_for_h2o_import.fyear"] = airlines_dataset["table_for_h2o_import.fyear"].asfactor()
airlines_dataset["table_for_h2o_import.fdayofmonth"] = airlines_dataset["table_for_h2o_import.fdayofmonth"].asfactor()
airlines_dataset["table_for_h2o_import.isdepdelayed"] = airlines_dataset["table_for_h2o_import.isdepdelayed"].asfactor()
airlines_dataset["table_for_h2o_import.fmonth"] = airlines_dataset["table_for_h2o_import.fmonth"].asfactor()

airlines_X_col_names = airlines_dataset.col_names[:-2]
airlines_y_col_name = airlines_dataset.col_names[-2]

train, valid, test = airlines_dataset.split_frame([0.6, 0.2], seed=1234)

from h2o.estimators.gbm import H2OGradientBoostingEstimator
gbm_v1 = H2OGradientBoostingEstimator(model_id="gbm_airlines_v1", seed=2000000)
gbm_v1.train(airlines_X_col_names, airlines_y_col_name, training_frame=train, validation_frame=valid)
gbm_v1.predict(test)