Welcome to BigDatapedia You tube channel
Please subscribe Our new You tube channel "BigDatapedia" , all our tips videos are available here
1) What is Hive .?
Apache Hive is an open source for data warehouse system. Its similar like SQL Queries. We can use Hive for analyzing and querying in large data sets on top of Hadoop.
2) Why do we need Hive .?
Hive is a tool in Hadoop ecosystem which provides an interface to organize and query data in a databse like fashion and write SQL like queries. It is suitable for accessing and analyzing data in Hadoop using SQL syntax.
3) What is a metastore in Hive .?
It is a relational database storing the metadata of hive tables, partitions, Hive databases etc...
When you create a table, this metastore gets updated with the information related to the new table which gets queried when you issue queries on that table.
4) Is Hive suitable to be used for OLTP systems .? Why .?
No, Hive does not provide insert and update at row level. So it is not suitable for OLTP system.
5) Can you explain about ACID transactions in Hive?
Hive supports ACID transactions: The full form of ACID is Atomicity, Consistency, Isolation and Durability.
ACID transactions are provided at the row levels, there are Insert, Delete, and Update options so that Hive supports ACID transaction. Insert, Delete and Update.
6) What are the types of tables in Hive?
There are two types of tables in Hive : Internal Table(aka Managed Table) and External table.
7) What kind of data warehouse application is suitable for Hive?
Hive is not considered as a full database. The design rules and regulations of Hadoop and HDFS put restrictions on what Hive can do.Hive is most suitable for data warehouse applications.
Where Analyzing the relatively static data, Less Responsive time and No rapid changes in data.
Hive does not provide fundamental features required for OLTP (Online Transaction Processing). Hive is suitable for data warehouse applications in large data sets.
8) Explain what is a Hive variable. What do we use it for?
Hive variable is basically created in the Hive environment that is referenced by Hive scripting languages. It provides to pass some values to the hive queries when the query starts executing. It uses the source command.
9) How to change the warehouse.dir location for older tables?
To change the base location of the Hive tables, edit the hive.metastore.warehouse.dir param. This will not affect the older tables. Metadata needs to be changed in the database (MySQL or Derby). The location of Hive tables is in table SDS and column LOCATION.
10) What are the types of metastore available in Hive .?
There are three types of meta stores available in Hive.
Embedded Metastore (Derby)
Local Metastore
Remote Metastore.
11) Is it possible to use same metastore by multiple users, in case of embedded hive?
No, it is not possible to use metastores in sharing mode. It is recommended to use standalone real database like MySQL or PostGresSQL.
12) If you run hive server, what are the available mechanism for connecting it from application?
There are following ways by which you can connect with the Hive Server
- Thrift Client: Using thrift you can call hive commands from a various programming languages e.g. C++, Java, PHP, Python and Ruby.
- JDBC Driver : It supports for the Java protocal.
- ODBC Driver: It supports ODBC protocol.
13) What is SerDe in Apache Hive ?
A SerDe is a short name for a Serializer Deserializer.
Hive uses SerDe as FileFormat to read and write data from tables. An important concept behind Hive is that it DOES NOT own the Hadoop File System format that data is stored in. Users are able to write files to HDFS with whatever tools or mechanism takes their fancy (CREATE EXTERNAL TABLE or LOAD DATA INPATH) and use Hive to correctly parse that file format in a way that can be used by Hive. A SerDe is a powerful and customizable mechanism that Hive uses to parse data stored in HDFS to be used by Hive.
14)Which classes are used by the Hive to Read and Write HDFS Files ?
Following classes are used by Hive to read and write HDFS files
TextInputFormat or HiveIgnoreKeyTextOutputFormat: These 2 classes read/write data in plain text file format.
SequenceFileInputFormat or SequenceFileOutputFormat: These 2 classes read/write data in hadoop SequenceFile format.
15) Give examples of the SerDe classes which hive uses to Serialize and Deserialize data ?
Hive currently use these SerDe classes to serialize and Deserialize data:
MetadataTypedColumnsetSerDe: This SerDe is used to read/write delimited records like CSV, tab-separated control-A separated records (quote is not supported yet.)
ThriftSerDe: This SerDe is used to read or write thrift serialized objects. The class file for the Thrift object must be loaded first.
DynamicSerDe: This SerDe also read or write thrift serialized objects, but it understands thrift DDL so the schema of the object can be provided at runtime. Also it supports a lot of different protocols, including TBinaryProtocol, TJSONProtocol, TCTLSeparatedProtocol(which writes data in delimited records).
16) How do you write your own custom SerDe and what is the need for that?
In most cases, users want to write a Deserializer instead of a SerDe, because users just want to read their own data format instead of writing to it.
For example, the RegexDeserializer will deserialize the data using the configuration parameter regex, and possibly a list of column names.
If your SerDe supports DDL (basically, SerDe with parameterized columns and column types), you probably want to implement a Protocol based on DynamicSerDe, instead of writing a SerDe from scratch. The reason is that the framework passes DDL to SerDe through thrift DDL format, and its non-trivial to write a thrift DDL parser.
Depending on the nature of data the user has, the inbuilt SerDe may not satisfy the format of the data. So users need to write their own java code to satisfy their data format requirements.17) What is ObjectInspector functionality ?
Hive uses ObjectInspector to analyze the internal structure of the row object and also the structure of the individual columns.
ObjectInspector provides a uniform way to access complex objects that can be stored in multiple formats in the memory, including:
Instance of a Java class (Thrift or native Java)
A standard Java object (we use java.util.List to represent Struct and Array, and use java.util.Map to represent Map)
A lazily-initialized object (For example, a Struct of string fields stored in a single Java string object with starting offset for each field)
A complex object can be represented by a pair of ObjectInspector and Java Object. The ObjectInspector not only tells us the structure of the Object, but also gives us ways to access the internal fields inside the Object.
In simple terms, ObjectInspector functionality in Hive is used to analyze the internal structure of the columns, rows, and complex objects. It allows to access the internal fields inside the objects.f18) What is the functionality of Query Processor in Apache Hive ?
This component implements the processing framework for converting SQL to a graph of map or reduce jobs and the execution time framework to run those jobs in the order of dependencies and the help of metastore details.
19) What is the limitation of Derby database for Hive metastore?
With derby database, you cannot have multiple connections or multiple sessions instantiated at the same time. Derby database runs in the local mode and it creates a log file so that multiple users cannot access Hive simultaneously.
20) What are managed and external tables?
We have got two things, one of which is data present in the HDFS and the other is the metadata, present in some database.There are two categories of Hive tables that is Managed and External Tables.
In the Managed tables, both the data and the metadata are managed by Hive and if you drop the managed table, both data and metadata are deleted.There are some situations where your data will be controlled by some other application and you want to read that data but you must allow Hive to delete that data. In such case, you can create an external table in Hive.
In the external table, metadata is controlled by Hive but the actual data will be controlled by some other application. So, when you delete a table accidentally, only the metadata will be lost and the actual data will reside wherever it is.
21) What are the complex data types in Hive?
MAP: The Map contains a key-value pair where you can search for a value using the key.
STRUCT: A Struct is a collection of elements of different data types. For example, if you take the address, it can have different data types. For example, pin code will be in Integer format.
ARRAY: An Array will have a collection of homogeneous elements. For example, if you take your skillset, you can have N number of skills UNIONTYPE: It represents a column which can have a value that can belong to any of the data types of your choice.
22) How does partitioning help in the faster execution of queries?
With the help of partitioning, a sub directory will be created with the name of the partitioned column and when you perform a query using the WHERE clause, only the particular sub-directory will be scanned instead of scanning the whole table. This gives you faster execution of queries.
23) How to enable dynamic partitioning in Hive?
Related to partitioning there are two types of partitioning Static and Dynamic. In the static partitioning, you will specify the partition column while loading the data.
Whereas in dynamic partitioning, you push the data into Hive and then Hive decides which value should go into which partition. To enable dynamic partitioning, you have set the below property
set hive.exec.dynamic.parition.mode = nonstrict;
Example: insert overwrite table emp_details_partitioned partition(location)
select * from emp_details;
24) What is bucketing ? The values in a column are hashed into a number of buckets which is defined by user. It is a way to avoid too many partitions or nested partitions while ensuring optimizes query output.25) How does bucketing help in the faster execution of queries?
If you have to join two large tables, you can go for reduce side join. But if both the tables have the same number of buckets or same multiples of buckets and also sorted on the same column there is a possibility of SMBMJ in which all the joins take place in the map phase itself by matching the corresponding buckets. Buckets are basically files that are created inside the HDFS directory.
There are different properties which you need to set for bucket map joins and they are as follows:
set hive.enforce.sortmergebucketmapjoin = false;
set hive.auto.convert.sortmerge.join = false;
set hive.optimize.bucketmapjoin = ture;
set hive.optimize.bucketmapjoin.sortedmerge = true;
26) How to enable bucketing in Hive?
By default bucketing is disabled in Hive, you can enforce to enable it by setting the below property
set hive.enforce.bucketing = true; 27) What are the different file formats in Hive?Every file format has its own characteristics and Hive allows you to choose easily the file format which you wanted to use.
There are different file formats supported by Hive
Text File format
Sequence File format
Parquet
Avro
RC file format
ORC
28) How is SerDe different from File format in Hive?
SerDe stands for Serializer and Deserializer. It determines how to encode and decode the field values or the column values from a record that is how you serialize and deserialize the values of a column. But file format determines how records are stored in key value format or how do you retrieve the records from the table.
29) What is RegexSerDe?
Regex stands for a regular expression. Whenever you want to have a kind of pattern matching, based on the pattern matching, you have to store the fields.
RegexSerDe is present in org.apache.hadoop.hive.contrib.serde2.RegexSerDe.
In the SerDeproperties, you have to define your input pattern and output fields. For example, you have to get the column values from line xyz/pq@def if you want to take xyz, pq and def separately.
To extract the pattern, you can use:
input.regex = (.*)/(.*)@(.*)
To specify how to store them, you can use
output.format.string = %1$s%2$s%3$s;
30) How is ORC file format optimised for data storage and analysis?
ORC stores collections of rows in one file and within the collection the row data will be stored in a columnar format. With columnar format, it is very easy to compress, thus reducing a lot of storage cost.
While querying also, it queries the particular column instead of querying the whole row as the records are stored in columnar format.
ORC has got indexing on every block based on the statistics min, max, sum, count on columns so when you query, it will skip the blocks based on the indexing.
31) How to access HBase tables from Hive?
Using Hive-HBase storage handler, you can access the HBase tables from Hive and once you are connected, you can query HBase using the SQL queries from Hive. You can also join multiple tables in HBase from Hive and retrieve the result.
32) When running a JOIN query, I see out-of-memory errors.?
This is usually caused by the order of JOIN tables. Instead of [FROM tableA a JOIN tableB b ON ], try [FROM tableB b JOIN tableA a ON ] NOTE that if you are using LEFT OUTER JOIN, you might want to change to RIGHT OUTER JOIN. This trick usually solve the problem the rule of thumb is, always put the table with a lot of rows having the same value in the join key on the rightmost side of the JOIN.
33) Did you used Mysql as Metatstore and faced errors like com.mysql.jdbc.exceptions.jdbc4. CommunicationsException: Communications link failure ?
This is usually caused by MySQL servers closing connections after the connection is idling for some time. Run the following command on the MySQL server will solve the problem [set global wait_status=120]
When using MySQL as a metastore I see the error [com.mysql.jdbc.exceptions.MySQLSyntaxErrorException: Specified key was too long; max key length is 767 bytes].
This is a known limitation of MySQL 5.0 and UTF8 databases. One option is to use another character set, such as latin1, which is known to work.
34) Does Hive support Unicode?
You can use Unicode string on data or comments, but cannot use for database or table or column name.
You can use UTF-8 encoding for Hive data. However, other encodings are not supported (HIVE 7142 introduce encoding for LazySimpleSerDe, however, the implementation is not complete and not address all cases).
35) Are Hive SQL identifiers (e.g. table names, columns, etc) case sensitive?
No, Hive is case insensitive.
36) What is the best way to load xml data into hive ?
The easiest way is to use the Hive XML SerDe (com.ibm.spss.hive.serde2.xml.XmlSerDe), which will allow you to directly import and work with XML data.
37) When Hive is not suitable?
It does not provide OLTP transactions support only OLAP transactions.If application required OLAP, switch to NoSQL database.HQL queries have higher latency, due to the mapreduce.
38) Mention what are the different modes of Hive?
Depending on the size of data nodes in Hadoop, Hive can operate in two modes.
These modes are, Local mode and Map reduce mode
39) Mention what is (HS2) HiveServer2?
It is a server interface that performs following functions.
- It allows remote clients to execute queries against Hive
- Retrieve the results of mentioned queries
- Some advanced features Based on Thrift RPC in its latest version include
- Multi-client concurrency
- Authentication
40) Mention what Hive query processor does?
Hive query processor convert graph of MapReduce jobs with the execution time framework. So that the jobs can be executed in the order of dependencies.
41) Mention what are the steps of Hive in query processor?
The components of a Hive query processor include,
- Logical Plan Generation
- Physical Plan Generation
- Execution Engine
- Operators
- UDFs and UDAFs
- Optimizer
- Parser
- Semantic Analyzer
- Type Checking
42) Explain how can you change a column data type in Hive?
You can change a column data type in Hive by using command,
ALTER TABLE table_name CHANGE column_name column_name new_datatype;
43) Mention what is the difference between order by and sort by in Hive?
SORT BY will sort the data within each reducer. You can use any number of reducers for SORT BY operation.
ORDER BY will sort all of the data together, which has to pass through one reducer. Thus, ORDER BY in hive uses a single.
44) Explain when to use explode in Hive?
Hadoop developers sometimes take an array as input and convert into a separate table row. To convert complex data types into desired table formats, then we can use explode function.
45) Mention how can you stop a partition form being queried?
You can stop a partition form being queried by using the ENABLE OFFLINE clause with ALTER TABLE statement.
46) Can we rename a Hive table ?
yes, using below command
Alter Table table_name RENAME TO new_name
47) What is the default location where hive stores table data?
hdfs://namenode_server/user/hive/warehouse
48) Is there a date data type in Hive?
Yes. The TIMESTAMP data types stores date in java.sql.timestamp format
49) Can we run unix shell commands from hive? Give example.
Yes, using the ! mark just before the command.
For example !pwd at hive prompt will list the current directory.
50) Can hive queries be executed from script files? How?
Using the source command.
Example −
Hive> source /path/to/file/file_with_query.hql
51) What is the importance of .hiverc file?
It is a file containing list of commands needs to run when the hive CLI starts. For example setting the strict mode to be true etc.
52) What are the default record and field delimiter used for hive text files?
The default record delimiter is − \n
And the filed delimiters are − \001,\002,\003
53) What do you mean by schema on read?
The schema is validated with the data when reading the data and not enforced when writing data.
54) How do you list all databases whose name starts with p?
SHOW DATABASES LIKE ‘p.*’
55) What does the “USE” command in hive do?
With the use command you fix the database on which all the subsequent hive queries will run.
56) How can you delete the DBPROPERTY in Hive?
There is no way you can delete the DBPROPERTY.
57) What is the significance of the line.?
set hive.mapred.mode = strict;
It sets the mapreduce jobs to strict mode.By which the queries on partitioned tables can not run without a WHERE clause. This prevents very large job running for long time.
58) How do you check if a particular partition exists?
This can be done with following query
SHOW PARTITIONS table_name PARTITION(partitioned_column=’partition_value’)
59) Which java class handles the Input and Output records encoding into files in Hive tables ?
For Input: org.apache.hadoop.mapred.TextInputFormat
For Output: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
60) What is the significance of ‘IF EXISTS” clause while dropping a table? When we issue the command DROP TABLE IF EXISTS table_name Hive throws an error if the table being dropped does not exist in the first place.
61) When you point a partition of a hive table to a new directory, what happens to the data?
The data stays in the old location. It has to be moved manually.
Write a query to insert a new column(new_col INT) into a hive table (htab) at a position before an existing column (x_col)
ALTER TABLE table_name
CHANGE COLUMN new_col INT
BEFORE x_col
62) Does the archiving of Hive tables, it saves any spaces in HDFS?
No. It only reduces the number of files which becomes easier for namenode to manage.
63) While loading data into a hive table using the LOAD DATA clause, how do you specify it is a hdfs file and not a local file ?
By Omitting the LOCAL CLAUSE in the LOAD DATA statement.
64) If you omit the OVERWRITE clause while creating a hive table,what happens to file which are new and files which already exist?
The new incoming files are just added to the target directory and the existing files are simply overwritten. Other files whose name does not match any of the incoming files will continue to exist. If you add the OVERWRITE clause then all the existing data in the directory will be deleted before new data is written.
65) What does the following query do?
INSERT OVERWRITE TABLE employees
PARTITION (country, state)
SELECT ..., se.cnty, se.st
FROM staged_employees se;
It creates partition on table employees with partition values coming from the columns in the select clause. It is called Dynamic partition insert.
66) What is a Table generating Function on hive?
A table generating function is a function which takes a single column as argument and expands it to multiple column or rows. Example exploe().
67) How can Hive avoid mapreduce?
If we set the property hive.exec.mode.local.auto to true then hive will avoid mapreduce to fetch query results.
68) What is the difference between LIKE and RLIKE operators in Hive?
The LIKE operator behaves the same way as the regular SQL operators used in select queries. Example −
street_name like ‘%Chi’
But the RLIKE operator uses more advance regular expressions which are available in java
Example − street_name RLIKE ‘.*(Chi|Oho).*’ which will select any word which has either chi or oho in it.
69) Is it possible to create Cartesian join between 2 tables, using Hive?
No. As this kind of Join can not be implemented in map reduce
70) What should be the order of table size in a join query?
In a join query the smallest table to be taken in the first position and largest table should be taken in the last position.
71) What is the usefulness of the DISTRIBUTED BY clause in Hive?
It controls how the map output is reduced among the reducers. It is useful in case of streaming data
72) How will you convert the string ’51.2’ to a float value in the price column?
Select cast(price as FLOAT)
73) What will be the result when you do cast(‘abc’ as INT)?
Hive will return NULL
74) Can we LOAD data into a view?
No. A view can not be the target of a INSERT or LOAD statement.
75) What types of costs are associated in creating index on hive tables?
Indexes occupies space and there is a processing cost in arranging the values of the column on which index is cerated.
Give the command to see the indexes on a table.
SHOW INDEX ON table_name
This will list all the indexes created on any of the columns in the table table_name.
76) What does /*streamtable(table_name)*/ do?
It is query hint to stream a table into memory before running the query. It is a query optimization Technique.
77) Can a partition be archived? What are the advantages and Disadvantages?
Yes. A partition can be archived. Advantage is it decreases the number of files stored in namenode and the archived file can be queried using hive. The disadvantage is it will cause less efficient query and does not offer any space savings.
78) What is a generic UDF in hive?
It is a UDF which is created using a java program to server some specific need not covered under the existing functions in Hive. It can detect the type of input argument programmatically and provide appropriate response.
79) The following statement failed to execute. What can be the cause?
LOAD DATA LOCAL INPATH ‘${env:HOME}/country/state/’
OVERWRITE INTO TABLE address;
The local inpath should contain a file and not a directory. The $env:HOME is a valid variable available in the hive environment.
80) How do you specify the table creator name when creating a table in Hive?
The TBLPROPERTIES clause is used to add the creator name while creating a table.
The TBLPROPERTIES is added like −
TBLPROPERTIES(‘creator’= ‘Joan’)
81) Which method has to be overridden when we use custom UDF in Hive?
Whenever you write a custom UDF in Hive, you have to extend the UDF class and you have to override the evaluate() function.
82) Suppose I have installed Apache Hive on top of my Hadoop cluster using default metastore configuration. Then, what will happen if we have multiple clients trying to access Hive at the same time?
The default metastore configuration allows only one Hive session to be opened at a time for accessing the metastore. Therefore, if multiple clients try to access the metastore at the same time, they will get an error. One has to use a standalone metastore, i.e. Local or remote metastore configuration in Apache Hive for allowing access to multiple clients concurrently.
Following are the steps to configure MySQL database as the local metastore in Apache Hive: One should make the following changes in hive-site.xml: javax.jdo.option.ConnectionURL property should be set to jdbc:mysql://host/dbname?createDataba seIfNotExist=true. javax.jdo.option.ConnectionDriverName property should be set to com.mysql.jdbc.Driver. One should also set the username and password as: javax.jdo.option.ConnectionUserName is set to desired username. javax.jdo.option.ConnectionPassword is set to the desired password. The JDBC driver JAR file for MySQL must be on the Hive classpath, i.e. The jar file should be copied into the Hive lib directory. Now, after restarting the Hive shell, it will automatically connect to the MySQL database which is running as a standalone metastore.
83) Is it possible to change the default location of a managed table?
Yes, it is possible to change the default location of a managed table. It can be achieved by using the clause LOCATION [hdfs_path].
84) When should we use SORT BY instead of ORDER BY?
We should use SORT BY instead of ORDER BY when we have to sort huge datasets because SORT BY clause sorts the data using multiple reducers whereas ORDER BY sorts all of the data together using a single reducer. Therefore, using ORDER BY against a large number of inputs will take a lot of time to execute.
85) What is dynamic partitioning and when is it used?
In dynamic partitioning values for partition columns are known in the runtime, i.e. It is known during loading of the data into a Hive table.
One may use dynamic partition in following two cases:
- Loading data from an existing non-partitioned table to improve the sampling and therefore, decrease the query latency.
- When one does not know all the values of the partitions before hand and therefore, finding these partition values manually from a huge data sets is a tedious task.
86) Suppose, I create a table that contains details of all the transactions done by the customers of year 2016: CREATE TABLE transaction_details (cust_id INT, amount FLOAT, month STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY , ;
Now, after inserting 50,000 tuples in this table, I want to know the total revenue generated for each month. But, Hive is taking too much time in processing this query. How will you solve this problem and list the steps that I will be taking in order to do so?
We can solve this problem of query latency by partitioning the table according to each month. So, for each month we will be scanning only the partitioned data instead of whole data sets.
As we know, we can not partition an existing non-partitioned table directly.
So, we will be taking following steps to solve the very problem:
- Create a partitioned table, say partitioned_transaction:
CREATE TABLE partitioned_transaction (cust_id INT, amount FLOAT, country STRING) PARTITIONED BY (month STRING) ROW FORMAT DELIMITED FIELDS
TERMINATED BY (,) ;
- Enable dynamic partitioning in Hive:
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
- Transfer the data from the non – partitioned table into the newly created partitioned table:
INSERT OVERWRITE TABLE partitioned_transaction PARTITION (month) SELECT cust_id, amount, country, month FROM transaction_details;
Now, we can perform the query using each partition and therefore, decrease the query time.
87) How can you add a new partition for the month December in the above partitioned table?
For adding a new partition in the above table partitioned_transaction, we will issue the command give below:
ALTER TABLE partitioned_transaction ADD PARTITION (month=Dec) LOCATION /partitioned_transaction;
88) What is the default maximum dynamic partition that can be created by a mapper/reducer? How can you change it?
By default the number of maximum partition that can be created by a mapper or reducer is set to 100. One can change it by issuing the following command:
SET hive.exec.max.dynamic.partitions.pernode = value
89) I am inserting data into a table based on partitions dynamically. But, I received an error FAILED ERROR IN SEMANTIC ANALYSIS: Dynamic partition strict mode requires at least one static partition column. How will you remove this error?
To remove this error one has to execute following commands:
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
90) Suppose, I have a CSV file sample.csv present in temp directory with the following entries:
id | first_name | last_name | gender | ip_address | |
---|---|---|---|---|---|
1 | Hugh | Jackman | hughjackman@cam.ac.uk | Male | 136.90.241.52 |
2 | David | Lawrence | dlawrence1@gmail.com | Male | 101.177.15.130 |
3 | Andy | Hall | andyhall2@yahoo.com | Female | 114.123.153.64 |
4 | Samuel | Jackson | samjackson231@sun.com | Male | 89.60.227.31 |
5 | Emily | Rose | rose.emily4@surveymonkey.com | Female | 119.92.21.19 |
How will you consume this CSV file into the Hive warehouse using built SerDe?
SerDe stands for serializer or deserializer. A SerDe allows us to convert the unstructured bytes into a record that we can process using Hive. SerDes are implemented using Java. Hive comes with several built-in SerDes and many other third-party SerDes are also available.
Hive provides a specific SerDe for working with CSV files. We can use this SerDe for the sample.csv by issuing following commands:
CREATE EXTERNAL TABLE sample
(id int, first_name string,
last_name string, email string,
gender string, ip_address string)
ROW FORMAT SERDE org.apache.hadoop.hive.serde2.OpenCSVSerde
STORED AS TEXTFILE LOCATION temp;
Now, we can perform any query on the table sample:
SELECT first_name FROM sample WHERE gender = male;
91) Suppose, I have a lot of small CSV files present in input directory in HDFS and I want to create a single Hive table corresponding to these files. The data in these files are in the format: {id, name, e-mail, country}. Now, as we know, Hadoop performance degrades when we use lots of small files.
So, how will you solve this problem where we want to create a single Hive table for lots of small files without degrading the performance of the system?
One can use the SequenceFile format which will group these small files together to form a single sequence file. The steps that will be followed in doing so are as follows:
Create a temporary table:
CREATE TABLE temp_table (id INT, name STRING, e-mail STRING, country STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY , STORED AS TEXTFILE;
Load the data into temp_table:
LOAD DATA INPATH input INTO TABLE temp_table;
Create a table that will store data in SequenceFile format:
CREATE TABLE sample_seqfile (id INT, name STRING, e-mail STRING, country STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY , STORED AS SEQUENCEFILE;
Transfer the data from the temporary table into the sample_seqfile table:
INSERT OVERWRITE TABLE sample SELECT * FROM temp_table;
Hence, a single SequenceFile is generated which contains the data present in all of the input files and therefore, the problem of having lots of small files is finally eliminated.
92) Can We Change settings within Hive Session? If Yes, How?
Yes, we can change the settings within Hive session, using the SET command. It helps to change Hive job settings for an exact query.
Example: The following commands shows buckets are occupied according to the table definition.
hive> SET hive.enforce.bucketing=true;
We can see the current value of any property by using SET with the property name. SET will list all the properties with their values set by Hive.
hive> SET hive.enforce.bucketing;
hive.enforce.bucketing=true
And this list will not include defaults of Hadoop. So we should use the below like
SET -v
It will list all the properties including the Hadoop defaults in the system.
93) Is it possible to add 100 nodes when we have 100 nodes already in Hive? How?
Yes, we can add the nodes by following the below steps.
Take a new system create a new username and password.
Install the SSH and with master node setup ssh connections.
Add ssh public_rsa id key to the authorized keys file.
Add the new data node host name, IP address and other details in /etc/hosts slaves file
192.168.1.102 slave3.in slave3.
Start the Data Node on New Node.
Login to the new node like suhadoop or ssh -X hadoop@192.168.1.103.
Start HDFS of a newly added slave node by using the following command
./bin/hadoop-daemon.sh start data node.
Check the output of jps command on a new node
94) Explain the concatenation function in Hive with an example?
Concatenate function will join the input strings.We can specify the N number of strings separated by a comma.
Example:
CONCAT (It,-,is,-,a,-,eLearning,-,provider);
Output:
It-is-a-eLearning-provider
So, every time we set the limits of the strings by -. If it is common for every strings, then Hive provides another command
CONCAT_WS. In this case,we have to specify the set limits of operator first.
CONCAT_WS (-,It,is,a,eLearning,provider);
Output: It-is-a-eLearning-provider.
95) Explain Trim and Reverse function in Hive with examples?
Trim function will delete the spaces associated with a string.
Example:
TRIM( BHAVESH );
Output:
BHAVESH
To remove the Leading space
LTRIM( BHAVESH);
To remove the trailing space
RTRIM(BHAVESH );
In Reverse function, characters are reversed in the string.
Example:
REVERSE(BHAVESH);
Output:
HSEVAHB
96) Explain process to access sub directories recursively in Hive queries?
By using below commands we can access sub directories recursively in Hive
hive> Set mapred.input.dir.recursive=true;
hive> Set hive.mapred.supports.subdirectories=true;
Hive tables can be pointed to the higher level directory and this is suitable for the directory structure which is like /data/country/state/city/
97) How to skip header rows from a table in Hive?
Header records in log files
System=
Version=
Sub-version=
In the above three lines of headers that we do not want to include in our Hive query. To skip header lines from our tables in the Hive,set a table property that will allow us to skip the header lines.
CREATE EXTERNAL TABLE employee (
name STRING,
job STRING,
dob STRING,
id INT,
salary INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY STORED AS TEXTFILE
LOCATION /user/data
TBLPROPERTIES(skip.header.line.count=2);
98) What is the maximum size of string data type supported by hive? Mention the Hive support binary formats ?
The maximum size of string data type supported by hive is 2 GB.
Hive supports the text file format by default and it supports the binary format Sequence files, ORC files, Avro Data files, Parquet files.
Sequence files: Splittable, compressible and row oriented are the general binary format.
ORC files: Full form of ORC is optimized row columnar format files. It is a Record columnar file and column oriented storage file. It divides the table in row split. In each split stores that value of the first row in the first column and followed sub subsequently.
AVRO datafiles: It is same as a sequence file splittable, compressible and row oriented, but except the support of schema evolution and multilingual binding support.
99) What is the precedence order of HIVE configuration?
We are using a precedence hierarchy for setting the properties
SET Command in HIVE
The command line –hiveconf option
Hive-site.XML
Hive-default.xml
Hadoop-site.xml
Hadoop-default.xml
100) If you run a select * query in Hive, Why does it not run MapReduce?
The hive.fetch.task.conversion property of Hive lowers the latency of mapreduce overhead and in effect when executing queries like SELECT, LIMIT, etc., it skips mapreduce function
101) How Hive can improve performance with ORC format tables?
We can store the hive data in highly efficient manner in the Optimized Row Columnar file format. It can simplify many Hive file format limitations. We can improve the performance by using ORC files while reading, writing and processing the data.
Set hive.compute.query.using.stats-true;
Set hive.stats.dbclass-fs;
CREATE TABLE orc_table (idint,name string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY \: LINES TERMINATED BY \n
STORES AS ORC;
102) Explain about the different types of join in Hive?
HiveQL has 4 different types of joins –
JOIN- Similar to Outer Join in SQL
FULL OUTER JOIN – Combines the records of both the left and right outer tables that fulfil the join condition.
LEFT OUTER JOIN- All the rows from the left table are returned even if there are no matches in the right table.
RIGHT OUTER JOIN-All the rows from the right table are returned even if there are no matches in the left table.
103) How can you configure remote metastore mode in Hive?
To configure metastore in Hive, hive-site.xml file has to be configured with the below property –
hive.metastore.uris
thrift: //node1 (or IP Address):9083
IP address and port of the metastore host
104) What happens on executing the below query? After executing the below query, if you modify the column how will the changes be tracked?
Hive> CREATE INDEX index_bonuspay ON TABLE employee (bonus)
AS org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler;
The query creates an index named index_bonuspay which points to the bonus column in the employee table. Whenever the value of bonus is modified it will be stored using an index value.
105) How to load Data from a .txt file to Table Stored as ORC in Hive?
LOAD DATA just copies the files to hive datafiles. Hive does not do any transformation while loading data into tables. So, in this case the input file /home/user/test_details.txt needs to be in ORC format if you are loading it into an ORC table. A possible workaround is to create a temporary table with STORED AS TEXT, then LOAD DATA into it, and then copy data from this table to the ORC table. Here is an example: CREATE TABLE test_details_txt( visit_id INT, store_id SMALLINT) STORED AS TEXTFILE; CREATE TABLE test_details_orc( visit_id INT, store_id SMALLINT) STORED AS ORC; Load into Text table LOAD DATA LOCAL INPATH /home/user/test_details.txt INTO TABLE test_details_txt; Copy to ORC table INSERT INTO TABLE test_details_orc SELECT * FROM test_details_txt;
106) How to create HIVE Table with multi character delimiter
FILELDS TERMINATED BY does not support multi-character delimiters. The easiest way to do this is to use RegexSerDe:
CREATE EXTERNAL TABlE tableex(id INT, name STRING)
ROW FORMAT org.apache.hadoop.hive.contrib.serde2.RegexSerDe
WITH SERDEPROPERTIES (input.regex = ^(\\d+)~\\*(.*)$)
STORED AS TEXTFILE LOCATION /user/myusername;
107) Is there any way to get the column name along with the output while execute any query in Hive?
If we want to see the columns names of the table in HiveQl, the following hive conf property should be set to true.
hive> set hive.cli.print.header=true;
If you prefer to see the column names always then update the $HOME/.hiverc file with the above setting in the first line..
Hive automatically looks for a file named .hiverc in your HOME directory and runs the commands it contains, if any
108) How to Improve Hive Query Performance With Hadoop?
- Use Tez Engine
Apache Tez Engine is an extensible framework for building high-performance batch processing and interactive data processing. It is coordinated by YARN in Hadoop. Tez improved the MapReduce paradigm by increasing the processing speed and maintaining the MapReduce ability to scale to petabytes of data.
Tez engine can be enabled in your environment by setting hive.execution.engine to tez:
set hive.execution.engine=tez;
- Use Vectorization
Vectorization improves the performance by fetching 1,024 rows in a single operation instead of fetching single row each time. It improves the performance for operations like filter, join, aggregation, etc.
Vectorization can be enabled in the environment by executing below commands.
set hive.vectorized.execution.enabled=true;
set hive.vectorized.execution.reduce.enabled=true;
- Use ORCFile
Optimized Row Columnar format provides highly efficient ways of storing the hive data by reducing the data storage format by 75% of the original. The ORCFile format is better than the Hive files format when it comes to reading, writing, and processing the data. It uses techniques like predicate push-down, compression, and more to improve the performance of the query.
- Use Partitioning
With partitioning, data is stored in separate individual folders on HDFS. Instead of querying the whole dataset, it will query partitioned dataset.
1)Create Temporary Table and Load Data Into Temporary Table
2)Create Partitioned Table
3)Enable Dynamic Hive Partition
4)Import Data From Temporary Table To Partitioned Table
- Use Bucketing
The Hive table is divided into a number of partitions and is called Hive Partition. Hive Partition is further subdivided into clusters or buckets and is called bucketing or clustering.
- Cost-Based Query Optimization
Hive optimizes each querys logical and physical execution plan before submitting for final execution. However, this is not based on the cost of the query during the initial version of Hive.
During later versions of Hive, query has been optimized according to the cost of the query (like which types of join to be performed, how to order joins, the degree of parallelism, etc.).
109) How do I query from a horizontal output to vertical output? There should be easier way to achieve that using explode function and selecting separately data for prev and next columns.110) Is there a simple way to replace non numeric characters hive excluding - to allow only -ve and +ve numbers.?
we can try to use regexp_extract instead:
regexp_extract('abcd-9090','.*(-[0-9]+)',1)
111) What is Hive Tablename maximum character limit .?
Metastore service uses a RDBMS as back-end, and you can see that the standard MySQL schema defines
CREATE TABLE IF NOT EXISTS TBLS (...
TBL_NAME varchar(128) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,
...
112) How can I convert a date in string format (“April 25, 2018”) to timestamp in hive?
use from_unixtime in conjunction with unix_timestamp.
select from_unixtime(unix_timestamp(`date`,'MMM dd, yyyy'),'yyyy-MM-dd')
113) How to drop the hive database whether it contains some tables.? use cascade command while drop the database.
Example: hive> drop database sampleDB cascade;
114) I Dropped and recreated hive external table, but no data shown, So what should i do.?
This is because the table you created is a partitioned table. The insert you ran would have created a partition for partition_column='abcxyz'. When you drop and re-create the table, Hive looses the information about the partition, it only knows about the table.
Run the command below to get hive to re-create the partitions based on the data.
MSCK REPAIR TABLE user_dinesh.sampletable;
...Thank You...