In my previous article I have given the details about the characteristics of hadoop with its configuration. In current market situation you need to know more about part of Hadoop- Hive. In this article I would like to give the more information about Hive architecture with its characteristics in detail. I would also like to give snapshot of the data model of hive with its diagram.
What we will cover in this article ?
- What is Hive ?
- Need of Hive
- Hive Architecture
- Characteristics of Hive
What is Hive?
Don’t You consider building MapReduce jobs to be a difficult task? You may certainly submit SQL queries and run MapReduce tasks using Hadoop Hive. Hive is the best tool for you if you are familiar with SQL because it will enable you to complete MapReduce tasks quickly. Like Pig, Hive has a proprietary language called HiveQL (HQL). It’s comparable to SQL. Like Pig Latin, HQL converts SQL-like queries into MapReduce jobs. The nice aspect is that using Hadoop Hive doesn’t require any Java expertise. Hive is a data storage system that was created with the intention of analyzing organized data. Hive was initially developed by Facebook and is now owned by Apache. Hive was developed to make fault-tolerant analysis of large amounts of data easier, and it has been widely used in big data analytics for more than a decade. Apache Hive distinguishes out from the other systems despite having numerous rivals like Impala
because it is fault-tolerant when it comes to the process of data processing and interpretation.
Need of Hive :
In essence, Facebook had to overcome a lot of obstacles before successfully using Apache Hive. One of those difficulties was the volume of data that was produced daily. Traditional databases couldn’t manage the burden of such a massive volume of data, including RDBMS and SQL. Facebook was searching for better choices as a result. In the beginning, MapReduce was used to address this issue. However, using MapReduce was highly challenging because it required mandatory SQL programming knowledge. Later, Facebook understood that Hadoop Hive had the ability to help it genuinely solve its difficulties. When designing sophisticated MapReduce tasks, developers can get away with using Apache Hive.
Hive Architecture :
The user interface (UI) is where users interact with the system to submit inquiries and carry out other tasks. In 2011, the system featured a command line interface, and work was being done on a web-based GUI. The component that receives the queries is the driver. This component provides execute and fetch APIs based on JDBC/ODBC interfaces and supports the idea of session handles. The component known as a “compiler” parses the query, does semantic analysis on the various query blocks and query expressions, and then develops an execution plan using table and partition metadata that is retrieved from the metastore.
The component known as a “metastore” maintains all the structure data for the different tables and partitions in a warehouse, including information about columns and column types, the serializes and deserializers required to read and write data, and the related HDFS files where the data is kept. The component that carries out the execution plan produced by the compiler is known as the execution engine. The strategy is a stage-based DAG. The execution engine controls the interdependencies between these many plan stages and carries out each stage’s execution using the proper system components.
Hive Data Model with Hive Architecture :
Tables :
The two different types of tables in Hive are as follows:
- External Table
- Managed Table
The tables found in an RDBMS are the same as those seen in Apache Hive. The data being saved is conceptually composed of the table in Hive. Additionally, the related metadata gives details about how the table’s data is organized. On tables, we may filter, project, join, and union operations. Data in Hadoop normally resides in HDFS, although it can also be found on S3, the local disc, or any other Hadoop filesystem. However, Hive doesn’t store the metadata in HDFS, but rather in a relational database.
Partitions :
For example, a table T with a date partition column ds has files with data for a certain date saved in the table location>/ds=date> directory in HDFS. Each Table can have one or more partition keys that determine how the data is stored. As an example, a query that is interested in rows from T that satisfy the criterion T.ds = ‘2008-09-01’ would only need to look at files in the /ds=2008-09-01/ directory in HDFS. Partitions allow the system to prune data to be inspected depending on query predicates.
Bucket :
The hash of a table column can be used to separate the data in each partition into Buckets. In the partition directory, each bucket is kept as a separate file. The system can effectively assess queries that depend on a sample of data thanks to bucketing (these are queries that use the SAMPLE clause on the table).
Characteristics of Hive :
In this section we will check about the Characteristics of Hive.
- Scalability of Hive is facilitated by the ability to convert queries into MapReduce jobs.
- supports web interface as well, which entails that both web browser clients and application API can contact the Hive DB server.
- Handles data warehouse applications, making it appropriate for the study of extremely large static data sets where quick reaction times are not a requirement.
- The output of a HiveQL query and the amount of data that is stored in the tables at Hadoop Cluster
- Additionally, it supports User-Defined Functions for jobs like data filtering and purification. UDFs can be defined in accordance with our needs.
I hope you like this article of Hive Architecture with diagrammatic representation. If you like this article of Hive Architecture or if you have any issues or comments with the same kindly feel free to comment in comments section.