In my previous article I have given the details about the characteristics of hadoop with its configuration. In current market situation you need to know more about part of Hadoop- Hive. In this article I would like to give the more information about Hive architecture with its characteristics in detail. I would also like to give snapshot of the data model of hive with its diagram.
What we will cover in this article ?
Don’t You consider building MapReduce jobs to be a difficult task? You may certainly submit SQL queries and run MapReduce tasks using Hadoop Hive. Hive is the best tool for you if you are familiar with SQL because it will enable you to complete MapReduce tasks quickly. Like Pig, Hive has a proprietary language called HiveQL (HQL). It’s comparable to SQL. Like Pig Latin, HQL converts SQL-like queries into MapReduce jobs. The nice aspect is that using Hadoop Hive doesn’t require any Java expertise. Hive is a data storage system that was created with the intention of analyzing organized data. Hive was initially developed by Facebook and is now owned by Apache. Hive was developed to make fault-tolerant analysis of large amounts of data easier, and it has been widely used in big data analytics for more than a decade. Apache Hive distinguishes out from the other systems despite having numerous rivals like Impala
because it is fault-tolerant when it comes to the process of data processing and interpretation.
In essence, Facebook had to overcome a lot of obstacles before successfully using Apache Hive. One of those difficulties was the volume of data that was produced daily. Traditional databases couldn’t manage the burden of such a massive volume of data, including RDBMS and SQL. Facebook was searching for better choices as a result. In the beginning, MapReduce was used to address this issue. However, using MapReduce was highly challenging because it required mandatory SQL programming knowledge. Later, Facebook understood that Hadoop Hive had the ability to help it genuinely solve its difficulties. When designing sophisticated MapReduce tasks, developers can get away with using Apache Hive.
The user interface (UI) is where users interact with the system to submit inquiries and carry out other tasks. In 2011, the system featured a command line interface, and work was being done on a web-based GUI. The component that receives the queries is the driver. This component provides execute and fetch APIs based on JDBC/ODBC interfaces and supports the idea of session handles. The component known as a “compiler” parses the query, does semantic analysis on the various query blocks and query expressions, and then develops an execution plan using table and partition metadata that is retrieved from the metastore.
The component known as a “metastore” maintains all the structure data for the different tables and partitions in a warehouse, including information about columns and column types, the serializes and deserializers required to read and write data, and the related HDFS files where the data is kept. The component that carries out the execution plan produced by the compiler is known as the execution engine. The strategy is a stage-based DAG. The execution engine controls the interdependencies between these many plan stages and carries out each stage’s execution using the proper system components.
The two different types of tables in Hive are as follows:
The tables found in an RDBMS are the same as those seen in Apache Hive. The data being saved is conceptually composed of the table in Hive. Additionally, the related metadata gives details about how the table’s data is organized. On tables, we may filter, project, join, and union operations. Data in Hadoop normally resides in HDFS, although it can also be found on S3, the local disc, or any other Hadoop filesystem. However, Hive doesn’t store the metadata in HDFS, but rather in a relational database.
For example, a table T with a date partition column ds has files with data for a certain date saved in the table location>/ds=date> directory in HDFS. Each Table can have one or more partition keys that determine how the data is stored. As an example, a query that is interested in rows from T that satisfy the criterion T.ds = ‘2008-09-01’ would only need to look at files in the /ds=2008-09-01/ directory in HDFS. Partitions allow the system to prune data to be inspected depending on query predicates.
The hash of a table column can be used to separate the data in each partition into Buckets. In the partition directory, each bucket is kept as a separate file. The system can effectively assess queries that depend on a sample of data thanks to bucketing (these are queries that use the SAMPLE clause on the table).
In this section we will check about the Characteristics of Hive.
I hope you like this article of Hive Architecture with diagrammatic representation. If you like this article of Hive Architecture or if you have any issues or comments with the same kindly feel free to comment in comments section.
In my previous articles I have given the roles and responsibilities of L1,L2 and L3…
In my previous articles i have given the hierarchy of production support in real company…
In this article i would like to provide information about production support organization structure or…
In my previous article I have given roles for L1 and L2 support engineer with…
I have started this new series of how to become application support engineer. This article…
In this series we are starting with Roles and responsibilities of L1 support engineer .…