Hive Project on Data Loading,Exporting
Hive Mini Project 1
Download Dataset 1:
https://drive.google.com/file/d/1WrG-9qv6atP-W3P_-gYln1hHyFKRKMHP/view
Download Dataset 2:
https://drive.google.com/file/d/1-JIPCZ34dyN6k9CqJa-Y8yxIGq6vTVXU/view
Note: both files are csv files.
Start a hive Shell using command hive inside the hive container
Our both csv files are store inside the local file system in container at
/app directory
/app directory is mounted on a volume in docker compose file
These files are
AgentPerformance.csv and agent_loging_report.csv
1. Create a schema based on the given dataset
Steps to create agent_loging_report table
This is how our agent_login_report.csv file looks like
Date format is in dd-mm-yy format.
By Default hive stores the date in yyyy-MM-dd format
Approach
So we need to create a reference table first defining the date column in string format and then we will use this reference table to create the main table.
Steps to create reference table named agent_login_report_ref
Now creating main table using this ref table
Loading Data in this main table
Steps to create agent_performance table
Loading data from local file system in ref file
Show table to confirm if table is created or not
Use describe to check the details on columns
Creating Main table named agent_performance_main
Inserting data in the main table using ref table
Let’s check in HDFS file system if tables are created or not
3. List of all agents' names.
hive> select distinct trim(agent) from agent_performance_main union select distinct trim(agent) from agent_login_main |
4. Find out agent average rating.
hive> select agent,avg(avg_rating) as avg_rating |
5. Total working days for each agents
hive> select agent,count(*) as login_count from agent_login_main |
6. Total query that each agent have taken
hive> select agent,sum(total_chants) as total_queries |
7. Total Feedback that each agent have received
hive> select agent,sum(total_feedback) as total_feedback |
8.Agent name who have average rating between 3.5 to 4
select agent,avg(avg_rating) as avg_rating from agent_performance_main group by agent having avg_rating between 3.5 and 4; |
9. Agent name who have rating less than 3.5
hive> select agent,avg(avg_rating) as avg_rating from agent_performance_main group by agent having avg_rating <3.5; |
10. Agent name who have rating more than 4.5
hive> select agent,avg(avg_rating) as avg_rating from agent_performance_main group by agent having avg_rating >4.5; |
11. How many feedback agents have received more than 4.5 average
hive> select agent,count(case when total_feedback>4.5 then 1 end) as feedback_count from agent_performance_main group by agent; |
12. average weekly response time for each agent
hive> select agent,weekofyear(date_col) as week_no,from_unixtime(cast(avg(unix_timestamp(avg_response_time,'H:mm:ss'))as bigint),'H:mm:ss') as avg_response_week_time from |
13. average weekly resolution time for each agents
hive> select agent,weekofyear(date_col) as week_no,from_unixtime(cast(avg(unix_timestamp(avg_resolution_time,'H:mm:ss'))as bigint),'H:mm:ss') as avg_resolution_week_time from |
14. Find the number of chat on which they have received a feedback
hive> select agent,count(total_chants) as total_chants |
Alternatively
agent, |
15. Total contribution hour for each and every agents weekly basis
16. Perform inner join, left join and right join based on the agent column and after joining the table export that data into your local system.
Ways to Export File in local
INSERT OVERWRITE LOCAL DIRECTORY '/test/' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' SELECT * FROM agent_login_main limit 2; |
#Exports to HDFS directory
INSERT OVERWRITE DIRECTORY '/user/data/output/export' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' SELECT * FROM emp.employee; |
16. Perform inner join, left join and right join based on the agent column and after joining the table export that data into your local system.
Performing Inner Join
hive> INSERT OVERWRITE LOCAL DIRECTORY '/inner_join/' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' SELECT * FROM agent_performance_main join agent_login_main on agent_login_main.agent=agent_performance_main.agent |
Performing Left Join
hive> INSERT OVERWRITE LOCAL DIRECTORY '/join/left_join/' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' SELECT * FROM agent_performance_main left join agent_login_main on agent_login_main.agent=agent_performance_main.agent |
Performing Right Join
hive> INSERT OVERWRITE LOCAL DIRECTORY '/join/right_join/' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' SELECT * FROM agent_performance_main right join agent_login_main on agent_login_main.agent=agent_performance_main.agent |
Checking our local file system if files were migrated or not
Go to the local file system
Navigate to the root directory
Cd /
You will get join folder
Cd to join folder
Cd join
You will get list of the folders inside the join directory
17. Perform partitioning on top of the agent column and then on top of that perform bucketing for each partitioning.
Approach
So here we will do these two thing
We will partition table on agent column
Table would be agent_login_main table
And on top of that we will create buckets on date_col column
Let’s see
Let’s create a partition table and bucketed table
Loading data in partition and bucketing table
hive> set hive.exec.dynamic.partition.mode=nonstrict; |