Hive Project on Data Loading,Exporting

Hive Mini Project 1

Download Dataset 1:

https://drive.google.com/file/d/1WrG-9qv6atP-W3P_-gYln1hHyFKRKMHP/view

Download Dataset 2:

https://drive.google.com/file/d/1-JIPCZ34dyN6k9CqJa-Y8yxIGq6vTVXU/view

Note: both files are csv files.

Start a hive Shell using command hive inside the hive container

Our both csv files are store inside the local file system in container at

/app directory

/app directory is mounted on a volume in docker compose file

These files are

AgentPerformance.csv and agent_loging_report.csv

1. Create a schema based on the given dataset

Steps to create agent_loging_report table

This is how our agent_login_report.csv file looks like

Date format is in dd-mm-yy format.

By Default hive stores the date in yyyy-MM-dd format

Approach

So we need to create a reference table first defining the date column in string format and then we will use this reference table to create the main table.

Steps to create reference table named agent_login_report_ref

Now creating main table using this ref table

Loading Data in this main table

Steps to create agent_performance table

Loading data from local file system in ref file

Show table to confirm if table is created or not

Use describe to check the details on columns

Creating Main table named agent_performance_main

Inserting data in the main table using ref table

Let’s check in HDFS file system if tables are created or not

3. List of all agents' names.

hive> select distinct trim(agent) from agent_performance_main union select distinct trim(agent) from agent_login_main
    > ;

4. Find out agent average rating.

hive> select agent,avg(avg_rating) as avg_rating
    > from agent_performance_main
    > group by agent
    > ;

5. Total working days for each agents

hive> select agent,count(*) as login_count from agent_login_main
    > group by agent
    > order by login_count desc;

6. Total query that each agent have taken

hive> select agent,sum(total_chants) as total_queries
    > from agent_performance_main
    > group by agent
    > order by total_queries;

7. Total Feedback that each agent have received

hive> select agent,sum(total_feedback) as total_feedback
    > from agent_performance_main
    > group by agent
    > order by total_feedback desc;

8.Agent name who have average rating between 3.5 to 4

select agent,avg(avg_rating) as avg_rating from agent_performance_main group by agent having avg_rating between 3.5 and 4;

9. Agent name who have rating less than 3.5

hive> select agent,avg(avg_rating) as avg_rating from agent_performance_main group by agent having avg_rating <3.5;

10. Agent name who have rating more than 4.5

hive> select agent,avg(avg_rating) as avg_rating from agent_performance_main group by agent having avg_rating >4.5;

11. How many feedback agents have received more than 4.5 average

hive> select agent,count(case when total_feedback>4.5 then 1 end) as feedback_count from agent_performance_main group by agent;

12. average weekly response time for each agent

hive> select agent,weekofyear(date_col) as week_no,from_unixtime(cast(avg(unix_timestamp(avg_response_time,'H:mm:ss'))as bigint),'H:mm:ss') as avg_response_week_time from
    > agent_performance_main
    > group by agent,weekofyear(date_col);

13. average weekly resolution time for each agents

hive> select agent,weekofyear(date_col) as week_no,from_unixtime(cast(avg(unix_timestamp(avg_resolution_time,'H:mm:ss'))as bigint),'H:mm:ss') as avg_resolution_week_time from
> agent_performance_main
> group by agent,weekofyear(date_col);

14. Find the number of chat on which they have received a feedback

hive> select agent,count(total_chants) as total_chants
    > from agent_performance_main
    > where (total_feedback<>0 and total_feedback is not null)
    > group by agent;

Alternatively

agent,
date_format(date,'W') week_no,
sum((split(avg_resolution_time,':')[0]3600 +split(avg_resolution_time,':')[1]60+split(avg_resolution_time,':')[2] )/3600) total_weekly_contri_hrs,
avg((split(avg_response_time ,':')[0]3600 +split(avg_response_time ,':')[1]60+split(avg_response_time ,':')[2] )/3600) Avg_weekly_response_time_hrs
from agent_performance_main
group by
agent,date_format(date,'W')

15. Total contribution hour for each and every agents weekly basis

16. Perform inner join, left join and right join based on the agent column and after joining the table export that data into your local system.

Ways to Export File in local

INSERT OVERWRITE LOCAL DIRECTORY '/test/' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' SELECT * FROM agent_login_main limit 2;

#Exports to HDFS directory

INSERT OVERWRITE DIRECTORY '/user/data/output/export' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' SELECT * FROM emp.employee;

16. Perform inner join, left join and right join based on the agent column and after joining the table export that data into your local system.

Performing Inner Join

hive> INSERT OVERWRITE LOCAL DIRECTORY '/inner_join/' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' SELECT * FROM agent_performance_main join agent_login_main on agent_login_main.agent=agent_performance_main.agent

Performing Left Join

hive> INSERT OVERWRITE LOCAL DIRECTORY '/join/left_join/' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' SELECT * FROM agent_performance_main left join agent_login_main on agent_login_main.agent=agent_performance_main.agent
    > ;

Performing Right Join

hive> INSERT OVERWRITE LOCAL DIRECTORY '/join/right_join/' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' SELECT * FROM agent_performance_main right join agent_login_main on agent_login_main.agent=agent_performance_main.agent
    > ;

Checking our local file system if files were migrated or not

Go to the local file system

Navigate to the root directory

Cd /

You will get join folder

Cd to join folder

Cd join

You will get list of the folders inside the join directory

17. Perform partitioning on top of the agent column and then on top of that perform bucketing for each partitioning.

Approach

So here we will do these two thing

We will partition table on agent column

Table would be agent_login_main table

And on top of that we will create buckets on date_col column

Let’s see

Let’s create a partition table and bucketed table

Loading data in partition and bucketing table

hive> set hive.exec.dynamic.partition.mode=nonstrict;
hive> insert overwrite table agent_login_main_part_buck partition (agent)
    > select sl_no,date_col,login_time,logout_time,duration,agent from agent_login_main;