Author:
Bhawesh Mehta
Data Enthusiastic
Project Name:
NYC Parking Tickets: An Exploratory Analysis Using Hive
One of the major objectives of this assignment is gaining familiarity with how an analysis works in Hive and how you can gain insights from large datasets.
Problem Statement -
New York City is a thriving metropolis and just like most other cities of similar size, one of the biggest problems its residents face is parking.
The classic combination of a huge number of cars and a cramped geography is the exact recipe that leads to a large number of parking tickets.
In an attempt to scientifically analyze this phenomenon, the NYC Police Department regularly collects data related to parking tickets.
This data is made available by NYC Open Data portal. We will try and perform some analysis on this data.
Download Dataset - https://data.cityofnewyork.us/browse?q=parking+tickets
Action:Export the data in CSV Format from the site given above
Note: Consider only the year 2017 for analysis and not the Fiscal year.
Before Going for the assignment >>
Let’s Load the data into HDFS file storage system from local
The file named parking_vio.csv is currently in local system
Local system means the file system of the container i am in.
Please Note:
(Local System does not refer to the local computer file system)

Now I will be moving this file from container file system to hadoop file system
Before loading the file in hadoop environment let’s check in the current directory in my hadoop environment.

Now loading data into hdfs file system using put command
Here I have loaded the file inside the local_to_hdfs directory in hadoop

Let me show you the content inside the /local_to_hdfs directory of hdfs
To see the content you have to use
hadoop fs -cat /local_to_hdfs
Question arises why the .csv file is not created ?
Answer because we have not created the local_to_hdfs folder
Now let me delete this first
hadoop fs -rm rf /local_to_hdfs
First create this folder and use put command to load the file

Here clearly you can see that now .csv file can be seen inside local_to_hdfs file system
The analysis can be divided into two parts:
Part-I
Examine the data:
Let’s Examine the data in the hadoop file system
Find the total number of tickets for the year.
Find out how many unique states the cars which got parking tickets came from.
Some parking tickets don’t have addresses on them, which is cause for concern. Find out how many such tickets there are(i.e. tickets where either "Street Code 1" or "Street Code 2" or "Street Code 3" is empty )
Approach 1
Using
Simple Internal table
Lets create a internal table named parking
Challenge here is that date column contains these values in these format
That is in mm-dd-yyyy or mm-dd-yyyy
So our approach will be to create a temporary table first defining them in string format and then loading the data from that temp table to the main table in data format in hive.
Before loading the data in main table we will replace / with - so that uniformity is maintained
So let’s try this
By Default hive stores the date in yyyy-MM-dd format
Lets create a reference table named violation_ref


Lets load data from our hdfs file system to this table named violation_ref

We did a mistake here while creating the table we should have escaped the first row as it contains the column name
Let’s drop the violation_ref table

Now again create violation_ref table


Note how i have used tblproperties(“skip.header.line.count”=”1”)
Loading data from hdfs file system
Here I noticed one thing then when i dropped the table that the file inside the hdfs file system also got deleted

Checking if the table has data in hdfs file system or not

Now lets create a violation_main table with date column defined
Note:
By Default hive stores the date in yyyy-MM-dd format
hive> insert into violation_main select summons_no,plate_id,registration_state,plate_type,FROM_UNIXTIME(UNIX_TIMESTAMP(issue_date, 'MM/dd/yyyy'), 'yyyy-MM-dd') as issue_date,violation_code,vehicle_body_type,vehicle_make,issuing_agency,street_code_1,street_code_2,street_code_3,vehicle_exp_date,violation_location,violation_precinct,issuer_precinct,issuer_code,issuer_command,issuer_squad,violation_time,time_first_observed,violationcounty,violationinfrontof_or_opp,house_no,street_no,intersection_street,date_first_observed,law_section,sub_division,violation_legal_code,daysparkingin_effect,fromhoursin_effect,tohoursin_effect,vehicle_color,unregistered_vehicle,vehicle_year,meter_no,feet_from_curb,violation_post_code,violation_description,no_standing_or_stopping_violation,hydrant_violation,double_parking_violation from violation_ref; |
Now let’s check how violation_main table looks in hdfs file system

Notice how file is splitted into multiple sub files starting with 000000_0
1-Find the total number of tickets for the year.
hive> select issue_date,count(*) from violation_main group by issue_date; |
But we need the data for 2017 only
Sol let’s modify this and create a temp table with data of year 2017 only
hive> create table parking_vio_2017 > ( > summons_no int, > plate_id string, > registration_state string, > plate_type string, > issue_date date, > violation_code int, > vehicle_body_type string, > vehicle_make string, > issuing_agency string, > street_code_1 int, > street_code_2 int, > street_code_3 int, > vehicle_exp_date int, > violation_location int, > violation_precinct int, > issuer_precinct int, > issuer_code int, > issuer_command string, > issuer_squad string, > violation_time string, > time_first_observed string, > violation_county string, > violation_in_front_of_or_opp string, > house_no string, > street_no string, > intersection_street string, > date_first_observed string, > law_section int, > sub_division string, > violation_legal_code string, > days_parking_in_effect string, > from_hours_in_effect string, > to_hours_in_effect string, > vehicle_color string, > unregistered_vehicle int, > vehicle_year string, > meter_no string, > feet_from_curb string, > violation_post_code string, > violation_description string, > no_standing_or_stopping_violation string, > hydrant_violation string, > double_parking_violation string > ) > row format delimited > fields terminated by ','; OK Time taken: 12.221 seconds hive> |
hive> insert into parking_vio_2017 select summons_no,plate_id,registration_state,plate_type,issue_date,violation_code,vehicle_body_type,vehicle_make,issuing_agency,street_code_1,street_code_2,street_code_3,vehicle_exp_date,violation_location,violation_precinct,issuer_precinct,issuer_code,issuer_command,issuer_squad,violation_time,time_first_observed,violation_county,violation_in_front_of_or_opp,house_no,street_no,intersection_street,date_first_observed,law_section,sub_division,violation_legal_code,days_parking_in_effect,from_hours_in_effect,to_hours_in_effect,vehicle_color,unregistered_vehicle,vehicle_year,meter_no,feet_from_curb,violation_post_code,violation_description,no_standing_or_stopping_violation,hydrant_violation,double_parking_violation from violation_main where year(issue_date)=2017; |
Let’s check if data in loaded in the hdfs warehouse
# hadoop fs -ls /user/hive/warehouse/hive_db.db/parking_vio_2017 |
Found 8 items -rwxrwxr-x 3 root supergroup 133306082 2023-04-08 11:54 /user/hive/warehouse/hive_db.db/parking_vio_2017/000000_0 -rwxrwxr-x 3 root supergroup 133615823 2023-04-08 11:55 /user/hive/warehouse/hive_db.db/parking_vio_2017/000001_0 -rwxrwxr-x 3 root supergroup 133371174 2023-04-08 11:55 /user/hive/warehouse/hive_db.db/parking_vio_2017/000002_0 -rwxrwxr-x 3 root supergroup 133720470 2023-04-08 11:55 /user/hive/warehouse/hive_db.db/parking_vio_2017/000003_0 -rwxrwxr-x 3 root supergroup 133466345 2023-04-08 11:55 /user/hive/warehouse/hive_db.db/parking_vio_2017/000004_0 -rwxrwxr-x 3 root supergroup 133509997 2023-04-08 11:55 /user/hive/warehouse/hive_db.db/parking_vio_2017/000005_0 -rwxrwxr-x 3 root supergroup 133557192 2023-04-08 11:55 /user/hive/warehouse/hive_db.db/parking_vio_2017/000006_0 -rwxrwxr-x 3 root supergroup 80027489 2023-04-08 11:56 /user/hive/warehouse/hive_db.db/parking_vio_2017/000007_0 |
2-Find out how many unique states the cars which got parking tickets came from.
hive> select distinct registration_state from violation_main; |
hive> select registration_state,count(*) as instance from parking_vio_2017 group by registration_state order by instance desc; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20230408120319_8c69e2b9-a0d2-4432-a51e-74cc7f583b36 Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 4 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2023-04-08 12:03:23,028 Stage-1 map = 0%, reduce = 0% 2023-04-08 12:03:27,707 Stage-1 map = 100%, reduce = 0% 2023-04-08 12:03:37,722 Stage-1 map = 100%, reduce = 100% Ended Job = job_local1615623838_0004 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2023-04-08 12:03:39,869 Stage-2 map = 100%, reduce = 100% Ended Job = job_local1643808020_0005 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 55123329229 HDFS Write: 8116602392 SUCCESS Stage-Stage-2: HDFS Read: 14141225856 HDFS Write: 2029150598 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK registration_state instance NY 4273941 NJ 475824 PA 140284 CT 70403 FL 69468 IN 45525 MA 38941 VA 34367 MD 30213 NC 27152 TX 18827 IL 18666 GA 17537 99 16055 AZ 12379 OH 12281 CA 12152 ME 10806 SC 10394 MN 10083 OK 9088 TN 8514 DE 7905 MI 7231 RI 5814 NH 4119 VT 3683 AL 3178 WA 3052 OR 2622 MO 2483 ON 2460 WI 2127 QB 1998 IA 1938 DC 1929 CO 1841 KY 1795 DP 1794 LA 1689 MS 1582 WV 1265 AR 994 SD 859 NM 792 ID 763 NV 725 KS 706 NE 704 UT 561 MT 505 GV 348 NS 322 AK 298 ND 254 WY 188 HI 156 AB 79 PE 61 NB 57 BC 54 PR 38 MB 17 SK 9 FO 8 Time taken: 20.127 seconds, Fetched: 65 row(s) |
3-Some parking tickets don’t have addresses on them, which is cause for concern. Find out how many such tickets there are(i.e. tickets where either "Street Code 1" or "Street Code 2" or "Street Code 3" is empty )
hive> select count(*) from parking_vio_2017 where street_code_1 is null or street_code_2 is null or street_code_3 is null or street_code_1=0 or street_code_2=0 or street_code_3=0; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20230408122406_546bfba9-608e-49bb-b6cc-c6ce23ab817d Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2023-04-08 12:24:08,395 Stage-1 map = 0%, reduce = 0% 2023-04-08 12:24:11,784 Stage-1 map = 100%, reduce = 0% 2023-04-08 12:24:18,791 Stage-1 map = 100%, reduce = 100% Ended Job = job_local1069774194_0007 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 44057236165 HDFS Write: 5072876495 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK _c0 1816816 Time taken: 12.794 seconds, Fetched: 1 row(s) hive> |
Part-II: Aggregation tasks
1) How often does each violation code occur? (frequency of violation codes - find the top 5)
hive> select violation_code,count(*) as instance from parking_vio_2017 group by violation_code order by instance desc limit 5; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20230408123022_9378794a-2f09-4e0b-9bad-3113d0113f09 Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 4 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2023-04-08 12:30:31,657 Stage-1 map = 0%, reduce = 0% 2023-04-08 12:30:40,682 Stage-1 map = 100%, reduce = 0% 2023-04-08 12:30:46,688 Stage-1 map = 33%, reduce = 0% 2023-04-08 12:30:48,710 Stage-1 map = 100%, reduce = 0% 2023-04-08 12:30:58,737 Stage-1 map = 100%, reduce = 100% Ended Job = job_local488446953_0008 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2023-04-08 12:31:01,739 Stage-2 map = 100%, reduce = 100% Ended Job = job_local1353223834_0009 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 79473118957 HDFS Write: 8116602392 SUCCESS Stage-Stage-2: HDFS Read: 20228673288 HDFS Write: 2029150598 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK violation_code instance 21 768082 36 662765 38 542079 14 476660 20 319646 Time taken: 38.911 seconds, Fetched: 5 row(s) hive> |
2) How often does each vehicle body type get a parking ticket? How about the vehicle make? (find the top 5 for both)
For Vehicle body type
hive> select vehicle_body_type,count(*) as instance from parking_vio_2017 group by vehicle_body_type order by instance desc limit 5; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20230408123551_a96f78b5-6b0a-4fae-8fd6-54ec188a7fc3 Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 4 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2023-04-08 12:35:54,477 Stage-1 map = 0%, reduce = 0% 2023-04-08 12:36:00,484 Stage-1 map = 8%, reduce = 0% 2023-04-08 12:36:03,487 Stage-1 map = 100%, reduce = 0% 2023-04-08 12:36:09,497 Stage-1 map = 25%, reduce = 0% 2023-04-08 12:36:12,502 Stage-1 map = 100%, reduce = 0% 2023-04-08 12:36:17,508 Stage-1 map = 75%, reduce = 0% 2023-04-08 12:36:18,509 Stage-1 map = 100%, reduce = 100% Ended Job = job_local198719902_0010 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2023-04-08 12:36:20,251 Stage-2 map = 100%, reduce = 100% Ended Job = job_local457352855_0011 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 87589715533 HDFS Write: 8116602392 SUCCESS Stage-Stage-2: HDFS Read: 22257822432 HDFS Write: 2029150598 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK vehicle_body_type instance SUBN 1883953 4DSD 1547307 VAN 724025 DELV 358982 SDN 194197 Time taken: 28.821 seconds, Fetched: 5 row(s) hive> |
For vehicle make
hive> select vehicle_make,count(*) as instance from parking_vio_2017 group by vehicle_make order by instance desc limit 5; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20230408123750_9647cd58-0775-4c05-af32-27742608aed6 Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 4 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2023-04-08 12:37:54,203 Stage-1 map = 0%, reduce = 0% 2023-04-08 12:37:56,206 Stage-1 map = 100%, reduce = 0% 2023-04-08 12:38:03,214 Stage-1 map = 100%, reduce = 100% Ended Job = job_local1454523152_0012 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2023-04-08 12:38:05,257 Stage-2 map = 100%, reduce = 100% Ended Job = job_local1060513731_0013 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 95706312109 HDFS Write: 8116602392 SUCCESS Stage-Stage-2: HDFS Read: 24286971576 HDFS Write: 2029150598 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK vehicle_make instance FORD 636842 TOYOT 605290 HONDA 538884 NISSA 462017 CHEVR 356032 Time taken: 14.345 seconds, Fetched: 5 row(s) hive> |
3) A precinct is a police station that has a certain zone of the city under its command. Find the (5 highest) frequencies of:
a.) Violating Precincts (this is the precinct of the zone where the violation occurred)
hive> select violation_precinct,count(*) as instances from parking_vio_2017 group by violation_precinct order by instances desc limit 5; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20230408124537_11dc8925-f0f4-4397-8eee-b5be88d363f3 Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 4 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2023-04-08 12:45:40,763 Stage-1 map = 0%, reduce = 0% 2023-04-08 12:45:43,765 Stage-1 map = 100%, reduce = 0% 2023-04-08 12:45:49,771 Stage-1 map = 100%, reduce = 100% Ended Job = job_local1258832131_0014 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2023-04-08 12:45:51,660 Stage-2 map = 100%, reduce = 100% Ended Job = job_local1685272680_0015 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 103822908685 HDFS Write: 8116602392 SUCCESS Stage-Stage-2: HDFS Read: 26316120720 HDFS Write: 2029150598 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK violation_precinct instances 0 925596 19 274443 14 203552 1 174702 18 169131 Time taken: 13.861 seconds, Fetched: 5 row(s) hive> |
b.) Issuer Precincts (this is the precinct that issued the ticket)
hive> select issuer_precinct,count(*) as instances from parking_vio_2017 group by issuer_precinct order by instances desc limit 5; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20230408124648_200df340-8b5c-4e61-9eb4-5a864c9ab815 Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 4 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2023-04-08 12:46:50,404 Stage-1 map = 0%, reduce = 0% 2023-04-08 12:46:52,406 Stage-1 map = 100%, reduce = 0% 2023-04-08 12:46:58,414 Stage-1 map = 100%, reduce = 100% Ended Job = job_local523335232_0016 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2023-04-08 12:47:00,186 Stage-2 map = 100%, reduce = 100% Ended Job = job_local1790657867_0017 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 111939505261 HDFS Write: 8116602392 SUCCESS Stage-Stage-2: HDFS Read: 28345269864 HDFS Write: 2029150598 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK issuer_precinct instances 0 1078403 19 266959 14 200494 1 168740 18 162994 Time taken: 11.936 seconds, Fetched: 5 row(s) hive> |
4) Find the violation code frequency across 3 precincts which have issued the most number of tickets - do these precinct zones have an exceptionally high frequency of certain violation codes?
hive> select issuer_precinct,violation_code,count() as instance from parking_vio_2017 group by issuer_precinct,violation_code order by issuer_precinct,instance desc;
hive> select issuer_precinct,count() as instance from parking_vio_2017 where issuer_precinct!=0 group by issuer_precinct,violation_code order by instance desc limit 3; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20230408131017_0062681a-3eaa-4524-8013-469a72d4e365 Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 4 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2023-04-08 13:10:20,140 Stage-1 map = 0%, reduce = 0% 2023-04-08 13:10:22,142 Stage-1 map = 100%, reduce = 0% 2023-04-08 13:10:28,147 Stage-1 map = 75%, reduce = 0% 2023-04-08 13:10:29,150 Stage-1 map = 100%, reduce = 100% Ended Job = job_local1999220878_0027 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2023-04-08 13:10:30,934 Stage-2 map = 100%, reduce = 100% Ended Job = job_local1703760097_0028 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 160639084717 HDFS Write: 8116602392 SUCCESS Stage-Stage-2: HDFS Read: 40520164728 HDFS Write: 2029150598 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK issuer_precinct instance 18 50150 19 48444 14 45036 Time taken: 12.999 seconds, Fetched: 3 row(s) hive> |
issuer_precinct are 18,19,14
Now let’s examine violation code frequency in 18,19,14 issuer_precinct
hive> select issuer_precinct,violation_code,count(*) as instance from parking_vio_2017 where issuer_precinct in (18,19,14) group by issuer_precinct,violation_code order by issuer_precinct,instance desc; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20230408131311_c3b7e27c-4880-4695-987e-ca09a6a842e9 Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 4 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2023-04-08 13:13:13,474 Stage-1 map = 0%, reduce = 0% 2023-04-08 13:13:15,475 Stage-1 map = 100%, reduce = 0% 2023-04-08 13:13:21,483 Stage-1 map = 100%, reduce = 100% Ended Job = job_local1448052520_0029 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2023-04-08 13:13:23,296 Stage-2 map = 100%, reduce = 100% Ended Job = job_local168259142_0030 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 168755681293 HDFS Write: 8116602392 SUCCESS Stage-Stage-2: HDFS Read: 42549313872 HDFS Write: 2029150598 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK issuer_precinct violation_code instance 14 14 45036 14 69 30464 14 31 22555 14 47 18364 14 42 10027 14 46 7679 14 19 7030 14 84 6743 14 82 5052 14 40 3582 14 17 3534 14 38 3269 14 9 2874 14 20 2761 14 71 2757 14 13 2701 14 48 2439 14 89 1960 14 50 1824 14 11 1745 14 79 1495 14 70 1461 14 10 1319 14 37 1256 14 64 1070 14 23 1044 14 21 1029 14 53 953 14 24 946 14 16 940 14 74 768 14 35 675 14 8 588 14 51 559 14 52 549 14 45 526 14 73 253 14 1 247 14 3 241 14 30 239 14 78 227 14 18 213 14 26 212 14 72 200 14 85 152 14 77 138 14 83 108 14 49 84 14 61 74 14 98 67 14 62 66 14 41 63 14 67 62 14 75 43 14 60 41 14 43 40 14 2 32 14 66 22 14 59 21 14 22 19 14 68 12 14 80 8 14 39 7 14 27 7 14 29 4 14 12 4 14 4 4 14 81 2 14 99 2 14 86 1 14 0 1 14 56 1 14 91 1 14 96 1 14 54 1 18 14 50150 18 69 20189 18 47 14107 18 31 11893 18 46 7863 18 42 6190 18 38 6176 18 84 5189 18 19 4580 18 20 4114 18 40 3350 18 16 2632 18 82 2242 18 37 2236 18 11 2059 18 79 2006 18 71 1977 18 13 1816 18 17 1653 18 21 1452 18 10 1373 18 70 1101 18 35 1080 18 23 841 18 24 683 18 9 634 18 48 624 18 50 602 18 45 601 18 74 518 18 53 431 18 64 428 18 51 359 18 78 279 18 8 215 18 98 173 18 77 123 18 18 108 18 66 99 18 73 92 18 72 80 18 89 79 18 49 69 18 75 56 18 26 53 18 85 52 18 39 50 18 1 46 18 60 42 18 41 35 18 83 33 18 61 27 18 43 18 18 68 15 18 67 15 18 30 14 18 3 14 18 22 13 18 62 9 18 59 9 18 80 7 18 2 3 18 29 3 18 99 3 18 33 2 18 0 2 18 12 1 18 52 1 18 15 1 18 96 1 18 95 1 18 81 1 18 27 1 19 46 48444 19 38 36386 19 37 36056 19 14 29797 19 21 28414 19 20 14629 19 40 11416 19 16 9926 19 71 7493 19 19 6856 19 10 5643 19 84 4910 19 70 4459 19 18 3148 19 69 2910 19 31 2080 19 53 1736 19 50 1483 19 17 1464 19 48 1460 19 74 1329 19 24 1029 19 42 903 19 82 888 19 47 702 19 51 539 19 9 480 19 13 445 19 64 389 19 45 241 19 23 207 19 78 189 19 11 183 19 98 92 19 75 91 19 85 75 19 72 63 19 61 60 19 83 50 19 73 44 19 41 43 19 39 27 19 30 27 19 60 26 19 8 24 19 79 13 19 43 13 19 52 12 19 68 11 19 35 7 19 49 6 19 67 5 19 66 4 19 26 4 19 81 3 19 62 3 19 33 3 19 80 3 19 99 3 19 59 2 19 0 2 19 4 1 19 89 1 19 77 1 19 91 1 19 29 1 19 2 1 19 22 1 19 32 1 19 12 1 Time taken: 11.828 seconds, Fetched: 218 row(s) hive> |
Precinct 18 and Precinct 14 has more less similar top violation code.
But Precinct 19 has very different top violation code.
Find out the properties of parking violations across different times of the day: The Violation Time field is specified in a strange format. Find a way to make this into a time attribute that you can use to divide into groups.
6.) Divide 24 hours into 6 equal discrete bins of time. The intervals you choose are at your discretion. For each of these groups, find the 3 most commonly occurring violations
select summons_no, violation_code , violation_time, issuer_precinct, case when substring(violation_time,1,2) in ('00','01','02','03','12') and upper(substring(violation_time,-1))='A' then 1 when substring(violation_time,1,2) in ('04','05','06','07') and upper(substring(violation_time,-1))='A' then 2 when substring(violation_time,1,2) in ('08','09','10','11') and upper(substring(violation_time,-1))='A' then 3 when substring(violation_time,1,2) in ('12','00','01','02','03') and upper(substring(violation_time,-1))='P' then 4 when substring(violation_time,1,2) in ('04','05','06','07') and upper(substring(violation_time,-1))='P' then 5 when substring(violation_time,1,2) in ('08','09','10','11') and upper(substring(violation_time,-1))='P' then 6 else null end as violation_time_bin from parking_vio_2017 where violation_time is not null or ( length(violation_time)=5 and upper(substring(violation_time,-1)) in ('A','P') and substring(violation_time,1,2) in ('00','01','02','03','04','05','06','07', '08','09','10','11','12') ) |
Now Let’s find 3 most commonly occurring violations
For violation_time_bin=1
with bin_table as ( select summons_no, violation_code , violation_time, issuer_precinct, case when substring(violation_time,1,2) in ('00','01','02','03','12') and upper(substring(violation_time,-1))='A' then 1 when substring(violation_time,1,2) in ('04','05','06','07') and upper(substring(violation_time,-1))='A' then 2 when substring(violation_time,1,2) in ('08','09','10','11') and upper(substring(violation_time,-1))='A' then 3 when substring(violation_time,1,2) in ('12','00','01','02','03') and upper(substring(violation_time,-1))='P' then 4 when substring(violation_time,1,2) in ('04','05','06','07') and upper(substring(violation_time,-1))='P' then 5 when substring(violation_time,1,2) in ('08','09','10','11') and upper(substring(violation_time,-1))='P' then 6 else null end as violation_time_bin from parking_vio_2017 where violation_time is not null or ( length(violation_time)=5 and upper(substring(violation_time,-1)) in ('A','P') and substring(violation_time,1,2) in ('00','01','02','03','04','05','06','07', '08','09','10','11','12') ) )
select violation_time_bin,violation_code,count(*) as instance from bin_table where violation_time_bin=1 group by violation_time_bin,violation_code order by instance desc limit 3
violation_time_bin violation_code instance 1 21 36957 1 40 25866 1 78 15528 Time taken: 16.955 seconds, Fetched: 3 row(s) hive> |
Similarly trying for violation_time_bin for 2,3,4,5,6
Three most commonly violation_codes are
21,36,38
7) Now, try another direction. For the 3 most commonly occurring violation codes, find the most common times of day (in terms of the bins from the previous part)
with bin_table as ( select summons_no, violation_code , violation_time, issuer_precinct, case when substring(violation_time,1,2) in ('00','01','02','03','12') and upper(substring(violation_time,-1))='A' then 1 when substring(violation_time,1,2) in ('04','05','06','07') and upper(substring(violation_time,-1))='A' then 2 when substring(violation_time,1,2) in ('08','09','10','11') and upper(substring(violation_time,-1))='A' then 3 when substring(violation_time,1,2) in ('12','00','01','02','03') and upper(substring(violation_time,-1))='P' then 4 when substring(violation_time,1,2) in ('04','05','06','07') and upper(substring(violation_time,-1))='P' then 5 when substring(violation_time,1,2) in ('08','09','10','11') and upper(substring(violation_time,-1))='P' then 6 else null end as violation_time_bin from parking_vio_2017 where violation_time is not null or ( length(violation_time)=5 and upper(substring(violation_time,-1)) in ('A','P') and substring(violation_time,1,2) in ('00','01','02','03','04','05','06','07', '08','09','10','11','12') ) )
select violation_time_bin,count(*) as instance from bin_table where violation_code in (21,36,38) group by violation_time_bin order by instance desc limit 3 |
violation_time_bin instance 3 1122795 4 601699 5 116648 Time taken: 16.351 seconds, Fetched: 3 row(s) hive> |
Bins 3, 4, 5 are having most violations
8.) Let’s try and find some seasonality in this data
a) First, divide the year into some number of seasons, and find frequencies of tickets for each season.
(Hint: A quick Google search reveals the following seasons in NYC:
Spring(March, April, March); Summer(June, July, August); Fall(September, October, November); Winter(December, January, February))
b)Then, find the 3 most common violations for each of these seasons.
with season_data as ( select case when month(issue_date) in (1,2,12) then 'winter' when month(issue_date) in (3,4,5) then 'spring' when month(issue_date) in (6,7,8) then 'summer' else 'fall' end as season,violation_code,count() as instances from parking_vio_2017 group by case when month(issue_date) in (1,2,12) then 'winter' when month(issue_date) in (3,4,5) then 'spring' when month(issue_date) in (6,7,8) then 'summer' else 'fall' end,violation_code ) select from ( select season,violation_code,instances,dense_rank() over (partition by season,violation_code order by instances desc) as rank from season_data )b
b.season b.violation_code b.instances b.rank fall 46 231 1 fall 21 128 2 fall 40 116 3 spring 21 402424 1 spring 36 344834 2 spring 38 271167 3 summer 21 127350 1 summer 36 96663 2 summer 38 83518 3 winter 21 238180 1 winter 36 221268 2 winter 38 187386 3 Time taken: 13.933 seconds, Fetched: 12 row(s) hive> |
Now let’s solve all above questions with some optimization techniques
Hive is better for orc type format
So here we will create an orc table with dynamic partition on issue_date,violation_code
and bucketing on summons_no
Here We will try to implement all the concepts learnt
Step 1
Lets create an orc table named parking_vio_2017_orc from reference table parking_vio_2017
hive> create table parking_vio_2017_orc > ( > summons_no int, > plate_id string, > registration_state string, > plate_type string, > issue_date date, > violation_code int, > vehicle_body_type string, > vehicle_make string, > issuing_agency string, > street_code_1 int, > street_code_2 int, > street_code_3 int, > vehicle_exp_date int, > violation_location int, > violation_precinct int, > issuer_precinct int, > issuer_code int, > issuer_command string, > issuer_squad string, > violation_time string, > time_first_observed string, > violation_county string, > violation_in_front_of_or_opp string, > house_no string, > street_no string, > intersection_street string, > date_first_observed string, > law_section int, > sub_division string, > violation_legal_code string, > days_parking_in_effect string, > from_hours_in_effect string, > to_hours_in_effect string, > vehicle_color string, > unregistered_vehicle int, > vehicle_year string, > meter_no string, > feet_from_curb string, > violation_post_code string, > violation_description string, > no_standing_or_stopping_violation string, > hydrant_violation string, > double_parking_violation string > ) > stored as orc; OK Time taken: 6.189 seconds hive> |
Lets load data into this orc table from the ref table named parking_vio_2017
hive> from parking_vio_2017 insert overwrite table parking_vio_2017_orc select *; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20230408183337_5e13629f-b635-4151-8dc9-5e1decd2870c Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Job running in-process (local Hadoop) 2023-04-08 18:33:40,787 Stage-1 map = 0%, reduce = 0% 2023-04-08 18:33:58,800 Stage-1 map = 12%, reduce = 0% 2023-04-08 18:34:11,817 Stage-1 map = 100%, reduce = 0% 2023-04-08 18:34:17,822 Stage-1 map = 25%, reduce = 0% 2023-04-08 18:34:26,828 Stage-1 map = 37%, reduce = 0% 2023-04-08 18:34:40,838 Stage-1 map = 100%, reduce = 0% 2023-04-08 18:34:46,844 Stage-1 map = 50%, reduce = 0% 2023-04-08 18:34:54,851 Stage-1 map = 62%, reduce = 0% 2023-04-08 18:35:08,870 Stage-1 map = 100%, reduce = 0% 2023-04-08 18:35:14,874 Stage-1 map = 75%, reduce = 0% 2023-04-08 18:35:23,948 Stage-1 map = 91%, reduce = 0% 2023-04-08 18:35:29,952 Stage-1 map = 100%, reduce = 0% Ended Job = job_local1833751516_0015 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to directory hdfs://namenode:8020/user/hive/warehouse/hi.. Loading data to table hive_db.parking_vio_2017_orc MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 35083110397 HDFS Write: 510857609 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK parking_vio_2017.summons_no parking_vio_2017.plate_id parking_vio_2017.registration_state parking_vio_2017.plate_type parking_vio_2017.issue_date parking_vio_2017.violation_code parking_vio_2017.vehicle_body_type parking_vio_2017.vehicle_make parking_vio_2017.issuing_agency parking_vio_2017.street_code_1 parking_vio_2017.street_code_2 parking_vio_2017.street_code_3 parking_vio_2017.vehicle_exp_date parking_vio_2017.violation_location parking_vio_2017.violation_precinct parking_vio_2017.issuer_precinct parking_vio_2017.issuer_code parking_vio_2017.issuer_command parking_vio_2017.issuer_squad parking_vio_2017.violation_time parking_vio_2017.time_first_observed parking_vio_2017.violation_county parking_vio_2017.violation_in_front_of_or_opp parking_vio_2017.house_no parking_vio_2017.street_no parking_vio_2017.intersection_street parking_vio_2017.date_first_observed parking_vio_2017.law_section parking_vio_2017.sub_division parking_vio_2017.violation_legal_code parking_vio_2017.days_parking_in_effect parking_vio_2017.from_hours_in_effect parking_vio_2017.to_hours_in_effect parking_vio_2017.vehicle_color parking_vio_2017.unregistered_vehicle parking_vio_2017.vehicle_year parking_vio_2017.meter_no parking_vio_2017.feet_from_curb parking_vio_2017.violation_post_code parking_vio_2017.violation_description parking_vio_2017.no_standing_or_stopping_violation parking_vio_2017.hydrant_violation parking_vio_2017.double_parking_violation Time taken: 115.373 seconds hive> |
Let’s check in hadoop environment how this orc file looks like

Its taking very less time to query now
hive> select count(*) from parking_vio_2017_orc ; OK _c0 5431903 Time taken: 0.248 seconds, Fetched: 1 row(s) hive> |
Now let’s create partitions and buckets
I am going to do dynamic partition on issue_date,violation_code and bucketing on summon_no
hive> create table parking_vio_2017_orc_part > ( > summons_no int, > plate_id string, > registration_state string, > plate_type string, > vehicle_body_type string, > vehicle_make string, > issuing_agency string, > street_code_1 int, > street_code_2 int, > street_code_3 int, > vehicle_exp_date int, > violation_location int, > violation_precinct int, > issuer_precinct int, > issuer_code int, > issuer_command string, > issuer_squad string, > violation_time string, > time_first_observed string, > violation_county string, > violation_in_front_of_or_opp string, > house_no string, > street_no string, > intersection_street string, > date_first_observed string, > law_section int, > sub_division string, > violation_legal_code string, > days_parking_in_effect string, > from_hours_in_effect string, > to_hours_in_effect string, > vehicle_color string, > unregistered_vehicle int, > vehicle_year string, > meter_no string, > feet_from_curb string, > violation_post_code string, > violation_description string, > no_standing_or_stopping_violation string, > hydrant_violation string, > double_parking_violation string > ) > partitioned by (issue_date date,violation_code int) > clustered by (summons_no) > sorted by (summons_no) > into 3 buckets; OK Time taken: 0.526 seconds hive> |
insert overwrite table parking_vio_2017_orc_part_buck partition (issue_date,violation_code) select summons_no,plate_id,registration_state,plate_type,vehicle_body_type,vehicle_make,issuing_agency,street_code_1,street_code_2,street_code_3,vehicle_exp_date,violation_location,violation_precinct,issuer_precinct,issuer_code,issuer_command,issuer_squad,violation_time,time_first_observed,violationcounty,violationinfrontof_or_opp,house_no,street_no,intersection_street,date_first_observed,law_section,sub_division,violation_legal_code,daysparkingin_effect,fromhoursin_effect,tohoursin_effect,vehicle_color,unregistered_vehicle,vehicle_year,meter_no,feet_from_curb,violation_post_code,violation_description,no_standing_or_stopping_violation,hydrant_violation,double_parking_violation,issue_date,violation_code from parking_vio_2017_orc |
Now let’s check how the partitions and bucketing looks in hdfs file system
