NYC Parking Tickets: An Exploratory Analysis Using Hive

Author:

Bhawesh Mehta

Data Enthusiastic

Project Name:

NYC Parking Tickets: An Exploratory Analysis Using Hive

One of the major objectives of this assignment is gaining familiarity with how an analysis works in Hive and how you can gain insights from large datasets.

Problem Statement -

New York City is a thriving metropolis and just like most other cities of similar size, one of the biggest problems its residents face is parking.

The classic combination of a huge number of cars and a cramped geography is the exact recipe that leads to a large number of parking tickets.

In an attempt to scientifically analyze this phenomenon, the NYC Police Department regularly collects data related to parking tickets.

This data is made available by NYC Open Data portal. We will try and perform some analysis on this data.

Download Dataset - https://data.cityofnewyork.us/browse?q=parking+tickets

Action:Export the data in CSV Format from the site given above

Note: Consider only the year 2017 for analysis and not the Fiscal year.

Before Going for the assignment >>

Let’s Load the data into HDFS file storage system from local

The file named parking_vio.csv is currently in local system

Local system means the file system of the container i am in.

Please Note:

(Local System does not refer to the local computer file system)

Now I will be moving this file from container file system to hadoop file system

Before loading the file in hadoop environment let’s check in the current directory in my hadoop environment.

Now loading data into hdfs file system using put command

Here I have loaded the file inside the local_to_hdfs directory in hadoop

Let me show you the content inside the /local_to_hdfs directory of hdfs

To see the content you have to use

hadoop fs -cat /local_to_hdfs

Question arises why the .csv file is not created ?

Answer because we have not created the local_to_hdfs folder

Now let me delete this first

hadoop fs -rm rf /local_to_hdfs

First create this folder and use put command to load the file

Here clearly you can see that now .csv file can be seen inside local_to_hdfs file system

The analysis can be divided into two parts:

Part-I

Examine the data:

Let’s Examine the data in the hadoop file system

Find the total number of tickets for the year.
Find out how many unique states the cars which got parking tickets came from.
Some parking tickets don’t have addresses on them, which is cause for concern. Find out how many such tickets there are(i.e. tickets where either "Street Code 1" or "Street Code 2" or "Street Code 3" is empty )

Approach 1

Using

Simple Internal table

Lets create a internal table named parking

Challenge here is that date column contains these values in these format

03-09-2017

01/18/2017

That is in mm-dd-yyyy or mm-dd-yyyy

So our approach will be to create a temporary table first defining them in string format and then loading the data from that temp table to the main table in data format in hive.

Before loading the data in main table we will replace / with - so that uniformity is maintained

So let’s try this

By Default hive stores the date in yyyy-MM-dd format

Lets create a reference table named violation_ref

Lets load data from our hdfs file system to this table named violation_ref

We did a mistake here while creating the table we should have escaped the first row as it contains the column name

Let’s drop the violation_ref table

Now again create violation_ref table

Note how i have used tblproperties(“skip.header.line.count”=”1”)

Loading data from hdfs file system

Here I noticed one thing then when i dropped the table that the file inside the hdfs file system also got deleted

Checking if the table has data in hdfs file system or not

Now lets create a violation_main table with date column defined

Note:

By Default hive stores the date in yyyy-MM-dd format

hive> insert into violation_main select summons_no,plate_id,registration_state,plate_type,FROM_UNIXTIME(UNIX_TIMESTAMP(issue_date, 'MM/dd/yyyy'), 'yyyy-MM-dd') as issue_date,violation_code,vehicle_body_type,vehicle_make,issuing_agency,street_code_1,street_code_2,street_code_3,vehicle_exp_date,violation_location,violation_precinct,issuer_precinct,issuer_code,issuer_command,issuer_squad,violation_time,time_first_observed,violationcounty,violationinfrontof_or_opp,house_no,street_no,intersection_street,date_first_observed,law_section,sub_division,violation_legal_code,daysparkingin_effect,fromhoursin_effect,tohoursin_effect,vehicle_color,unregistered_vehicle,vehicle_year,meter_no,feet_from_curb,violation_post_code,violation_description,no_standing_or_stopping_violation,hydrant_violation,double_parking_violation from violation_ref;

Now let’s check how violation_main table looks in hdfs file system

Notice how file is splitted into multiple sub files starting with 000000_0

1-Find the total number of tickets for the year.

hive> select issue_date,count(*) from violation_main group by issue_date;

But we need the data for 2017 only

Sol let’s modify this and create a temp table with data of year 2017 only

hive> create table parking_vio_2017
> (
> summons_no int,
> plate_id string,
> registration_state string,
> plate_type string,
> issue_date date,
> violation_code int,
> vehicle_body_type string,
> vehicle_make string,
> issuing_agency string,
> street_code_1 int,
> street_code_2 int,
> street_code_3 int,
> vehicle_exp_date int,
> violation_location int,
> violation_precinct int,
> issuer_precinct int,
> issuer_code int,
> issuer_command string,
> issuer_squad string,
> violation_time string,
> time_first_observed string,
> violation_county string,
> violation_in_front_of_or_opp string,
> house_no string,
> street_no string,
> intersection_street string,
> date_first_observed string,
> law_section int,
> sub_division string,
> violation_legal_code string,
> days_parking_in_effect string,
> from_hours_in_effect string,
> to_hours_in_effect string,
> vehicle_color string,
> unregistered_vehicle int,
> vehicle_year string,
> meter_no string,
> feet_from_curb string,
> violation_post_code string,
> violation_description string,
> no_standing_or_stopping_violation string,
> hydrant_violation string,
> double_parking_violation string
> )
> row format delimited
> fields terminated by ',';
OK
Time taken: 12.221 seconds
hive>

hive> insert into parking_vio_2017 select
summons_no,plate_id,registration_state,plate_type,issue_date,violation_code,vehicle_body_type,vehicle_make,issuing_agency,street_code_1,street_code_2,street_code_3,vehicle_exp_date,violation_location,violation_precinct,issuer_precinct,issuer_code,issuer_command,issuer_squad,violation_time,time_first_observed,violation_county,violation_in_front_of_or_opp,house_no,street_no,intersection_street,date_first_observed,law_section,sub_division,violation_legal_code,days_parking_in_effect,from_hours_in_effect,to_hours_in_effect,vehicle_color,unregistered_vehicle,vehicle_year,meter_no,feet_from_curb,violation_post_code,violation_description,no_standing_or_stopping_violation,hydrant_violation,double_parking_violation from violation_main where year(issue_date)=2017;

Let’s check if data in loaded in the hdfs warehouse

# hadoop fs -ls /user/hive/warehouse/hive_db.db/parking_vio_2017

Found 8 items
-rwxrwxr-x 3 root supergroup 133306082 2023-04-08 11:54 /user/hive/warehouse/hive_db.db/parking_vio_2017/000000_0
-rwxrwxr-x 3 root supergroup 133615823 2023-04-08 11:55 /user/hive/warehouse/hive_db.db/parking_vio_2017/000001_0
-rwxrwxr-x 3 root supergroup 133371174 2023-04-08 11:55 /user/hive/warehouse/hive_db.db/parking_vio_2017/000002_0
-rwxrwxr-x 3 root supergroup 133720470 2023-04-08 11:55 /user/hive/warehouse/hive_db.db/parking_vio_2017/000003_0
-rwxrwxr-x 3 root supergroup 133466345 2023-04-08 11:55 /user/hive/warehouse/hive_db.db/parking_vio_2017/000004_0
-rwxrwxr-x 3 root supergroup 133509997 2023-04-08 11:55 /user/hive/warehouse/hive_db.db/parking_vio_2017/000005_0
-rwxrwxr-x 3 root supergroup 133557192 2023-04-08 11:55 /user/hive/warehouse/hive_db.db/parking_vio_2017/000006_0
-rwxrwxr-x 3 root supergroup 80027489 2023-04-08 11:56 /user/hive/warehouse/hive_db.db/parking_vio_2017/000007_0

2-Find out how many unique states the cars which got parking tickets came from.

hive> select distinct registration_state from violation_main;

hive> select registration_state,count(*) as instance from parking_vio_2017 group by registration_state order by instance desc;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20230408120319_8c69e2b9-a0d2-4432-a51e-74cc7f583b36
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2023-04-08 12:03:23,028 Stage-1 map = 0%, reduce = 0%
2023-04-08 12:03:27,707 Stage-1 map = 100%, reduce = 0%
2023-04-08 12:03:37,722 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local1615623838_0004
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2023-04-08 12:03:39,869 Stage-2 map = 100%, reduce = 100%
Ended Job = job_local1643808020_0005
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 55123329229 HDFS Write: 8116602392 SUCCESS
Stage-Stage-2: HDFS Read: 14141225856 HDFS Write: 2029150598 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
registration_state instance
NY 4273941
NJ 475824
PA 140284
CT 70403
FL 69468
IN 45525
MA 38941
VA 34367
MD 30213
NC 27152
TX 18827
IL 18666
GA 17537
99 16055
AZ 12379
OH 12281
CA 12152
ME 10806
SC 10394
MN 10083
OK 9088
TN 8514
DE 7905
MI 7231
RI 5814
NH 4119
VT 3683
AL 3178
WA 3052
OR 2622
MO 2483
ON 2460
WI 2127
QB 1998
IA 1938
DC 1929
CO 1841
KY 1795
DP 1794
LA 1689
MS 1582
WV 1265
AR 994
SD 859
NM 792
ID 763
NV 725
KS 706
NE 704
UT 561
MT 505
GV 348
NS 322
AK 298
ND 254
WY 188
HI 156
AB 79
PE 61
NB 57
BC 54
PR 38
MB 17
SK 9
FO 8
Time taken: 20.127 seconds, Fetched: 65 row(s)

3-Some parking tickets don’t have addresses on them, which is cause for concern. Find out how many such tickets there are(i.e. tickets where either "Street Code 1" or "Street Code 2" or "Street Code 3" is empty )

hive> select count(*) from parking_vio_2017 where street_code_1 is null or street_code_2 is null or street_code_3 is null or street_code_1=0 or street_code_2=0 or street_code_3=0;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20230408122406_546bfba9-608e-49bb-b6cc-c6ce23ab817d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2023-04-08 12:24:08,395 Stage-1 map = 0%, reduce = 0%
2023-04-08 12:24:11,784 Stage-1 map = 100%, reduce = 0%
2023-04-08 12:24:18,791 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local1069774194_0007
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 44057236165 HDFS Write: 5072876495 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
_c0
1816816
Time taken: 12.794 seconds, Fetched: 1 row(s)
hive>

Part-II: Aggregation tasks

1) How often does each violation code occur? (frequency of violation codes - find the top 5)

hive> select violation_code,count(*) as instance from parking_vio_2017 group by violation_code order by instance desc limit 5;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20230408123022_9378794a-2f09-4e0b-9bad-3113d0113f09
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2023-04-08 12:30:31,657 Stage-1 map = 0%, reduce = 0%
2023-04-08 12:30:40,682 Stage-1 map = 100%, reduce = 0%
2023-04-08 12:30:46,688 Stage-1 map = 33%, reduce = 0%
2023-04-08 12:30:48,710 Stage-1 map = 100%, reduce = 0%
2023-04-08 12:30:58,737 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local488446953_0008
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2023-04-08 12:31:01,739 Stage-2 map = 100%, reduce = 100%
Ended Job = job_local1353223834_0009
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 79473118957 HDFS Write: 8116602392 SUCCESS
Stage-Stage-2: HDFS Read: 20228673288 HDFS Write: 2029150598 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
violation_code instance
21 768082
36 662765
38 542079
14 476660
20 319646
Time taken: 38.911 seconds, Fetched: 5 row(s)
hive>

2) How often does each vehicle body type get a parking ticket? How about the vehicle make? (find the top 5 for both)

For Vehicle body type

hive> select vehicle_body_type,count(*) as instance from parking_vio_2017 group by vehicle_body_type order by instance desc limit 5;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20230408123551_a96f78b5-6b0a-4fae-8fd6-54ec188a7fc3
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2023-04-08 12:35:54,477 Stage-1 map = 0%, reduce = 0%
2023-04-08 12:36:00,484 Stage-1 map = 8%, reduce = 0%
2023-04-08 12:36:03,487 Stage-1 map = 100%, reduce = 0%
2023-04-08 12:36:09,497 Stage-1 map = 25%, reduce = 0%
2023-04-08 12:36:12,502 Stage-1 map = 100%, reduce = 0%
2023-04-08 12:36:17,508 Stage-1 map = 75%, reduce = 0%
2023-04-08 12:36:18,509 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local198719902_0010
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2023-04-08 12:36:20,251 Stage-2 map = 100%, reduce = 100%
Ended Job = job_local457352855_0011
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 87589715533 HDFS Write: 8116602392 SUCCESS
Stage-Stage-2: HDFS Read: 22257822432 HDFS Write: 2029150598 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
vehicle_body_type instance
SUBN 1883953
4DSD 1547307
VAN 724025
DELV 358982
SDN 194197
Time taken: 28.821 seconds, Fetched: 5 row(s)
hive>

For vehicle make

hive> select vehicle_make,count(*) as instance from parking_vio_2017 group by vehicle_make order by instance desc limit 5;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20230408123750_9647cd58-0775-4c05-af32-27742608aed6
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2023-04-08 12:37:54,203 Stage-1 map = 0%, reduce = 0%
2023-04-08 12:37:56,206 Stage-1 map = 100%, reduce = 0%
2023-04-08 12:38:03,214 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local1454523152_0012
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2023-04-08 12:38:05,257 Stage-2 map = 100%, reduce = 100%
Ended Job = job_local1060513731_0013
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 95706312109 HDFS Write: 8116602392 SUCCESS
Stage-Stage-2: HDFS Read: 24286971576 HDFS Write: 2029150598 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
vehicle_make instance
FORD 636842
TOYOT 605290
HONDA 538884
NISSA 462017
CHEVR 356032
Time taken: 14.345 seconds, Fetched: 5 row(s)
hive>

3) A precinct is a police station that has a certain zone of the city under its command. Find the (5 highest) frequencies of:

a.) Violating Precincts (this is the precinct of the zone where the violation occurred)

hive> select violation_precinct,count(*) as instances from parking_vio_2017 group by violation_precinct order by instances desc limit 5;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20230408124537_11dc8925-f0f4-4397-8eee-b5be88d363f3
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2023-04-08 12:45:40,763 Stage-1 map = 0%, reduce = 0%
2023-04-08 12:45:43,765 Stage-1 map = 100%, reduce = 0%
2023-04-08 12:45:49,771 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local1258832131_0014
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2023-04-08 12:45:51,660 Stage-2 map = 100%, reduce = 100%
Ended Job = job_local1685272680_0015
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 103822908685 HDFS Write: 8116602392 SUCCESS
Stage-Stage-2: HDFS Read: 26316120720 HDFS Write: 2029150598 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
violation_precinct instances
0 925596
19 274443
14 203552
1 174702
18 169131
Time taken: 13.861 seconds, Fetched: 5 row(s)
hive>

b.) Issuer Precincts (this is the precinct that issued the ticket)

hive> select issuer_precinct,count(*) as instances from parking_vio_2017 group by issuer_precinct order by instances desc limit 5;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20230408124648_200df340-8b5c-4e61-9eb4-5a864c9ab815
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2023-04-08 12:46:50,404 Stage-1 map = 0%, reduce = 0%
2023-04-08 12:46:52,406 Stage-1 map = 100%, reduce = 0%
2023-04-08 12:46:58,414 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local523335232_0016
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2023-04-08 12:47:00,186 Stage-2 map = 100%, reduce = 100%
Ended Job = job_local1790657867_0017
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 111939505261 HDFS Write: 8116602392 SUCCESS
Stage-Stage-2: HDFS Read: 28345269864 HDFS Write: 2029150598 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
issuer_precinct instances
0 1078403
19 266959
14 200494
1 168740
18 162994
Time taken: 11.936 seconds, Fetched: 5 row(s)
hive>

4) Find the violation code frequency across 3 precincts which have issued the most number of tickets - do these precinct zones have an exceptionally high frequency of certain violation codes?

hive> select issuer_precinct,violation_code,count() as instance from parking_vio_2017 group by issuer_precinct,violation_code order by issuer_precinct,instance desc;

hive> select issuer_precinct,count() as instance from parking_vio_2017 where issuer_precinct!=0 group by issuer_precinct,violation_code order by instance desc limit 3;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20230408131017_0062681a-3eaa-4524-8013-469a72d4e365
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2023-04-08 13:10:20,140 Stage-1 map = 0%, reduce = 0%
2023-04-08 13:10:22,142 Stage-1 map = 100%, reduce = 0%
2023-04-08 13:10:28,147 Stage-1 map = 75%, reduce = 0%
2023-04-08 13:10:29,150 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local1999220878_0027
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2023-04-08 13:10:30,934 Stage-2 map = 100%, reduce = 100%
Ended Job = job_local1703760097_0028
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 160639084717 HDFS Write: 8116602392 SUCCESS
Stage-Stage-2: HDFS Read: 40520164728 HDFS Write: 2029150598 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
issuer_precinct instance
18 50150
19 48444
14 45036
Time taken: 12.999 seconds, Fetched: 3 row(s)
hive>

issuer_precinct are 18,19,14

Now let’s examine violation code frequency in 18,19,14 issuer_precinct

hive> select issuer_precinct,violation_code,count(*) as instance from parking_vio_2017 where issuer_precinct in (18,19,14) group by issuer_precinct,violation_code order by issuer_precinct,instance desc;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20230408131311_c3b7e27c-4880-4695-987e-ca09a6a842e9
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2023-04-08 13:13:13,474 Stage-1 map = 0%, reduce = 0%
2023-04-08 13:13:15,475 Stage-1 map = 100%, reduce = 0%
2023-04-08 13:13:21,483 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local1448052520_0029
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2023-04-08 13:13:23,296 Stage-2 map = 100%, reduce = 100%
Ended Job = job_local168259142_0030
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 168755681293 HDFS Write: 8116602392 SUCCESS
Stage-Stage-2: HDFS Read: 42549313872 HDFS Write: 2029150598 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
issuer_precinct violation_code instance
14 14 45036
14 69 30464
14 31 22555
14 47 18364
14 42 10027
14 46 7679
14 19 7030
14 84 6743
14 82 5052
14 40 3582
14 17 3534
14 38 3269
14 9 2874
14 20 2761
14 71 2757
14 13 2701
14 48 2439
14 89 1960
14 50 1824
14 11 1745
14 79 1495
14 70 1461
14 10 1319
14 37 1256
14 64 1070
14 23 1044
14 21 1029
14 53 953
14 24 946
14 16 940
14 74 768
14 35 675
14 8 588
14 51 559
14 52 549
14 45 526
14 73 253
14 1 247
14 3 241
14 30 239
14 78 227
14 18 213
14 26 212
14 72 200
14 85 152
14 77 138
14 83 108
14 49 84
14 61 74
14 98 67
14 62 66
14 41 63
14 67 62
14 75 43
14 60 41
14 43 40
14 2 32
14 66 22
14 59 21
14 22 19
14 68 12
14 80 8
14 39 7
14 27 7
14 29 4
14 12 4
14 4 4
14 81 2
14 99 2
14 86 1
14 0 1
14 56 1
14 91 1
14 96 1
14 54 1
18 14 50150
18 69 20189
18 47 14107
18 31 11893
18 46 7863
18 42 6190
18 38 6176
18 84 5189
18 19 4580
18 20 4114
18 40 3350
18 16 2632
18 82 2242
18 37 2236
18 11 2059
18 79 2006
18 71 1977
18 13 1816
18 17 1653
18 21 1452
18 10 1373
18 70 1101
18 35 1080
18 23 841
18 24 683
18 9 634
18 48 624
18 50 602
18 45 601
18 74 518
18 53 431
18 64 428
18 51 359
18 78 279
18 8 215
18 98 173
18 77 123
18 18 108
18 66 99
18 73 92
18 72 80
18 89 79
18 49 69
18 75 56
18 26 53
18 85 52
18 39 50
18 1 46
18 60 42
18 41 35
18 83 33
18 61 27
18 43 18
18 68 15
18 67 15
18 30 14
18 3 14
18 22 13
18 62 9
18 59 9
18 80 7
18 2 3
18 29 3
18 99 3
18 33 2
18 0 2
18 12 1
18 52 1
18 15 1
18 96 1
18 95 1
18 81 1
18 27 1
19 46 48444
19 38 36386
19 37 36056
19 14 29797
19 21 28414
19 20 14629
19 40 11416
19 16 9926
19 71 7493
19 19 6856
19 10 5643
19 84 4910
19 70 4459
19 18 3148
19 69 2910
19 31 2080
19 53 1736
19 50 1483
19 17 1464
19 48 1460
19 74 1329
19 24 1029
19 42 903
19 82 888
19 47 702
19 51 539
19 9 480
19 13 445
19 64 389
19 45 241
19 23 207
19 78 189
19 11 183
19 98 92
19 75 91
19 85 75
19 72 63
19 61 60
19 83 50
19 73 44
19 41 43
19 39 27
19 30 27
19 60 26
19 8 24
19 79 13
19 43 13
19 52 12
19 68 11
19 35 7
19 49 6
19 67 5
19 66 4
19 26 4
19 81 3
19 62 3
19 33 3
19 80 3
19 99 3
19 59 2
19 0 2
19 4 1
19 89 1
19 77 1
19 91 1
19 29 1
19 2 1
19 22 1
19 32 1
19 12 1
Time taken: 11.828 seconds, Fetched: 218 row(s)
hive>

Precinct 18 and Precinct 14 has more less similar top violation code.

But Precinct 19 has very different top violation code.

Find out the properties of parking violations across different times of the day: The Violation Time field is specified in a strange format. Find a way to make this into a time attribute that you can use to divide into groups.

6.) Divide 24 hours into 6 equal discrete bins of time. The intervals you choose are at your discretion. For each of these groups, find the 3 most commonly occurring violations

select summons_no, violation_code , violation_time, issuer_precinct,
case
when substring(violation_time,1,2) in ('00','01','02','03','12')
and upper(substring(violation_time,-1))='A' then 1
when substring(violation_time,1,2) in ('04','05','06','07')
and upper(substring(violation_time,-1))='A' then 2
when substring(violation_time,1,2) in ('08','09','10','11')
and upper(substring(violation_time,-1))='A' then 3
when substring(violation_time,1,2) in ('12','00','01','02','03')
and upper(substring(violation_time,-1))='P' then 4
when substring(violation_time,1,2) in ('04','05','06','07')
and upper(substring(violation_time,-1))='P' then 5
when substring(violation_time,1,2) in ('08','09','10','11')
and upper(substring(violation_time,-1))='P' then 6
else null
end as violation_time_bin
from parking_vio_2017 where violation_time is not null
or
(
length(violation_time)=5 and upper(substring(violation_time,-1)) in ('A','P') and substring(violation_time,1,2) in ('00','01','02','03','04','05','06','07', '08','09','10','11','12')
)

Now Let’s find 3 most commonly occurring violations

For violation_time_bin=1

with bin_table as (
select summons_no, violation_code , violation_time, issuer_precinct,
case
when substring(violation_time,1,2) in ('00','01','02','03','12')
and upper(substring(violation_time,-1))='A' then 1
when substring(violation_time,1,2) in ('04','05','06','07')
and upper(substring(violation_time,-1))='A' then 2
when substring(violation_time,1,2) in ('08','09','10','11')
and upper(substring(violation_time,-1))='A' then 3
when substring(violation_time,1,2) in ('12','00','01','02','03')
and upper(substring(violation_time,-1))='P' then 4
when substring(violation_time,1,2) in ('04','05','06','07')
and upper(substring(violation_time,-1))='P' then 5
when substring(violation_time,1,2) in ('08','09','10','11')
and upper(substring(violation_time,-1))='P' then 6
else null
end as violation_time_bin
from parking_vio_2017 where violation_time is not null
or
(
length(violation_time)=5 and upper(substring(violation_time,-1)) in ('A','P') and substring(violation_time,1,2) in ('00','01','02','03','04','05','06','07', '08','09','10','11','12')
)
)

select violation_time_bin,violation_code,count(*) as instance from bin_table
where violation_time_bin=1
group by violation_time_bin,violation_code
order by instance desc
limit 3

violation_time_bin violation_code instance
1 21 36957
1 40 25866
1 78 15528
Time taken: 16.955 seconds, Fetched: 3 row(s)
hive>

Similarly trying for violation_time_bin for 2,3,4,5,6

Three most commonly violation_codes are

21,36,38

7) Now, try another direction. For the 3 most commonly occurring violation codes, find the most common times of day (in terms of the bins from the previous part)

with bin_table as (
select summons_no, violation_code , violation_time, issuer_precinct,
case
when substring(violation_time,1,2) in ('00','01','02','03','12')
and upper(substring(violation_time,-1))='A' then 1
when substring(violation_time,1,2) in ('04','05','06','07')
and upper(substring(violation_time,-1))='A' then 2
when substring(violation_time,1,2) in ('08','09','10','11')
and upper(substring(violation_time,-1))='A' then 3
when substring(violation_time,1,2) in ('12','00','01','02','03')
and upper(substring(violation_time,-1))='P' then 4
when substring(violation_time,1,2) in ('04','05','06','07')
and upper(substring(violation_time,-1))='P' then 5
when substring(violation_time,1,2) in ('08','09','10','11')
and upper(substring(violation_time,-1))='P' then 6
else null
end as violation_time_bin
from parking_vio_2017 where violation_time is not null
or
(
length(violation_time)=5 and upper(substring(violation_time,-1)) in ('A','P') and substring(violation_time,1,2) in ('00','01','02','03','04','05','06','07', '08','09','10','11','12')
)
)

select violation_time_bin,count(*) as instance from bin_table
where violation_code in (21,36,38)
group by violation_time_bin
order by instance desc
limit 3

violation_time_bin instance
3 1122795
4 601699
5 116648
Time taken: 16.351 seconds, Fetched: 3 row(s)
hive>

Bins 3, 4, 5 are having most violations

8.) Let’s try and find some seasonality in this data

a) First, divide the year into some number of seasons, and find frequencies of tickets for each season.

(Hint: A quick Google search reveals the following seasons in NYC:

Spring(March, April, March); Summer(June, July, August); Fall(September, October, November); Winter(December, January, February))

b)Then, find the 3 most common violations for each of these seasons.

with season_data as (
select case when month(issue_date) in (1,2,12) then 'winter' when month(issue_date) in (3,4,5) then 'spring' when month(issue_date) in (6,7,8) then 'summer' else 'fall' end as season,violation_code,count() as instances
from parking_vio_2017
group by case when month(issue_date) in (1,2,12) then 'winter' when month(issue_date) in (3,4,5) then 'spring' when month(issue_date) in (6,7,8) then 'summer' else 'fall' end,violation_code
)
select from (
select season,violation_code,instances,dense_rank() over (partition by season,violation_code order by instances desc) as rank
from season_data
)b

b.season b.violation_code b.instances b.rank
fall 46 231 1
fall 21 128 2
fall 40 116 3
spring 21 402424 1
spring 36 344834 2
spring 38 271167 3
summer 21 127350 1
summer 36 96663 2
summer 38 83518 3
winter 21 238180 1
winter 36 221268 2
winter 38 187386 3
Time taken: 13.933 seconds, Fetched: 12 row(s)
hive>

Now let’s solve all above questions with some optimization techniques

Hive is better for orc type format

So here we will create an orc table with dynamic partition on issue_date,violation_code

and bucketing on summons_no

Here We will try to implement all the concepts learnt

Step 1

Lets create an orc table named parking_vio_2017_orc from reference table parking_vio_2017

hive> create table parking_vio_2017_orc
> (
> summons_no int,
> plate_id string,
> registration_state string,
> plate_type string,
> issue_date date,
> violation_code int,
> vehicle_body_type string,
> vehicle_make string,
> issuing_agency string,
> street_code_1 int,
> street_code_2 int,
> street_code_3 int,
> vehicle_exp_date int,
> violation_location int,
> violation_precinct int,
> issuer_precinct int,
> issuer_code int,
> issuer_command string,
> issuer_squad string,
> violation_time string,
> time_first_observed string,
> violation_county string,
> violation_in_front_of_or_opp string,
> house_no string,
> street_no string,
> intersection_street string,
> date_first_observed string,
> law_section int,
> sub_division string,
> violation_legal_code string,
> days_parking_in_effect string,
> from_hours_in_effect string,
> to_hours_in_effect string,
> vehicle_color string,
> unregistered_vehicle int,
> vehicle_year string,
> meter_no string,
> feet_from_curb string,
> violation_post_code string,
> violation_description string,
> no_standing_or_stopping_violation string,
> hydrant_violation string,
> double_parking_violation string
> )
> stored as orc;
OK
Time taken: 6.189 seconds
hive>

Lets load data into this orc table from the ref table named parking_vio_2017

hive> from parking_vio_2017 insert overwrite table parking_vio_2017_orc select *;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20230408183337_5e13629f-b635-4151-8dc9-5e1decd2870c
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2023-04-08 18:33:40,787 Stage-1 map = 0%, reduce = 0%
2023-04-08 18:33:58,800 Stage-1 map = 12%, reduce = 0%
2023-04-08 18:34:11,817 Stage-1 map = 100%, reduce = 0%
2023-04-08 18:34:17,822 Stage-1 map = 25%, reduce = 0%
2023-04-08 18:34:26,828 Stage-1 map = 37%, reduce = 0%
2023-04-08 18:34:40,838 Stage-1 map = 100%, reduce = 0%
2023-04-08 18:34:46,844 Stage-1 map = 50%, reduce = 0%
2023-04-08 18:34:54,851 Stage-1 map = 62%, reduce = 0%
2023-04-08 18:35:08,870 Stage-1 map = 100%, reduce = 0%
2023-04-08 18:35:14,874 Stage-1 map = 75%, reduce = 0%
2023-04-08 18:35:23,948 Stage-1 map = 91%, reduce = 0%
2023-04-08 18:35:29,952 Stage-1 map = 100%, reduce = 0%
Ended Job = job_local1833751516_0015
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://namenode:8020/user/hive/warehouse/hi..
Loading data to table hive_db.parking_vio_2017_orc
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 35083110397 HDFS Write: 510857609 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
parking_vio_2017.summons_no parking_vio_2017.plate_id parking_vio_2017.registration_state parking_vio_2017.plate_type parking_vio_2017.issue_date parking_vio_2017.violation_code parking_vio_2017.vehicle_body_type parking_vio_2017.vehicle_make parking_vio_2017.issuing_agency parking_vio_2017.street_code_1 parking_vio_2017.street_code_2 parking_vio_2017.street_code_3 parking_vio_2017.vehicle_exp_date parking_vio_2017.violation_location parking_vio_2017.violation_precinct parking_vio_2017.issuer_precinct parking_vio_2017.issuer_code parking_vio_2017.issuer_command parking_vio_2017.issuer_squad parking_vio_2017.violation_time parking_vio_2017.time_first_observed parking_vio_2017.violation_county parking_vio_2017.violation_in_front_of_or_opp parking_vio_2017.house_no parking_vio_2017.street_no parking_vio_2017.intersection_street parking_vio_2017.date_first_observed parking_vio_2017.law_section parking_vio_2017.sub_division parking_vio_2017.violation_legal_code parking_vio_2017.days_parking_in_effect parking_vio_2017.from_hours_in_effect parking_vio_2017.to_hours_in_effect parking_vio_2017.vehicle_color parking_vio_2017.unregistered_vehicle parking_vio_2017.vehicle_year parking_vio_2017.meter_no parking_vio_2017.feet_from_curb parking_vio_2017.violation_post_code parking_vio_2017.violation_description parking_vio_2017.no_standing_or_stopping_violation parking_vio_2017.hydrant_violation parking_vio_2017.double_parking_violation
Time taken: 115.373 seconds
hive>

Let’s check in hadoop environment how this orc file looks like

Its taking very less time to query now

hive> select count(*) from parking_vio_2017_orc ;
OK
_c0
5431903
Time taken: 0.248 seconds, Fetched: 1 row(s)
hive>

Now let’s create partitions and buckets

I am going to do dynamic partition on issue_date,violation_code and bucketing on summon_no

hive> create table parking_vio_2017_orc_part
> (
> summons_no int,
> plate_id string,
> registration_state string,
> plate_type string,
> vehicle_body_type string,
> vehicle_make string,
> issuing_agency string,
> street_code_1 int,
> street_code_2 int,
> street_code_3 int,
> vehicle_exp_date int,
> violation_location int,
> violation_precinct int,
> issuer_precinct int,
> issuer_code int,
> issuer_command string,
> issuer_squad string,
> violation_time string,
> time_first_observed string,
> violation_county string,
> violation_in_front_of_or_opp string,
> house_no string,
> street_no string,
> intersection_street string,
> date_first_observed string,
> law_section int,
> sub_division string,
> violation_legal_code string,
> days_parking_in_effect string,
> from_hours_in_effect string,
> to_hours_in_effect string,
> vehicle_color string,
> unregistered_vehicle int,
> vehicle_year string,
> meter_no string,
> feet_from_curb string,
> violation_post_code string,
> violation_description string,
> no_standing_or_stopping_violation string,
> hydrant_violation string,
> double_parking_violation string
> )
> partitioned by (issue_date date,violation_code int)
> clustered by (summons_no)
> sorted by (summons_no)
> into 3 buckets;
OK
Time taken: 0.526 seconds
hive>

insert overwrite table parking_vio_2017_orc_part_buck partition (issue_date,violation_code) select summons_no,plate_id,registration_state,plate_type,vehicle_body_type,vehicle_make,issuing_agency,street_code_1,street_code_2,street_code_3,vehicle_exp_date,violation_location,violation_precinct,issuer_precinct,issuer_code,issuer_command,issuer_squad,violation_time,time_first_observed,violationcounty,violationinfrontof_or_opp,house_no,street_no,intersection_street,date_first_observed,law_section,sub_division,violation_legal_code,daysparkingin_effect,fromhoursin_effect,tohoursin_effect,vehicle_color,unregistered_vehicle,vehicle_year,meter_no,feet_from_curb,violation_post_code,violation_description,no_standing_or_stopping_violation,hydrant_violation,double_parking_violation,issue_date,violation_code from parking_vio_2017_orc

Now let’s check how the partitions and bucketing looks in hdfs file system

NYC Parking Tickets: An Exploratory Analysis Using Hive

Partitions|Bucketing|Hive|Analysis

NYC Parking Tickets: An Exploratory Analysis Using Hive