Calculating business days in HiveQL
One of the common tasks in data processing is to calculate the number of days between two given dates. You can easily achieve this by using Hive DATEDIFF function. You can also get weekday number by using this more obscure function:
SELECT FROM_UNIXTIME(your_date,'u') FROM some_table;
This will return 1 for Monday, 7 for Sunday and is based on Java SimpleDateFormat —
It becomes more challenging if you need to calculate the number of business days (excluding Saturdays and Sundays) between two dates. There is no built-in function in Hive to perform such calculation, but fortunately it is rather simple to write your own Hive UDFs. I couldn’t quickly find any existing open source functions to solve this problem, so I wrote my own using Scala — https://github.com/pythian/hive-udfs
There are actually three functions in hive-udfs package: CountBusinessDays, CountSaturdays and CountSundays. These functions accept start date and end date as UNIX_TIMESTAMP and return the count of different types of full days in this interval, excluding start and end points.
Here is how you can use this UDF in Hive:
ADD JAR hdfs:///user/hive/udfs/pythian-hive-udfs-assembly-0.1.jar;
CREATE TEMPORARY FUNCTION count_business_days AS 'com.pythian.udf.CountBusinessDays';
CREATE TEMPORARY FUNCTION count_saturdays AS 'com.pythian.udf.CountSaturdays';
CREATE TEMPORARY FUNCTION count_sundays AS 'com.pythian.udf.CountSundays';
SELECT count_business_days(UNIX_TIMESTAMP(start_date), UNIX_TIMESTAMP(end_date)) FROM some_table;
The code is open source and free to use. Comments and suggestions are always welcome.