The CASE expression is one of the most important functions for cleaning and enriching data. CASE evaluates a column or group of columns against one or more conditions and then returns an expression. This is helpful for cutting data into custom groups or mapping values between systems. Often, data is at a lower level of detailed then what’s useful for trend analysis, such as a time series with minute level timestamps. Using a case expression the analyst can cut the data into custom intervals. For example, suppose an analyst wants to compare orders on weekdays against orders on the weekend. The set of dates ‘2018-08-13’,‘2018-09-18’,‘2018-05-16’ can be used an input to a CASE an return ‘Mon-Fri’,‘Sat-Sun’,‘Mon-Fri’.
SELECT MY_DATE,
CASE
WHEN DATEPART(dw,MY_DATE)>= 2 AND DATEPART(dw,MY_DATE) <= 5 THEN 'Mon-Fri'
WHEN DATEPART(dw,MY_DATE) < 2 OR DATEPART(dw,MY_DATE) > 6 THEN 'Sat-Sun'
END AS CUSTOM_INTERVAL
FROM (
SELECT '2018-08-13' AS MY_DATE UNION ALL
SELECT '2018-09-22' UNION ALL
SELECT '2018-05-16'
Note the code above could be expressed logically as an If… Then… statement:
If MY_DATE is between the weekdays Monday and Friday then return the string ‘Mon-Fri’.
Else if MY_DATE isn’t between the weekdays Monday and Friday then return the string ‘Sat-Sun’.
Another good use case for CASE is discretizing numeric variables. A business might have custom rules about the classification of high, medium, and low priced orders. Let’s say any purchase under $30k is low, between $30k and $80k is medium and any purchase over $80k is high. Again, the If.. then.. pseudo code would look this:
If the order price is less than $30,000 then return the text string ‘low’. If the order price is between $30,000 and $80,000 then return the text string ‘medium’. If the order price is greater than $80,000 then return the text string ‘high’.
Once has the logic down it comes down to translating into sql syntax. The pseudo code broken down to a more elementary form looks like this:
If condition==True Then If is less than $30,0000 Then <‘low’>
A subtle change transforms the statement into a syntactically correct CASE expression.
CASE WHEN o_totalprice < 30000 THEN ‘low’
Using the orders table in tpch database we can fully express the high-medium-low price business rule:
SELECT TOP (10)
o_orderdate
,o_orderkey
,o_totalprice
,CASE
WHEN o_totalprice < 30000 THEN 'low'
WHEN o_totalprice BETWEEN 30000 AND 80000 THEN 'medium'
WHEN o_totalprice > 80000 THEN 'high'
END AS price_bucket
FROM tpch.dbo.orders
So far we’ve used the CASE expression against Date and Numeric columns. Another very useful application of the CASE statement is working with text columns. One may be tasked with mapping data values from one system to another. For example, a column describing the shipping mode contain the text strings (‘REG AIR’,‘SHIP’,‘FOB’) which might be a standard encoding for one system but need additional context to make sense for the consumers of data in another system. These three values might need replaced with a new string to make sense at the other system. (‘REG AIR’,‘SHIP’,‘FOB’) becomes (‘STANDARD AIRLINE’,‘BOAT’,‘FREE ON BOARD’). The case expression enables a quick and easy transformation:
SELECT l_shipmode,
CASE l_shipmode
WHEN 'REG AIR' THEN 'STANDARD AIRLINE'
WHEN 'FOB' THEN 'FREE ON BOARD'
WHEN 'SHIP' THEN 'BOAT'
ELSE l_shipmode
END AS custom_shipmode
FROM [tpch].[dbo].[lineitem]
GROUP BY l_shipmode,
CASE l_shipmode
WHEN 'REG AIR' THEN 'STANDARD AIRLINE'
WHEN 'FOB' THEN 'FREE ON BOARD'
WHEN 'SHIP' THEN 'BOAT'
ELSE l_shipmode
END
This case introduced a few subtle differences from the first two. First, one might have noticed the syntax is different. The first two start with CASE WHEN and this one is CASE WHEN. The syntax of the first two denote what’s called a searched CASE expression and the syntax of the third denotes a simple CASE expression. As a rule of thumb, one should use simple CASE expressions over searched CASE when possible because simple case expressions may result in better query performance.
Notice also the introduction of the ELSE keyword. The ELSE is always implicit in the case statement and defaulted to return NULL if none of the conditions are met. We see the behavior in action by commenting else the explicit ELSE:
SELECT l_shipmode,
CASE l_shipmode
WHEN 'REG AIR' THEN 'STANDARD AIRLINE'
WHEN 'FOB' THEN 'FREE ON BOARD'
WHEN 'SHIP' THEN 'BOAT'
--ELSE l_shipmode
END AS custom_shipmode
FROM [tpch].[dbo].[lineitem]
GROUP BY l_shipmode,
CASE l_shipmode
WHEN 'REG AIR' THEN 'STANDARD AIRLINE'
WHEN 'FOB' THEN 'FREE ON BOARD'
WHEN 'SHIP' THEN 'BOAT'
--ELSE l_shipmode
END
The last major difference of this query is the presence of the CASE expression in the GROUP BY clause. It’s common to see a CASE only in the select clause but it’s possible to see CASE used in the GROUP BY, WHERE, and ORDER BY clauses. A case statement in the ORDER BY is helpful for sorting by a custom rule to enhance presentation Suppose an analyst needs to see a count of customers grouped by market segment. The analyst knows the business prioritizes customers in the ‘HOUSEHOLD’ over customers in the ‘BUILDING’ market segment and thus would like to return results with HOUSEHOLD over the BUILDING category. Here is the query the analyst would write:
SELECT [c_mktsegment],COUNT(1) AS n_customers
FROM [tpch].[dbo].[customer]
GROUP BY c_mktsegment
ORDER BY CASE c_mktsegment
WHEN 'HOUSEHOLD' THEN 1
WHEN 'AUTOMOBILE' THEN 2
WHEN 'FURNITURE' THEN 3
WHEN 'MACHINERY ' THEN 4
WHEN 'BUILDING' THEN 5
END
Comments