Do you need to find the data for a row that has the maximum value for a specific column in Oracle? Learn how to do it and see some alternative answers in this article.
The Table Structure
Let’s say you have a table called sales_volume that looks like this:
CITY | START_DATE | SALES |
El Paso | 27/May/15 | 4493 |
Wilmington | 16/Nov/15 | 4656 |
Richmond | 11/May/15 | 3739 |
Irvine | 23/Jan/15 | 1356 |
Berkeley | 18/May/15 | 263 |
Pasadena | 20/Oct/15 | 969 |
Chandler | 30/Nov/15 | 3679 |
Montgomery | 03/Oct/15 | 4744 |
Miramar | 23/May/15 | 3754 |
Elk Grove | 22/Jul/15 | 1684 |
You can see there is a range of records here with different data.
The Problem – Select the Row That Has the Maximum Value
Now, given you have the table above (or something like it), how could you do the following:
For each different city, find the rows that match the highest start_date value for each city.
So, I would expect an output like this:
CITY | START_DATE | SALES |
Abilene | 31/Dec/15 | 2611 |
Akron | 31/Dec/15 | 621 |
Albuquerque | 31/Dec/15 | 4397 |
Alexandria | 31/Dec/15 | 3944 |
How can I do this? How can I find the row that has the maximum value?
Also, if there are multiple ways to do it, which is the best way? After describing all of the solutions, I’ll compare the performance of each of them.
Note: With each of the outputs below, I’ve only displayed a small subset, as the full result contains a few thousand rows.
Solution 1 – LEFT OUTER JOIN
The first solution uses a LEFT OUTER JOIN on another instance of the same table.
SELECT
s1.city,
s1.start_date,
s1.sales
FROM sales_volume s1
LEFT OUTER JOIN sales_volume s2 ON (s1.city = s2.city AND s1.sales < s2.sales)
WHERE s2.city IS NULL
ORDER BY city;
Result:
CITY | START_DATE | SALES |
Abilene | 31/Dec/15 | 2,611 |
Abilene | 31/Dec/15 | 4,899 |
Abilene | 31/Dec/15 | 4,177 |
Abilene | 31/Dec/15 | 3,142 |
Abilene | 31/Dec/15 | 4,862 |
Akron | 31/Dec/15 | 621 |
Akron | 31/Dec/15 | 3,442 |
Akron | 31/Dec/15 | 4,939 |
Akron | 31/Dec/15 | 2,249 |
Akron | 31/Dec/15 | 4,929 |
Akron | 31/Dec/15 | 1,640 |
Akron | 31/Dec/15 | 999 |
Akron | 31/Dec/15 | 2,157 |
Akron | 31/Dec/15 | 4,811 |
Akron | 31/Dec/15 | 862 |
Akron | 31/Dec/15 | 3,664 |
Albuquerque | 31/Dec/15 | 4,397 |
Albuquerque | 31/Dec/15 | 398 |
Albuquerque | 31/Dec/15 | 3,750 |
Albuquerque | 31/Dec/15 | 4,561 |
Albuquerque | 31/Dec/15 | 620 |
Albuquerque | 31/Dec/15 | 3,935 |
Albuquerque | 31/Dec/15 | 4,820 |
Albuquerque | 31/Dec/15 | 1,640 |
Albuquerque | 31/Dec/15 | 2,173 |
Albuquerque | 31/Dec/15 | 3,318 |
This query selects all of the fields from the sales_volume table where no other row exists for the same city and with a greater start_date value.
Solution 2 – MAX and DENSE_RANK
Another way of getting the data you need is to use the MAX function.
SELECT
s.city,
MAX(s.start_date) KEEP (DENSE_RANK FIRST ORDER BY sales DESC) AS start_date
FROM sales_volume s
GROUP BY s.city
ORDER BY s.city;
Now, the MAX function is used to find the value of the highest record. It’s used as an analytical function by adding the KEEP clause at the end.
The KEEP clause, along with DENSE RANK FIRST tells the function to keep the first value that it finds when looking for the sales value. But, the ORDER BY start_date DESC part tells the function to order the fields by the start_date in descending order. This tells the MAX function to find the row that matches the highest start_date value.
However, the drawback with this solution is that it does not show the entire record.
Solution 3 – Correlated Subquery
Another way to use SQL to select the row with the highest value is to use MAX with a correlated subquery.
SELECT
s1.city,
s1.start_date,
s1.sales
FROM sales_volume s1
WHERE s1.sales = (
SELECT MAX(sales)
FROM sales_volume s2
WHERE s1.city = s2.city
)
ORDER BY s1.city;
This query may not be very efficient, as it uses this correlated subquery. But, we will see what happens when we do the performance test at the end of this post.
Solution 4 – MAX and Partition By
The last solution that I’ll be demonstrating for using SQL to select a max row is this one, which uses the MAX function and the PARTITION BY clause.
SELECT
city,
start_date,
sales
FROM (
SELECT
city,
start_date,
sales,
MAX(sales) OVER (PARTITION BY city) AS max_sales
FROM sales_volume)
WHERE sales = max_sales
ORDER BY city;
This uses the MAX function as an analytical query. It uses an inline view (having an SQL query as the FROM clause). Within this view, it adds an extra column which is the maximum start_date value which is partitioned by city. This field has the same value for each city, which is the maximum.
Then, outside this view, in the WHERE clause, it only shows those where the start_date value is the maximum.
Performance Comparison
Now that we have a few different solutions, all showing the correct records, let’s compare the performance of each of them.
The table that I used only had a small number of records. Let’s expand this with some random data up to 1 million records and see how it performs.
I’ll insert the extra records, add an index on the city column and the sales, show the actual run times for each, and check the EXPLAIN PLAN for each solution.
Here are the EXPLAIN PLANs for each of my solutions.
SELECT
PLAN_TABLE_OUTPUT
FROM TABLE(DBMS_XPLAN.DISPLAY());
Solution 1
------------------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | ------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 10000 | 390K| | 5160 (1)| 00:01:02 | | 1 | SORT ORDER BY | | 10000 | 390K| 480K| 5160 (1)| 00:01:02 | |* 2 | FILTER | | | | | | | |* 3 | HASH JOIN RIGHT OUTER| | 10000 | 390K| 28M| 5054 (1)| 00:01:01 | | 4 | TABLE ACCESS FULL | SALES_VOLUME | 1000K| 17M| | 1008 (1)| 00:00:13 | | 5 | TABLE ACCESS FULL | SALES_VOLUME | 1000K| 20M| | 1009 (1)| 00:00:13 | ------------------------------------------------------------------------------------------------
Solution 2
----------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ----------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 286 | 6292 | 1035 (4)| 00:00:13 | | 1 | SORT GROUP BY | | 286 | 6292 | 1035 (4)| 00:00:13 | | 2 | TABLE ACCESS FULL| SALES_VOLUME | 1000K| 20M| 1009 (1)| 00:00:13 | -----------------------------------------------------------------------------------
Solution 3
-------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 2739 | 109K| 2048 (3)| 00:00:25 | | 1 | SORT ORDER BY | | 2739 | 109K| 2048 (3)| 00:00:25 | |* 2 | HASH JOIN | | 2739 | 109K| 2047 (3)| 00:00:25 | | 3 | VIEW | VW_SQ_1 | 286 | 5434 | 1034 (3)| 00:00:13 | | 4 | HASH GROUP BY | | 286 | 5148 | 1034 (3)| 00:00:13 | | 5 | TABLE ACCESS FULL| SALES_VOLUME | 1000K| 17M| 1008 (1)| 00:00:13 | | 6 | TABLE ACCESS FULL | SALES_VOLUME | 1000K| 20M| 1009 (1)| 00:00:13 | --------------------------------------------------------------------------------------
Solution 4
-------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | -------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1000K| 55M| | 7648 (1)| 00:01:32 | |* 1 | VIEW | | 1000K| 55M| | 7648 (1)| 00:01:32 | | 2 | WINDOW SORT | | 1000K| 20M| 30M| 7648 (1)| 00:01:32 | | 3 | TABLE ACCESS FULL| SALES_VOLUME | 1000K| 20M| | 1009 (1)| 00:00:13 | --------------------------------------------------------------------------------------------
You can see how they each compare in this table:
Solution | Cardinality | Cost |
1 | 10,010 | 5,163 |
2 | 286 | 1,035 |
3 | 2,745 | 2,048 |
4 | 1,001,040 | 7,654 |
One thing that concerns me is the cardinality estimate on Solution 4. I’m not expecting 1 million unique records to be returned, but we’ll see what happens when we run the query.
Now, after running each of the queries, I get the following information:
Solution | Run Time (Sec) | Rows |
1 | 793.724 | 2726 |
2 | 0.462 | 286 |
3 | 0.504 | 2726 |
4 | 2.01 | 2726 |
You can see here that both Solution 2 and 3 took less than a second to run. Solution 4 took 2 seconds to run. Solution 1 took 793 seconds to run (13 minutes and 13 seconds), which was by far the longest.
Solution 1, 3 and 4 have 2726 records, which is correct because there are multiple occurrences of cities and start_dates with the same value.
Solution 2, however, had much less, as it does not show the full record.
Conclusion
So, in conclusion:
- There are many ways to write this query to get the solution you need.
- They each have a different performance.
- Make sure you create the right indexes on these fields.
For the data I have, I would probably use solution 3, but you should test these solutions against your own data to see which performs better.
Lastly, if you enjoy the information and career advice I’ve been providing, sign up to my newsletter below to stay up-to-date on my articles. You’ll also receive a fantastic bonus. Thanks!
Excellent sir simply superb
suberb
Thank you thank you thank you.
I have been banging on this a while. the table I have the max value for does not have a great way to join. This was very timely. I had been getting missing group statement errors so my format must have been wrong.
I copied yours and it ran. I used to have to find that max date and update my query. Now I can run it truly ad-hoc.
Have a nice weekend
Thank you so much.
I think problem statement should be corrected to sales instead of start_date
“For each different city, find the rows that match the highest sales value for each city.”
Looks like description is explaining about maximum start_date but queries are displaying maximum sales
Solution 2 is answering a different question to the others?