Windows Functions
Drawbacks aggregate functions
perform aggregations over the entire table and cause multiple rows to be grouped together
A window function performs calculations across a set of table rows that are somehow related to the current row.
Windows functions operate on SET of rows instead of the entire table. 'Window' means interval or set of rows.
This allows each row to retain its own identity since they are not grouped into a single output row.
OVER() clause and Partition BY
Syntax
SELECT
OrderDate,
SalesAmount,
SUM(SalesAmount) OVER (PARTITION BY CustomerID ORDER BY OrderDate) AS RunningTotal
FROM
SalesOrders;
PARTITION BY
clause: Defines the intervals/partitions/sets of rows over which the aggregation will be performed. This is optional and if not defined the entire table is considered as one partition.ORDER BY
clause: specifies the order/sequence in which the window function will be performed for each partition.The
OVER
clause determines exactly how the rows of the query are split up for processing by the window function.
If OVER() is empty, it means the entire table is considered to be a single window. The entire table is one partition.
Types Of Window Functions:
Aggregate window functions using the over() clause: sum, min, max,avg, count
Rank Based Window functions: Row_number, Rank, Dense Rank.
Aggregate Window Functions
SELECT
OrderID,
CustomerID,
SalesAmount,
COUNT(*) OVER (PARTITION BY CustomerID) AS CustomerOrderCount,
SUM(SalesAmount) OVER (PARTITION BY CustomerID) AS Totalsales,
AVG(SalesAmount) OVER (PARTITION BY CustomerID) AS Averagesales,
MAX(SalesAmount) OVER (PARTITION BY CustomerID) AS HighestSales,
MIN(SalesAmount) OVER (PARTITION BY CustomerID) AS LeastSales
FROM
Sales;
OrderID | CustomerID | SalesAmount | CustomerOrderCount | TotalSales | AverageSales | HighestSales | LeastSales |
1 | A | 100 | 3 | 450 | 150.0 | 200 | 100 |
3 | A | 150 | 3 | 450 | 150.0 | 200 | 100 |
6 | A | 200 | 3 | 450 | 150.0 | 200 | 100 |
2 | B | 200 | 2 | 375 | 187.5 | 200 | 175 |
5 | B | 175 | 2 | 375 | 187.5 | 200 | 175 |
4 | C | 300 | 1 | 300 |
ROW_NUMBER() function
Assigns a sequential integer(starting from 1) for each row present inside a partition.
There is no gap while assigning a row number within a partition
If two values are equal they are not allocated the same row_number but a consecutive number is given.
Technically, like giving a unique ID to values in each row
Row numbers are reinitialized with every new partition.
Syntax:
ROW_NUMBER() OVER (
[PARTITION BY partition_expression, ... ]
ORDER BY sort_expression [ASC | DESC], ...
)
OrderID | CustomerID | SalesAmount |
1 | A | 100 |
2 | B | 200 |
3 | A | 150 |
4 | C | 300 |
5 | B | 175 |
6 | A | 200 |
SELECT
OrderID,
CustomerID,
SalesAmount,
ROW_NUMBER() OVER (ORDER BY OrderID) AS RowNumber
FROM
Sales;
output:
OrderID | CustomerID | SalesAmount | RowNumber
----------------------------------------------
1 | A | 100 | 1
2 | B | 200 | 2
3 | A | 150 | 3
4 | C | 300 | 4
5 | B | 175 | 5
6 | A | 200 | 6
Application
nth highest value
Duplicates
RANK function
Assign a ranking(starting from 1) for each row present inside a partition.
syntax
RANK() OVER (
[PARTITION BY partition_expression, ... ]
ORDER BY sort_expression [ASC | DESC], ...
)
Example:
In the above table, we have different customerID who made different orders. One customer could have made different orders and we'd like to rank them in order of their amount
SELECT
OrderID,
CustomerID,
SalesAmount,
RANK() OVER (PARTITION BY CustomerID ORDER BY SalesAmount DESC) AS Rank
FROM
Sales;
The OVER
is part of the syntax and compulsory section. PARTITION BY
shows the column we want to categorize using.
OrderID | CustomerID | SalesAmount | Rank
-------------------------------------------
1 | A | 100 | 3
3 | A | 150 | 2
6 | A | 200 | 1
2 | B | 200 | 1
5 | B | 175 | 2
4 | C | 300 | 1
Core differences with row_number
Tie values in a partition are allocated the same rank
Ranks are not always sequential. A skipped rank after a tie
rank is reinitialized for every new partition.
SELECT
OrderID,
CustomerID,
SalesAmount,
RANK() OVER (ORDER BY SalesAmount DESC) AS Rank
FROM
Sales;
output:
OrderID | CustomerID | SalesAmount | Rank |
4 | C | 300 | 1 |
6 | A | 200 | 2 |
2 | B | 200 | 2 |
5 | B | 175 | 4 |
3 | A | 150 | 5 |
1 | A | 100 | 6 |
Application:
Find and delete duplicates
Performance analysis
Dense_rank function
Assigns ranking(starting from 1) for each row inside a partition
syntax:
DENSE_RANK() OVER (
[PARTITION BY partition_expression, ... ]
ORDER BY sort_expression [ASC | DESC], ...
)
Dense rank assigns consecutive ranks without skipping any ranks hence no gaps.
SELECT
OrderID,
CustomerID,
SalesAmount,
DENSE_RANK() OVER (ORDER BY SalesAmount DESC) AS DenseRank
FROM
Sales;
Output:
OrderID | CustomerID | SalesAmount | DenseRank |
4 | C | 300 | 1 |
6 | A | 200 | 2 |
2 | B | 200 | 2 |
5 | B | 175 | 3 |
3 | A | 150 | 4 |
1 | A | 100 | 5 |
Function | Purpose | Behavior |
ROW_NUMBER() | Assigns a unique sequential integer (starting from 1) for each row within a partition. | There are no gaps in row numbers within a partition. If two rows have the same values and the same order, they will get different row numbers. Row numbers are reinitialized for each new partition. |
RANK() | Assigns a rank (starting from 1) for each row within a partition, with equal values getting the same rank and leaving gaps. | Tied values within a partition receive the same rank, and the next rank will have a gap if multiple rows share the same rank. Ranks are reinitialized for each new partition. |
DENSE_RANK() | Assigns a rank (starting from 1) for each row within a partition, with equal values getting the same rank, but no gaps. | Tied values within a partition receive the same rank, and the next rank will not have a gap if multiple rows share the same rank. Ranks are reinitialized for each new partition. |
Ntile
Divides a result set into a specified number of roughly equal groups
Syntax:
NTILE(number_of_buckets) OVER (
[PARTITION BY partition_expression, ... ]
ORDER BY sort_expression [ASC | DESC], ...
)
number_of_buckets
: This is an integer value that specifies the number of buckets or partitions you want to divide the result set into.
SELECT
EmployeeID,
LastName,
Salary,
NTILE(3) OVER (ORDER BY Salary DESC) AS Bucket
FROM
Employees;
OrderID | CustomerID | SalesAmount | Bucket |
4 | C | 300 | 1 |
6 | A | 200 | 2 |
2 | B | 200 | 2 |
5 | B | 175 | 3 |
3 | A | 150 | 3 |
1 | A | 100 | 3 |
Lead function
Allows you to access the value of a subsequent (following) row within the result set of a query.
syntax:
LEAD(column, offset, default_value) OVER (
[PARTITION BY partition_expression, ... ]
ORDER BY sort_expression [ASC | DESC], ...
)
column
: This is the column from which you want to retrieve the subsequent row's value.offset
: This is an optional integer value that specifies how many rows after the current row you want to look ahead. If omitted the default is 1ORDER BY: This clause is required and specifies the order in which rows are processed by the LEAD() function.
Useful for calculating the difference between consecutive rows or obtaining information from the next row based on a specific order.
Example:
SELECT
OrderID,
CustomerID,
SalesAmount,
LEAD(SalesAmount, 1, 0) OVER (ORDER BY OrderID) AS SubsequentSalesAmount
FROM
Sales;
OrderID | CustomerID | SalesAmount | SubsequentSalesAmount |
1 | A | 100 | 200 |
2 | B | 200 | 150 |
3 | A | 150 | 300 |
4 | C | 300 | 175 |
5 | B | 175 | 200 |
6 | A | 200 | 0 |
Lag Function
pulls from previous rows.
Allows access to values of a previous row within the result set of a query.
Often used to perform calculations that involve comparing the current row with the preceding row.
Syntax
LAG(column, offset, default_value) OVER (
[PARTITION BY partition_expression, ... ]
ORDER BY sort_expression [ASC | DESC], ...
)
column
: Column from which you want to retrieve the previous row's value.offset
: This is an optional integer value that specifies how many rows before the current row you want to look back. A value of 1 indicates the immediately preceding row, 2 indicates the row before that, and so on. If you omit this parameter, the default is 1.default_value
: This is an optional value that is returned if the specified offset goes beyond the first row of the partition or result set.PARTITION BY
as always is optional
Example:
SELECT
OrderID,
CustomerID,
SalesAmount,
LAG(SalesAmount, 1, 0) OVER (ORDER BY OrderID) AS PreviousSalesAmount
FROM
Sales;
OrderID | CustomerID | SalesAmount | PreviousSalesAmount |
1 | A | 100 | 0 |
2 | B | 200 | 100 |
3 | A | 150 | 200 |
4 | C | 300 | 150 |
5 | B | 175 | 300 |
6 | A | 200 | 175 |
Another example where the offset is defined as 2 and no default is given
SELECT
OrderID,
CustomerID,
SalesAmount,
LAG(SalesAmount, 2) OVER (ORDER BY OrderID) AS PreviousSalesAmount
FROM
Sales;
Sample Output:
OrderID | CustomerID | SalesAmount | PreviousSalesAmount |
1 | A | 100 | NULL |
2 | B | 200 | NULL |
3 | A | 150 | 100 |
4 | C | 300 | 200 |
5 | B | 175 | 150 |
6 | A | 200 | 300 |
Where there is no previous value it defaults to NULL
First_value
function allows you to retrieve the value of a specified column from the first row within a partition of your result set.
FIRST_VALUE(column) OVER (
[PARTITION BY partition_expression, ... ]
ORDER BY sort_expression [ASC | DESC], ...
)
Example
SELECT
OrderID,
CustomerID,
SalesAmount,
FIRST_VALUE(SalesAmount) OVER (PARTITION BY CustomerID ORDER BY OrderDate) AS FirstSalesAmount
FROM
SalesOrders;
Output:
OrderID | CustomerID | SalesAmount | FirstSalesAmount |
1 | A | 100 | 100 |
3 | A | 150 | 100 |
6 | A | 200 | 100 |
2 | B | 200 | 200 |
5 | B | 175 | 200 |
4 | C | 300 | 300 |
Last value
The very last value in a partition
Syntax:
LAST_VALUE(column) OVER (
[PARTITION BY partition_expression, ... ]
ORDER BY sort_expression [ASC | DESC], ...
)
Example:
SELECT
OrderID,
CustomerID,
SalesAmount,
LAST_VALUE(SalesAmount) OVER (PARTITION BY CustomerID ORDER BY OrderDate) AS LastSalesAmount
FROM
SalesOrders;
Output
OrderID | CustomerID | SalesAmount | LastSalesAmount |
1 | A | 100 | 200 |
3 | A | 150 | 200 |
6 | A | 200 | 200 |
2 | B | 200 | 175 |
5 | B | 175 | 175 |
4 | C | 300 | 300 |
Nth value
allows you to retrieve the value of a column from a specific row within a result set
syntax:
NTH_VALUE(column, n) OVER (
[PARTITION BY partition_expression, ... ]
ORDER BY sort_expression [ASC | DESC], ...
)
n - specifies the position of the row
Further Reading
https://mode.com/sql-tutorial/sql-window-functions/https://www.postgresql.org/docs/9.1/tutorial-window.html