Windows Functions

Drawbacks aggregate functions

perform aggregations over the entire table and cause multiple rows to be grouped together

A window function performs calculations across a set of table rows that are somehow related to the current row.

Windows functions operate on SET of rows instead of the entire table. 'Window' means interval or set of rows.

This allows each row to retain its own identity since they are not grouped into a single output row.

OVER() clause and Partition BY

Syntax

SELECT
    OrderDate,
    SalesAmount,
    SUM(SalesAmount) OVER (PARTITION BY CustomerID ORDER BY OrderDate) AS RunningTotal
FROM
    SalesOrders;

PARTITION BY clause: Defines the intervals/partitions/sets of rows over which the aggregation will be performed. This is optional and if not defined the entire table is considered as one partition.
ORDER BY clause: specifies the order/sequence in which the window function will be performed for each partition.
The OVER clause determines exactly how the rows of the query are split up for processing by the window function.

If OVER() is empty, it means the entire table is considered to be a single window. The entire table is one partition.

Types Of Window Functions:

Aggregate window functions using the over() clause: sum, min, max,avg, count
Rank Based Window functions: Row_number, Rank, Dense Rank.

Aggregate Window Functions

SELECT
    OrderID,
    CustomerID,
    SalesAmount,
    COUNT(*) OVER (PARTITION BY CustomerID) AS CustomerOrderCount,
SUM(SalesAmount) OVER (PARTITION BY CustomerID) AS Totalsales,
AVG(SalesAmount) OVER (PARTITION BY CustomerID) AS Averagesales,
 MAX(SalesAmount) OVER (PARTITION BY CustomerID) AS HighestSales,
 MIN(SalesAmount) OVER (PARTITION BY CustomerID) AS LeastSales
FROM
    Sales;

OrderID	CustomerID	SalesAmount	CustomerOrderCount	TotalSales	AverageSales	HighestSales	LeastSales
1	A	100	3	450	150.0	200	100
3	A	150	3	450	150.0	200	100
6	A	200	3	450	150.0	200	100
2	B	200	2	375	187.5	200	175
5	B	175	2	375	187.5	200	175
4	C	300	1	300

ROW_NUMBER() function

Assigns a sequential integer(starting from 1) for each row present inside a partition.

There is no gap while assigning a row number within a partition

If two values are equal they are not allocated the same row_number but a consecutive number is given.

Technically, like giving a unique ID to values in each row

Row numbers are reinitialized with every new partition.

Syntax:

ROW_NUMBER() OVER (
    [PARTITION BY partition_expression, ... ]
    ORDER BY sort_expression [ASC | DESC], ...
)

OrderID	CustomerID	SalesAmount
1	A	100
2	B	200
3	A	150
4	C	300
5	B	175
6	A	200

SELECT
    OrderID,
    CustomerID,
    SalesAmount,
    ROW_NUMBER() OVER (ORDER BY OrderID) AS RowNumber
FROM
    Sales;

output:

OrderID  | CustomerID | SalesAmount | RowNumber
----------------------------------------------
1        | A          | 100         | 1
2        | B          | 200         | 2
3        | A          | 150         | 3
4        | C          | 300         | 4
5        | B          | 175         | 5
6        | A          | 200         | 6

Application

nth highest value
Duplicates

RANK function

Assign a ranking(starting from 1) for each row present inside a partition.

syntax

RANK() OVER (
    [PARTITION BY partition_expression, ... ]
    ORDER BY sort_expression [ASC | DESC], ...
)

Example:

In the above table, we have different customerID who made different orders. One customer could have made different orders and we'd like to rank them in order of their amount

SELECT
    OrderID,
    CustomerID,
    SalesAmount,
    RANK() OVER (PARTITION BY CustomerID ORDER BY SalesAmount DESC) AS Rank
FROM
    Sales;

The OVER is part of the syntax and compulsory section. PARTITION BY shows the column we want to categorize using.

OrderID  | CustomerID | SalesAmount | Rank
-------------------------------------------
1        | A          | 100         | 3
3        | A          | 150         | 2
6        | A          | 200         | 1
2        | B          | 200         | 1
5        | B          | 175         | 2
4        | C          | 300         | 1

Core differences with row_number

Tie values in a partition are allocated the same rank
Ranks are not always sequential. A skipped rank after a tie
rank is reinitialized for every new partition.

SELECT
    OrderID,
    CustomerID,
    SalesAmount,
    RANK() OVER (ORDER BY SalesAmount DESC) AS Rank
FROM
    Sales;

output:

OrderID	CustomerID	SalesAmount	Rank
4	C	300	1
6	A	200	2
2	B	200	2
5	B	175	4
3	A	150	5
1	A	100	6

Application:

Find and delete duplicates
Performance analysis

Dense_rank function

Assigns ranking(starting from 1) for each row inside a partition

syntax:

DENSE_RANK() OVER (
    [PARTITION BY partition_expression, ... ]
    ORDER BY sort_expression [ASC | DESC], ...
)

Dense rank assigns consecutive ranks without skipping any ranks hence no gaps.

SELECT
    OrderID,
    CustomerID,
    SalesAmount,
    DENSE_RANK() OVER (ORDER BY SalesAmount DESC) AS DenseRank
FROM
    Sales;

Output:

OrderID	CustomerID	SalesAmount	DenseRank
4	C	300	1
6	A	200	2
2	B	200	2
5	B	175	3
3	A	150	4
1	A	100	5

Function	Purpose	Behavior
`ROW_NUMBER()`	Assigns a unique sequential integer (starting from 1) for each row within a partition.	There are no gaps in row numbers within a partition. If two rows have the same values and the same order, they will get different row numbers. Row numbers are reinitialized for each new partition.
`RANK()`	Assigns a rank (starting from 1) for each row within a partition, with equal values getting the same rank and leaving gaps.	Tied values within a partition receive the same rank, and the next rank will have a gap if multiple rows share the same rank. Ranks are reinitialized for each new partition.
`DENSE_RANK()`	Assigns a rank (starting from 1) for each row within a partition, with equal values getting the same rank, but no gaps.	Tied values within a partition receive the same rank, and the next rank will not have a gap if multiple rows share the same rank. Ranks are reinitialized for each new partition.

Ntile

Divides a result set into a specified number of roughly equal groups

Syntax:

NTILE(number_of_buckets) OVER (
    [PARTITION BY partition_expression, ... ]
    ORDER BY sort_expression [ASC | DESC], ...
)

number_of_buckets: This is an integer value that specifies the number of buckets or partitions you want to divide the result set into.

SELECT
    EmployeeID,
    LastName,
    Salary,
    NTILE(3) OVER (ORDER BY Salary DESC) AS Bucket
FROM
    Employees;

OrderID	CustomerID	SalesAmount	Bucket
4	C	300	1
6	A	200	2
2	B	200	2
5	B	175	3
3	A	150	3
1	A	100	3

Lead function

Allows you to access the value of a subsequent (following) row within the result set of a query.

syntax:

LEAD(column, offset, default_value) OVER (
    [PARTITION BY partition_expression, ... ]
    ORDER BY sort_expression [ASC | DESC], ...
)

column: This is the column from which you want to retrieve the subsequent row's value.
offset: This is an optional integer value that specifies how many rows after the current row you want to look ahead. If omitted the default is 1
ORDER BY: This clause is required and specifies the order in which rows are processed by the LEAD() function.

Useful for calculating the difference between consecutive rows or obtaining information from the next row based on a specific order.

Example:

SELECT
    OrderID,
    CustomerID,
    SalesAmount,
    LEAD(SalesAmount, 1, 0) OVER (ORDER BY OrderID) AS SubsequentSalesAmount
FROM
    Sales;

OrderID	CustomerID	SalesAmount	SubsequentSalesAmount
1	A	100	200
2	B	200	150
3	A	150	300
4	C	300	175
5	B	175	200
6	A	200	0

Lag Function

pulls from previous rows.

Allows access to values of a previous row within the result set of a query.

Often used to perform calculations that involve comparing the current row with the preceding row.

Syntax

LAG(column, offset, default_value) OVER (
    [PARTITION BY partition_expression, ... ]
    ORDER BY sort_expression [ASC | DESC], ...
)

column: Column from which you want to retrieve the previous row's value.
offset: This is an optional integer value that specifies how many rows before the current row you want to look back. A value of 1 indicates the immediately preceding row, 2 indicates the row before that, and so on. If you omit this parameter, the default is 1.
default_value: This is an optional value that is returned if the specified offset goes beyond the first row of the partition or result set.
PARTITION BY as always is optional

Example:


SELECT
    OrderID,
    CustomerID,
    SalesAmount,
    LAG(SalesAmount, 1, 0) OVER (ORDER BY OrderID) AS PreviousSalesAmount
FROM
    Sales;

OrderID	CustomerID	SalesAmount	PreviousSalesAmount
1	A	100	0
2	B	200	100
3	A	150	200
4	C	300	150
5	B	175	300
6	A	200	175

Another example where the offset is defined as 2 and no default is given

SELECT
    OrderID,
    CustomerID,
    SalesAmount,
    LAG(SalesAmount, 2) OVER (ORDER BY OrderID) AS PreviousSalesAmount
FROM
    Sales;

Sample Output:

OrderID	CustomerID	SalesAmount	PreviousSalesAmount
1	A	100	NULL
2	B	200	NULL
3	A	150	100
4	C	300	200
5	B	175	150
6	A	200	300

Where there is no previous value it defaults to NULL

First_value

function allows you to retrieve the value of a specified column from the first row within a partition of your result set.

FIRST_VALUE(column) OVER (
    [PARTITION BY partition_expression, ... ]
    ORDER BY sort_expression [ASC | DESC], ...
)

Example

SELECT
    OrderID,
    CustomerID,
    SalesAmount,
    FIRST_VALUE(SalesAmount) OVER (PARTITION BY CustomerID ORDER BY OrderDate) AS FirstSalesAmount
FROM
    SalesOrders;

Output:

OrderID	CustomerID	SalesAmount	FirstSalesAmount
1	A	100	100
3	A	150	100
6	A	200	100
2	B	200	200
5	B	175	200
4	C	300	300

Last value

The very last value in a partition

Syntax:

LAST_VALUE(column) OVER (
    [PARTITION BY partition_expression, ... ]
    ORDER BY sort_expression [ASC | DESC], ...
)

Example:

SELECT
    OrderID,
    CustomerID,
    SalesAmount,
    LAST_VALUE(SalesAmount) OVER (PARTITION BY CustomerID ORDER BY OrderDate) AS LastSalesAmount
FROM
    SalesOrders;

Output

OrderID	CustomerID	SalesAmount	LastSalesAmount
1	A	100	200
3	A	150	200
6	A	200	200
2	B	200	175
5	B	175	175
4	C	300	300

Nth value

allows you to retrieve the value of a column from a specific row within a result set

syntax:

NTH_VALUE(column, n) OVER (
    [PARTITION BY partition_expression, ... ]
    ORDER BY sort_expression [ASC | DESC], ...
)

n - specifies the position of the row

Windows Functions

Table of contents

OVER() clause and Partition BY

Types Of Window Functions:

Aggregate Window Functions

ROW_NUMBER() function

RANK function

Dense_rank function

Ntile

Lead function

Lag Function

First_value

Last value

Nth value

Further Reading