Windows Functions

Drawbacks aggregate functions

perform aggregations over the entire table and cause multiple rows to be grouped together

A window function performs calculations across a set of table rows that are somehow related to the current row.

Windows functions operate on SET of rows instead of the entire table. 'Window' means interval or set of rows.

This allows each row to retain its own identity since they are not grouped into a single output row.

OVER() clause and Partition BY

Syntax

SELECT
    OrderDate,
    SalesAmount,
    SUM(SalesAmount) OVER (PARTITION BY CustomerID ORDER BY OrderDate) AS RunningTotal
FROM
    SalesOrders;
  1. PARTITION BY clause: Defines the intervals/partitions/sets of rows over which the aggregation will be performed. This is optional and if not defined the entire table is considered as one partition.

  2. ORDER BY clause: specifies the order/sequence in which the window function will be performed for each partition.

  3. The OVER clause determines exactly how the rows of the query are split up for processing by the window function.

If OVER() is empty, it means the entire table is considered to be a single window. The entire table is one partition.

Types Of Window Functions:

  1. Aggregate window functions using the over() clause: sum, min, max,avg, count

  2. Rank Based Window functions: Row_number, Rank, Dense Rank.

Aggregate Window Functions

SELECT
    OrderID,
    CustomerID,
    SalesAmount,
    COUNT(*) OVER (PARTITION BY CustomerID) AS CustomerOrderCount,
SUM(SalesAmount) OVER (PARTITION BY CustomerID) AS Totalsales,
AVG(SalesAmount) OVER (PARTITION BY CustomerID) AS Averagesales,
 MAX(SalesAmount) OVER (PARTITION BY CustomerID) AS HighestSales,
 MIN(SalesAmount) OVER (PARTITION BY CustomerID) AS LeastSales
FROM
    Sales;
OrderIDCustomerIDSalesAmountCustomerOrderCountTotalSalesAverageSalesHighestSalesLeastSales
1A1003450150.0200100
3A1503450150.0200100
6A2003450150.0200100
2B2002375187.5200175
5B1752375187.5200175
4C3001300

ROW_NUMBER() function

Assigns a sequential integer(starting from 1) for each row present inside a partition.

There is no gap while assigning a row number within a partition

If two values are equal they are not allocated the same row_number but a consecutive number is given.

Technically, like giving a unique ID to values in each row

Row numbers are reinitialized with every new partition.

Syntax:

ROW_NUMBER() OVER (
    [PARTITION BY partition_expression, ... ]
    ORDER BY sort_expression [ASC | DESC], ...
)
OrderIDCustomerIDSalesAmount
1A100
2B200
3A150
4C300
5B175
6A200
SELECT
    OrderID,
    CustomerID,
    SalesAmount,
    ROW_NUMBER() OVER (ORDER BY OrderID) AS RowNumber
FROM
    Sales;

output:

OrderID  | CustomerID | SalesAmount | RowNumber
----------------------------------------------
1        | A          | 100         | 1
2        | B          | 200         | 2
3        | A          | 150         | 3
4        | C          | 300         | 4
5        | B          | 175         | 5
6        | A          | 200         | 6

Application

  • nth highest value

  • Duplicates

RANK function

Assign a ranking(starting from 1) for each row present inside a partition.

syntax

RANK() OVER (
    [PARTITION BY partition_expression, ... ]
    ORDER BY sort_expression [ASC | DESC], ...
)

Example:

In the above table, we have different customerID who made different orders. One customer could have made different orders and we'd like to rank them in order of their amount

SELECT
    OrderID,
    CustomerID,
    SalesAmount,
    RANK() OVER (PARTITION BY CustomerID ORDER BY SalesAmount DESC) AS Rank
FROM
    Sales;

The OVER is part of the syntax and compulsory section. PARTITION BY shows the column we want to categorize using.

OrderID  | CustomerID | SalesAmount | Rank
-------------------------------------------
1        | A          | 100         | 3
3        | A          | 150         | 2
6        | A          | 200         | 1
2        | B          | 200         | 1
5        | B          | 175         | 2
4        | C          | 300         | 1

Core differences with row_number

  • Tie values in a partition are allocated the same rank

  • Ranks are not always sequential. A skipped rank after a tie

  • rank is reinitialized for every new partition.

SELECT
    OrderID,
    CustomerID,
    SalesAmount,
    RANK() OVER (ORDER BY SalesAmount DESC) AS Rank
FROM
    Sales;

output:

OrderIDCustomerIDSalesAmountRank
4C3001
6A2002
2B2002
5B1754
3A1505
1A1006

Application:

  • Find and delete duplicates

  • Performance analysis

Dense_rank function

Assigns ranking(starting from 1) for each row inside a partition

syntax:

DENSE_RANK() OVER (
    [PARTITION BY partition_expression, ... ]
    ORDER BY sort_expression [ASC | DESC], ...
)

Dense rank assigns consecutive ranks without skipping any ranks hence no gaps.

SELECT
    OrderID,
    CustomerID,
    SalesAmount,
    DENSE_RANK() OVER (ORDER BY SalesAmount DESC) AS DenseRank
FROM
    Sales;

Output:

OrderIDCustomerIDSalesAmountDenseRank
4C3001
6A2002
2B2002
5B1753
3A1504
1A1005
FunctionPurposeBehavior
ROW_NUMBER()Assigns a unique sequential integer (starting from 1) for each row within a partition.There are no gaps in row numbers within a partition. If two rows have the same values and the same order, they will get different row numbers. Row numbers are reinitialized for each new partition.
RANK()Assigns a rank (starting from 1) for each row within a partition, with equal values getting the same rank and leaving gaps.Tied values within a partition receive the same rank, and the next rank will have a gap if multiple rows share the same rank. Ranks are reinitialized for each new partition.
DENSE_RANK()Assigns a rank (starting from 1) for each row within a partition, with equal values getting the same rank, but no gaps.Tied values within a partition receive the same rank, and the next rank will not have a gap if multiple rows share the same rank. Ranks are reinitialized for each new partition.

Ntile

Divides a result set into a specified number of roughly equal groups

Syntax:

NTILE(number_of_buckets) OVER (
    [PARTITION BY partition_expression, ... ]
    ORDER BY sort_expression [ASC | DESC], ...
)

number_of_buckets: This is an integer value that specifies the number of buckets or partitions you want to divide the result set into.

SELECT
    EmployeeID,
    LastName,
    Salary,
    NTILE(3) OVER (ORDER BY Salary DESC) AS Bucket
FROM
    Employees;
OrderIDCustomerIDSalesAmountBucket
4C3001
6A2002
2B2002
5B1753
3A1503
1A1003

Lead function

Allows you to access the value of a subsequent (following) row within the result set of a query.

syntax:

LEAD(column, offset, default_value) OVER (
    [PARTITION BY partition_expression, ... ]
    ORDER BY sort_expression [ASC | DESC], ...
)
  • column: This is the column from which you want to retrieve the subsequent row's value.

  • offset: This is an optional integer value that specifies how many rows after the current row you want to look ahead. If omitted the default is 1

  • ORDER BY: This clause is required and specifies the order in which rows are processed by the LEAD() function.

Useful for calculating the difference between consecutive rows or obtaining information from the next row based on a specific order.

Example:

SELECT
    OrderID,
    CustomerID,
    SalesAmount,
    LEAD(SalesAmount, 1, 0) OVER (ORDER BY OrderID) AS SubsequentSalesAmount
FROM
    Sales;
OrderIDCustomerIDSalesAmountSubsequentSalesAmount
1A100200
2B200150
3A150300
4C300175
5B175200
6A2000

Lag Function

pulls from previous rows.

Allows access to values of a previous row within the result set of a query.

Often used to perform calculations that involve comparing the current row with the preceding row.

Syntax

LAG(column, offset, default_value) OVER (
    [PARTITION BY partition_expression, ... ]
    ORDER BY sort_expression [ASC | DESC], ...
)
  • column: Column from which you want to retrieve the previous row's value.

  • offset: This is an optional integer value that specifies how many rows before the current row you want to look back. A value of 1 indicates the immediately preceding row, 2 indicates the row before that, and so on. If you omit this parameter, the default is 1.

  • default_value: This is an optional value that is returned if the specified offset goes beyond the first row of the partition or result set.

  • PARTITION BY as always is optional

Example:


SELECT
    OrderID,
    CustomerID,
    SalesAmount,
    LAG(SalesAmount, 1, 0) OVER (ORDER BY OrderID) AS PreviousSalesAmount
FROM
    Sales;
OrderIDCustomerIDSalesAmountPreviousSalesAmount
1A1000
2B200100
3A150200
4C300150
5B175300
6A200175

Another example where the offset is defined as 2 and no default is given

SELECT
    OrderID,
    CustomerID,
    SalesAmount,
    LAG(SalesAmount, 2) OVER (ORDER BY OrderID) AS PreviousSalesAmount
FROM
    Sales;

Sample Output:

OrderIDCustomerIDSalesAmountPreviousSalesAmount
1A100NULL
2B200NULL
3A150100
4C300200
5B175150
6A200300

Where there is no previous value it defaults to NULL

First_value

function allows you to retrieve the value of a specified column from the first row within a partition of your result set.

FIRST_VALUE(column) OVER (
    [PARTITION BY partition_expression, ... ]
    ORDER BY sort_expression [ASC | DESC], ...
)

Example

SELECT
    OrderID,
    CustomerID,
    SalesAmount,
    FIRST_VALUE(SalesAmount) OVER (PARTITION BY CustomerID ORDER BY OrderDate) AS FirstSalesAmount
FROM
    SalesOrders;

Output:

OrderIDCustomerIDSalesAmountFirstSalesAmount
1A100100
3A150100
6A200100
2B200200
5B175200
4C300300

Last value

The very last value in a partition

Syntax:

LAST_VALUE(column) OVER (
    [PARTITION BY partition_expression, ... ]
    ORDER BY sort_expression [ASC | DESC], ...
)

Example:

SELECT
    OrderID,
    CustomerID,
    SalesAmount,
    LAST_VALUE(SalesAmount) OVER (PARTITION BY CustomerID ORDER BY OrderDate) AS LastSalesAmount
FROM
    SalesOrders;

Output

OrderIDCustomerIDSalesAmountLastSalesAmount
1A100200
3A150200
6A200200
2B200175
5B175175
4C300300

Nth value

allows you to retrieve the value of a column from a specific row within a result set

syntax:

NTH_VALUE(column, n) OVER (
    [PARTITION BY partition_expression, ... ]
    ORDER BY sort_expression [ASC | DESC], ...
)

n - specifies the position of the row

Further Reading

https://mode.com/sql-tutorial/sql-window-functions/https://www.postgresql.org/docs/9.1/tutorial-window.html

https://www.postgresql.org/docs/8.4/functions-window.html