SQL Data Cleaning

LEFT

Pulls a specified number of characters for each row starting from the beginning i.e. the left side.

product_idproduct_name
1Apple iPhone 13
2Samsung Galaxy S21
3Google Pixel 6
SELECT product_id, LEFT(product_name, 3) AS abbreviated_name
FROM products;

Output:

| product_id | abbreviated_name |
|------------|------------------|
| 1          | 'App'            |
| 2          | 'Sam'            |
| 3          | 'Goo'            |

RIGHT

pulls a specified number of characters from the right side(starting from the end)

Output:

product_idlast_3_characters
1'13'
2'S21'
3'6'

If we have trailing or leading spaces they'll still be counted as characters.

LENGTH

Provides the number of characters for each row of the specified column.

Some databases support LEN()

SELECT LENGTH('Hello, World!') AS text_length;

output: 12

POSITION

Also called STRPOS in some databases

Returns the index of the first occurrence of the specified character within a string.

The index of the first position is 1 in SQL

NB: Both are case sensitive, hence you'd consider making your case either Upper or Lower

SELECT POSITION('World' IN 'Hello, World!') AS start_position;

Output: 7

SUBSTRING

Extracts a substring from a string.

Oracle has SUBSTR

Example:

SELECT SUBSTRING('Hello, World!', 1, 5) AS extracted_text;
Output: Hello

CONCAT

Allows to combine columns across rows. can also use pipes(||) or '+'.

Syntax

SELECT 
CONCAT(first_name, '', last_name)
FROM table;
SELECT 
first_name || last_name
FROM table;

CONCAT_WS

Concatenates strings with a specified delimiter.

Useful for joining columns with a separator.

Example:

SELECT CONCAT_WS(', ', 'John', 'Kimani') AS full_name;

Output

John, Kimani

UPPER AND LOWER

Converts text to upper and lower case respectively.

Syntax:

SELECT 
UPPER(column), LOWER(column)
FROM table_name;

INITCAP

Capitalize the first letter of each word in a string.

Useful for standardizing the capitalization of names or titles.

REPLACE

Used to find and replace a specific substring within a string with another substring.

Example:

SELECT comment_id, REPLACE(user_comment, 'bad', 'good') AS modified_comment
FROM comments;

Wherever the comment had the word bad it gets replaced with good.

CAST

CAST is used to convert one data type to another (e.g., string to date).

SELECT CAST('2023-09-16' AS DATE) AS converted_date;

In Postgresql, one can use '::' for casting

Example:

SELECT current_date::VARCHAR AS formatted_date;

TRIM

Removes leading and trailing spaces from a string(from both ends).

Useful for cleaning white spaces

Syntax:

SELECT
TRIM(column_name)
FROM table_name;

LTRIM: Removes leading whitespaces. that is, white spaces at the beginning

RTRIM: Removes trailing white spaces. That is, white spaces at the end

TO_DATE

Supported in Oracle and PostgreSQL

Convert a text or character string into a date data type. It's particularly useful when you have date values stored as text and need to work with them as actual dates.

Syntax:

TO_DATE('2023-09-16', 'YYYY-MM-DD')

COALESCE

The COALESCE function is used to handle NULL values in SQL.

It returns the first non-NULL value from a list of expressions.

Syntax:

SELECT COALESCE(NULL, 'Fallback Value') AS cleaned_value;

'FAllback Value' is what you expect to return when you find the null values.

Example

SELECT COALESCE(sales_region, 'Unknown') AS cleaned_region
FROM sales_data;

To replace NULL values in the "sales_region" column with the string 'Unknown'.

Further Reading

https://www.postgresql.org/docs/8.1/functions-string.html

https://mode.com/sql-tutorial/sql-string-functions-for-cleaning/