SQL Data Cleaning
LEFT
Pulls a specified number of characters for each row starting from the beginning i.e. the left side.
product_id | product_name | |
1 | Apple iPhone 13 | |
2 | Samsung Galaxy S21 | |
3 | Google Pixel 6 |
SELECT product_id, LEFT(product_name, 3) AS abbreviated_name
FROM products;
Output:
| product_id | abbreviated_name |
|------------|------------------|
| 1 | 'App' |
| 2 | 'Sam' |
| 3 | 'Goo' |
RIGHT
pulls a specified number of characters from the right side(starting from the end)
Output:
product_id | last_3_characters |
1 | '13' |
2 | 'S21' |
3 | '6' |
If we have trailing or leading spaces they'll still be counted as characters.
LENGTH
Provides the number of characters for each row of the specified column.
Some databases support LEN()
SELECT LENGTH('Hello, World!') AS text_length;
output: 12
POSITION
Also called STRPOS in some databases
Returns the index of the first occurrence of the specified character within a string.
The index of the first position is 1 in SQL
NB: Both are case sensitive, hence you'd consider making your case either Upper or Lower
SELECT POSITION('World' IN 'Hello, World!') AS start_position;
Output: 7
SUBSTRING
Extracts a substring from a string.
Oracle has SUBSTR
Example:
SELECT SUBSTRING('Hello, World!', 1, 5) AS extracted_text;
Output: Hello
CONCAT
Allows to combine columns across rows. can also use pipes(||) or '+'.
Syntax
SELECT
CONCAT(first_name, '', last_name)
FROM table;
SELECT
first_name || last_name
FROM table;
CONCAT_WS
Concatenates strings with a specified delimiter.
Useful for joining columns with a separator.
Example:
SELECT CONCAT_WS(', ', 'John', 'Kimani') AS full_name;
Output
John, Kimani
UPPER AND LOWER
Converts text to upper and lower case respectively.
Syntax:
SELECT
UPPER(column), LOWER(column)
FROM table_name;
INITCAP
Capitalize the first letter of each word in a string.
Useful for standardizing the capitalization of names or titles.
REPLACE
Used to find and replace a specific substring within a string with another substring.
Example:
SELECT comment_id, REPLACE(user_comment, 'bad', 'good') AS modified_comment
FROM comments;
Wherever the comment had the word bad it gets replaced with good.
CAST
CAST is used to convert one data type to another (e.g., string to date).
SELECT CAST('2023-09-16' AS DATE) AS converted_date;
In Postgresql, one can use '::' for casting
Example:
SELECT current_date::VARCHAR AS formatted_date;
TRIM
Removes leading and trailing spaces from a string(from both ends).
Useful for cleaning white spaces
Syntax:
SELECT
TRIM(column_name)
FROM table_name;
LTRIM: Removes leading whitespaces. that is, white spaces at the beginning
RTRIM: Removes trailing white spaces. That is, white spaces at the end
TO_DATE
Supported in Oracle and PostgreSQL
Convert a text or character string into a date data type. It's particularly useful when you have date values stored as text and need to work with them as actual dates.
Syntax:
TO_DATE('2023-09-16', 'YYYY-MM-DD')
COALESCE
The COALESCE function is used to handle NULL values in SQL.
It returns the first non-NULL value from a list of expressions.
Syntax:
SELECT COALESCE(NULL, 'Fallback Value') AS cleaned_value;
'FAllback Value' is what you expect to return when you find the null values.
Example
SELECT COALESCE(sales_region, 'Unknown') AS cleaned_region
FROM sales_data;
To replace NULL values in the "sales_region" column with the string 'Unknown'.
Further Reading
https://www.postgresql.org/docs/8.1/functions-string.html
https://mode.com/sql-tutorial/sql-string-functions-for-cleaning/