Technical Report: Analysis of Sales Data
Introduction
This report presents an analysis of sales data to uncover trends and insights. The data includes information on orders, products, quantities, prices, order dates, purchase addresses, and additional derived columns such as sales, city, and hour.
Data Preparation
The dataset, "Sales Data.csv," was loaded into a Pandas DataFrame. The first few rows of the dataset are shown below:
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | Month | Sales | City | Hour |
295665 | Macbook Pro Laptop | 1 | 1700.00 | 2019-12-30 00:01:00 | 136 Church St, New York City, NY | 12 | 1700.00 | New York City | 0 |
295666 | LG Washing Machine | 1 | 600.00 | 2019-12-29 07:03:00 | 562 2nd St, New York City, NY | 12 | 600.00 | New York City | 7 |
295667 | USB-C Charging Cable | 1 | 11.95 | 2019-12-12 18:21:00 | 277 Main St, New York City, NY | 12 | 11.95 | New York City | 18 |
295668 | 27in FHD Monitor | 1 | 149.99 | 2019-12-22 15:13:00 | 410 6th St, San Francisco, CA | 12 | 149.99 | San Francisco | 15 |
295669 | USB-C Charging Cable | 1 | 11.95 | 2019-12-18 12:38:00 | 43 Hill St, Atlanta, GA | 12 | 11.95 | Atlanta | 12 |
Data preprocessing was done in Python and there were no missing values, no duplicates were found. Our data had 185950 rows and 11 columns. These are the columns in the dataset :' Unnamed: 0', 'Order ID', 'Product', 'Quantity Ordered', 'Price Each', 'Order Date', 'Purchase Address', 'Month', 'Sales', 'City', 'Hour'
# find duplicates
df.duplicated().sum()
df.isnull().sum()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 185950 entries, 0 to 185949
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 185950 non-null int64
1 Order ID 185950 non-null int64
2 Product 185950 non-null object
3 Quantity Ordered 185950 non-null int64
4 Price Each 185950 non-null float64
5 Order Date 185950 non-null object
6 Purchase Address 185950 non-null object
7 Month 185950 non-null int64
8 Sales 185950 non-null float64
9 City 185950 non-null object
10 Hour 185950 non-null int64
dtypes: float64(2), int64(5), object(4)
Data Type Conversion
Several columns required data type conversions to facilitate accurate analysis:
- Order Date Conversion: The
Order Date
column was converted from string format to datetime format to enable time-based analysis.
Data Analysis
Correlation Analysis
A correlation analysis was conducted to understand the relationships between numerical variables in the dataset. The correlation matrix was visualized using a heatmap to easily identify strong relationships.
The code used for generating the heatmap is shown below:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Selecting numerical columns
df1 = df.select_dtypes(include=['int64', 'float64'])
corr = df1.corr()
# Plotting the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.show()
Conclusion
In summary, this analysis of sales data has provided valuable insights into the patterns and relationships within the dataset. The key findings include:
Strong Correlations: The correlation analysis revealed strong relationships between certain numerical variables. For example, a high correlation between
Quantity Ordered
andSales
indicates that an increase in the number of items ordered is directly associated with an increase in sales revenue.No Missing Values or Duplicates: The data cleaning process confirmed that there were no missing values or duplicate entries in the dataset. This indicates that the data is complete and ready for analysis without the need for additional cleaning steps.
Data Type Conversion: Converting the
Order Date
from a string to a datetime format enabled time-based analysis, facilitating deeper insights into sales trends over different periods.
This is a task for HNG internship you can learn more about it from