Technical Report: Analysis of Sales Data

Introduction

This report presents an analysis of sales data to uncover trends and insights. The data includes information on orders, products, quantities, prices, order dates, purchase addresses, and additional derived columns such as sales, city, and hour.

Data Preparation

The dataset, "Sales Data.csv," was loaded into a Pandas DataFrame. The first few rows of the dataset are shown below:

Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	Month	Sales	City	Hour
295665	Macbook Pro Laptop	1	1700.00	2019-12-30 00:01:00	136 Church St, New York City, NY	12	1700.00	New York City	0
295666	LG Washing Machine	1	600.00	2019-12-29 07:03:00	562 2nd St, New York City, NY	12	600.00	New York City	7
295667	USB-C Charging Cable	1	11.95	2019-12-12 18:21:00	277 Main St, New York City, NY	12	11.95	New York City	18
295668	27in FHD Monitor	1	149.99	2019-12-22 15:13:00	410 6th St, San Francisco, CA	12	149.99	San Francisco	15
295669	USB-C Charging Cable	1	11.95	2019-12-18 12:38:00	43 Hill St, Atlanta, GA	12	11.95	Atlanta	12

Data preprocessing was done in Python and there were no missing values, no duplicates were found. Our data had 185950 rows and 11 columns. These are the columns in the dataset :' Unnamed: 0', 'Order ID', 'Product', 'Quantity Ordered', 'Price Each', 'Order Date', 'Purchase Address', 'Month', 'Sales', 'City', 'Hour'

# find duplicates
df.duplicated().sum()
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 185950 entries, 0 to 185949
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Unnamed: 0        185950 non-null  int64  
 1   Order ID          185950 non-null  int64  
 2   Product           185950 non-null  object 
 3   Quantity Ordered  185950 non-null  int64  
 4   Price Each        185950 non-null  float64
 5   Order Date        185950 non-null  object 
 6   Purchase Address  185950 non-null  object 
 7   Month             185950 non-null  int64  
 8   Sales             185950 non-null  float64
 9   City              185950 non-null  object 
 10  Hour              185950 non-null  int64  
dtypes: float64(2), int64(5), object(4)

Data Type Conversion

Several columns required data type conversions to facilitate accurate analysis:

Order Date Conversion: The Order Date column was converted from string format to datetime format to enable time-based analysis.

Data Analysis

Correlation Analysis

A correlation analysis was conducted to understand the relationships between numerical variables in the dataset. The correlation matrix was visualized using a heatmap to easily identify strong relationships.

The code used for generating the heatmap is shown below:

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Selecting numerical columns
df1 = df.select_dtypes(include=['int64', 'float64'])
corr = df1.corr()

# Plotting the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.show()

Conclusion

In summary, this analysis of sales data has provided valuable insights into the patterns and relationships within the dataset. The key findings include:

Strong Correlations: The correlation analysis revealed strong relationships between certain numerical variables. For example, a high correlation between Quantity Ordered and Sales indicates that an increase in the number of items ordered is directly associated with an increase in sales revenue.
No Missing Values or Duplicates: The data cleaning process confirmed that there were no missing values or duplicate entries in the dataset. This indicates that the data is complete and ready for analysis without the need for additional cleaning steps.
Data Type Conversion: Converting the Order Date from a string to a datetime format enabled time-based analysis, facilitating deeper insights into sales trends over different periods.

This is a task for HNG internship you can learn more about it from

https://hng.tech/internship

https://hng.tech/hire