Summary: The Amazon Sales Dataset, sourced from Kaggle, provides detailed information on over 1,000 products sold on Amazon’s website. It includes key attributes such as product names, categories, pricing, discounts, ratings, and customer reviews, offering valuable insights for sales trend analysis and consumer behavior studies.
Based on the available data, here are some key questions we could explore with stakeholders:
Pricing Strategy Optimization:
How does the discount percentage correlate with the rating of the product? Are higher discounts leading to better ratings?
What is the average price difference between actual and discounted prices across different categories?
Which products have the smallest difference between actual price and discounted price, and do they have higher sales or ratings?
Customer Behavior Analysis:
What is the distribution of ratings across different categories? Are there categories that consistently receive higher or lower ratings?
How does the number of reviews (rating count) correlate with the product rating?
Which products have the highest review count but the lowest ratings, indicating potential quality issues?
library(tidyverse)
library(dplyr)
library(ggplot2)
library(corrplot)
dim(AmazonSalesData)
## [1] 1465 16
Data has 1465 observations (rows) and 16 variables (columns)
colnames(AmazonSalesData)
## [1] "product_id" "product_name" "category"
## [4] "discounted_price" "actual_price" "discount_percentage"
## [7] "rating" "rating_count" "about_product"
## [10] "user_id" "user_name" "review_id"
## [13] "review_title" "review_content" "img_link"
## [16] "product_link"
sum(is.na(AmazonSalesData))
## [1] 0
Data is complete with no missing values.
glimpse(AmazonSalesData)
## Rows: 1,465
## Columns: 16
## $ product_id <chr> "B07JW9H4J1", "B098NS6PVG", "B096MSW6CT", "B08HDJ8…
## $ product_name <chr> "Wayona Nylon Braided USB to Lightning Fast Chargi…
## $ category <chr> "Computers&Accessories|Accessories&Peripherals|Cab…
## $ discounted_price <chr> "₹399", "₹199", "₹199", "₹329", "₹154", "₹149", "₹…
## $ actual_price <chr> "₹1,099", "₹349", "₹1,899", "₹699", "₹399", "₹1,00…
## $ discount_percentage <chr> "64%", "43%", "90%", "53%", "61%", "85%", "65%", "…
## $ rating <chr> "4.2", "4.0", "3.9", "4.2", "4.2", "3.9", "4.1", "…
## $ rating_count <chr> "24,269", "43,994", "7,928", "94,363", "16,905", "…
## $ about_product <chr> "High Compatibility : Compatible With iPhone 12, 1…
## $ user_id <chr> "AG3D6O4STAQKAY2UVGEUV46KN35Q,AHMY5CWJMMK5BJRBBSNL…
## $ user_name <chr> "Manav,Adarsh gupta,Sundeep,S.Sayeed Ahmed,jaspree…
## $ review_id <chr> "R3HXWT0LRP0NMF,R2AJM3LFTLZHFO,R6AQJGUP6P86,R1KD19…
## $ review_title <chr> "Satisfied,Charging is really fast,Value for money…
## $ review_content <chr> "Looks durable Charging is fine tooNo complains,Ch…
## $ img_link <chr> "https://m.media-amazon.com/images/W/WEBP_402378-T…
## $ product_link <chr> "https://www.amazon.in/Wayona-Braided-WN3LG1-Synci…
All the variables are currently stored as characters, which could lead to issues when performing mathematical calculations later on.
# Remove special characters and convert to numeric
AmazonSalesData$discounted_price <- as.numeric(gsub("[₹,]", "", AmazonSalesData$discounted_price))
AmazonSalesData$actual_price <- as.numeric(gsub("[₹,]", "", AmazonSalesData$actual_price))
AmazonSalesData$discount_percentage <- as.numeric(gsub("%", "", AmazonSalesData$discount_percentage))
AmazonSalesData$rating <- as.numeric(AmazonSalesData$rating)
## Warning: NAs introduced by coercion
AmazonSalesData$rating_count <- as.numeric(gsub(",", "", AmazonSalesData$rating_count))
glimpse(AmazonSalesData)
## Rows: 1,465
## Columns: 16
## $ product_id <chr> "B07JW9H4J1", "B098NS6PVG", "B096MSW6CT", "B08HDJ8…
## $ product_name <chr> "Wayona Nylon Braided USB to Lightning Fast Chargi…
## $ category <chr> "Computers&Accessories|Accessories&Peripherals|Cab…
## $ discounted_price <dbl> 399.00, 199.00, 199.00, 329.00, 154.00, 149.00, 17…
## $ actual_price <dbl> 1099, 349, 1899, 699, 399, 1000, 499, 299, 999, 29…
## $ discount_percentage <dbl> 64, 43, 90, 53, 61, 85, 65, 23, 50, 33, 55, 63, 69…
## $ rating <dbl> 4.2, 4.0, 3.9, 4.2, 4.2, 3.9, 4.1, 4.3, 4.2, 4.0, …
## $ rating_count <dbl> 24269, 43994, 7928, 94363, 16905, 24871, 15188, 30…
## $ about_product <chr> "High Compatibility : Compatible With iPhone 12, 1…
## $ user_id <chr> "AG3D6O4STAQKAY2UVGEUV46KN35Q,AHMY5CWJMMK5BJRBBSNL…
## $ user_name <chr> "Manav,Adarsh gupta,Sundeep,S.Sayeed Ahmed,jaspree…
## $ review_id <chr> "R3HXWT0LRP0NMF,R2AJM3LFTLZHFO,R6AQJGUP6P86,R1KD19…
## $ review_title <chr> "Satisfied,Charging is really fast,Value for money…
## $ review_content <chr> "Looks durable Charging is fine tooNo complains,Ch…
## $ img_link <chr> "https://m.media-amazon.com/images/W/WEBP_402378-T…
## $ product_link <chr> "https://www.amazon.in/Wayona-Braided-WN3LG1-Synci…
We’ve converted some characters into numeric values, but it’s important to remember that the monetary figures are in rupees. We also received a notice indicating that some NAs were introduced during the conversion. Let’s double-check the data and remove those rows, as missing values won’t be useful for our analysis.
sum(is.na(AmazonSalesData))
## [1] 3
3 rows have NAs. We will go ahead and remove them as removing 3 rows won’t hinder much of our analysis.
AmazonSalesDataClean <- na.omit(AmazonSalesData)
sum(is.na(AmazonSalesDataClean))
## [1] 0
Now we have no more missing data and can continue on with our analysis.
duplicates <- AmazonSalesDataClean[duplicated(AmazonSalesDataClean) | duplicated(AmazonSalesDataClean, fromLast = TRUE), ]
print(duplicates)
## [1] product_id product_name category
## [4] discounted_price actual_price discount_percentage
## [7] rating rating_count about_product
## [10] user_id user_name review_id
## [13] review_title review_content img_link
## [16] product_link
## <0 rows> (or 0-length row.names)
We don’t have any duplicates in our dataset.
Starting off with our Pricing Strategy Optimization questions:
correlation <- cor(AmazonSalesDataClean$discount_percentage, AmazonSalesDataClean$rating, use = "complete.obs")
print(correlation)
## [1] -0.155679
The correlation between discount percentage and product rating shows a weak negative linear relationship, with a correlation coefficient of -0.1553754. This indicates that as the discount percentage increases, the product rating tends to decrease slightly, but the relationship is very weak. Therefore, we can infer that discount percentage has very little influence on the rating a product receives.
corrplot(cor(AmazonSalesDataClean[, c("discounted_price", "actual_price", "discount_percentage", "rating", "rating_count")], use = "complete.obs"), method = "color")
ggplot(AmazonSalesDataCategories, aes(x = discount_percentage, y = rating)) +
geom_point(aes(color = Category1)) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
labs(title = "Discount Percentage vs Rating with Single Trend Line")
## `geom_smooth()` using formula = 'y ~ x'
The visualization shows a trend line with a slight downward slope,
indicating the negative correlation identified earlier. However, the
near-horizontal orientation of the line suggests that the relationship
is weak.
Avg_price_category <- AmazonSalesDataClean %>%
group_by(category) %>%
summarize(
Avg_price_actual = mean(actual_price, na.rm = TRUE),
Avg_price_discount = mean(discounted_price, na.rm = TRUE),
Price_difference = Avg_price_actual - Avg_price_discount,
Avg_rating = mean(rating,na.rm = TRUE),
Avg_rating_count = mean(rating_count, na.rm = TRUE)
)
summary(Avg_price_category)
## category Avg_price_actual Avg_price_discount Price_difference
## Length:211 Min. : 99.0 Min. : 77.0 Min. : 0.0
## Class :character 1st Qu.: 804.5 1st Qu.: 381.7 1st Qu.: 362.7
## Mode :character Median : 1660.0 Median : 722.7 Median : 801.0
## Mean : 4139.9 Mean : 2292.3 Mean : 1847.6
## 3rd Qu.: 3461.2 3rd Qu.: 1986.4 3rd Qu.: 1394.9
## Max. :75990.0 Max. :42990.0 Max. :33000.0
## Avg_rating Avg_rating_count
## Min. :3.300 Min. : 224
## 1st Qu.:4.000 1st Qu.: 3123
## Median :4.142 Median : 6509
## Mean :4.130 Mean : 13485
## 3rd Qu.:4.300 3rd Qu.: 14510
## Max. :4.600 Max. :270563
Here is a quick summary of our findings. The average actual price of products ranges from 99 rupees to 75,990 rupees, while the average discounted price ranges from 77 rupees to 42,990 rupees. The price difference between actual and discounted prices has a minimum value of 0, indicating there is at least one product not offered at a discount. On the other hand, the maximum price difference observed is 33,000 rupees. The median actual price is 1,660 rupees, with a median discounted price of 722.7 rupees, leading to a median price difference of 801 rupees. Overall, most products tend to have a discount, but the extent of the discount varies widely.
min_price_diff <- Avg_price_category %>%
filter(Price_difference == min(Price_difference, na.rm = TRUE))
max_price_diff <- Avg_price_category %>%
filter(Price_difference == max(Price_difference, na.rm = TRUE))
min_price_diff
## # A tibble: 11 × 6
## category Avg_price_actual Avg_price_discount Price_difference Avg_rating
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Electronics|… 549 549 0 4.5
## 2 Electronics|… 4699 4699 0 4.5
## 3 Home&Kitchen… 99 99 0 4.3
## 4 Home&Kitchen… 230 230 0 4.5
## 5 Home&Kitchen… 635 635 0 4.3
## 6 OfficeProduc… 440 440 0 4.5
## 7 OfficeProduc… 535 535 0 4.4
## 8 OfficeProduc… 99 99 0 4.3
## 9 OfficeProduc… 225 225 0 4.1
## 10 OfficeProduc… 150 150 0 4.25
## 11 Toys&Games|A… 150 150 0 4.3
## # ℹ 1 more variable: Avg_rating_count <dbl>
max_price_diff
## # A tibble: 1 × 6
## category Avg_price_actual Avg_price_discount Price_difference Avg_rating
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Home&Kitchen|… 75990 42990 33000 4.3
## # ℹ 1 more variable: Avg_rating_count <dbl>
summary(min_price_diff)
## category Avg_price_actual Avg_price_discount Price_difference
## Length:11 Min. : 99.0 Min. : 99.0 Min. :0
## Class :character 1st Qu.: 150.0 1st Qu.: 150.0 1st Qu.:0
## Mode :character Median : 230.0 Median : 230.0 Median :0
## Mean : 710.1 Mean : 710.1 Mean :0
## 3rd Qu.: 542.0 3rd Qu.: 542.0 3rd Qu.:0
## Max. :4699.0 Max. :4699.0 Max. :0
## Avg_rating Avg_rating_count
## Min. :4.100 Min. : 224
## 1st Qu.:4.300 1st Qu.: 4318
## Median :4.300 Median : 4798
## Mean :4.359 Mean : 5676
## 3rd Qu.:4.500 3rd Qu.: 6823
## Max. :4.500 Max. :15867
summary(max_price_diff)
## category Avg_price_actual Avg_price_discount Price_difference
## Length:1 Min. :75990 Min. :42990 Min. :33000
## Class :character 1st Qu.:75990 1st Qu.:42990 1st Qu.:33000
## Mode :character Median :75990 Median :42990 Median :33000
## Mean :75990 Mean :42990 Mean :33000
## 3rd Qu.:75990 3rd Qu.:42990 3rd Qu.:33000
## Max. :75990 Max. :42990 Max. :33000
## Avg_rating Avg_rating_count
## Min. :4.3 Min. :3231
## 1st Qu.:4.3 1st Qu.:3231
## Median :4.3 Median :3231
## Mean :4.3 Mean :3231
## 3rd Qu.:4.3 3rd Qu.:3231
## Max. :4.3 Max. :3231
The products with the smallest discounts, totaling 11, were not sold at any discount, resulting in a 0% discount rate. These products span categories such as film, painting materials, and pens. On the other hand, the product with the largest difference between its actual price and discounted price is a Split-System Air Conditioner. Across all products, the average and median rating remains consistent at 4.3, suggesting that discounts did not significantly influence product ratings. This aligns with our earlier finding that the discount percentage has little to no correlation with product ratings.
Although we don’t have exact sales data, the rating count suggests that products not sold at a discount may have had higher sales. This could be due to factors like their generally lower prices; for example, a fountain pen would naturally cost far less than an air conditioner. Therefore, while non-discounted items may have sold more, this does not necessarily indicate that discounts negatively affect sales.
Now let’s get into our Customer behavior questions:
Avg_price_category_ratings <- AmazonSalesDataClean %>%
group_by(category) %>% # Group the data by 'category'
summarize(
Avg_rating = mean(rating,na.rm = TRUE),
Avg_rating_count = mean(rating_count, na.rm = TRUE)
)
ggplot(AmazonSalesDataCategories, aes(x = Category1, y = rating, fill = Category1)) + geom_boxplot() + labs(title = "Rating Distribution Across Categories") + theme(axis.text.x = element_text(angle = 45, hjust = 1))
lowest_ratings <- Avg_price_category_ratings %>%
arrange(Avg_rating) %>%
slice_min(Avg_rating, n = 10, with_ties = FALSE)
highest_ratings <- Avg_price_category_ratings %>%
arrange(Avg_rating) %>%
slice_max(Avg_rating, n = 10, with_ties = FALSE)
# Print the results
lowest_ratings
## # A tibble: 10 × 3
## category Avg_rating Avg_rating_count
## <chr> <dbl> <dbl>
## 1 Home&Kitchen|Kitchen&HomeAppliances|Coffee,Tea&E… 3.3 478
## 2 Computers&Accessories|Accessories&Peripherals|Ke… 3.4 10462.
## 3 Computers&Accessories|Accessories&Peripherals|Au… 3.5 7222
## 4 Electronics|HomeTheater,TV&Video|Accessories|3DG… 3.5 621
## 5 Computers&Accessories|Accessories&Peripherals|Au… 3.6 8886.
## 6 Computers&Accessories|Printers,Inks&Accessories|… 3.6 7918.
## 7 Home&Kitchen|Heating,Cooling&AirQuality|RoomHeat… 3.6 564.
## 8 Home&Kitchen|Kitchen&HomeAppliances|SewingMachin… 3.65 4762.
## 9 Computers&Accessories|Printers,Inks&Accessories|… 3.67 11414.
## 10 Home&Kitchen|Kitchen&HomeAppliances|SmallKitchen… 3.67 3718.
highest_ratings
## # A tibble: 10 × 3
## category Avg_rating Avg_rating_count
## <chr> <dbl> <dbl>
## 1 Computers&Accessories|Tablets 4.6 2886
## 2 Computers&Accessories|Components|Memory 4.5 26194
## 3 Computers&Accessories|NetworkingDevices|NetworkA… 4.5 22420
## 4 Electronics|Cameras&Photography|Accessories|Film 4.5 4875
## 5 Electronics|HomeAudio|MediaStreamingDevices|Stre… 4.5 224
## 6 Electronics|PowerAccessories|SurgeProtectors 4.5 20668
## 7 Home&Kitchen|CraftMaterials|PaintingMaterials 4.5 9427
## 8 Home&Kitchen|Kitchen&HomeAppliances|Coffee,Tea&E… 4.5 1065
## 9 Home&Kitchen|Kitchen&HomeAppliances|SmallKitchen… 4.5 2280
## 10 HomeImprovement|Electrical|CordManagement 4.5 5985
summary(lowest_ratings)
## category Avg_rating Avg_rating_count
## Length:10 Min. :3.300 Min. : 478
## Class :character 1st Qu.:3.500 1st Qu.: 1395
## Mode :character Median :3.600 Median : 5992
## Mean :3.549 Mean : 5605
## 3rd Qu.:3.638 3rd Qu.: 8644
## Max. :3.671 Max. :11414
summary(highest_ratings)
## category Avg_rating Avg_rating_count
## Length:10 Min. :4.50 Min. : 224
## Class :character 1st Qu.:4.50 1st Qu.: 2432
## Mode :character Median :4.50 Median : 5430
## Mean :4.51 Mean : 9602
## 3rd Qu.:4.50 3rd Qu.:17858
## Max. :4.60 Max. :26194
Analyzing the top 510 products with both the lowest and highest ratings reveals an interesting trend: while categories like Computers & Accessories and Home & Kitchen are present among the lowest-rated products, they also dominate the top 5 highest-rated products. This suggests a wide variability in customer satisfaction within these categories, potentially indicating inconsistent product quality or varying consumer expectations.
ggplot(top_n(AmazonSalesDataCategories, 10, rating), aes(x = reorder(Category1, -rating), y = rating)) + geom_bar(stat = "identity") + coord_flip() + labs(title = "Top 10 Most Reviewed Products by Categories")
ggplot(slice_min(AmazonSalesDataCategories, rating_count, n = 10),
aes(x = reorder(Category1, rating), y = rating)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Bottom 10 Least Reviewed Products by Categories")
correlation_ratings <- cor(Avg_price_category_ratings$Avg_rating_count, Avg_price_category_ratings$Avg_rating, use = "complete.obs")
print(correlation_ratings)
## [1] 0.06565764
A correlation of 0.06565764 between Avg_rating_count
(the average number of ratings) and Avg_rating (the average
rating) suggests a very weak positive relationship between these two
variables. This means that as the number of ratings slightly increases,
the average rating tends to increase marginally, but the relationship is
so weak that it might not be meaningful. In practical terms, this small
correlation indicates that the number of people who rate a product has
very little effect on the actual rating a product receives, implying
that other factors are likely influencing the average rating.
# Set a threshold for high review count, e.g., top 10% of review counts
high_review_threshold <- quantile(Avg_price_category_ratings$Avg_rating_count, 0.90, na.rm = TRUE)
# Filter for products with high review counts and low ratings
potential_quality_issues <- Avg_price_category_ratings %>%
filter(Avg_rating_count >= high_review_threshold) %>%
arrange(Avg_rating) %>% # Sort by rating (low to high)
slice_head(n = 5) # Select the top 5 products with lowest ratings
print(potential_quality_issues)
## # A tibble: 5 × 3
## category Avg_rating Avg_rating_count
## <chr> <dbl> <dbl>
## 1 Electronics|Cameras&Photography|Accessories|Batte… 3.8 40895
## 2 Electronics|Headphones,Earbuds&Accessories|Headph… 3.90 80864.
## 3 MusicalInstruments|Microphones|Condenser 3.9 44441
## 4 Electronics|Mobiles&Accessories|Smartphones&Basic… 3.91 61140.
## 5 Electronics|Headphones,Earbuds&Accessories|Headph… 4 34120.
The following categories exhibit very high average rating counts but relatively low average ratings, suggesting potential quality concerns.
The correlation between discount percentage and product rating is weakly negative (-0.155), indicating that higher discounts do not necessarily lead to better product ratings. Therefore, discounting alone may not improve customer satisfaction or perception of the product.
There is a wide range of price differences between actual and discounted prices across categories. Products with smaller price differences (or no discount) still maintain relatively high ratings, suggesting that product quality or perceived value may matter more than the size of the discount.
Categories like Home & Kitchen and Office Products have items with zero price difference (no discount), yet maintain high ratings. This may indicate that these products are perceived as good value for money even without discounts.
A product in the Home & Kitchen category has the largest price difference, with a 33,000 rupee discount, yet maintains an average rating of 4.3. This suggests that large discounts can still maintain high ratings when the perceived value remains high.
The average ratings across categories are fairly consistent, with a mean rating of 4.13 out of 5, indicating overall customer satisfaction is relatively high regardless of category.
Since there is no strong relationship between higher discounts and better ratings, it is essential to focus on improving product quality and features. Marketing efforts should emphasize the value and quality of the product rather than relying solely on discounts to boost sales.
Conduct a detailed sentiment analysis of reviews to identify recurring themes for low-rated products. Use this feedback to address product quality issues, which may lead to improved ratings over time.
Categories like Home & Kitchen and Office Products perform well even without large discounts. These categories could focus on bundling offers or highlighting value propositions rather than deep discounts, while categories with large price differences should be monitored for sales performance to assess the effectiveness of the discount strategy.
Products in the Electronics and Home & Kitchen categories that offer significant discounts may benefit from targeted marketing efforts to boost sales while maintaining their high ratings.
Regularly review products with low sales or low ratings and high review counts. These could indicate potential quality issues or customer dissatisfaction. Consider refining or removing underperforming products from the catalog.