Summary: The Amazon Sales Dataset, sourced from Kaggle, provides detailed information on over 1,000 products sold on Amazon’s website. It includes key attributes such as product names, categories, pricing, discounts, ratings, and customer reviews, offering valuable insights for sales trend analysis and consumer behavior studies.

Based on the available data, here are some key questions we could explore with stakeholders:

Pricing Strategy Optimization:

How does the discount percentage correlate with the rating of the product? Are higher discounts leading to better ratings?
What is the average price difference between actual and discounted prices across different categories?
Which products have the smallest difference between actual price and discounted price, and do they have higher sales or ratings?

Customer Behavior Analysis:

What is the distribution of ratings across different categories? Are there categories that consistently receive higher or lower ratings?
How does the number of reviews (rating count) correlate with the product rating?
Which products have the highest review count but the lowest ratings, indicating potential quality issues?

Load data into R studio

library(tidyverse)
library(dplyr)
library(ggplot2)
library(corrplot)

How is the data organized

dim(AmazonSalesData)

## [1] 1465   16

Data has 1465 observations (rows) and 16 variables (columns)

colnames(AmazonSalesData)

##  [1] "product_id"          "product_name"        "category"           
##  [4] "discounted_price"    "actual_price"        "discount_percentage"
##  [7] "rating"              "rating_count"        "about_product"      
## [10] "user_id"             "user_name"           "review_id"          
## [13] "review_title"        "review_content"      "img_link"           
## [16] "product_link"

sum(is.na(AmazonSalesData))

## [1] 0

Data is complete with no missing values.

Clean the Data

glimpse(AmazonSalesData)

## Rows: 1,465
## Columns: 16
## $ product_id          <chr> "B07JW9H4J1", "B098NS6PVG", "B096MSW6CT", "B08HDJ8…
## $ product_name        <chr> "Wayona Nylon Braided USB to Lightning Fast Chargi…
## $ category            <chr> "Computers&Accessories|Accessories&Peripherals|Cab…
## $ discounted_price    <chr> "₹399", "₹199", "₹199", "₹329", "₹154", "₹149", "₹…
## $ actual_price        <chr> "₹1,099", "₹349", "₹1,899", "₹699", "₹399", "₹1,00…
## $ discount_percentage <chr> "64%", "43%", "90%", "53%", "61%", "85%", "65%", "…
## $ rating              <chr> "4.2", "4.0", "3.9", "4.2", "4.2", "3.9", "4.1", "…
## $ rating_count        <chr> "24,269", "43,994", "7,928", "94,363", "16,905", "…
## $ about_product       <chr> "High Compatibility : Compatible With iPhone 12, 1…
## $ user_id             <chr> "AG3D6O4STAQKAY2UVGEUV46KN35Q,AHMY5CWJMMK5BJRBBSNL…
## $ user_name           <chr> "Manav,Adarsh gupta,Sundeep,S.Sayeed Ahmed,jaspree…
## $ review_id           <chr> "R3HXWT0LRP0NMF,R2AJM3LFTLZHFO,R6AQJGUP6P86,R1KD19…
## $ review_title        <chr> "Satisfied,Charging is really fast,Value for money…
## $ review_content      <chr> "Looks durable Charging is fine tooNo complains,Ch…
## $ img_link            <chr> "https://m.media-amazon.com/images/W/WEBP_402378-T…
## $ product_link        <chr> "https://www.amazon.in/Wayona-Braided-WN3LG1-Synci…

All the variables are currently stored as characters, which could lead to issues when performing mathematical calculations later on.

# Remove special characters and convert to numeric
AmazonSalesData$discounted_price <- as.numeric(gsub("[₹,]", "", AmazonSalesData$discounted_price))
AmazonSalesData$actual_price <- as.numeric(gsub("[₹,]", "", AmazonSalesData$actual_price))
AmazonSalesData$discount_percentage <- as.numeric(gsub("%", "", AmazonSalesData$discount_percentage))
AmazonSalesData$rating <- as.numeric(AmazonSalesData$rating)

## Warning: NAs introduced by coercion

AmazonSalesData$rating_count <- as.numeric(gsub(",", "", AmazonSalesData$rating_count))

glimpse(AmazonSalesData)

## Rows: 1,465
## Columns: 16
## $ product_id          <chr> "B07JW9H4J1", "B098NS6PVG", "B096MSW6CT", "B08HDJ8…
## $ product_name        <chr> "Wayona Nylon Braided USB to Lightning Fast Chargi…
## $ category            <chr> "Computers&Accessories|Accessories&Peripherals|Cab…
## $ discounted_price    <dbl> 399.00, 199.00, 199.00, 329.00, 154.00, 149.00, 17…
## $ actual_price        <dbl> 1099, 349, 1899, 699, 399, 1000, 499, 299, 999, 29…
## $ discount_percentage <dbl> 64, 43, 90, 53, 61, 85, 65, 23, 50, 33, 55, 63, 69…
## $ rating              <dbl> 4.2, 4.0, 3.9, 4.2, 4.2, 3.9, 4.1, 4.3, 4.2, 4.0, …
## $ rating_count        <dbl> 24269, 43994, 7928, 94363, 16905, 24871, 15188, 30…
## $ about_product       <chr> "High Compatibility : Compatible With iPhone 12, 1…
## $ user_id             <chr> "AG3D6O4STAQKAY2UVGEUV46KN35Q,AHMY5CWJMMK5BJRBBSNL…
## $ user_name           <chr> "Manav,Adarsh gupta,Sundeep,S.Sayeed Ahmed,jaspree…
## $ review_id           <chr> "R3HXWT0LRP0NMF,R2AJM3LFTLZHFO,R6AQJGUP6P86,R1KD19…
## $ review_title        <chr> "Satisfied,Charging is really fast,Value for money…
## $ review_content      <chr> "Looks durable Charging is fine tooNo complains,Ch…
## $ img_link            <chr> "https://m.media-amazon.com/images/W/WEBP_402378-T…
## $ product_link        <chr> "https://www.amazon.in/Wayona-Braided-WN3LG1-Synci…

We’ve converted some characters into numeric values, but it’s important to remember that the monetary figures are in rupees. We also received a notice indicating that some NAs were introduced during the conversion. Let’s double-check the data and remove those rows, as missing values won’t be useful for our analysis.

sum(is.na(AmazonSalesData))

## [1] 3

3 rows have NAs. We will go ahead and remove them as removing 3 rows won’t hinder much of our analysis.

AmazonSalesDataClean <- na.omit(AmazonSalesData)

sum(is.na(AmazonSalesDataClean))

## [1] 0

Now we have no more missing data and can continue on with our analysis.

duplicates <- AmazonSalesDataClean[duplicated(AmazonSalesDataClean) | duplicated(AmazonSalesDataClean, fromLast = TRUE), ]
print(duplicates)

##  [1] product_id          product_name        category           
##  [4] discounted_price    actual_price        discount_percentage
##  [7] rating              rating_count        about_product      
## [10] user_id             user_name           review_id          
## [13] review_title        review_content      img_link           
## [16] product_link       
## <0 rows> (or 0-length row.names)

We don’t have any duplicates in our dataset.

Analysis

Starting off with our Pricing Strategy Optimization questions:

Pricing Strategy Optimization:
- How does the discount percentage correlate with the rating of the product? Are higher discounts leading to better ratings?
- What is the average price difference between actual and discounted prices across different categories?
- Which products have the smallest difference between actual price and discounted price, and do they have higher sales or ratings?

correlation <- cor(AmazonSalesDataClean$discount_percentage, AmazonSalesDataClean$rating, use = "complete.obs")
print(correlation)

## [1] -0.155679

The correlation between discount percentage and product rating shows a weak negative linear relationship, with a correlation coefficient of -0.1553754. This indicates that as the discount percentage increases, the product rating tends to decrease slightly, but the relationship is very weak. Therefore, we can infer that discount percentage has very little influence on the rating a product receives.

corrplot(cor(AmazonSalesDataClean[, c("discounted_price", "actual_price", "discount_percentage", "rating", "rating_count")], use = "complete.obs"), method = "color")

ggplot(AmazonSalesDataCategories, aes(x = discount_percentage, y = rating)) + 
  geom_point(aes(color = Category1)) + 
  geom_smooth(method = "lm", se = FALSE, color = "black") + 
  labs(title = "Discount Percentage vs Rating with Single Trend Line")

## `geom_smooth()` using formula = 'y ~ x'

The visualization shows a trend line with a slight downward slope, indicating the negative correlation identified earlier. However, the near-horizontal orientation of the line suggests that the relationship is weak.

Avg_price_category <- AmazonSalesDataClean %>%
  group_by(category) %>%  
  summarize(
    Avg_price_actual = mean(actual_price, na.rm = TRUE),  
    Avg_price_discount = mean(discounted_price, na.rm = TRUE),
    Price_difference = Avg_price_actual - Avg_price_discount,
    Avg_rating = mean(rating,na.rm = TRUE),
    Avg_rating_count = mean(rating_count, na.rm = TRUE)
  )
summary(Avg_price_category)

##    category         Avg_price_actual  Avg_price_discount Price_difference 
##  Length:211         Min.   :   99.0   Min.   :   77.0    Min.   :    0.0  
##  Class :character   1st Qu.:  804.5   1st Qu.:  381.7    1st Qu.:  362.7  
##  Mode  :character   Median : 1660.0   Median :  722.7    Median :  801.0  
##                     Mean   : 4139.9   Mean   : 2292.3    Mean   : 1847.6  
##                     3rd Qu.: 3461.2   3rd Qu.: 1986.4    3rd Qu.: 1394.9  
##                     Max.   :75990.0   Max.   :42990.0    Max.   :33000.0  
##    Avg_rating    Avg_rating_count
##  Min.   :3.300   Min.   :   224  
##  1st Qu.:4.000   1st Qu.:  3123  
##  Median :4.142   Median :  6509  
##  Mean   :4.130   Mean   : 13485  
##  3rd Qu.:4.300   3rd Qu.: 14510  
##  Max.   :4.600   Max.   :270563

Here is a quick summary of our findings. The average actual price of products ranges from 99 rupees to 75,990 rupees, while the average discounted price ranges from 77 rupees to 42,990 rupees. The price difference between actual and discounted prices has a minimum value of 0, indicating there is at least one product not offered at a discount. On the other hand, the maximum price difference observed is 33,000 rupees. The median actual price is 1,660 rupees, with a median discounted price of 722.7 rupees, leading to a median price difference of 801 rupees. Overall, most products tend to have a discount, but the extent of the discount varies widely.

min_price_diff <- Avg_price_category %>%
  filter(Price_difference == min(Price_difference, na.rm = TRUE))

max_price_diff <- Avg_price_category %>%
  filter(Price_difference == max(Price_difference, na.rm = TRUE))

min_price_diff

## # A tibble: 11 × 6
##    category      Avg_price_actual Avg_price_discount Price_difference Avg_rating
##    <chr>                    <dbl>              <dbl>            <dbl>      <dbl>
##  1 Electronics|…              549                549                0       4.5 
##  2 Electronics|…             4699               4699                0       4.5 
##  3 Home&Kitchen…               99                 99                0       4.3 
##  4 Home&Kitchen…              230                230                0       4.5 
##  5 Home&Kitchen…              635                635                0       4.3 
##  6 OfficeProduc…              440                440                0       4.5 
##  7 OfficeProduc…              535                535                0       4.4 
##  8 OfficeProduc…               99                 99                0       4.3 
##  9 OfficeProduc…              225                225                0       4.1 
## 10 OfficeProduc…              150                150                0       4.25
## 11 Toys&Games|A…              150                150                0       4.3 
## # ℹ 1 more variable: Avg_rating_count <dbl>

max_price_diff

## # A tibble: 1 × 6
##   category       Avg_price_actual Avg_price_discount Price_difference Avg_rating
##   <chr>                     <dbl>              <dbl>            <dbl>      <dbl>
## 1 Home&Kitchen|…            75990              42990            33000        4.3
## # ℹ 1 more variable: Avg_rating_count <dbl>

summary(min_price_diff)

##    category         Avg_price_actual Avg_price_discount Price_difference
##  Length:11          Min.   :  99.0   Min.   :  99.0     Min.   :0       
##  Class :character   1st Qu.: 150.0   1st Qu.: 150.0     1st Qu.:0       
##  Mode  :character   Median : 230.0   Median : 230.0     Median :0       
##                     Mean   : 710.1   Mean   : 710.1     Mean   :0       
##                     3rd Qu.: 542.0   3rd Qu.: 542.0     3rd Qu.:0       
##                     Max.   :4699.0   Max.   :4699.0     Max.   :0       
##    Avg_rating    Avg_rating_count
##  Min.   :4.100   Min.   :  224   
##  1st Qu.:4.300   1st Qu.: 4318   
##  Median :4.300   Median : 4798   
##  Mean   :4.359   Mean   : 5676   
##  3rd Qu.:4.500   3rd Qu.: 6823   
##  Max.   :4.500   Max.   :15867

summary(max_price_diff)

##    category         Avg_price_actual Avg_price_discount Price_difference
##  Length:1           Min.   :75990    Min.   :42990      Min.   :33000   
##  Class :character   1st Qu.:75990    1st Qu.:42990      1st Qu.:33000   
##  Mode  :character   Median :75990    Median :42990      Median :33000   
##                     Mean   :75990    Mean   :42990      Mean   :33000   
##                     3rd Qu.:75990    3rd Qu.:42990      3rd Qu.:33000   
##                     Max.   :75990    Max.   :42990      Max.   :33000   
##    Avg_rating  Avg_rating_count
##  Min.   :4.3   Min.   :3231    
##  1st Qu.:4.3   1st Qu.:3231    
##  Median :4.3   Median :3231    
##  Mean   :4.3   Mean   :3231    
##  3rd Qu.:4.3   3rd Qu.:3231    
##  Max.   :4.3   Max.   :3231

The products with the smallest discounts, totaling 11, were not sold at any discount, resulting in a 0% discount rate. These products span categories such as film, painting materials, and pens. On the other hand, the product with the largest difference between its actual price and discounted price is a Split-System Air Conditioner. Across all products, the average and median rating remains consistent at 4.3, suggesting that discounts did not significantly influence product ratings. This aligns with our earlier finding that the discount percentage has little to no correlation with product ratings.

Although we don’t have exact sales data, the rating count suggests that products not sold at a discount may have had higher sales. This could be due to factors like their generally lower prices; for example, a fountain pen would naturally cost far less than an air conditioner. Therefore, while non-discounted items may have sold more, this does not necessarily indicate that discounts negatively affect sales.

Now let’s get into our Customer behavior questions:

Customer Behavior Analysis:
- What is the distribution of ratings across different categories? Are there categories that consistently receive higher or lower ratings?
- How does the number of reviews (rating count) correlate with the product rating?
- Which products have the highest review count but the lowest ratings, indicating potential quality issues?

Avg_price_category_ratings <- AmazonSalesDataClean %>%
  group_by(category) %>%  # Group the data by 'category'
  summarize(
    Avg_rating = mean(rating,na.rm = TRUE),
    Avg_rating_count = mean(rating_count, na.rm = TRUE)
  )

ggplot(AmazonSalesDataCategories, aes(x = Category1, y = rating, fill = Category1)) + geom_boxplot() + labs(title = "Rating Distribution Across Categories") + theme(axis.text.x = element_text(angle = 45, hjust = 1))

lowest_ratings <- Avg_price_category_ratings %>%
  arrange(Avg_rating) %>%  
  slice_min(Avg_rating, n = 10, with_ties = FALSE) 

highest_ratings <- Avg_price_category_ratings %>%
  arrange(Avg_rating) %>%  
  slice_max(Avg_rating, n = 10, with_ties = FALSE)

# Print the results
lowest_ratings

## # A tibble: 10 × 3
##    category                                          Avg_rating Avg_rating_count
##    <chr>                                                  <dbl>            <dbl>
##  1 Home&Kitchen|Kitchen&HomeAppliances|Coffee,Tea&E…       3.3              478 
##  2 Computers&Accessories|Accessories&Peripherals|Ke…       3.4            10462.
##  3 Computers&Accessories|Accessories&Peripherals|Au…       3.5             7222 
##  4 Electronics|HomeTheater,TV&Video|Accessories|3DG…       3.5              621 
##  5 Computers&Accessories|Accessories&Peripherals|Au…       3.6             8886.
##  6 Computers&Accessories|Printers,Inks&Accessories|…       3.6             7918.
##  7 Home&Kitchen|Heating,Cooling&AirQuality|RoomHeat…       3.6              564.
##  8 Home&Kitchen|Kitchen&HomeAppliances|SewingMachin…       3.65            4762.
##  9 Computers&Accessories|Printers,Inks&Accessories|…       3.67           11414.
## 10 Home&Kitchen|Kitchen&HomeAppliances|SmallKitchen…       3.67            3718.

highest_ratings

## # A tibble: 10 × 3
##    category                                          Avg_rating Avg_rating_count
##    <chr>                                                  <dbl>            <dbl>
##  1 Computers&Accessories|Tablets                            4.6             2886
##  2 Computers&Accessories|Components|Memory                  4.5            26194
##  3 Computers&Accessories|NetworkingDevices|NetworkA…        4.5            22420
##  4 Electronics|Cameras&Photography|Accessories|Film         4.5             4875
##  5 Electronics|HomeAudio|MediaStreamingDevices|Stre…        4.5              224
##  6 Electronics|PowerAccessories|SurgeProtectors             4.5            20668
##  7 Home&Kitchen|CraftMaterials|PaintingMaterials            4.5             9427
##  8 Home&Kitchen|Kitchen&HomeAppliances|Coffee,Tea&E…        4.5             1065
##  9 Home&Kitchen|Kitchen&HomeAppliances|SmallKitchen…        4.5             2280
## 10 HomeImprovement|Electrical|CordManagement                4.5             5985

summary(lowest_ratings)

##    category           Avg_rating    Avg_rating_count
##  Length:10          Min.   :3.300   Min.   :  478   
##  Class :character   1st Qu.:3.500   1st Qu.: 1395   
##  Mode  :character   Median :3.600   Median : 5992   
##                     Mean   :3.549   Mean   : 5605   
##                     3rd Qu.:3.638   3rd Qu.: 8644   
##                     Max.   :3.671   Max.   :11414

summary(highest_ratings)

##    category           Avg_rating   Avg_rating_count
##  Length:10          Min.   :4.50   Min.   :  224   
##  Class :character   1st Qu.:4.50   1st Qu.: 2432   
##  Mode  :character   Median :4.50   Median : 5430   
##                     Mean   :4.51   Mean   : 9602   
##                     3rd Qu.:4.50   3rd Qu.:17858   
##                     Max.   :4.60   Max.   :26194

Analyzing the top 510 products with both the lowest and highest ratings reveals an interesting trend: while categories like Computers & Accessories and Home & Kitchen are present among the lowest-rated products, they also dominate the top 5 highest-rated products. This suggests a wide variability in customer satisfaction within these categories, potentially indicating inconsistent product quality or varying consumer expectations.

ggplot(top_n(AmazonSalesDataCategories, 10, rating), aes(x = reorder(Category1, -rating), y = rating)) + geom_bar(stat = "identity") + coord_flip() + labs(title = "Top 10 Most Reviewed Products by Categories")

ggplot(slice_min(AmazonSalesDataCategories, rating_count, n = 10), 
       aes(x = reorder(Category1, rating), y = rating)) + 
  geom_bar(stat = "identity") + 
  coord_flip() + 
  labs(title = "Bottom 10 Least Reviewed Products by Categories")

correlation_ratings <- cor(Avg_price_category_ratings$Avg_rating_count, Avg_price_category_ratings$Avg_rating, use = "complete.obs")

print(correlation_ratings)

## [1] 0.06565764

A correlation of 0.06565764 between Avg_rating_count (the average number of ratings) and Avg_rating (the average rating) suggests a very weak positive relationship between these two variables. This means that as the number of ratings slightly increases, the average rating tends to increase marginally, but the relationship is so weak that it might not be meaningful. In practical terms, this small correlation indicates that the number of people who rate a product has very little effect on the actual rating a product receives, implying that other factors are likely influencing the average rating.

# Set a threshold for high review count, e.g., top 10% of review counts
high_review_threshold <- quantile(Avg_price_category_ratings$Avg_rating_count, 0.90, na.rm = TRUE)

# Filter for products with high review counts and low ratings
potential_quality_issues <- Avg_price_category_ratings %>%
  filter(Avg_rating_count >= high_review_threshold) %>%
  arrange(Avg_rating) %>%  # Sort by rating (low to high)
  slice_head(n = 5)  # Select the top 5 products with lowest ratings

print(potential_quality_issues)

## # A tibble: 5 × 3
##   category                                           Avg_rating Avg_rating_count
##   <chr>                                                   <dbl>            <dbl>
## 1 Electronics|Cameras&Photography|Accessories|Batte…       3.8            40895 
## 2 Electronics|Headphones,Earbuds&Accessories|Headph…       3.90           80864.
## 3 MusicalInstruments|Microphones|Condenser                 3.9            44441 
## 4 Electronics|Mobiles&Accessories|Smartphones&Basic…       3.91           61140.
## 5 Electronics|Headphones,Earbuds&Accessories|Headph…       4              34120.

The following categories exhibit very high average rating counts but relatively low average ratings, suggesting potential quality concerns.

Conclusions

The correlation between discount percentage and product rating is weakly negative (-0.155), indicating that higher discounts do not necessarily lead to better product ratings. Therefore, discounting alone may not improve customer satisfaction or perception of the product.
There is a wide range of price differences between actual and discounted prices across categories. Products with smaller price differences (or no discount) still maintain relatively high ratings, suggesting that product quality or perceived value may matter more than the size of the discount.
Categories like Home & Kitchen and Office Products have items with zero price difference (no discount), yet maintain high ratings. This may indicate that these products are perceived as good value for money even without discounts.
A product in the Home & Kitchen category has the largest price difference, with a 33,000 rupee discount, yet maintains an average rating of 4.3. This suggests that large discounts can still maintain high ratings when the perceived value remains high.
The average ratings across categories are fairly consistent, with a mean rating of 4.13 out of 5, indicating overall customer satisfaction is relatively high regardless of category.

Recommendations:

Focus on Product Quality Over Discounts:

Since there is no strong relationship between higher discounts and better ratings, it is essential to focus on improving product quality and features. Marketing efforts should emphasize the value and quality of the product rather than relying solely on discounts to boost sales.

Leverage Customer Reviews for Product Improvement:

Conduct a detailed sentiment analysis of reviews to identify recurring themes for low-rated products. Use this feedback to address product quality issues, which may lead to improved ratings over time.

Optimize Discount Strategies Based on Category Performance:

Categories like Home & Kitchen and Office Products perform well even without large discounts. These categories could focus on bundling offers or highlighting value propositions rather than deep discounts, while categories with large price differences should be monitored for sales performance to assess the effectiveness of the discount strategy.

Target Specific Categories for Promotional Efforts:

Products in the Electronics and Home & Kitchen categories that offer significant discounts may benefit from targeted marketing efforts to boost sales while maintaining their high ratings.

Monitor Low-Performing Products:

Regularly review products with low sales or low ratings and high review counts. These could indicate potential quality issues or customer dissatisfaction. Consider refining or removing underperforming products from the catalog.

Amazon Sales Data Analysis

Maritza Jimenez

2024-09-10