Mastering Retail Data Analysis: A Step-by-Step Guide to Understanding Averages Beyond the Mean

Introduction

We often rely on the average to summarize data. But in messy retail datasets, the mean can be misleading. Imagine a store where most customers spend $8–15, yet the average order value shows $20. This happens because a few large orders and returns skew the result. In this step-by-step guide, you'll learn how to properly analyze customer spending using the Online Retail Dataset. We'll move beyond the mean to explore the median and interquartile range (IQR), giving you a robust understanding of real-world data.

Mastering Retail Data Analysis: A Step-by-Step Guide to Understanding Averages Beyond the Mean — Source: www.freecodecamp.org

What You Need

Basic Python knowledge: Variables, functions, and data structures.
Pandas library: Familiarity with loading data and DataFrame operations.
A development environment: Jupyter Notebook, VS Code, or Google Colab.
The Online Retail Dataset: Download from UCI Repository (Excel file, 541,909 transactions from a UK online retailer, 2010–2011).
Additional Python libraries: numpy, matplotlib (optional for visualization).

Step-by-Step Guide

Step 1: Load and Inspect the Dataset

First, import Pandas and load the dataset from the Excel file. Use pd.read_excel() with the openpyxl engine. Then inspect the first few rows to understand its structure.

import pandas as pd

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"
df = pd.read_excel(url, engine='openpyxl')
print(df.head())

You'll see columns like InvoiceNo, StockCode, Description, Quantity, UnitPrice, CustomerID, etc. Note that there are many rows – over 500,000 transactions. Some may have missing values.

Step 2: Clean the Data

Real-world data is rarely clean. Remove rows where CustomerID is missing, as we need customer info for analysis. Then create a TotalPrice column by multiplying Quantity and UnitPrice.

df = df.dropna(subset=['CustomerID'])
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']
print(df.shape)

Notice that Quantity can be negative (returns) and TotalPrice can be very large (bulk purchases). These are the messy elements that will distort a simple average.

Step 3: Calculate the Mean (Average Order Value)

Now compute the mean of TotalPrice. This is the arithmetic average: sum of all values divided by count.

mean_value = df['TotalPrice'].mean()
print(f"Mean Order Value: ${mean_value:.2f}")

You'll likely get something around $20. But is that representative? Let's examine further.

Step 4: Identify the Problem – Visualize the Distribution

Plot a histogram of TotalPrice to see the spread. Use matplotlib to get a quick view.

import matplotlib.pyplot as plt
plt.hist(df['TotalPrice'], bins=50, edgecolor='black')
plt.xlabel('Total Price')
plt.ylabel('Frequency')
plt.title('Distribution of Order Values')
plt.show()

You'll see a heavy right skew – most orders are small ($0–$50), but a few are extremely high (thousands). Also, negative values appear due to returns. The mean is pulled to the right by these outliers, hence the $20 average.

Step 5: Calculate the Median (The Robust Alternative)

The median is the middle value when data is sorted. It resists outliers because it depends only on the central data point(s).

median_value = df['TotalPrice'].median()
print(f"Median Order Value: ${median_value:.2f}")

You'll find the median is much lower – around $10–$15. This aligns better with what you observed earlier: most customers spend between $8 and $15. The median gives a more truthful picture of typical spending.

Step 6: Explore the Spread with Quartiles and IQR

Averages alone are insufficient. Understand the distribution using quartiles. The first quartile (Q1) is the 25th percentile, the third quartile (Q3) is the 75th percentile, and the interquartile range (IQR = Q3 - Q1) shows the middle 50% spread.

Q1 = df['TotalPrice'].quantile(0.25)
Q2 = df['TotalPrice'].median()
Q3 = df['TotalPrice'].quantile(0.75)
IQR = Q3 - Q1
print(f"Q1: ${Q1:.2f}, Median: ${Q2:.2f}, Q3: ${Q3:.2f}, IQR: ${IQR:.2f}")

Now you can say: "50% of orders fall between $X and $Y" – a much richer insight than a single average.

Step 7: Apply IQR to Understand Outlier Boundaries

Use the IQR to define potential outliers as values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR. Calculate how many transactions fall outside this fence.

lower_fence = Q1 - 1.5 * IQR
upper_fence = Q3 + 1.5 * IQR
outliers = df[(df['TotalPrice'] < lower_fence) | (df['TotalPrice'] > upper_fence)]
print(f"Number of outliers: {len(outliers)}")

You'll discover that a small fraction of orders (maybe 5-10%) are outliers, yet they heavily inflated the mean. This confirms that the median and IQR are better for describing the core customer behavior.

Step 8: Compare and Conclude

Now compare the mean and median. The mean is about $20, median about $12. The significant difference indicates skewness. For decision-making – like setting pricing or inventory – use the median and IQR. The mean is useful only if you care about total revenue, not typical order value.

Finally, summarize your findings in a table or short report.

Tips for Success

Always visualize first. A histogram or box plot reveals skewness and outliers instantly.
Don't ignore negative values. They are real transactions (returns). Consider separating returns and purchases for separate analyses.
Use quantiles for reporting. Instead of "average order is $20", say "half of our orders are between $8 and $18". This is more informative.
Scale matters. If you remove extreme outliers, recalculate mean – but clearly document your filtering.
Automate checks. Write a small function that outputs mean, median, Q1, Q3, and IQR for any numerical column. This becomes reusable.
Keep the business context. The goal is not just stats, but actionable insights for inventory, marketing, and customer segmentation.

Conclusion

This step-by-step process showed why the mean can lie in messy retail data. By loading, cleaning, calculating mean, identifying skewness, using median, and exploring IQR, you've learned a robust approach to uncover true customer spending patterns. Apply these steps to your own datasets to avoid misleading averages.

For more such guides, check our What You Need section or experiment with other datasets.

Darhost