How to Prepare Data for Analysis: A Practical Guide

June 20, 2026

how to prepare data for analysis

data preparation

data cleaning

data analysis

business intelligence

Share this post:

You export a month of orders, customer records, and cart activity from your Shopify store. You expect answers. Which traffic source brings buyers who convert? Which products get added to carts but dropped before checkout? Which repeat customers are worth a follow-up email?

Then you open the CSV.

Some dates are formatted one way, others another. Revenue fields look like text. Product names don't match. A few rows are blank in places that matter. There are duplicate order records you can't explain. What looked like a quick analysis job turns into cleanup work.

That's normal. It's also why learning how to prepare data for analysis matters so much for e-commerce. Before you build a pivot table, chart customer behavior, or compare channels, you need a dataset you can trust. Good analysis starts long before the first dashboard. It starts when raw exports become structured, consistent, documented data.

From Messy Spreadsheet to Clear Insights

Monday morning. You export orders, customer records, and cart events because you want one clear answer, such as which paid campaign brings customers who buy twice instead of once. An hour later, you are staring at three files that should fit together but do not.

That gap between export and answer is the effective start of analysis.

In e-commerce, data prep is less about tidying a spreadsheet for its own sake and more about protecting the decision you plan to make from it. If product names differ across files, cart activity will not line up with orders. If customer IDs are missing or inconsistent, repeat purchase analysis falls apart. If timestamps are stored in different formats or time zones, daily sales trends and checkout drop-off patterns can look wrong.

The fix is a simple pipeline. Pull the right exports, line up the fields that belong together, clean only the columns tied to your question, and keep a record of what you changed. That approach works whether you are in Excel, Google Sheets, or pandas, and it keeps the project manageable for a store owner doing a first serious cleanup.

A small example makes the point. If the goal is to understand cart abandonment, focus on session ID, customer ID, product SKU, event time, checkout status, and source or campaign. In Excel or Sheets, that might mean filtering blank session IDs, trimming extra spaces from SKUs, and checking whether event times sort correctly. In pandas, it often starts with a few lines such as df['sku'] = df['sku'].str.strip() and pd.to_datetime(df['event_time']) so the same cart can be traced from add-to-cart to checkout.

Use the business question to set the boundary. Merchants often lose time cleaning every column in the export, including notes, tags, or internal fields that will never be analyzed. If the question is customer lifetime value, spend your effort on customer identifiers, order dates, refunds, and net sales. If the question is channel performance, align source, medium, campaign, and order value first. The right scope keeps the work tied to business metrics that actually guide decisions.

Stores that pull data from Shopify, ad platforms, email tools, and behavior tracking run into another problem. The files were created by different systems with different naming rules. That is why even a spreadsheet-based workflow benefits from understanding how data is extracted, transformed, and loaded across tools. Merchants dealing with disconnected exports can get useful context from these modern ETL strategies, especially when order data and cart activity do not line up cleanly.

Practical rule: Clean the fields that affect the decision in front of you first, then expand only if the analysis requires it.

First Look Collecting and Profiling Your Data

A frequent initial error is editing the only copy available. Don't do that. Keep your raw export untouched and work from a duplicate.

That isn't just cautious housekeeping. A reliable workflow should preserve an untouched raw copy, run an initial frequency scan for impossible values, and inspect suspicious cases before any inferential analysis, because outlier handling should be decided before hypothesis testing to avoid post-hoc bias, as noted in this raw-data preparation guide.

Start with a copy and a simple inventory

Open your copied file in Excel, Google Sheets, or pandas and answer basic questions first:

How many rows are there
How many columns are there
What does each column appear to represent
Which columns should be IDs, dates, numbers, or categories
Which fields have obvious blanks or strange values

In Excel or Sheets, start with COUNT, COUNTA, filters, and sorting. In pandas, df.shape, df.info(), and df.describe(include='all') give you a fast health check.

Small examples help:

Excel or Sheets row count: =COUNTA(A:A)
Excel or Sheets blanks in email column: =COUNTBLANK(C:C)
pandas shape: df.shape
pandas missing values: df.isna().sum()

You're not fixing anything yet. You're building a punch list.

Look for patterns, not isolated errors

A blank phone number might not matter. A blank traffic source on a large chunk of orders probably does. A single strange order total may be a typo. A whole block of strange totals may point to a formatting import issue.

This is also where business context matters. If you sell both direct-to-consumer and wholesale, a missing company field may be normal in one segment and a problem in another. If your store captures on-site behavior or cart events, exports from tools such as Cart Whisper | Live View Pro can add useful session-level detail, but event data often needs extra profiling because one shopper can generate many records over time.

Before asking what the data says, ask what each row actually represents. One row can mean an order, a customer, a cart event, or a page view. Mixing those levels breaks analysis fast.

Many merchants pair spreadsheet profiling with direct customer research. If behavioral data tells you where shoppers drop off but not why, Formbricks' user research guide is a practical companion for choosing interviews, surveys, or feedback methods that fill that gap.

For KPI cleanup, it also helps to make sure everyone means the same thing by “conversion,” “average order value,” or “returning customer.” A simple reference like this guide to business metrics definitions can prevent analysis errors that come from inconsistent naming rather than bad rows.

The Big Cleanup Handling Messy Data

Once you've profiled the file, the main cleanup starts. In most e-commerce datasets, three issues create the biggest problems: missing values, duplicates, and outliers. Handle them in that order only if that matches the risk in your file. Sometimes duplicate records are the first thing to fix because they distort every later check.

Missing values need decisions, not guesswork

A practical workflow for missing data is to first inspect the pattern of missingness with summary tables or visual checks, then choose among deletion, simple imputation, or model-based estimation based on how much data is missing and whether the gaps appear systematic, according to Data to Policy Lab's preparation guide.

For a merchant, that means asking a business question before filling blanks:

Is the missing field essential?
Is the row still useful without it?
Is the blank random, or concentrated in one channel, device type, customer segment, or time period?

If coupon_code is blank, that might just mean no discount was used. If order_total is blank, the row may be unusable for revenue analysis.

Here's the practical playbook:

Delete the row when the missing value is in a critical field and the record can't support your analysis.
Fill with a label like “Unknown” for categorical fields such as city, source, or device when you want to keep the row visible.
Use simple imputation carefully for fields where a placeholder or central value makes analytical sense.

Examples:

Excel or Sheets: Use filters to isolate blanks, then fill selected category cells with Unknown.
pandas: df['source'] = df['source'].fillna('Unknown')

Duplicates are only obvious when you define a unique record

A merchant often says, “I have duplicates,” though the actual issue is that the file contains multiple valid events for the same shopper. If one customer viewed a product three times, that isn't duplication. It's behavior.

Define uniqueness first. Examples:

Orders: usually unique by order ID
Customers: usually unique by customer ID or email, depending on your system
Cart events: often unique only by event timestamp plus cart or session ID plus event type

Then remove true duplicates.

Problem	Excel / Google Sheets Solution	Pandas (Python) Solution
Missing categorical value	Filter blanks, fill with `Unknown` or another clear label	`df['col'] = df['col'].fillna('Unknown')`
Missing numeric value	Review whether to leave blank, delete row, or fill after inspection	`df = df.dropna(subset=['col'])` or `df['col'] = df['col'].fillna(df['col'].median())`
Duplicate rows	Use Remove duplicates on the columns that define a unique record	`df = df.drop_duplicates(subset=['order_id'])`
Inconsistent text labels	Use Find and Replace or helper columns with `TRIM` and `UPPER`	`df['country'] = df['country'].str.strip().str.upper()`
Suspicious extreme values	Sort high to low and manually inspect edge cases	`df.sort_values('order_total', ascending=False)`

Outliers can be errors or valuable signals

DASCA notes that strong preparation includes handling missing data, detecting outliers with tools such as box plots and z scores, and standardizing variables so results are more reliable and comparable in its guide to accurate statistical analysis.

In e-commerce, an extreme value isn't automatically wrong. A very large order might be a wholesale account. A very short time-to-purchase might be a repeat buyer who knew exactly what they wanted. But a negative quantity, impossible timestamp, or clearly malformed total usually needs review.

Sort the highest and lowest values first. Merchants catch more real errors that way than by jumping straight into formulas.

In Excel or Sheets, sort the column and inspect the top and bottom values. In pandas, try:

df.sort_values('order_total', ascending=False).head()

If you're working with quantities, refund-adjusted totals, or event counts, sanity checks are often more useful than fancy methods. Ask whether the value could happen in the store you run.

Standardizing for Consistency and Comparison

A cleaned export can still give the wrong answer if the same idea is stored three different ways. That happens all the time in e-commerce data. Revenue comes through as text in one report, dates are true dates in another, and the same marketing channel appears under five labels. If you group or compare before fixing that, your totals split, your filters miss records, and your trend lines stop being trustworthy.

A professional man working on a desktop computer displaying a spreadsheet with data analysis results.

Fix data types before building summaries

Start with the columns you plan to analyze first. For most merchants, that means order_total, created_at, refund_amount, discount_code, customer_id, and a few category fields such as country, device, or channel.

Type mistakes create very practical reporting problems:

Currency fields stored as text will break sums, averages, and profit calculations
Date fields stored inconsistently will ruin daily or monthly sales trends
Boolean fields like subscribed or first-time buyer need one format, not TRUE, Yes, 1, and Y
IDs should usually stay as text so leading zeros and long values do not get mangled

A quick rule helps here. If you would calculate with it, make it numeric. If you would group by it, make the labels consistent. If it identifies a record, keep it as text.

In Excel or Google Sheets, this usually means stripping symbols, checking for hidden spaces, and applying the correct format. If you need helper columns for cleanup, this guide to adding formulas in Excel is useful for building quick checks without turning the sheet into a mess.

In pandas, the core fixes are simple:

Numeric conversion: pd.to_numeric(df['order_total'], errors='coerce')
Date conversion: pd.to_datetime(df['created_at'], errors='coerce')

Standardize labels so categories group the way you expect

This step matters most when you want to compare performance across channels, regions, products, or customer segments.

Suppose your country field contains US, USA, United States, and U.S.A.. Your spreadsheet treats those as separate values. So will your pivot table. The same issue shows up in acquisition data with labels like facebook, Facebook Ads, Meta, and Paid Social.

Those labels might belong in one bucket. They might not. A merchant reviewing top-line channel performance may combine them. A paid media buyer checking campaign efficiency may keep them separate. The trade-off is between simpler reporting and preserving detail. Choose the level that matches the decision you need to make.

A good standardization rule should be boring and repeatable. For example:

Store country codes in one format, such as US, CA, GB
Pick one device set, such as mobile, desktop, tablet
Decide whether channel names reflect source platforms or reporting groups
Keep a small lookup tab or mapping file so the rule stays consistent next month too

The Analysis Factor recommends a clear file structure, consistent naming, careful cleanup for duplicates and impossible values, and a data dictionary so variables are defined the same way each time. That matters in day-to-day store analysis. If one person groups Instagram under social and another leaves it separate, your month-over-month comparisons stop being reliable.

Standardization is what lets April and May mean the same thing in your report.

Normalize only when the use case calls for it

Normalization is useful when you are scoring customers, building segments, or preparing data for a model that compares variables on very different scales. It is less useful when you are just trying to understand store performance in a dashboard or spreadsheet.

For example, order_total might range from small purchases to large wholesale orders, while discount_rate is a percentage. If you are creating a customer score, putting those fields on a comparable scale can help. If you are preparing a weekly sales report, it often adds complexity without improving the decision.

For most first cleanup projects, merchants get more value from three things: correct data types, consistent labels, and documented rules. Those steps make customer behavior easier to read and sales comparisons easier to trust.

Enriching Your Data with Feature Engineering

Prepared data isn't only clean. It's also more useful than the export you started with. That's where feature engineering comes in. You create new columns from existing data so the file answers real business questions more directly.

A professional woman interacting with a holographic interface to transform and analyze business data sets.

Turn raw timestamps into buying patterns

A created_at column is useful. A day_of_week column is often more useful for merchants. The same goes for month, hour, or is_weekend.

From order or cart activity data, you can derive:

Day of week to spot weekday versus weekend buying patterns
Hour of day to understand when shoppers add to cart
Time to purchase from first session or cart creation to order completion
Days since last purchase to identify at-risk repeat customers

In Excel, these are simple formula jobs. If you need a refresher, this walkthrough on adding formulas in Excel is handy for building helper columns without overcomplicating the sheet.

A few examples:

Excel day of week: =TEXT(A2,"dddd")
Excel days since last purchase: =TODAY()-B2
pandas day of week: df['day_of_week'] = df['purchase_date'].dt.day_name()

Build business-facing fields, not just technical ones

The best engineered features are easy to explain to someone who runs the store.

Good examples:

Customer value tier
First order month
Repeat buyer flag
Abandoned cart flag
Product margin band
Time between first and second order

These fields make analysis easier because they reflect how merchants already think. You don't want to explain ten raw columns every time you open a report. You want one field that captures the pattern you care about.

A customer table with total_orders and total_spend becomes much more usable once you add customer_value_tier. A cart event table becomes easier to analyze when you create a flag that identifies the final action in the session.

Good feature engineering reduces the number of questions you need to ask the raw data every time.

This step is where analysis starts feeling like decision support rather than cleanup. Instead of merely counting transactions, you start seeing shopping behavior.

Finalizing Your Dataset for Analysis

Most first cleanup projects go wrong at the end, not the beginning. The merchant fixes rows, standardizes fields, creates helpful columns, then immediately starts building charts. That skips the part that makes the work trustworthy.

Validate the file before you trust it

Run a final review before exporting anything.

A checklist titled Final Validation of Dataset Readiness featuring five essential steps for data analysis preparation.

Check a few things manually:

Row counts after duplicate removal
Sample records from the top, middle, and bottom of the file
New calculated columns to confirm formulas worked
Data types for every field you plan to use in analysis
Category values after standardization

If you segmented or filled missing values, be especially careful about subgroup bias. HHS ASPE notes that analysts should check whether missing-data rates differ across subgroups and consider separate imputations by subgroup when they do, to avoid introducing bias in its equity-focused quantitative analysis guidance. In e-commerce, that can matter when gaps cluster by region, customer type, acquisition source, or account class.

Document what changed

A clean file without documentation becomes confusing fast, especially a month later.

Create a simple data dictionary in a second tab or separate file:

Column name
What it means
Original source
Any recoding or transformation applied
Any assumptions that affect interpretation

This doesn't need to be academic. It just needs to be clear. “customer_value_tier = grouped from total spend and order count” is enough. The point is that someone else, or future you, can retrace the logic.

The Analysis Factor also recommends documenting changes by date and defining variables clearly so the work remains reproducible. That's not extra bureaucracy. That's what keeps a quick cleanup project from turning into a mystery file nobody trusts.

For many merchants, the final destination is still Excel. If that's your next step, this guide on how to analyze data in Excel is a practical follow-up once the dataset is validated and exported with a clear name.

Your prepared dataset is now ready for pivot tables, customer segmentation, trend charts, and more advanced modeling if you need it. And if your store works with live operational data rather than static exports, remember that preparation is often iterative. The file will need maintenance as inputs change.

If you want cleaner event-level data to work with in the first place, Cart Whisper | Live View Pro gives Shopify merchants visibility into live shopper behavior, cart activity, product views, searches, and exportable cart data that can be analyzed in Excel or Google Sheets. That's useful when you're trying to move beyond order summaries and understand what happened before checkout.