65.9K
CodeProject is changing. Read more.
Home

Reshaping Data in a Pandas DataFrame

starIconstarIconstarIconstarIconstarIcon

5.00/5 (12 votes)

May 29, 2020

CPOL

4 min read

viewsIcon

12027

downloadIcon

112

In this sixth part of the Data Cleaning with Python and Pandas series, we look at a few of the simpler methods for combining data.

Introduction

This article is part of the Data Cleaning with Python and Pandas series. It’s aimed at getting developers up and running quickly with data science tools and techniques.

If you’d like to check out the other articles in the series, you can find them here:

Sometimes, even after you’ve cleaned up your dataset, you still sometimes need to reshape your Pandas DataFrame to get the most out of the data. Reshaping is the term used when the table structure is manipulated to form different datasets, such as making "wide" data tables "long."

This will feel familiar if you’ve worked with Pivot Tables in Excel or the built-in pivot and crosstab support included in many relational databases.

As an example, the tables above (from the Pandas documentation) have been reshaped by pivoting, stacking or unstacking the table.

  • The pivot method takes a large data set with multiple indexes and summarizes it
  • The stack method takes a table with multiple indexes and groups them
  • The unstack method takes a table with multiple unique columns and ungroups them

At this stage, we’re going to look at a number of methods to reshape the data with Pandas. We’ll see how we can use pivoting and stacking of DataFrames to get a different picture of our data.

Note that we've created a complete Jupyter Notebook with the source data files for this series of modules, which you can download and install locally.

Pivoting a Pandas DataFrame

With Pandas, we can use the pivot function to create a new DataFrame from an existing one. At the moment our tables are indexed by purchase ID, but let’s pivot our previously-created combinedData table into something a little more interesting.

First, let's try the following pivot method by starting a new code block and adding:

productsByState = combinedData.pivot(index='product_id', columns='company', values='paid')

The result looks like this:

Running this command generates a duplicate index error as pivot works only on DataFrames with keys that are unique.

But there’s another method that can give us a result that gets around this. pivot_table works much like pivot, except it aggregates the duplicate values instead of generating an error.

  • The pivot_table method takes a large data set and summarizes it by aggregating duplicates

Let's use this method with the defaults:

productsByState = combinedData.pivot_table(index=['product_id', 'product'], columns='state', values='paid')

You can see the result here:

This produces a DataFrame with a list of products and the mean value for each state across the columns. This is not really that useful, so let's change the aggregation method:

reshapedData = combinedData.pivot_table(index=['product_id', 'product'], columns='state', values='paid', aggfunc=np.sum)
reshapedData = reshapedData.fillna(0)
print(reshapedData.head(10))

This now produces a table of products with the sum of all the sales of those products by state. The second line in this method also removes the NaN values and replaces them with a 0, as the assumption is there are no sales in that state for those products.

Grouping Data in a Pandas DataFrame

The other reshaping activity we’ll look at is grouping the data elements together. Let's go back to our original big DataFrame and create a new DataFrame that groups a single customer's transactions together.

  • The groupby method takes a large data set and groups by a columns values

Start a new code block and add:

volumesData = combinedData.groupby(by='customer_id') print(volumesData.head(10))

Here's the result:

It doesn’t really look like that did anything because our DataFrame is indexed on the purchase_id.

Lets add an aggregate function to summarize the data so our grouping works as intended:

volumesData = combinedData.groupby(by='customer_id').sum()
print(volumesData.head(10))

Again, here's the result:

This groups our dataset the way we were expecting, but we seem to be missing some columns and purchase_id doesn't really make sense, so let’s expand our groupby method and trim off the purchase_id column:

volumesData = combinedData.groupby(by=['customer_id','first_name','last_name','product_id','product']).sum()
volumesData.drop(columns='purchase_id', inplace=True)
print(volumesData.head(10))

Here's our new result:

The final result looks pretty good and gives us a good idea of what the customers are buying, the amounts, and how much they’re paying.

Finally, we’ll make one more groupby change to our dataset. Add the following to create a totals-by-state DataFrame:

totalsData = combinedData.groupby(by='state').sum().reset_index()
totalsData.drop(columns=['purchase_id','customer_id','product_id'], inplace=True)

The key change here is we added a reset_index method after the sum method. This is to ensure the resulting DataFrame has a usable index for our visualization work.

Summary

We took our complete, clean dataset and reshaped it a few different ways to give us some added insights into our data.

Next we’ll look at visualizations and see how they’re an important tool for presenting our data and ensuring the results are clean.

Header image source: http://ohi-science.org/data-science-training/tidyr