With our dataset in place, we’ll take a quick look at the visualizations you can easily create from a dataset using popular Python libraries, then walk through an example of a visualization.
This article is part of the Data Cleaning with Python and Pandas series. It’s aimed at getting developers up and running quickly with data science tools and techniques.
If you’d like to check out the other articles in the series, you can find them here:
Sometimes, even after you’ve cleaned up your dataset, you still sometimes need to reshape your Pandas DataFrame to get the most out of the data. Reshaping is the term used when the table structure is manipulated to form different datasets, such as making "wide" data tables "long."
This will feel familiar if you’ve worked with Pivot Tables in Excel or the built-in pivot and crosstab support included in many relational databases.
As an example, the tables above (from the Pandas documentation) have been reshaped by pivoting, stacking or unstacking the table.
pivot method takes a large data set with multiple indexes and summarizes it
stack method takes a table with multiple indexes and groups them
unstack method takes a table with multiple unique columns and ungroups them
At this stage, we’re going to look at a number of methods to reshape the data with Pandas. We’ll see how we can use pivoting and stacking of DataFrames to get a different picture of our data.
Note that we've created a complete Jupyter Notebook with the source data files for this series of modules, which you can download and install locally.
Pivoting a Pandas DataFrame
With Pandas, we can use the
pivot function to create a new DataFrame from an existing one. At the moment our tables are indexed by purchase ID, but let’s pivot our previously-created
combinedData table into something a little more interesting.
First, let's try the following
pivot method by starting a new code block and adding:
productsByState = combinedData.pivot(index='product_id', columns='company', values='paid')
The result looks like this:
Running this command generates a duplicate index error as
pivot works only on DataFrames with keys that are unique.
But there’s another method that can give us a result that gets around this.
pivot_table works much like pivot, except it aggregates the duplicate values instead of generating an error.
pivot_table method takes a large data set and summarizes it by aggregating duplicates
Let's use this method with the defaults:
productsByState = combinedData.pivot_table(index=['product_id', 'product'], columns='state', values='paid')
You can see the result here:
This produces a DataFrame with a list of products and the mean value for each state across the columns. This is not really that useful, so let's change the aggregation method:
reshapedData = combinedData.pivot_table(index=['product_id', 'product'], columns='state', values='paid', aggfunc=np.sum)
reshapedData = reshapedData.fillna(0)
This now produces a table of products with the sum of all the sales of those products by state. The second line in this method also removes the NaN values and replaces them with a 0, as the assumption is there are no sales in that state for those products.
Grouping Data in a Pandas DataFrame
The other reshaping activity we’ll look at is grouping the data elements together. Let's go back to our original big DataFrame and create a new DataFrame that groups a single customer's transactions together.
groupby method takes a large data set and groups by a columns values
Start a new code block and add:
volumesData = combinedData.groupby(by='customer_id') print(volumesData.head(10))
Here's the result:
It doesn’t really look like that did anything because our DataFrame is indexed on the
Lets add an aggregate function to summarize the data so our grouping works as intended:
volumesData = combinedData.groupby(by='customer_id').sum()
Again, here's the result:
This groups our dataset the way we were expecting, but we seem to be missing some columns and
purchase_id doesn't really make sense, so let’s expand our
groupby method and trim off the
volumesData = combinedData.groupby(by=['customer_id','first_name','last_name','product_id','product']).sum()
Here's our new result:
The final result looks pretty good and gives us a good idea of what the customers are buying, the amounts, and how much they’re paying.
Finally, we’ll make one more
groupby change to our dataset. Add the following to create a totals-by-state DataFrame:
totalsData = combinedData.groupby(by='state').sum().reset_index()
The key change here is we added a
reset_index method after the
sum method. This is to ensure the resulting DataFrame has a usable index for our visualization work.
We took our complete, clean dataset and reshaped it a few different ways to give us some added insights into our data.
Next we’ll look at visualizations and see how they’re an important tool for presenting our data and ensuring the results are clean.
Header image source: http://ohi-science.org/data-science-training/tidyr