So there you are. The clock is ticking and a deadline is fast approaching to provide a data file for a downstream process. You contemplate delaying, but decide that would be admitting defeat to a problem well in your control. How can this be? Spoiler alert. The power of python.
Balancing the duality of expediency vs. quality delivery is something all professionals have to balance in a fast paced work environment. At some point the pursuit of perfection diminishes the ability to deliver good enough. Fortunately for you, the savvy data enthusiast, there are tools provided by pandas to help you achieve a more perfect and expedient outcome. One of them is crosstabs. You can have it all!
Crosstabs are a powerful and easy to use tool provided by pandas to understand your data in a visual form. I find it particularly useful for conducting timely analysis and quality assurance testing on record distributions.
Within this tutorial I am going to be using techniques based on my earlier post 'Basic data exploration with python' using the 'Adult Data Set' found on the University of California Irvine Machine Learning Repository. The data is a subset of the 1994 census data containing general demographic information of respondents. This kind of data is ideal to better understand crosstab functionality because it contains a mixture of numeric (like average hours per week worked) and categorical (gender or race) variables. To view more detailed information on the data, please reference the link to the data set above. As with previous posts, you can view all the source code of my posts on my git hub site in the python-demos/data_crosstabs repository.
To render the crosstabs in a way that is easier to read and interpret I use the hydrogen package for the atom text editor. Hydrogen allows you to run python commands in-line and display the output within the text editor, similar to jupyter notebook functionality. All crosstab tables in this post are screenshots taken directly from the text editor after running the stated command.
Package Dependendencies Required
import numpy import pandas import csv
Function Dependencies Required
#import a comma seperated file def loadCSVfile(csvFile): df=pd.DataFrame(pd.read_csv('example_csv_file.csv',header=0,delimiter=',',error_bad_lines=False,dtype=str)) return df
Phase 1) Load the Data
analysisDf = loadCSVfile('adult.csv')
Phase 2) Examine the Data
Use the dtype method to evaluate the datatypes of the dataframe columns. You should see they are all objects. This is expected given the dtype=str parameter set in the 'loadCSVfile' function.
#input analysisDf.dtypes #output:
In looking at the columns, it is likely possibility hat the columns age, capital-gain, capital-loss, and hours-per are likely numeric. Examine the columns to confirm they are indeed numeric instead of categorical using the .loc method.
#input analysisDf.loc[50:60,('age','capital-gain','capital-loss','hours-per')] #output
Because you can see a wide variation in numerical variables by row, this indicates the variables are numeric rather than binary or categorical. Therefore, set the 'age', 'capital-gain', 'capital-loss', and 'hours-per' variables to an integer data type. This is important to ensure the crosstab can calculate values using the aggfunc property available in the crosstab method discussed later in this post.
Phase 3) Construct and view your crosstab
Say you wanted to better understand the education distribution by race within the data set. To accomplish this, create a crosstab using education classification as the rows with race classification as the columns. This operation first identifies the number of samples in each education category then calculates the number of samples within that category belonging to a particular race (also known as an inner join between segments).
#input pd.crosstab(analysisDf['education'],columns=analysisDf['race']) #output
With the crosstab we can clearly see how much of the sample belongs to each race and education classification. This is great for referencing raw sample sizes, but does not allow for a very clean comparative analysis.
For example, say we were conducting a study to examine the education level by race to evaluate (if any) disparity. Because the sample size by race will inherently be different, it would be better to compare the percentage of a particular race with a certain level of education. To do this you can utilize the 'normalize' parameter. With 'normalize' you can identify if you want the crosstab to calculate percentage of a particular category present (by column or row). Note that in this example I multipled the dataframe by 100 to return a percentage.
#input pd.crosstab(analysisDf['education'],columns=analysisDf['race'],normalize='columns')*100 #output
Now we can clearly see percentage of each race with different levels of education.
But lets say now that we wanted to understand the average hours per week that each race and education segment works. This would allow us to identify if there is a disparity in the number of hours worked by a paricular education and race segment. To do this use the parameter aggfunc and values. In this example I am going to pass the values 'hours-per' with a aggfunc 'mean'. This will calculate the mean number of hours-per week worked by a particular segment. We can successfully perform a mean calculation because we changed the 'hours-per' variable to an integer data type in an earlier step.
#input pd.crosstab(analysisDf['education'],columns=analysisDf['race'],values=analysisDf['hours-per'],aggfunc='mean') #output
We've now reviewed three fundamental ways that you can use crosstabs to investigate and understand your data. Hopefully you can now put this method to use to understand your data in a fast and elegant way. Please let me know if you have any comments or feedback and thanks for reading!
Until next time,