In the world of information management the ability to quickly pass and interpret information is pivotal in achieving precise and on time outcomes. In what can literally be a 'sea' of data, there may be icebergs lurking that run the risk of derailing an entire project. Why not use python to explore the data and identify such hazards long before you experience the pain of downstream errors that are almost sure to happen without thorough vetting?
In this post I am going to breakdown a basic data exploration workflow and provide some easy ways you can quickly integrate python into your own processes. If at any point in reading this post you have any burning questions or recommendations, please feel free to leave a comment! I would be happy to take on feedback or assist you in your python pursuits.
Basic python workflow
The common workflow I use to analyze data happens in three phases.
- Load the Data
- Manupulate / Work with the Data
- Save the Data
There is nothing profound about this workflow and is likely something you're very familiar with if you work with other analytical tools. The snippets below are presented based on the workflow stage and intended to demonstrate the ease in which you can integrate python to do efficient and meaningful work!
Package dependencies required
import numpy import pandas import csv
Phase 1) Load the Data
One of the best features of python is the ease of loading files into pandas dataframes. To do this simply copy these functions, along with the package dependencies noted above, into your .py file and save. At this point you can drop any file you'd like to examine into the same directory and explore the data to your hearts content.
One of the important things to note with these functions are the property settings 'dtype=False' and 'error_bad_lines=False.'
The 'dtype' parameter does as the name implies, it sets the data type of incoming columns. If you do not set this property, python will attempt to interpret data types for you. Although this can be convenient in some cases, it is less so in most. From experience, I've found the surest way to understand your data is to examine it in a string form first. This allows you to set the data types you would like to use in your analysis, instead of python making a 'mistake' in the outset that causes problems downstream.
The 'error_bad_lines' property tells python if it is to stop loading data if it encounters some corrupted data in the file. This often occurs if you are interpreting string data encoded differently between platforms (like UTF-8 v Windows-1252). By setting this property to 'False' you're telling python to continue loading the data regardless of observed errors. This is particularly advantageous if you're ultimate goal is to simply get a glace at the data.
#import a pipe delimited file def loadDATfile(datFile): df=pd.DataFrame(pd.read_csv('example_csv_file.csv',header=0,delimiter='|',error_bad_lines=False,dtype=str)) return df #import a tab delimited file def loadTabFile(tabFile): df=pd.DataFrame(pd.read_csv('example_csv_file.txt',header=1,delimiter='\t',error_bad_lines=False,dtype=str,)) return df #import a comma seperated file def loadCSVfile(csvFile): df=pd.DataFrame(pd.read_csv('example_csv_file.csv',header=0,delimiter=',',error_bad_lines=False,dtype=str)) return df #Load your dataset (csv example) into a dataframe df = loadCSVfile('example_feed_file.csv') #load an Excel File -- note the variable headerLoc denotes the row that should be used for your dataframe labels and the variable sheetName denotes the excel sheet you want to load into a dataframe def loadExcelfile(excelFile,headerLoc,sheetName): df=pd.DataFrame(pd.read_excel(os.path.join(curDir,feedFileLocation+excelFile),header=headerLoc,dtype=str,sheet_name=sheetName)) return df
Phase 2) Explore your Data
Now for the fun part. Actually examining the information you've now conviniently placed at your fingertips. As you could guess, the means of accomplishing this is quite simple. Please keep in mind the functions and tasks described in this post only scratch the surface of pythons analytical capabilities. If you want to learn more about the plethora of options available the pandas package webpage is a great start.
Prior to getting into the code I want to call out one dataframe method that is useful for checking and validating your data. It is the 'value_counts()' method (included as item #6 in the code snippet). In short, this function returns the unique values and respective counts for a column. Although there is nothing particularly 'fancy' about this method, I've used it numerious times to detect the proverbial 'data icebergs.' If you find yourself in a situation where you're doing a data integration and a stakeholders told you the data is a certain way, use this function to be an early detection/validation hero!
#1) Examine dataframe parameters row_length = len(df) column_length = len(df.columns.tolist()) #2) Examing the first 20 or last 20 rows df_first_20_rows = df.head(20) df_last_20_rows = df.tail(20) #3) Check the dtypes of your data columns df.info() #4) Get all information about your dataset df.describe() #5) Add another column to your dataset (why not?) #Creates another column called 'Append Column' with the value 'TEST VALUE' populated in each cell. This functionality is particularly helpful when doing iterative anonymous functions (lambda functions) I will talk about those in a future post! df.assign('Append Column'='TEST VALUE') #6) Evaluate the some stats by column df['Example Column'].value_counts() #what are the unique values in a column and the counts of those values? df['Example Column'].isnull().value_counts() #Is a cell null (TRUE/FALSE) and how many cases of null values (TRUE) are present?
Phase 3) Save your Data
So you've done the work and now it is time to save the data for use elsewhere or future reference. To accomplish this task load the function csvFileSave. This function first forces the data you'd like to save into a dataframe, then, using the pandas 'to_csv' method saves the file. By defult the dataframe will save in your current working directory. As you work with the 'to_csv' method be mindful of the 'sep' property. By default the function is set to use commas as the primary seperator. However, should you choose something else (like a pipe) simply replace ',' with '|'.
# Define the function def csvFileSave(arrayName,fileName): table=DataFrame(arrayName) table.to_csv(fileName,index=False,sep=',') #save your dataframe csvFileSave(df,'Example_File_Save.csv') #Congragulations, you've now completed the load, manipulate, and save workflow using python!
If you would like to have a consolidated means of using these functions, feel free to clone the code located in my github repo here. Within this repository you will find code from all posts on this site and a folder named 'data_exploration' that provides an additional load, manipulate, and save workflow I commonly use to work with data.
As always, thanks for reading and happy data exploration!