3 Tips to Audit your Data with Pandas

Igor Comune
2 min readSep 19, 2023

--

In this post, I’ll tell you Why and How you should use this 3 basic tips to audit your data using Python Pandas.

Source: pxhere.com/en/photo/1640402

Index

  1. Normalize your Data
  2. Counting Null/NaN Values
  3. Display only the COLUMNS and ROWS with Null values

Extra tip — Grouping and Counting Null Values

1 Normalize your Data

Why

Sometimes data can be considered a null, but it is not exactly. As an example, in some databases, null values can be described as blank string. Therefore, the first thing you SHOULD know are the peculiarities of your data and treat them accordingly.

How

# replaces 'blank' by np.nan
df[column_name] = df[column_name].replace('blank', np.nan)

2 Counting Null/NaN values

Why

That’s the core of our problem, counting null values.

How

# Counting the number of null by COLUMN
df.isnull().sum()

# Counting the number of null by ROW
df.isnull().sum(axis=1)

3 Display only the COLUMNS with null values

Why

Sometimes it will be required to upload it to an auditing database or send it as a table for a manager, so, you need to filter it, either by row or column.

How

#################### FOR COLUMNS ####################
# Checking if there is ANY null value in each column, return the columns
# with null values in it
null_columns = df.columns[df.isnull().any()]
df_with_nulls = df[null_columns] # Display only the columns with null values


#################### FOR ROWS ######################
# Returns all the rows the contains at least one null value
rows_with_nan = df[df.isna().any(axis=1)]

If you’ve reached this far, I’ve probably helped you somehow. If I did, please consider following me on my social networks.

Extra Tip — Grouping and Counting Null Values

Why

Most of the time, managers want to be straight forward and know exactly how many empty values there are in their database. To achieve this goal, you’ll need to group and count the number of null values.

How

The example below counts the number of null values in the column salary, when it is grouped by sector.

df.groupby('sector')['salary'].apply(lambda x: x.isnull().sum())

Final Thoughts

You will probably need to use a lot of for loops, create if, except, try… to process the data, append it, concatenate it before you conclude the auditing phase… if you want to know more about it, just let me know.

If it was helpful somehow, please… share it, like it, and comment on it ❤

Igor Comune | LinkedIn

--

--

Igor Comune

An "under construction" Data Scientist with a love for numbers and analysis!