wiping down a mirror as a metaphor for cleaning College Scorecard data ... it helps you see better

Cleaning College Scorecard Data

Share this post on:

Cleaning the College Scorecard data before using it locally to query columns of interest to us in a college search allowed me to use correct datatypes and to fit the data into a Postgres table. In case you haven’t had a chance to see other walkthroughs of my automation process for various demo needs, the full Ansible setup will download data from a source and then load it into a table. Previously, I have used the process for MoMA art and artists data, and for generating a million-row table, and storing these in different YugabyteDB topologies. I leveraged this recently to load College Scorecard data for our child’s college search.

The focus of this post is to show a few tips on cleaning the data.

The file I used to do the cleaning is here if you’d like to follow along: https://github.com/dataindataout/xtest_ansible/blob/main/tasks/dml/datasets/clean-ipeds.py

First, because I was importing the data into a YugabyteDB table, I had to reduce the number of columns. The College Scorecard data is very wide, and YugabyteDB/PostgreSQL only supports up to 1600 columns. I realized that I could get all the information I needed from the IPEDS source within the full data set. The IPEDS data is about 850 fields, and this fits nicely.

I handled this by using the ‘usecols’ parameter when reading the downloaded csv into a dataframe:

df = pd.read_csv(
    "/tmp/Most-Recent-Cohorts-Institution.csv",
    usecols=[
        "UNITID",...

The second thing I needed to clean up was the presence of the string ‘PrivacySuppressed’ as a value in numeric fields. College Scorecard puts this value in fields when a small number of responses would make it too easy to identify data about an individual or small set of individuals. Of course, the database threw errors on the import because a string cannot be loaded into a numeric field. I also wanted to retain the numeric datatype to analyze data using math functions without unwieldy conversions.

To achieve the removal of that string, I used the following functions:

with open("/tmp/Most-Recent-Cohorts-Institution.csv") as f:
    newText = f.read().replace("PrivacySuppressed", "")

with open("/tmp/Most-Recent-Cohorts-Institution.csv", "w") as f:
    f.write(newText)

Before the data is read into the Pandas data frame, the code reads the file, replacing ‘PrivacySuppressed’ with a blank string (you could choose null or some other value as needed), and then overwrites the file.

At the end of this cleaning process, a new ‘cleaned’ version of the file is created, and then used in the next step of my automation to load into the database.

I hope you enjoyed this little deep dive into how I used Python for cleaning the College Scorecard data for analysis.

Author: Valerie Parham-Thompson

View all posts by Valerie Parham-Thompson >

Leave a Reply

Your email address will not be published. Required fields are marked *