Cleaning College Scorecard Data

Tips on cleaning the College Scorecard data

Cleaning College Scorecard Data

Cleaning the College Scorecard data before using it locally to query columns of interest to us in a college search allowed me to use correct datatypes and to fit the data into a Postgres table. In case you haven’t had a chance to see other walkthroughs of my automation process for various demo needs, the full Ansible setup will download data from a source and then load it into a table. Previously, I have used the process for MoMA art and artists data, and for generating a million-row table, and storing these in different YugabyteDB topologies. I leveraged this recently to load College Scorecard data for our child’s college search.

The focus of this post is to show a few tips on cleaning the data.

The file I used to do the cleaning is here if you’d like to follow along: https://github.com/dataindataout/xtest_ansible/blob/main/tasks/dml/datasets/clean-ipeds.py

First, because I was importing the data into a YugabyteDB table, I had to reduce the number of columns. The College Scorecard data is very wide, and YugabyteDB/PostgreSQL only supports up to 1600 columns. I realized that I could get all the information I needed from the IPEDS source within the full data set. The IPEDS data is about 850 fields, and this fits nicely.

I handled this by using the ‘usecols’ parameter when reading the downloaded csv into a dataframe:

df = pd.read_csv(
    "/tmp/Most-Recent-Cohorts-Institution.csv",
    usecols=[
        "UNITID",...

The second thing I needed to clean up was the presence of the string ‘PrivacySuppressed’ as a value in numeric fields. College Scorecard puts this value in fields when a small number of responses would make it too easy to identify data about an individual or small set of individuals. Of course, the database threw errors on the import because a string cannot be loaded into a numeric field. I also wanted to retain the numeric datatype to analyze data using math functions without unwieldy conversions.

To achieve the removal of that string, I used the following functions:

with open("/tmp/Most-Recent-Cohorts-Institution.csv") as f:
    newText = f.read().replace("PrivacySuppressed", "")

with open("/tmp/Most-Recent-Cohorts-Institution.csv", "w") as f:
    f.write(newText)

Before the data is read into the Pandas data frame, the code reads the file, replacing ‘PrivacySuppressed’ with a blank string (you could choose null or some other value as needed), and then overwrites the file.

At the end of this cleaning process, a new ‘cleaned’ version of the file is created, and then used in the next step of my automation to load into the database.

I hope you enjoyed this little deep dive into how I used Python for cleaning the College Scorecard data for analysis.