Open Data College Scorecard
I’ve been pretty quiet recently, I know. My youngest has been going through the college application phase, which has taken a lot of time for both of us. I wouldn’t give up all the lovely college visits and overnights, but I might easily part ways with the paperwork.
I do have something fun to share from the experience. I didn’t like the limitations of common college search websites. In particular, we were looking for a college in a subset of surrounding states. Most college search forms allow you to enter only a region, and the Southeastern region was too broad for her search. I also didn’t like that signing up for the sites subjected you to a lot of marketing.
Turns out, the core data used by college search websites is open data from the US Department of Education. See this data for yourself: https://collegescorecard.ed.gov/data/
Because I had a data-processing Ansible framework, I knew I could prepare this in a way that would make it easy to query. In this repo, I am downloading the data, creating a table of just the IPEDS data, and then starting up a local YugabyteDB cluster for querying. I chose to use just the IPEDS subset of the full data set primarily because there were too many fields in the full data set to fit in the column limit of Postgres/YugabyteDB. However, IPEDS has everything you will commonly need to create a college list.
Download the repo https://github.com/dataindataout/xtest_ansible and start the local database:
ansible-playbook -i inventory.yml tasks/college-scorecard.yml
ysqlsh college_scorecard_sql
Then you can run custom queries to your heart’s content.
The data dictionary is available here: https://collegescorecard.ed.gov/data/data-documentation
Here are a few pointers:
- The field ‘control’ allows you to filter out public (1), private nonprofit (2), and private for-profit (3). I’d suggest filtering out private for-profit schools unless you have a very specific vocation in mind.
- The fields starting with ‘CIP’ will allow you to filter on particular majors. In the example below, I’m searching for CIP45BACHL=1, which means that the institution offers a bachelor’s degree in Social Sciences. For most majors, the CIP codes are more specific than that.
- Religious affiliation (the field ‘relaffil’) tells you very little about the religious focus of a school. For example, Duke University has a United Methodist religious affiliation due to how they were founded, but nobody would say that Duke is a Christian college. Many private schools in the US were founded by religious organizations; if you don’t know which ones are Christian colleges or not by reputation, then it could be better to filter in or out based on something like the presence of a bachelor’s degree in theology and religious vocations (CIP39BACHL).
- The C100_4 field helps you see the 4-year graduation rate; important to know if you’ve budgeted for four years.
- The CCSIZSET field is an indicator of both size and percentage of students who live on campus.
Here’s a sample query:
select stabbr, instnm, insturl from ipeds where stabbr in ('NC', 'VA') and control<3 and ccsizset in (7, 8, 10, 11) and CIP45BACHL=1 and CIP39BACHL=0 and PREDDEG=3 and C100_4>.25 order by stabbr, C100_4;
Using the College Scorecard open data like this will at least allow you to narrow down a list of schools to research. Enjoy!