Data Engineering

College Scorecard Application

Valerie Parham-Thompson

Still working on the College Scorecard dataset. Previously I explored the dataset in a real-world application, talked about how to clean the data, and worked with the data API.

Now I’ve decided I want to put this in a web application so others can use the dataset in the same flexible way that I have been using it. (Reminder that several popular college-search websites exist, but they are limited in the ways you can filter the data. Also they tend to gather personal data for I suppose ad generation.)

College Scorecard API

Valerie Parham-Thompson

After I finished the YugabyteDB universe network mapping example, I started thinking about other things to map. Anything with latitude and longitude will work. College locations from my previous work on the College Scorecard data set were an obvious choice.

Previously, I had exported the data and transformed it to allow for sorting and analysis. That’s still a valid method if you want to play with the pull data set, since the API allows only page size of max 100 at a time. However, with the right filters, that might be enough, and the API is a quicker path to getting the data.

Count Large Partitions in YCQL

Valerie Parham-Thompson

One thing that can really wreck your performance in Cassandra and the similar YugabyteDB YCQL is large partitions due to an imbalanced key. Without the robust nodetool commands of Cassandra, it can be challenging to find these large partitions in YugabyteDB.

dsbulk is a tool used for migrating data, and YugabyteDB has a fork that takes into consideration slight differences from Cassandra. That tool can be leveraged to list the top largest partitions.

Correct Partition Endpoints

Valerie Parham-Thompson

I was recently reviewing a database partitioning definition in YugabyteDB (the postgres “ysql” API), and realized the partition distribution might not be what the developer intended.

What is database partitioning?

Database partitioning is used to divide large tables into smaller tables (partitions). While the data is physically separate, the application can access the data logically as a single table.

This can help performance through a process called partition pruning. The database planner skips partitions that don’t hold the data. For example, if a table is partitioned on months of the year, a query on a single month only has to access the rows in the single partition for that month.

Why You Need a Default Partition

Valerie Parham-Thompson

Postgres and YugabyteDB allow you to define partitions of parent tables. Partitions are useful in at least two ways:

  1. You can take advantage of partition pruning. The database doesn’t need to look at partitions it knows won’t meet the parameters of the query.
  2. You can easily archive data by disconnecting and/or dropping partitions instead of managing expensive delete queries.

Here’s one gotcha I ran into recently. What happens if you insert a row into a partitioned table, but there’s no partition for it? The insert fails with an error – see below for a reproduction of this scenario.

Generate Random Data

Valerie Parham-Thompson

I had to create a 10 million row table for testing recently, and put together a query to generate random data for it.

INSERT INTO my_table
(id,
mydatetime,
string1,
string2)

SELECT
(random() * 70 + 10)::int,
TIMESTAMP '2024-01-01 00:00:00.000000' + interval '1 millisecond' * (random() * 86400 * 1000 * 365),
(array['alligator','bear','cat','dog'])[(random() * 3 + 1)::int],
substr(md5(random()::text), 1, 10)

FROM generate_series(1, 10);

The id field is just a random integer in this example, but you’d probably use an identity column.

Cleaning College Scorecard Data

Valerie Parham-Thompson

Cleaning the College Scorecard data before using it locally to query columns of interest to us in a college search allowed me to use correct datatypes and to fit the data into a Postgres table. In case you haven’t had a chance to see other walkthroughs of my automation process for various demo needs, the full Ansible setup will download data from a source and then load it into a table. Previously, I have used the process for MoMA art and artists data, and for generating a million-row table, and storing these in different YugabyteDB topologies. I leveraged this recently to load College Scorecard data for our child’s college search.

Open Data College Scorecard

Valerie Parham-Thompson

I’ve been pretty quiet recently, I know. My youngest has been going through the college application phase, which has taken a lot of time for both of us. I wouldn’t give up all the lovely college visits and overnights, but I might easily part ways with the paperwork.

I do have something fun to share from the experience. I didn’t like the limitations of common college search websites. In particular, we were looking for a college in a subset of surrounding states. Most college search forms allow you to enter only a region, and the Southeastern region was too broad for her search. I also didn’t like that signing up for the sites subjected you to a lot of marketing.

Timestamps Postgres Migration

Valerie Parham-Thompson

Math… the universal language. Timestamps, not so much.

The way we decide to denote date and time differs across both computer languages and human languages. The format also differs across implementations of SQL. For example, Oracle and Postgres allow very different formats to be entered in the timestamp data type.

Oracle allows a wide variety of punctuation in dates: hyphens, slashes, commas, periods, colons. Postgres supports a more limited list.

Processing Data with Pandas

Valerie Parham-Thompson

I’ve been experimenting with processing data with Pandas this week, specifically historical NOAA weather data, and storing it in a local YugabyteDB cluster. This open data set contains max/min/precipitation for years back to 1750 (not all data points are available for all years or locations). It’s available here: https://www.ncei.noaa.gov/metadata/geoportal/rest/metadata/item/gov.noaa.ncdc:C00861/html

I leveraged my existing demo framework to provision a local YugabyteDB cluster, and then used Pandas to import data from txt and csv files. The txt lookup files were countries, states, stations, and inventory. The csv files were available in different formats. The code I’ve linked below imports all weather data for a single year.