Processing Data with Pandas
I’ve been experimenting with processing data with Pandas this week, specifically historical NOAA weather data, and storing it in a local YugabyteDB cluster. This open data set contains max/min/precipitation for years back to 1750 (not all data points are available for all years or locations). It’s available here: https://www.ncei.noaa.gov/metadata/geoportal/rest/metadata/item/gov.noaa.ncdc:C00861/html
I leveraged my existing demo framework to provision a local YugabyteDB cluster, and then used Pandas to import data from txt and csv files. The txt lookup files were countries, states, stations, and inventory. The csv files were available in different formats. The code I’ve linked below imports all weather data for a single year.
Similar to other projects, there is a folder named “scenarios” and I will continue to build this out. Right now, it only has the lookup and data population python scripts. Other scenarios will include queries, and a demonstration of YugabyteDB indexes.
One thing to note is that the lookup files were in fixed-width format, so I had to provide names and column settings, like this for the states file:
# lookup tables don't have headers, so specify those here
headings = [
"code",
"name",
]
# lookup tables are fixed-width files, so specify the widths here
colspecs = [
(0, 2),
(3, 50),
]
I originally hard-coded sql files for the table definition statements, but found a way to create the table more programatically. First I download the files in the provisioning stage:
- name: Download states file
ansible.builtin.get_url:
url: https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-states.txt
dest: /tmp/states.txt
mode: "0775"
Then the code is used to read in the values using the headers and column spacing as mentioned above:
# Read in data using above specifications; data types will be auto-detected
data = pd.read_fwf(path, names=headings, header=None, colspecs=colspecs)
Note the data types are auto-detected, and Pandas does a pretty good job with this.
Since the lookup tables did not have headers and were fixed-width, I chose to process one at a time. If your data files have headers, and are comma-separated rather than fixed-width, it’s easy to process a lot of the files in one script. This is what I will do with the data.py script eventually. The list can go in this block:
# Input file path
input_filepaths = [
"/tmp/2011.csv",
]
Take a look at the repo for an example of processing data with Pandas: https://github.com/dataindataout/ghcn_testing