Congress mandates the Department of Education to collect data on postsecondary education in the United States. Anyone can find and use the Integrated Postsecondary Education Data System (IPEDS) to see the results of annual surveys conducted by the National Center for Education Statistics (NCES). The data is public domain and it’s the core of every college search and ranking system.
Unfortunately, it’s not the easiest data set to use. It was designed for (and likely, by) researchers. You can tell because there are data files for STATA, SPSS and SAS. Thankfully there’s also a CSV file for each survey, so we can copy the data into PostgreSQL. Of course you will need to set up the tables with columns defined in the data dictionary (an Excel file) and know to specify the ISO-8859-15 encoding to avoid problems with invalid byte sequences.
Once you have the data in PostgreSQL, you’ll need to extract relevant information, which usually means joining several tables. For instance, if you want to know what majors a college offers, you’ll need to cross reference:
- the institution directory,
- the most recent survey of completed awards data and
- the Classification of Instructional Programs (CIP) codes.[1]
If you have a question about a school, there’s a good chance you can get the data from IPEDS, but it will require a bit of digging to find the right connections.
Ansible playbooks and database views
IPEDS data gets updated several times a year as survey results get recorded. So that means any work done to create connections will need to be updated when new data is released. In anticipation of regular updates, I’m creating Ansible playbooks to re-import everything. That way I have an automated way to import next year’s results and any revisions that might yet come for the most recent year.
Importing raw IPEDS data is only half the work, though, so I’m setting up database views that extract meaningful datapoints and generate new derived columns. This could be done on the application level, but views keep these data manipulations in the database so that any application can use them. Since the data is updated only a few times a year, I’m using materialized views to cache the results.
Generating a profile document
Once you have beaten the data into shape, the next thing to do is create a profile. I’m using ERB since I’m writing my programs using Ruby and I have a plan to create a plugin for Discourse in the future. With a template system, it’s relatively easy to generate different profiles depending on how they will be used:
- Markdown for Discourse posts
- HTML for webpages
- CSV for importing into Excel
There are a few complications with import that data starting with finding the CSV file to download and ending with the odd Excel habit of adding an equal sign in order to preserve the leading zeros. ↩︎