Open Data Science Conference Boston 2015

Doing open data science on government financials is not easy. A lot of the info is not, well, open. The good news is that data on government spending, borrowing, pensions and the like exists, but often lies hidden in bulky PDFs that are difficult to work with.

In my Open Data Science Conference presentation on May 31, 2015 in Boston, I highlighted some of the obstacles reporters face in doing accountability journalism using public financial data. We are certainly not the only ones: Anyone using EMMA, the main repository for state and local government financial disclosures, will quickly realize that the data is excellent but the PDFs are not.

I got around it for a recent project on tobacco bonds – state and local governments’ borrowing against their share of the 1998 legal settlement with cigarette manufacturers – by creating an electronic database of such bonds using paper filings from EMMA.

It was worth it: I found that governments had promised to repay $64 billion on just $3 billion they raised from cashing-in on their share of the legal settlement using a risky form of debt known as “capital appreciation bonds” that let interest add up until it comes due decades later in big amounts.

Was it the ideal solution? Hardly. Sometimes, though, that’s just what it comes down to.

Not always, though: In the second half of the talk, Marc Joffe, an expert at government financial data – and wrangling it out of PDFs for analytical purposes – walked through some of the creative solutions he’s implemented for tacking the same problems, including this really cool project he published earlier this year on California cities’ pension costs.

Marc was able to apply a more programmatic approach to the filings he was looking at, which showed that some of the cities faced pension costs as high as 17.6 percent of their revenues. That adds up to a significant burden for some small cities – and their taxpayers.

If you’re into working with state and local government finance data, Marc is definitely someone you should meet. And if you’re into financial journalism and using data to tell stories generally, I’d love to connect with you as well.

In fact, there is an open-source data project I’d like to pitch you.

In 2013, csv soundsystem – an informal gathering of developers, data journalists and scientists that I call home – launched It’s a user-friendly API that converts daily U.S. Treasury borrowing and spending data from these (not always) fixed-width text files into structured CSVs. The project finally opened up the government’s checkbook to analysis and visualization. Liberating the data inspired several data-powered storied and projects, some of which are summarized here.

I’d like to expand to include the Treasury’s monthly statements as well. These statements provide even more details into the financial activities of Federal agencies and sub-agencies. Unfortunately, they are big, unwieldy fixed-width text files as well. But liberating them in a similar fashion to could provide us an even better window into the finances of our government.

The code for our original parser is available here. The Github repo for the monthly statements project is available here. If you’d like to help out, give me a shout on Twitter @Cezary or shoot me an email:

What else is going on?

There is legislation afoot in Congress to make government finance data more open and readily accessible. But as this debt ceiling chart from makes clear, it’s not easy for Congress to avoid causing the U.S. government to default on its debts, or agree on anything else. So it’s not clear if the Financial Transparency Act (H.R. 2477) will ever get a chance to make our lives easier. But for the hopeful, you can read more about it here.

In the meanwhile, it’s up to us to keep hacking, scraping, developing and building our way to data transparency. Let’s keep the good fight going, and good luck!