Designing a Course in Data Journalism

More data journalism classes are on the way, at little schools and big ones. I'm hearing this through the grapevine. Several grapevines today, in fact. And I think it's fantastic.

When asked what such a course might look like, I point folks to Brian Boyer's theoretical and amazing Hacker Journalism 101. Journalism needs more journalists who can code, and this is a great way to get there.

That's programming, though, you say. We can't teach programming in a journalism class!

You can, and you should. Basic programming will help journalists understand and deal with data used by cities, cops, politicians, agencies, campaigns, companies, banks, stores, non-profits, advocacy groups and just about any other source you can think of.

Knowing how to code, even a little, is like having a solar calculator for that database you just scored.

In addition to programming, here are some of my favorite topics for classes, readings or workshops:

  • Finding data for your stories
  • Finding stories in your data
  • How to tell one story well
  • All data is dirty ... and what to do about that
  • Basic stats
  • Percentage points for journalists
  • Mapmaking made easy
  • Lying and truthing with easily-made maps
  • When maps shouldn't be maps
  • Basic chartbuilding
  • Lying and truthing with charts and graphs
  • Did I mention programming?

And I'd have 'em code something. Every week.

What have I left out? Add comments below and I'll update this post -- and my advice to others.

Photo: My daughters in a UW-Madison lecture hall where I studied geography. Both of them have dabbled in coding.

Real-time Data Journalism

While preparing for the real-time challenge of Election night, the WNYC Data News Team -- and the entire city -- turned its attention to an oncoming storm.

For our Hurricane Sandy coverage, we quickly built and maintained several data projects to help convey information people needed. All used open, public data and several were updated regularly -- either automatically or by hand.

Our projects included:

  • The evacuation map above, built using public shapefiles from New York City's Department of Emergency Management.
  • A storm-surge map for the entire New Jersey and New York coastlines, stitched together from a variety of U.S. Army Corps of Engineers shapefiles.
  • Hurricane Tracker to watch the storm's forecast track and its radar echo, fed by four real-time feeds from the National Weather Service. 
  • Transit Tracker with the latest information about several public transportaiton systems, driven by a Google spreadsheet updated by a half-dozen producers and reporters from transportation agency tweets, websites and public announcements.
  • A live flood-gauge map showing where the water was rising, driven by a real-time feed from the National Weather Service.
  • traffic map for the back-to-work crush sans subways, fed live by the Google Maps traffic layer.
  • A subway-restoration map, updated several times a day with new maps issued by the city transit agency.

For details on the data, follow the source links on each project.

With more time, we would have worked more on the aethetics. But time wasn't something we had much of, so we did our best to be accurate and clear given the resources available.

Read more about the WNYC Data News team's thinking behind our Sandy coverage

Counting the Jay-Z subway crowd

Saturday morning we did something fun: We counted the number of people who took the subway to the opening-night Jay-Z concert at Brooklyns new Barclays Center the night before.

Or at least got pretty close.

Traffic and transit were closely watched for the new arena, as it the 19,000 or so concertgoers would have just 541 parking spaces. So we decided to grab data from subway turnstiles to measure the crowds leaving the Atlantic Ave-Barclays Center station for the show.

How we did it

Turning around the data overnight took a little planning. Here's how we pulled it off:

Every Saturday morning, the MTA posts turnstile data for the previous week. Fortunately for us, the last reading is 8 p.m. Friday, the scheduled start time for the concert.

The data files contain the entry and exit counter readings for each turnstile in the system as a sort of "odometer" reading. The data is a little tricky to use, though it does have a regular structure.

So Steve Melendez, our Data News Team programmer, wrote some Python code that grabs the data files and puts the individual readings into a SQLite database. He then sorted the readings by station (using this chart), and calculated how many exit clicks were logged for the Atlantic Avenue station from 4 p.m. to 8 p.m.

We suspected there would be a jump in the time period before the concert. So earlier in the week, we ran the numbers for each Friday for much of the year and calculated those averages (I ended up using just September, because they were higher, post-summer vacation readings). Then, Saturday morning, Steven got up really early Saturday and ran the program again, including the newly posted numbers.

He sent me the latest values, and I added them to the chart in a taxi on the way to the station. At 8:35 a.m., I was on the air talking about how it appears about a third of the concert-goers took the subway.

It could be more: Some people could have left the system at another station. And if anyone left through an emergency exit, or if they showed up after 8 p.m., they wouldn't be in our turnstile data.

But it's a place to start, and we'll be watching how these numbers change for future concerts and for Brooklyn Nets games.

Charting Local Olympic Data

The WNYC Data News team isn't just about maps. We dig into all kinds of structured data -- and the 2012 Olympics will generate a bunch of it.

There are some great efforts afoot to follow the Games, with the New York Times doing amazing work as always.

Our slice of that effort is the Team NYC Olympian Tracker, which we made to help our audience follow who's competing today, who's in contention for a medal and who's won medals. 

WNYC's Team NYC Olympian Tracker

The entire application fed by a Google Spreadsheet, which is linked to the chart by a cool data tookit called Miso, from The Guardian and Bocoup.  Since we're inthe process of interviewing for designers, we went with a clean and quick design using Bootstrap.

It was a fun project and, just like a map, built to make interesting data easy to use.

NYPD Stop & Frisk Data for You

This week, we published a map showing total NYPD stop and frisks by block together with locations where guns were discovered during such stops.

In the tradition of showing our work, here's some information about how we built it -- and data you can download and explore yourself.

The Data

The major bumps I hit working with the NYPD's Stop, Question and Frisk data sets were 1) they're in a format I don't know, and 2) the geographic locations aren't in latitudes and longitudes.

For bump #1, I used the free statistical program "R" to convert the NYPD's ".por" files into something I could use. R is also great at handling big data sets, and easily managed the 685,724 stops in the 2011 file.

For bump #2, I noticed that each stop had data fields called "XCOORD" and "YCOORD." A couple of tests confirmed that those values described the stop's position on the New York-Long Island State Plane Coordinate System -- something I've seen in a lot of city data. So I used the free geographic software QGIS to load in the data and convert (technically, reproject) those coordinates into latitudes and longitudes.

And now you can have the data I used to make the map. Just click to download:

stopfrisk2011_databundle_sans_allstops.zip
(4.3MB download, unzips to 12MB)
Contains a shapefile of all NYC blocks with the total stop-and-frisks calculated for each block, a shapefile with the points for all stops where guns were found, raw data on each of the 768 stops where guns were found and notes about each data set. Here's more detail on the contents..

stopfrisk2011_databundle_with_allstops.zip
(51MB download, unzips to 500MB)
This file has of the above and a .csv with the raw data for all 685,724 stops in 2011. While it's in a more common format than what the NYPD provides, it's too big to open in Excel and maxes out the limits for Google Fusion Tables. So you'll need a stats program like R or some database know-how to handle it.

The Map

I built the map using TileMill from Mapbox, which I've been playing with for some months now.

While it's tricker than generating quick maps from Fusion Tables, if you're patient and spend some time with it, you can make some pretty gorgeous maps.

Besides providing wonderful control over styles and colors, TileMill solves an important problem: New York City has roughly 38,500 census blocks -- and loading the data to draw them all onto a Google map will anger any browser. With TileMill, you bake the data into individual image tiles, which get served up to the user as they zoom and pan.

To cover the area of NYC and provide 8 levels of zoom, I pre-cooked 59,095 tiles. But once they're uploaded to the MapBox server, which took about 15 minutes, they load almost instantly.

As always, I welcome comments and questions below or at john (at) johnkeefe.net.

Love Design? Join the WNYC Data News Team

Do you want to ...

  Inform the citizens of New York?

  Help people understand their world?

  Root out corruption?

  Make a mark on society?

  Craft beautiful online projects and visualizations?

  ( Like this diversity map, this stop & frisk project and this election tracker? )

WNYC is growing our Data News Team to make high-impact visualizations and projects, and to help WNYC reporters and producers present the facts, expose corruption and explain our world. We've been pioneers in the field of crowdsourcing, data journalism and mapping -- even winning some prestigious awards for our work.

Now we're kicking it up a notch. Like to join us? 

What we have:

  • An award-wining staff of reporters and producers
  • A committed, innovative digital staff
  • A mission to conduct journalism in the public interest
  • Millions of engaged, passionate and active listeners and readers

What you have:

  • A passion for news
  • An attention to detail, a respect for fairness and a hatred of inaccuracy
  • A user-centered approach to exploring information
  • An appreciation for clean lines, clear stories and use of white space
  • A genuine and friendly disposition, and an honest spirit of collaboration
  • A bias toward sharing what you know, and helping others build on it

What you'll do:

  • Huddle with reporters to figure out how we might help their stories with data, design and web technology
  • Work as a team to turn ideas into realities in days or weeks, tops
  • Learn from and build on successes and mistakes along the way
  • Have your work consumed online and talked about on air to millions of New Yorkers

Head over to our official aplication for Interaction Designer and tell us all about you.

The Thinking Behind WNYC's Vertical Timeline

Making a music stand, my father said, was a great challenge: Even though people had made them for centuries, it was still possible to blend beauty and function in a new way.

In journalism, the same is true for the timeline.

Presenting a chronological story online, beautifully and functionally, has been tricky. There are some great examples, such as the New York Times' chronology of the Iraq war, and the three-dimensional Middle East timeline from The Guardian.

ProPublica built the excellent TimelineSetter to put Times-like timelines in the hands of non-Times journalists, and we used it for a while. But TimelineSetter's horizontal layout got cramped in WNYC's article columns, and we longed for something that fit better.

Working with Balance Media and the WNYC web design team, we kicked around several ideas and settled on a vertical version. As it happened, Facebook's new vertical timeline had come out, inspiring a crop of JavaScript libraries we could work with.

We also decided to dispense with a journalistic convention that represented temporal gaps visually -- making months wider than weeks, for example -- and focus, instead, on seeing the sequence of events at once.

The live version at WNYC is here.

We also went with a center-spine orientation, which give it balance and allows the user to see more items at the same time. And the very cool Isotope code reshuffles the items to fit as they are closed, opened, resorted or even added.

Open to use

Finally we wanted it easy for us -- and you -- to use. So we wired it to Google Spreadsheets, allowing reporters and editors to easily enter and update the information. The wiring there is based on a previous project of ours called Tabletop.js.

And we made the source code openly available and scary-easy to use, and you can start by copying this Google spreadsheet template.

We usually build an HTML page just like the one in the code example, and then use a simple line of HTML to iframe it onto an article page. The only trick is to make sure the iframe is tall enough.

The code is free for non-commercial use; commercial use requires a $25 license fee for Isotope.

We hope folks will use the timeline, and come up with improvements. Let us know about either in the comments below or by writing me at john (at) johnkeefe (dot) net.

NICAR 2012 - Links from My Presos

I had the honor of presenting at four sessions at the Computer Assisted Reporting conference this week. For those who attended, here are the links I referenced in each session.

If you weren't in St. Louis for the conference, you can still get a sense of what's here. If you see something you want to know more about, let me know. For everyone else's presentations, check out this great list.

Election Night Results & Maps

Ins and Outs of APIs

Election Data Without a Database

Apps Without a Backend CMS (using Google Spreadsheets instead)

Hacking the Census - How we made a Fusion Tables census map

States & Municipalities with Real-Time Election Results

Here's my unofficial list of state and local governments that have -- or have provided in the past -- real-time election results. Great opportunities to build your own election-night maps and analysis!

Know of others? Drop me a note at john [at] johnkeefe.net. (Updated as of Feb. 2012.)

The Nevada Vote: In 3-D

The Guardian pushed the limits of election-night data display this week with a relief map of the Florida primary vote. 

They didn't push far enough.

As promised: Live election results in True 3-D.

Nevada 3d Still

(To avoid blog lag, I've put the live version here.)

You need a current browser to see it. Recent versions of Chrome and Firefox work. Safari does, too, if you nudge it.

With any luck, the counties shall grow as the vote rolls in tonight.

For those interested, I built it in Processing and use Processing.js to put it on the web. You're welcome to embed it if you wish. Just drop me a note or comment that you did.

UPDATE: My data-fetching code is a little wonky. Refresh the page to ensure the latest results!

UPDATE 2: I actually don't believe this is the best way to present numeric data. Representing numeric scale with a 3D drawing on a 2D surface is exceptionally tricky and should probably be avoided. Also, there are no rollovers or other clarifying information -- like county names and vote counts.

That said, I like the idea that some data sets might be worth spinning, touching and flying through. So maybe this is my first step in that direction.

Plus, it was fun.

UPDATE 3: By request, here is the Processing sketch upon which this was built.