Zoë Wilkinson Saldaña Data scienctist + data librarian

On visualizing the loveliness of the pinball universe

network graph of pinball machines showing relationship between machines across United States

It makes me happy to think somewhere in the glowing bubbles are machines that I’ve fed with quarters and hopefully treated with care. And yet, as I try to wrap my mind around the way pinball is growing, I realize how small and personal it feels to me now. Am I really part of this contiguous world? What would my pinball universe look like, if not this one?

Read my essay on pinball + network analysis on Messy Data

Stories from the city, stories from the cloud: an introduction to city open data portals in the United States

bar chart showing number of datasets in open data portal for cities with top fifteen highest Census ratings.

What stories can you tell with a city’s open data? And what’s missing from the data?…

In this post, I will introduce you to the producers and gatekeepers of civic data that Google describes as they play out in practice. When it comes to cities, these questions tend to revolve around the city open data portal: a model adopted nation-wide that facilitates tens of thousands of datasets, a buzzing cross-section of hyper-local civic data activity, and a service almost exclusively built on just one tech company’s platform.

Read on Messy Data (my new feminist critical data science blog!)

Creating a custom network visualization using the Scalar API Explorer, Part 1

Creating a custom network visualization using the Scalar API Explorer, Part 1 feature image

Scalar is a unique and powerful open source publishing platform. Its strength lies in its ability to combine linear and nonlinear methods of exploring media, narratives, annotations, and scholarship within the single organizing structure of a book. Scalar also provides several out-of-the-box options to visualize your data - including as an interactive network visualization.

But what happens when you want to customize your visualizations beyond what the Scalar presets allow for? I recently ran up against this issue and decided to find a way to create a network visualization “from scratch” (in reality, leveraging a number of excellent open source tools, demos, and APIs!) I used the Scalar API Explorer to export data about our pages and tags between pages, prepared the network data with Python and various packages (NetworkX, BeautifulSoup, etc.), and wrote a custom network visualization using D3.js and Canvas.

I waded through a fair bit of code and experimentation along the way, and I’d like to share with you some notes, lessons, code, and tools that reuslted from that process. I also tried to identify several places where you may wish to deviate from my process, or to experiment further, depending on your Scalar book and your own vision of what such a visualization might look like.

This tutorial is written as two parts:

  • Part 1: Represent your Scalar book as a network using Python and the Scalar API Explorer
  • Part 2: Create an interactive visualization of the network data using D3.js and Canvas

In Part 1, I will introduce the goals of this process and walk through the Python code needed to prepare your data.

...(read more)...

What does critical data science add to our understanding of sexual harassment in academia?

What does critical data science add to our understanding of sexual harassment in academia? feature image

A cautious introduction to NLP and Machine Learning methods in analyzing thousands of anonymous sexual harassment & assault reports.


“The data are too messy.”

“There’s no way we could work through it in time.”

“I’d like to figure something out. But I don’t know.”

I was sitting in the grad student lounge in Ann Arbor with three classmates from Information Visualization. Each of us had the same Google Sheet pulled up on our browsers: “Sexual Harassment In the Academy: A Crowdsource Survey. By Dr. Karen Kelsky, of The Professor Is In”.

In just a few months, a call for anonymous survey submissions in the popular The Professor Is In blog had resulted in over 2,300 reports of sexual harassment and assault.

Our group quickly realized a few things: this data was immense. It was messy in the sense that data folks often describe messy data: non-standard, full of missing values and strange capitalizations. It was also data that spoke to an immensity of pain, the loss of futures diverted and destroyed.

One peer described the issues in the data, column by column, but she stopped when it came to the “Event” column. We reached a point where we stopped knowing what to say, and just made eye contact with each other. Eventually, our group passed on the Sexual Harassment dataset in favor of a climate change-related project (which generated its own share of complex data issues).

However, I couldn’t stop thinking about the survey. Thousands of individuals had revealed their experiences with sexual harassment and assault in academia, many apparently for the first time. These reports detailed the devastating effect these events had on their lives. This is vital, powerful data that deserves to have its story told.

...(read more)...