A cautious introduction to NLP and Machine Learning methods in analyzing thousands of anonymous sexual harassment & assault reports.
“The data are too messy.”
“There’s no way we could work through it in time.”
“I’d like to figure something out. But I don’t know.”
I was sitting in the grad student lounge in Ann Arbor with three classmates from Information Visualization. Each of us had the same Google Sheet pulled up on our browsers: “Sexual Harassment In the Academy: A Crowdsource Survey. By Dr. Karen Kelsky, of The Professor Is In”.
In just a few months, a call for anonymous survey submissions in the popular The Professor Is In blog had resulted in over 2,300 reports of sexual harassment and assault.
Our group quickly realized a few things: this data was immense. It was messy in the sense that data folks often describe messy data: non-standard, full of missing values and strange capitalizations. It was also data that spoke to an immensity of pain, the loss of futures diverted and destroyed.
One peer described the issues in the data, column by column, but she stopped when it came to the “Event” column. We reached a point where we stopped knowing what to say, and just made eye contact with each other. Eventually, our group passed on the Sexual Harassment dataset in favor of a climate change-related project (which generated its own share of complex data issues).
However, I couldn’t stop thinking about the survey. Thousands of individuals had revealed their experiences with sexual harassment and assault in academia, many apparently for the first time. These reports detailed the devastating effect these events had on their lives. This is vital, powerful data that deserves to have its story told.