Innovation

Making Text Data Accessible for Social Science

July 19, 2019 1886

jason-leung-0sBTrm726C8-unsplash.jpg

Text is everywhere, and everything is text. More textual data than ever before are available to computational social scientists—be it in the form of digitized books, communication traces on social media platforms, or digital scientific articles. Researchers in academia and industry increasingly use text data to understand human behavior and to measure patterns in language. Techniques from natural language processing have created a fertile soil to perform these tasks and to make inferences based on text data on a large scale.

However, a central obstacle prevalent across research areas—particularly in computational social science and social science as a whole—is access to sensitive data. Arguably, most existing text datasets are never shared, primarily as a result of data protection restrictions. Privacy regulations are in place—quite rightly—to protect the identity of persons referred to in text data, but the consequence of this is that many of the datasets of greatest potential value cannot be studied. For example, police forces, businesses and health care organizations hold massive amounts of data with an untapped potential to solve hard problems such as identifying trends in police reports or understanding language change in patient interviews at scale. In short, while there is huge potential in text data, the most valuable datasets hardly reach the researchers with the ability to analyze, interpret and generate societal impact from such data.

In addition, isolated data sharing agreements between stakeholders and universities hardly solve the problem because they prohibit the sharing of data among researchers, which makes follow-up research and replication studies unlikely and ultimately slows down the scientific progress that could be established. The lack of access to sensitive data not only inhibits research but potentially also creates bias towards questions that focus on readily available data (e.g. Twitter), instead of being driven by real-world importance. As a whole, the problem of accessing sensitive text data imposes a major obstacle on progress in computational social science. Text Wash aims to solve the underlying dilemma: how to research genuinely pressing issues using sensitive data without violating privacy and data protection regulation?

Solving the dilemma

This post was originally posted on sister site, SAGE Ocean, under the title, “Making sensitive text data accessible for computational social science“, by Dr Bennett Kleinberg, Maximilian Mozes and Dr Toby Davies.

If the primary impediment to the sharing of text data is the presence of sensitive and identifying information (e.g. names, dates), a straightforward solution lies in the anonymization of text data by removing identifiable and sensitive information. However, current approaches to data anonymization either require cost—and time-intensive manual anonymization by human experts, or the automatic manipulation of texts by replacing identifying information with generic and context-independent terms (e.g. by replacing all names and dates in a text with the phrase “XXX”). The latter of these, for example, is the approach employed by the anonymization tool provided by the UK Data Service. However, since both syntactic and semantic characteristics of texts are essential for current text processing and information extraction methods to work successfully, such approaches make texts unusable for proper linguistic analyses. To overcome this problem, we propose a fully-automated text anonymization tool that removes traceable and confidential information from English texts and substitutes such information with meaningful replacements. In doing so, both the sentence’s syntactic correctness and its semantic meaning will be preserved.

“Text Wash aims to solve the underlying dilemma: how to research genuinely pressing issues using sensitive data without violating privacy and data protection regulation?

Preserving meaning in text anonymization

In order to illustrate the importance of preserving the semantics when altering texts for linguistic analyses, let us consider the following example. Assume we would like to anonymize the statement

“Alice was very happy about Jamie’s termination of employment with TextWash Inc. last Monday. Three days later, Jamie sued TextWash Inc.”

Imagining that Alice, Jamie and TextWash Inc. are existing entities, such a statement would be difficult to share freely since it would expose Alice’s pleasure resulting from Jamie’s firing, and that they sued their former employer. Nevertheless, from a scientific perspective, this statement might contain valuable information and could be utilized to, for example, automatically extract the emotional valence of Alice’s reaction to Jamie’s firing. If we anonymized this text by simply replacing identifiable information with the generic phrase “XXX”, we would obtain the sentence

“XXX was very happy about XXX termination of employment with XXX last XXX. XXX later, XXX sued XXX”.

Such an approach would make it impossible to extract meaningful information from this statement. Neither humans nor computational algorithms could understand which of the de-identified entities correspond to the same person or organization, thus making it impossible to retain the semantic relationships between individual entities in the statement (in the second sentence, for example, we would not be able to understand that Jamie sued TextWash Inc. since all the references to previous occurrences of the same entity are lost).

Now if we consider an entity-specific anonymization method that i) differentiates between individual categories of identifiable information (e.g. persons, locations, organizations, dates) and ii) consistently replaces instances of these categories throughout the text, we would obtain the sentence

“[PERSON_1] was very happy about [PERSON_2] termination of employment with [ORGANISATION_1] last [DATE_1]. [DATE_2] later, [PERSON_2] sued [ORGANISATION_1]”.

This resulting sequence does not contain any sensitive information anymore and could hence be shared openly online without the violation of confidentiality. And additionally, since the statement consistently replaces certain entities and their respective instances, we could confidently analyze this statement in an automated way without giving up valuable textual characteristics. The following table summarizes the characteristics of both anonymization procedures.


Context-preserving anonymization

Categorizes anonymized entities

Identifies reoccurrences of the same concept and thus preserves co-references between entities

 Naive anonymization with “XXX”

Does not categorize anonymized entities

Does not account for repeated occurrences of anonymized concepts


This example clearly demonstrates the importance of context-preserving text anonymization when utilizing the anonymized texts for linguistic purposes. To enable researchers the anonymization of text in this manner, Text Wash aims at providing an easy-to-use tool that anonymizes information by consistently replacing identifiable information to preserve both the syntactic structure and the semantic representation of texts.

A bottom-up tool for text anonymization

Aside from preserving context and retaining the usefulness of anonymized texts for follow-up analyses, the success of text anonymization efforts crucially depends on stakeholder needs. Previous anonymization efforts have either ignored the researchers’ needs (e.g. by fully redacting texts with XXX) or have neglected the anonymization requirements from data owners (so that the data may still not meet the criteria for sharing). 

We therefore closely collaborate with stakeholders from police forces and governmental data protection officials to precisely define the necessities and requirements of our text anonymization procedure. Once we have defined these requirements, our system will utilize state-of-the-art methods from natural language processing and information extraction to aim at sufficiently anonymizing text such that the manipulated confidential and sensitive information can eventually be shared openly across research communities. Our aim is to maximize impact by ensuring that the needs of both data holders and researchers are built into the tool from the start.

Text Wash is currently under development thanks to a SAGE Concept Grant. The tool will be available as an open-source library for R for the research community and as a stand-alone offline version for data owners.


Dr Bennett Kleinberg is an assistant professor at the Department of Security and Crime Science and the Dawes Centre for Future Crime at University College London. He is interested in understanding crime and security problems with computational techniques and in behavioral inferences from text data. Maximilian Mozes is a Ph.D. student at University College London supervised by Lewis Griffin (Department of Computer Science) and Bennett Kleinberg (Department of Security and Crime Science). His research interests lie at the intersection of natural language processing and crime science and his doctoral studies focus on assessing the vulnerabilities of statistical learning techniques operating on textual data. Dr Toby Davies is an assistant professor in the Department of Security and Crime Science at University College London. His interests lie in the quantitative and computational analysis of crime, with particular focus on spatial analysis and the role of networks in crime.

View all posts by Bennett Kleinberg Maximilian Mozes and Dr Toby Davies

Related Articles

Canada’s Storytellers Challenge Seeks Compelling Narratives About Student Research
Communication
November 21, 2024

Canada’s Storytellers Challenge Seeks Compelling Narratives About Student Research

Read Now
Our Open-Source Tool Allows AI-Assisted Qualitative Research at Scale
Innovation
November 13, 2024

Our Open-Source Tool Allows AI-Assisted Qualitative Research at Scale

Read Now
This Anthropology Course Looks at Built Environment From Animal Perspective
Industry
September 10, 2024

This Anthropology Course Looks at Built Environment From Animal Perspective

Read Now
2024 Henry and Bryna David Lecture: K-12 Education in the Age of AI
Event
September 5, 2024

2024 Henry and Bryna David Lecture: K-12 Education in the Age of AI

Read Now
Philosophy Has Been – and Should Be – Integral to AI

Philosophy Has Been – and Should Be – Integral to AI

Philosophy has been instrumental to AI since its inception, and should still be an important contributor as artificial intelligence evolves..

Read Now
New SSRC Project Aims to Develop AI Principles for Private Sector

New SSRC Project Aims to Develop AI Principles for Private Sector

The new AI Disclosures Project seeks to create structures that both recognize the commercial enticements of AI while ensuring that issues of safety and equity are front and center in the decisions private actors make about AI deployment.

Read Now
AI Upskilling Can and Should Empower Business School Faculty

AI Upskilling Can and Should Empower Business School Faculty

If schools provide the proper support and resources, they will help educators move from anxiety to empowerment when integrating AI into the classroom.

Read Now
0 0 votes
Article Rating
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments