Innovation

No More Tradeoffs: The Era of Big Data Analysis Has Come

July 16, 2019 1814

For centuries, being a scientist has meant learning to live with limited data. People only share so much on a survey form. Experiments don’t account for all the conditions of real world situations. Field research and interviews can only be generalized so far. Network analyses don’t tell us everything we want to know about the ties among people. And text/content/document analysis methods allow us to dive deep into a small set of documents, or they give us a shallow understanding of a larger archive. Never both. So far, the truly great scientists have had to apply many of these approaches to help us better see the world through their kaleidoscope of imperfect lenses.

This post was originally posted on sister site, SAGE Ocean, under the title, “No more tradeoffs: The era of big data content analysis has come“, by Nick Adams.

But, at least in the area of text analysis (AKA content analysis, or natural language processing), the old limits are crumbling, opening up a frontier of big social data that is fertile enough to support hundreds of research projects and next generation scientists.

The shift, like so many others, is coming thanks to advancements in data science technology. But it does not represent a total break with the past. For decades, social scientists have used expert document labeling tools, often called CAQDAS (Computer Assisted Qualitative Data Analysis Software), like AtlasTI, NVivo, and MaxQDA that allow a highly trained scholar to apply a custom set of labels to a few hundred documents. Initially built for labeling (AKA hand-coding) interview transcripts and meeting minutes, these tools are great for smaller projects. But training up a team to use them on a larger set of documents has always been a costly process of transferring expertise to research assistants and closely supervising their work to ensure its consistency. Lately, tools like Dedoose, DiscoverText, LightTag, and TagTog have re-factored this expert labeling approach into cloud-based software that is cheaper to deliver, update, and maintain—but their users still suffer the same bottleneck of expertise transfer. Even when they are applied to the simplest of labeling tasks, CAQDAS tools require the face-to-face training and oversight of a research team. In the university context, this management burden is grossly compounded by the high turnover among the transient student researcher workforce.

Crowds teaching machines

More recently, machine learning researchers have given social scientists hope. They have developed a number of new technologies able to harness the efforts of online workers to label data like images and video. The online workers go to a crowd work marketplace such as Amazon’s Mechanical Turk, where they are able to find brief ‘Human Intelligence Tasks’ they can perform through their Internet browser for a couple dollars per task. These crowd workers tend to be above average in education, and many do the work to earn supplemental household income. Already, they have completed billions of tasks applying highly accurate labels to images. 

But the research community has been waiting a long time for a way to enlist the crowd in the much more difficult analysis of textual data like news articles, court proceedings, historical archives, and more. Computers–originally built to apply logical and mathematical operations on numeric, ordinal, and categorical data—still can’t faithfully understand human language. Computer scientists’ text mining (AKA natural language processing) approaches have been especially inadequate when it comes to applying, measuring, and testing the rich social theory scholars have built up over centuries. The clamoring for a high-scale, high-fidelity content labeling tool has only intensified since Benoit, Conway, and Laver published their landmark paper proving that crowds of independently working internet laborers could produce results as good as or better than experts.

Finally, the wait is over. My new company, Thusly Inc. is now providing the first and only full-service, end-to-end, online textual data labeling factory that can be custom-rigged for researchers’ use in a matter of several hours, and allow them to go as deep and as broad as they wish with their text labeling projects. No more compromises.

New, different, bigger, better

Notably, this new tool—called TagWorks—has not come from billion dollar tech companies like Amazon’s Mechanical Turk or Figure Eight. I imagined, designed, and developed it as a social scientist and data scientist intimately familiar with the headaches of managing annotation projects using the older, expert-dependent software. My vision of vast, powerful, richly labeled archives combined with my burning frustration as I tried to use other tools at high scale. I became obsessed with the goal of providing researchers with customizable annotation tools producing world-class, research-grade data in months, not years. So, while industry software developers are encouraged to “move fast and break things,” the TagWorks team and I have carefully ensured that researchers’ data will be accurate, reviewable, and reproducible—and articulate with even their most nuanced and elaborate theories about the social world.

TagWorks is one of a kind. Big tech companies like Figure Eight sometimes offer to build their customers annotation tools over the course of 4-6 months at great expense. (The folks at LightTag help you see the prohibitive cost of building even simple tools). But only TagWorks is optimized for efficient assembly line data flows and automated crowd task management. And only TagWorks provides interfaces specifically designed to guide non-expert crowd workers in the production of highly accurate data without face-to-face training. 

this is a screenshot of the TagWorks homepage

Available now

For researchers who want high-quality data labels at a high-scale, the era of tradeoffs is over. You can label as many documents as you have with as many tags as you want. You can go deep. And you can go big. You can finally know and index everything important about your giant archive of documents. In a matter of months you could have the world’s most enriched archive in your discipline, or create the best labeled training data set producing the best text classification algorithm in your domain. To receive TagWorks’ free hybrid text analysis tips or schedule a free consultation to learn more about how you can advance the frontier in your field, email us at office@thusly.co or go to https://tag.works/get-started to get started.



Nick Adams is an expert in social science methods and natural language processing, and the CEO of Thusly Inc., which provides TagWorks as a service. He holds a doctorate in sociology from the University of California, Berkeley and is the founder and Chief Scientist of the Goodly Labs, a tech for social good non-profit based in Oakland, CA.

View all posts by Nick Adams

Related Articles

NAS Report Examines Nexus of AI and Workplace
Bookshelf
December 20, 2024

NAS Report Examines Nexus of AI and Workplace

Read Now
When Do You Need to Trust a GenAI’s Input to Your Innovation Process?
Business and Management INK
December 13, 2024

When Do You Need to Trust a GenAI’s Input to Your Innovation Process?

Read Now
The Authors of ‘Artificial Intelligence and Work’ on Future Risk
Innovation
December 4, 2024

The Authors of ‘Artificial Intelligence and Work’ on Future Risk

Read Now
Beware! AI Can Lie.
Innovation
December 3, 2024

Beware! AI Can Lie.

Read Now
Canada’s Storytellers Challenge Seeks Compelling Narratives About Student Research

Canada’s Storytellers Challenge Seeks Compelling Narratives About Student Research

“We are, as a species, addicted to story,” says English professor Jonathan Gottschall in his book, The Storytelling Animal. “Even when the […]

Read Now
Our Open-Source Tool Allows AI-Assisted Qualitative Research at Scale

Our Open-Source Tool Allows AI-Assisted Qualitative Research at Scale

The interactional skill of large language models enables them to carry out qualitative research interviews at speed and scale. Demonstrating the ability of these new techniques in a range of qualitative enquiries, Friedrich Geiecke and Xavier Jaravel, present a new open source platform to support this new form of qualitative research.

Read Now
This Anthropology Course Looks at Built Environment From Animal Perspective

This Anthropology Course Looks at Built Environment From Animal Perspective

Title of course: Space/Power/Species What prompted the idea for the course? A few years ago, I came across the architect Joyce Hwang’s […]

Read Now
0 0 votes
Article Rating
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments