Business and Management INK

“Everything Not Saved Will Be Lost.” –Nintendo “Quit Screen” Message

By Richard F.J. Haans and Marc J. Mertens

April 8, 2025 845

In this post, authors Richard F.J. Haans and Marc J. Mertens reflect on the inspiration behind their research article, “The Internet Never Forgets: A Four-Step Scraping Tutorial, Codebase, and Database for Longitudinal Organizational Website Data,” published in Organizational Research Methods.

Websites are more than just digital storefronts. They are vital for engaging with customers, attracting talent, appealing to investors, and communicating with other stakeholders. The richness of websites also makes them a potential treasure trove for researchers interested in understanding organizations and their leadership. Additionally, websites are dynamic. They constantly evolve in design, content, and functionality—reflecting the changing priorities, strategies, and market conditions of the organizations they represent.

However, this dynamism also poses significant challenges for researchers: Websites are regularly updated, and older content gets removed or replaced as a result—preventing researchers from leveraging these valuable data unless they happen to have saved them in time.

This is where the Wayback Machine, a digital archive of the World Wide Web operated by the Internet Archive, comes into play. By capturing and preserving billions of webpages as they appeared at various points in time, the Wayback Machine allows researchers to retrieve historical snapshots of websites. This offers a solution to the problem of data loss over time.

Yet, despite its promise, the Wayback Machine remains underutilized in organization and management research. This motivated us to systematically assess the quality and coverage of the Wayback Machine and develop innovative methods and tools to access and analyze its data. In our paper published in Organizational Research Methods, we introduce an open-source codebase that facilitates high-volume access to historical website data of organizations using the Wayback Machine. Specifically, we lay out a comprehensive, four-step tutorial—complete with code and best-practice examples—enabling researchers to systematically collect and utilize longitudinal website data.

We also created the CompuCrawl database, a freely accessible dataset featuring historical websites of over 11,000 North American firms listed in Compustat, spanning from 1996 to 2020. The CompuCrawl database contains more than 1.6 million webpages, serving as a powerful resource for future organizational research.

In conclusion, while websites are a rich and largely untapped data source for organizational research, the challenges of accessing and preserving historical content have limited their use. Archives like the Wayback Machine, in combination with the methodologies presented in our paper, open up new avenues for longitudinal research that can deepen our understanding of organizational dynamics over time. To dive deeper into our approach, explore the codebase, and access the CompuCrawl database, we invite you to read our article in Organizational Research Methods and visit our project website at https://haans-mertens.github.io/.

Richard F.J. Haans and Marc J. Mertens

Richard F.J. Haans (PhD) is the Director of Full-time Doctoral Education and Associate Professor at Rotterdam School of Management, Erasmus University. He received his PhD from Tilburg University and has research interests in competitive dynamics and methodological advances. Marc J. Mertens is a research associate and PhD candidate at the University of Mannheim, Germany. He has research interests in stakeholder strategy, financial activism, impression management theory and optimal distinctiveness theory. He also recently received the Best Paper Award at the International Conference of the Global Research Foundation for Corporate Governance.

View all posts by Richard F.J. Haans and Marc J. Mertens

Published

April 8, 2025

DORA to Launch Practical Guide to Responsible Research Assessment

By Social Science Space

Read Now

Author Reflections on Intraorganizational Developmental Networks

Business and Management INK

April 2, 2025

Author Reflections on Intraorganizational Developmental Networks

By Andrew Dhaenens

Read Now

Political Theory, UK Experience Among Topics in Politics Webinar Series

Resources

April 1, 2025

Political Theory, UK Experience Among Topics in Politics Webinar Series

By Sage

Read Now

Migrant Deaths Along the US-Mexico Border: Causes, Counts, and What the Future May Hold

Public Policy

March 26, 2025

Migrant Deaths Along the US-Mexico Border: Causes, Counts, and What the Future May Hold

By Donald Kerwin and Daniel E. Martinez

Read Now

Changing the World or Changing Ourselves?

Catherine Brentnall and David Higgins 3550 Business and Management INK

In this blog post, co-authors Catherine Brentnall and David Higgins reflect on their interest in how educators change themselves and their practice […]

Read Now

Generative AI Literacy: A Proposed Way Forward

Stefanie Beninger, Alex Reppel, Julie Stanton and Forrest Watson 3964 Business and Management INK

In this article, co-authors Stefanie Beninger, Alex Reppel, Julie Stanton and Forrest Watson reflect on the inspiration behind their research article, “Facilitating Generative AI […]

Read Now