Business and Management INK

“Everything Not Saved Will Be Lost.” –Nintendo “Quit Screen” Message

April 8, 2025 845

In this post, authors Richard F.J. Haans and Marc J. Mertens reflect on the inspiration behind their research article, “The Internet Never Forgets: A Four-Step Scraping Tutorial, Codebase, and Database for Longitudinal Organizational Website Data,” published in Organizational Research Methods.

Websites are more than just digital storefronts. They are vital for engaging with customers, attracting talent, appealing to investors, and communicating with other stakeholders. The richness of websites also makes them a potential treasure trove for researchers interested in understanding organizations and their leadership. Additionally, websites are dynamic. They constantly evolve in design, content, and functionality—reflecting the changing priorities, strategies, and market conditions of the organizations they represent.

However, this dynamism also poses significant challenges for researchers: Websites are regularly updated, and older content gets removed or replaced as a result—preventing researchers from leveraging these valuable data unless they happen to have saved them in time.

This is where the Wayback Machine, a digital archive of the World Wide Web operated by the Internet Archive, comes into play. By capturing and preserving billions of webpages as they appeared at various points in time, the Wayback Machine allows researchers to retrieve historical snapshots of websites. This offers a solution to the problem of data loss over time.

Yet, despite its promise, the Wayback Machine remains underutilized in organization and management research. This motivated us to systematically assess the quality and coverage of the Wayback Machine and develop innovative methods and tools to access and analyze its data. In our paper published in Organizational Research Methods, we introduce an open-source codebase that facilitates high-volume access to historical website data of organizations using the Wayback Machine. Specifically, we lay out a comprehensive, four-step tutorial—complete with code and best-practice examples—enabling researchers to systematically collect and utilize longitudinal website data.

We also created the CompuCrawl database, a freely accessible dataset featuring historical websites of over 11,000 North American firms listed in Compustat, spanning from 1996 to 2020. The CompuCrawl database contains more than 1.6 million webpages, serving as a powerful resource for future organizational research.

In conclusion, while websites are a rich and largely untapped data source for organizational research, the challenges of accessing and preserving historical content have limited their use. Archives like the Wayback Machine, in combination with the methodologies presented in our paper, open up new avenues for longitudinal research that can deepen our understanding of organizational dynamics over time. To dive deeper into our approach, explore the codebase, and access the CompuCrawl database, we invite you to read our article in Organizational Research Methods and visit our project website at https://haans-mertens.github.io/.

Richard F.J. Haans (PhD) is the Director of Full-time Doctoral Education and Associate Professor at Rotterdam School of Management, Erasmus University. He received his PhD from Tilburg University and has research interests in competitive dynamics and methodological advances. Marc J. Mertens is a research associate and PhD candidate at the University of Mannheim, Germany. He has research interests in stakeholder strategy, financial activism, impression management theory and optimal distinctiveness theory. He also recently received the Best Paper Award at the International Conference of the Global Research Foundation for Corporate Governance.

View all posts by Richard F.J. Haans and Marc J. Mertens

Related Articles

DORA to Launch Practical Guide to Responsible Research Assessment
Resources
April 15, 2025

DORA to Launch Practical Guide to Responsible Research Assessment

Read Now
Author Reflections on Intraorganizational Developmental Networks
Business and Management INK
April 2, 2025

Author Reflections on Intraorganizational Developmental Networks

Read Now
Political Theory, UK Experience Among Topics in Politics Webinar Series
Resources
April 1, 2025

Political Theory, UK Experience Among Topics in Politics Webinar Series

Read Now
Migrant Deaths Along the US-Mexico Border: Causes, Counts, and What the Future May Hold
Public Policy
March 26, 2025

Migrant Deaths Along the US-Mexico Border: Causes, Counts, and What the Future May Hold

Read Now
Changing the World or Changing Ourselves?

Changing the World or Changing Ourselves?

In this blog post, co-authors Catherine Brentnall and David Higgins reflect on their interest in how educators change themselves and their practice […]

Read Now
Generative AI Literacy: A Proposed Way Forward

Generative AI Literacy: A Proposed Way Forward

In this article, co-authors Stefanie Beninger, Alex Reppel, Julie Stanton and Forrest Watson reflect on the inspiration behind their research article, “Facilitating Generative AI […]

Read Now
Tracking Current Federal Changes Affecting U.S. Education and Science

Tracking Current Federal Changes Affecting U.S. Education and Science

The arrival of Donald Trump’s second term as U.S. president brought with it a dramatic, chaotic and generally ideological assault on the […]

Read Now
0 0 votes
Article Rating
Subscribe
Notify of
guest


This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments