“Everything Not Saved Will Be Lost.” –Nintendo “Quit Screen” Message
In this post, authors Richard F.J. Haans and Marc J. Mertens reflect on the inspiration behind their research article, “The Internet Never Forgets: A Four-Step Scraping Tutorial, Codebase, and Database for Longitudinal Organizational Website Data,” published in Organizational Research Methods.
Websites are more than just digital storefronts. They are vital for engaging with customers, attracting talent, appealing to investors, and communicating with other stakeholders. The richness of websites also makes them a potential treasure trove for researchers interested in understanding organizations and their leadership. Additionally, websites are dynamic. They constantly evolve in design, content, and functionality—reflecting the changing priorities, strategies, and market conditions of the organizations they represent.
However, this dynamism also poses significant challenges for researchers: Websites are regularly updated, and older content gets removed or replaced as a result—preventing researchers from leveraging these valuable data unless they happen to have saved them in time.
This is where the Wayback Machine, a digital archive of the World Wide Web operated by the Internet Archive, comes into play. By capturing and preserving billions of webpages as they appeared at various points in time, the Wayback Machine allows researchers to retrieve historical snapshots of websites. This offers a solution to the problem of data loss over time.
Yet, despite its promise, the Wayback Machine remains underutilized in organization and management research. This motivated us to systematically assess the quality and coverage of the Wayback Machine and develop innovative methods and tools to access and analyze its data. In our paper published in Organizational Research Methods, we introduce an open-source codebase that facilitates high-volume access to historical website data of organizations using the Wayback Machine. Specifically, we lay out a comprehensive, four-step tutorial—complete with code and best-practice examples—enabling researchers to systematically collect and utilize longitudinal website data.
We also created the CompuCrawl database, a freely accessible dataset featuring historical websites of over 11,000 North American firms listed in Compustat, spanning from 1996 to 2020. The CompuCrawl database contains more than 1.6 million webpages, serving as a powerful resource for future organizational research.
In conclusion, while websites are a rich and largely untapped data source for organizational research, the challenges of accessing and preserving historical content have limited their use. Archives like the Wayback Machine, in combination with the methodologies presented in our paper, open up new avenues for longitudinal research that can deepen our understanding of organizational dynamics over time. To dive deeper into our approach, explore the codebase, and access the CompuCrawl database, we invite you to read our article in Organizational Research Methods and visit our project website at https://haans-mertens.github.io/.