Whither Small Data?
The distinction between small and big data is a recent one. Prior to 2008, data were rarely considered in terms of being ‘small’ or ‘big’. All data were, in effect, what is now sometimes referred to as ‘small data’, regardless of their volume. Due to factors such as cost, resourcing, and the difficulties of generating, processing, analyzing and storing data, limited volumes of high quality data were produced through carefully designed studies using sampling frameworks calculated to ensure representativeness. The small data generated by such studies are thus characterized by their generally limited volume, non-continuous collection, narrow variety, and a focus on answering specific questions.
In the last decade or so, small data have been complemented by what has been termed ‘big data,’ which have very different characteristics (see table ). Big data are large in volume, striving to be exhaustive, produced continuously, and are varied in nature, although they are often a by-product of system rather than being designed to investigate particular phenomena or processes.
Comparing small and big data
Characteristic | Small data | Big data |
Volume | Limited to large | Very large |
Exhaustivity | Samples | Entire populations |
Resolution and identification | Coarse & weak to tight & strong | Tight & strong |
Relationality | Weak to strong | Strong |
Velocity | Slow, freeze-framed | Fast |
Variety | Limited to wide | Wide |
Flexibility and scalability | Low to middling | High |
The term ‘big’ in big data then is somewhat misleading as big data are characterized by much more than volume. Indeed, some ‘small’ datasets can be very large in size, such as national censuses that also seek to be exhaustive and have strong resolution (high granularity) and relationality (data fields can be linked together to derive new information). However, census datasets lack velocity (usually conducted once every ten years), variety (usually c.30 structured questions), and flexibility (once a census is set and is being administered it is all but impossible to tweak the questions or add new questions or remove others and generally the fields are fixed, typically across censuses, to enable time-series analysis). As such, census data whilst voluminous is quite different in nature to the big data produced by new digital traffic and building management systems, surveillance and policing systems, customer management and logistic chains, financial and payment systems, and locative and social media. For example, in 2012 Facebook reported that it was processing 2.5 billion pieces of content (links, stores, photos, news, etc), 2.7 billion ‘Like’ actions and 300 million photo uploads per day and Wal-Mart was generating more than 2.5 petabytes of data relating to more than 1 million customer transactions every hour.
Acknowledgements
This blog post is extracted and reworked from the book The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences (Sage, 2014) and the research was supported by an European Research Council Advanced Investigator award (ERC-2012-AdG-323636)
For a short video on The Data Revolution, click HERE.
The rapid growth and impact of big data has led some to ponder whether big data might lead to the demise of small data, or whether the stature of studies based on small data might be diminished, due to their limitations in size, temporality and relatively high cost. Such a framing, however, misunderstands both the nature of big data and the value of small data.
Big data may seek to be exhaustive, but as with all data they are both a representation and a sample. What data are captured is shaped by the field of view/sampling frame (not everyone belongs to Facebook or shops in Wal-Mart); the technology and platform used (different surveys, sensors, textual prompts, and layouts produce variances and biases in what data are generated); the data ontology employed (how the data are calibrated and classified); and the regulatory environment with respect to privacy, data protection and security.
Big data undoubtedly strive to be more exhaustive and provide dynamic, fine-grained insight but, nonetheless, their panoptic promise can never be fully fulfilled. Big data generally capture what is easy to ensnare — data that are openly expressed (what is typed, swiped, scanned, sensed, etc.) — as well as data that are the ‘exhaust,’ a by-product, of the primary task/output. Tackling a question through big data often means repurposing data that were not designed to reveal insights into a particular phenomenon, with all the attendant issues of such a maneuver, for example creating ecological fallacies.
In contrast, small data may be limited in volume and velocity, but they have a long history of development across science, state agencies, non-governmental organizations and business, with established methodologies and modes of analysis, and a record of producing meaningful answers. Small data studies can be much more finely tailored to answer specific research questions and to explore in detail and in-depth the varied, contextual, rational and irrational ways in which people interact and make sense of the world, and how processes work. Small data can focus on specific cases and tell individual, nuanced and contextual stories. Small data studies thus seek to mine gold from working a narrow seam, whereas big data studies seek to extract nuggets through open-pit mining, scooping up and sieving huge tracks of data.
These two approaches of narrow versus open mining have consequences with respect to data quality, fidelity and lineage. Given the limited sample sizes of small data, data quality (how clean [error and gap free], objective [bias free] and consistent [few discrepancies] the data are), veracity (the authenticity of the data and the extent to which they accurately [precision] and faithfully [fidelity, reliability] represent what they are meant to), and lineage (documentation that establishes provenance and fit for use) are of paramount importance. Much work is expended on limiting sampling and methodological biases as well as ensuring that data are as rigorous and robust as possible before they are analyzed or shared.
In contrast, it has been argued by some that big data studies do not need the same standards of data quality, veracity and lineage because the exhaustive nature of the datasets remove sampling biases and more than compensates for any errors or gaps or inconsistencies in the data or weakness in fidelity. The argument here is “more trumps better.” Of course, this presumes that all uses of big data will tolerate inexactitude, when in fact many big data applications do require precision (e.g., finance data). Moreover, the warning “garbage in, garbage out” still holds — big datasets full of dirty, gamed or biased data, or data with poor fidelity, are going to produce analysis and conclusions that have weak validity.
Moreover, with a few exceptions, such as satellite imagery and national security and policing, big data are mainly produced by the private sector. Access is usually restricted behind pay walls and proprietary licensing, limited to ensure competitive advantage and to leverage income through their sale or licensing. Indeed, it is somewhat of a paradox that only a handful of entities are drowning in the data deluge and companies, such as mobile phone operators, app developers, social media providers, financial institutions, retail chains, and surveillance and security firms, are under no obligations to share freely the data they collect through their operations. Of course, many small datasets are similarly limited in use to defined personnel or available for a fee or under license. Increasingly, however, public institution and academic data are becoming more open access.
Despite the rapid growth and benefits of big data and associated new analytics, small data will continue to be a vital part of the research landscape: small and big data will complement one another. Mining narrow seams of high quality data will continue alongside open pit mining because it enables much more control of the research design and to answer specific, targeted questions. However, such small data will increasingly come under pressure to utilize the new archiving technologies: to be scaled-up within data infrastructures such as archives and repositories in order that they are preserved for future generations, become accessible to re-use and combination with other small and big data, and more value and insight can be extracted from them through the application of big data analytics.
There seems little reason to worry about the demise of small data studies then, despite the hype and hubris of big data advocates, except for the fact that research funding and attention is in danger of being disproportionately directed at developing big data solutions. Rather than overtly favoring one approach over another, funders should take note that the key concern of research is to answer specific questions as validly as possible, whether that be by using small or big data.