A New Sociological Approach to Big Data
Via retail transactions, searching Google, Tweeting and countless other prosaic activities IBM estimates we are creating 2.5 billion gigabytes of data a day. This is ‘Big Data’; a catch-all, superficial and problematic term for unwieldy data sets that require substantial computing power to curate and put to constructive use. Although its slippery definition is relevant to this blog post; problematising the concept of Big Data is for another time. This post begins with the methodological enthusiasm and scepticism that exists simultaneously within the contemporary sociological dialogue that addresses these large, potentially instructive, datasets. Enthusiasts describe how Big Data offers the possibility of access to interactions, behaviours and opinions at a scale beyond the dreams of sociologists who had to settle for representative sampling. Whille sceptics warn of being distracted from more profound challenges or being seduced by the promise of Big Data into, for example, inferring causation from correlation.
A new paper in Sociology argues at the heart of sociology’s equivocal stance towards Big Data is an absence of methods that can “explore the proportionality, dynamism and relationality” that make Big Data so alluring. Using Twitter as a source, its authors show previous approaches, by “sampling users or tweets according to a priori criteria, external to the data themselves” have imposed an external structure on Big Data. Paradoxically, these approaches have therefore neutralised one of Big Data’s most attractive affordances; a vision of how “data emerges dynamically at scale”.
Southampton University’s Tinati, Halford, Carr and Pope describe a new digital tool that “can follow the emergent flow of information”. In this case “what is tweeted, retweeted and hashtagged, and the evolving networks that form and reform between people over time”. Crucially, this method allows its exponents “to show how specific pieces of information flow and how the incremental actions of individual users produce social roles and networks”. To demonstrate their claims, the authors present a case study that captures Twitter activity induced in November 2011 by students protesting in response to the UK coalition government’s decision to increase English student’s tuition fees. Because all actors were rendered, not just a pre-selected cohort, the paper’s analysis shows how the network materialised and which “actors and information emerged as important over time”. Significant tweeters; actors in the network, emerged, who had the network been sampled, may have remained invisible. This “shows for the first time how specific pieces of information flow and how the incremental actions of individual users produce social roles and networks inside Twitter”.
After describing further possible applications for their method: “the same principles and method could be applied to any web-based system of dynamic information diffusion from emails, to Youtube, Facebook or Flickr”, the authors acknowledge “retweets only offer one way to explore the exposure of information within a communications network”. In response, they conclude by advocating a “wide data” approach which contemplates traversing boundaries between data sources such as open social networks, Google, government and corporate data and even dissolving the distinction between offline and online to access ‘traditional’ data sources such as print and broadcast media.
A colleague from the Interdisciplinary Research Unit in Web Science Lebanon, Stéphane Bazan, helped illustrate this final point. He said if you want to learn anything about Lebanese communication networks from Twitter you have to know that, despite deeply identifying with their country, 3 out of 4 Lebanese live and work abroad. Moreover, Lebanese political affiliation is determined by religious loyalties; jobs from president to police officer are distributed according to religion. Lebanese, he said, “think and react according to religion”. A study into how and why information is spread in this context would need to exploit all forms of available data to account for how complex systems of power and patronage within 18 religious communities could be represented in seemingly innocuous acts. For example, it’s a sobering thought that users, in this context, may disappear from emergent networks after being arrested for demonstrating their ‘misplaced’ loyalty by retweeting. Data from Twitter alone would tell us little about this network; however a more expansive, mixed methods approach to data collection would get us closer to the truth.