AI News, Beyond Just “Big” Data

Beyond Just “Big” Data

First, although big data—those massive amounts of information that require special techniques to store, search, and analyze—remains a thriving and much-discussed area, it’s no longer the new kid on the data block.

Wrangling even petabyte-size data sets (a petabyte is 1,000 terabytes) and data lakes (data stored and readily accessible in its pure, unprocessed state) are tasks for professionals, so not only are listings for big-data-related jobs thick on the ground but the job titles themselves now display a pleasing variety: companies are looking for data architects (specialists in building data models), data custodians and data stewards (who manage data sources), data visualizers (who can translate data into visual form), data change agents and data explorers (who change how a company does business based on analyzing company data), and even data frackers (who use enhanced or hidden measures to extract or obtain data).

Now there is also thick data (which combines both quantitative and qualitative analysis), long data (which extends back in time hundreds or thousands of years), hot data (which is used constantly, meaning it must be easily and quickly accessible), and cold data (which is used relatively infrequently, so it can be less readily available).

On Brontobyte Data and Other Big Words | What's The Big Data?

Source: Datafloq Paul McFedries in IEEE Spectrum: When Gartner released its annual Hype Cycle for Emerging Technologies for 2014, it was interesting to note that big data was now located on the downslope from the “Peak of Inflated Expectations,” while the Internet of Things (often shortened to IoT) was right at the peak, and data science was on the upslope.

Wrangling even petabyte-size data sets (a petabyte is 1,000 terabytes) and data lakes (data stored and readily accessible in its pure, unprocessed state) are tasks for professionals, so not only are listings for big-data-related jobs thick on the ground but the job titles themselves now display a pleasing variety: companies are looking for data architects (specialists in building data models), data custodians and data stewards (who manage data sources), data visualizers (who can translate data into visual form), data change agents and data explorers (who change how a company does business based on analyzing company data), and even data frackers (who use enhanced or hidden measures to extract or obtain data).

Now there is also thick data (which combines both quantitative and qualitative analysis), long data (which extends back in time hundreds or thousands of years), hot data (which is used constantly, meaning it must be easily and quickly accessible), and cold data (which is used relatively infrequently, so it can be less readily available).

We're all data geeks now

When Gartner released its annual Hype Cycle for Emerging Technologies for 2014, it was interesting to note that big data was now located on the downslope from the 'Peak of Inflated Expectations,' while the Internet of Things (often shortened to IoT) was right at the peak, and data science was on the upslope.

In 2008, Chris Anderson (2008), at that time the Editor–in–Chief of Wired, proposed that in the age of the petabyte, there was no longer any need for the scientific method, nor for models or theories.

His earlier paper on ‘The long tail’ (Anderson, 2004) has proved highly influential, and the term itself has now entered the standard lexicon of many disciplines — particularly marketing, microfinance, business modelling, innovation, and social networking [2].

Anderson, on the other hand, looks to the advent of Google — the specific company as well as the generic phenomenon — as exemplifying the process whereby models become obsolete: Once we have all the data there will be no more need for models, since they are at best incomplete.

Anderson seems to assume that once all the data is collected it will be a fairly straightforward step to move seamlessly to ‘correct’, and presumably useful, conclusions via a range of computer–based and computational applications targeted on the data.

This rather glosses over the manner and extent to which a complete set of data renders models or other forms of explanation obsolete: Anderson assumes this is the case, but there is the contrary argument — one with which the current authors would concur — that with a plethora of data there is often a greater demand for some form of model or abstraction.

In part this is to avoid being overwhelmed by the detail of it all, but also because it needs to be understood that intrinsically explanation and understanding necessitate the use of forms of abstraction — i.e., models which are developed in order to provide a focus on some aspects of the data, at the expense of others.

Only in this way can any findings from the data actually be incorporated into our actions and strategies — i.e., be of actual use: Something that Helles and Jansen (2013) stress in their editorial where they encourage responses to the ‘invitation to scholarly and social dialogues about the data that we all make as communicators, citizens, and consumers’ (stress added).

But before moving on to that it is important to recognize that Anderson’s argument does encompass an important insight into the possibilities opened up with the dawning of what he termed ‘The Petabyte Age’, but which is now referred to as ‘The Age of Big Data’ — i.e., the ability to derive patterns and correlations from huge data resources which can themselves prompt further investigation.

We offer a brief and highly specific overview of the history of scientific theory generation with a view to examining the ways in which the availability of massive data resources can be utilized to generate new theories: Affording an additional avenue for research and conceptualization, rather than a strategy that is seen as superseding the need for any models or theories at all.

What actually constitutes the scientific method, however, is a highly fraught issue, so it is not too surprising that at certain times the Wikipedia entry for ‘scientific method’ contains the admonition ‘editing of this article by new or unregistered users is currently disabled due to vandalism’.

On the one hand there have been those who take their lead from Aristotle and Francis Bacon who advocated the collection of large quantities of data, followed by exploration for patterns or regularities from which theories or hypotheses might then be derived, based on induction — i.e., moving from a set of specific observations towards a more general conclusion.

Boyle’s advocacy of public experiment was severely criticized by Hobbes who objected to the idea of arriving at truth by consensus — i.e., those who attended the public experiments: Moreover a truth couched in probabilistic statements rather than definitive ones.

Clearly in the long term Boyle’s empiricism and experimental method won out, but the contentious nature of any attempts to generalize specific methods to a universal epistemology remains, particularly in the light of the onslaught that emerged in the later decades of the twentieth century from the writings of Karl Popper (1959), Thomas Kuhn (1970) and Richard Rorty (1979) amongst many others.

particularly in his identification of scientific method with what, in another context, has been termed ‘naive Baconian inductivism’ — i.e., the idea that stacking up vast amounts of data, observations or the like will lead to increasingly better (more complete, more certain, even definitive) knowledge.

The weakness of this approach is something that should already be understood to some extent given the experience of the advent of such technologies as management information systems [MIS] in the 1960s and 1970s, which, it was claimed, would make management decision–making ever more effective, as more information was made available to the decision–makers.

Already in the 1960s Russell Ackoff (1967) characterized and criticized this misconception in his classic paper ‘Management misinformation systems’, although a fairly mundane and uncritical view of ‘rational behaviour’ continues to be a common assumption in many areas.

Our views are broadly in line with those expressed by the respondents in the Edge discussion on Anderson’s article (Dyson, et al., 2008), seeing it as a mix of provocation, together with a mistaken view of the scientific method, but justifiably making the case for new forms of coping with and understanding the potentialities of massive collections of digitized data.

Moreover in the interim the claims for and understanding of Big Data have developed, so that it is largely accepted that the skills needed to operate in this field include not just technical skills and expertise centred on analytic tools developed for specialist applications like astronomy (SKYCAT), fraud detection (HNC Falcon and Nestor PRISM), and financial transactions (various), but also the ability to present the outputs visually and also to understand the questions to pose in the first place.

Yes, someone will still need to know which questions to ask of the data, but the hard–core science of it should be rendered simpler by applications.’ (Asay, 2013a) We take up Anderson’s argument and its aftermath, particularly developing the view that although Big Data offers significant opportunities, it does not preclude the necessity for insight and innovation;

Coincidentally, the day BBC reported on this research (12 March 2013) was also the date on which a report from Her Majesty’s Inspectorate of Constabulary noted that a failure to share intelligence allowed a well–known celebrity to avoid arrest on a number of reports of sexual harassment — and worse — which were never linked (Casciani, 2013).

Although initially this trend might appear to amount simply to a shift of content from one location to another — albeit with different formats for control and innovative protocols and processes — it quickly becomes evident that the ramifications are substantive and significant.

In general parlance the word theory can be understood as either something akin to a hunch — ‘it’s only a theory’, ‘it’s just my theory’ — or as referring to a well–structured and robust basis for explanation, prediction, and control — the theory of gravity, helio–centrism, planned behaviour, or plate tectonics would be candidates in this sense.

But for present purposes it is sufficient to note that our views are founded on John Dewey’s (1930) pragmatism which views theories, models, concepts, or any other form of explanation in terms of usefulness rather than criteria pertaining to veracity or formality.

In other words this conjunction of data, computer–based analytic tools, and the skills and insights of specially trained and experienced people has resulted, in some cases, in the development of new and improved theories and insights, enhanced levels of understanding, and more effective policies and interventions.

In less extreme cases it also applies to gambling, for instance where people study the sequence of numbers that come up in successive turns of a roulette wheel, believing that once they understand the pattern of results they can crack the system: in reality each spin of the wheel is entirely independent of preceding or succeeding ones.

Anderson may have achieved renown on the basis of his paper on ‘The long tail’, but the person often regarded as ‘the father of the long tail’ is Benoit Mandelbrot, and his work and career offer useful counterpoints to claims of the type that are made by Anderson and other proponents of Big Data.

(Mandelbrot, 1963) This led him to write his now classic paper ‘The variation of certain speculative prices’ (Mandelbrot, 1963), which offered completely new ways of analyzing data — leading to the concepts of ‘long tails’, ‘fat tails’, fractals, and roughness.

He refers to Pasteur’s apothegm to the effect that chance favours the prepared mind, but adds that ‘I also think that my long string of lucky breaks can be credited to my mode of paying attention: I look at funny things and never hesitate to ask questions’ (stress added).

As such the age of Big Data continues to demand that all researchers and analysts engage in processes of modelling and theory generation, continually offering further bases for establishing rigor and relevance in their theoretical and conceptual development.

In effect these data sets are another, and increasingly important, resource for developing our insights, rather than something that effectively displaces existing approaches: Indeed there is something of a renewed necessity in encouraging the skills and tools that might lead to these conceptual developments and outcomes.

Ultimately we see this paper as a contribution to age–old questions around the issue of the source and nature of knowledge, and the status of knowledge claims: topics that have taken on a new resonance in the light of the growth of the internet and in particular in the age of Big Data.

In this regard DM is a component of the KDD process, providing the means and techniques ‘to extract and enumerate patterns from the data according to the specifications of measures and thresholds, using databases together along with pre-processing, sub–sampling and transformations of the data’ (Fayyad, et al., 1996).

Although many standard texts portray the relationship between data and information as that between raw material and processed product, this ‘chemical engineering metaphor’ has long been subject to criticism emanating from a semiotic or semantic perspective that refuses to endorse the distinction between data and information expressed in many standard texts (see Bryant, 2006, for an extended critique).

It is somewhat disappointing to note that despite the longstanding and serious nature of these forms of criticism, many scholarly papers and core texts remain stubbornly resistant even to acknowledging the contested nature of these claims.

In her paper for the recent First Monday special issue on Big Data, Markham (2013) offers a refreshing critique of the term, invoking the work of Geertz who is quoted to the effect that ’what we call our data are really our own constructions of other people’s constructions of what they and their compatriots are up to’ (stress added), and Bowker who argued that the term ‘raw data’ is an oxymoron.

On the other hand the metaphorical resonances of the term ‘data mining’ may now in the digital age have some considerable ‘grab’ given the ways in which vast stores of data can be searched and analyzed using current technologies.

Mining may well be a reasonable label for these processes, although it must be understood that only a limited range of such activities centres on data processing performed by silicon–based entities, and these must be initiated, guided, and supplemented by meaning–oriented actions of carbon–based entities.

So it may be valid to argue that the existence of massive data sets affords additional enhancements and opportunities for analysis and investigation, and that to some extent standard statistical models that try to take account of sampling are not always relevant when one has the entire population at hand.

Investigation and research have always employed a wide variety of resources including documents, observations and various forms of field studies and participative activities, now in addition we have access to the expanding panoply of digital resources, including e–mail, blogs, Facebook, and Twitter.

Moreover the aim of investigation and research is not simply to report upon or transcribe reality, but to derive patterns and offer critical and innovative insights, including explanations aimed at responding to ‘why?’ and ‘how?’ type questions.

An example of Big Data analytics, and its wider ramifications: Culturomics 2.0 In 2011 First Monday published a paper by Kalev Leetaru (2011) offering a contribution to what he terms the ‘emerging field of Culturomics’ [7], which ‘seeks to explore broad cultural trends through the computerized analysis of vast digital book archives, offering novel insights into the functioning of human society’.

More importantly for our present purposes his overall strategy provides the basis for a discussion of the ways in which the analyses of and findings from Big Data need to be evaluated, critiqued, and challenged rather than received and accepted purely and simply as the result of applying non–controversial computational analysis to a massive set of data — in this case amounting to several million news articles, several billion words.

His work is premised on the assertion that news reports ‘contain far more than just factual details: an array of cultural and contextual influences strongly impact how events are framed for an outlet’s audience, offering a window into national consciousness’.

He characterizes his data domain in noting that ‘accurately measuring the local press in nearly every country of the world requires a data source that continuously monitors domestic print, Internet, and broadcast media worldwide in their vernacular languages and delivers it as a uniform daily translated compilation’.

As the issue of Big Data has developed, KDD has evolved in terms of associated tools and techniques, but essentially it encompasses the way in which massive data sources can be used as the basis from which to derive patterns and models, often with a commercial interest guiding the agenda — hence the link with business intelligence.

The key features can be summarized as follows: Developing an understanding of the application domain Creating target data sets Data cleaning or pre–processing Data reduction and projection Data mining Interpretation of results Developing an understanding of the application domain Leetaru can be seen to have oriented his work around an interest in Culturomics wedded to a view that interrogation of large text archives provides a basis for ‘insights to the functioning of society, including predicting future economic events’.

Leetaru offers some account of this last aspect, and his Web crawl and later analyses using ‘two well–known tonal dictionaries’ can in part be seen as a strategy for coping with outliers and missing values, although there might be some doubts regarding his coverage of non–English and non–Web sources.

In fact both KDD and GTM can be seen as instantiations of hermeneutics — the GTM approach of iteration between data gathering and analysis is akin to the hermeneutic circle in which our understanding of certain detailed aspects is dependent on our understanding of the whole, which is itself dependent on understanding the details.

Bryant and Charmaz (2007) offer a succinct characterization of the method: Most fundamentally, grounded theory methods emphasize analyzing data and entail an iterative process of simultaneous data collection and analysis through which each informs and focuses the other.

Certainly Glaser and Strauss appeared to propose it as ‘inductive’ and ‘grounded in the data’ in their early work, as they were determined to offer a distinct alternative to the form of hypothetico–deductive model which they saw as prevalent amongst social researchers in the U.S. in the 1960s.

The key features of the method can be summarized under the following headings [10]: An initial interest in a problem domain or context An open but purposive sampling strategy in the earliest stages Simultaneous and iterative data collection and analysis Construction of various higher level abstractions — codes and categories in the parlance of GTM — derived from examination of the data, and not from previously derived theories of logical categories Repeated sampling and analysis in order to perform constant and repeated comparisons of the data in order to develop theoretical concepts and abstractions Selection of one or more specific concepts for further development Application of the selected concepts for use in a more deliberate manner against the context and appropriate data — theoretical sampling Articulation of theoretical statements and constructs offering a substantive account of some aspects of the initial context We shall refer to all of these in what follows — offering an alternative approach for data–driven investigation and analysis, whether aimed at Big Data or more limited resources.

Getting started — Developing an understanding of the data domain The various contributors to the October 2013 issue of First Monday made the point that although the term ‘data’ initially referred to what was given or present–at–hand, this is highly misleading as data is usually gathered or ‘harvested’ (Helles, 2013) with some specific aim in mind, and with regard to Big Data it is crucial to ensure that there is some level of clarity and comprehension of the agenda underlying the implementation of specific analytic tools and strategies to existing data sets that often appear simply to have ‘grow’d like Topsy’.

As Albert Einstein put it ‘If we knew what it was we were doing, it would not be called research, would it?’ So in many instances investigation may well start from a position where one is unable to frame hypotheses or specific research questions or detailed agendas: i.e., a hunch or the desire to follow up on some personal experience.

Similarly KDD requires that in the early stages investigators seek an understanding of the problem domain, engaging in exploratory research aimed at articulating a research problem in such a way that useful variables can be identified from the dataset to formulate models.

Although it can be contended that this is ‘data fishing’, it should also be understood that having too constrained an idea about hypotheses, research questions, or agendas early on in the knowledge discovery process may be inappropriate in some instances — a stance that is very much the same as that proposed for GTM research.

Whatever the reason for starting the investigation, which may well be some personal motivation, the initial stage is to engage with the research domain and gather some data — possibly in the form of interviews, observations, documents, a large data set or a combination of all of these.

This exploratory phase of ‘open coding’ leads to identification of issues in the sense that the researcher(s) begin to identify patterns across incidents, in much the same way as KDD practitioners look to identify useful variables from the dataset to formulate models.

In contrast to Leetaru’s research — where he specifically interrogated his data with ‘tone’ and ‘location’ in mind — GTM advocates wide–ranging queries of the data in the first instance — e.g., ‘What is happening here?’, ‘What is this data a study of?’, and so on.

On the basis of these insights from GTM and KDD, when confronted by Big Data — either as an investigator or as someone responding to someone else’s findings — one should seek clarification regarding the nature of the data and the ways in which it has been used in the articulation of aspects of interest — i.e., variables, patterns, codes or other abstractions.

Although it may be countered that the GTM strategy may not be readily adapted to the massive data sets that now confront researchers, it is important to keep in mind that there are always other possible ways of interrogating a data set whether in the form of asking different questions or using a different algorithm.

This is based on developing abstractions or concepts as a result of having accomplished the initial coding or categorization of the data — i.e., the researcher(s) should now have some idea of what is going on in the context under investigation, and what the data can be understood to indicate.

This will involve selecting variables of interest, usually in the form of keywords or phrases, that need to be identified from the data set, but this needs to be done in a controlled manner to avoid a proliferation of variables that may result in misleading or ambiguous results.

This may include something along the lines of the KDD stage of data refinement, involving the identification and bracketing or disregard of outliers, deciding on strategies for missing values and further characterization of the sample population: procedures designed to produce a more focused basis for analysis, based on detailed and iterative investigation of the data at hand.

On the other hand some investigators would argue that the robustness of one’s findings depend in part on the extent to which negative cases or contrary findings were sought — along the lines of Popper’s advocacy of a strategy of conjecture and refutation.

Again we stress that the reason for dwelling on these aspects as they are incorporated into the two approaches is to provide a basis for the assessment and evaluation of the plethora of Big Data ‘findings’ that are already becoming a central feature of what can be termed ‘familiar knowledge’, emanating from Internet sources and searches, news reports, policy initiatives and other aspects of our time.

Moreover since findings of Big Data analyses are themselves forms of data, anyone seeking to understand or incorporate such accounts in any manner needs to be aware of the necessary skills involved in interrogating the data at hand, and the various ways in which we inevitably tend to categorize and stress particular features as we seek relevant and useful conceptual insights.

What his paper illustrates is that research using Big Data should be seen as affording the potential for developing new insights that requires a methodical approach consciously combining computational analyses of these resources with expertise of researchers and practitioners.

In the case of Leetaru’s work it can be seen that what was been erroneously reported as a case of computer technology directly offering predictions is in fact a far more complex process derived initially from human ingenuity and suppositions, combined with the computational power and massive data sets now available to researchers.

Moreover now that two years have passed since the initial paper was published it is not unfair to ask if the predictive nature of the work has actually been borne out — e.g., did Leetaru himself or anyone using his model predict the events in Egypt in 2013 that led to the ouster of President Morsi?

But Reichertz (2007) makes the point even more forcefully in discussing the necessity for understanding theoretical sensitivity as a form of abduction, since the result is to bring together the logic of discovery with the logic of justification within the context of methodological considerations.

Researching is not simply the case of collecting data or evidence, the researcher is a key factor in the research landscape, a link in the chain that reaches iteratively around data, codes, concepts, knowledge discovery, DM, and tentative theories.

If one uses the process–oriented terms — modeling, theorizing, researching — rather than the simple noun forms, the issues come more readily in to focus: someone is engaging in these activities, and different people will come to the research domain or the realm of Big Data with different skills, expertise, experiences, and presuppositions.

Charmaz offers a definition of abduction as follows ‘a type of reasoning that begins by examining data and after scrutiny of these data, entertains all possible explanations for the observed data, and then forms hypotheses to confirm or disconfirm until the researcher arrives at the most plausible interpretation of the observed data.’ [13] Reichertz offers another: ‘Something unintelligible is discovered in the data, and on the basis of the mental design of a new rule, the rule is discovered or invented and, at the same time, it also becomes clear what the case is.

Here one has decided (with whatever degree of awareness and for whatever reasons) no longer to adhere to the conventional view of things.’ [14] The idea of entertaining ‘all possible explanations’ is an intriguing one, and it takes on new resonances given the potential for computational analysis of massive data sets.

Once we move away from the idea that Big Data will magically provide the one correct answer or model, we can surely entertain the possibility that it will allow a wide range of possible explanations and outcomes depending on the form of analytics and techniques that might be applied.

The proponents of AI saw intelligence as essentially rule–based, so the technology required for a machine to exhibit intelligence centred on a vast and ever–growing rule base amenable to rapid processing whenever responses to specific questions and interrogations were demanded.

Thus a novice nurse is taught ‘how to read blood pressure, measure bodily outputs, and compute fluid retention, and is given rules for determining what to do when those measurements reach certain values’ [15].

Later stages — advanced beginner, competence, and then proficiency — move on from this strict and narrow adherence to specific rules, gradually encompassing a wider and more diverse attempt to understand and respond appropriately to specific contexts, each with its unique characteristics.

Dreyfus and Dreyfus offered a dramatic illustration of this with regard to an expert chess player being given the task of adding numbers spoken to him at a rate of one number per second, while playing five–seconds–a–move chess against a player of only slightly lesser ability, and convincingly beating his opponent.

Again a lesson can be drawn from the world of AI in the 1980s and 1990s where specific implementations of AI techniques, referred to as ‘expert systems’, were used in various fields, including diagnoses of patients presenting with stomach pains.

At the time it was reported that these expert systems were better at diagnosis than doctors, but on a closer reading it transpired that the main reason for this was that ‘better’ was simply seen in terms of the percentage probability of providing the correct diagnosis.

In the context of GTM abduction has taken on a renewed resonance particularly with the development of the constructivist account of the method, since it specifically moves away from the idea that concepts or theories ‘emerge’ in some fashion from the data, instead putting the onus fairly and squarely on the shoulders of the researcher(s).

As such this conjunction of abduction with theoretical sensitivity is in direct contrast to the concept of apophenia referred to earlier: The former affirms the importance of researchers as active and insightful agents, investigating contexts and data;

Minnameier, 2010), given that Peirce himself offered seemingly contradictory or divergent accounts of the term at different phases of his writing, it is important to recognize that, however diverse, it offers an alternative to the usual recourse to reasoning by induction or deduction.

As Jensen (2008) points out, induction and deduction themselves are not immune to significant criticism, and it is only with abduction that the ‘interchange between researcher and informants [can serve to] establish — infer — relevant categories and concepts’.

able to buy access to data, and students from the top universities are the ones most likely to be invited to work within large social media companies.’ [16] The algorithms underlying much of the power of Big Data have also come under increasing scrutiny.

The recent furore regarding the marketing of ‘Keep Calm ...’ t–shirts brought the concept of algorithms into the wider realm, with the supposition that many aspects of Internet marketing were no longer within the remit of human responsibility (McVeigh, 2013): Amazon was forced to take action on Saturday after it was found to be selling T–shirts with slogans promoting rape and violence on its Web site.

In fact in the case of the offensive t–shirts this was clearly not the case, since it was not possible to order a t–shirt bearing the slogan ‘Keep Calm and Hit Him!’ But this is not to undermine the importance of having some scrutiny of the algorithms underlying analysis of massive data sets.

recent paper by Asay (2013b) offers a well–balanced view on Big Data, offering an argument that outlines the way in which the ‘phenomenon’ or ‘fad’ can perhaps mature into a discipline, with less hype and a greater understanding of its potential uses, weaknesses, and complexities.

What we have sought to do is provide a mapping of the theory and model building process in GTM with KDD to argue that the latter can be understood as a theory generation tool in the age of Big Data if researchers take account of a variety of methods that can foster conceptual innovation, specifically those encompassed by GTM.

Various authors, such as boyd and Crawford, have warned of this possibility in the context of Big Data, and it is interesting to note that many of those heralding the age of big data tend to offer analyses that indicate how patterns have been detected that relate to events in the past — e.g., Leetaru’s work.

As more predictions and strategies are made based on big data analytics, examples of apophenia and false correlation will no doubt abound, but in the meantime Silverman’s (2014) perceptive review of a recent encomium to big data is worth noting.

Rocket-Powered Data Science

For aspiring data scientists of all ages, I provide in my article at MapR the full, unabridged version of my answers, which may help you even more to achieve your goal.

“For someone who stays in school, do you recommend that they enroll in a program tailored toward data science, or would they get the requisite skills in a ‘hard science’ program such as astrophysics (like you)?” 5.