AI News, Data science: how is it different to statistics ?

Data science: how is it different to statistics ?

I believe that statistics is a crucial part of data science, but at the same time, most statistics departments are at grave risk of becoming irrelevant.

In this first column, I’ll discuss why I think data science isn’t just statistics, and highlight important parts of data science that are typically considered to be out of bounds for statistics research.

think there are three main steps in a data science project: you collect data (and questions), analyze it (using visualization and models), then communicate the results.

It’s rare to walk this process in one direction: often your analysis will reveal that you need new or different data, or when presenting results you’ll discover a flaw in your model.

Good questions are crucial for good analysis, but there is little research in statistics about how to solicit and polish good questions, and it’s a skill rarely taught in core PhD curricula.

Organizing data into the right ‘shape’ is essential for fluent data analysis: if it’s in the wrong shape you’ll spend the majority of your time fighting your tools, not questioning the data.

Communication is not a mainstream thread of statistics research (if you attend the JSM, it’s easy to come to the conclusion that some academic statisticians couldn’t care less about the communication of results).

Statistics research focuses on data collection and modelling, and there is little work on developing good questions, thinking about the shape of data, communicating results or building data products.

Data science

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured,[1][2]

Data science is a 'concept to unify statistics, data analysis, machine learning and their related methods' in order to 'understand and analyze actual phenomena' with data.[3]

Turing award winner Jim Gray imagined data science as a 'fourth paradigm' of science (empirical, theoretical, computational and now data-driven) and asserted that 'everything about science is changing because of the impact of information technology' and the data deluge.[4][5]

In many cases, earlier approaches and solutions are now simply rebranded as 'data science' to be more attractive, which can cause the term to become 'dilute[d] beyond usefulness.'[10]

In 1974, Naur published Concise Survey of Computer Methods, which freely used the term data science in its survey of the contemporary data processing methods that are used in a wide range of applications.

In his report, Cleveland establishes six technical areas which he believed to encompass the field of data science: multidisciplinary investigations, models and methods for data, computing with data, pedagogy, tool evaluation, and theory.

In 2005, The National Science Board published 'Long-lived Digital Data Collections: Enabling Research and Education in the 21st Century' defining data scientists as 'the information and computer scientists, database and software and programmers, disciplinary experts, curators and expert annotators, librarians, archivists, and others, who are crucial to the successful management of a digital data collection' whose primary activity is to 'conduct creative inquiry and analysis.'[25]

Turing award winner Jim Gray envisioned 'data-driven science' as a 'fourth paradigm' of science that uses the computational analysis of large data as primary scientific method[4][5]

Similarly, in business sector, multiple researchers and analysts state that data scientists alone are far from being sufficient in granting companies a real competitive advantage[33]

and consider data scientists as only one of the four greater job families companies require to leverage big data effectively, namely: data analysts, data scientists, big data developers and big data engineers.[34]

Now the data in those disciplines and applied fields that lacked solid theories, like health science and social science, could be sought and utilized to generate powerful predictive models.[1]

In an effort similar to Dhar's, Stanford professor David Donoho, in September 2015, takes the proposition further by rejecting three simplistic and misleading definitions of data science in lieu of criticisms.[36]

Second, data science is not defined by the computing skills of sorting big data sets, in that these skills are already generally used for analyses across all disciplines.[36]

Third, data science is a heavily applied field where academic programs right now do not sufficiently prepare data scientists for the jobs, in that many graduate programs misleadingly advertise their analytics and statistics training as the essence of a data science program.[36][37]

This way, the future of data science not only exceeds the boundary of statistical theories in scale and methodology, but data science will revolutionize current academia and research paradigms.[36]

As Donoho concludes, 'the scope and impact of data science will continue to expand enormously in coming decades as scientific data and data about science itself become ubiquitously available.'[36]

Handbook of Applied Multivariate Statistics and Mathematical Modeling

Multivariate statistics and mathematical models provide flexible and powerful tools essential in most disciplines.

The Handbook of Applied Multivariate Statistics and Mathematical Modeling explains the appropriate uses of multivariate procedures and mathematical modeling techniques, and prescribe practices that enable applied researchers to use these procedures effectively without needing to concern themselves with the mathematical basis.

Data analysis

Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.

Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, while being used in different business, science, and social science domains.

Data mining is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes, while business intelligence covers data analysis that relies heavily on aggregation, focusing mainly on business information.[1]

Predictive analytics focuses on application of statistical models for predictive forecasting or classification, while text analytics applies statistical, linguistic, and structural techniques to extract and classify information from textual sources, a species of unstructured data.

Statistician John Tukey defined data analysis in 1961 as: 'Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.'[3]

For instance, these may involve placing data into rows and columns in a table format (i.e., structured data) for further analysis, such as within a spreadsheet or statistical software.[4]

In general terms, models may be developed to evaluate a particular variable in the data based on other variable(s) in the data, with some residual error depending on model accuracy (i.e., Data = Model + Error).[2]

For example, regression analysis may be used to model whether a change in advertising (independent variable X) explains the variation in sales (dependent variable Y).

It may be described as Y = aX + b + error, where the model is designed such that a and b minimize the error when the model predicts Y for a given range of values of X.

Tables are helpful to a user who might lookup specific numbers, while charts (e.g., bar charts or line charts) may help explain the quantitative messages contained in the data.

Stephen Few described eight types of quantitative messages that users may attempt to understand or communicate from a set of data and the associated graphs used to help communicate the message.

Hypothesis testing is used when a particular hypothesis about the true state of affairs is made by the analyst and data is gathered to determine whether that state of affairs is true or false.

Regression analysis may be used when the analyst is trying to determine the extent to which independent variable X affects dependent variable Y (e.g., 'To what extent do changes in the unemployment rate (X) affect the inflation rate (Y)?').

Necessary condition analysis (NCA) may be used when the analyst is trying to determine the extent to which independent variable X allows variable Y (e.g., 'To what extent is a certain unemployment rate (X) necessary for a certain inflation rate (Y)?').

Whereas (multiple) regression analysis uses additive logic where each X-variable can produce the outcome and the X's can compensate for each other (they are sufficient but not necessary), necessary condition analysis (NCA) uses necessity logic, where one or more X-variables allow the outcome to exist, but may not produce it (they are necessary but not sufficient).

For example, in August 2010, the Congressional Budget Office (CBO) estimated that extending the Bush tax cuts of 2001 and 2003 for the 2011–2020 time period would add approximately $3.3 trillion to the national debt.[18]

As another example, the auditor of a public company must arrive at a formal opinion on whether financial statements of publicly traded corporations are 'fairly stated, in all material respects.'

In his book Psychology of Intelligence Analysis, retired CIA analyst Richards Heuer wrote that analysts should clearly delineate their assumptions and chains of inference and specify the degree and source of the uncertainty involved in the conclusions.

More important may be the number relative to another number, such as the size of government revenue or spending relative to the size of the economy (GDP) or the amount of cost relative to revenue in corporate financial statements.

For example, when analysts perform financial statement analysis, they will often recast the financial statements under different assumptions to help arrive at an estimate of future cash flow, which they then discount to present value based on some interest rate, to determine the valuation of the company or its stock.

The different steps of the data analysis process are carried out in order to realise smart buildings, where the building management and control operations including heating, ventilation, air conditioning, lighting and security are realised automatically by miming the needs of the building users and optimising resources like energy and time.

These data systems present data to educators in an over-the-counter data format (embedding labels, supplemental documentation, and a help system and making key package/display and content decisions) to improve the accuracy of educators’ data analyses.[24]

The most important distinction between the initial data analysis phase and the main analysis phase, is that during initial data analysis one refrains from any analysis that is aimed at answering the original research question.

Data quality can be assessed in several ways, using different types of analysis: frequency counts, descriptive statistics (mean, standard deviation, median), normality (skewness, kurtosis, frequency histograms, n: variables are compared with coding schemes of variables external to the data set, and possibly corrected if coding schemes are not comparable.

When a model is found exploratory in a dataset, then following up that analysis with a confirmatory analysis in the same dataset could simply mean that the results of the confirmatory analysis are due to the same type 1 error that resulted in the exploratory model in the first place.

Different companies or organizations hold a data analysis contests to encourage researchers utilize their data or to solve a particular question using data analysis.

The 10 Statistical Techniques Data Scientists Need to Master

Regardless of where you stand on the matter of Data Science sexiness, it’s simply impossible to ignore the continuing importance of data, and our ability to analyze, organize, and contextualize it.

With technologies like Machine Learning becoming ever-more common place, and emerging fields like Deep Learning gaining significant traction amongst researchers and engineers — and the companies that hire them — Data Scientists continue to ride the crest of an incredible wave of innovation and technological progress.

As Josh Wills put it, “data scientist is a person who is better at statistics than any programmer and better at programming than any statistician.” I personally know too many software engineers looking to transition into data scientist and blindly utilizing machine learning frameworks such as TensorFlow or Apache Spark to their data without a thorough understanding of statistical theories behind them.

Now being exposed to the content twice, I want to share the 10 statistical techniques from the book that I believe any data scientists should learn to be more effective in handling big datasets.

I wrote one of the most popular Medium posts on machine learning before, so I am confident I have the expertise to justify these differences: In statistics, linear regression is a method to predict a target variable by fitting the best linear relationship between the dependent and independent variable.

Now I need to answer the following questions: Classification is a data mining technique that assigns categories to a collection of data in order to aid in more accurate predictions and analysis.

Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

Types of questions that a logistic regression can examine: In Discriminant Analysis, 2 or more groups or clusters or populations are known a priori and 1 or more new observations are classified into 1 of the known populations based on the measured characteristics.

Discriminant analysis models the distribution of the predictors X separately in each of the response classes, and then uses Bayes’ theorem to flip these around into estimates for the probability of the response category given the value of X.

In other words, the method of resampling does not involve the utilization of the generic distribution tables in order to compute approximate p probability values.

In order to understand the concept of resampling, you should understand the terms Bootstrapping and Cross-Validation: Usually for linear models, ordinary least squares is the major criteria to be considered to fit them into the data.

This approach fits a model involving all p predictors, however, the estimated coefficients are shrunken towards zero relative to the least squares estimates.

In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables.

So far, we only have discussed supervised learning techniques, in which the groups are known and the experience provided to the algorithm is the relationship between actual entities and the group they belong to.

Below is the list of most widely used unsupervised learning algorithms: This was a basic run-down of some basic statistical techniques that can help a data science program manager and or executive have a better understanding of what is running underneath the hood of their data science teams.

Is Most Published Research Wrong?

Mounting evidence suggests a lot of published research is false. Check out Audible: Support Veritasium on Patreon: .

Practice 4 - Analyzing and Interpreting Data

Science and Engineering Practice 3: Analyzing and Interpreting Data Paul Andersen explains how scientists analyze and interpret data. Data can be organized ...

Statistical Text Analysis for Social Science

What can text analysis tell us about society? Corpora of news, books, and social media encode human beliefs and culture. But it is impossible for a researcher to ...

Sociology Research Methods: Crash Course Sociology #4

Today we're talking about how we actually DO sociology. Nicole explains the research method: form a question and a hypothesis, collect data, and analyze that ...

Biomedical Big Data Revolution | Dr. Stefan Bekiranov | TEDxRVA

Find a cure for cancer from the comfort of your living room while in your PJs. It's more possible today than it was a short time ago. We are currently undergoing a ...

Natural Language Processing: Crash Course Computer Science #36

Today we're going to talk about how computers understand speech and speak themselves. As computers play an increasing role in our daily lives there has ...

Research Methods - Introduction

In this video, Dr Greg Martin provides an introduction to research methods, methedology and study design. Specifically he takes a look at qualitative and ...

Analyzing and modeling complex and big data | Professor Maria Fasli | TEDxUniversityofEssex

This talk was given at a local TEDx event, produced independently of the TED Conferences. The amount of information that we are creating is increasing at an ...

Choosing which statistical test to use - statistics help

Seven different statistical tests and a process by which you can decide which to use. The tests are: Test for a mean, test for a proportion, difference of proportions ...

Ways to represent data | Data and statistics | 6th grade | Khan Academy

Here are a few of the many ways to look at data. Which is your favorite? Practice this lesson yourself on KhanAcademy.org right now: ...