AI News, Data science: how is it different to statistics ?
Data science: how is it different to statistics ?
I believe that statistics is a crucial part of data science, but at the same time, most statistics departments are at grave risk of becoming irrelevant.
In this first column, I’ll discuss why I think data science isn’t just statistics, and highlight important parts of data science that are typically considered to be out of bounds for statistics research.
think there are three main steps in a data science project: you collect data (and questions), analyze it (using visualization and models), then communicate the results.
It’s rare to walk this process in one direction: often your analysis will reveal that you need new or different data, or when presenting results you’ll discover a flaw in your model.
Good questions are crucial for good analysis, but there is little research in statistics about how to solicit and polish good questions, and it’s a skill rarely taught in core PhD curricula.
Organizing data into the right ‘shape’ is essential for fluent data analysis: if it’s in the wrong shape you’ll spend the majority of your time fighting your tools, not questioning the data.
Communication is not a mainstream thread of statistics research (if you attend the JSM, it’s easy to come to the conclusion that some academic statisticians couldn’t care less about the communication of results).
Statistics research focuses on data collection and modelling, and there is little work on developing good questions, thinking about the shape of data, communicating results or building data products.
Operations research, or operational research in British usage, is a discipline that deals with the application of advanced analytical methods to help make better decisions. Further, the term 'operational analysis' is used in the British (and some British Commonwealth) military as an intrinsic part of capability development, management and assurance.
Operational research (OR) encompasses a wide range of problem-solving techniques and methods applied in the pursuit of improved decision-making and efficiency, such as simulation, mathematical optimization, queueing theory and other stochastic-process models, Markov decision processes, econometric methods, data envelopment analysis, neural networks, expert systems, decision analysis, and the analytic hierarchy process. Nearly all of these techniques involve the construction of mathematical models that attempt to describe the system.
Since that time, operational research has expanded into a field widely used in industries ranging from petrochemicals to airlines, finance, logistics, and government, moving to a focus on the development of mathematical models that can be used to analyse and optimize complex systems, and has become an area of active academic and industrial research. Early work in operational research was carried out by individuals such as Charles Babbage.
His research into the cost of transportation and sorting of mail led to England's universal 'Penny Post' in 1840, and studies into the dynamical behaviour of railway vehicles in defence of the GWR's broad gauge. Percy Bridgman brought operational research to bear on problems in physics in the 1920s and would later attempt to extend these to the social sciences. Modern operational research originated at the Bawdsey Research Station in the UK in 1937 and was the result of an initiative of the station's superintendent, A.
Early in the war while working for the Royal Aircraft Establishment (RAE) he set up a team known as the 'Circus' which helped to reduce the number of anti-aircraft artillery rounds needed to shoot down an enemy aircraft from an average of over 20,000 at the start of the Battle of Britain to 4,000 in 1941. In 1941, Blackett moved from the RAE to the Navy, after first working with RAF Coastal Command, in 1941 and then early in 1942 to the Admiralty. Blackett's team at Coastal Command's Operational Research Section (CC-ORS) included two future Nobel prize winners and many other people who went on to be pre-eminent in their fields. They undertook a number of crucial analyses that aided the war effort.
The reason was that if a U-boat saw an aircraft only shortly before it arrived over the target then at 100 feet the charges would do no damage (because the U-boat wouldn't have had time to descend as far as 100 feet), and if it saw the aircraft a long way from the target it had time to alter course under water so the chances of it being within the 20-foot kill zone of the charges was small.
revealed that glossy enamel paint was more effective camouflage for night fighters than traditional dull camouflage paint finish, and the smooth paint finish increased airspeed by reducing skin friction. On land, the operational research sections of the Army Operational Research Group (AORG) of the Ministry of Supply (MoS) were landed in Normandy in 1944, and they followed British forces in the advance across Europe.
In 1967 Stafford Beer characterized the field of management science as 'the business use of operations research'. However, in modern times the term management science may also be used to refer to the separate fields of organizational studies or corporate strategy. Like operational research itself, management science (MS) is an interdisciplinary branch of applied mathematics devoted to optimal decision planning, with strong links with economics, business, engineering, and other sciences.
Management science is concerned with developing and applying models and concepts that may prove useful in helping to illuminate management issues and solve managerial problems, as well as designing and developing new and better models of organizational excellence. The application of these models within the corporate sector became known as management science. Some of the fields that have considerable overlap with Operations Research and Management Science include: Applications are abundant such as in airlines, manufacturing companies, service organizations, military branches, and government.
These include: The International Federation of Operational Research Societies (IFORS) is an umbrella organization for operational research societies worldwide, representing approximately 50 national societies including those in the US, UK, France, Germany, Italy, Canada, Australia, New Zealand, Philippines, India, Japan and South Africa. The constituent members of IFORS form regional groups, such as that in Europe. Other important operational research organizations are Simulation Interoperability Standards Organization (SISO) and Interservice/Industry Training, Simulation and Education Conference (I/ITSEC) In 2004 the US-based organization INFORMS began an initiative to market the OR profession better, including a website entitled The Science of Better which provides an introduction to OR and examples of successful applications of OR to industrial problems.
The 10 Statistical Techniques Data Scientists Need to Master
Regardless of where you stand on the matter of Data Science sexiness, it’s simply impossible to ignore the continuing importance of data, and our ability to analyze, organize, and contextualize it.
With technologies like Machine Learning becoming ever-more common place, and emerging fields like Deep Learning gaining significant traction amongst researchers and engineers — and the companies that hire them — Data Scientists continue to ride the crest of an incredible wave of innovation and technological progress.
As Josh Wills put it, “data scientist is a person who is better at statistics than any programmer and better at programming than any statistician.” I personally know too many software engineers looking to transition into data scientist and blindly utilizing machine learning frameworks such as TensorFlow or Apache Spark to their data without a thorough understanding of statistical theories behind them.
Now being exposed to the content twice, I want to share the 10 statistical techniques from the book that I believe any data scientists should learn to be more effective in handling big datasets.
I wrote one of the most popular Medium posts on machine learning before, so I am confident I have the expertise to justify these differences: In statistics, linear regression is a method to predict a target variable by fitting the best linear relationship between the dependent and independent variable.
Now I need to answer the following questions: Classification is a data mining technique that assigns categories to a collection of data in order to aid in more accurate predictions and analysis.
Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.
Types of questions that a logistic regression can examine: In Discriminant Analysis, 2 or more groups or clusters or populations are known a priori and 1 or more new observations are classified into 1 of the known populations based on the measured characteristics.
Discriminant analysis models the distribution of the predictors X separately in each of the response classes, and then uses Bayes’ theorem to flip these around into estimates for the probability of the response category given the value of X.
In other words, the method of resampling does not involve the utilization of the generic distribution tables in order to compute approximate p probability values.
In order to understand the concept of resampling, you should understand the terms Bootstrapping and Cross-Validation: Usually for linear models, ordinary least squares is the major criteria to be considered to fit them into the data.
This approach fits a model involving all p predictors, however, the estimated coefficients are shrunken towards zero relative to the least squares estimates.
In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables.
So far, we only have discussed supervised learning techniques, in which the groups are known and the experience provided to the algorithm is the relationship between actual entities and the group they belong to.
Below is the list of most widely used unsupervised learning algorithms: This was a basic run-down of some basic statistical techniques that can help a data science program manager and or executive have a better understanding of what is running underneath the hood of their data science teams.
Turing award winner Jim Gray imagined data science as a 'fourth paradigm' of science (empirical, theoretical, computational and now data-driven) and asserted that 'everything about science is changing because of the impact of information technology' and the data deluge. When Harvard Business Review called it 'The Sexiest Job of the 21st Century', the term 'data science' became a buzzword, and is now often applied to business analytics, business intelligence, predictive modeling, or any arbitrary use of data, or used as a glamorized term for statistics. In many cases, earlier approaches and solutions are now simply rebranded as 'data science' to be more attractive, which can cause the term to become 'dilute[d] beyond usefulness.' While many university programs now offer a data science degree, there exists no consensus on a definition or suitable curriculum contents. To its discredit, however, many data science and big data projects fail to deliver useful results, often as a result of poor management and utilization of resources. The term 'data science' has appeared in various contexts over the past thirty years but did not become an established term until recently.
In 2005, The National Science Board published 'Long-lived Digital Data Collections: Enabling Research and Education in the 21st Century' defining data scientists as 'the information and computer scientists, database and software and programmers, disciplinary experts, curators and expert annotators, librarians, archivists, and others, who are crucial to the successful management of a digital data collection' whose primary activity is to 'conduct creative inquiry and analysis.' Around 2007, Turing award winner Jim Gray envisioned 'data-driven science' as a 'fourth paradigm' of science that uses the computational analysis of large data as primary scientific method and 'to have a world in which all of the science literature is online, all of the science data is online, and they interoperate with each other.' In the 2012 Harvard Business Review article 'Data Scientist: The Sexiest Job of the 21st Century', DJ Patil claims to have coined this term in 2008 with Jeff Hammerbacher to define their jobs at LinkedIn and Facebook, respectively.
Now the data in those disciplines and applied fields that lacked solid theories, like health science and social science, could be sought and utilized to generate powerful predictive models. In an effort similar to Dhar's, Stanford professor David Donoho, in September 2015, takes the proposition further by rejecting three simplistic and misleading definitions of data science in lieu of criticisms. First, for Donoho, data science does not equate big data, in that the size of the data set is not a criterion to distinguish data science and statistics. Second, data science is not defined by the computing skills of sorting big data sets, in that these skills are already generally used for analyses across all disciplines. Third, data science is a heavily applied field where academic programs right now do not sufficiently prepare data scientists for the jobs, in that many graduate programs misleadingly advertise their analytics and statistics training as the essence of a data science program. As a statistician, Donoho, following many in his field, champions the broadening of learning scope in the form of data science, like John Chambers who urges statisticians to adopt an inclusive concept of learning from data, or like William Cleveland who urges to prioritize extracting from data applicable predictive tools over explanatory theories. Together, these statisticians envision an increasingly inclusive applied field that grows out of traditional statistics and beyond.
Handbook of Applied Multivariate Statistics and Mathematical Modeling
Multivariate statistics and mathematical models provide flexible and powerful tools essential in most disciplines.
The Handbook of Applied Multivariate Statistics and Mathematical Modeling explains the appropriate uses of multivariate procedures and mathematical modeling techniques, and prescribe practices that enable applied researchers to use these procedures effectively without needing to concern themselves with the mathematical basis.
- On Monday, August 19, 2019
Excel 2013 Statistical Analysis #01: Using Excel Efficiently For Statistical Analysis (100 Examples)
Download File: All Excel Files for All Video files: ..
The best stats you've ever seen | Hans Rosling
With the drama and urgency of a sportscaster, statistics guru Hans Rosling uses an amazing new presentation tool, Gapminder, to present ..
Statistical Text Analysis for Social Science
What can text analysis tell us about society? Corpora of news, books, and social media encode human beliefs and culture. But it is impossible for a researcher to ...
Practice 4 - Analyzing and Interpreting Data
Science and Engineering Practice 3: Analyzing and Interpreting Data Paul Andersen explains how scientists analyze and interpret data. Data can be organized ...
Time Series Forecasting Theory | AR, MA, ARMA, ARIMA | Data Science
In this video you will learn the theory of Time Series Forecasting. You will what is univariate time series analysis, AR, MA, ARMA & ARIMA modelling and how to ...
Choosing which statistical test to use - statistics help
Seven different statistical tests and a process by which you can decide which to use. If this video helps you, please donate by clicking on: ...
Types of Data: Nominal, Ordinal, Interval/Ratio - Statistics Help
The kind of graph and analysis we can do with specific data is related to the type of data it is. In this video we explain the different levels of data, with examples.
What is action research?
Here's a short description of action research. TRANSCRIPT: Teaching is a craft. It's both an art and a science, which is why great teachers always experiment ...
Biomedical Big Data Revolution | Dr. Stefan Bekiranov | TEDxRVA
Find a cure for cancer from the comfort of your living room while in your PJs. It's more possible today than it was a short time ago. We are currently undergoing a ...
The Correlation Coefficient - Explained in Three Steps
The correlation coefficient is a really popular way of summarizing a scatter plot into a single number between -1 and 1. In this video, I'm giving an intuition how ...