AI News, The Risky Eclipse of Statisticians

The Risky Eclipse of Statisticians

If statisticians have historically been leaders of data, why was there a need for a brand new breed of data scientists?  While the world is exploding with bounties of valuable data, statisticians are strangely working quietly in the shadows.

Back in 2010, predictive modeling and analytics website Kaggle proudly dangled Varian’s prediction as a carrot on their careers page to lure people to join their team.

For instance, UC Berkeley’s Terry Speed observes: Justin Strauss, co-founder at Storyhackers, who previously led data science programs in the healthcare industry, can attest to this more generally. He says he has “seen an underrepresentation” of statisticians at conferences and other events related to Big Data.

As renowned statistician Gerry Hahn once said: “This is a Golden Age of statistics, but not necessarily for statisticians.” Instead of crowning statisticians king, the Big Data revolution borrowed the foundational elements of applied statistics, married it with computer science and birthed an entirely new heir: The Data Scientist.

“The area of massive datasets, though currently of great interest to Computational statisticians and to many data analysts, has not yet become part of mainstream statistical science.”

Second, statistics is a crucial part of data science, but it–alone–is insufficient in making sense of exponential amounts of messy data we are producing daily.

Machine learning is deeply rooted in statistics, but few statisticians have the technical skills to manipulate a dataset of 10 billion in which each data point has a dimension of 10,000.

Hardtke says.  But now, we’re at this convergence of super cheap, high-speed computing that’s helping data scientists process powerful insights and find answers to questions that remained a mystery 20 years ago.

Some say it’s a buzzword with good marketing (here), other say it’s a made up title (here) and some call them folks who sold out to shareholders (here).

Even without a prominent presence of statisticians, educational institutions are churning out entirely new curriculums devoted to the so-called “new” field of data science in just the last few years.

into R (not to mention <lm()>, <knn()>, <gbm()>, etc.) can “do” statistics, even if they misuse those methods in ways that William Sealy Gosset wouldn’t approve on his booziest days at the Guinness brewery.” Reddy writes.  The worst part is, you can usually get away with carrying out subpar analysis because it’s hard to identify the quality of statistics without examining analysis in detail, he adds.

He makes a compelling point: It’s better to have someone who’s really passionate about geology, physics or any other science because they’ll pick up the tools of data manipulation as part of a bigger mission.

When Hardtke was tasked with building a strong data science team at startup Bright.com several years ago, he couldn’t afford to recruit the best data scientist away from the likes of Google and Facebook.

But he knew something most data scientist-crazed recruiters don’t understand: At its core, it’s all about learning how to ingest data using statistical methodology and computational techniques to find an answer.

Sharma says.  And, ultimately, Sharma’s team was able to work together to find a successful plan to monitor a user’s happiness, which offered deeper insight into search behavior and satisfaction with search results. While both data science and statistics share a common goal of extracting meaningful insight from data, the evolution of data science in the last 10 years emphasizes a demand for a combination of interdisciplinary skill.

Although the demand for pure statistics will shrink relative to data science and over time, it’s going to be more important than ever to have interdisciplinary knowledge from a variety of fields.

Data science

Turing award winner Jim Gray imagined data science as a 'fourth paradigm' of science (empirical, theoretical, computational and now data-driven) and asserted that 'everything about science is changing because of the impact of information technology' and the data deluge.[4][5] When Harvard Business Review called it 'The Sexiest Job of the 21st Century'[6] the term became a buzzword, and is now often applied to business analytics,[7] or even arbitrary use of data, or used as a sexed-up term for statistics.[8] While many university programs now offer a data science degree, there exists no consensus on a definition or curriculum contents.[7] Because of the current popularity of this term, there are many 'advocacy efforts' surrounding it.[9] The term 'data science' (originally used interchangeably with 'datalogy') has existed for over thirty years and was used initially as a substitute for computer science by Peter Naur in 1960.

Now the data in those disciplines and applied fields that lacked solid theories, like health science and social science, could be sought and utilized to generate powerful predictive models.[1] In an effort similar to Dhar's, Stanford professor David Donoho, in September 2015, takes the proposition further by rejecting three simplistic and misleading definitions of data science in lieu of criticisms.[28] First, for Donoho, data science does not equate big data, in that the size of the data set is not a criterion to distinguish data science and statistics.[28] Second, data science is not defined by the computing skills of sorting big data sets, in that these skills are already generally used for analyses across all disciplines.[28] Third, data science is a heavily applied field where academic programs right now do not sufficiently prepare data scientists for the jobs, in that many graduate programs misleadingly advertise their analytics and statistics training as the essence of a data science program.[28][29] As a statistician, Donoho, following many in his field, champions the broadening of learning scope in the form of data science,[28] like John Chambers who urges statisticians to adopt an inclusive concept of learning from data,[30] or like William Cleveland who urges to prioritize extracting from data applicable predictive tools over explanatory theories.[14] Together, these statisticians envision an increasingly inclusive applied field that grows out of traditional statistics and beyond.

For the future of data science, Donoho projects an ever-growing environment for open science where data sets used for academic publications are accessible to all researchers.[28] US National Institute of Health has already announced plans to enhance reproducibility and transparency of research data.[31] Other big journals are likewise following suit.[32][33] This way, the future of data science not only exceeds the boundary of statistical theories in scale and methodology, but data science will revolutionize current academia and research paradigms.[28] As Donoho concludes, 'the scope and impact of data science will continue to expand enormously in coming decades as scientific data and data about science itself become ubiquitously available.'[28]

The Age of Big Data

Rick Smolan, creator of the “Day in the Life” photography series, is planning a project later this year, “The Human Face of Big Data,” documenting the collection and uses of data.

A meme and a marketing term, for sure, but also shorthand for advancing trends in technology that open the door to a new approach to understanding the world and making decisions.

Now, with people supplying millions of questions, Siri is becoming an increasingly adept personal assistant, offering reminders, weather reports, restaurant suggestions and answers to an expanding universe of questions.

Retailers, like Walmart and Kohl’s, analyze sales, pricing and economic, demographic and weather data to tailor product selections at particular stores and determine the timing of price markdowns.

Online dating services, like Match.com, constantly sift through their Web listings of personal characteristics, reactions and communications to improve the algorithms for matching men and women on dates.

Police departments across the country, led by New York’s, use computerized mapping and analysis of variables like historical arrest patterns, paydays, sporting events, rainfall and holidays to try to predict likely crime “hot spots” and deploy officers there in advance.

They studied 179 large companies and found that those adopting “data-driven decision making” achieved productivity gains that were 5 percent to 6 percent higher than other factors could explain.

Researchers have found a spike in Google search requests for terms like “flu symptoms” and “flu treatments” a couple of weeks before there is an increase in flu patients coming to hospital emergency rooms in a region (and emergency room reports usually lag behind visits by two weeks or so).

The group will conduct so-called sentiment analysis of messages in social networks and text messages — using natural-language deciphering software — to help predict job losses, spending reductions or disease outbreaks in a given region.

In economic forecasting, research has shown that trends in increasing or decreasing volumes of housing-related search queries in Google are a more accurate predictor of house sales in the next quarter than the forecasts of real estate economists.

It was a classic demonstration of the “small-world phenomenon,” captured in the popular phrase “six degrees of separation.” Today, social-network research involves mining huge digital data sets of collective behavior online.

Among the findings: people whom you know but don’t communicate with often — “weak ties,” in sociology — are the best sources of tips about job openings.

With huge data sets and fine-grained measurement, statisticians and computer scientists note, there is increased risk of “false discoveries.” The trouble with seeking a meaningful needle in massive haystacks of data, says Trevor Hastie, a statistics professor at Stanford, is that “many bits of straw look like needles.” Big Data also supplies more raw material for statistical shenanigans and biased fact-finding excursions.

A model might spot a correlation and draw a statistical inference that is unfair or discriminatory, based on online searches, affecting the products, bank loans and health insurance a person is offered, privacy advocates warn.

Data Scientist: The Sexiest Job of the 21st Century

When Jonathan Goldman arrived for work in June 2006 at LinkedIn, the business networking site, the place still felt like a start-up.

For one thing, he had given Goldman a way to circumvent the traditional product release cycle by publishing small modules in the form of ads on the site’s most popular pages.

Through one such module, Goldman started to test what would happen if you presented users with names of people they hadn’t yet connected with but seemed likely to know—for example, people who had shared their tenures at schools and workplaces.

Goldman is a good example of a new key player in organizations: the “data scientist.” It’s a high-ranking professional with the training and curiosity to make discoveries in the world of big data.

If your organization stores multiple petabytes of data, if the information most critical to your business resides in forms other than rows and columns of numbers, or if answering your biggest question would involve a “mashup” of several analytical efforts, you’ve got a big data opportunity.

Much of the current enthusiasm for big data focuses on technologies that make taming it possible, including Hadoop (the most widely used framework for distributed file system processing) and related open-source tools, cloud computing, and data visualization.

Greylock Partners, an early-stage venture firm that has backed companies such as Facebook, LinkedIn, Palo Alto Networks, and Workday, is worried enough about the tight labor pool that it has built its own specialized recruiting team to channel talent to businesses in its portfolio.

“Once they have data,” says Dan Portillo, who leads that team, “they really need people who can manage it and find insights in it.” If capitalizing on big data depends on hiring scarce data scientists, then the challenge for managers is to learn how to identify that talent, attract it to an enterprise, and make it productive.

In a competitive landscape where challenges keep changing and data never stop flowing, data scientists help decision makers shift from ad hoc analysis to an ongoing conversation with data.

More enduring will be the need for data scientists to communicate in language that all their stakeholders understand—and to demonstrate the special skills involved in storytelling with data, whether verbally, visually, or—ideally—both.

But we would say the dominant trait among data scientists is an intense curiosity—a desire to go beneath the surface of a problem, find the questions at its heart, and distill them into a very clear set of hypotheses that can be tested.

As Portillo told us, “The traditional backgrounds of people you saw 10 to 15 years ago just don’t cut it these days.” A quantitative analyst can be great at analyzing data but not at subduing a mass of unstructured data and getting it into a form in which it can be analyzed.

A data management expert might be great at generating and organizing data in structured form but not at turning unstructured data into structured data—and also not at actually analyzing the data.

Several universities are planning to launch data science programs, and existing programs in analytics, such as the Master of Science in Analytics program at North Carolina State, are busy adding big data exercises and coursework.

The Insight Data Science Fellows Program, a postdoctoral fellowship designed by Jake Klamka (a high-energy physicist by training), takes scientists from academia and in six weeks prepares them to succeed as data scientists.

As one of them commented, “If we wanted to work with structured data, we’d be on Wall Street.” Given that today’s most qualified prospects come from nonbusiness backgrounds, hiring managers may need to figure out how to paint an exciting picture of the potential for breakthroughs that their problems offer.

One described being a consultant as “the dead zone—all you get to do is tell someone else what the analyses say they should do.” By creating solutions that work, they can have more impact and leave their marks as pioneers of their profession.

As the story of Jonathan Goldman illustrates, their greatest opportunity to add value is not in creating reports or presentations for senior executives but in innovating with customer-facing products and processes.

At Intuit data scientists are asked to develop insights for small-business customers and consumers and report to a new senior vice president of big data, social design, and marketing.

New conferences and informal associations are springing up to support collaboration and technology sharing, and companies should encourage scientists to become involved in them with the understanding that “more water in the harbor floats all boats.” Data scientists tend to be more motivated, too, when more is expected of them.

The challenges of accessing and structuring big data sometimes leave little time or energy for sophisticated analytics involving prediction or optimization.

People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s?” If “sexy” means having rare qualities that are much in demand, data scientists are already there.

In those days people with backgrounds in physics and math streamed to investment banks and hedge funds, where they could devise entirely new algorithms and data strategies.

One question raised by this is whether some firms would be wise to wait until that second generation of data scientists emerges, and the candidates are more numerous, less expensive, and easier to vet and assimilate in a business setting.

Why not leave the trouble of hunting down and domesticating exotic talent to the big data start-ups and to firms like GE and Walmart, whose aggressive strategies require them to be at the forefront?

If companies sit out this trend’s early days for lack of talent, they risk falling behind as competitors and channel partners gain nearly unassailable advantages.

New Breed of Super Quants at NYU Prep for Wall Street

The ginned-up name didn’t really exist a decade ago and today almost anyone with some coding or data know-how can claim the “scientist”

In the near absence of degree programs, investment firms must sort through the wannabes and find skilled data scientists from fields like physics and math.

“The term is a fairly loose term, and it can mean anything from somebody who’s an extreme expert in machine learning all the way down to someone who’s really more of a data analyst, preparing and cleaning data and producing charts, and it can mean everything in between,’’

In the academic community there are diverging views about whether data science should be its own discipline, said Ronald Wasserstein, executive director of the American Statistical Association, a group that includes statisticians in academia and government.

While he sees data science emerging as a separate discipline, some academics are less certain and consider it a combined set of skills and ideas drawn from computer science and statistics, he said.

NYU professors discussed this question at length and decided that data science is sufficiently distinct from computer science and statistics and deserves its own academic center and shingle, said Vasant Dhar, a professor of data science who helped start the Ph.D.

Patil, who ran the data team at LinkedIn Corp., and Jeff Hammerbacher, who managed the data group at Facebook Inc., were both under pressure from their human resource departments to come up with an appropriate title for their team members.

Patil then tested their idea by posting the same job opening multiple times using different titles, such as analyst, research scientist and data scientist.

NYU trains data science students to conduct scientific inquiry that encompasses causal reasoning, machine learning, graph models, big-data analysis, statistics, math, ethics and more.

said NYU’s Dhar, who founded and runs the $250 million Adaptive Quant Trading program at SCT Capital Management, a hedge fund that uses machine learning to make investment decisions without human intervention.

Newly minted doctorates stand to make a lot of money -- $200,000 plus a bonus -- at a hedge fund, estimated Adam Zoia, head of recruiting firm Glocap.

Data Science and Statistics: Opportunities and Challenges

We now live in a world where it seems that everything about us is (or soon will be) tracked and recorded: what we eat, what we watch, how we socialize, what we like and dislike, our vital health statistics—and the list goes on.

Such unprecedented access to personal data presents potentially enormous opportunities to, for instance, help government officials make better policy decisions, allow businesses to operate more efficiently and profitably, streamline the use of public resources, support more personalized healthcare and drug design, and otherwise improve the overall quality of life in our society.

But it will also address such concerns as the latest trends in machine learning: how to extract meaningful insights and preferences from customer data in general and how to ask the right questions to make better business decisions.

However, we still lack the critical ability to seamlessly stitch together various pieces of data to make meaningful predictions that lead to high-impact decisions.

In a typical organization, basic operational tasks depend on decisions about how to invest available resources among different competing options, with an eye on one or more objectives.

The retailer’s primary operational problem is figuring out which products to showcase for customers, given various operational constraints such as its budget for buying inventory, the limits on its stores’ shelf space, and its suppliers’ schedules.

The question of choosing which products to showcase arises at different times for different types of decisions, such as deciding which products to purchase across the chain of stores, which to ship to various locations from distribution centers, which products to discount, which to promote via e-mail, and which to show to customers when they visit stores or e-commerce sites.

Operationally, this requires building a data-processing system that might be extremely large-scale and that might need to operate in real time with three high-level components: interfaces, infrastructure, and algorithms.

The resulting algorithms use the computation and storage infrastructure, based on the data obtained through the interface, and produce end results that can be delivered to the end user through the interface.

That data is collected through a customer’s browsing history and clicks on the e-commerce website, past purchases, and other online activity gleaned through our Web and mobile interfaces.

It is transformed into real-time, personalized decisions via potentially sophisticated data-processing algorithms that use behavioral models from the social sciences, along with methods from mathematical statistics and machine learning.

Key to building this type of personalization or recommendation system is having access to a skilled team of data scientists and statisticians who can identify appropriate statistical methods and behavioral models to develop data-processing algorithms.

Meanwhile, our new six-week, online course, “Data Science: Data to Insights”, which begins October 4, will share the latest information about ways to apply data science techniques to more effectively address your organization’s many challenges.

Statistics & Data Analysis: Does It Have A Future?

Get My BEST-SELLING Book, The Complete Software Developer's Career Guide For FREE ◅ SUBSCRIBE TO THIS CHANNEL: vid.io/xokz Inevitable..

"Statistics and Big Data at Google"

Tim Hesterberg, Senior Statistician, Google.

LinkedIn Data Scientist Talks Statistics

Watch Deepak Kumar explain how important statistics--the science of learning from data--is to his job at LinkedIn. To learn more about statistics education and careers, visit

Employers Discuss The Demand For Statisticians

Love data? A career in statistics could be for you. A new national survey by SHRM finds data analysis skills are in high demand and the job growth for statisticians is expected to increase...

Good With Numbers? Consider A Statistician Career & Earn Over $100K

Jill Schlesinger tells us how the demand for jobs in big data are pouring into the Bay Area. (7/15/15)

Hans Rosling's 200 Countries, 200 Years, 4 Minutes - The Joy of Stats - BBC Four

More about this programme: Hans Rosling's famous lectures combine enormous quantities of public data with a sport's commentator's style to reveal the..

Statisticians in Other Fields

Think statistics isn't used in your field? Think again. For fields as varied as journalism, sports, healthcare, agriculture, video game development, and many others, statistics is integral....

Data Statistician at ESPN | What Next?

If you love sports and math, then a job as an ESPN data statistician might be perfect for you. Learn more about careers in mathematics on our website:

1. Introduction to Statistics

NOTE: This video was recorded in Fall 2017. The rest of the lectures were recorded in Fall 2016, but video of Lecture 1 was not available. MIT 18.650 Statistics for Applications, Fall 2016...

Andrea Lodi - Why Montréal is a World Leader in Big Data and Machine Learning | Contact MTL

Canada Excellence Research Chair in Data Science for Real-Time Decision-Making, professor Andrea Lodi, is considered to be one of the most promising researchers in the field of big data. Polytechni...