AI News, Menu


Today I’m going to describe some of the principles outlined by Beck and Anders, again asking whether they are applicable to data science, and whether data science could benefit from their application.

[Update: new post on the next layer: practices] Beck and Anders have a list of 14 principles: humanity, economics, mutual benefit, self-similarity, improvement, diversity, reflection, flow, opportunity, redundancy, failure, quality, baby steps and accepted responsibility.

Acknowledging that fact leads to a discussion about how to get the most out of people, and the authors offer this list of needs: basic safety, accomplishment, belonging, growth and intimacy.

The practices of XP seek to meet these needs and, by limiting working hours, give space for the individual to meet other needs like relaxation, exercise and socialising.

That this row of network logs or credit card transactions represents the actions of a real human being is sometimes hard to remember.

Beck and Anders use the perhaps controversial example of internal documentation for software, arguing that excessive documentation is not of benefit to the developer, and the mutually beneficial solution is to write automated tests, take care to refactor out complexity and choose coherent names and metaphors which make the code clearer to anyone reading it for the first time.

For data scientists making your code and model usable is of clear importance, especially in a consulting environment where you will probably not be around to implement and iterate further on the model.

In the spirit of the humanity point above, perhaps data scientists can expand this definition to include the benefit of the person who provides the data, or the population they come from.

Does a normal consumer benefit enough from the fraud detection algorithm to understand the need for deep inspection of all their credit card transactions?

Using the high profile example of Facebook’s mood manipulation experiment, clearly those people in the negatively influenced group did not mutually benefit from this activity.

I find this principle quite freeing as it makes clear that you shouldn’t worry if you are starting from a bad position as long as you decide to head in the right direction.

Often the personality types that find themselves drawn to data science exhibit some degree of perfectionism, whether it manifests as continually tweaking code to eek out some performance or obsessing over image placement on slidedecks.

At Pivotal we have data scientists with backgrounds including biology, physics, computer science and online media, and we often find that techniques and solutions in one field are extremely useful when applied to a different one.

There is nothing unique about how data scientists should approach reflection, but in my experience it is easy for high-achieving data scientists with academic backgrounds to be less reflective than necessary, resting somewhat on the laurels of past achievements.

Broad Institute

Tracing back to the Human Genome Project, Broad scientists have been involved in systematic efforts to create large datasets intended to serve as a foundation for biological and medical studies in thousands of laboratories around the world.

As an academic non-profit research institute, Broad recognizes the unique role that such institutions play in propelling the biomedical ecosystem by exploring fundamental questions and working on risky, early-stage projects that often lack clear economic return.

To maximize its impact, our work (including discoveries, data, tools, technologies, knowledge, and intellectual property) should be made readily available for use, at no cost, by other academic and non-profit research institutions.

Five principles for applying data science for social good

Editor's note: Jake Porway expanded on the ideas outlined in this piece in his Strata + Hadooop World NYC 2015 keynote address, 'What does it take to apply data science for social good?'

It’s a satirical take on our sector’s occasional tendency to equate narrow tech solutions like “software-designed data centers for cloud computing” with historical improvements to the human condition.

Whether you take it as parody or not, there is a very real swell in organizations hoping to use “data for good.” Every week, a data or technology company declares that it wants to “do good” and there are countless workshops hosted by major foundations musing on what “big data can do for society.” Add to that a growing number of data-for-good programs from Data Science for Social Good’s fantastic summer program to Bayes Impact’s data science fellowships to DrivenData’s data-science-for-good competitions, and you can see how quickly this idea of “data for good” is growing.

Yes, it’s an exciting time to be exploring the ways new datasets, new techniques, and new scientists could be deployed to “make the world a better place.” We’ve already seen deep learning applied to ocean health, satellite imagery used to estimate poverty levels, and cellphone data used to elucidate Nairobi’s hidden public transportation routes.

At DataKind, we’ve spent the last three years teaming data scientists with social change organizations, to bring the same algorithms that companies use to boost profits, to mission-driven organizations in order to boost their impact.

Hillary Clinton, Melinda Gates, and Chelsea Clinton stood on stage and lauded the report, the culmination of a year-long effort to aggregate and analyze new and existing global data, as the biggest, most comprehensive data collection effort about women and gender ever attempted.

These datasets are sometimes cutely referred to as “massive passive” data, because they are large, backward-looking, exceedingly coarse, and nearly impossible to make decisions from, much less actually perform any real statistical analysis upon.

The promise of a data-driven society lies in the sudden availability of more real-time, granular data, accessible as a resource for looking forward, not just a fossil record to look back upon.

Mobile phone data, satellite data, even simple social media data or digitized documents can yield mountains of rich, insightful data from which we can build statistical models, create smarter systems, and adjust course to provide the most successful social interventions.

To affect social change, we must spread the idea beyond technologists that data is more than “spreadsheets” or “indicators.” We must consider any digital information, of any kind, as a potential data source that could yield new information.

“data science is not overhead.” But there are many organizations doing tremendous work that still think of data science as overhead or don’t think of it at all, yet their expertise is critical to moving the entire field forward.

As data scientists, we need to find ways of illustrating the power and potential of data science to address social sector issues, so that organizations and their funders see this untapped powerful resource for what it is.

It was clear that, like so many other well-intentioned efforts, the project was at risk of gathering dust on a shelf if the team of volunteers couldn’t help the organization understand what they had learned and how it could be integrated into the organization’s ongoing work.

Take, for example, a seemingly innocuous challenge like “providing healthier school lunches.” What initially appears to be a straightforward opportunity to improve the nutritional offerings available to schools quickly involves the complex educational budgeting system, which in turn is determined through even more politically fraught processes.

DataKind is piloting a collective impact model called DataKind Labs, that seeks to bring together diverse problem holders, data holders, and data science experts to co-create solutions that can be applied across an entire sector-wide challenge.

The current approach appears to be “get the tech geeks to hack on this problem, and we’ll have cool new solutions!” I’ve opined that, though there are many benefits to hackathons, you can’t just hack your way to social change.

Under this media partnership, we will be regularly contributing our findings to O'Reilly, bringing new and inspirational examples of data science across the social sector to our community, and giving you new opportunities to get involved with the cause, from volunteering on world-changing projects to simply lending your voice.

Why Program? Reproducibility, Provenance, and Tracking Changes

Before diving into answering the question, “why program?” portion of his talk, Wickham discusses two “main engines” that help data scientists understand what is going on within a data set: visualization and models.

During the talk, Wickham indicates “that the first visualization you look at will always reveal a data quality error, and if it doesn’t not reveal a data quality error, that just mans you haven’t found one yet.” Yet, he also indicates that visualizations do not scale particularly well and suggests using models to complement visualizations.

Wickham advocates using a programming language, rather than a GUI, to do data science as it provides the opportunity to reproduce work, understand the data provenance (which is also linked to reproducibility), and the ability to see how the data analysis has evolved over time.

In this portion of his talk, Wickham references a project on GitHub where people can see the series of commits, drill down on the commits, and see “where data analysis is now but you can see how it evolved over time.” He contrasts this with Excel which provides the opportunity for people to accidentally randomize their data without knowing the provenance or having a rollback option.

As Wickham defines data science as “the process by which data becomes understanding, knowledge, and insight”, he advocates using data science tools where value is gained from iteration, surprise, reproducibility, and scalability.

In particular, he argues that being a data scientist and being programmer are not mutually exclusive and that using a programming language helps data scientists towards understanding the real signal within their data.

What is Data Science?

Are you ready to reap the business value of Data Science? Data science is the art of looking at data and applying scientific principles to figure out how to make ...

Data Science and Statistics: different worlds?

Chris Wiggins (Chief Data Scientist, New York Times) David Hand (Emeritus Professor of Mathematics, Imperial College) Francine Bennett (Founder, ...

Basics of Data Science Terminology

This video is an introduction to the subject of Data Science and explains the basic terminologies used in the subject. It tries to elaborate about the underlying ...

Data Science and AI in Pharma and Healthcare (CXOTalk #275)

Data, artificial intelligence and machine learning are having a profound influence on healthcare, drug discovery, and personalized medicine. On this episode ...

Integrating Business Intelligence and Data Science

Data for Good Exchange 2017: Why the industry needs a data science code of ethics

At the Data for Good Exchange on Sunday, September 24, 2017, we spoke with Gideon Mann, Bloomberg's Head of Data Science, Natalie Evans Harris, COO ...

The Most Important Data Science Technologies to Learn for 2017

Looking to Buy a Tesla? Get $1000 Off + Free Supercharging Use our referral code and instantly get a discount plus free supercharging on your new Model S or ...

7 Data Science Projects & Use Cases in the Insurance Industry

Insurance industry has been using quantitative research for a very long time. Actuaries have been using statistical modelling to price insurance products and ...

Want a career in science? Consider BIOINFORMATICS!

Bioinformatics is a rapidly growing field that mixes biology with computer science and technology in the analysis of big data. Watch as two students from ...

HR meets science at Google with Prasad Setty

Prasad Setty, Vice President of People Analytics & Compensation at Google, moderated a panel on using data to make better people decisions. Google uses ...