AI News, $6M for UC Berkeley and Cal Poly to expand open-source software for scientific computing and data science

$6M for UC Berkeley and Cal Poly to expand open-source software for scientific computing and data science

July 7, 2015 — Three foundations pledged $6M over the next three years to Project Jupyter, an open-source software project that supports scientific computing and data science across a wide range of programming languages via a large, public, open and inclusive community.

Sloan Foundation, and Gordon and Betty Moore Foundation, these researchers will expand and improve the capabilities of the Jupyter Notebook, a web-based platform that allows scientists, researchers and educators to combine live code, equations, narrative text and rich media into a single, interactive document.

“Given the importance of computing across modern society, we see uses of our tools that range from high school education in programming to the nation’s supercomputing facilities and the leaders of the tech industry.” Teachers, for example, can prepare a lecture using Jupyter Notebook and then turn it into a web-based slide show presentation in which you can write code and see the results of that code in real time.

The capabilities of the Jupyter Notebook will expand to allow users easier access to collaborative computing and reusing their content in a wide range of settings, such as standalone web applications and dashboards.

The campus enrolls more than 36,000 undergraduate and graduate students, and has more than 1,500 full-time and 500 part-time faculty members in more than 130 academic departments and more than 110 interdisciplinary research units and field stations.

Today, UC Berkeley is considered one of the nation’s most prestigious universities – public or private – and is internationally recognized for its distinguished record of world-class scholarship, innovation and concern for the betterment of our world.

Known for its Learn by Doing approach, small class sizes and open access to expert faculty, Cal Poly is a distinctive learning community whose 20,000 academically motivated students enjoy an unrivaled hands-on educational experience that prepares them to lead successful personal and professional lives.

Project Jupyter: Computational Narratives as the Engine of Collaborative Data Science

Thus, in orderfor data, and the computations that process and visualize that data, tobe useful for humans, they must be embedded into a narrative — acomputational narrative — that tells a story for a particular audienceand context.

Eventually, it may even be important toenable non-coding lab scientists to perform that same statisticalanalyses and visualizations on data from new samples using a simplifiedgraphical user interface.

That is,other people — including the same scientist six months later — need tobe able to understand exactly what was done (code, data and narrative)and be able to reliably reproduce the work in order to build new ideasoff it.

Reproducibility has long been one of the foundations of thescientific method, but the rise of data science brings new challenges toscientific reproducibility, while simultaneously extending thesequestions to other domains like policy making, government or journalism.

Given this background, the core problem we are trying to solve is thecollaborative creation of reproducible computational narratives that canbe used across a wide range of audiences and contexts.

We propose toaccomplish this through Project Jupyter (formerly IPython), a set ofopen-source software tools for interactive and exploratory computing.These software projects support scientific computing and data scienceacross a wide range of programming languages (Python, Julia, R, etc.)and already provide basic reproducibility and collaboration features.This grant aims at making major progress atop this foundation.

The mainapplication offered by Project Jupyter is the Jupyter Notebook, aweb-based interactive computing platform that allows users to authorcomputational narratives that combine live code, equations, narrativetext, interactive user interfaces and other rich media.

While these products are extremely popular, theirproprietary nature and expensive licensing fees make them unattractivefor open and reproducible scientific research and data science.

First, Google Drive[12] has, quite literally, inventedmodern online collaboration by offering a productive environment thatallows multiple, distributed users to simultaneously edit documents,spreadsheets, and slide presentations.

GitHub is a commercial (but free for public usage)collaboration platform built around git that has become invaluable forcompanies, open source projects and scientists alike.

Over the past few years, we have spent significant amounts of time andeffort investing in relationships with other individuals andorganizations that have overlapping missions, impact areas, user groupsand technologies as Project Jupyter.

We also work closely with a number of companies that are buildingproducts based on the Jupyter Notebook, contribute code and financialresources to the project and serve as advisors on a wide range oftechnical and strategic topics.

In 2014, we began working with the CTO of O’Reilly,Andrew Odewahn, to explore ways of integrating the Jupyter architectureinto their publishing platform, to enable both authors and readers ofO’Reilly content to experience books as live, computational entities.O’Reilly already has multiple books that include code examples asJupyter Notebooks.

Project Jupyter’s mission is to create open source tools for interactivescientific computing and data science in research, education andindustry, with an emphasis on usability, collaboration andreproducibility.

The core development team has grown to roughly a dozenactive contributors and a “long tail” of community contributorscurrently numbering over 400, who participate with various degrees ofregularity.

For the first decade, IPython focused strictly on scientific andinteractive computing in the Python language, providing a richinteractive shell well suited to the workflow of everyday research, aswell as tools for parallel computing.

While scientists have always used computers as a research tool, they usethem differently than industrial software engineers: in science, thecomputer is a kind of “abstract microscope” that enables the scientistto peek into data and models that represent or summarize the real world.Software engineers tend to write programs to solve reasonablywell-defined and independently specified problems, and their deliverableis a software artifact: a standalone application, library or system.

While standalone software libraries exist in science (say the buildingof a library to solve differential equations), we target a more commonscenario: the iterative exploration of a problem via computation and theinteractive study of intermediate results.

In this kind of computationalwork, scientists evolve their codes iteratively, executing small testprograms or fragments and using the results of each iteration as insightthat informs the next step.

The nature of this process means that, for scientists, aninteractive computing system is of paramount importance: the ability toexecute very small fragments of code (possibly a single line) andimmediately see the results is at the heart of their workflow.

Hamming, “the purposeof computing is insight, not numbers.” For this reason, computation inscience is ultimately in service of a result that needs to be woven intothe bigger narrative of the questions under study: that result will bepart of a paper, will support or contest a theory, will advance ourunderstanding of a domain.

The problem the Jupyter project tackles is precisely this intersection:creating tools to support in the best possible ways the computationalworkflow of scientific inquiry, and providing the environment tocreate the proper narrative around that central act of computation.

Finally, while all the above has been cast in the context of scientificresearch, the rise of ubiquitous data science means that these samequestions are now not only the purview of physicists or biologists.Today policy makers, journalists, business analysts, financial modelbuilders, all work with the same tools and challenges: their data maycome from a population census or the stock market, and instead of anacademic paper they may be writing a blog post or a sales report for aclient, but ultimately the process is similar.

Given the importance ofcomputing across modern society, we see uses of our tools that rangefrom high school education in programming to the nation’s supercomputingfacilities and the leaders of the tech industry mentioned above.

The challenge for ourorganization is to maintain a focused research agenda where we provide acoherent vision of the future in interactive computation, a clean set ofabstractions and tools, and a sustainable community model.

A summary of that structure and governancefollows[16]: The main project activities, supported by a combination of open sourcevolunteers, funded researchers and industry partners, are: Estimated user base.

A few other relevant achievements of the projectover the last few years: There are many more teaching materials, conference talks, blog posts andprojects using our architecture and tools, that we can not fit in thisspace.

At the heart of the entire Jupyter architecture lies the idea ofinteractive computing: humans executing small pieces of code invarious programming languages, and immediately seeing the results oftheir computation.

Interactive computing is central to data sciencebecause scientific problems benefit from an exploratory process wherethe results of each computation inform the next step and guide theformation of insights about the problem at hand.

Third, we will create a system thatallows users to bundle, share and deploy sets of widgets as independent“apps.” This will allow users and developers to leverage the notebookfor highly customized, but still data and code driven, user interfacesthat can be used with non-technical audiences.

First, we will create a more modular set of UI components to enableusers and third party developers to build purpose specific UIs withcustom components, such as file browsers, debuggers, variableinspectors, documentation panes, etc.

These include multicell operations(cut/copy/paste), structural operations that allow different sectionsand subsections to be collapsed/expanded and moved atomically, and animproved dashboard for working with directories of files and notebooks.

First, we will improve the ability of users to transition from a singlelarge notebook, to a smaller notebook that calls code contained inexternal modules that can be tested and documented separately.

that it gives the same results when run again.This verification will be performed by rerunning the original notebook,comparing the output of the rerun notebook with that of the original,and then creating a human readable “reproducibility report” thatsummarizes the differences, if any.

These narratives end up being used in a wide variety ofcontexts: academic publications, blog posts, books, traditionaljournalism articles, technical documentation, government reports, grantapplications, industry research and commercial products.

In our modern, web-enabled companies, universities,research labs and non-profits, data science and scientific computing arecarried out by distributed teams whose work and contributions aretightly coupled.

Today, the Jupyter notebook has almost no support or thesetypes for synchronous and asynchronous collaborations, which limits theimpact and usefulness of the notebook in collaboration rich contextssuch as education and scientific research.

This will allow multiple users to share notebookswith each other online, and edit those notebooks together in real time.To this collaborative editing system we will add user presence,commenting and cloud based document storage.

Because of thedifficulty and scope of this work, we are working directly with GoogleResearch to help us design the underlying architectures and implementthem in our software (see above for the details of this collaboration).The initial implementation of these features will rely on open GoogleAPIs (Drive API, Real Time API), however, we plan on buildingabstractions and APIs that will allow us to plug into a number ofdifferent collaborative backends (Firebase, etc.).

JupyterHub eases the installation anddeployment of the notebook to large numbers of users and opens the doorfor novel collaboration possibilities, However, the version ofJupyterHub that exists today has very limited sharing capabilities.

This focus area is conceptually different from theabove described 3 main technical areas and will involve an ongoing setof activities throughout the project period.

First, we will set up a robust training program that leverages seniorproject staff to manage and train new undergraduates, graduate studentsand postdocs to work on the project at Cal Poly and UC Berkeley.

These meetings bring together 5–15 core developers and designersto review the project’s progress, discuss major technical andarchitectural issues and plan the future roadmap of the project.

The approach describedhere has emerged from our own experience in building open sourcesoftware over the last 14 years as well as a careful study andapplication of the methods described by Eric Ries in his book, the TheLean Startup, as well as the books and courses of Steve Blank.

First, the validation stage is completely unpredictable.Features are used in unexpected ways, new groups of users emerge, otherdevelopers extend and reuse our work in innovative ways, and newcollaborators and stakeholders emerge.

Project Jupyter gets $6M to expand collaborative data science software

July 7, 2015 — Three foundations pledged $6M over the next three years to Project Jupyter, an open-source software project that supports scientific computing and data science across a wide range of programming languages via a large, public, open and inclusive community.

Sloan Foundation, and Gordon and Betty Moore Foundation, these researchers will expand and improve the capabilities of the Jupyter Notebook, a web-based platform that allows scientists, researchers and educators to combine live code, equations, narrative text and rich media into a single, interactive document.

“Given the importance of computing across modern society, we see uses of our tools that range from high school education in programming to the nation’s supercomputing facilities and the leaders of the tech industry.” Teachers, for example, can prepare a lecture using Jupyter Notebook and then turn it into a web-based slide show presentation in which you can write code and see the results of that code in real time.

The capabilities of the Jupyter Notebook will expand to allow users easier access to collaborative computing and reusing their content in a wide range of settings, such as standalone web applications and dashboards.

The campus enrolls more than 36,000 undergraduate and graduate students, and has more than 1,500 full-time and 500 part-time faculty members in more than 130 academic departments and more than 110 interdisciplinary research units and field stations.

Today, UC Berkeley is considered one of the nation’s most prestigious universities – public or private – and is internationally recognized for its distinguished record of world-class scholarship, innovation and concern for the betterment of our world.

Known for its Learn by Doing approach, small class sizes and open access to expert faculty, Cal Poly is a distinctive learning community whose 20,000 academically motivated students enjoy an unrivaled hands-on educational experience that prepares them to lead successful personal and professional lives.  The Leona M.

Fernando Perez

After completing a PhD in particle physics at the University of Colorado at Boulder, his postdoctoral research in applied mathematics centered on the development of fast algorithms for the solution of partial differential equations in multiple dimensions.

 Today, his research focuses on creating tools for modern computational research and data science across domain disciplines, with an emphasis on high-level languages, interactive and literate computing, and reproducible research.

Software System Award Honors Project Jupyter Team

Project Jupyter is an open, international collaboration that develops tools for interactive computing: a process of human computer interplay for scientific exploration and data analysis.

The collaboration develops applications such as the widely popular Jupyter Notebook, an open-source web app that allows users to create and share documents that contain live code, equations, visualizations and narrative text.

“The flexibility of the Jupyter architecture makes it easy to deploy in a variety of scenarios: while individual users can run the tools on a personal laptop or workstation, the same tools can be deployed on remote resources,” says Shane Canon, a project engineer at NERSC.

“In fact, NERSC offers Jupyter as an interactive tool for remote access to its supercomputing resources.” At UC Berkeley two new courses Foundations of Data Science and Principles and Techniques of Data Science, will be supported by Jupyter Notebooks deployed in the cloud and integrated with campus authentication.

As a graduate student studying physics at the University of Colorado in the early 2000s, Pérez remembers using a hodgepodge of software systems to illustrate code, equations, visualizations and text in his scientific computing papers.

He found researchers around the globe that had all independently started building scientific computing tools in Python and combined these disparate efforts into one open-source platform called IPython—“I” for interactive.

“One afternoon in late 2001, I was a physics graduate student at the University of Colorado working on my dissertation and decided to spend an afternoon writing the original, tiny version of IPython,” says Pérez.

For me, it’s been a wild ride, made possible by going from a personal exploration to an open collaboration with an incredible team ” “This is a project that has demonstrated 20 years of intellectual contributions with major impact in research, education and industry, and it continues to make its advances available to the world as an open platform,” says Kathy Yelick, Associate Laboratory Director of Berkeley Lab Computing Sciences.

More: ### Lawrence Berkeley National Laboratory addresses the world’s most urgent scientific challenges by advancing sustainable energy, protecting human health, creating new materials, and revealing the origin and fate of the universe.

Project Jupyter: Bridging Science, Education and Communication (Data Dialogs 2017)

Fernando Pérez — Project Jupyter, evolved from the IPython environment, provides a platform for interactive computing that is widely used today in research, ...

Gateways 2016: Fernando Perez on Project Jupyter

Dr. Fernando Perez, creator of IPython (now Project Jupyter) and scientist at Lawrence Berkeley National Laboratory and Berkeley Institute for Data Science at ...

Keynote: Project Jupyter | SciPy 2016 | Brian Granger

Brian Granger is an Associate Professor of Physics at Cal Poly State University in San Luis Obispo, CA. He has a background in theoretical physics, with a Ph.D ...

Large Scale Teaching Infrastructure with Kubernetes - Yuvi Panda, Berkeley University

Large Scale Teaching Infrastructure with Kubernetes - Yuvi Panda, Berkeley University Data Science & Programming literacy is an important aspect of literacy in ...

JupyterHub from the Ground Up with Kubernetes - Camilla Montonen

PyData London 2018 JupyterHub is a great way to provide a data analytics environment for a class, a research group or a team of data scientists. In this talk, we ...

EN - ESSEC Business School - MSc in Data Sciences and Business Analytics

Find the REPLAY & THE ANALYSIS by Campus-Channel for this program here: ...

Keynote: Machine Learning for Social Science | SciPy 2016 | Hanna Wallach

In this talk, I will introduce the audience to the emerging area of computational social science, focusing on how machine learning for social science differs from ...

Data Science MicroMasters Program | UCSanDiegoX on edX

Gain the critical skills needed to become a data scientist, rated one of the best jobs in America and in demand globally. Pursue this MicroMasters program on ...