AI News, BOOK REVIEW: Artificial Intelligence for Computational Sustainability: A Lab Companion/Machine Learning for Prediction

Artificial Intelligence for Computational Sustainability: A Lab Companion/Machine Learning for Prediction

Machine learning for purposes of predicting properties of objects and events -- as opposed to machine learning for purposes on improving search, planning and problem solving -- is the dominant form of machine learning studied (though the latter is often usefully understood in terms of the former).

In this lab, you will examine the effects of climate and climate change on the distributions of several species of tree, and then use climate and species-range data to construct computational models of species distribution using maximum entropy modeling (also known as Maxent)[9][10][11].

Maxent is a general method from information theory for finding the probability distribution that has maximum entropy (i.e., is the most non-committal, or closest to a uniform distribution), subject to a set of constraints that represent our partial knowledge of the target distribution.

Each location on the map, including the known samples, is characterized by a set of climate variables, such as mean annual temperature, mean diurnal temperature range, mean precipitation during the coldest quarter of the year, etc.

Overlaid on each climate map are maps of six species’ ranges: bigcone Douglas fir (Pseudotsuga macrocarpa), Bishop pine (Pinus muricata), Blue oak (Quercus douglasii), Jeffrey pine (Pinus jeffreyi), coast redwood (Sequoia sempervirens), and giant sequoia (Sequoia giganteum).

The Maxent software [1] for species distribution modeling was developed in a collaboration between machine learning researchers and a biologist (emphasizing the interdisciplinary nature of computational sustainability) in 2004.

To learn the species distribution models, Maxent takes two inputs: (1) a file containing exact locations where a species of interest is known to grow and (2) a file containing climate data for each of those locations.

By evaluating the climate data at each location where the species of interest is present, Maxent calculates a probability function that describes the chances of a tree location having any given climate setting.

Each output folder will contain a .html webpage that summarizes the model's information, including the predicted species distribution overlayed on a map and several performance curves, as shown in the figures below.

Cooler colors (blue/green) indicate areas where the model calculates a low probability of species presence and warmer colors (red/yellow) indicate areas where the model calculates a higher probability of species presence.

For the response curve (middle figure), the x-axis represents a variety of climate values (in this case the annual precipitation in mm) and the y-axis indicates the probability of finding the species of interest in an area with any given annual precipitation.

Notice that the ROC curve lies in the unit square, so a model with perfect (100%) accuracy would have the red line go all the way to the upper left (coordinates (0,1)) and would have area 1 (although this is seldom achieved).

This warming is expected to continue for many years to come as a result of an increase in the amount of long-wave radiation emitted towards the ground by greenhouse gas molecules like CO2, CH3, and H2O.

If evapotranspiration increases, new seedlings and mature trees growing on the lower elevation tree line between alpine forest above and desert scrub below will die more often and the lower tree line will also rise.

While real temperature change will be very spatially, seasonally, and diurnally variable (warming should be most substantial near poles, during winter, and at night), this hypothetical temperature change is applied everywhere at all times.

Few undergraduate textbooks go significantly into other forms of regression, such as polynomial regression and tree-structured regression, but these texts typically provide ample material for instructors and the lab text to get into these issues, perhaps with pointers to other online content that is created in response to the lab text’s coverage.

An important aspect of both regression (and decision) trees is that they make explicit the important principle of 'context' through the strategy of recursive decomposition – some variables may be informative in some contexts (e.g., subtrees), but not others.

Global Footprint Network, 2012) is the amount of land (e.g., in hectares) that is needed to sustain indefinitely, without degradation, a process or entity, ranging in scale from (manufacture, use, and disposal of) individual artifacts to cities, nations and the world’s human population.

In supervised learning there is (typically) one attribute or variable, called the dependent variable, that is the focus of attention -- the goal of supervised learning from labeled data is to optimize prediction performance of this one dependent variable given some or all of the values of the remaining independent variables.

In contrast, unsupervised learning can be cast as a problem in which no one variable is the exclusive focus of attention, but rather a system might be called upon to make predictions along any variables with unknown values, given known values for other variables.

Unsupervised learning, such as belief network learning and clustering, can be used to discover, represent, and exploit statistical relationships between features and objects (e.g., people, processes, artifacts) for purposes of contextualizing and predicting ecological footprints.

Artificial Intelligence for Computational Sustainability: A Lab Companion/Machine Learning for Prediction

Machine learning for purposes of predicting properties of objects and events -- as opposed to machine learning for purposes on improving search, planning and problem solving -- is the dominant form of machine learning studied (though the latter is often usefully understood in terms of the former).

In this lab, you will examine the effects of climate and climate change on the distributions of several species of tree, and then use climate and species-range data to construct computational models of species distribution using maximum entropy modeling (also known as Maxent)[9][10][11].

Maxent is a general method from information theory for finding the probability distribution that has maximum entropy (i.e., is the most non-committal, or closest to a uniform distribution), subject to a set of constraints that represent our partial knowledge of the target distribution.

Each location on the map, including the known samples, is characterized by a set of climate variables, such as mean annual temperature, mean diurnal temperature range, mean precipitation during the coldest quarter of the year, etc.

Overlaid on each climate map are maps of six species’ ranges: bigcone Douglas fir (Pseudotsuga macrocarpa), Bishop pine (Pinus muricata), Blue oak (Quercus douglasii), Jeffrey pine (Pinus jeffreyi), coast redwood (Sequoia sempervirens), and giant sequoia (Sequoia giganteum).

The Maxent software [1] for species distribution modeling was developed in a collaboration between machine learning researchers and a biologist (emphasizing the interdisciplinary nature of computational sustainability) in 2004.

To learn the species distribution models, Maxent takes two inputs: (1) a file containing exact locations where a species of interest is known to grow and (2) a file containing climate data for each of those locations.

By evaluating the climate data at each location where the species of interest is present, Maxent calculates a probability function that describes the chances of a tree location having any given climate setting.

Each output folder will contain a .html webpage that summarizes the model's information, including the predicted species distribution overlayed on a map and several performance curves, as shown in the figures below.

Cooler colors (blue/green) indicate areas where the model calculates a low probability of species presence and warmer colors (red/yellow) indicate areas where the model calculates a higher probability of species presence.

For the response curve (middle figure), the x-axis represents a variety of climate values (in this case the annual precipitation in mm) and the y-axis indicates the probability of finding the species of interest in an area with any given annual precipitation.

Notice that the ROC curve lies in the unit square, so a model with perfect (100%) accuracy would have the red line go all the way to the upper left (coordinates (0,1)) and would have area 1 (although this is seldom achieved).

This warming is expected to continue for many years to come as a result of an increase in the amount of long-wave radiation emitted towards the ground by greenhouse gas molecules like CO2, CH3, and H2O.

If evapotranspiration increases, new seedlings and mature trees growing on the lower elevation tree line between alpine forest above and desert scrub below will die more often and the lower tree line will also rise.

While real temperature change will be very spatially, seasonally, and diurnally variable (warming should be most substantial near poles, during winter, and at night), this hypothetical temperature change is applied everywhere at all times.

Few undergraduate textbooks go significantly into other forms of regression, such as polynomial regression and tree-structured regression, but these texts typically provide ample material for instructors and the lab text to get into these issues, perhaps with pointers to other online content that is created in response to the lab text’s coverage.

An important aspect of both regression (and decision) trees is that they make explicit the important principle of 'context' through the strategy of recursive decomposition – some variables may be informative in some contexts (e.g., subtrees), but not others.

Global Footprint Network, 2012) is the amount of land (e.g., in hectares) that is needed to sustain indefinitely, without degradation, a process or entity, ranging in scale from (manufacture, use, and disposal of) individual artifacts to cities, nations and the world’s human population.

In supervised learning there is (typically) one attribute or variable, called the dependent variable, that is the focus of attention -- the goal of supervised learning from labeled data is to optimize prediction performance of this one dependent variable given some or all of the values of the remaining independent variables.

In contrast, unsupervised learning can be cast as a problem in which no one variable is the exclusive focus of attention, but rather a system might be called upon to make predictions along any variables with unknown values, given known values for other variables.

Unsupervised learning, such as belief network learning and clustering, can be used to discover, represent, and exploit statistical relationships between features and objects (e.g., people, processes, artifacts) for purposes of contextualizing and predicting ecological footprints.

A comparison of absolute performance of different correlative and mechanistic species distribution models in an independent area

Species distribution models are valuable tools in addressing questions and issues in the fields of climate change ecology, and biogeography, as well as in evolutionary and conservation biology, and thus, understanding performance testing and evaluation methods of correlative and mechanistic models is vital to their practical usefulness (Guisan and Thuiller 2005).

In measuring model performance, AUC is threshold independent and thus particularly suitable for the performance evaluation of ordinal score models such as logistic regression with true presence–absence data.

In this regard, we believe that the MESS maps will not identify changes in correlations between variables, and tests for these are also critical because the model parameters are estimated on the correlation structure between predictors in the training data.

We also note that the comparison of the individual technique of CL to an ensemble approach of the five correlative models showed that there was a better agreement between the ensemble output of correlative model projections with the mechanistic model output when compared to only using single‐modeling techniques.

In this regard, it should be noted that our results also indicated that it is paramount to have some knowledge of how reliable SDM predictions are and that ideally this should be tested on an individual case basis, as the TSS for Asparagus asparagoides in Bioclim, GLM, MaxEnt, and BRT models was 0.73, 0.72, 0.74, and 0.76, respectively, while it was 0.42 in RF model.

(2008) who documented that AUC is not an appropriate measure of comparative accuracy between model results for five reasons: (1) the probability values predicted and the closeness of fit of the model are ignored;

It should be noted that there are a number of important decisions to be made in constructing an SDM and our study, as well as other related studies, describes factors which can impact on or limit results including (1) occurrence in theoretically unsuitable habitat of a particular mobile species, (3) occurrence in theoretically unsuitable habitat of sessile species (e.g., plants), (3) failure to observe a species in a suitable habitat, (4) low detectability of a particular species, (5) ecotypes of the same species and sibling species, (6) historical bias in natural history collections, and (7) no absences (Guisan and Thuiller 2005;

<?xml version="1.0" encoding="UTF-8"?>Mapping Species Distributions with MAXENT Using a Geographically Biased Sample of Presence Data: A Performance Assessment of Methods for Correcting Sampling Bias

As an unexpected first finding, we noticed that the range of AUC values obtained for biased and corrected models remained high even for models with the strongest biases.

AUC may be a good statistical measure of discrimination ability, but it often fails to quantify the ecological realism of modeled distribution [72], [73], [83] especially when estimated from presence-only data.

Contrary to previous studies investigating sampling bias correction in SDM that focused on a few methods and simple biases [41], [52]–[54], we reviewed here five different ways to deal with sampling bias and used both real and virtual datasets under various bias scenarios.

In addition, rather than basing our conclusions on island species [41], [52], [54], we used continental species whose distributions are clearly shaped by climate and not by a geographically bound space.

Our results clearly evidence that the different methods of sampling bias correction tested here may have very variable efficiency depending on the modeling conditions (biases type and correction method).

For instance, AUC often increases with the size of the study area because it contributes to include background points that have environmental characteristics greatly distant from the species requirement, resulting in artificial increase of SDM validation [65].

A relevant selection of the training area (the geographic region in which background points are selected) should reflect the geographical space accessible to the species over a given time period [65].

Regarding the high variability in correction performance of the different methods depending on various factors, it is difficult to propose a universal guideline to solving sampling bias.

An estimation of sampling probabilities across the study area can provide insights on the potential bias that may affect the collected observations and help in further choice of the correction method.

For instance, it is possible to split a large dataset and apply systematic sampling to each dataset in species with broad distributions [44] or to apply the biasfile method after using first another correction method Nevertheless, we found that only systematic sampling constantly performed well irrespective of species and bias type.

Since systematic sampling seems to be robust enough to differentiate sampling bias and species, we suggest that this could be the method selected first if no further attempts to correct sampling bias are to be made.

Even if the elaboration of a step-by-step framework would be ideal to assist in the definition of a reliable strategy of bias correction, we highlight that most SDMs studies, at least those that use MAXENT as modeling algorithm, would highly benefit from this simple tweaking of input occurrences.

In this regard, we encourage MAXENT users to carefully take sampling bias into account, and to use systematic sampling of their input occurrences as a quick and simple resolution of bias.

Species Distribution Modeling in R Tutorial

Practice I: Biological and environmental data for Species Distribution Modelling

This is the third part of a training course on Species Distribution Modelling (also called Ecological Niche Modelling) taught by Richard Pearson at University ...

Create an ArcGIS R tool for species distribution modeling

A couple of months ago, ESRI released a bridge library to connect ArcGIS and R. This library was developed with the purpose of facilitating management and ...

Applications V: Predicting species’ invasions

This is the tenth part of a training course on Species Distribution Modelling (also called Ecological Niche Modelling) taught by Richard Pearson at University ...

Course: Introduction to Ecological Niche Modelling - 2nd Edition

2nd Edition Introduction to Ecological Niche Modelling November 19th-23rd, 2018, Barcelona (Spain) Topic: Ecology Course overview This course will teach ...

MaxEnt 2017 - Yannis L. Kalaidzidis - Text attribution by Bayesian Network Classifier

Oral session presented at MaxEnt 2017 37th International Workshop on Bayesian Inference and Maximum ..

Ecological Niche modelling workflow on the BioVeL portal

This movie shows you how to create, test, and project a species distribution model under different climate scenarios, using a butterfly species as example.

High Resolution Climate Models to Benefit Avian Conservation

This webinar, "Application of High Resolution Climate Models to Benefit Avian Conservation in the Prairie Pothole Region, ..

Webinar: Species distribution models and future extinction risk of reptiles and amphibians

Title: Ecophysiological species distribution models and future extinction risk of reptiles and amphibians Presenter: Barry Sinervo, Professor, University of ...

Statistical model choice in phylogenetic biogeography

Dr. Nicholas Matzke discusses special cases of a supermodel implemented in BioGeoBEARS, a new R package that implements the most popular ...