AI News, NIPS Proceedingsβ

NIPS Proceedingsβ

Part of: Advances in Neural Information Processing Systems 27 (NIPS 2014) We define a fairness solution criterion for multi-agent decision-making problems, where agents have local interests.

Our experiments on resource allocation problems show that this fairness criterion provides a more favorable solution than the utilitarian criterion, and that our game-theoretic approach is significantly faster than linear programming.

Lipschitz constant of the underlying target function which serves as a

nonparametric machine learning method for which we derive online learning

model-reference adaptive control and provide a convergence guarantee on the

compare the performance of our approach to recently proposed alternative

effectiveness of radial basis function kernel (beyond Gaussian) estimators

outperform state-of-the-art mechanisms that use iid sampling under weak

can be accurately measured by a property we define called the charm of the

kernel, and that orthogonal random features provide optimal (in terms of mean

which explain why orthogonal random features outperform unstructured on

downstream tasks such as kernel ridge regression by showing that orthogonal

random features provide kernel algorithms with better spectral properties

generally to estimate the benefits from applying orthogonal transforms.

and data availability, as well as algorithmic advances, have led machine

learning techniques to impressive results in regression, classification,

the proximity to the physical limits of chip fabrication alongside the

increasing size of datasets are motivating a growing number of researchers to

explore the possibility of harnessing the power of quantum computation to

in quantum machine learning and discuss perspectives for a mixed readership

of classical machine learning and quantum computation experts.

emphasis will be placed on clarifying the limitations of quantum algorithms,

resources are expected to provide advantages for learning problems.

in the presence of noise and certain computationally hard problems in machine

questions, like how to upload classical data into quantum form, will also be

agent interacts with the environment by taking actions and observing the next

rewards, and actions can all induce randomness in the observed long-term

work advocating a distributional approach to reinforcement learning in which

the distribution over returns is modeled explicitly instead of only

number of gaps between the theoretical and algorithmic results given by

reinforcement learning algorithm consistent with our theoretical formulation.

Finally, we evaluate this new algorithm on the Atari 2600 games, observing

algorithms can learn complex behavioral skills, but real-world application of

these methods requires a large amount of experience to be collected by the

learning process requires extensive human intervention.

propose an autonomous method for safe and efficient reinforcement learning

that simultaneously learns a forward and reset policy, with the reset policy

function for the reset policy, we can automatically determine when the

forward policy is about to enter a non-reversible state, providing for

of the reset policy can greatly reduce the number of manual resets required

to learn a task, can reduce the number of unsafe actions that lead to

machine learning methods in numerous domains involving human subjects,

methods to measure and eliminate unfairness from machine learning methods.

of fair decision making: distributive fairness, i.e., the fairness of the

organizational justice and focus on another dimension of fair decision

making: procedural fairness, i.e., the fairness of the decision making

features used in the decision process, and evaluate the moral judgments of

on two real world datasets using human surveys on the Amazon Mechanical Turk

(AMT) platform, demonstrating that we capture important properties of

optimize the tradeoff between procedural fairness and prediction accuracy.

our datasets, we observe empirically that procedural fairness may be achieved

with little cost to outcome fairness, but that some loss of accuracy is

of probabilities, are omnipresent in real-world problems tackled by machine

simulators that are widely used in engineering and scientific research,

generative adversarial networks (GANs) for image synthesis, and

hot-off-the-press approximate inference techniques relying on implicit

models rely on approximating the intractable distribution or optimisation

objective for gradient- based optimisation, which is liable to produce

estimates the score function of the implicitly defined distribution.

efficacy of the proposed estimator is empirically demonstrated by examples

that include meta-learning for approximate inference and entropy regularised

variational continual learning (VCL), a simple but general framework for

continual learning that fuses online variational inference (VI) and recent

successfully train both deep discriminative models and deep generative models

in complex continual learning settings where existing tasks evolve over time

continual learning outperforms state-of-the-art continual learning methods on

a variety of tasks, avoiding catastrophic forgetting in a fully automatic

learning (RL) has been proven to be a powerful, general tool for learning

large for solving challenging real-world problems, even for off-policy

that the learning signal consists only of scalar rewards, ignoring much of

the rich information contained in state transition tuples.

uses this information, by training a predictive model, but often does not

achieve the same asymptotic performance as model-free RL due to model bias.

We introduce temporal difference models (TDMs), a family of goal-conditioned

value functions that can be trained with model-free learning and used for

RL: they leverage the rich information in state transitions to learn very

efficiently, while still attaining asymptotic performance that exceeds that

range of continuous control tasks, TDMs provide a substantial improvement in

reinforcement learning model the entire distribution of returns, rather than

proposed C51 algorithm, based on categorical distributional reinforcement

framework to analyse CDRL algorithms, establish the importance of the

inference algorithms for probabilistic programming languages, as used in data

We show how to conceptualise and analyse such inference algorithms as

manipulating intermediate representations of probabilistic programs using

higher-order functions and inductive types, and their denotational semantics.

difficulty: it is impossible to define a measurable space structure over the

collection of measurable functions between arbitrary measurable spaces that

proposed mathematical structure that supports both function spaces and

representing probabilistic programs, and semantic validity criteria for

traditional measure theoretic origins, we use Kock’s synthetic measure

towards unsupervised representation learning: we encode samples into

back into the original input space by leveraging MPE inference.

reconstructions and extend it towards dealing with missing embedding

attack for solving programming competition-style problems from input-output

predict properties of the program that generated the outputs from the inputs.

We use the neural network's predictions to augment search techniques from the

programming languages community, including enumerative search and an

comparable to the simplest problems on programming competition websites.

method to sample from a discrete probability distribution, or to estimate its

random perturbation to the distribution in a particular way, each time

related methods, of which the Gumbel trick is one member, and show that the

new methods have superior properties in several settings with minimal

computational benefits for discrete graphical models, Gumbel perturbations on

setting, proving new upper and lower bounds on the log partition function and

we balance the discussion by showing how the simpler analytical form of the

consider the problem of sampling a sequence from a discrete random prob-

ability measure (RPM) with countable support, under (probabilistic)

inference methods in probabilistic program- ming systems.

suite of methods that enable these models to be deployed in large data regime

lacks a principled method to handle streaming data in which the posterior

distribution over function values and the hyperparameters are updated in an

hand-crafted heuristics for hyperparameter learning, or suffer from

catastrophic forgetting or slow updating when new data arrive.

develops a new principled framework for deploying Gaussian process

probabilistic models in the streaming setting, providing principled methods

proposed framework is experimentally validated using synthetic and real-world

flexible distributions over functions that enable high-level assumptions

about unknown functions to be encoded in a parsimonious, flexible and general

analytical intractabilities that arise when data are sufficiently numerous or

approximation schemes have been developed over the last 15 years to address

Unlike much of the previous venerable work in this area, the new framework is

built on standard methods for approximate inference (variational free-energy,

is performed at `inference time' rather than at `modelling time', resolving

awkward philosophical and empirical questions that trouble previous

pseudo-point approximation methods that outperform current approaches on

and numerically robust approaches to nonparametric machine learning that have

been proposed to be utilised in the context of system identification and

order to compute inferences over unobserved function values.

most of these approaches rely on exact knowledge about the input space metric

estimate the Lipschitz constants from the data are not robust to noise or

seem to be ad-hoc and typically are decoupled from the ultimate learning and

optimising parameters of the presupposed metrics by minimising validation set

To avoid poor performance due to local minima, we propose

of embeddings based on structured random matrices with orthogonal rows which

can be applied in many machine learning applications including dimensionality

transform and the angular kernel, we show that we can select matrices

yielding guaranteed improved performance in accuracy and/or speed compared to

chain-based perspectives to help understand the benefits, and empirical

results which suggest that the approach is helpful in a wider range of

holds the promise of enabling autonomous robots to learn large repertoires of

applications of reinforcement learning often compromise the autonomy of the

learning process in favor of achieving training times that are practical for

learning alleviates this limitation by training general-purpose neural

network policies, but applications of direct deep reinforcement learning

algorithms have so far been restricted to simulated settings and relatively

simple tasks, due to their apparent high sample complexity.

demonstrate that a recent deep reinforcement learning algorithm based on

off-policy training of deep Q-functions can scale to complex 3D manipulation

tasks and can learn deep neural network policies efficiently enough to train

further reduced by parallelizing the algorithm across multiple robots which

that our method can learn a variety of 3D manipulation skills in simulation

and a complex door opening skill on real robots without any prior

reinforcement learning (RL) methods have been successful in a wide variety of

learning, but at the cost of high variance, which often requires large

present Q-Prop, a policy gradient method that uses a Taylor expansion of the

stable, and effectively combines the benefits of on-policy and off-policy

algorithms, and use control variate theory to derive two variants of Q-Prop

provides substantial gains in sample efficiency over trust region policy

stability over deep deterministic policy gradient (DDPG), the

reinforcement learning methods using previously collected data can improve

sample efficiency over on-policy policy gradient techniques.

paper examines, both theoretically and empirically, approaches to merging on-

show that off-policy updates with a value function estimator can be

interpolated with on-policy policy gradient updates whilst still satisfying

Our analysis uses control variate methods to produce a

family of policy gradient algorithms, with several recently proposed

comparison of these techniques with the remaining algorithmic details fixed,

and show how different mixing of off-policy gradient estimates with on-policy

algorithm provides a generalization and unification of existing deep policy

gradient techniques, has theoretical guarantees on the bias introduced by

off-policy updates, and improves on the state-of-the-art model-free deep RL

natural choice for representing discrete structure in the world.

stochastic neural networks rarely use categorical latent variables due to the

efficient gradient estimator that replaces the non-differentiable sample from

estimators on structured output prediction and unsupervised generative

modeling tasks with categorical latent variables, and enables large speedups

general method for improving the structure and quality of sequences generated

by a recurrent neural network (RNN), while maintaining information originally

on data using maximum likelihood estimation (MLE), and the probability

distribution over the next token in the sequence learned by this model is

learning (RL) to generate higher-quality outputs that account for

domain-specific incentives while retaining proximity to the prior policy of

proposed method improves the desired properties and structure of the

generated sequences, while maintaining information learned from data.

fairness in machine learning has focused on various statistical

observational criteria have severe inherent limitations that prevent them

criteria, we frame the problem of discrimination based on protected

to assume about our model of the causal data generating process?'' Through

articulate why and when observational criteria fail, thus formalizing what

put forward natural causal non-discrimination criteria and develop algorithms

inertial sensors (3D accelerometers and 3D gyroscopes) have become widely

available due to their small size and low cost.

are obtained at high sampling rates and can be integrated to obtain position

scale, but suffer from integration drift over longer time scales.

this issue, inertial sensors are typically combined with additional sensors

position and orientation estimation using inertial sensors.

different modeling choices and a selected number of important algorithms.

algorithms include optimization-based smoothing and filtering as well as

computationally cheaper extended Kalman filter and complementary filter

estimates with real-world Bayesian deep learning models, practical inference

been used for machine vision and medical applications, but VI can severely

existing models to be changed radically, thus are of limited use for

objectives, deriving a simple inference technique which, together with

dropout, can be easily implemented with existing models by simply changing

uncertainty far away from the data using adversarial images, showing that

learning technique called LACKI, the estimated (possibly nonlinear) model

on these, a number of predictive controllers with stability guaranteed by

off- line is considered and robust stability and recursive feasibility is

ensured by using tightened constraints in the optimisation problem.

controller has been extended to the more interesting and complex case: the

online learning of the model, where the new data collected from feedback is

compositional model by either stacking more blocks or by using a

in order to obtain analogous compact representations for the class of

blocks is that inference is often costly, intractable or difficult to carry

dimensional component forbids the direct use of standard simulation-based

Monte Carlo methods, which are exact inference schemes since they target the

provide general purpose exact inference schemes in the flavour or

probabilistic programming: the user is able to choose from a variety of

performing posterior inference with a Bayesian nonparametric mixture model.

Specifically, we introduce a novel and efficient MCMC sampling scheme in an

augmented space that has a small number of auxiliary variables per iteration.

We apply our sampling scheme to a density estimation and clustering tasks

vehicle (AV) software is typically composed of a pipeline of individual

components, linking sensor inputs to motor outputs.

outputs propagate downstream, hence safe AV software must consider the

improved by quantifying the uncertainties of component outputs and

propose example problems, and highlight possible solutions.

data-efficient reinforcement learning method for continuous state-action

small noise exist, such as PILCO which learns the cartpole swing-up task in

in belief space, consistent with partially observable Markov decisions

significant observation noise, outperforming more naive methods such as

post-hoc application of a filter to policies optimised by the original

task, which involves nonlinear dynamics and requires nonlinear control.

multitude of data-modelling contexts ranging from robotics to the social

standard probabilistic modelling tools to the circular domain.

introduce a new multivariate distribution over circular variables, called the

multivariate circular distributions are shown to be special cases of this

probabilistic principal component analysis with circular hidden variables.

These models can leverage standard modelling tools (e.g.

development of an efficient variational free-energy scheme for performing

attempt to find a most likely configuration of a discrete graphical model.

a solution to the relaxed problem is obtained at an integral vertex then the

We consider binary pairwise models and introduce new methods which allow us

to demonstrate refined conditions for tightness of LP relaxations in the

relaxations, treewidth is not precisely the right way to characterize

uprooting and rerooting graphical models was introduced specifically for

binary pairwise models by Weller [18] as a way to transform a model to any of

a whole equivalence class of related models, such that inference on any one

inference, or relevant bounds, may be much easier to obtain or more accurate

to models with higher-order potentials and develop theoretical insights.

example, we demonstrate that the triplet-consistent polytope TRI is unique in

significantly improve accuracy of methods of inference for higher-order

presents three iterative methods for orientation estimation.

(NLS) estimation of the angular velocity which is used to parametrise the

obtaining class annotations is expensive while at the same time unlabelled

assumptions on the data distribution, recent work has managed to learn

SPNs are deep probabilistic models admitting inference in linear time in

allows generative and discriminative semi-supervised learning, (2) guarantees

that adding unlabelled data can increase, but not degrade, the performance

(safe), and (3) is computationally efficient and does not enforce restrictive

safe semi-supervised learning with SPNs is competitive compared to

magnetic HMC, since in 3 dimensions a subset of the dynamics map onto the

mechanics of a charged particle coupled to a magnetic field.

and construct a symplectic, leapfrog-like integrator allowing for the

these non-canonical dynamics can lead to improved mixing of magnetic HMC

convolutional structure into Gaussian processes, making them more suited to

construction of an inter-domain inducing point approximation that is

generalisation benefit of a convolutional kernel, together with fast but

marginal likelihood can be used to find an optimal weighting between

automated, data-driven decision making in an ever expanding range of

applications has raised concerns about its potential unfairness towards

focused on defining, detecting, and removing unfairness from data-driven

(equality) in treatment or outcomes for different social groups, tend to be

needlessly stringent, limiting the overall decision making accuracy.

literature in economics and game theory and propose preference-based notions

of fairness —- given the choice between various sets of decision treatments

or outcomes, any group of users would collectively prefer its treatment or

Then, we introduce tractable proxies to design convex margin-based

we experiment with a variety of synthetic and real-world datasets and show

that preference-based fairness allows for greater decision accuracy than

learning, and admits a fast kernel-width-selection procedure as the random

constructed by sampling trees via a Mondrian process [Roy and Teh, 2009], and

2014], where trees are also sampled via a Mondrian process, but fit

formulas are a domain-specific language adopted by several R packages for

describing an important and useful class of statistical models: hierarchical

regression, in which regression coefficients are themselves defined by nested

language, enriched with formulas based on our regression calculus.

best of our knowledge, this is the first formal description of the core ideas

formally equivalent to neural networks with multiple, infinitely wide hidden

more flexible, have a greater capacity to generalise, and provide better

calibrated uncertainty estimates than alternative deep models.

develops a new approximate Bayesian learning scheme that enables DGPs to be

applied to a range of medium to large scale regression problems for the first

utilise presupposed Lipschitz properties to compute inferences over

constant of the target function is known a priori they offer convergence

general setting that builds on Hölder continuity relative to

constant online from function value observations that possibly are corrupted

parameters within a kinky inference rule gives rise to a nonparametric

machine learning method, for which we establish strong universal

any continuous function in the limit of increasingly dense data to within a

worst-case error bound that depends on the level of observational

roll-dynamics and performance metrics our approach outperforms recently

mechanism for multi-agent convex optimisation and coordination with no-regret

develop an indirect mechanism for coordinated, distributed multi-agent

no-regret learning based mechanism design and renders it applicable to

stated as a collection of single-agent convex programmes coupled by common

A key idea is to recast the joint optimisation problem as

distributed learning in a repeated game between the original agents and a

newly introduced group of adversarial agents who influence prices for

that all agents employ selfish, sub-linear regret algorithms in the course of

the repeated game, we guarantee that our mechanism can achieve design goals

within an error which approaches zero as the agents gain experience.

error bounds are deterministic or probabilistic, depending on the nature of

the regret bounds available for the algorithms employed by the agents.

functions encode smoothness assumptions on the structure of the function to

limitation is to find a different representation of the data by introducing a

novel supervised method that jointly learns a transformation of the data into

a feature space and a GP regression from the feature space to observed space.

The Manifold GP is a full GP and allows to learn data representations, which

evaluate our approach on complex non-smooth functions where standard GPs

perform poorly, such as step functions and robotics tasks with contacts. Jes

Bayesian generalised ensemble (BayesGE) is a new method that addresses two

major drawbacks of standard Markov chain Monte Carlo algorithms for inference

in high-dimensional probability models: inapplicability to estimate the

approach to iteratively update the belief about the density of states

(distribution of the log likelihood under the prior) for the model, with the

dual purpose of enhancing the sampling efficiency and make the estimation of

systems and show that it compares favourably to existing state-of-the-art

powerful parametric models that can be trained efficiently using the

large parametric functions with that of graphical models, which makes it

not directly applicable to stochastic networks that include discrete sampling

operations within their computational graph, training such networks remains

likelihood-ratio estimator by reducing its variance using a control variate

unlike prior attempts at using backpropagation for training stochastic

experiments on structured output prediction and discrete latent variable

modeling demonstrate that MuProp yields consistently good performance across

has been successfully applied to a range of challenging problems, and has

recently been extended to handle large neural network policies and value

particularly when using high-dimensional function approximators, tends to

algorithms and representations to reduce the sample complexity of deep

call normalized adantage functions (NAF), as an alternative to the more

commonly used policy gradient and actor-critic methods.

allows us to apply Q-learning with experience replay to continuous tasks, and

substantially improves performance on a set of simulated robotic control

of learned models for accelerating model-free reinforcement learning.

that iteratively refitted local linear models are especially effective for

this, and demonstrate substantially faster learning on domains where such

new approximate inference method based on the minimization of

complex probabilistic models with little effort since it only requires as

parameter α, the method is able to interpolate between variational Bayes

methods unifies a number of existing approaches, and enables a smooth

interpolation from the evidence lower-bound to the log (marginal) likelihood

optimisation methods are deployed to obtain a tractable and unified framework

We further consider negative alpha values and propose a

novel variational inference method as a new special case in the proposed

variational framework for learning inducing variables (Titsias, 2009a) has

new proof of the result for infinite index sets which allows inducing points

that are not data points and likelihoods that depend on all function values.

We then discuss augmented index sets and show that, contrary to previous

we show how our framework sheds light on interdomain sparse approximations

unfamiliar dynamical systems with increasing autonomy are ubiquitous.

robotics, to finance, to industrial processing, autonomous learning helps

obviate a heavy reliance on experts for system identification and controller

Often real world systems are nonlinear, stochastic, and expensive to

therefore, nonlinear systems can be identified with minimal system

This thesis considers data efficient autonomous learning of

use deterministic models, which easily overfit data, especially small

scratch, similar to the PILCO algorithm, which achieved unprecedented data

noise by simulating a filtered control process using a tractably analytic

variable belief Markov decision process' when filters must predict under

take a step towards data efficient learning of high-dimensional control using

mitigates adverse effects of observation noise, much greater performance is

achieved when optimising controllers with evaluations faithful to reality: by

simulating closed-loop filtered control if executing closed-loop filtered

outperforming filters applied to systems optimised by unfiltered simulations.

We show directed exploration improves data efficiency.

dynamics models are almost as data efficient as Gaussian process models.

Results show data efficient learning of high-dimensional control is possible

areas such as computer vision and natural language processing to predict

prediction is performed by MAP inference or, equivalently, by solving an

obtain accurate predictions, both learning and inference typically require

striking observation that approximations based on linear programming (LP)

that learning with LP relaxed inference encourages integrality of training

non-parametric estimation of functions of random variables using kernel mean

of the mean embedding of a random variable X lead to consistent estimators of

require an estimator of the mean embedding of their joint distribution as a

case, our results cover both mean embeddings based on i.i.d.

as 'reduced set' expansions in terms of dependent expansion points.

latter serves as a justification for using such expansions to limit memory

Most of the world's digital data is currently encoded in a sequential form,

concrete techniques for compressing various kinds of non-sequential data via

arithmetic coding, and derives re-usable probabilistic data models from

described for certain types of permutations, combinations and multisets;

model may be ‘uprooted’ to a fully symmetric model, wherein original

singleton potentials are transformed to potentials on edges to an added

variable, and then ‘rerooted’ to a new model on the original number of

deepens our understanding, may be applied to any existing algorithm to yield

improved methods in practice, generalizes earlier theoretical results, and

pairwise graphical models and provide an exact characterization (necessary

and sufficient conditions observing signs of potentials) of tightness for the

problem, by forbidding an odd-K5 (complete graph on 5 variables with all

edges repulsive) as a signed minor in the signed suspension graph.

captures signs of both singleton and edge potentials in a compact and

efficiently testable condition, and improves significantly on earlier

forbidding minors, draw connections and suggest paths for future

examine the effect of clamping variables for approximate inference in

undirected graphical models with pairwise relationships and discrete

and summing approximate sub-partition functions can lead only to a decrease

mean field method, in each case guaranteeing an improvement in the

approximation to consideration and examine ways to choose good variables to

frustrated cycles, and of checking the singleton entropy of a variable.

explore the value of our methods by empirical analysis and draw lessons to

programming (LP) relaxations are widely used to attempt to identify a most

relaxation attains an optimum vertex at an integral location and thus

guarantees an exact solution to the original optimization problem.

pairwise models and derive sufficient conditions for guaranteed tightness of

We provide simple new proofs of earlier results and

derive significant novel results including that LP+TRI is tight for any

theorem that may be used to break apart complex models into smaller pieces.

provide an approach to one-dimensional numerical integration on bounded

continuous with a known Lipschitz constant, these quadrature rules can

provide a tight error bound around their integral estimates and utilise the

translating a sample into an integral estimate with probabilistic uncertainty

since data-driven learning allows to reduce the amount of engineering

learning (RL) approaches typically require many interactions with the system

to learn controllers, which is a practical limitation in real systems, such

address this problem, current learning approaches typically require

task-specific knowledge in form of expert demonstrations, realistic

simulators, pre-shaped policies, or specific knowledge about the underlying

probabilistic, non-parametric Gaussian process transition model of the

and controller learning our approach reduces the effects of model errors, a

model-based policy search method achieves an unprecedented speed of learning.

We demonstrate its applicability to autonomous learning in real robot and

Abstract: We consider training a deep neural network to generate samples

optimization minimizing a two-sample test statistic-informally speaking, a

good generator network produces samples that cause a two-sample test to fail

unbiased estimate of the maximum mean discrepancy, which is the centerpiece

of the nonparametric kernel two-sample test proposed by Gretton et al.

network and an adversarial discriminator network, both trained to outwit the

probability measures, which is referred to as the class Q, , in the context

flexible parameterization, the distinguishing feature of the class Q is the

leads to derive an efficient marginal MCMC algorithm for posterior sampling

important in fields as disparate as the social sciences, biology, engineering

designed to learn Bayesian nonparametric models of time series.

quantify the uncertainty due to limitations in the quantity and the quality

whilst preventing overfitting when the data does not warrant complex

We begin with a unifying literature review on time series models

to exploit those insights by developing new learning algorithms for the

model with a simple, robust and fast learning algorithm that makes it well

difficult to tune) smoothing step that is a key part of learning nonlinear

automated, data-driven decision making in an ever expanding range of

applications has raised concerns about its potential unfairness towards

focused on defining, detecting, and removing unfairness from data-driven

(equality) in treatment or outcomes for different social groups, tend to be

needlessly stringent, limiting the overall decision making accuracy.

literature in economics and game theory and propose preference-based notions

of fairness —- given the choice between various sets of decision treatments

or outcomes, any group of users would collectively prefer its treatment or

Then, we introduce tractable proxies to design convex margin-based

we experiment with a variety of synthetic and real-world datasets and show

that preference-based fairness allows for greater decision accuracy than

learning, and admits a fast kernel-width-selection procedure as the random

constructed by sampling trees via a Mondrian process [Roy and Teh, 2009], and

2014], where trees are also sampled via a Mondrian process, but fit

formulas are a domain-specific language adopted by several R packages for

describing an important and useful class of statistical models: hierarchical

regression, in which regression coefficients are themselves defined by nested

language, enriched with formulas based on our regression calculus.

best of our knowledge, this is the first formal description of the core ideas

formulas and latent variables in a probabilistic programming language.

formally equivalent to neural networks with multiple, infinitely wide hidden

more flexible, have a greater capacity to generalise, and provide better

calibrated uncertainty estimates than alternative deep models.

develops a new approximate Bayesian learning scheme that enables DGPs to be

applied to a range of medium to large scale regression problems for the first

utilise presupposed Lipschitz properties to compute inferences over

constant of the target function is known a priori they offer convergence

general setting that builds on Hölder continuity relative to

constant online from function value observations that possibly are corrupted

parameters within a kinky inference rule gives rise to a nonparametric

machine learning method, for which we establish strong universal

any continuous function in the limit of increasingly dense data to within a

worst-case error bound that depends on the level of observational

roll-dynamics and performance metrics our approach outperforms recently

develop an indirect mechanism for coordinated, distributed multi-agent

no-regret learning based mechanism design and renders it applicable to

stated as a collection of single-agent convex programmes coupled by common

A key idea is to recast the joint optimisation problem as

distributed learning in a repeated game between the original agents and a

newly introduced group of adversarial agents who influence prices for

that all agents employ selfish, sub-linear regret algorithms in the course of

the repeated game, we guarantee that our mechanism can achieve design goals

within an error which approaches zero as the agents gain experience.

error bounds are deterministic or probabilistic, depending on the nature of

the regret bounds available for the algorithms employed by the agents.

functions encode smoothness assumptions on the structure of the function to

limitation is to find a different representation of the data by introducing a

novel supervised method that jointly learns a transformation of the data into

a feature space and a GP regression from the feature space to observed space.

The Manifold GP is a full GP and allows to learn data representations, which

evaluate our approach on complex non-smooth functions where standard GPs

perform poorly, such as step functions and robotics tasks with contacts.

Bayesian generalised ensemble (BayesGE) is a new method that addresses two

major drawbacks of standard Markov chain Monte Carlo algorithms for inference

in high-dimensional probability models: inapplicability to estimate the

approach to iteratively update the belief about the density of states

(distribution of the log likelihood under the prior) for the model, with the

dual purpose of enhancing the sampling efficiency and make the estimation of

systems and show that it compares favourably to existing state-of-the-art

powerful parametric models that can be trained efficiently using the

large parametric functions with that of graphical models, which makes it

not directly applicable to stochastic networks that include discrete sampling

operations within their computational graph, training such networks remains

likelihood-ratio estimator by reducing its variance using a control variate

unlike prior attempts at using backpropagation for training stochastic

experiments on structured output prediction and discrete latent variable

modeling demonstrate that MuProp yields consistently good performance across

has been successfully applied to a range of challenging problems, and has

recently been extended to handle large neural network policies and value

particularly when using high-dimensional function approximators, tends to

algorithms and representations to reduce the sample complexity of deep

call normalized adantage functions (NAF), as an alternative to the more

commonly used policy gradient and actor-critic methods.

allows us to apply Q-learning with experience replay to continuous tasks, and

substantially improves performance on a set of simulated robotic control

of learned models for accelerating model-free reinforcement learning.

that iteratively refitted local linear models are especially effective for

this, and demonstrate substantially faster learning on domains where such

new approximate inference method based on the minimization of

complex probabilistic models with little effort since it only requires as

parameter α, the method is able to interpolate between variational Bayes

methods unifies a number of existing approaches, and enables a smooth

interpolation from the evidence lower-bound to the log (marginal) likelihood

optimisation methods are deployed to obtain a tractable and unified framework

We further consider negative alpha values and propose a

novel variational inference method as a new special case in the proposed

variational framework for learning inducing variables (Titsias, 2009a) has

new proof of the result for infinite index sets which allows inducing points

that are not data points and likelihoods that depend on all function values.

We then discuss augmented index sets and show that, contrary to previous

we show how our framework sheds light on interdomain sparse approximations

unfamiliar dynamical systems with increasing autonomy are ubiquitous.

robotics, to finance, to industrial processing, autonomous learning helps

obviate a heavy reliance on experts for system identification and controller

Often real world systems are nonlinear, stochastic, and expensive to

therefore, nonlinear systems can be identified with minimal system

This thesis considers data efficient autonomous learning of

use deterministic models, which easily overfit data, especially small

scratch, similar to the PILCO algorithm, which achieved unprecedented data

noise by simulating a filtered control process using a tractably analytic

variable belief Markov decision process' when filters must predict under

take a step towards data efficient learning of high-dimensional control using

mitigates adverse effects of observation noise, much greater performance is

achieved when optimising controllers with evaluations faithful to reality: by

simulating closed-loop filtered control if executing closed-loop filtered

outperforming filters applied to systems optimised by unfiltered simulations.

We show directed exploration improves data efficiency.

dynamics models are almost as data efficient as Gaussian process models.

Results show data efficient learning of high-dimensional control is possible

areas such as computer vision and natural language processing to predict

prediction is performed by MAP inference or, equivalently, by solving an

obtain accurate predictions, both learning and inference typically require

striking observation that approximations based on linear programming (LP)

that learning with LP relaxed inference encourages integrality of training

instances, and that tightness generalizes from train to test data.

non-parametric estimation of functions of random variables using kernel mean

of the mean embedding of a random variable X lead to consistent estimators of

require an estimator of the mean embedding of their joint distribution as a

case, our results cover both mean embeddings based on i.i.d.

as 'reduced set' expansions in terms of dependent expansion points.

latter serves as a justification for using such expansions to limit memory

Most of the world's digital data is currently encoded in a sequential form,

concrete techniques for compressing various kinds of non-sequential data via

arithmetic coding, and derives re-usable probabilistic data models from

described for certain types of permutations, combinations and multisets;

model may be ‘uprooted’ to a fully symmetric model, wherein original

singleton potentials are transformed to potentials on edges to an added

variable, and then ‘rerooted’ to a new model on the original number of

deepens our understanding, may be applied to any existing algorithm to yield

improved methods in practice, generalizes earlier theoretical results, and

pairwise graphical models and provide an exact characterization (necessary

and sufficient conditions observing signs of potentials) of tightness for the

problem, by forbidding an odd-K5 (complete graph on 5 variables with all

edges repulsive) as a signed minor in the signed suspension graph.

captures signs of both singleton and edge potentials in a compact and

efficiently testable condition, and improves significantly on earlier

forbidding minors, draw connections and suggest paths for future

examine the effect of clamping variables for approximate inference in

undirected graphical models with pairwise relationships and discrete

and summing approximate sub-partition functions can lead only to a decrease

mean field method, in each case guaranteeing an improvement in the

approximation to consideration and examine ways to choose good variables to

frustrated cycles, and of checking the singleton entropy of a variable.

explore the value of our methods by empirical analysis and draw lessons to

programming (LP) relaxations are widely used to attempt to identify a most

relaxation attains an optimum vertex at an integral location and thus

guarantees an exact solution to the original optimization problem.

pairwise models and derive sufficient conditions for guaranteed tightness of

We provide simple new proofs of earlier results and

derive significant novel results including that LP+TRI is tight for any

theorem that may be used to break apart complex models into smaller pieces.

provide an approach to one-dimensional numerical integration on bounded

continuous with a known Lipschitz constant, these quadrature rules can

provide a tight error bound around their integral estimates and utilise the

translating a sample into an integral estimate with probabilistic uncertainty

since data-driven learning allows to reduce the amount of engineering

learning (RL) approaches typically require many interactions with the system

to learn controllers, which is a practical limitation in real systems, such

address this problem, current learning approaches typically require

task-specific knowledge in form of expert demonstrations, realistic

simulators, pre-shaped policies, or specific knowledge about the underlying

probabilistic, non-parametric Gaussian process transition model of the

and controller learning our approach reduces the effects of model errors, a

model-based policy search method achieves an unprecedented speed of learning.

We demonstrate its applicability to autonomous learning in real robot and

Abstract: We consider training a deep neural network to generate samples

optimization minimizing a two-sample test statistic-informally speaking, a

good generator network produces samples that cause a two-sample test to fail

unbiased estimate of the maximum mean discrepancy, which is the centerpiece

of the nonparametric kernel two-sample test proposed by Gretton et al.

network and an adversarial discriminator network, both trained to outwit the

probability measures, which is referred to as the class Q, , in the context

flexible parameterization, the distinguishing feature of the class Q is the

leads to derive an efficient marginal MCMC algorithm for posterior sampling

Multivariate categorical data occur in many applications of machine learning.

One of the main difficulties with these vectors of categorical variables is

gained significant improvement in supervised tasks with this data.

models embed observations in a continuous space to capture similarities

unsupervised task of distribution estimation of multivariate categorical

Our model ties together many existing models, linking the

linear categorical latent Gaussian model, the Gaussian process latent

our model based on recent developments in sampling based variational

discrete counterparts in imputation tasks of sparse data.

covariance function with a finite Fourier series approximation and treat it

transforms the random covariance function to fit the data.

properties of our approximate inference, compare it to alternative ones, and

captures complex functions better than standard approaches and avoids

principal theoretical and practical approaches for designing machines that

which describes how to represent and manipulate uncertainty about models and

predictions, has a central role in scientific data analysis, machine

learning, robotics, cognitive science and artificial intelligence.

state-of-the-art advances in the field, namely, probabilistic programming,

Bayesian optimization, data compression and automatic model discovery.

process (GP) regression has O(N3 runtime for data size N, making

structure inherent in particular covariance functions, including GPs with

the multidimensional input setting, despite the preponderance of

extensions of structured GPs to multidimensional inputs, for models with

inference in additive GPs, showing a novel connection between the classic

kernel structure, we present a novel method for GPs with inputs on a

several data sets, achieving performance equal to or very close to the naive

or particle filtering, is a popular class of methods for sampling from an

intractable target distribution using a sequence of simpler intermediate

critically dependent on the proposal distribution: a bad proposal can lead to

presents a new method for automatically adapting the proposal using an

parameterised proposal distribution and it supports online and batch

with rich parameterisations based upon neural networks leading to Neural

significantly improves inference in a non-linear state space model

outperforming adaptive proposal methods including the Extended Kalman and

translates into improved parameter learning when NASMC is used as a

NASMC is able to train a neural network-based deep recurrent generative model

methods and the recent work in scalable, black-box variational inference.

process (GP) models form a core part of probabilistic machine learning.

Considerable research effort has been made into attacking three issues with

GP models: how to compute efficiently when the number of data is large;

addresses these, using a variational approximation to the posterior which is

simultaneously, with efficient computations based on inducing-point sparse

within a variational inducing point framework, out-performing the state of

exploited to allow classification in problems with millions of data points,

Additionally, computing EI requires a current best solution, which may not

exist if none of the data collected so far satisfy the constraints.

many-to-many object matching from multiple networks, which is the task of

example, the proposed method can discover shared word groups from

models with this assumption, objects in different networks are clustered into

common groups depending on their interaction patterns, discovering a

demonstrated by using synthetic and real relational data sets, which include

nature of EP appears to make it an ideal candidate for performing Bayesian

learning on large models in large-scale datasets settings.

crucial limitation in this context: the number approximating factors needs to

called stochastic expectation propagation (SEP), that maintains a global

posterior approximation (like VI) but updates it in a local way (like EP ).

Experiments on a number of canonical learning problems using synthetic and

real-world datasets indicate that SEP performs almost as well as full EP, but

suited to performing approximate Bayesian learning in the large model, large

difficult task in many situations, but when expressing beliefs about complex

order to reveal appropriate probability distributions for arbitrary

this language allows for the effective automatic construction of

language of kernels can be mapped neatly onto natural language allowing for

automating the construction of statistical models, the need to be able to

effectively check or criticise these models becomes greater.

demonstrates how kernel two sample tests can be used to demonstrate where a

probabilistic model most disagrees with data allowing for targeted

thesis also briefly discusses the philosophy of model criticism within the

exploratory approach to statistical model criticism using maximum mean

instead constructed as an analytic maximisation over a large space of

possible statistics and therefore automatically select the statistic which

selected statistic, called the witness function, can be used to identify

then apply the procedure to real data where the models being assessed are

restricted Boltzmann machines, deep belief networks and Gaussian process

regression and demonstrate the ways in which these models fail to capture the

posterior sampling in Bayesian nonparametric mixture models with priors that

way of representing the infinite dimensional component of the model such that

while explicitly representing this infinite component it has less memory and

simulation results demonstrating the efficacy of the proposed MCMC algorithm

From training data from several related domains (or tasks), methods of domain

adaptation try to combine knowledge to improve performance.

discusses an approach to domain adaptation which is inspired by a causal

assumption holds true for a subset of predictor variables: the conditional of

the target variable given this subset of predictors is invariant with respect

corresponding conditional expectation in the training domains and use it for

for automatic inference of the above subset in regression and classification.

We study the performance of this approach in an adversarial setting, in the

case where no additional examples are available in the test domain.

labeled sample is available, we provide a method for using both the

transferred invariant conditional and task specific information.

results on synthetic data sets and a sentiment analysis problem.

learning community has recently shown a lot of interest in practical

probabilistic models as computational processes using syntax resembling

known to offer a convenient and elegant abstraction for programming with

to construct probabilistic models for machine learning, while still offering

algorithms for multisets of sequences, taking advantage of the multiset's

storing its elements in some sequential order, but then information is wasted

order-invariant tree representation, and derive an arithmetic code that

compressing collections of SHA-1 hash sums, and multisets of arbitrary,

makes several improvements to the classic PPM algorithm, resulting in a new

algorithm with superior compression effectiveness on human text.

original escape mechanism, we use a generalised blending method with explicit

hyper-parameters that control the way symbol counts are combined to form

procedure to model stationary signals as the convolution between a

continuous-time white-noise process and a continuous-time linear filter drawn

moving average process and, conditionally, is itself a Gaussian process with

model can be equivalently considered in the frequency domain, where the power

spectral density of the signal is specified using a Gaussian process.

the main contributions of the paper is to develop a novel variational

freeenergy approach based on inter-domain inducing variables that efficiently

learns the continuous-time linear filter and infers the driving white-noise

In turn, this scheme provides closed-form probabilistic estimates of

the covariance kernel and the noise-free signal both in denoising and

provides closed-form expressions for the approximate posterior of the

spectral density given the observed data, leading to new Bayesian

models (SSMs) is proposed whereby the state-transition function of the model

learning a latent function that resides in the state space and for which

input-output sample pairs are not available, thus prohibiting the use of

learn the mixing weights of the kernel estimate by sampling from their

version of the proposed algorithm, followed by an online version which

performs inference on both the parameters and the hidden state through

proposed algorithm outperforms kernel adaptive filters in the prediction of

real-world time series, while also providing probabilistic estimates, a key

proposed recently and provide a high-dimensional feature space (alternative

lack of general quaternion-valued kernels, which are necessary to exploit the

proposes a novel way to design quaternion-valued kernels, this is achieved by

transforming three complex kernels into quaternion ones and then combining

emphasis is on a new quaternion kernel of polynomial features, which is

algorithms require complete knowledge (or accurate estimation) of the second

order statistics, this makes Gaussian processes (GP) well suited for

modelling complex signals, as they are designed in terms of covariance

approach for modelling complex signals, whereby the second-order statistics

allows for circularity coefficient estimation in a robust manner when the

observed signal is corrupted by (circular) white noise.

validated using climate signals, for both circular and noncircular cases.

results obtained open new possibilities for collaboration between the complex

signal processing and Gaussian processes communities towards an appealing

efficient proposal optimized for iHMMs and leverages ancestor sampling to

significant con- vergence improvements on synthetic and real world data

configuration of a discrete graphical model with highest probability (termed

MAP inference) is to reduce the problem to finding a maximum weight stable

set (MWSS) in a derived weighted graph, which, if perfect, allows a solution

class of binary pairwise models where this method may be applied.

their analysis made a seemingly innocuous assumption which simplifies

analysis but led to only a subset of possible reparameterizations being

demonstrating that this greatly expands the set of tractable models.

provide a simple, exact characterization of the new, enlarged set and show

how such models may be efficiently identified, thus settling the power of the

graphical models, belief propagation often performs remarkably well for

approximate marginal inference, and may be viewed as a heuristic to minimize

a broad family of related pairwise free energy approximations with arbitrary

robot applications, such as grasping and manipulation, it is difficult to

program desired task solutions beforehand, as robots are within an uncertain

performance, machine learning, especially, reinforcement learning, usually

available for learning, due to system constraints and practical issues,

inference for learning control method (PILCO), can be tailored to cope with

probabilistic Gaussian processes framework, additional system knowledge can

be incorporated by defining appropriate prior distributions, e.g.

evaluation, we employ the approach for learning an object pick-up task.

results show that by including prior knowledge, policy learning can be sped

structure determination based on NMR chemical shifts are becoming

fragment replacement strategy, in which structural fragments are repeatedly

consistent with the chemical shift data, they do not enable the sampling of

the conformational space of proteins with correct statistical weights.

we present a method of molecular fragment replacement that makes it possible

to perform equilibrium simulations of proteins, and hence to determine their

chemical shift information in a probabilistic model in Markov chain Monte

possible to fold proteins to their native states starting from extended

condition and hence it can be used to carry out an equilibrium sampling from

the Boltzmann distribution corresponding to the force field used in the

strategy can be used in combination with chemical shift information to

quadratic in the number of latent variables and training runtime scales

grid factor graphs, which are prevalent in computer vision and spatial

GPstruct models based on ensemble learning, with weak learners (predictors)

trained on subsets of the latent variables and bootstrap data, which can

presents problems in time-series settings or in spatial datasets where large

approximation whose complexity grows linearly with the number of

tasks including missing data imputation for audio and spatial datasets.

trace out the speed-accuracy trade-off for the new method and show that the

natural connection between random partitions of objects and kernels between

strategies for deep networks is crucial to good predictive performance.

shed light on this problem, we analyze the analogous problem of constructing

tends to capture fewer degrees of freedom as the number of layers increases,

also examine deep covariance functions, obtained by composing infinitely many

tool for sampling complex systems such as large biomolecular structures.

ensemble to distributions constructed from generative probabilistic models of

comparison to conventional parametric models, we offer the possibility to

straightforwardly trade off model capacity and computational cost whilst

stochastic variational inference and online learning approaches for fast

order to derive non-approximate parallel MCMC inference for it – work which

experimental evidence, analysing the load balance for the inference and

of variational inference for sparse GP regression and latent variable models

exploiting the decoupling of the data given the inducing points to

(on flight data with 2 million records) and latent variable modelling (on

functional parameters are usually learned by maximum likelihood, which can

This new model can capture highly flexible functional relationships for the

inference procedures and it avoids overfitting problems by following a fully

random tree structure with a set of leaves that defines a collection of

process for the tree is defined in terms of particles (representing the

diffusion tree, however, multiple copies of a particle may exist and diffuse

to multiple locations in the continuous space, resulting in (a random number

a hierarchically-clustered factor analysis model with the beta diffusion tree

on missing data problems with data sets of gene expression arrays,

the tree structure is defined in terms of particles (representing the

diffusion tree (Neal, 2003b), which defines a tree structure over partitions

diffusion tree, multiple copies of a particle may exist and diffuse along

multiple branches in the beta diffusion tree, and an object may therefore

hierarchically-clustered factor analysis model with the beta diffusion tree

on missing data problems with data sets of gene expression microarrays,

processes have served as models for unknown multisets of a measurable space.

buffet process, which we call a negative binomial Indian buffet process.

an intermediate step toward this goal, we provide constructions for the beta

negative binomial process that avoid a representation of the underlying beta

evaluation point that maximizes the expected information gained with respect

synthetic and realworld applications, including optimization problems in

factorization model for collaborative filtering that learns from data that is

these models usually assume that the data is missing at random (MAR), and

efficient stochastic inference algorithm for PMF models of fully observed

expensive batch approaches and has better predictive performance than

strategies which produce large gains over standard uniform subsampling.

larger than the number of allowed function evaluation, whereas the

formulate the problem in terms of probabilistic inference within a topic

the large number of topics we propose a novel efficient Gibbs sampling scheme

factorization model for rating data and a corresponding active learning

challenging tasks for recommender systems: what to recommend with new users

of active learning depends strongly upon having accurate estimates of i) the

uncertainty in model parameters and ii) the intrinsic noisiness of the data.

enables us to gain useful information in the cold-start setting from the very

Modern DNA sequencing methods produce vast amounts of data that often

method based on position-specific scoring matrices, which can take into

data-specific biases, and sequencing errors are naturally dealt with

probabilistic approach can limit the problem of random matches from short

reads of contamination and that it improves the mapping of real reads from

The presented work is an implementation of a novel approach to short read

mapping where quality scores, prior mismatch probabilities and mapping

quality and/or biased sequencing data but also a demonstration of the

feasibility of using a probability based alignment method on real and

sequential hypothesis test that allows us to accept or reject samples with

high confidence using only a fraction of the data required for the exact MH

'Strong' (S) and 'weak' (w) syllables] is cued, among others, by slow rates

unclear exactly which envelope modulation rates and statistics are the most

judgment task, adult listeners identified AM tone-vocoded nursery rhyme

that a 1π radian phase-shift (half a cycle) would reverse the perceived

rhythm pattern (i.e., trochaic -> iambic) whereas a 2π

cycle) would retain the perceived rhythm pattern (i.e., trochaic ->

explores an open-ended space of statistical models to discover a good

explanation of a data set, and then produces a detailed report with figures

performance evaluated over 13 real time series data sets from various

separate strand of recent research, randomized methods have been proposed to

construct features that help reveal nonlinear patterns in data.

tasks such as regression or classification, random features exhibit little or

single test point scales linearly with the number of training points and the

analysis of the pairwise graph representable Markov random field, which we

use to extend the model to semi-supervised learning problems, and propose an

analysis on supervised and semi-supervised datasets and show good empirical

deduce the correct control values to elicit a particular desired response.

model needs to capture strongly nonlinear effects and deal with the presence

input noise into heteroscedastic output noise, and compare it to other

to a effective model for nonlinear functions with input and output noise.

then consider the broad topic of GP state space models for application to

in state space models, including introducing a new method based on

methods in some detail including providing a systematic comparison between

complex systems is a difficult task and thus a framework which allows an

automatic design directly from data promises to be extremely useful.

specifically gamma processes, to construct a countably infinite graph with

reinforcement scheme has recently been proposed with similar properties, but

approach of explicitly constructing the mixing measure, which allows more

inverse regression (MNIR) has been proposed as a new model of annotated text

the IRTM enables systematic discovery of in-topic lexical variation, which is

We construct a family of probabilistic numerical methods that instead return

posterior means match the outputs of the Runge-Kutta family exactly, thus

light on the structure of Runge-Kutta solvers from a new direction, provide a

richer, probabilistic output, have low computational cost, and raise new

form expressions for the marginal likelihood and predictive distribution of a

Student-t process, by integrating away an inverse Wishart process prior over

process - a nonparametric representation, analytic marginal and predictive

distributions, and easy model selection through covariance kernels - but has

recently proved using graph covers (Ruozzi, 2012) that the Bethe partition

function is upper bounded by the true partition function for a binary

Bethe partition functions for each sub-model obtained after clamping any

bound on a broad class of approximate partition functions of general pairwise

clamping a few wisely chosen variables can be of practical value by

inductive biases, provide a distinct opportunity to develop intelligent

the introductory chapter, we discuss the high level principles behind all of

automatically discover rich structure in data, a model must have large

models for large datasets, which typically provide more information for

learning structure, and 4) we can often exploit the existing inductive biases

then discuss, in chapter 2, Gaussian processes as kernel machines, and my

the GPRN is a highly expressive kernel, formed using an adaptive mixture of

the time-varying expression levels of 1000 genes, the spatially varying

(input dependent noise covariances) between returns on equity indices and

we introduce simple closed form kernel for automatic pattern discovery and

SM kernel to discover patterns and perform long range extrapolation on

multidimensional pattern extrapolation, particularly suited to imge and movie

Without human intervention - no hand crafting of kernel features, and

large scale pattern extrapolation, inpainting and kernel discovery problems,

structure of a spectral mixture product (SMP) kernel, for fast yet exact

alternative scalable gaussian process methods in speed and accuracy.

inference which exploits existing model structure are useful in combination

important part of a wide variety of real data - including problems in

econometrics, gene expression, geostatistics, nuclear magnetic resonance

spectroscopy, ensemble learning, multi-output regression, change point

modelling, time series, multivariate volatility, image inpainting, texture

properties of atomic nuclei to discover the structure, reaction state and

weighted sum of trigonometric functions undergoing exponential decay to model

the components of this general model - amplitudes, phase shifts,

frequencies, decay rates, and noise variances - and offer practical

is particularly robust to low signal to noise ratios (SNR), and overlapping

a novel compositional, generative model for vector space representations of

space semantics as a top-down process, and provides efficient algorithms for

latent variable models which have been shown to successfully learn the hidden

a countably infinite number of hidden states, which rids us not only of the

necessity to specify a priori a fixed number of hidden states available but

complexity of the task at hand increases, the computational cost of such

infinite HCRF models, and a novel variational inference approach on a model

via cross-validation- for the difficult tasks of recognizing instances of

with respect to existing approaches, among others, conditional random fields

framework can be instantiated for a wide range of structured objects such as

the model is benchmarked on several natural language processing tasks and a

video gesture segmentation task involving a linear chain structure.

are built compositionally by adding and multiplying a small number of base

outperforms many widely used kernels and kernel combination methods on a

investigate the expressive power of three classes of model-those with binary

problem of inferring marginal probabilities and consider whether it is

possible to 'simply reduce' marginal inference from general discrete factor

binary pairwise factor graphs is able to simply reduce only positive models.

performance of standard approximate inference algorithms on the outputs of

models are successfully used in many areas of science, engineering and

dynamics, resulting in a flexible model able to capture complex dynamical

dynamics of the model and instead infer directly the joint smoothing

nonlinear system identification based on a nonlinear autoregressive exogenous

system's dynamics which is able to report its uncertainty in regions where

its relatively low computational cost make of GP-FNARX a good candidate for

parameters, these models provide a scalable and reliable starting point for

modelling uses probability theory to express all aspects of uncertainty in

which simply uses the rules of probability theory in order to make

predictions, compare alternative models, and learn model parameters and

probabilistic modelling and an accessible survey of some of the main tools in

for modelling unknown functions, density estimation, clustering, time series

present new methods for additive GPs, showing a novel connection between the

accuracy-complexity tradeoff, we extend this model with a novel variant of

achieving close performance to the naive Full GP at orders of magnitude less

models for dynamic social network data have focused on modelling the

influence of evolving unobserved structure on observed social interactions.

paper, we introduce a new probabilistic model for capturing this phenomenon,

our model's capability for inferring such latent structure in varying types

of social network datasets, and experimental studies show this structure

achieves higher predictive performance on link prediction and forecasting

probabilistic model based on the horseshoe prior is proposed for learning de-

estimating feature selection dependencies may suffer from over-fitting in the

model proposed, additional data from a multi-task learning scenario are

experiments also show that the model is able to induce suitable feature

prior shows that it is well suited for regression problems that are sparse at

knowledge about specific groups of features that are a priori believed to be

automatic relevance determination and additional variants used for group

experimental design (also known as active learning), where the data instances

reducing the number of instances needed to obtain a particular level of

inference algorithm for PMF models of fully observed binary matrices.

method exhibits faster convergence rates than more expensive batch approaches

proposed method includes new data subsampling strategies which produce large

automatically selecting the size of the minibatches of data and we propose an

estimation of dependencies between multiple variables is a central problem in

equities and currencies and observe consistent performance gains compared to

approach, based on Bayesian statistical methods, that allows the fitting of a

rich mental representations that guide their behavior in a variety of

tomography, that can extract complex, multi-dimensional priors across tasks.

familiarity and odd-one-out, involving an ecologically relevant set of

and vary dramatically across subjects, but are invariant across the tasks

with high precision the behavior of subjects for novel stimuli both in the

naturalistic stimulus set that guides behavior in multiple tasks.

Gaussians fit to a single curved or heavy-tailed cluster will report that the

derive a simple inference scheme for this model which analytically integrates

model is effective for density estimation, performs better than infinite

Gaussian mixture models at recovering the true number of clusters, and

framework yields the desired visualization with fewer user interactions than

this paper, we propose a probabilistic model for discovering latent influence

used for modeling a sequence, in which adoption by a user triggers the

multiple items, we employ multiple inhomogeneous Poisson processes, which

discovering relations between users and predicting item popularity in the

model is demonstrated by using real data sets in a social bookmark sharing

behavior of citizens are increasingly reflected online - therefore, mining

surveys enhanced with rich socio-demographic information to enable insights

We report an experimental realization of an adaptive quantum state tomography

we observe close to $N^-1$ scaling of infidelity with overall number of

growing number of large-scale knowledge bases in a variety of domains

knowledge bases would make it possible to unify many sources of structured

greedy nature, our experiments indicate that SiGMa can efficiently match some

Abstract: This report discusses methods for forecasting hourly loads of a

learning / regression algorithms and few domain specific adjustments were

non-linear dependence between random variables of arbitrary dimension based

is defined in terms of correlation of random non-linear copula projections;

low computational cost and is easy to implement: just five lines of R code,

assumption by discovering the latent functions that specify the shape of a

conditional copula given its conditioning variables We learn these functions

real-world datasets show that, when modeling all conditional dependencies, we

sequencing technologies has revolutionized the way we study genomes and gene

genomic origin of the reads, i.e., mapping the reads to a reference genome.

In this new situation, conventional alignment tools are obsolete, as they

cannot handle this huge amount of data in a reasonable amount of time.

new mapping algorithms have been developed, which are fast at the expense of

in short read mapping and show that mapping reads correctly is a nontrivial

the type of data, and that a considerable fraction of uniquely mapped reads

simple statistical results on the expected number of random matches in a

probabilistic model to infer supervised latent variables in the Hamming space

semantic concept have similar latent values, and objects in different

latent variable problem based on an intuitive principle of pulling objects

variables can be directly used to perform a nearest neighbour search for the

hash codes, and show how to effectively couple the structure of the hash

latent feature models is inherently difficult as the inference space grows

exponentially with the size of the input data and number of latent features.

the task of clustering data points into clusters where only a fraction of the

Bayesian kernel based method to cluster data points without the need to

prespecify the number of clusters or to model complicated densities from

case where we are given additional information about the training data, which

attributes, bounding boxes, image tags and rationales as additional

kernels are derived by modelling a spectral density - the Fourier transform

patterns and performing long range extrapolation on synthetic examples, as

- enabling automatic pattern extrapolation with Gaussian processes on large

procedures - we show that GPatt can solve large scale pattern extrapolation,

inpainting, and kernel discovery problems, including a problem with 383,400

inference which exploits model structure are useful in combination for

models suffer from a) overfitting problems and multiple local optima, b)

failure to capture shifts in market conditions and c) large computational

in market conditions are captured by assuming a diffusion process in

data show excellent performance of the proposed method with respect to

standard beamformer in a linear dynamical system, thereby introducing the

settings, data is collected as multiple time series, where each recorded time

Gaussian Process model for analyzing multiple time series with multiple time

learning, robotics, and control for representing unknown system functions by

dynamic systems is based on analytic moment matching in the context of the

control community, in particular for the modelling of discrete time state

This paper introduces a method of achieving this, yielding faster dynamics

present a new model based on Gaussian processes (GPs) for learning pairwise

learning of user preferences with unsupervised dimensionality reduction for

present an efficient active learning strategy for querying preferences.

proposed technique performs favorably on real-world data against

show that the criterion minimised when selecting samples in kernel herding is

In this paper we revisit the problem of optimal design of quantum tomographic

reduction in the total number of measurements required as compared to

linearly mixes the probabilistic predictions of multiple models, each

combining multiple models only under certain restrictive assumptions, which

combination procedure starting from a classic statistical model proposed by

MOTIVATION: The integration of multiple datasets remains a key challenge in

generate a broad array of different data types, providing distinct-but often

range of different datasets and data types simultaneously (including the

ability to model time series data explicitly using Gaussian processes).

artificially constructed time series datasets, we show that MDI is able to

capabilities of current approaches and integrate gene expression, chromatin

integration of multiple datasets remains a key challenge in systems biology

array of different data types, providing distinct – but often complementary

range of different datasets and data types simultaneously (including the

ability to model time series data explicitly using Gaussian processes).

model, with dependencies between these models captured via parameters that

artificially constructed time series datasets, we show that MDI is able to

data, to identify a set of protein complexes for which genes are co-regulated

techniques – as well as to non-integrative approaches – demonstrate that

fundamental problem in the analysis of structured relational data like

graphs, networks, databases, and matrices is to extract a summary of the

We obtain a flexible yet simple Bayesian nonparametric model by placing a

utilises elliptical slice sampling combined with a random sparse

model to network data and clarify its relation to models in the literature,

new framework based on the theory of copulas is proposed to address

be detected and corrected to adapt a density model accross different learning

regression problems with real-world data illustrate the efficacy of the

has generated immense research interest, with many successful applications in

diverse areas such as signal acquisition, image coding, genomics and

variable models, and develop L1 minimising factor models, Bayesian variants

clustering solutions from data and the feature subsets that are relevant for

be shared and therefore the sets of relevant features are allowed to overlap.

provide an inference approach to learn the latent parameters corresponding to

hyperparameters in closed form, and introduces an active learning scheme to

network data extract a summary of the relational structure underlying an

models provide a natural approach to capture more complex dependencies.

comparisons, the model achieves significantly improved predictive performance

models with a single layer hierarchy over-simplify real networks.

Factor analysis models effectively summarise the covariance structure of high

mutual information, the proposed dependence measure is invariant to any

estimator is consistent, robust to outliers, and uses rank statistics only.

We derive upper bounds on the convergence rate and propose independence tests

characteristic subpatterns in potentially noisy graph data, it appears

different graph instances have different edge sets pose a serious challenge.

connected, 2) occurs in all or almost all graph instances, and 3) has the

the number of edge missing to make a subset of vertices into a clique.

optimization problem, or b) a min-min soft margin optimization problem.

real social network data we show that the proposed method is able to reliably

find soft cliques in graph data, even if that is distorted by random noise or

embeds the trajec- tory of the Markov chain into a reproducing ker- nel

out analytically: our proposal distribu- tion in the original space is a

normal distribution whose mean and covariance depend on where the current

sample lies in the support of the tar- get distribution, and adapts to its

tive samplers on multivariate, highly nonlinear target distributions, arising

unsupervised regime with training examples but without class labels, and the

show that the augmented representation achieves better results in terms of

haplotype-based approach for inferring local genetic ancestry of individuals

haplotypes give rise to both the ancestral and admixed population haplotypes,

an effective utilization of the population structural information under a

the robustness under deviation from common modeling assumptions by

admixed population from an arbitrary number of ancestral populations and also

performs competitively in terms of spurious ancestry proportions under

simulation under various admixing scenarios and present empirical analysis

family of infinitely exchangeable priors over discrete tree structures that

allows the depth of the tree to grow with the data, and then showing that our

family contains all hierarchical models with certain mild symmetry

algorithm to exploit the existing data resources when learning on a new

Kalman filter (UKF) is a widely used method in control and time series

point placement, potentially causing it to perform poorly in nonlinear

sigma points correctly from data can make sigma point collapse much less

by imposing soft constraints on the amplitude and phase variables of the

evaluate the method on synthetic and natural, clean and noisy signals,

dependent covariance matrices Σ(x), allowing one to model input varying

can naturally scale to thousands of response variables, as opposed to

competing multivariate volatility models which are typically intractable for

class of covariance dynamics - periodicity, Brownian motion, smoothness, -

(predictor) dependent signal and noise correlations between multiple output

(response) variables, input dependent length-scales and amplitudes, and

demonstrating substantially improved performance over eight popular multiple

output (multi-task) Gaussian process models and three multivariate volatility

models on real datasets, including a 1000 dimensional gene expression

discrete variable undirected models into fully continuous systems.

constants (partition functions), and in general opens up a number of new

due to prohibitive computational resource costs that are not taken into

probabilities that trades off maximizing a given utility criterion and

avoiding resource costs that arise due to deviating from initially given

same formalism generalizes to discrete control problems, leading to linearly

solvable bounded rational control policies in the case of Markov systems.

serve as a general principle in control design that unifies a number of

recently reported approximate optimal control methods both in the continuous

narrow corridor is a familiar example of a two-person motor game that

sensorimotor tasks that correspond to classic coordination games with

multiple Nash equilibria, such as 'choosing sides', 'stag hunt', 'chicken',

continuous payoff in the form of a resistive force counteracting their

disabled people by decoding neural activity into useful behavioral commands.

'offline', using neural activity previously gathered from a healthy animal,

testing may neglect important features of a real prosthesis, most notably the

critical role of feedback control, which enables the user to adjust neural

can, for a particular decode system, algorithm, or parameter, engage feedback

human subject, using a closed-loop neural prosthesis driven by synthetic

simulator (OPS) to optimize 'online' decode performance based on a key

parameter of a current state-of-the-art decode algorithm, the bin width of a

agree that neural activity should be analyzed in bins of 100- to 300-ms

shorter bin widths (25-50 ms) yield higher decode performance.

confirm this surprising finding using a closed-loop rhesus monkey prosthetic

outlines a new language-independent model for sentiment analysis of short,

happy vs sad sentiment, and show that in some circumstances this outperforms

allow the modelling of differ- ent sentiment distributions in different

Twitter data and present a scalable system of data acquisi- tion and

incorporating model uncertainty into long-term planning, PILCO can cope with

demonstrate how a low-cost off-the-shelf robotic system can learn closed-loop

policies for a stacking task in only a handful of trials - from scratch.

efficient, reduces model bias, and deals with several noise sources in a

responses of humans and RL algorithms, we also find that humans appear to

allows efficient evaluation of all input interaction terms, whose number is

defining a probability distribution over equivalence classes of sparse binary

matrices with a finite number of rows and an unbounded number of columns.

represent objects using a potentially infinite array of features, or that

involve bipartite graphs in which the size of at least one class of nodes is

applications of the Indian buffet process in machine learning, discuss its

tracking, control policies given no prior knowledge of the dynamical system

algorithm which can be viewed as a form indirect self tuning regulator.

the task of reference tracking using the inverted pendulum it was shown to

yield generally improved performance on the best controller derived from the

standard linear quadratic method using only 30 s of total interaction with

are often affected by overfitting problems when labeling errors occur far

theoretic active learning has been widely studied for probabilistic models.

For simple regression an optimal myopic policy is easily tractable.

approach that expresses information gain in terms of predictive entropies,

makes minimal approximations to the full information theoretic objective.

experimental performance compares favourably to many popular active learning

reformulation of binary preference learning to a classification problem, we

paper, we aim to fill this gap and demonstrate the potential of applying the

kernel trick to tractable Bayesian parametric models in a wider context than

machine for density estimation that is obtained by applying the kernel trick

approximate inference methods for DDT models and show excellent performance

drive sequential tree building and greedy search to find optimal tree

demonstrate appropriate observation models for continuous and binary data.

model and then present two inference methods: a collapsed MCMC sampler which

demonstrated on synthetic and real world data, both continuous and

proposed where observed data Y is modeled as a linear superposition, G, of a

expression data is investigated using randomly generated data sets based on a

multinomial case we introduce a novel variational bound for the softmax

factor which is tighter than other commonly used bounds whilst maintaining

ageing related phenotypes measured on the 12,000 female twins in the Twins UK

adjustment to be made to an individual's chronological age to give her

approximating general properties of the posterior, ignoring the decision task

approximate inference methods with respect to the decision task at hand.

present a general framework rooted in Bayesian decision theory to analyze

neocortex code and compute as part of a locally interconnected population.

population processes empirically by fitting statistical models to unaveraged

population, the most appropriate model captures shared variability by a

low-dimensional latent process evolving with smooth dynamics, rather than by

model with realistic spiking observations to coupled generalised linear

latent dynamical approach outperforms the GLM in terms of goodness-of- fit,

method of maximisation of the marginal likelihood, and allow estimation of

underlying structure in data, and have relevance in a vast number of

application areas including collaborative filtering, source separation,

missing data imputation, gene expression analysis, information retrieval,

develops generalisations of matrix factorisation models that advance our

three concerns: widening the applicability of latent variable models to the

including higher order data structures into the matrix factorisation

These three issues reflect the reality of modern data analysis and

we develop new models that allow for a principled exploration and use of data

define ideas of weakly and strongly sparse vectors and investigate the

classes of prior distributions that give rise to these forms of sparsity,

Based on these sparsity favouring priors, we develop and compare methods for

used to allow binary data with specified correlation to be generated, based

generalisation considers the extension of matrix factorisation models to

changes in end- stage cardiomyopathy led us to hypothesise that distinct

sequencing, covering 24 million out of the 28 million CG di-nucleotides in

of set functions on Polish spaces to establish countable additivity of the

includes: (a) subjective expected utility (SEU) theory, the framework of

control law that deals with the case where autonomous agents are uncertain

decision-makers maximize expected utility, but crucially ignore the resource

variational free utility principle akin to thermodynamical free energy that

and robust (minimax) control schemes fall out naturally from this framework

that an intractable planning problem reduces to a simple multi-armed bandit

adaptive bandit player that is universal with respect to a given class of

optimal bandit players, thus indirectly constructing an adaptive agent that

network may provide concurrent views into the dynamical processes of that

multiple linear dynamical laws approximate and nonlinear and potentially

non-stationary dynamical process - is able to distinguish dynamical regimes

within single-trial motor cortical activity associated with the preparation

between them correlate strongly with external events whose timing may vary

comparable models in predicting the firing rate of an isolated neuron based

dynamical processes underlying the coordinated evolution of network activity

approximation play a fundamental role in the possibility and plausibility of

(pseudo-) training set such that a specific measure of information loss is

this thesis is to improve scaling instead by exploiting any structure that is

obtain accuracies close to, or exactly equal to the full GP model at a

the covariance matrices generated have rank D - this results in significant

with a tensor product kernel evaluated on a multivariate grid of inputs

critical regulators of sleep-wake cycles, reward-seeking, and body energy

unclear whether hcrt/orx neurons are one homogenous population, or whether

structural and functional information about individual hcrt/orx neurons in

mouse brain slices, by combining patch-clamp analysis of spike firing,

membrane currents, and synaptic inputs with confocal imaging of cell shape

Statistical cluster analysis of intrinsic firing properties revealed that

provide quantitative evidence that, at the cellular level, the mouse hcrt/orx

topics per document is small, exact inference takes polynomial time.

contrast, we show that, when a document has a large number of topics, finding

These are extended to the state space approach to time series in two

performance on six real world data sets, which include three environmental

based, to general nonlinear systems for nonparametric system identification.

Gaussian process approaches to time series into a change point framework.

data, from an old regime, that hinders predictive performance is

point model when change point labels are available in training.

mentioned methodologies significantly improve predictive performance on the

important middle ground, retaining distributional information about

uncertainty in latent variables, unlike maximum a posteriori methods (MAP),

well-known compactness property of variational inference is a failure to

parameter learning and analytically reveal systematic biases in the

computationally challenging but improves on existing methods in a number of

encompasses noise and uncertainty, allowing PAD to cope with missing data and

recent scientific and engineering problems require signals to be decomposed

into a product of a slowly varying positive envelope and a quickly varying

carrier whose instantaneous frequency also varies slowly over time.

signal processing provides algorithms for so-called amplitude- and

semi-definite random matrices indexed by any arbitrary input variable.

The GWP captures a diverse class of covariance dynamics, naturally hanles

missing data, scales nicely with dimension, has easily interpretable

improved performance over eight popular multiple output (multi-task) Gaussian

process models and three multivariate volatility models on benchmark

datasets, including a 1000 dimensional gene expression dataset.

models offer a platform for analyzing multi-electrode recordings of neuronal

(L1L method) to detect short-term (order of 3ms) neuronal interactions.

estimate the parameters in this model using a coordinate descent algorithm,

able to detect excitatory interactions with both high sensitivity and

to multi-electrode recordings collected in the monkey dorsal premotor cortex

stick-breaking processes to allow for trees of unbounded width and depth,

methods based on slice sampling to perform Bayesian inference and simulate

networks are a powerful way to model complex probability distributions.

structure of a layered, directed belief network that is unbounded in both

the nonlinear Gaussian belief network framework to allow each unit to vary

Abstract: The design of optimal adaptive controllers is usually based on

parameter point estimates as if they represented ``true'' parameter values.

Here we present a stochastic control rule instead where controls are sampled

inference and action sampling both work forward in time and hence such a

learn part-of-speech tags from newswire text in an unsupervised fashion.

scores, embodied by iteration duration, ease of development, deployment and

assumption is that preparatory activity constitutes a subthreshold form of

movement activity: a neuron active during rightward movements becomes

the level of a single neuron, preparatory tuning was weakly correlated with

initial state of a dynamical system whose evolution produces movement

represent specific factors, and may instead play a more mechanistic role.

from seven areas spanning the four cortical lobes in monkeys and cats.

every case, stimulus onset caused a decline in neural variability.

occurred even when the stimulus produced little change in mean firing rate.

The variability decline was observed in membrane potential recordings, in the

spiking of individual neurons and in correlated spiking variability measured

all stimuli tested, regardless of whether the animal was awake, behaving or

medical applications, we face decision-making problems where data are limited

ingredient is a probabilistic dynamics model learned from data, which is

takes uncertainties into account in a principled way and, therefore, reduces

to jointly learning models and controllers when expert knowledge is difficult

a double pendulum attached to a cart or to balance a unicycle with five

step toward pilco's extension to partially observable Markov decision

processes, we propose a principled algorithm for robust filtering and

nonlinear systems, it does neither rely on function linearization nor on

telomere length .70 +/- .24 compared with 1.02 +/- .52 in individuals who had

mixture modeling framework it is possible to infer the necessary number of

problem of finding the 'correct' number of mixture components by assuming

models are cast as infinite mixture models and inference using Markov chain

primary goal of this paper is to compare the choice of conjugate and

non-conjugate base distributions on a particular class of DPM models which is

single data set may be multi-faceted and can be grouped and interpreted in

provide a variational inference approach to learn the features and clustering

clusterings and views but also allows us to automatically learn the number of

representations humans develop - in other words, to 'read' subject's minds.

In order to eliminate potential biases in reporting mental contents due to

verbal elaboration, subjects' responses in experiments are often limited to

binary decisions or discrete choices that do not require conscious reflection

such ideal observer models allowed us to infer subjects' mental

results demonstrate a significant potential in standard binary decision tasks

work is that user feedback can greatly improve the quality of automatically

importantly, a principled model for exploiting user feedback to learn the

truth values of statements in the knowledge base would be a major step

family of probabilistic graphical models that builds on user feedback and

logical inference rules derived from the popular Semantic-Web formalism of

extensive experiments on real-world datasets, with feedback collected from

biomedical data modeling of 42 skin and ageing phenotypes measured on the

problems include high missingness, heterogeneous data, and repeat

predicting disease labels and symptoms from available explanatory variables,

concluding that factor analysis type models have the strongest statistical

scalable algorithm for fitting the Kronecker graph generation model to large

In contrast, KRONFIT takes linear time, by exploiting the structure of

functional similarity of genes and their distances in networks based on

predicting gene function from synthetic lethality interaction networks.

gene function prediction from synthetic lethal interaction networks based on

improved gene function prediction compared with state-of-the-art competitors

the problem is to search for a lower dimensional manifold which captures the

in a low dimensional latent space, and a stochastic map to the observed

improved generalisation performance and better density estimates in

based on three desiderata, namely that rewards should be real- valued,

might allow for a novel approach to adaptive control based on a minimum

adaptive agent that is univemacmacrsal with respect to a given class of

minimizing the relative entropy of the adaptive agent from the expert that is

problem is given by a stochastic controller called the Bayesian control rule,

is shown that under mild assumptions, the Bayesian control rule converges to

expected utility (MEU) principle, but crucially ignores the resource costs

conjugate utility function based on three axioms: utilities should be

these axioms enforce a unique conversion law between utility and probability

characterized as a variational principle: given a utility function, its

conjugate probability measure maximizes a free utility functional.

free utility due to the addition of new constraints expressed by a target

that optimal control, adaptive estimation and adaptive control problems can

a principled approach to bounded rationality that establishes a close link to

inference methods is provided, including exact and variational inference,

classification using machine learning continues to be an active research

approach turns all the modelling complexity into a feature selection problem.

classification problem by designing a custom probabilistic graphical model.

online change point detection with Gaussian processes to create a

nonparametric time series model which can handle change points.

Bayesian online change point detection algorithms, is applicable when

modules (TMs) by integrating gene expression and transcription factor binding

particular, it allows us to identify the subset of genes that share the same

combining gene expression and transcription factor binding (ChIP-chip) data

in this way, we are better able to determine the groups of genes that are

the code for the work presented in this article, please contact the

to relational learning which, given a set of pairs of objects S = A(1):B(1),

for analogous pairs of objects that match the query set of interest.

containing features of the objects of interest and a link matrix specifying

approach can work in practice even if a small set of protein pairs is

important step towards this goal is to detect genes whose expression levels

this information may provide insights into the course and causal structure of

genes to an infection by a fungal pathogen using a microarray time series

dataset covering 30,336 gene probes at 24 observed time points.

classification experiments, our test compares favorably with existing methods

sounds in order to solve important tasks, like automatic speech recognition,

textures, like running water, wind, fire and rain, depends on

are more robust to noise, they can also fill-in missing sections of data, and

modulation depth and modulation time-scale in each sub-band of a signal are

coloured noise carriers is shown to be capable of generating a range of

inference in the model for auditory textures qualitatively replicates the

primitive grouping rules that listeners use to understand simple acoustic

methods to do fast online anomaly detection using scan statistics.

statistics have long been used to detect statistically significant bursts of

issues that occur in application: dealing with an unknown background rate of

events, allowing for slow natural changes in background frequency, the

inverse problem of finding an unusual lack of events, and setting the test

unscented Kalman filter (UKF) is a widely used method in control and time

a step known as sigma point placement, causing it to perform poorly in

the sigma points correctly from data can make sigma point collapse much less

task of verb clustering, incorporating supervision in the form of must-links

active learning approach for constraint selection employing uncertainty-based

predict the latent standard deviations of a sequence of random variables.

easily handle missing data, incorporate covariates other than time, and model

variable models represent hidden structure in observational data.

for the distribution of the observational data changing over time, space or

generalizations has been proposed for latent variable models based on the

generates a binary random matrix with an unbounded number of columns for each

membership model - each data point is modeled with a collection of

point is positively correlated with its proportion within that data point.

topic (component) might be rare throughout the corpus but dominant within

data point's 'focus', is determined independently from the amount that each

model of the data's probability density can assist in identifying clusters.

image analysis the objective is to unmix a set of acquired pixels into pure

algorithms are based on sparsity regularization encouraging pure spectral

evaluate the method on synthetical and real hyperspectral data of wheat

system must learn to infer the presence of objects and features in the world

explicitly, model the way these elements interact to create the image.

response properties of cells in the mammalian visual system reflect this

properties of simple and complex cells in the primary visual cortex (V1).

particular, feature identity variables were activated in a way that resembled

the activity of complex cells, while feature attribute variables responded

the model closely parallelled the reported anatomical grouping of simple

of complex and simple cells as elements in the segmentation of a visual scene

identity and appearance of more articulated visual elements, culminating in

Our second class of methods uses Dirichlet process mixtures (DPM) of such

a new approximate inference method based on the minimization of

probabilistic models with little effort since it only requires as input the

probit regression and neural network regression and classification problems

show that BB-αwith non-standard settings of α, such as α= 0.5, usually

ultimatum game or the prisoner's dilemma typically lead to Nash equilibria

when multiple competitive decision makers with perfect knowledge select

in motor interactions in which players vie for control and try to minimize

single player took both roles, playing the sensorimotor game bimanually,

the study of human motor interactions within a game theoretic framework,

suggesting that the coupling of motor systems can lead to game theoretic

volume and cardiac rate can induce changes in the blood-oxygen level

of neural activation using fMRI, and it is therefore important to model and

explain significantly more variance in gray matter BOLD signal than a model

that includes RV alone, and an average HR response function is proposed that

associated with individual array dimensions are jointly retrieved while the

against classical models on applications to modeling amino acid fluorescence,

collaborative filtering and a number of benchmark multiway array data.

analytic moment-based filter for nonlinear stochastic dynamic systems modeled

contrast to humans or animals, artificial learners often require more trials

when learning motor control tasks solely based on experience.

autonomous learners will reduce the amount of engineering required to solve

key ingredients of biological learning systems to speed up artificial

when learning motor control tasks in the absence of expert knowledge.

implement two key ingredients of biological learning systems, generalization

learning (RL) and optimal control of systems with continuous states and

optimal control problem, where problem-specific prior knowledge is available,

classic optimal control problem, GPDP models the unknown value functions with

state and explores the state space using Bayesian active learning.

to learn probabilistic models of the a priori unknown transition dynamics and

nonparametric latent feature model that does not bound the number of active

The core contribution of this thesis are three new inference procedures that

Bayesian inference in a broad class of conjugate models, (b) a parallel,

distributed across multiple processors, and (c) a variational inference

useful in planning domains where agents must balance actions that provide

agent explores its world and only models visited states explicitly.

seek to identify co-occurring hidden features in a set of observations.

Gibbs sampler for conjugate likelihood models and demonstrate its efficacy on

framework to generate several specific models and demonstrate applications on

divides a large data set between multiple processors and uses message passing

parallel inference scheme for IBP-based models, scales to datasets orders of

principled prior in situations where the number of hidden features is

nonparametric prior for latent feature models in which observations are

feature models seek to infer these unobserved features from a set of

accurate in the limit, samplers for the IBP tend to mix slowly in practice.

on truncating to infinite models, provide theoretical bounds on the

for approximate inference on graphical models based on belief propagation

variables one at a time for conditioning, running belief propagation after

clamp at each level of recursion, and propose a fast heuristic which applies

performs better than selecting variables at random, and give experimental

scores are found to be significant covariates in determining insertion and

deletion (indel) error rates, but not mismatch rates which depend on the

for classifying genuine mutations: a hypothesis testing framework and a

kernel-based approaches to this problem share a set of common features: (i)

inference, and propose (i) an unsupervised kernel method (ii) that takes the

We study unsupervised learning in a probabilistic generative model for

the objects present, specified in the model parameters, combine to form the

standard bars benchmark test for object learning show that the algorithm

The model and the learning algorithm thus connect research on occlusion with

tools for dimensionality reduction when dealing with real valued data.

based on a factorisation of the observed data matrix, and show performance of

infinite-dimensional random objects, such as functions, infinite graphs or

of stochastic process theory, the construction of stochastic processes from

approaches in the literature of microarray gene expression data, little

process mixture (DPM) models provide a nonparametric Bayesian alternative to

the bootstrap approach to modeling uncertainty in gene expression clustering.

clustering of high-dimensional nontime series gene expression data using full

expression profiles which extend previously published cluster analyses of

computational approaches in the literature of microarray gene expression data

mixture) to model uncertainty in the data and Bayesian model selection to

plausible results are presented from a well studied data set: expression

new approach to non-linear regression called function factorization, that is

suitable for problems where an output variable can reasonably be modeled by a

number of multiplicative interaction terms between non-linear functions of

high-dimensional space by the sum of products of simpler functions on

superior predictive performance of the method on a food science data set

features extracted using existing related matrix factorization

probabilistic model for learning non-negative tensor factorizations (NTF), in

evaluate the model on two food science data sets, and show that the

probabilistic NTF model leads to better predictions and avoids overfitting

discuss how the Gibbs sampler can be used for model order selection by

iterated conditional modes algorithm that rivals existing state-of-the-art

used as a representation of conditional independence in machine learning and

machine learning tasks can benefit from the principle presented here: the

power to model dependencies that are generated from hidden variables, but

constraints play an important role in learning with graphical models.

variable model where two independent observed variables have no common latent

marginal observed distribution directly, without explicitly including latent

series of microarray measurements have become available over recent years.

whether a gene is differentially expressed across the whole time

article, we propose a Gaussian process based approach for studying these

thaliana gene expression levels, our novel technique helps us to uncover

and important step towards this goal is to detect genes whose expression

expressed, although this information may provide insights into the course and

process regression, can deal with arbitrary numbers of replicates and is

using a microarray time series dataset covering 30,336 gene probes at 24 time

(IHMM) extends hidden Markov models to have a countably infinite number of

application of this model to artificial data, a video gesture classification

task, and a musical theme labeling task, and show that components of the

adapt the UPM to incoming regime changes as soon as possible, necessitating

hyper-parameters which allow the user to fully specify the hazard function

promising result we evaluate the output of the unsupervised PoS tagger as a

direct replacement for the output of a fully supervised PoS tagger for the

guiding DP- MMs towards a particular clustering solution using pairwise

highlights the benefits of both standard and constrained DPMMs com- pared to

evaluate a method of guiding DPMMs towards a particular clustering solution

Bayesian (VB) [4] and collapsed variational methods [5], [6] recently

efficiently sum over exponentially many ways of partitioning the data and

offer a novel lower bound on the marginal likelihood of the DPM [6].

paper we make the following contributions: (1) We show empirically that the

BHC lower bounds are substantially tighter than the bounds given by VB [4]

and by collapsed variational methods [5] on synthetic and real datasets.

combinatorial approximate inference methods and lower bounds may be useful

neural trajectories that summarize the activity recorded simultaneously from

visualizing the high-dimensional, noisy spiking activity in a compact form,

trajectories involve a two-stage process: the spike trains are first smoothed

first describe extensions of the two-stage methods that allow the degree of

present a novel method for extracting neural trajectories - Gaussian-process

methods to the activity of 61 neurons recorded simultaneously in macaque

trajectories, we directly observed a convergence in neural state during motor

activity across a neural population to the subject's behavior on a

characterize neural population activity when the underlying time course is

known, we performed simulations that revealed that GPFA performed tens of

problem of extracting smooth, low-dimensional neural trajectories that

high-dimensional, noisy spiking activity in a compact form, such trajectories

can offer insight into the dynamics of the neural circuitry underlying the

a two-stage process: the spike trains are first smoothed over time, then a

extensions of the two-stage methods that allow the degree of smoothing to be

chosen in a principled way and that account for spiking variability, which

- which unifies the smoothing and dimensionality- reduction operations in a

61 neurons recorded simultaneously in macaque premotor and motor cortices

other recorded neurons, we found that the proposed extensions improved the

directly observed a convergence in neural state during motor planning, an

methods can be a powerful tool for relating the spiking activity across a

activity when the underlying time course is known, we performed simulations

of visual cortex, and in particular those based on sparse coding, have

Having validated our methods on toy data, we find that natural images are

offer an attractive framework within which to infer underlying intensity

naive implementation will become computationally infeasible in any problem of

problem specific methods for a class of renewal processes that eliminate the

memory burden and reduce the solve time by orders of magnitude.

trains present challenges to analytical efforts due to their noisy, spiking

on a smoothed, denoised estimate of the spike train's underlying firing rate.

Current techniques to find time-varying firing rates require ad hoc choices

optimal estimates of firing rate functions underlying single or multiple

difficult to determine an optimal closed-loop policy in nonlinear control

with people using natural language often experience communication errors and

successfully used probabilistic models of speech, language, and user behavior

to generate robust dialog performance in the presence of noisy speech

recognition and ambiguous language choices, but decisions made using these

probabilistic models are still prone to errors due to the complexity of

acquiring and maintaining a complete model of human language and behavior.

specify a priori from domain knowledge, and learning these parameters from

model of the user online, including new vocabulary and word choice

questioning as the agent learns, but also allows the agent to actively query

for additional information when its uncertainty suggests a high risk of

study UK road traffic data and explore a range of modelling and inference

motorway record speed and flow measurements at regularly spaced locations as

data helps us to better understand and quantify the nature of congestion on

understand the overall journey times and we look at methods to improve our

ability to predict journey times given access jointly to both real-time and

framework for modeling partial memberships of data points to clusters.

a standard mixture model which assumes that each data point belongs to one

and only one mixture component, or cluster, a partial membership model allows

which assign data points partial memberships to clusters can be useful for

tasks such as clustering genes based on microarray data (Gasch

chemoinformatics studied graph data with dozens of nodes, systems biology and

significant increase in graph size: Classic algorithms for data analysis are

overcome this problem is to design novel efficient algorithms, the other is

'representative' small subgraph from the original large graph, with

We present a form of GPC which is robust to labeling errors in the data set.

This model allows label noise not only near the class boundaries, but also

far from the class boundaries which can result from mistakes in labelling or

usefulness of the proposed algorithm with model selection method through

mistakes in labelling or gross errors in measuring the input features.

derive an outlier robust algorithm for training this model which alternates

recent algorithms for approximate inference in Gaussian process models for

some methods produce good predictive distributions although their marginal

methods in various ways, and provide unifying code implementing all

novel framework for very fast model-based reinforcement learning in

framework, we use flexible, non-parametric models to describe the world based

problem in a setting where we provide very limited prior knowledge about the

predicting class labels for objects within a relational database, it is often

encodes that, conditioned on input features, each object label is independent

probabilistic models based on factorizing over latent variables and model

shown that the LSVB approach gives better estimates of the model evidence as

contains features that span four orders of magnitude: Sentences ($\sim1$s);

time-scales to solve complicated tasks such as auditory scene analysis [1].

One route toward understanding how auditory processing accomplishes this

algorithms largely concentrate on the shorter time-scale structures in

resolution (to extract short temporal information) and for long duration (to

develop a new statistical model for natural sounds that captures structure

across a wide range of time-scales, and to provide efficient learning and

number of states considered at each time step to a finite number, with

dynamic programming, which samples whole state trajectories efficiently.

present applications of iHMM inference using the beam sampler on changepoint

binary hidden Markov chains which together produce an observable sequence of

on a dataset based on Levin’s (1993) verb classes using the recently

learning task in natural language processing (NLP): lexical-semantic verb

method to add human supervision to the model in order to to influence the

performed highlights the benefits of the chosen method compared to previously

accuracy, especially when the number of training examples per problem is

set of latent variable models for different multi-task learning scenarios.

show that the framework is a generalization of standard learning methods for

single prediction problems and it can effectively model the shared structure

experiments on both simulated datasets and real world classification datasets

standard multi-task learning setting and transfer learning setting.

introduce and demonstrate a new approach for estimating a distribution over

the missing labels where data points are viewed as nodes of a graph, and

pairwise similarities are used to derive a transition probability matrix P

set to be absorbing states of the Markov random walk, and the probability of

derived and demonstrated on both real and artificial data sets, including a

via a kernel function using input attributes of the instances.

knowledge can further reveal additional pairwise correlations between

Experimental results on several real world data sets verify the usefulness of

We describe a flexible nonparametric approach to latent variable modelling in

a probability distribution over equivalence classes of binary matrices with a

finite number of rows, corresponding to the data points, and an unbounded

matrix indicate which latent feature is possessed by which data point, and

over unbounded binary matrices by taking the limit of a distribution over

single hyperparameter which controls both the number of feature per ob ject

number of features per object and the total number of features in the matrix.

each cluster and forms an overlapping mixture by taking products of such

focus in on overlapping regions while maintaining the ability to model a

proposed where observed data Y is modelled as a linear superposition, G, of a

active for a specific data point is specified by an infinite binary matrix,

algorithm on synthetic and gene expression data and compare to standard ICA

assign each row or column to a single cluster based on a categorical hidden

feature, our binary feature model reflects the prior belief that items and

provide simple learning and inference rules for this new model and show how

corresponds to solving a multiclass learning task in each clique, which

concentrate on the best known procedures and standard generalisations of

which objects can be related, making automated analogical reasoning very

this classi- cal problem as a problem of Bayesian analy- sis of relational

approximations, in that they try to summarize all the training data via a

large body of powerful learning algorithms for diverse applications.

peer reviewed software accompanied by short articles would be highly valuable

distribution whereby objects are modelled using an unbounded number of latent

adopted by the brain, is to shape useful representations of sounds on prior

This demodulation cascade relates to classical amplitude demodulation, but

approach, probabilistic amplitude demodulation, is shown to out-perform the

inference and learning in the limiting case of a suitable probabilistic model

identifying news articles in multiple languages that report on the same news

We discuss a general way for constrained clustering using a recent,

a correlation clustering implementation that features linear program chunking

clustering algorithm to the crosslingual link detection problem and present

experimental results that show correlation clustering improves upon the

hierarchical clustering approaches commonly used in link detection, and,

clustering in its most basic form where only a local metric on the data space

novel approach to clustering where data points are viewed as nodes of a

graph, and pairwise similarities are used to derive a transition probability

reveals structure at increasing scales by varying the number of steps taken

of clusters, and the number of steps that best reveal it, are found by

is a simple yet powerful method for finding structure in data using spectral

new insights into how the method works and uses these to derive new

algorithms which given the data alone automatically learn different plausible

given affinity matrix to infer the number of clusters in data, and the second

combines learning the affinity matrix with inferring the number of clusters.

feature model that allows for multi-complex membership by individual proteins

is coupled with a graph diffusion kernel that evaluates the likelihood of two

complexes and automatically infers the number of significant complexes from

method is capable of partitioning the data in a biologically meaningful way.

A supplementary web site containing larger versions of the figures is

multiple sequence alignment profiles with the purpose of improving the

Markov model where a hidden state generates segments of various length and

likelihood function that explicitly represents multiple sequence alignment

benchmark data sets show that incorporating the profiles results in

concept or cluster, given a query consisting of a few items from that

concept of a cluster and ranks items using a score which evaluates the

marginal probability that each item belongs to a cluster containing the query

binary data and show that our score can be evaluated exactly using a single

sparse matrix multiplication, making it possible to apply our algorithm to

aspects (EBA) is a probabilistic choice model describing how humans decide

choosing which mobile phone to buy the features to consider may be: long

2006 track, which asks participants to retrieve passages from scientific

generation and reranking query results to encourage relevance and diversity.

For query generation, we automatically identify noun phrases from the topic

descriptions, and use online resources to gather synonyms as expansion terms.

passages using a naive clustering-based approach in our second run, and we

test GRASSHOPPER, a novel graph-theoretic algorithm based on absorbing random

failed to produce adequate query results for several topics, causing our

surprisingly achieved higher aspect-level scores using the initial ranking

probability distribution over equivalence classes of binary matrices with a

this distribution as a prior in an infinite latent feature model, deriving a

image retrieval which models the distribution of color and texture features

marginal likelihoods, which integrate over model parameters, in the case of

sparse binary data the score reduces to a single matrix-vector multiplication

certain class at an input location is monotonically related to the value of

over this latent function, data are used to infer both the posterior over the

discriminating between images of faces of men and women from face images.

consider the gender classification task of discriminating between images of

auxiliary-variable scheme (Møller et al., 2004) offers a solution: exact

and due to high-quality requirements the effect on the devices' perform ance

specifications, and model-based sensitivity analysis has made its way into

scheme, making it possible to apply the analysis to computationally costly

systems serve us as case studies to introduce the analysis and to assess its

costly simulation runs and can ensure a reliable accuracy of the

account point predictions and losses we proposed that evaluate the quality of

provide a principled, practical, probabilistic approach to learning in kernel

independencies that is closed under marginalization and arises naturally from

process (GP) regression model whose covariance is parameterized by the the

training cost and O(M2) prediction cost per test case.

We finally demonstrate its performance on some large data sets, and make a

optimization of a small set of M pseudo-inputs, thereby reducing complexity

that this optimization space becomes impractically big for high dimensional

dimensional space is learned in a supervised manner, alongside the

even more powerful in this regard - we learn an uncertainty parameter for

input-dependent noise makes it possible to apply GPs to much larger and more

of hidden causes is unbounded, but only a finite number influence observable

approaches in discovering hidden causes in simulated data, and use our

non-parametric approach to discover hidden causes in a real medical

transcriptional networks from highly replicated gene expression profiling

time series data obtained from a well-established model of T cell activation.

measurements depend on some hidden state variables that evolve according to

directly measured in a gene expression profiling experiment, for example:

bootstrap procedure was used to derive classical confidence intervals for

variational approximations are used to perform the analogous model selection

generalizes the probit function is used as the likelihood function for

on support vector machines on some benchmark and real-world data sets,

including applications of ordinal regression to collaborative filtering and

likelihood function is proposed to capture the preference relations in the

results compared against the constraint classification approach on several

analysis, these ordinal labels have been rarely treated in a principled way.

This paper describes a gene selection algorithm based on Gaussian processes

to discover consistent gene expression patterns associated with ordinal

ordinal labels is demonstrated by the gene expression signature associated

prognostic application in mind are also useful as an investigative tool to

reveal associations between specific molecular and cellular events and

microarray data with binary labels with results comparable to other methods

present a novel algorithm for agglomerative hierarchical clustering based on

used to compute the predictive distribution of a test point and the

It uses a model-based criterion to decide on merging clusters rather than an

Nested sampling provides a robust alternative to annealing-based methods for

an ability to draw samples uniformly from the prior subject to a constraint

better sparse approximations, combining the best of the existing strategies,

impractical when the size of the training set exceeds a few thousand cases.

framework for learning precise, compact, and fast representations of the

model for density estimation and on binary linear classification, with both

synthetic data sets for visualization and several real data sets.

show significant reductions in prediction time and memory footprint.

In this paper we consider latent variable models and introduce a new

possible to identify the hidden, independent components instead of just

experimental results on two multi-label text classification data sets show

unique in that major participants report scientific progress on a weekly

protein sequence (target) under consideration by the major structural

analyze the potential impact that this major initiative provides to

proteins targeted by structural genomics and how biased is the target set

of structure predictions using different methods integrated with data from

models (SSMM) to exploit multiple sequence alignment profiles which contain

interactions in β-sheets, this model is capable of carrying out inference on

paper, we merge the parametric structure of neural networks into a segmental

of a sigmoid belief network, captures the underlying dependency in residue

novel approach to the problem of automatically clustering protein sequences

and discovering protein families, subfamilies etc., based on the theory of

dictate how many mixture components are required to model it, and provides a

illustrate our methods with application to three data sets: globin sequences,

method is producing biologically meaningful results, which provide a very

information, we obtain a classification of sequences of known structure which

site containing larger versions of the figures is available at

a novel approach to the problem of automatically clustering protein sequences

and discovering protein families, subfamilies etc., based on the thoery of

dictate how many mixture components are required to model it, and provides a

illustrate our methods with application to three data sets: globin sequences,

methods is producing biologically meaningful results, which provide a very

information, we obtain a classification of sequences of known structure which

task is to predict the orientation of visual stimuli from the activity of a

improving the coding of the input (i.e., the spike data) as well as of the

output (i.e., the orientation), and report the results obtained using

The use of non-orthonormal basis functions in ridge regression leads to an

implicit whitening of the basis functions by penalizing directions in

an improved average performance compared to standard ridge regression.

neurons from extracellular recordings, known as spike sorting, is a

used as a means for evaluating the quality of clustering and therefore spike

non-parametric modelling approach for black-box identi cation of non-linear

space where prediction quality is poor, due to the lack of data or its

Gaussian process models contain noticeably less coef cients to be optimised.

This paper illustrates possible application of Gaussian process models within

Gaussian process model is used in predictive control, where optimisation of

learning of posterior distributions over undirected model parameters has been

schemes and test on fully observed binary models (Boltzmann machines) for a

small coronary heart disease data set and larger artificial systems.

approximations generally performed poorly, more advanced methods using loopy

propagation, brief sampling and stochastic dynamics lead to acceptable

classification problems the input contains a large number of potentially

features by optimizing the model marginal likelihood, also known as the

selection, and to select data points in a sparse Bayesian kernel classifier.

performance is more accurate on test data than relevance determination by

transcriptional networks from highly replicated gene expression profiling

time series data obtained from a well-established model of T-cell activation.

State space models are a class of dynamic Bayesian networks that assume that

the observed measurements depend on some hidden state variables that evolve

equations for incorporating training data and examine how to learn the

reinforcement learning in continuous state spaces and discrete time.

demonstrate how the GP model allows evaluation of the value function in

would allow the method to capture entire distributions over future values

problem of estimating the depth of a point in space from observing its image

learning approach where the mapping from image to spatial coordinates is

the generic learning approach, in addition to simplifying the procedure of

calibration, can lead to higher depth accuracies than classical calibration

transformation can lead to significantly better performance than using a

form, map posteriors are dominated by a small number of links that tie

interconnected, where links represent relative information between pairs of

nearby features, as well as information about the robot's pose relative to

empirical results obtained for a benchmark data set collected in an outdoor

non-parametric Dirichlet process prior to model the growing number of

algorithm based on convex optimization for constructing kernels for

decomposition of graph Laplacians, and combine labeled and unlabeled data in

Gaussian random field kernels, a nonparametric kernel approach is presented

kernels on real datasets using support vector machines, with encouraging

problem of multi-step ahead prediction in time series analysis using the

discrete-time non-linear dynamic system can be performed by doing repeated

modelling dynamic systems caution has to be axercised when signals are

highlight areas of the input space where prediction quality is poor, due to

the lack of data or its complexity, by indicating the higher variance around

optimisation of control signal takes the variance information into account.

The predictive control principle is demonstrated on a simulated example of

statistics approach, are used to implement a nonlinear adaptive control law.

focus on reliably estimating the predictive mean and variance of forecasted

predictive mean and variance for Gaussian kernel shapes under the assumption

focus on reliably estimating the predictive mean and variance of forecasted

predictive mean and variance for Gaussian kernel shapes under the assumption

evaluations suffice for the generation of 100 roughly independent points from

involved in computing marginal likelihoods of statistical models (a.k.a.

paper, we study the general class of bound optimization algorithms -

relationship between the updates performed by bound optimization methods and

under which bound optimization algorithms exhibit quasi-Newton behavior, and

results supporting our analysis and showing that simple data preprocessing

can result in dramatically improved performance of bound optimizers in

estimation of latent variable models, and report empirical results showing

outperform standard EM in terms of speed of convergence in certain cases.

processes provide an approach to nonparametric modelling which allows a

of multiple local linear models in a consistent manner, inferring consistent

dynamic system identification, by summarising large quantities of

set size - traditionally a problem for Gaussian process models.

and unlabeled data are represented as vertices in a weighted graph, with edge

mean of the field is characterized in terms of harmonic functions, and is

efficiently obtained using matrix methods or belief propagation.

resulting learning algorithms have intimate connections with random walks,

incorporate class priors and the predictions of classifiers obtained by

entropy minimization, and show the algorithm's ability to perform feature

susceptibility contrast MR imaging requires deconvolution to obtain the

automatically estimates the noise level in each voxel, has the advantage that

value decomposition (SVD) using a common threshold for the singular values

and to SVD using a threshold optimized according to the noise level in each

Dirichlet process capable of capturing a rich set of transition dynamics.

three hyperparameters control the time scale of the dynamics, the sparsity of

allow the alphabet of emitted symbols to be infinite - consider, for

In this paper, we study a special kind of learning problem in which each

training instance is given a set of (or distribution over) candidate class

problem can occur, e.g., in an information retrieval setting where a set of

proposed approach over five different UCI datasets show that our approach is

able to find the correct label among the set of candidate labels and actually

achieve performance close to the case when each training instance is given a

allows the effective covariance function to vary with the inputs, and may

handle large datasets - thus potentially overcoming two of the biggest

sequence or secondary structure data alone have been shown to have potential

which simultaneously learns amino acid sequence, secondary structure and

awareness of the errors inherent in predicted secondary structure may be

validation data have been derived for a number of protein superfamilies from

results using posterior probability classification demonstrate that the

Bayesian network performs better in classifying proteins of known structural

superfamily than a hidden Markov model trained on amino acid sequences

deterministic algorithm to approximately optimize the objective function by

of expers (MoE) models to experimentally show that the proposed method can

find the optimal number of experts of a MoE while avoiding local maxima.

tutorial on learning and inference in hidden Markov models in the context of

possible to consider novel generalizations to hidden Markov models with

multiple hidden state variables, multiscale representations, and mixed

architecture provide a prior distribution over tree structures, and simulated

We find that the mean field method captures the posterior better than

number of training examples the learner needs in order to achieve good

review how optimal data selection techniques have been used with feedforward

sharply decreases the number of training examples the learner needs in order

best characterized by an interaction of multiple independent causes or

interesting models, beyond the simple linear dynamical system or hidden

variational approach is based on exploiting tractable substructures in the

demonstrate how this can be used to infer the hidden state dimensionality of

an algorithm that infers the model structure of a mixture of factor analysers

determine the optimal number of components and the local dimensionality of

approximation we show how to obtain unbiased estimates of the true evidence,

Abstract: We introduce a new statistical model for time series which

iteratively segments data into regimes with approximately linear dynamics and

neural network (Jacobs et al, 1991) to its fully dynamical version, in which

maximizes a lower bound on the log likelihood and makes use of both the

artificial data sets and on a natural data set of respiration force from a

connections can be learned using simple rules that require only locally

and the localised disparity detectors in the first hidden layer form a

model develops topographically organised local feature detectors.

Real-world learning tasks may involve high-dimensional data sets with

based on maximum likelihood density estimation for learning from such data

tools for learning probabilistic models of time series data.

inferring the posterior probabilities of the hidden state variables given the

nature of the hidden state representation, this exact algorithm is

state variables are decoupled, yielding a tractable algorithm for learning

chorales and show that factorial HMMs can capture statistical structure in

widely used tools for learning probabilistic models of time series data.

inferring the posterior probabilities of the hidden state variables given the

nature of the hidden state representation, this exact algorithm is

state variables are decoupled, yielding a tractable algorithm for learning

chorales and show that factorial HMMs can capture statistical structure in

connections can be learned using simple rules that require only locally

sparse, distributed, hierarchical representation of global disparity from

properties of the algorithm on this problem and find that : (1) increasing

likelihood parameter estimation from data sets with missing or hidden

``maximization'' step re-estimates the parameters using these uncertain state

Gaussian radial basis function (RBF) approximators are used to model the

nonlinearities, the integrals become tractable and the maximization step can

both neural networks and the central nervous system share is the ability to

coordinates into motor coordinates-to study the generalization effects of

remapping of one or two input-output pairs induced a significant global, yet

decaying, change in the visuomotor map, suggesting a representation for the

map composed of units with large functional receptive fields.

context-dependent remappings indicated that a single point in visual space

possible: estimating the mean of a random variable with known variance.

involve a 'noise limit', below which they regularize with infinite weight

updated using a very simple learning rule that only requires locally

describe a class of probabilistic models that we call credibility networks.

Using parse trees as internal representations of images, credibility networks

are able to perform segmentation and recognition simultaneously, removing the

neural activity) during single trial fMRI activation experiments with blocked

time learning effects between repeated trials is possible since inference is

methods, which exploit laws of large numbers to transform the original

graphical model into a simplified graphical model in which inference is

We study a time series model that can be viewed as a decision tree with

performed exactly and the layers of the decision tree are decoupled, one in

which the decision tree calculations are performed exactly and the time steps

mixture model it is not necessary a priori to limit the number of components

which neatly sidesteps the difficult problem of finding the 'right' number of

In reasonably small amounts of computer time this approach outperforms other

state-of-the-art methods on 5 datalimited tasks from real world domains.

sources of uncertainty in the estimated generalisation performances due to

simplest method, these parameters are fit from the data using optimization.

against several other methods, on regression tasks using both real data and

approach to learning in multi-layer perceptron neural networks achieves

better performance than the commonly used early stopping procedure, even for

many sources, an environment within which this data can be used to assess the

examine in formal models several types of competitive mechanism that have

confirmed by theoretical analysis and full scale computer simulations.

gaussian clusters, vector quantization, Kalman filter models, and hidden

introducing a new way of linking discrete and continuous state models using a

be implemented in autoencoder neural networks and learned using squared error

known as sensible principal component analysis, as well as a novel concept of

involving global and local mixtures of the basic models and provide

algorithm to overcome the local maxima problem in parameter estimation of

operations using a new criterion for efficiently selecting the

of gaussian mixtures and mixtures of factor analyzers using synthetic and

real data and show the effectiveness of using the split-and-merge operations

configurations we repeatedly perform split and merge operations using a new

Experimental results on synthetic and real data show the effectiveness of

using the split and merge operations to improve the likelihood of both the

local maxima problem in parameter estimation of finite mixture models.

case of mixture models, local maxima often involve having too many components

repeatedly perform simultaneous split-and-merge operations using a new

the proposed algorithm to the training of gaussian mixtures and mixtures of

factor analyzers using synthetic and real data and show the effectiveness of

using the split- and-merge operations to improve the likelihood of both the

usefulness of the proposed algorithm by applying it to image compression and

simple prior over weights implies a complex prior over functions.

central nervous system (CNS) uses an internal model to simulate the dynamic

experimental results and simulations based on a novel approach that

CFTC’s Technology Advisory Committee Public Meeting on February 14, 2018

CFTC's Technology Advisory Committee Public Meeting on February 14, 2018 at the CFTC's headquarters, Washington, DC.