AI News, A robot has figured out how to use tools

A robot has figured out how to use tools

Learning to use tools played a crucial role in the evolution of human intelligence.

It also includes a camera that sees the environment within reach of the arm—and, most important, a computer running a very large neural network that lets the robot learn.

“We really want to study that sort of generality, rather than a robot learning to use one tool.” The researchers have previously shown how a robot can learn to move objects without explicit instruction.

The new robot learns in a similar way, but it builds a more complex model of the physical world (“Moving this item can move those other items over there”).

Annie Xie, an undergraduate student at UC Berkeley involved with the project, writes about the work in a related blog post: “With a mix of demonstration data and unsupervised experience, a robot can use novel objects as tools and even improvise tools in the absence of traditional ones.” Levine, a leading researcher in robotic learning, says he was surprised by the robot’s ability to improvise.

A Teen Pretends to Be Trans, and Some Viewers Object to the Deception

Ernst, 36, who was a director and producer on the Amazon series “Transparent,” said in an interview that he was skeptical of the novel’s premise initially.

Ernst said he feared that the film was intended to be laughed at by straight viewers — a comedy of errors reminiscent of “Twelfth Night” or “Tootsie” that swaps gender identity rather than plain old gender.

Some on social media asserted that it was problematic for a lesbian to be interested in a transgender man, the implication being that trans men are merely lesbians “playing dress up.” But Schrag and Ernst say that these relationships were true to their social scene in 2006, the year that the story takes place.

Stories by Steve LeVine

The big picture: Historians focus on social forces, big personalities and economics, but wind, tectonics and Earth's orbit are among the deeper dynamics that set "the very course of history,"

The Middle East events prior to the Bronze Age mirror some of the challenges we face today, according to a 2016 paper published in the Quaternary Science Reviews: "If you fast-forward 10, 20, 30 years in the future, it will fundamentally change where rain falls.


easily, why do we want to create clunky mechanical replicas of ourselves?

all we know, Octavia (pictured here) is pondering these questions right now.

She's an advanced social robot who can scuttle around on her wheels, pick up objects,

Very likely a humanoid—a humanlike robot with arms, legs, and a head, probably painted metallic silver.

Unless you happen to work in robotics, I doubt you pictured a mechanical

snake or a clockwork cockroach, a bomb disposal robot, or a Roomba robot vacuum cleaner.

Where the sci-fi robots we see in movies and TV shows tend to be

humanoids, the humdrum robots working away in the world around us (things like robotic welder arms in car-assembly plants) are much more functional,

For some reason, sci-fi writers have an obsession with robots that are little more than flawed, tin-can, replacement humans.

Maybe that makes for a better story, but it doesn't really reflect the current state of robot technology, with its emphasis on developing practical robots

It certainly looks like one, but it has no senses of any kind, no electronic or mechanical onboard computer for thinking, and its limbs have no motors or other means to move themselves.

It's easy enough to write entertaining stories about intelligent robots taking control of the planet, but just try developing

make our robot 1) sense things (detect objects in the world), 2) think about

problem we'll explore in a moment), and then 3) act on them (move or

In psychology (the science of human behavior) and in robotics, these things are

called perception (sensing), cognition (thinking), and action (moving).

arms in factories are mostly about action (though they may have

sensors), while robot vacuum cleaners are mostly about perception

a moment, there's been a long and lively debate over whether robots really need cognition, but

most engineers would agree that a machine needs both perception and action to qualify as a robot.

the tiger is locked inside a cage)—is almost infinitely harder.

then apply the brakes and tell it to creep away in a different

Spot, a quadruped robot built by Boston Dynamics, has a lidar (a kind of laser radar) where

would be to create a super-lifelike humanoid robot and stick it in

its digital camera eyes), interpreting what it sees, and controlling

also build a self-driving car an entirely different way without anyone

satellite navigation, lidar,sonar, radar, and infrared detectors, accelerometers—and

self-driving cars have computers, surfing a flood of digital data quite unlike a human's mental model.

computer simulation of interconnected brain cells that can be trained to

recognize patterns) processing information from a self-driving car's sensors so

the vehicle can recognize situations like driving behind a learner, spotting a looming emergency when children are playing ball by the side of the

road, and other danger signs that experienced drivers recognize automatically.

vision, so the other human senses (hearing, smell, taste, and touch)

hears with their ears, a robot uses a microphone to convert sounds

straightforward to sample a sound signal, analyze the frequencies

in your signal match the pattern of a human scream, it's a

turning human speech into recognizable text for decades; even

can listen to my voice and faithfully print my words on the screen.

yawning iris, or the volatile liquids in perfume drift into our noses and

highly unusual features, such as why smells are powerful memory triggers.

(The answer is simply because the bits of our brain that process smells are physically very close to two other key bits of our brains, namely

the hippocampus, a kind of 'crossroads' in our memory, and amygdala, where emotions

are processed.) So, in the words of the old joke, if robots have no nose, how do they smell?

including mass spectrometers and gas chromatographs, but they're

least, conceptually) the way the human nose converts smells into electrical

left your eyes, ears, or nose, and reached your brain, the problem is simply one of pattern recognition.

If we could modify these things with touch sensors, maybe they could double up as working hands for robots?

over half a century, giving them anything like a working human hand has

high-precision brain surgery, carve stone like a sculptor, or

the New York Times reported in September 2014, building a robot with human touch has

suddenly become one of the most interesting problems in robotics research.

One of the misleading things about trying to develop a humanoid robot is that it tricks us into replicating only the five basic human senses—and one of the great things about robots is that they can use any kind of sensor or detector

is now called the Turing test (a way of establishing whether a machine is 'intelligent') in 1950.

Photo: Robots are designed with friendly faces so humans don't feel threatened when they work alongside them.

digital-cameras eyes help it to learn and recognize human expressions, while the rubber-tube

Would you rather your coworkers were cold, logical, hyperintelligent beings who could solve every problem and never make a

ability to listen, smile, tell jokes round the water cooler, and sympathize when your life takes a turn for the worse is

When people look at cars, they tend to see faces (two headlights for eyes, a radiator grille for

a mouth) or link particular emotions with certain colours of paintwork

(a red car is racy, a black one is dark and mysterious, a silver

of his students, listens, coos, and pays attention to humans in a

startlingly babylike way—to the extent that people grow very attached to it,

words, we might redefine the problem of developing emotional robots as making machines that humans really care about.

nothing easier than lifting your hand to scratch your nose—your brain makes it

are generally faster and more reliable, but hopeless at managing rough terrain or stairs).

giant electric, hydraulic, or pneumatic arms fitted with various tools geared

one is driven by four hydraulic legs powered by a small internal combustion engine from a go-kart.

In theory, that gives it a big advantage over robots powered by batteries (it should be able to go

Action is a much simpler problem: movement is movement—we don't have to worry about defining it, the same way we worry over 'intelligence,' for example.

Ironically, though we admire the remarkable grace of a ballet dancer, the leaps and bounds of a world-class athlete, or the painstaking care of a traditional craftsman, we take it for granted that robots will be able to zing about or make things for us with even greater precision.

Most, however, rely on relatively simple, much more afforable electric stepper motors and servo motors,

which let a robot spin its wheels or swing its limbs with pinpoint control.

Unlike humans, which get tired and make mistakes, robot moves are reliably repeatable;

of doing a wide variety of jobs (in the way that humans are general-purpose

Riveting and welding, swinging and sparking—most of the world's robots are high-powered arms, like the ones you see in car factories.

strong, powerful, and dangerous, they're usually fenced off in safety cages and

A few years ago, Rodney Brooks reinvented the whole idea of the robot arm with an affordable

($25,000), easy-to-use, user-friendly industrial robot called Baxter, which evolved into a similar machine named Sawyer.

It can be 'trained' (Brooks avoids the word 'reprogrammed') simply by moving its limbs, and it has enough onboard sensory

Photo: Robot arms are versatile, precise, and—unlike human factory workers—don't need

robots work this way: they're simply robot trucks with cameras

robots were designed much the same way, though autonomous rovers (with enough onboard cognition to control themselves) are now commonplace. So

remote-controlled from Earth, the much bigger and newer Mars Spirit and Opportunity rovers (launched in 2003) are far more autonomous.

This one, Explosive Ordnance Disposal Mobile Unit (EODMU) 8, can pick up suspect devices with its jaw and carry them to safety.

machine parts for quality control or shifting boxes from one place

the Machines, building intelligent, autonomous, general-purpose robots was considered an overly ambitious research goal.

alternative approach to robotics, where grand plans were put aside and robots simply evolved as their creators figured out better ways of building robots

incarnation of the stair stomping, chair balancing, car driving robots

But the sheer complexity of driving—even humans take years to properly

turning a corner, overtaking, parallel parking, slowing down when

prosthetic limbs, heart pacemakers, cochlear implants for deaf

people, robot 'exoskeletons' that paralyzed people can slip over their bodies to help

to robot but a smarter, smoother transition from flesh machines to hybrids

Frontiers in Neurorobotics

In the extrinsic phase, the agent is required to use the knowledge acquired in the intrinsic phase to solve extrinsic tasks: first the agent has to memorize the state of some objects set in a certain configuration (goal;

The figure shows that during the intrinsic phase the architecture uses intrinsic motivations to learn action affordances and forward models, and during the extrinsic phase it uses affordances and forward models to plan and solve the extrinsic tasks.

Importantly, in both phases active vision allows the agent to focus on a single object per time, in particular to elicit object-centered intrinsic motivations, to learn or activate the affordances and the forward models related to specific objects, and to parse the extrinsic goal into simpler sub-goals each achievable with 1-step planning.

Open-ended learning processes allow robots to acquire knowledge (e.g., goals, action policies, forward models, and inverse models, etc.) in an incremental fashion by interacting with the environment.

The utility and adaptive function of IMs reside in that they can produce learning signals, or trigger the performance of behaviors, to drive the acquisition of knowledge and skills that become useful only in later stages with respect to the time in which they are acquired (Baldassarre, 2011).

The architecture presented here, as usually done in the robotic literature on affordances (see below), assumes the agent is given a set of actions and the capacity to recognize if the performance of those actions leads to their desired effects (goals): the challenge for the robot is indeed to use intrinsic motivations to learn which actions can be successfully accomplished on which objects (object affordances).

Although the intrinsic and extrinsic learning processes are often intermixed in realistic situations, separating them can help to clarify problems and to develop algorithms, most notably to use the performance in the extrinsic phase to measure the quality of the knowledge autonomously acquired in the intrinsic phase.

A possible strategy to solve complex tasks is based on planning (Ghallab et al., 2004), directed to assemble sequences of sub-goals/skills leading to accomplishing the overall complex goal.

The second type involves utility-based planners that have to decide between alternative conflicting goals having a different desirability and pursued in uncertain conditions (here the stochasticity of the environment is due to the fact that actions succeed only with a certain probability).

The interest of affordances for autonomous robotics, as we will also show here, resides in the fact that they can represent a simple and efficient mechanism to rapidly decide which actions can be performed, with some potential utility, on the objects available in the environment.

can be used to support forward planning, as here, because they allow the agent to predict the effect of performing a given action A on a certain object O and check if the obtained effect E matches a desired goal/sub-goal (G).

checking (Russell and Norvig, 2016) allows the agent to search actions for which the effect fulfills the goal and then to search for relevant objects (on which the actions can be applied) to accomplish the goal (the function C has also been used often in the affordance literature to perform sheer object recognition based on how objects respond to actions, e.g., see Fitzpatrick et al., 2003;

Formally, the definition of affordance used in this work is as follows: An affordance is an agent's estimated probability Pr(sb,o′∈G|a,sb,o) that if it performs action a on the object o when the object and own body b are in state sb,o then the outcome will be a state sb,o′

(h) The definition assumes a binary success/failure of the action, for example based on the use of a threshold (e.g.: “the object is considered as reached if the distance between the object and the hand-palm after action execution is smaller than 2 cm”).

An alternative definition, assuming that suitable distance metrics could be applied to the object/body states, could state that the affordance is accomplished in a continuous degree related to the final distance of the state achieved by the action and the reference goal-state (e.g.: “a reaching action toward an object brought the hand 5 cm close to it”).

These assumptions reflect fundamental principles of organization of the visual system of primates (Ungerleider and Haxby, 1994) and in artificial systems they allow the reduction of visual information processing and an easier analysis of spatial relations between scene elements (Ognibene and Baldassare, 2015).

The planning processes considered here are akin to those used within the Dyna systems of reinforcement learning literature where planning is implemented as a reinforcement learning process running within a world model rather in the actual environment (Sutton, 1990;

These simplifications allow us to develop the overall architecture of the system, but in future work some of the components of the system could be substituted by more sophisticated components, in particular for object segmentation and detection (Yilmaz et al., 2006;

The mechanism is inspired by the concept of opportunity cost used in economics, referring to the value of the opportunities that are lost by allocating a certain resource (here a unit of learning time) to a certain activity (Buchanan, 2008).

second contribution of this work is the study of how the introduction of the attention mechanism, extracting information about the single object and about the object appearance/location impacts (a) the affordance learning process and (b) the second extrinsic phase where planning is needed to accomplish an extrinsic complex goal.

The second issue is important as attention is a key means to detect objects in humanoid robots (Camoriano et al., 2017) and information on objects is pivotal for both affordance detection (e.g., Montesano et al., 2008) and for planning.

In particular we will face the problem of what could be the utility of affordances, defined as the probability estimate of action success, within a planning system that is endowed with refined components implementing forward models and relevance checking.

In this respect, we will propose that: (a) affordances can play a role in forward planning as they support fast selection of relevant actions within the system's controller in a way similar to the way they are used to act in the environment, akin to the role of the “preconditions”

The agent (section 2.3) is endowed with a simulated camera sensor that can look at different sub-portions of the working space, and is able to select and perform four actions on the object that is at the center of its camera.

Assuming the working space has a side measuring 1 unit, the circle has a diameter measuring 0.1 units, the square has a side measuring 0.1 units, and the rectangle has sides measuring 0.6 and 0.16 units.

mechanism that leads the system to scan the environment based on its salient features (color blobs of objects) and also observes the action effects in the environment by looking at portions of the scene that are changed by the actions (section 2.3.1);

The outer attention scans the environment on the basis of two bottom-up processes both affecting gaze (they sum up): the first process is sensitive to the saliency of objects, and the second process is sensitive to the changes of the appearance of objects produced by actions.

Attention actively guides a RGB visual sensor (a pan-tilt camera) returning an image centered on the current attention focus and sufficient to always cover the whole working space independently of the gaze pointing.

Firstly the system focuses on the portion of space where a change in the periphery image takes place (this mimics some processes of primates for which a reflex focuses attention on changes happening in the environment (Comoli et al., 2003;

This process leads the system to look at the object that has been changed by the action, e.g., the object that has been changed in color or displaced in space (in the latter case, the system gazes the position where the object is moved, and not the position where it disappears, due to the object saliency that is not present where it disappears).

After this attentional movement on the changed object is performed, the system compares the focus image (involving only the area covering the object) and the object position (given by the gaze direction) before and after performing the attentional movement itself: this allows the system to decide if the performed action was successful or not (presence of affordance).

The focus image comparison is based on the L1 norm of the difference between the vectors of the two images before and after action performance, divided by the number of dimensions of the vectors: if this measure is higher than τi = 10−5 then the focus image is considered to have changed.

The position comparison is based on the L1 norm of the difference between the vectors of the two (x,y) focus positions before and after action performance, divided by 2: if this measure is higher than τp = 0.02 then the position of the object is considered to have changed.

The predictor component is formed by 16 predictors (these are regressors), 4 for each of the 4 actions: (a) the affordance predictor predicts the object affordance (i.e., the probability that the action effect takes place when the action is performed);

(b) the learning-progress predictor predicts the learning progress of the affordance predictor when applying the action to the target object, and is used to generate intrinsic motivations based on the learning progress of the affordance predictors;

Each predictor gets as input the focus image (whose pixels are each mapped onto (0, 1) and unrolled into a vector) and returns, with one output sigmoid unit, the prediction of the action success.

Each predictor is trained with a standard rule and a learning target 0 or 1, encoding respectively the failure or success of the action to produce its desired effect, i.e., the presence/absence of the affordance (the learning rate used varied in the different tests, see section 3).

Each of the where-effect predictors gets as input the initial (x, y) position of the target object and the desired (x, y) position of the object depending on the sub-goal, and predicts, with two linear units, the predicted object (x, y) position after the action performance [x and y coordinates are each mapped to the range (0, 1)].

In this section we first present the motivation signals (section 2.4.1) and the algorithm for learning affordances and forward models (section 2.4.2) used by the three versions of the system (IGN, FIX, and IMP) during the intrinsic phase.

The use of this formula is justified by the fact that entropy is a measure of ignorance (uncertainty) of the system: the uncertainty is minimal when p = 0 or p = 1, and maximal when p = 0.5 (the value of the entropy is here normalized so that H(p) ∈

(2007), the current object is considered interesting, and hence worth exploring, when the entropy is above a threshold th (here th = 0.3, which corresponds to an ignorance value, i.e., affordance predictor output, of 0.947 or 0.053).

The mechanism of the leaky average threshold, used in IGN and IMP, allows the agent to indirectly compare the relative levels of how interesting different objects are, and to focus the exploration effort on the most interesting of them notwithstanding the fact that different objects are in focus at different times due to the presence of the active vision mechanisms.

The algorithm is based on the following operations and functions: (a) the Scan function focuses the system visual sensor on an object based on the bottom-up attention mechanism and returns the image and position of the object;

(d) ScanEffect looks where a change in the environment has happened (e.g., at the new position occupied by the object after the move action, or at the object that changed its color after a change-color action), and returns the resulting new object image and position;

A new sub-goal is selected either in the case that the previous sub-goal has been accomplished (in which case the Boolean variable sub_goal_active = FALSE) or if a time out elapses in the unsuccessful attempt to pursue it (here the time out is equal to 8 iterations of the algorithm).

The agent uses the function Scan to identify a new target sub-goal: this function scans the goal-image with the saliency-based attention mechanism and returns the new sub-goal image and focus location (sub_goal_image and sub_goal_position).

To this purpose, the function ScanEnvironmentWithSameFocusAsSubGoal drives the outer attention focus (targeting the environment) to the position corresponding to the inner attention focus (targeting the goal image) and returns the corresponding focus image (focus_image).

Then the function GoalNotAchievedCheck compares the sub_goal_image and the focus_image to check that the sub-goal has not been achieved yet, in particular it sets the variable sub_goal_active to FALSE or TRUE if they, respectively, match or mismatch (the match holds if the Euclidean distance between the vectors corresponding to the two images is below a threshold τ

To this purpose, the function uses the effect predictors to predict the effect of each action and then compares it with the sub-goal (this happens if the Euclidean distances are below 0.0035 for the sub-goal image and the object image, and below 0.01 for their position coordinates).

This allows affordances, which are computed fast through 1-output neural networks, to speed up the planning search by reducing the number of more computationally expensive operations involving the prediction of action effects (new object image and position) and their comparison with the sub-goal.

In cases where the agent has a limited amount of resources available to accomplish the goals (e.g., time or energy to perform actions), it should first invest such resources in the accomplishment of the most valuable sub-goals (for simplicity, we assume here a constant cost per action and a negligible cost of reasoning with respect to acting, as often done in utility-based planning, Russell and Norvig, 2016).

When the Boolean variable max_utility_estimatation is TRUE, the planner evaluates the value of the possible sub-goals it can achieve with the available object-action combinations and stores an estimate of its value in the variable potential_utility, otherwise it acts in the world.

Various mechanisms could be used to set and keep the system in the evaluation mode: here for simplicity we gave the system a certain amount of iterations before performing an action, but more flexible mechanisms might be used (e.g., passing to act when the estimates stabilize).

However, instead of executing the planned actions the system only updates the potential_utility if the current goal-object couple has a higher utility than it: this ensures that the potential utility estimation tends to approximate the value of the most valuable sub-goals.

The second test, called the late object test, involves an intrinsic phase where some objects, not initially present, are introduced after the system has acquired knowledge on objects introduced initially.

These results offer a first validation of the idea that the IMP and ING systems, using a dynamic threshold for evaluating the interest of the current object in terms of its potential return of information, outperform the FIX system previously proposed in the literature.

The reason for the poor scaling is mainly due to the simple bottom-up attention mechanism used here to guide attention, which uses a random exploration to find the objects needed to accomplish a certain sub-goal.

In the first test the six red-green-blue square and circle objects are present from the start of the simulation and have all affordances, while the red-green-blue rectangle objects are introduced late, have move affordance set to 0 (not movable) and the color affordances set to 1 (“greenable,”

In the second test, the square objects are present from the start and afford all actions, the circle objects are present from the start and do not afford any action, and the three rectangle objects are introduced late and afford all actions.

In the second and third late-object test, the three systems differed in their behaviors while learning the affordances during the intrinsic phase, but all presented a similar performance when tested in the extrinsic phase, so we report the data related to them in Supplementary Material.

Regarding the extrinsic-phase tests (Table 5), all three systems were successful in the first three scenarios, but failed in the fourth and fifth scenarios, showing that the different quality of affordances acquired in the intrinsic phase did not affect the performance in these particular tests.

All systems failed the extrinsic phase scenario 4 and 5, involving an additional circle each, because in the late-object tests 2 and 3 the circle objects do not afford any action and so their state cannot be changed.

The first and second late-object tests confirm that the IGN and IMP systems outperform the FIX system in learning affordances as they can decide to explore a certain object on the basis of a comparison between its expected information gain and the information gain expected on average from other objects.

As a consequence, after learning, the affordance predictions of such objects were inaccurate (far from 0) (object numbers 4, 5 and 6) and showed high variance for most objects (Supplementary Figure 8).

This result can be explained by the fact that the predictions for the novel objects are bootstrapped from previously learned affordances of similar objects, in particular based on the color that causes synergies when it involves objects with same present/absent affordance, and interference in the opposite case.

A mechanism of replay of past experience would possibly overcome this problem as it would intermix experience related to the different objects, allowing the neural-network predictors to disentangle the present/absent affordances of similar objects.

After learning, all three systems showed a good capacity to predict the affordances, but the IMP system was more accurate than the IGN and FIX systems as it could better employ the available learning time to accumulate more knowledge (Figure 11).

The results show that the utility-based planner performed significantly better than the goal-based planner when it could rely on a small number of actions, and showed a statistical trend to do so for a higher number of actions (Figure 12).

However, reality, offering a very large number of alternative (sub-)goals with respect to the actions that can be performed, is similar to the case of the experiment where the system has only 1 or 2 actions available, so utility-based planning is very important in such conditions.

To this purpose, we compared the performance of the utility-planner using affordances and forward models trained with either one of the IGN/FIX/IMP mechanisms for 4,000 executed actions, a time not sufficient to fully learn the forward models.

Figure 13 shows the performance (overall gained utility) of the three utility planners using a maximum of 1, 2, or 3 actions and averaged over 100 repetitions of the experiment.

Many works have focused on intrinsic motivations as a means to solving extrinsic challenging tasks where a long sequence of skills is required to solve a task or maximize a specific reward function (“sparse reward,”

In the intrinsic phase, the first system learns the skills on the basis of reinforcement learning guided by intrinsic motivations and reward functions found by a genetic algorithm that uses the performance in the extrinsic phase as fitness.

Russell and Norvig, 2016), until the late 90's the research on planning mainly focused on backward planning because this revealed more efficient than forward planning generating a wide search branching as many actions are applicable to each state.

This suggests that the common use of forward planning by organisms (Wikenheiser and Redish, 2015) might rely on affordances for pruning relevant actions: affordances hence are so important for organisms (Thill et al., 2013) because they not only support an efficient action but also planning.

Here we compared this mechanism with a more sophisticated mechanism where the ignorance for the current object and more uncertain affordance is compared with the estimated average ignorance for the other objects and affordances on which exploration time and energy might be alternatively invested.

The authors equated affordances to the forward models, in particular to the triad <object-features, action, effect>, where actions are pre-coded behaviors for moving or lifting objects and effects are clustered with a support vector machine.

In a first phase the system learns the affordances and in a second phase the system is assigned a goal and plans the course of actions to pursue it based on a breadth-first forward search over actions and states until it finds a state similar to the goal.

(2015) affordances support the selection of actions based on the amount of the expected desired (continuous) effect, whereas in our system they support the selection of actions based on their probability of producing the desired effect (which can be present/absent).

This literature is relevant as many action affordances tend to involve objects and a controllable visual sensor with a limited perceptual scope is a means to isolate information on specific objects (not considering here the important problem related to the fact that objects have different sizes and this requires an adjustable visual scope).

Ognibene and Baldassare, 2015) have shown how such systems decrease the computational burden required by processing too wide visual images, and research on state-of-the-art deep neural networks applied to vision problems is confirming the utility of an attentional focus (Xu et al., 2015).

As mentioned in the introduction, this is a fundamental operation representing a first important step from a factored (featured-based) representation of the world state to a structured representation, allowing to reason on the state and relations between objects, typically used in classic planning (Russell and Norvig, 2016).

We started to explore the factorization of a scene by an active vision system endowed with controllable restricted visual sensors in a camera-arm robot interacting with simple-shaped 2D objects as those used here (Ognibene et al., 2008, 2010).

The sensor of this system was controlled not only with a bottom-up attention component, as here, but also by a top-down component able to learn to find a desired target-object by reinforcement learning: the latter component might be integrated into the current model in the future.

These systems have functionalities complementary with those of the system presented here, so they might be suitably integrated in the future to have a system able to: (a) self-generate goals and use them to drive the learning of the attention and motor skills to accomplish them through intrinsic motivations (previous systems) and (b) develop affordances and re-use the previously acquired skills to solve complex extrinsic tasks (system proposed here).

This work has focused on a possible specific instance of the concept of affordance intended as the probability of achieving a certain desired outcome associated to an action, by performing such action on a certain object.

We investigated here three issues related to this concept: (a) within an open-ended autonomous learning context, how can intrinsic motivations guide affordance learning in a system that moves the attention of a visual sensor over different objects;

Regarding the first issue, we proposed a mechanism to use intrinsic motivations (system IGN) to improve previously proposed ways (Ugur et al., 2007) with which a system endowed with a mechanism focusing on only one object/condition per time can decide whether or not to invest energy to explore it.

The proposed solution is based on an adjustable variable storing the opportunity cost of the current choice, i.e., the value that the system looses by selecting the current option rather than alternative ones (Buchanan, 2008).

With respect to intrinsic motivations, in the case of deterministic scenarios where the system knows in advance that the affordance probability is either 0 or 1, the value of actions on the current object and the cost of alternatives was here estimated in terms of intrinsic motivations measuring the system ignorance (system IGN).

However, an open problem of this solution, known in the literature (Santucci et al., 2013), is the fact that error improvement signals, as those used to compute intrinsic motivation in IMP, are small with respect to noise as they are equivalent to a derivative in time (vs.

Another important aspect related to autonomous learning driven by intrinsic motivations is that here, for the sake of focussing the research, the current system learns affordances on the basis of pre-wired actions and goals (expected outcomes of affordances).

Regarding the second issue, related to the advantage for planning of having an attention system focusing on objects, we showed how the parsing of the scene into objects allows the solution of non-trivial planning problems on the basis of relatively simple one-step planning mechanisms.

Here we have proposed a first solution to this problem that requires low computational resources (scanning objects in sequence, computing their expected utility, updating a variable that stores the maximum expected utility encountered this far, and deciding to act on the current object depending on how its utility compares with the maximum expected utility).

Regarding the third and last issue, related to the possible added value of affordances in planning systems, we showed that affordances as defined here can be useful in goal-based planning systems as they allow a search focused on actions that can be used in the current context.

In this respect, the concept of affordances used here, pivoting on sb, o, starts to integrate the two approaches as it focuses on single objects (the body and the target object), and at the same time it considers probabilities of their states (specifically encoded with factored representations, such as pixels images, as commonly done in robotic reinforcement learning models, Wiering and Van Otterlo, 2012).

To face this condition the system should be endowed with object segmentation capabilities (Zhang et al., 2008) or robust object recognition algorithms such as deep neural networks (LeCun et al., 2015).

Interestingly, some of these latter algorithms have started to use attention mechanisms to improve object recognition (Maiettini et al., 2018): future work might investigate the links between these mechanisms and the attention processes presented here.

The component could be enhanced with the addition of more sophisticated top-down attention mechanisms able to drive attention on the basis of the current knowledge on the identity and position of objects in the scene (Rasolzadeh et al., 2010;

final general feature of the system that should be addressed in future work is the fact that the information flows between the several components of the architecture are managed by a hard-coded central algorithm using time flags and in some cases symbolic representations.