AI News, Beyond Deep Learning: A Case Study in Sports Analytics

Beyond Deep Learning: A Case Study in Sports Analytics

When it comes to televised sporting events, the ability of the cameras to follow the action — both smoothly and accurately — can make or break the viewing experience.

This involved creating machine learning algorithms that trained automated cameras to strike a delicate balance between capturing all movements accurately and maintaining a smooth viewing experience.

Before I elaborate on the strategies we used to achieve this objective, which involved combining a deep neural network with a model-based approach, I want to take a step back and discuss, at a higher level, machine learning, deep learning, and imitation learning.

Deep learning is the workhorse behind many of the recent breakthroughs in large-scale applications of machine learning, ranging from image classification, to object recognition, to more complicated sequence prediction tasks such as machine translation.

Image credit: NvidiaAnother field within machine learning is imitation learning, in which an AI system is tasked with performing sequential decision making that mimics human demonstrations, a technique we employed in our camera automation project.

machine learning method by itself can be very good at choosing the right camera orientation, but without incorporating a smooth model that imitates human actions over a whole sequence of movements, the camera will course correct step-by-step —resulting in a jerky and unpleasant viewing experience.

Yisong's research lies primarily in the theory and application of statistical machine learning, and he is particularly interested in developing novel methods for spatial–temporalreasoning,structured prediction, interactive learning systems, and learning with humans in the loop.

Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning

To better understand the complexities of natural ecosystems and better manage and protect them, it would be helpful to have detailed, large-scale knowledge about the number, location, and behaviors of animals in natural ecosystems (2).

While they can take millions of images (6⇓–8), extracting knowledge from these camera-trap images is traditionally done by humans (i.e., experts or a community of volunteers) and is so time-consuming and costly that much of the valuable knowledge in these big data repositories remains untapped.

In this work, we focus on harnessing computer vision to automatically extract the species, number, presence of young, and behavior (e.g., moving, resting, or eating) of animals, which are statistics that wildlife ecologists have previously decided are informative for ecological studies based on SS data (9⇓⇓–12).

Automatic animal identification and counting could improve all biology missions that require identifying species and counting individuals, including animal monitoring and management, examining biodiversity, and population estimation (3).

Instead, we investigate the efficacy of deep learning to enable many future such studies by offering a far less expensive way to provide the data from large-scale camera-trap projects that has previously led to many informative ecological studies (9⇓⇓–12).

Here, we combine the millions of labeled data from the SS project, modern supercomputing, and state-of-the-art deep neural network (DNN) architectures to test how well deep learning can automate information extraction from camera-trap images.

What is Human-in-the-Loop for Machine Learning?

Given that there have been huge advances in the development and accuracy of machine-driven systems, they still tend to fall short of the desired accuracy rates.

The intention being, to use a trained crowd or general human population to correct inaccuracies in machine predictions thereby increasing accuracy, which results in higher quality of results.

Research suggests that a variant of Pareto’s 80:20 rule is consistent with most accurate machine learning systems to date, with 80% AI-driven, 19% human input and 1% randomness.

Supervised ML, curated (labeled) data sets used by ML experts to train algorithms by adjusting parameters, in order to make accurate predictions for incoming data.

Computational Intelligence and Neuroscience

Human detection in videos plays an important role in various real life applications.

On the other hand, the proposed feature learning approaches are cheaper and easier because highly abstract and discriminative features can be produced automatically without the need of expert knowledge.

In this paper, we utilize automatic feature learning methods which combine optical flow and three different deep models (i.e., supervised convolutional neural network (S-CNN), pretrained CNN feature extractor, and hierarchical extreme learning machine) for human detection in videos captured using a nonstatic camera on an aerial platform with varying altitudes.

Human detection in videos (i.e., series of images) plays an important role in various real life applications (e.g., visual surveillance and automated driver assistance).

They are able to learn highly abstract features automatically without human intervention and directly from raw pixels without designing specific features.

On top of that, they are robust against dynamical events in different scenarios: the convolutional neural network (CNN), which is one of the supervised feature learning methods, extracts spatial structure by using convolutions that provide local representations, pooling that is shift-invariant, and normalization that is adapted to illumination change.

Hierarchical extreme learning machine (H-ELM), which is one of the unsupervised feature learning methods, utilizes sparse autoencoders to provide more robust features that adapt with data variations without preprocessing.

Several papers have already utilized handcrafted features for human detection and have demonstrated that these features are useful and successful for specific tasks.

It includes five convolutional layers, max pooling layers, three fully connected layers, and a 1000-class soft-max layer.

In this system, automatic feature learning via fast deep network cascades was used to perform human detection.

A framework that is based on optical flow and graph representation was employed to extract the moving areas from the frames of moving cameras in the Predator Unmanned Airborne Vehicle (UAV) [16].

The task of mobile robot navigation utilized optical flow to detect humans in real time when the robot is moving [19].

The objective of this paper is to study and compare different deep learning methods to detect humans in a challenging scenario that includes a camera attached to a moving airborne object.

The comparison between three different deep models which are supervised CNN, pretrained CNN, and HELM is demonstrated for feature learning and model building for the UCF-ARG aerial dataset.

An optical flow model is added as a first stage in the three systems to get the training and testing samples as inputs to deep models.

The novelty of our work is as follows:(i)To the best of our knowledge, this work is the first one that utilizes different deep models for the public UCF-ARG aerial dataset for human detection.(ii)Supervised CNN is demonstrated to find optimal features that are discriminative to two classes of human and nonhuman.

Soft-max and SVM are used in the last layer of CNN to produce the classification output.(iii)Pretrained AlexNet CNN model that has already been trained on ImageNet dataset for visual object recognition to classify 1000 different classes is demonstrated as a feature extractor with fixed parameters after removing the fully connected layers to find discriminative features for human, nonhuman classification.(iv)HELM is also discussed to take into consideration the trade-off between high accuracy and low training time.(v)The comparison between CNN as a supervised feature learner, pretrained CNN as a feature extractor, and HELM as an unsupervised feature learner in terms of learning speed and accuracy is evaluated for five human actions (digging, waving, throwing, walking, and running).

The organization of the paper is as follows: In Section 2, the three proposed systems that consist of the optical flow model and three deep models are described.

These representations are then classified into binary classes (human and nonhuman) using soft-max or support vector machine (SVM) in supervised CNN and pretrained CNN and extreme learning machine (ELM) in HELM.

A brief review about each module used in the proposed detection system (optical flow, supervised CNN, pretrained CNN, ELM, and HELM) is summarized in the following subsections.

The quality of optical flow for background stabilization is important as it is the first stage, before feature learning is performed via deep models which act as input for the classifiers.

To find the optical flow between two frames, two optical flow constraint equations are used:where , , and are the derivatives of spatiotemporal brightness for a frame, is the vertical part of optical flow, and is the horizontal part.

When optical flow is applied over the whole frame, the Horn-Schunck approach [23] finds the velocity field estimation for each pixel in the frame, , that minimizes the following equation:where is a scaling factor for the optical flow computation and and are the spatial derivatives of the optical velocity In these equations, is the velocity estimate at pixel and is the neighborhood average of .

After applying optical flow to stabilize the scenes, humans are detected utilizing a deep model approach for human and nonhuman classification purposes.

Thus, in this work, it functions optimally in learning important and discriminative features from a series of images (i.e., videos) containing humans.

They often differ from each other in the number of feature maps in every convolutional layer, the dimensions of convolution filters, the specific connection between layers, and the activation functions.

where is the number of iterations, is the learning rate, is the vector of parameters, is the loss function, and is the gradient.

The loss function with regularization is as follows:where is the weight vector, is the regularization coefficient, and is the regularization function.

The error function is the cross-entropy function for 1-of-k mutually exclusive classes as shown in the following equation:where is a vector of parameters, indicates that the th sample is linked to the class, and is the th sample’s output and can be formulated as a probability.

The activation function of the output is the soft-max function:where , In this work, a fixed learning rate of 0.01 is used.

The biases and weights of the hidden layers are set randomly but the weights of the outputs are calculated analytically where is an activation function of the th hidden node, is an input weight, is a bias, and is the weight applied on the output.

neurons are used in the hidden layer.where is an output of the hidden layer, is the Moore–Penrose generalized inverse of a matrix, T is a target, and is a regulation coefficient.

This aerial dataset is considered as one of the most challenging datasets because the image samples are vary in activities, positions, orientations, viewpoints, cloth color, and scale.

Ten actions are performed four times by each person: boxing, carrying, clapping, digging, jogging, open-close trunk, running, throwing, walking, and waving.

This is sufficient as the main interest is to evaluate the performance of the proposed approach by using various deep models to classify aerial datasets which are highly challenging.

For this work, 48 (videos for each activity) 5 (activities) = 240 videos are used as training and testing data.

Because the size of patches is not equal as a result of varying altitudes when airborne, these patches are resized to pixels in pretrained CNN and pixels in S-CNN and HELM before being processed by the deep models to extract the discriminative features.

To further clarify this issue, in each frame that is processed, only one patch is extracted for humans (via optical flow) and multiple patches are extracted for the others (i.e., nonhumans).

The architecture consists of 17 layers including an input layer, three convolutional layers, three max pooling layers, six rectifier linear unit layers, and finally two fully connected layers with a soft-max layer.

Figure 6 shows the snapshots of feature maps after the first convolution layer C1, the second convolution layer C2, and the third convolution layer C3.

The learned features are extracted from the eighth layer “fc8” which is connected directly to a soft-max layer or SVM classifier to produce two classes: human and nonhuman.

The supervised CNN layers are as follows:(1)Image input layer with pixels grey image.(2)Convolution layer: 20 feature maps of .(3)Relu layer.(4)Max pooling layer: pooling regions of size and returning the maximum of the four.(5)Relu layer.(6)Convolution layer: 20 feature maps of .(7)Relu layer.(8)Max pooling layer: pooling regions of size and returning the maximum of the four.(9)Relu layer.(10)Convolution layer: 20 feature maps of .(11)Relu layer.(12)Max pooling layer: pooling regions of size and returning the maximum of the four.(13)Fully connected layer: with 1000 nodes.(14)Relu layer.(15)Fully connected layer with 2 classes.(16)Soft-max layer.(17)Classification layer.

The networks include five convolutional layers, ReLU layers, max pooling layers, three fully connected layers, a soft-max layer, and a classification layer.

Stochastic gradient descent was used to train the model with a mini batch size of 32 to ensure that the CNN and image data fit into GPU memory.

This results from utilizing hierarchical extreme learning machine as a fast deep model that does not require fine tuning of weights iteratively.

The challenging part of this dataset is the size of human patches which vary according to the altitude of the moving airborne platform and the multiple viewpoints of humans in the same video.

The results of this work can be summarized as follows:(1)The quality of the stabilization method (optical flow) is important in our proposed systems as a first stage before applying the deep models for human, nonhuman classification.(2)The proposed systems solve the trade-off problem between high accuracy and speed of detection.

The advantages of the proposed systems are as follows:(i)The proposed system can detect humans automatically and does not need a manual detection threshold to select one that has the highest true positive rate.(ii)The generalization of both deep models is able to detect humans accurately in all 80 videos that are not in the training data.(iii)The proposed system achieves real-time performance for testing live-captured videos because the optical flow model only utilizes two successive frames to find motion.

Moreover, deep models only require a single frame to classify the optical flow patches as human or nonhuman (i.e., human detection).(iv)The proposed deep models are robust against various activities, positions, orientations, viewpoints, cloth color, scale, and altitudes.

For a future work, we will integrate tracking that makes use of initially extracted training regions around humans as positive samples and other regions as negative samples.

As the result is highly accurate and efficient, we will utilize the results of human detection demonstrated in this paper for human action recognition to map each activity with a specific action class.

How computers learn to recognize objects instantly | Joseph Redmon

Ten years ago, researchers thought that getting a computer to tell the difference between a cat and a dog would be almost impossible. Today, computer vision ...

Robots Are Teaching Themselves With Simulations, What’s Next?

This robotic hand practiced rotating a block for 100 years inside a 50 hour simulation! Is this the next revolutionary step for neural networks? A.I. Is Monitoring ...

Exploiting Uncertainty in Regression Forests for Accurate Camera Relocalization

CVPR 2015 Paper Video Project Page: Recent advances in camera relocalization use predictions ..

The Evolution of Convolution Neural Networks

From the one that started it all "LeNet" (1998) to the deeper networks we see today like Xception (2017), here are some important CNN architectures you should ...

Computer Vision: Crash Course Computer Science #35

Today we're going to talk about how computers see. We've long known that our digital cameras and smartphones can take incredibly detailed images, but taking ...

Blue skies to ground truth: Machine learning for Kinect human motion capture

Kinect for XBox 360 is not just a new way of controlling computer games, it represents a fundamental change in the way humans can interact with machines.

Robust Solving of Optical Motion Capture Data by Denoising

Raw optical motion capture data often includes errors such as occluded markers, mislabeled markers, and high frequency noise or jitter. Typically these errors ...

Two-Stream RNN/CNN for action recognition in 3D videos

We combine GRU-RNNs with CNNs for robust action recognition based on 3D voxel and tracking data of human movement. Our system reaches a classification ...

Deep Learning Algorithm to Achieve High Accuracy, Advantech(EN)

Advantech's IVA Inference Systems familiarize themselves with information from a variety of video sources and formats, and Intelligent Video Analysis (IVA) ...

Deep Compression, DSD Training and EIE

Deep Compression, DSD Training and EIE: Deep Neural Network Model Compression, Regularization and Hardware Acceleration Neural networks are both ...