AI News, Pitcher Prognosis: Using Machine Learning to Predict BaseballInjuries

Pitcher Prognosis: Using Machine Learning to Predict BaseballInjuries

In the multibillion dollar world of sports entertainment, we often think of injuries as being chance events.

Although professional players are placed under a high level of medical scrutiny, I reasoned that the information encoded in performance statistics might add a useful leading indicator of injury risk to the medical toolbox.

Then, I would aggregate the player’s statistics from preceding games and use those as features.The idea is thus that a coach, medical support staff member, or even a player him- or herself, could then enter their accumulated statistics on a given day (the “intervention point”) into my model and see what the likelihood would be that playing on that day could precede an injury.

In my case, the well-structured nature of baseball and prior familiarity with the dataset had assured me that my data were relatively clean, so the most urgent question confronting me was whether game statistics in fact contained any predictive information at all in relation to injuries.

although in many careers, the early forties are a highly productive time, the extreme physical demands of baseball mean that few players can continue to perform at the professional level that long.

The light blue bars are the distribution of ages in games that did not precede an injury event, the red bars did precede an injury event, and the dark blue fractional bars are the overlap of the light blue and the red.

Note that the bins are not integer values of innings: since innings pitched is counted by the number of outs recorded when a pitcher leaves the game, there are twenty-eight possible values for innings pitched in a standard game.

Feature Engineering To hone the predictive power of my features, first I generated new features by applying different aggregation windows: for each player, I created separate features for each performance metric for one game preceding the intervention point, for the average of seven games preceding the intervention point, and for the player’s entire career.

For a relatively casual baseball fan like myself, it is difficult to draw consistent, distinct categories of pitching style from expert commentary or from the statistical data that I had already collected.

projected the term frequency vectors I had created, which had a dimensionality on the order of the total number of terms present, onto a two-dimensional space using multidimensional scaling, which is meant to preserve the approximate relation of each of the pitcher descriptions to all of the others.

In the way that I set up the term frequency vectors, a single word can occur more than once because I accounted for the frequency of bigrams, or pairs of words occurring together, and trigrams as well as single words.

optimized the random forest hyperparameters to maximize the area under an ROC curve, which has two characteristics that make it better than accuracy score for this sort of situation: 1) the value of this metric is still meaningful with greatly imbalanced datasets - and there are many more games preceding noninjuries in baseball than games preceding injuries - and 2) how a risk-predicting application may be used is not necessarily known before deployment: avoiding false positives may matter more than avoiding false negatives, or vice versa.

The hyperparameters I focused on were the number of features each decision tree could choose from at each step in its creation and the maximum depth of those trees, or the total number of features that could be used in the classification of a single point.

although I saw little increase in performance beyond 300 trees, I settled on 1,000 because compute time was not limiting and having redundancy within the forest would not be expected to harm model performance.

The performance metric I chose to maximize with my grid search was area under the ROC curve, which has two characteristics that make it better than the standard accuracy score for this sort of situation: 1) the value of this metric is still meaningful with greatly imbalanced datasets - and there are many more games preceding noninjuries in baseball than games preceding injuries - and 2) how a risk-predicting application may be used is not necessarily known before deployment: avoiding false positives may matter more than avoiding false negatives, or vice versa.

The “injury score” output by the random forest model is notionally a probability of a particular set of feature values of indicating that an injury will occur, or more precisely the average of this probability across all of the decision trees in the forest, although depending on how one deals with the class imbalance in injury prediction problem, this interpretation is not necessarily correct.

To avoid forcing baseball players and coaches to deal with the intricacies of random forest output, the web application I designed compares the injury score for a given player’s input to all of the scores in the database used for the modeling and outputs the player’s injury score percentile, which should be readily understandable to many people.

Some users may distrust what seems like a data science black box, and to provide more persuasive analysis or explanation, I also use nearest neighbors analysis to identify games similar to the user’s entered values.

Pitcher Prognosis: Using Machine Learning to Predict BaseballInjuries

In the multibillion dollar world of sports entertainment, we often think of injuries as being chance events.

Although professional players are placed under a high level of medical scrutiny, I reasoned that the information encoded in performance statistics might add a useful leading indicator of injury risk to the medical toolbox.

Then, I would aggregate the player’s statistics from preceding games and use those as features.The idea is thus that a coach, medical support staff member, or even a player him- or herself, could then enter their accumulated statistics on a given day (the “intervention point”) into my model and see what the likelihood would be that playing on that day could precede an injury.

In my case, the well-structured nature of baseball and prior familiarity with the dataset had assured me that my data were relatively clean, so the most urgent question confronting me was whether game statistics in fact contained any predictive information at all in relation to injuries.

although in many careers, the early forties are a highly productive time, the extreme physical demands of baseball mean that few players can continue to perform at the professional level that long.

The light blue bars are the distribution of ages in games that did not precede an injury event, the red bars did precede an injury event, and the dark blue fractional bars are the overlap of the light blue and the red.

Note that the bins are not integer values of innings: since innings pitched is counted by the number of outs recorded when a pitcher leaves the game, there are twenty-eight possible values for innings pitched in a standard game.

Feature Engineering To hone the predictive power of my features, first I generated new features by applying different aggregation windows: for each player, I created separate features for each performance metric for one game preceding the intervention point, for the average of seven games preceding the intervention point, and for the player’s entire career.

For a relatively casual baseball fan like myself, it is difficult to draw consistent, distinct categories of pitching style from expert commentary or from the statistical data that I had already collected.

projected the term frequency vectors I had created, which had a dimensionality on the order of the total number of terms present, onto a two-dimensional space using multidimensional scaling, which is meant to preserve the approximate relation of each of the pitcher descriptions to all of the others.

In the way that I set up the term frequency vectors, a single word can occur more than once because I accounted for the frequency of bigrams, or pairs of words occurring together, and trigrams as well as single words.

optimized the random forest hyperparameters to maximize the area under an ROC curve, which has two characteristics that make it better than accuracy score for this sort of situation: 1) the value of this metric is still meaningful with greatly imbalanced datasets - and there are many more games preceding noninjuries in baseball than games preceding injuries - and 2) how a risk-predicting application may be used is not necessarily known before deployment: avoiding false positives may matter more than avoiding false negatives, or vice versa.

The hyperparameters I focused on were the number of features each decision tree could choose from at each step in its creation and the maximum depth of those trees, or the total number of features that could be used in the classification of a single point.

although I saw little increase in performance beyond 300 trees, I settled on 1,000 because compute time was not limiting and having redundancy within the forest would not be expected to harm model performance.

The performance metric I chose to maximize with my grid search was area under the ROC curve, which has two characteristics that make it better than the standard accuracy score for this sort of situation: 1) the value of this metric is still meaningful with greatly imbalanced datasets - and there are many more games preceding noninjuries in baseball than games preceding injuries - and 2) how a risk-predicting application may be used is not necessarily known before deployment: avoiding false positives may matter more than avoiding false negatives, or vice versa.

The “injury score” output by the random forest model is notionally a probability of a particular set of feature values of indicating that an injury will occur, or more precisely the average of this probability across all of the decision trees in the forest, although depending on how one deals with the class imbalance in injury prediction problem, this interpretation is not necessarily correct.

To avoid forcing baseball players and coaches to deal with the intricacies of random forest output, the web application I designed compares the injury score for a given player’s input to all of the scores in the database used for the modeling and outputs the player’s injury score percentile, which should be readily understandable to many people.

Some users may distrust what seems like a data science black box, and to provide more persuasive analysis or explanation, I also use nearest neighbors analysis to identify games similar to the user’s entered values.

The Mystery Sabermetrics Still Can’t Solve

That was the sentiment coming from the Miami Marlins in the wake of this week’s devastating news that Jose Fernandez, the team’s brilliant 21-year-old ace, would likely be out for the remainder of the year with a torn elbow ligament requiring Tommy John surgery.

Bleacher Report injury expert Will Carroll reported last summer that a staggering third of all current major league pitchers have undergone Tommy John surgery at some point in their careers, and the procedure has been performed more than ever before in recent years.1 And the spate of pitching injuries this year is causing the sports-media equivalent of a moral panic.

The most common month of the season for the surgery — June — is yet to come.3 As Bill James, the godfather of sabermetrics, has repeatedly noted, starting pitcher injuries haven’t really increased in recent seasons, despite our perceptions of the contrary.4 But they aren’t decreasing, either.

Throwing while tired is dangerous to a pitcher’s arm.” To quantify this effect, Jazayerli and Woolner set up a scale to separate ordinary starts5 from high pitch-count outings that put tremendous strain on the arm, with a stress factor that compounds as more pitches are thrown.

In “Extra Innings: More Baseball Between the Numbers”, Corey Dawkins reported a failure to find any significant correlation between PAP and either short- or long-term future pitching injuries, while BP’s Russell Carleton could only find compelling support for the notion that eliminating 130-pitch starts reduces the probability of future injuries.

The newer, more data-rich models tended to settle on two factors as important predictors of a future injury: sudden, unexplained declines in a pitcher’s fastball velocity and increased variation in his release point.

(As the hypothesis goes, tired pitchers — like those who have accumulated more abuse points — find it harder to maintain consistency in their release points.) But there’s hardly a consensus among analysts that PITCHf/x is providing useful data.6 Again writing in “Extra Innings,” Dawkins expresses a major concern about any findings that rely on PITCHf/x’s release-point data because there’s a large degree of measurement error inherent to the system.

But there’s also the possibility that survivorship bias is at work.8 Pitchers who have bad mechanics (or any other flaw that would put them at greater risk of seeing their careers end early via injury) are automatically weeded out of baseball at a young age, leaving behind only the group of pitchers who made it through that initial checkpoint.

Keep pitching without incident, though, and Fernandez’s odds of future ailments would be significantly reduced, simply because the biggest test — whether he can handle a major league workload — would already have been passed.

Softball pitchers could face arm injury due to overuse

1 of 16 SHARE THIS STORY Tweet Share Share Pin Email Comment For aspiring baseball pitchers, the throwing arm is treated as a fragile package — an object that must be cared for, tended to often, and capable of being damaged at any moment.

Precautions such as pitch counts and days' rest are used by coaches to keep arms healthy, but despite all of the attempts to protect pitchers, the number of Tommy John surgeries — the usual end result when a pitcher's arm is injured from overuse — continues to soar.

There are no established guidelines on how many pitches are too many, and no regulations are being considered, though some researchers suggest that the number of girls being injured is significantly underestimated.

survey conducted in 2012-13 among active major league baseball players revealed that 25 percent of pitchers underwent the famed elbow-ligament replacement surgery at some point in their careers.

Clarkstown South senior Briana Keaveney suffered a rotator cuff injury years ago due to overuse of her shoulder between softball and swimming, but pitched every game for the Vikings last season.

Pitching in an underhand motion may be more natural than an overhand delivery, but the SUNY Cortland-committed hurler is still conscious that it can't be safe to be pitching game after game.

Numbers don't lie, but they can embellish There are a significantly greater number of boys playing baseball — including professional leagues — than there are girls playing softball.

While there is no denying that the overwhelming majority of pitching-related injuries occur in baseball players, some medical experts claim the numbers can be slightly misleading.

Nicholas said the ratio of baseball injuries to softball injuries he sees is about 5:1, but that's only because of the much larger number of baseball players, compared to softball players.

Boys might still have a small chance to play professional ball — overseas, minor league, for an independent team or major league baseball — and so may consider surgery no more than a bump in the road.

'Girls are more likely to stop playing if they needed Tommy John surgery, versus young boys, who will have the surgery and go back to play the game,' said Nicholas, who estimated he operates on well over 80 baseball players a year.

(Photo: Peter Carr/The Journal News) 'There's never been much discussion on it at all' Beginning at the Little League level, baseball pitchers must rest a certain number of days between starts depending on their pitch count from the previous game.

Last May, Rye sophomore George Kirby threw 153 pitches during a game on only three days' rest, helping to bring home his team's first Section 1 championship in three decades.

A pitcher usually throws about 100 pitches per game and Kirby, fans said, could have hurt himself during the game, or put an unneeded strain on his arm resulting in an injury down the road.

(Photo: Peter Carr/The Journal News) By contrast, when North Rockland junior Kayla McDermott and Suffern junior Allie Wood faced off this April, the two pitchers threw a combined total of 359 pitches during a single softball game.

On April 30, McDermott and Wood battled in an epic pitcher's duel that lasted 11 innings, during which the Manhattan-committed McDermott threw 176 pitches, a career high, striking out 18 batters in all 11 innings.

'I usually pitch five days a week, sometimes six, so my arm has been fine with pitching multiple games in a row,' McDermott said after the game.

Both girls admitted that they're usually sorer from outfield practice — when they're throwing overhand — than a day in the circle, but they had divergent views on the possibility of injuring themselves down the line due to overuse of their arms.

STL@ATL: Winkler exits the game with an arm injury

Daniel Winkler grabs his arm after a pitch and exits the game with an injury in the top of the 7th inning Check out for our full archive of videos, and subscribe on..

ALCS Gm3: Bauer leaves game with finger wound in 1st

Trevor Bauer has to leave the game in the bottom of the 1st inning when his previous finger wound begins to bleed while pitching Check out for our full archive of videos,..

BRAIN INJURY FOR THE PITCHER! | MLB The Show 17 | Road to the Show #520

BINGE-WATCH THIS SERIES STARTING FROM #1 ▻ T-SHIRTS! ▻ MY INSTAGRAM ▻ @bobbycrosby.

Weirdest Baseball Injuries

Comment what you want my next video to be of and I will feature your comment in an upcoming video.

14 Year Old Pitcher Gets Hit In Face With Softball.

The Ball Hit Me So Hard That It Passed The Catcher And Hit The Backstop.. There Was No Serious Injury.. All It Left Me Was The Seams From The Ball On My Forehead.(:

MLB | Injured Umpires

Outro: Marian Hill - Down SUBCRIBE TO SPORTS MIXES 23: SUBSCRIBE TO BUCCO BASEBALL: -----------------------------.

CWS@NYY: Tanaka throws 35 pitches in simulated game

8/23/14: Masahiro Tanaka throws 35 pitches in a simulated game, as he continues to strengthen and rehab a partially torn elbow ligament Check out for our full archive..

DET@CLE: Carrasco struck on pitching hand, exits game

Ian Kinsler's line drive strikes Carlos Carrasco on his pitching hand, forcing Carrasco to leave the game in the top of the 1st Check out for our full archive of videos,..

MIN@COL: Suzuki gets hit by two pitches in one at-bat

7/13/14: Kurt Suzuki is struck by a foul ball twice in the span of three pitches but remains in the ballgame Check out for our full archive of videos, and subscribe..

BROKEN FOOT! | MLB The Show 16 | Road to the Show #371

T-SHIRTS! ▻ BINGE-WATCH THIS SERIES STARTING FROM #1 ▻ MY INSTAGRAM ▻ @bobbycrosby.