AI News, Pitcher Prognosis: Using Machine Learning to Predict BaseballInjuries

Pitcher Prognosis: Using Machine Learning to Predict BaseballInjuries

In the multibillion dollar world of sports entertainment, we often think of injuries as being chance events.

Although professional players are placed under a high level of medical scrutiny, I reasoned that the information encoded in performance statistics might add a useful leading indicator of injury risk to the medical toolbox.

Then, I would aggregate the player’s statistics from preceding games and use those as features.The idea is thus that a coach, medical support staff member, or even a player him- or herself, could then enter their accumulated statistics on a given day (the “intervention point”) into my model and see what the likelihood would be that playing on that day could precede an injury.

In my case, the well-structured nature of baseball and prior familiarity with the dataset had assured me that my data were relatively clean, so the most urgent question confronting me was whether game statistics in fact contained any predictive information at all in relation to injuries.

although in many careers, the early forties are a highly productive time, the extreme physical demands of baseball mean that few players can continue to perform at the professional level that long.

The light blue bars are the distribution of ages in games that did not precede an injury event, the red bars did precede an injury event, and the dark blue fractional bars are the overlap of the light blue and the red.

Note that the bins are not integer values of innings: since innings pitched is counted by the number of outs recorded when a pitcher leaves the game, there are twenty-eight possible values for innings pitched in a standard game.

Feature Engineering To hone the predictive power of my features, first I generated new features by applying different aggregation windows: for each player, I created separate features for each performance metric for one game preceding the intervention point, for the average of seven games preceding the intervention point, and for the player’s entire career.

For a relatively casual baseball fan like myself, it is difficult to draw consistent, distinct categories of pitching style from expert commentary or from the statistical data that I had already collected.

projected the term frequency vectors I had created, which had a dimensionality on the order of the total number of terms present, onto a two-dimensional space using multidimensional scaling, which is meant to preserve the approximate relation of each of the pitcher descriptions to all of the others.

In the way that I set up the term frequency vectors, a single word can occur more than once because I accounted for the frequency of bigrams, or pairs of words occurring together, and trigrams as well as single words.

optimized the random forest hyperparameters to maximize the area under an ROC curve, which has two characteristics that make it better than accuracy score for this sort of situation: 1) the value of this metric is still meaningful with greatly imbalanced datasets - and there are many more games preceding noninjuries in baseball than games preceding injuries - and 2) how a risk-predicting application may be used is not necessarily known before deployment: avoiding false positives may matter more than avoiding false negatives, or vice versa.

The hyperparameters I focused on were the number of features each decision tree could choose from at each step in its creation and the maximum depth of those trees, or the total number of features that could be used in the classification of a single point.

although I saw little increase in performance beyond 300 trees, I settled on 1,000 because compute time was not limiting and having redundancy within the forest would not be expected to harm model performance.

The performance metric I chose to maximize with my grid search was area under the ROC curve, which has two characteristics that make it better than the standard accuracy score for this sort of situation: 1) the value of this metric is still meaningful with greatly imbalanced datasets - and there are many more games preceding noninjuries in baseball than games preceding injuries - and 2) how a risk-predicting application may be used is not necessarily known before deployment: avoiding false positives may matter more than avoiding false negatives, or vice versa.

The “injury score” output by the random forest model is notionally a probability of a particular set of feature values of indicating that an injury will occur, or more precisely the average of this probability across all of the decision trees in the forest, although depending on how one deals with the class imbalance in injury prediction problem, this interpretation is not necessarily correct.

To avoid forcing baseball players and coaches to deal with the intricacies of random forest output, the web application I designed compares the injury score for a given player’s input to all of the scores in the database used for the modeling and outputs the player’s injury score percentile, which should be readily understandable to many people.

Some users may distrust what seems like a data science black box, and to provide more persuasive analysis or explanation, I also use nearest neighbors analysis to identify games similar to the user’s entered values.

Pitcher Prognosis: Using Machine Learning to Predict BaseballInjuries

In the multibillion dollar world of sports entertainment, we often think of injuries as being chance events.

Although professional players are placed under a high level of medical scrutiny, I reasoned that the information encoded in performance statistics might add a useful leading indicator of injury risk to the medical toolbox.

Then, I would aggregate the player’s statistics from preceding games and use those as features.The idea is thus that a coach, medical support staff member, or even a player him- or herself, could then enter their accumulated statistics on a given day (the “intervention point”) into my model and see what the likelihood would be that playing on that day could precede an injury.

In my case, the well-structured nature of baseball and prior familiarity with the dataset had assured me that my data were relatively clean, so the most urgent question confronting me was whether game statistics in fact contained any predictive information at all in relation to injuries.

although in many careers, the early forties are a highly productive time, the extreme physical demands of baseball mean that few players can continue to perform at the professional level that long.

The light blue bars are the distribution of ages in games that did not precede an injury event, the red bars did precede an injury event, and the dark blue fractional bars are the overlap of the light blue and the red.

Note that the bins are not integer values of innings: since innings pitched is counted by the number of outs recorded when a pitcher leaves the game, there are twenty-eight possible values for innings pitched in a standard game.

Feature Engineering To hone the predictive power of my features, first I generated new features by applying different aggregation windows: for each player, I created separate features for each performance metric for one game preceding the intervention point, for the average of seven games preceding the intervention point, and for the player’s entire career.

For a relatively casual baseball fan like myself, it is difficult to draw consistent, distinct categories of pitching style from expert commentary or from the statistical data that I had already collected.

projected the term frequency vectors I had created, which had a dimensionality on the order of the total number of terms present, onto a two-dimensional space using multidimensional scaling, which is meant to preserve the approximate relation of each of the pitcher descriptions to all of the others.

In the way that I set up the term frequency vectors, a single word can occur more than once because I accounted for the frequency of bigrams, or pairs of words occurring together, and trigrams as well as single words.

optimized the random forest hyperparameters to maximize the area under an ROC curve, which has two characteristics that make it better than accuracy score for this sort of situation: 1) the value of this metric is still meaningful with greatly imbalanced datasets - and there are many more games preceding noninjuries in baseball than games preceding injuries - and 2) how a risk-predicting application may be used is not necessarily known before deployment: avoiding false positives may matter more than avoiding false negatives, or vice versa.

The hyperparameters I focused on were the number of features each decision tree could choose from at each step in its creation and the maximum depth of those trees, or the total number of features that could be used in the classification of a single point.

although I saw little increase in performance beyond 300 trees, I settled on 1,000 because compute time was not limiting and having redundancy within the forest would not be expected to harm model performance.

The performance metric I chose to maximize with my grid search was area under the ROC curve, which has two characteristics that make it better than the standard accuracy score for this sort of situation: 1) the value of this metric is still meaningful with greatly imbalanced datasets - and there are many more games preceding noninjuries in baseball than games preceding injuries - and 2) how a risk-predicting application may be used is not necessarily known before deployment: avoiding false positives may matter more than avoiding false negatives, or vice versa.

The “injury score” output by the random forest model is notionally a probability of a particular set of feature values of indicating that an injury will occur, or more precisely the average of this probability across all of the decision trees in the forest, although depending on how one deals with the class imbalance in injury prediction problem, this interpretation is not necessarily correct.

To avoid forcing baseball players and coaches to deal with the intricacies of random forest output, the web application I designed compares the injury score for a given player’s input to all of the scores in the database used for the modeling and outputs the player’s injury score percentile, which should be readily understandable to many people.

Some users may distrust what seems like a data science black box, and to provide more persuasive analysis or explanation, I also use nearest neighbors analysis to identify games similar to the user’s entered values.

The Mystery Sabermetrics Still Can’t Solve

That was the sentiment coming from the Miami Marlins in the wake of this week’s devastating news that Jose Fernandez, the team’s brilliant 21-year-old ace, would likely be out for the remainder of the year with a torn elbow ligament requiring Tommy John surgery.

Bleacher Report injury expert Will Carroll reported last summer that a staggering third of all current major league pitchers have undergone Tommy John surgery at some point in their careers, and the procedure has been performed more than ever before in recent years.1 And the spate of pitching injuries this year is causing the sports-media equivalent of a moral panic.

The most common month of the season for the surgery — June — is yet to come.3 As Bill James, the godfather of sabermetrics, has repeatedly noted, starting pitcher injuries haven’t really increased in recent seasons, despite our perceptions of the contrary.4 But they aren’t decreasing, either.

Throwing while tired is dangerous to a pitcher’s arm.” To quantify this effect, Jazayerli and Woolner set up a scale to separate ordinary starts5 from high pitch-count outings that put tremendous strain on the arm, with a stress factor that compounds as more pitches are thrown.

In “Extra Innings: More Baseball Between the Numbers”, Corey Dawkins reported a failure to find any significant correlation between PAP and either short- or long-term future pitching injuries, while BP’s Russell Carleton could only find compelling support for the notion that eliminating 130-pitch starts reduces the probability of future injuries.

The newer, more data-rich models tended to settle on two factors as important predictors of a future injury: sudden, unexplained declines in a pitcher’s fastball velocity and increased variation in his release point.

(As the hypothesis goes, tired pitchers — like those who have accumulated more abuse points — find it harder to maintain consistency in their release points.) But there’s hardly a consensus among analysts that PITCHf/x is providing useful data.6 Again writing in “Extra Innings,” Dawkins expresses a major concern about any findings that rely on PITCHf/x’s release-point data because there’s a large degree of measurement error inherent to the system.

But there’s also the possibility that survivorship bias is at work.8 Pitchers who have bad mechanics (or any other flaw that would put them at greater risk of seeing their careers end early via injury) are automatically weeded out of baseball at a young age, leaving behind only the group of pitchers who made it through that initial checkpoint.

Keep pitching without incident, though, and Fernandez’s odds of future ailments would be significantly reduced, simply because the biggest test — whether he can handle a major league workload — would already have been passed.

Softball pitchers could face arm injury due to overuse

1 of 16 SHARE THIS STORY Tweet Share Share Pin Email Comment For aspiring baseball pitchers, the throwing arm is treated as a fragile package — an object that must be cared for, tended to often, and capable of being damaged at any moment.

Precautions such as pitch counts and days' rest are used by coaches to keep arms healthy, but despite all of the attempts to protect pitchers, the number of Tommy John surgeries — the usual end result when a pitcher's arm is injured from overuse — continues to soar.

There are no established guidelines on how many pitches are too many, and no regulations are being considered, though some researchers suggest that the number of girls being injured is significantly underestimated.

survey conducted in 2012-13 among active major league baseball players revealed that 25 percent of pitchers underwent the famed elbow-ligament replacement surgery at some point in their careers.

Clarkstown South senior Briana Keaveney suffered a rotator cuff injury years ago due to overuse of her shoulder between softball and swimming, but pitched every game for the Vikings last season.

Pitching in an underhand motion may be more natural than an overhand delivery, but the SUNY Cortland-committed hurler is still conscious that it can't be safe to be pitching game after game.

Numbers don't lie, but they can embellish There are a significantly greater number of boys playing baseball — including professional leagues — than there are girls playing softball.

While there is no denying that the overwhelming majority of pitching-related injuries occur in baseball players, some medical experts claim the numbers can be slightly misleading.

Nicholas said the ratio of baseball injuries to softball injuries he sees is about 5:1, but that's only because of the much larger number of baseball players, compared to softball players.

Boys might still have a small chance to play professional ball — overseas, minor league, for an independent team or major league baseball — and so may consider surgery no more than a bump in the road.

'Girls are more likely to stop playing if they needed Tommy John surgery, versus young boys, who will have the surgery and go back to play the game,' said Nicholas, who estimated he operates on well over 80 baseball players a year.

(Photo: Peter Carr/The Journal News) 'There's never been much discussion on it at all' Beginning at the Little League level, baseball pitchers must rest a certain number of days between starts depending on their pitch count from the previous game.

Last May, Rye sophomore George Kirby threw 153 pitches during a game on only three days' rest, helping to bring home his team's first Section 1 championship in three decades.

A pitcher usually throws about 100 pitches per game and Kirby, fans said, could have hurt himself during the game, or put an unneeded strain on his arm resulting in an injury down the road.

(Photo: Peter Carr/The Journal News) By contrast, when North Rockland junior Kayla McDermott and Suffern junior Allie Wood faced off this April, the two pitchers threw a combined total of 359 pitches during a single softball game.

On April 30, McDermott and Wood battled in an epic pitcher's duel that lasted 11 innings, during which the Manhattan-committed McDermott threw 176 pitches, a career high, striking out 18 batters in all 11 innings.

'I usually pitch five days a week, sometimes six, so my arm has been fine with pitching multiple games in a row,' McDermott said after the game.

Both girls admitted that they're usually sorer from outfield practice — when they're throwing overhand — than a day in the circle, but they had divergent views on the possibility of injuring themselves down the line due to overuse of their arms.

Faculty Directory

Experienced in nonlinear science and dynamical systems, Dr. Albers has recently aimed to apply these skills and methods to electronic health records and intensive care data, particularly looking at temporal changes.

His current research interests include robotic grasping, 3-D vision and modeling, and medical robotics.

He is doing research on computational biology with emphasis on systems-based analysis of genomic data.

He is the inventor of 17 Patents, and has written 3 books and more than 200 peer-reviewed scientific papers.

He is a financial economist whose work centers on understanding the nature of risk and return in asset prices.

His work spans municipal and government bond markets, equities, investment management and portfolio allocation, and alternative investments.

She is a marketing modeler who uses tools from statistics and economics to answer marketing questions.

Her main research areas are customer analytics and pricing in the context of subscription businesses.

She specializes in understanding and predicting changes in customer behavior, such as customer retention and usage.

Guillaume Bal Health Analytics Foundations of Data Science Applied Physics and Applied Mathematics Professor gb2030@columbia.edu Website Research Specialtyapplied mathematics, partial differential equations with random coefficients, theory of inverse problemsEducationDiploma, Ecole Polytechnique, 1993Ph.D.

A recipient of the NIH Director's Pioneer Award in 2007, Bearman is currently investigating the social determinants of the autism epidemic.A specialist in network analysis, he co-designed the National...

His research interests are mainly in statistical learning, graphical model, social networks and causal models.

His research often involves developing theoretical tools for learning patterns and structures in high dimensional data.

His primary research interests are in the area of decision-making under model uncertainty with a focus on applications in e-commerce,...

Raimondo Betti Smart Cities Civil Engineering and Engineering Mechanics Chair & Professor betti@civil.columbia.edu Website Professor Raimondo Betti received his Laurea degree magna cum laude in Civil Engineering from the University of Rome “La Sapienza” in 1985 and his Master of Science in Structural Mechanics (1988) and PhD in Civil Engineering (1991) from the University of Southern California.

His research involves probabilistic topic models, Bayesian nonparametric methods, and approximate posterior inference.

He works on a variety of applications, including text, images, music, social networks, user behavior, and scientific data.

Maura Boldrini Health Analytics Psychiatry Associate Professor of Neurobiology mb928@columbia.edu Website In the Boldrini laboratory, research focuses on studying adult neurogenesis and angiogenesis in the human brain, which are mechanisms of structural plasticity occurring in mammals including humans, that are necessary for learning and coping, and for treatment response.

We study stem cells and their progeny in the human adult brain and the molecular expression profile associated with the cellular changes, in the...

DuBois Bowman Health Analytics Foundations of Data Science Biostatistics Chair & Professor fb2403@columbia.edu Website DuBois Bowman, PhD, has built an active research program involving the development of biostatistical methods for brain imaging data, including functional magnetic resonance imaging, diffusion tensor imaging, and positron emission tomography.

Dr. Bowman's research spans numerous substantive areas including Parkinson's disease, Alzheimer's disease, depression, schizophrenia, and cocaine addiction, among others, and...

His research interests include the pricing of derivative securities, risk management and, more generally, quantitative methods for decision-making under uncertainty.

Brown Health Analytics Biological Sciences Research Scientist lb2425@columbia.edu Website As Director of the Quantitative Proteomics Center at Columbia University my work focuses on the identification of proteins with differential quantitative expression in cells, tissues or in affinity purifications.

A wide variety of proteomes can be processed including cells, tissues, organelles, biofluids, and affinity...

His main research interests are in the area of financial engineering with main focus on counterparty risk, systemic risk, and dynamic optimization.

His research, by experience in industry, is centered on real world impact and emerging computing trends, while his training, in mathematics and theoretical computer science, is focused on guiding principles.

He is an active researcher leading development of theories, algorithms, and systems for multimedia analysis and retrieval.

He has published extensively in many of the field’s top academic journals such as “Management Science” and “Operations Research.” His research has addressed issues in production/distribution planning, procurement auctions, supplier management, supply chain coordination, supply chain information sharing, incentive contracts, salesforce incentives, etc....

Jan Claassen Health Analytics Neurology Assistant Professor jc1439@columbia.edu Website Dr. Claassen is a nationally and internationally recognized expert in the treatment of neurological intensive care.

leader in the field of water resources and urban sustainability, Culligan has worked extensively with The Earth Institute's Urban Design Lab at Columbia University to explore novel, interdisciplinary solutions to the modern day challenges of urbanization, with a particular emphasis on the City of New York.

Cunningham Foundations of Data Science Statistics Assistant Professor jpc2181@columbia.edu Website Many fields and industries are witnessing huge increases in the quantity and complexity of recorded data.

This changing data paradigm will only lead to a similarly dramatic increase in theoretical understanding and useful technologies if we create the analytical methods to meaningfully interrogate this data.

Davis Financial and Business Analytics Statistics Howard Levene Professor rdavis@stat.columbia.edu Website Research InterestsMy research interests lie primarily in the areas of applied probability, time series, and stochastic processes.

While my research interests have gravitated towards problems in time series analysis (inference, estimation, prediction and general properties of time series models), extreme value theory still has a strong...

Prior to joining Columbia, he was a managing director at Goldman Sachs, where he was head of the quantitative strategies group in the equities division, and then head of quantitative risk strategies in firm-wide risk.

She then joined the New England Complex Systems Institute in 2006 as a postdoctoral fellow working on information theoretic tools for data analysis.

She is interested in developing scalable machine learning algorithms (distributed and parallel learning) and distributed optimization.

Noémie Elhadad Health Analytics Biomedical Informatics Assistant Professor noemie@dbmi.columbia.edu Website My research interests are in natural language processing, with particular focus on text summarization and discourse-level structuring of information.

I investigate ways in which clinical texts (these include scientific articles, medical textbooks, and patient notes) and health consumer texts (health news stories, educational health documents, and peer-patient forum posts) can be processed automatically to enhance...

Ellis Data, Media & Society Electrical Engineering Associate Professor dpwe@ee.columbia.edu Website Dan Ellis is an associate professor of electrical engineering at Columbia Engineering, where he leads the Laboratory for Recognition and Organization of Speech and Audio (LabROSA), working on extracting information from speech, music, and environmental sound.

His interests include human–computer interaction, augmented reality and virtual environments, 3-D user interfaces, knowledge-based design of graphics and multimedia, mobile and wearable computing, computer games, and information...

His research is in high-dimensional statistical learning, a rapidly growing area in statistics that arises from the emergence of the 'big data'.

In particular, he works on high-dimensional variable selection and classification, nonparametric and semi-parametric methods, bioinformatics and...

Her research is on the forefront of multidisciplinary science and engineering in sensors, structural health monitoring, intelligent structures and system control for smart city applications, with an emphasis on structural safety and system resilience against natural and man-made hazards.

He trained in Physics (B.S.) and Physiology (Ph.D.) and he is a world leader in the field of single molecule biology, and the founder of the field of protein mechanics.

Her research focuses on developing high-resolution optical imaging and spectroscopy instruments in conjunction with real-time image analysis for diagnosis and therapy monitoring of diseases of the heart.

Kristina Ford Smart Cities International and Public Affairs Professor of Professional Practice kf2381@columbia.edu Website In the immediate aftermath of Hurricane Katrina, Kristina Ford’s thoughtful, well-informed and articulate assessments – heard on CNN, BBC and National Public Radio – became the first, public voice of reason to mediate the great storm’s human and civic consequences to America and beyond.

Galea’s research program seeks to uncover how determinants at multiple levels—including policies, features of the social environment, molecular, and genetic factors—jointly influence the...

Gallego Financial and Business Analytics Industrial Engineering and Operations Research Liu Family Professor ggallego@ieor.columbia.edu Website Professor Guillermo Gallego joined Columbia University's Industrial Engineering and Operations Research Department in 1988 where he has been conducting research in the areas of inventory theory, supply chain management, revenue management, and semi-conductor manufacturing.

I am interested in computer systems in a broad sense, including distributed systems, the Web, security and privacy, operating systems, and databases.

More specifically, my current research focuses on the challenges and opportunities created by today's emerging technologies, such as the Web, cloud computing, and powerful mobile...

He has received the Outstanding Statistical Application award from the American Statistical Association, the award for best article published in the American Political Science Review, and the Council of Presidents of Statistical Societies award for outstanding contributions by a...

Pierre Gentine is working on land-atmosphere interactions, convection-clouds, and surface hydrology using conceptual models, numerical models and wide range of data analysis tools.

His overall research objective is to understand how soil and atmospheric moisture organizes across...

Javad Ghaderi Foundations of Data Science Electrical Engineering Assistant Professor jghaderi@ee.columbia.edu Website My research interests are broadly in the analysis, design, and management of large-scale networked systems.

In particular, my research draws upon mathematical tools from control, optimization, information theory, algorithms, and stochastic processes to study communication networks, wireless systems, social networks, and data centers.

Anderson Professor pg20@columbia.edu Website Professor Glasserman's research and teaching address risk management, the pricing of derivative securities, Monte Carlo simulation, statistics and operations.

Donald Goldfarb Foundations of Data Science Industrial Engineering and Operations Research Alexander and Hermine Avanessians Professor goldfarb@columbia.edu Website Donald Goldfarb, the Alexander and Hermine Avanessians Professor of Industrial Engineering and Operations Research (IEOR), has been at Columbia Engineering since 1982, serving as acting dean of the School in 1994 to 95 and chair of the IEOR Department from 1984 to 2002.

Dr. Goldsmith joined Columbia after receiving his PhD in Biostatistics from Johns Hopkins in 2012, where his dissertation focused on statistical methods for high-dimensional structured data.Dr.

Goldsmith has research interests in scientific domains including neuroimaging, physical activity monitoring...

Her research, which has focused on the development and application of mathematical models of service systems, has resulted in dozens of publications in the premier technical journals such as Operations Research and Management Science as well as prominent healthcare journals such as Health Services Research, Inquiry and Academic Emergency...

The technologies developed by his laboratory are licensed and/or used today in Adobe Photoshop and Illustrator, at major film...

In 1985, he moved to The Miami Herald and eventually became city editor, where he oversaw the paper’s local coverage of Hurricane Andrew.

Mark Hansen Data, Media & Society Journalism Professor David and Helen Gurley Brown Institute for Media Innovation Director markh@columbia.edu Website Mark Hansen joined Columbia Journalism School in July of 2012, after a decade of shuttling between the west and east coasts.

He was a faculty member in the IEOR department until June 2005 and during this time his teaching and research focused on financial engineering.

He led the effort to create the Arden Syntax, a language for representing health knowledge that has become a national standard.

Robert Stanley Hum, MD FRCPC Health Analytics Columbia University Medical Center Assistant Professor of Pediatrics rsh2117@columbia.edu Website In addition to being a full-time pediatric intensivist, Dr. Hum's research interest primarily focuses on the use of informatics and technology in the medical education and in the treatment and diagnosis of diseases in pediatric critical care medicine.

Professor Iyengar teaches courses in simulation and optimization.Professor Garud Iyengar’s research interests include convex optimization, robust optimization, queuing networks, combinatorial optimization, mathematical and computational finance, communication and information theory.

He directs the Columbia Machine Learning Laboratory whose research intersects computer science and statistics to develop new frameworks for learning from data with applications in vision, networks, spatio-temporal data, and text.

He conducts research in the fields of pricing, revenue management, logistics, supply chain management, algorithmic trading and transportation analysis.

He teaches courses in the areas of quantitative corporate finance, industrial economics, operations consulting, logistics, and production and inventory...

His primary research interests are in the use of statistical and semantic methods for navigating through collections of videos, particularly those showing human activities.

His research interests are in analog, RF and power integrated circuits in nanoscale CMOS technologies and the applications they enable in communications, sensing and energy.

CSUD seeks to understand the complexity of interactions between land use and transport through research and education, including student and professional training programs, in addition to working with...

He is researching Inequality and executive compensation, ideological slant in academic scholarship, and the political color of boards using the tool kit of computational social science.

Her work explores problems ranging from digital location technologies, the ethics and politics of mapping, to new structures of participation in design, and the visualization of urban and...

He researches thin film devices and systems, especially focusing on optoelectronic and sensing devices based on organic and amorphous metal oxide thin film materials.

Laine Health Analytics Biomedical Engineering Chair, Professor Radiology Professor laine@columbia.edu Website Research Areas: Mathematical analysis and quantification of medical images, signal and image processing, computer-aided diagnosis.

Upmanu Lall Smart Cities Foundations of Data Science Engineering Alan & Carol Silberstein Professor Columbia Water Center Director lall@civil.columbia.edu Website Research AreasHydro-climate modeling, Spatial data analysis and visualization, Time series analysis and forecasting, Bayes Networks for Process Modeling and Decision Making, Risk and reliability, Water Resource Management using Climate Information.BiographyB.

His research relates to the etiopathogenesis, genetics, natural history, environmental provocation, biomarker/bioimaging development...

The theoretical work builds on methods of communications/networking, information theory, machine learning, nonlinear dynamical systems, signal processing and systems identification.

SLAC maintained the largest scientific database in the world and operates a world-class high performance datacenter.

My current areas of research are cryptography, complexity theory, harmonic analysis, combinatorics, and distributed computing.

Guohua Li is an epidemiologist specializing in injury and perioperative outcomes research using complex data systems and innovative statistical techniques....

He has performed numerous econometric studies of the impact of biomedical innovation on longevity and health in the United States and other nations.

Her research interest is at the intersection of intellectual property law and theory, the economics of information, property law, and contract law.

Her research examines how modifiable built and social environments influence cardiovascular and pulmonary health, as well as differences in these effects across population subgroups.

He has over 100 publications in such areas as Bayesian statistics, text mining, Monte Carlo methods, pharmacovigilance and probabilistic graphical models.

His research focuses on stochastic networks, financial engineering, and quantitative pricing and revenue management.

in Computer Science from the Massachusetts Institute of Technology in 2000, and joined Columbia after three years as a research scientist in the Secure Systems Research Department at AT&T Labs - Research.Her research interests are in cryptography, security, complexity...

From 1996 to 2001 he was a technical leader in AT&T laboratories, from 1976 to 1996 he was at AT&T Bell Laboratories, first as a member of the technical staff and then...

My research interests include statistical learning, applied statistics, large scale optimization, mathematical...

leading scholar and researcher in the field of natural language processing, McKeown focuses her research on big data;

her interests include text summarization, question answering, natural language generation, multimedia explanation, digital libraries, and multilingual applications.

My research interests are public health organizational systems, information management, workforce development and preparedness, information display and access, and risk communication.

My current research takes a complex systems approach toward understanding public health organizational processes by applying dynamic network analysis, a quantitative,...

Currently, he is focused on three projects: leading the infrastructure team for the Millennium Villages Project (10 countries, 14 sites across sub-Saharan Africa);...

Eben Moglen Cybersecurity Columbia Law School Professor moglen@columbia.edu Website Moglen started out as a computer programming language designer and then received his bachelor's degree from Swarthmore College in 1980, where he won the Hicks Prize for Literary Criticism.

Her research is focused on grammar induction, computational semantics, language in social media, and applications to computational social science and health informatics.

A vast array of digitized historical documents, as old as the printed word, have become available through large-...

Emi Nakamura Financial and Business Analytics Economics Associate Professor enakamura@columbia.edu Website My research focuses on empirical macroeconomics, using a combination of micro and macro data and a combination of theory and empirical methods.

My work has centered on 1) microeconomic pricing behavior and its implications for questions in macroeconomics and international economics, 2) the macroeconomic effects of government spending—i.e., fiscal stimulus, and 3) the implications of macroeconomic disasters and long‐...

Natriello teaches graduate courses in the social organization of schools and classrooms, the social dimensions of assessment processes, the sociology of online learning, and research methods.

the creation of novel cameras, the design of physics based models for vision, and the development of algorithms for scene understanding.

Nelson specializes in the area of international media development and has worked extensively as an analyst, evaluator, and practitioner in the field.

Oded Netzer Financial and Business Analytics Columbia Business School Associate Professor on2110@columbia.edu Website Professor Netzer's research interests focus on customer relationships, preference measurement, and modeling various aspects of choice behavior, including how choices change over time, contexts, and consumers.

He has served as a consultant to both government and industry, including as the technical advisor to nine States on the Microsoft Antitrust Settlement, and as an expert witness before the US International Trade Commission.

His main research area is on design methodologies and CAD tools for synthesis and optimization of asynchronous and mixed-timing (i.e.

Buhmann at ETH Zurich.My main research interest are the statistics of discrete objects and structures: permutations, graphs, partitions, binary sequences...

All these questions pose complex analytical challenges, with direct impact on medical research.Web www.cs.columbia.edu/~itsik Dana Pe'er Foundations of Data Science Biological Sciences Assistant Professor dpeer@biology.columbia.edu Website Dana Pe'er is associate professor at the Departments of Biological Sciences and Systems Biology at Columbia University.

In particular, we study how genetic variation alters regulatory network function, subsequently phenotype in health and disease.

Dr. Perotte’s primary research area is the development and application of statistical machine learning methods, including probabilistic graphical models for biomedical informatics.

His research and teaching interests are in the broad area of predictive analytics and the use of quantitative methods to help businesses make more effective decisions.

Protter Financial and Business Analytics Statistics Professor pep2117@columbia.edu Website Professor Protter’s primary research interests include mathematical finance (capital asset pricing theory, the pricing and hedging of derivatives, liquidity issues, financial bubbles, insider trading, high frequency trading, and credit risk), stochastic integration theory, stochastic differential equation theory, numerical solutions of stochastic differential equations, discretization of stochastic processes (as a...

He leads an interdisciplinary team that develops and implements mathematical and computational tools to extract biologically and clinically relevant information from large data sets, with interests in infectious diseases and cancer.

Owen Rambow Data, Media & Society Center for Computational Learning Systems Research Scientist ocr2101@columbia.edu Website My expertise lies in the areas of formal and computational models of syntax and other levels of linguistic representation.

Email summarization poses special problems (as compared to other summarization tasks) because of the dialogic nature of the source, and the informal or incomplete...

He also is Executive Vice President and partner of Gedeon GRC Consulting, a full service engineering consulting firm, and a partner in CleanTrans, an environmental transportation company....

His research interests touch on various aspects of database systems, including query processing, query language design, data warehousing, and architecture-sensitive database system design.

His current research interests include combinatorial optimization, large-scale graphical models, approximate counting and inference, and belief propagation style message-passing algorithms with an emphasis on applications in machine learning, statistical physics, and artificial intelligence.

Soumitra Sengupta Cybersecurity Biomedical Informatics Associate Clinical Professor sen@columbia.edu Website Education:PhD 1987 State University of New York, Stony Brook, 1984-1987 MS, 1984 State University of New York, Stony Brook, 1982-1984 BE, 1980 Birla Institute of Technology and Science, Pilani, India, 1975-1980Research Interests:Systems management, information and systems architecture, information security, metricsCurrent Projects:Searching for Access Anomalies in Clinical Audit LogsDescription: Discovering patterns...

His research interests cover extreme energy-efficient circuit and system design that can benefit various cyber physical systems for ubiquitous and reliable sensing,...

He is particularly interested in foundational questions of understanding what types of learning problems have—and do not have—computationally efficient algorithms.

Sethumadhavan’s research interests are in hardware security, hardware support for security and privacy and energy-efficient computing.

After completing his PhD in computational biology in 2007 at the Human Genome Sequencing Center at Baylor College of Medicine, he led the analysis of the first personal genome produced by next-generation sequencing (that of Dr. James D.

degree from Princeton University, Princeton, NJ, in 1987 where he was valedictorian of his graduating class and received the Phi Beta Kappa prize for the highest academic standing.

personal health devices, community health workers, community organization reported) with health information systems and payment mechanisms.

Our goal is to improve primary health care system delivery by making communities integral and active parts of the health...

He specializes in structural health monitoring, using sensor information to determine the condition of critical infrastructure.

Smyth has been involved with the sensor instrumentation and vibration analysis and remote monitoring of a large number of iconic long-span bridges and landmark buildings and museums.

Adam is an atmospheric scientist who specializes in the dynamics of climate and weather, particularly in the tropics, on time scales of days to decades.

His research interests include tropical cyclones, tornadoes, intraseasonal variability, tropical precipitation, the general...

Her research focuses on technological capabilities, industry development plans, and employment and skills systems.

This includes investigating traditional and modern techniques and the co-evolution of jobs and skills with national and supra-national technical standards.

Stein Financial and Business Analytics Foundations of Data Science Industrial Engineering and Operations Research Professor cliff@ieor.columbia.edu Website Professor Clifford Stein joined Columbia University's Industrial Engineering and Operations Research Department in 2001, where he has been conducting research in the areas of combinatorial optimization, scheduling, and network algorithms.

His research involves the discovery of small molecules that can be used to understand and treat cancer and neurodegeneration.

This includes studying adequacy and robustness in replicated results, designing and implementing validation systems, developing standards of openness for...

He is particularly interested in the role of social, financial and economic networks in shaping economic outcomes.

Tatonetti Health Analytics Foundations of Data Science Biomedical Informatics Assistant Professor nick.tatonetti@columbia.edu Website Nicholas P Tatonetti, PhD Assistant Professor of Biomedical InformaticsDepartment of Biomedical Informatics, Columbia Initiative for Systems Biology, &Department of Medicine at Columbia University Research Specialty Translational bioinformatics, machine learning, observational data mining, combinatorial drug design,emergent biology, genetic networks and network analysis, clinical data analysis Education Stanford...

Dennis Tenen Data, Media & Society English and Comparative Literature Assistant Professor dt2406@columbia.edu Website Dennis Tenen writes and teaches in the field of computational culture studies both as in the critical study of computational culture and in the sense of applying computational approaches to the study of culture.

As a scholar, a coder, and an aspiring activist, Dennis is interested in the the impact of technology on the way we think, read, and write: issues of authority, textual production, reception, influence,...

Tippett Smart Cities Applied Physics and Applied Mathematics Lecturer in Discipline michael.tippett@columbia.edu Website Research Interests My research focuses on the predictability and variability of the climate system, with emphasis on the application of statistical methods to data from observations and numerical models.

His research focuses on various aspects of innovation (including idea generation, preference measurement, and the diffusion of innovation), social networks and behavioral economics.

Prior to forming Resolution Seven, Duy founded and was the Chief Operations Officer of Missing Pixel, an award-winning interactive production company.Duy has shot for major broadcast networks, cable channels, independent...

Through an integrated program of empirical and quantitative approaches, research in her lab examines forest ecological dynamics in response to natural disturbance and human land use.

Viele Professor venkat@columbia.edu Website Research InterestsOur group’s research contributions have been in three areas: (1) risk analysis and management in complex engineered systems, (2) cyberinfrastructure and “big data” analytics for molecular products design and discovery, and (3) complex adaptive teleological systems.

These challenges are addressed using a combination of artificial intelligence, informatics, statistics, and mathematical programming...

Their work involves developing methods that connect network structure to function to phenotypes, and can be used to make experimentally verifiable predictions.

His research interests fall in the general areas of computing, signal processing, and communications, and he has published extensively in these areas.

Weintraub Financial and Business Analytics Columbia Business School Sidney Taurel Associate Professor gweintraub@columbia.edu Website Gabriel Weintraub's research covers several subjects that lie in the intersection between operations/management science, and applied economics.

He is particularly interested in developing mathematical and computational models for the economic analysis of problems in operations;

Dr. Weng's research goal is to design enabling technologies for clinician scientists and to improve the cost-effectiveness and efficiency of clinical and translational research.

Her research addresses the socio-technical issues around sharing and reusing fragmented research sources, repurposing...

Ward Whitt Financial and Business Analytics Industrial Engineering and Operations Research Professor ww2040@columbia.edu Website Professor Whitt joined Columbia University’s IEOR Department in 2002, after spending 25 years in research at AT&T, first at Bell Labs and then at AT&T Labs, where he was a technology leader and an AT&T fellow.

She is widely recognized for her intellectual leadership, having earned the respect and admiration of colleagues across the field for her role in the dramatic transformation of data science into a discipline essential to so much scholarly...

Some of my current projects involve efficient and reliable multithreading, tools for the cloud, and operating systems support for reliability.

He is particularly interested in the areas of stochastic modeling and statistics, and their synergistic application to problems arising in service operations, revenue management, and financial services...

Chaolin Zhang Foundations of Data Science Systems Biology Assistant Professor cz2294@columbia.edu Website Chaolin Zhang uses a combination of computational and experimental methods to infer RNA regulatory networks in the nervous system.

In particular, he is interested in characterizing the regulatory networks that specify neuronal cell types, and how these networks can be compromised in certain pathologic contexts, such as neurodegenerative diseases and brain tumors.Lab website:http://zhanglab.c2b2.columbia.edu Tian Zheng Foundations of Data Science Statistics Associate Professor tzheng@stat.columbia.edu Website Tian Zheng is associate professor of Statistics at Columbia University.

Her research is to develop novel methods and improve existing methods for exploring and analyzing interesting patterns in complex data from different application domains.

Her current projects are in the fields of statistical genetics, bioinformatics and computational biology, feature selection and...

From a methodological perspective his research focuses on new statistical methods for causal inference in randomized experiments and observational studies.

From a substantive point of view, he is very interested in using these methods to address questions in health care and public policy.

WATCH: Scherzer Blasts First Career Home Run Then Leaves Game with Injury

Max Scherzer hit his first career home run before leaving Tuesday's game against the Marlins with an injury.

In the second inning with the Nationals up 1-0, Scherzer showed bunt before pulling back and knocking a three-run home run to Marlins Park's second deck.

But cameras caught Scherzer grimacing in the dugout after hitting the homer, and he pointed to his neck and said 'I can't go.'

STL@ATL: Winkler exits the game with an arm injury

Daniel Winkler grabs his arm after a pitch and exits the game with an injury in the top of the 7th inning Check out for our full archive of videos, and subscribe on..

ALCS Gm3: Bauer leaves game with finger wound in 1st

Trevor Bauer has to leave the game in the bottom of the 1st inning when his previous finger wound begins to bleed while pitching Check out for our full archive of videos,..

BRAIN INJURY FOR THE PITCHER! | MLB The Show 17 | Road to the Show #520

BINGE-WATCH THIS SERIES STARTING FROM #1 ▻ T-SHIRTS! ▻ MY INSTAGRAM ▻ @bobbycrosby.

Weirdest Baseball Injuries

Comment what you want my next video to be of and I will feature your comment in an upcoming video.

14 Year Old Pitcher Gets Hit In Face With Softball.

The Ball Hit Me So Hard That It Passed The Catcher And Hit The Backstop.. There Was No Serious Injury.. All It Left Me Was The Seams From The Ball On My Forehead.(:

MLB | Injured Umpires

Outro: Marian Hill - Down SUBCRIBE TO SPORTS MIXES 23: SUBSCRIBE TO BUCCO BASEBALL: -----------------------------.

CWS@NYY: Tanaka throws 35 pitches in simulated game

8/23/14: Masahiro Tanaka throws 35 pitches in a simulated game, as he continues to strengthen and rehab a partially torn elbow ligament Check out for our full archive..

DET@CLE: Carrasco struck on pitching hand, exits game

Ian Kinsler's line drive strikes Carlos Carrasco on his pitching hand, forcing Carrasco to leave the game in the top of the 1st Check out for our full archive of videos,..

MIN@COL: Suzuki gets hit by two pitches in one at-bat

7/13/14: Kurt Suzuki is struck by a foul ball twice in the span of three pitches but remains in the ballgame Check out for our full archive of videos, and subscribe..

BROKEN FOOT! | MLB The Show 16 | Road to the Show #371

T-SHIRTS! ▻ BINGE-WATCH THIS SERIES STARTING FROM #1 ▻ MY INSTAGRAM ▻ @bobbycrosby.