AI News, QA with Data Scientists:Dirk Tassilo Hettich
QA with Data Scientists:Dirk Tassilo Hettich
Dirk Tassilo Hettich holds a PhD in Computer Science, as well as MSc and BSc degrees in Bioinformatics from the University of Tübingen, Germany where he has closely worked with neuroscientists in the area of applied machine learning in the context of human-computer interaction and affective computing.
After having researched brain-computer interfacing for communication and control for almost 10 years, Dirk Tassilo decided that it was time to see how big data and advanced analytics do apply in an economic context and joined the Tax Technology & Analytics team at EY Stuttgart led by Florian Buschbacher in March of 2016.
Since then he has applied his software development, machine learning, and visualization expertise in multiple client projects as project and application lead.
classification, regression, clustering, feature analysis (basic statistics), performance testing and metrics (e.g.
I guess everybody in the data science community agrees that the “right” features and data are key to successful or effective machine learning.
That said, I am a big fan of linear support vector machines for exploratory machine learning purposes yet also applied ones for their inner workings are relatively easy to understand (e.g.
(*) The ETL process of aggregating multiple data sources can be labor-intensive for the real world usually is not well organized (e.g.
think of different granularity in samples, failing systems, or missing data etc.) but usually required in an industrial context (i.e.
In the contrary, I strongly support the parsimony principle and usually seek to use the smallest set of highly descriptive features.
Once patterns are identified, parsers can be derived that apply certain rules to incoming data in a productive system.
Coming from a performance analysis point of view, one should ask how many samples are required in order to successfully perform n-fold cross-validation.
With access to big data resources and potentially sensitive information comes great responsibility for the data engineers, InfoSec, yet also data scientists.
^ Ioffe, Sergey;
Christian Szegedy (2015).
b Grus, Joel (2015).
'Feature scaling in support vector data descriptions'.
Data Transformations “By Example” in the Azure Machine Learning Workbench
In this post, I talk about the Derive Column By Example transformation – an unexpected, powerful and super-efficient way to perform complex data transformations in the Azure Machine Learning Workbench.
You’ve just invented this amazing new recipe for sautéing vegetables and are now wondering how best to teach this new recipe to Sunny, the new robot on the team.
Shortly after, you see Sunny merrily sautéing away several kinds of vegetables using your latest recipe, to demonstrate their learning.
The underlying technology, known as PROgram Synthesis using Examples (PROSE) within Microsoft, is a new frontier in AI that brings together advances in logical reasoning and machine learning based methods.
Logical reasoning -based search techniques are used to efficiently search for programs within an underlying domain-specific language (DSL) that are consistent with the examples.
Machine learning -based ranking techniques are used to pick an intended program from among the many programs that are consistent with the examples.
The DSL for data transformation in the Azure Machine Learning Workbench is designed to perform common data transformation tasks for data professionals.
If the intended task can be expressed by a program in the underlying DSL, the tool will synthesize one such program using examples.
The technology is a result of many years of research at Microsoft and has come a long way since it was first released in Excel 2013 (some of you may be familiar with the Excel Flash Fill experience).
What’s more, PROSE generates programs which, when combined with your other data preparation steps, can be operationalized using the scale-out capabilities of Azure Machine Learning.
Here, we are using the name, mass (g), year, and GeoLocation columns as inputs to compute the output column.
If none of those satisfy your needs, you can write custom transformations in Python for variety of tasks such as adding a derived column based on values from other columns, filtering rows, manipulating the entire dataset, or writing the output to a custom destination.
Summary, Next Steps In this post, we introduced a powerful new approach to data transformations – one that truly has the promise to revolutionize the data preparation workflow for data scientists.
- On Thursday, July 18, 2019
Decision Tree 1: how it works
Full lecture: A Decision Tree recursively splits training data into subsets based on the value of a single attribute. Each split corresponds to a node in the. Splitting..
Machine Learning and Robust Optimization, Fengqi You, Cornell University
When Machine Learning Meets Robust Optimization: Data-driven Adaptive Robust Optimization Models, Algorithms & Applications In this presentation, we will introduce a novel data-driven adaptive...
The Python ecosystem for Data Science: A guided tour - Christian Staudt
Description Pythonistas have access to an extensive collection of tools for data analysis. The space of tools is best understood as an ecosystem: Libraries build upon each other, and a good...
(ML 2.1) Classification trees (CART)
Basic intro to decision trees for classification using the CART approach. A playlist of these Machine Learning videos is available here:
Feature Scaling | Machine Learning | Data science
In this video you will learning why scaling input variables is required before running gradient descent algorithm For Training & Study packs on Analytics/Data Science/Big Data, Contact us...
Valerio Maggio – Data Formats for Data Science
The plain text is one of the simplest yet most intuitive format in which data could be stored. It is easy to create, human and machine readable, storage-friendly (i.e. highly compressible),...
Python for Data Science - Feature Engineering: Overview
This first video on feature engineering covers basic questions like 'What are features' and 'What is feature engineering?'. As always, code is posted in the Python for Data Science repository...
Automatic Discovery of the Statistical Types of Variables in a Dataset
A common practice in statistics and machine learning is to assume that the statistical data types (e.g., ordinal, categorical or real-valued) of variables, and usually also the likelihood model,...
Learning Representations: A Challenge for Learning Theory
VideoLectures.Net Computer Science View the talk in context: View the complete 26th Annual Conference on Learning Theory (COLT), Princeton 2013:..
Data Science Lecture Series: Maximizing Human Potential Using Machine Learning-Driven Applications
Data Science Lecture Series: Maximizing Human Potential Using Machine Learning-Driven Applications Lecture | September 19 | 1:00-2:30 p.m. | Sutardja Dai Hall, Banatao Auditorium Speaker/Performer...