AI News, Cite This Page

Cite This Page

File:Sample view of data collection process using artificial intelligence application.png.

Wikimedia Commons contributors, 'File:Sample view of data collection process using artificial intelligence application.png', Wikimedia Commons, the free media repository, 27 November 2016, 22:07 UTC, <>

Wikimedia Commons contributors, 'File:Sample view of data collection process using artificial intelligence application.png,' Wikimedia Commons, the free media repository, (accessed October 4, 2018).

Data mining

Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.[1]

Data mining is an interdisciplinary subfield of computer science with an overall goal to extract information (with intelligent methods) from a data set and transform the information into a comprehensible structure for further use.[1][2][3][4]

Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.[1]

The term 'data mining' is in fact a misnomer, because the goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself.[6]

and is frequently applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence (e.g., machine learning) and business intelligence.

Often the more general terms (large scale) data analysis and analytics – or, when referring to actual methods, artificial intelligence and machine learning – are more appropriate.

The actual data mining task is the semi-automatic or automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining, sequential pattern mining).

For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system.

Neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, but do belong to the overall KDD process as additional steps.

The related terms data dredging, data fishing, and data snooping refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered.

In the 1960s, statisticians and economists used terms like data fishing or data dredging to refer to what they considered the bad practice of analyzing data without an a-priori hypothesis.

As data sets have grown in size and complexity, direct 'hands-on' data analysis has increasingly been augmented with indirect, automated data processing, aided by other discoveries in computer science, such as neural networks, cluster analysis, genetic algorithms (1950s), decision trees and decision rules (1960s), and support vector machines (1990s).

It bridges the gap from applied statistics and artificial intelligence (which usually provide the mathematical background) to database management by exploiting the way data is stored and indexed in databases to execute the actual learning and discovery algorithms more efficiently, allowing such methods to be applied to ever larger data sets.

As data mining can only uncover patterns actually present in the data, the target data set must be large enough to contain these patterns while remaining concise enough to be mined within an acceptable time limit.

A simple version of this problem in machine learning is known as overfitting, but the same problem can arise at different phases of the process and thus a train/test split - when applicable at all - may not be sufficient to prevent this from happening.[19]

For exchanging the extracted models – in particular for use in predictive analytics – the key standard is the Predictive Model Markup Language (PMML), which is an XML-based language developed by the Data Mining Group (DMG) and supported as exchange format by many data mining applications.

Data aggregation involves combining data together (possibly from various sources) in a way that facilitates analysis (but that also might make identification of private, individual-level data deducible or otherwise apparent).[29]

The threat to an individual's privacy comes into play when the data, once compiled, cause the data miner, or anyone who has access to the newly compiled data set, to be able to identify specific individuals, especially when the data were originally anonymous.[30][31][32]

However, even 'de-identified'/'anonymized' data sets can potentially contain enough information to allow identification of individuals, as occurred when journalists were able to find several individuals based on a set of search histories that were inadvertently released by AOL.[33]

As a consequence of Edward Snowden's global surveillance disclosure, there has been increased discussion to revoke this agreement, as in particular the data will be fully exposed to the National Security Agency, and attempts to reach an agreement have failed.[citation needed]

The focus on the solution to this legal issue being licences and not limitations and exceptions led to representatives of universities, researchers, libraries, civil society groups and open access publishers to leave the stakeholder dialogue in May 2013.[38]

For example, as part of the Google Book settlement the presiding judge on the case ruled that Google's digitisation project of in-copyright books was lawful, in part because of the transformative uses that the digitisation project displayed - one being text and data mining.[39]

AI (artificial intelligence)

While AI tools present a range of new functionality for businesses, artificial intellignce also raises some ethical questions.

Hackers are starting to use sophisticated machine learning tools to gain access to sensitive systems, complicating the issue of security beyond its current state.

Deep learning-based video and audio generation tools also present bad actors with the tools necessary to create so-called deepfakes, convincingly fabricated videos of public figures saying or doing things that never took place.

Enhance! Super Resolution From Google | Two Minute Papers #124

The paper "RAISR: Rapid and Accurate Image Super Resolution" is available here: Additional supplementary materials: ..

Neural Networks and Tensorflow - 5 - Backpropagation

In this series we're going to look into concepts of deep learning and neural networks with TensorFlow. Backpropagation is an essential topic in deep learning.

Artificial Neural Networks with Python - 2 - Basic Concepts

In this series we're exploring artificial neural networks with Python. In this tutorial we're discussing some very basic concepts of artificial neural networks: how ...

AI-Based Large-Scale Texture Synthesis | Two Minute Papers #252

Pick up cool perks on our Patreon page: One-time payment links and crypto addresses are available below



Intelligence cycle management

Intelligence cycle management Intelligence cycle management refers to the overall activity of guiding the intelligence cycle, which is a set of processes used to ...

Motion planning

Motion planning Motion planning (also known as the navigation problem or the piano mover's problem) is a term used in robotics for the process of breaking ...

Final Year Projects | Automatic Semantic Content Extraction in Videos Using a Fuzzy Ontology

Final Year Projects | Automatic Semantic Content Extraction in Videos Using a Fuzzy Ontology and Rule-Based Model Including Packages ...

Machine Learning with Scikit-Learn - The Cancer Dataset - 13 - Decision Trees 3

In this machine learning series I will work on the Wisconsin Breast Cancer dataset that comes with scikit-learn. I will train a few algorithms and evaluate their ...

Artificial Neural Networks with Python - 10 - Optical Character Recognition - 2

In this series we're exploring artificial neural networks with Python. In this tutorial we're continuing our work on optical character recognition. Here we are doing ...