__Data Preprocessing for Data Mining__ addresses one of the most important issues within the well-known Knowledge Discovery from Data process. Data directly taken from the source will likely have inconsistencies, errors or most importantly, it is not ready to be considered for a data mining process. Furthermore, the increasing amount of data in recent science, industry and business applications, calls to the requirement of more complex tools to analyze it. Thanks to data preprocessing, it is possible to convert the impossible into possible, adapting the data to fulfill the input demands of each data mining algorithm. Data preprocessing includes the data reduction techniques, which aim at reducing the complexity of the data, detecting or removing irrelevant and noisy elements from the data. This book is intended to review the tasks that fill the gap between the data acquisition from the source and the data mining process. A comprehensive look from a practical point of view, including basic concepts and surveying the techniques proposed in the specialized literature, is given.Each chapter is a stand-alone guide to a particular data preprocessing topic, from basic concepts and detailed descriptions of classical algorithms, to an incursion of an exhaustive catalog of recent developments. The in-depth technical descriptions make this book suitable for technical professionals, researchers, senior undergraduate and graduate students in data science, computer science and engineering. Preface 7 Contents 9 Acronyms 15 1 Introduction 16 1.1 Data Mining and Knowledge Discovery 16 1.2 Data Mining Methods 17 1.3 Supervised Learning 21 1.4 Unsupervised Learning 22 1.4.1 Pattern Mining 25 23 1.4.2 Outlier Detection 9 23 1.5 Other Learning Paradigms 23 1.5.1 Imbalanced Learning 22 23 1.5.2 Multi-instance Learning 5 24 1.5.3 Multi-label Classification 8 24 1.5.4 Semi-supervised Learning 33 24 1.5.5 Subgroup Discovery 17 24 1.5.6 Transfer Learning 26 25 1.5.7 Data Stream Learning 13 25 1.6 Introduction to Data Preprocessing 25 1.6.1 Data Preparation 26 1.6.2 Data Reduction 28 References 31 2 Data Sets and Proper Statistical Analysis of Data Mining Techniques 33 2.1 Data Sets and Partitions 33 2.1.1 Data Set Partitioning 35 2.1.2 Performance Measures 38 2.2 Using Statistical Tests to Compare Methods 39 2.2.1 Conditions for the Safe Use of Parametric Tests 40 2.2.2 Normality Test over the Group of Data Sets and Algorithms 41 2.2.3 Non-parametric Tests for Comparing Two Algorithms in Multiple Data Set Analysis 43 2.2.4 Non-parametric Tests for Multiple Comparisons Among More than Two Algorithms 46 References 51 3 Data Preparation Basic Models 53 3.1 Overview 53 3.2 Data Integration 54 3.2.1 Finding Redundant Attributes 55 3.2.2 Detecting Tuple Duplication and Inconsistency 57 3.3 Data Cleaning 59 3.4 Data Normalization 60 3.4.1 Min-Max Normalization 60 3.4.2 Z-score Normalization 61 3.4.3 Decimal Scaling Normalization 62 3.5 Data Transformation 62 3.5.1 Linear Transformations 63 3.5.2 Quadratic Transformations 63 3.5.3 Non-polynomial Approximations of Transformations 64 3.5.4 Polynomial Approximations of Transformations 65 3.5.5 Rank Transformations 66 3.5.6 Box-Cox Transformations 67 3.5.7 Spreading the Histogram 68 3.5.8 Nominal to Binary Transformation 68 3.5.9 Transformations via Data Reduction 69 References 69 4 Dealing with Missing Values 72 4.1 Introduction 72 4.2 Assumptions and Missing Data Mechanisms 74 4.3 Simple Approaches to Missing Data 76 4.4 Maximum Likelihood Imputation Methods 77 4.4.1 Expectation-Maximization (EM) 78 4.4.2 Multiple Imputation 81 4.4.3 Bayesian Principal Component Analysis (BPCA) 85 4.5 Imputation of Missing Values. Machine Learning Based Methods 89 4.5.1 Imputation with K-Nearest Neighbor (KNNI) 89 4.5.2 Weighted Imputation with K-Nearest Neighbour (WKNNI) 90 4.5.3 K-means Clustering Imputation (KMI) 91 4.5.4 Imputation with Fuzzy K-means Clustering (FKMI) 91 4.5.5 Support Vector Machines Imputation (SVMI) 92 4.5.6 Event Covering (EC) 95 4.5.7 Singular Value Decomposition Imputation (SVDI) 99 4.5.8 Local Least Squares Imputation (LLSI) 99 4.5.9 Recent Machine Learning Approaches to Missing Values Imputation 103 4.6 Experimental Comparative Analysis 103 4.6.1 Effect of the Imputation Methods in the Attributes' Relationships 103 4.6.2 Best Imputation Methods for Classification Methods 110 4.6.3 Interesting Comments 113 References 114 5 Dealing with Noisy Data 119 5.1 Identifying Noise 119 5.2 Types of Noise Data: Class Noise and Attribute Noise 122 5.2.1 Noise Introduction Mechanisms 123 5.2.2 Simulating the Noise of Real-World Data Sets 126 5.3 Noise Filtering at Data Level 127 5.3.1 Ensemble Filter 128 5.3.2 Cross-Validated Committees Filter 129 5.3.3 Iterative-Partitioning Filter 129 5.3.4 More Filtering Methods 130 5.4 Robust Learners Against Noise 130 5.4.1 Multiple Classifier Systems for Classification Tasks 132 5.4.2 Addressing Multi-class Classification Problems by Decomposition 135 5.5 Empirical Analysis of Noise Filters and Robust Strategies 137 5.5.1 Noise Introduction 137 5.5.2 Noise Filters for Class Noise 139 5.5.3 Noise Filtering Efficacy Prediction by Data Complexity Measures 141 5.5.4 Multiple Classifier Systems with Noise 145 5.5.5 Analysis of the OVO Decomposition with Noise 148 References 152 6 Data Reduction 158 6.1 Overview 158 6.2 The Curse of Dimensionality 159 6.2.1 Principal Components Analysis 160 6.2.2 Factor Analysis 162 6.2.3 Multidimensional Scaling 163 6.2.4 Locally Linear Embedding 166 6.3 Data Sampling 167 6.3.1 Data Condensation 169 6.3.2 Data Squashing 170 6.3.3 Data Clustering 170 6.4 Binning and Reduction of Cardinality 172 References 173 7 Feature Selection 174 7.1 Overview 174 7.2 Perspectives 175 7.2.1 The Search of a Subset of Features 175 7.2.2 Selection Criteria 179 7.2.3 Filter, Wrapper and Embedded Feature Selection 184 7.3 Aspects 187 7.3.1 Output of Feature Selection 187 7.3.2 Evaluation 188 7.3.3 Drawbacks 190 7.3.4 Using Decision Trees for Feature Selection 190 7.4 Description of the Most Representative Feature Selection Methods 191 7.4.1 Exhaustive Methods 192 7.4.2 Heuristic Methods 193 7.4.3 Nondeterministic Methods 193 7.4.4 Feature Weighting Methods 195 7.5 Related and Advanced Topics 196 7.5.1 Leading and Recent Feature Selection Techniques 197 7.5.2 Feature Extraction 199 7.5.3 Feature Construction 200 7.6 Experimental Comparative Analyses in Feature Selection 201 References 202 8 Instance Selection 205 8.1 Introduction 205 8.2 Training Set Selection Versus Prototype Selection 207 8.3 Prototype Selection Taxonomy 209 8.3.1 Common Properties in Prototype Selection Methods 209 8.3.2 Prototype Selection Methods 212 8.3.3 Taxonomy of Prototype Selection Methods 212 8.4 Description of Methods 216 8.4.1 Condensation Algorithms 216 8.4.2 Edition Algorithms 220 8.4.3 Hybrid Algorithms 222 8.5 Related and Advanced Topics 231 8.5.1 Prototype Generation 231 8.5.2 Distance Metrics, Feature Weighting and Combinations with Feature Selection 231 8.5.3 Hybridizations with Other Learning Methods and Ensembles 232 8.5.4 Scaling-Up Approaches 233 8.5.5 Data Complexity 233 8.6 Experimental Comparative Analysis in Prototype Selection 234 8.6.1 Analysis and Empirical Results on Small Size Data Sets 235 8.6.2 Analysis and Empirical Results on Medium Size Data Sets 240 8.6.3 Global View of the Obtained Results 241 8.6.4 Visualization of Data Subsets: A Case Study Based on the Banana Data Set 243 References 246 9 Discretization 254 9.1 Introduction 254 9.2 Perspectives and Background 256 9.2.1 Discretization Process 256 9.2.2 Related and Advanced Work 259 9.3 Properties and Taxonomy 260 9.3.1 Common Properties 260 9.3.2 Methods and Taxonomy 264 9.3.3 Description of the Most Representative Discretization Methods 268 9.4 Experimental Comparative Analysis 274 9.4.1 Experimental Set up 274 9.4.2 Analysis and Empirical Results 277 References 287 10 A Data Mining Software Package Including Data Preparation and Reduction: KEEL 293 10.1 Data Mining Softwares and Toolboxes 293 10.2 KEEL: Knowledge Extraction Based on Evolutionary Learning 295 10.2.1 Main Features 296 10.2.2 Data Management 297 10.2.3 Design of Experiments: Off-Line Module 299 10.2.4 Computer-Based Education: On-Line Module 301 10.3 KEEL-Dataset 302 10.3.1 Data Sets Web Pages 302 10.3.2 Experimental Study Web Pages 305 10.4 Integration of New Algorithms into the KEEL Tool 306 10.4.1 Introduction to the KEEL Codification Features 306 10.5 KEEL Statistical Tests 311 10.5.1 Case Study 312 10.6 Summarizing Comments 318 References 319 Index 322 Data Preprocessing for Data Mining addresses one of the most important issues within the well-known Knowledge Discovery from Data process. Data directly taken from the source will likely have inconsistencies, errors or most importantly, it is not ready to be considered for a data mining process. Furthermore, the increasing amount of data in recent science, industry and business applications, calls to the requirement of more complex tools to analyze it. Thanks to data preprocessing, it is possible to convert the impossible into possible, adapting the data to fulfill the input demands of each data mining algorithm. Data preprocessing includes the data reduction techniques, which aim at reducing the complexity of the data, detecting or removing irrelevant and noisy elements from the data. This book is intended to review the tasks that fill the gap between the data acquisition from the source and the data mining process. A comprehensive look from a practical point of view, including basic concepts and surveying the techniques proposed in the specialized literature, is given. Each chapter is a stand-alone guide to a particular data preprocessing topic, from basic concepts and detailed descriptions of classical algorithms, to an incursion of an exhaustive catalog of recent developments. The in-depth technical descriptions make this book suitable for technical professionals, researchers, senior undergraduate and graduate students in data science, computer science and engineering Front Matter....Pages i-xv Introduction....Pages 1-17 Data Sets and Proper Statistical Analysis of Data Mining Techniques....Pages 19-38 Data Preparation Basic Models....Pages 39-57 Dealing with Missing Values....Pages 59-105 Dealing with Noisy Data....Pages 107-145 Data Reduction....Pages 147-162 Feature Selection....Pages 163-193 Instance Selection....Pages 195-243 Discretization....Pages 245-283 A Data Mining Software Package Including Data Preparation and Reduction: KEEL....Pages 285-313 Back Matter....Pages 315-320