This textbook, suitable for an early undergraduate up to a graduate course, provides an overview of many basic principles and techniques needed for modern data analysis. In particular, this book was designed and written as preparation for students planning to take rigorous Machine Learning and Data Mining courses. It introduces key conceptual tools necessary for data analysis, including concentration of measure and PAC bounds, cross validation, gradient descent, and principal component analysis. It also surveys basic techniques in supervised (regression and classification) and unsupervised learning (dimensionality reduction and clustering) through an accessible, simplified presentation. Students are recommended to have some background in calculus, probability, and linear algebra. Some familiarity with programming and algorithms is useful to understand advanced topics on computational techniques. Preface 7 Acknowledgements 11 Contents 12 1 Probability Review 17 1.1 Sample Spaces 17 1.2 Conditional Probability and Independence 20 1.3 Density Functions 20 1.4 Expected Value 22 1.5 Variance 23 1.6 Joint, Marginal, and Conditional Distributions 24 1.7 Bayes' Rule 26 1.7.1 Model Given Data 27 1.8 Bayesian Inference 30 Exercises 35 2 Convergence and Sampling 38 2.1 Sampling and Estimation 38 2.2 Probably Approximately Correct (PAC) 41 2.3 Concentration of Measure 41 2.3.1 Markov Inequality 42 2.3.2 Chebyshev Inequality 43 2.3.3 Chernoff-Hoeffding Inequality 44 2.3.4 Union Bound and Examples 46 2.4 Importance Sampling 49 2.4.1 Sampling Without Replacement with Priority Sampling 54 Exercises 56 3 Linear Algebra Review 58 3.1 Vectors and Matrices 58 3.2 Addition and Multiplication 61 3.3 Norms 64 3.4 Linear Independence 66 3.5 Rank 67 3.6 Square Matrices and Properties 68 3.7 Orthogonality 70 Exercises 72 4 Distances and Nearest Neighbors 74 4.1 Metrics 74 4.2 Lp Distances and their Relatives 75 4.2.1 Lp Distances 75 4.2.2 Mahalanobis Distance 78 4.2.3 Cosine and Angular Distance 79 4.2.4 KL Divergence 80 4.3 Distances for Sets and Strings 81 4.3.1 Jaccard Distance 82 4.3.2 Edit Distance 84 4.4 Modeling Text with Distances 85 4.4.1 Bag-of-Words Vectors 85 4.4.2 k-Grams 88 4.5 Similarities 91 4.5.1 Set Similarities 91 4.5.2 Normed Similarities 92 4.5.3 Normed Similarities between Sets 93 4.6 Locality Sensitive Hashing 95 4.6.1 Properties of Locality Sensitive Hashing 97 4.6.2 Prototypical Tasks for LSH 98 4.6.3 Banding to Amplify LSH 99 4.6.4 LSH for Angular Distance 102 4.6.5 LSH for Euclidean Distance 104 4.6.6 Min Hashing as LSH for Jaccard Distance 105 Exercises 108 5 Linear Regression 110 5.1 Simple Linear Regression 110 5.2 Linear Regression with Multiple Explanatory Variables 114 5.3 Polynomial Regression 117 5.4 Cross-Validation 119 5.4.1 Other ways to Evaluate Linear Regression Models 123 5.5 Regularized Regression 124 5.5.1 Tikhonov Regularization for Ridge Regression 125 5.5.2 Lasso 127 5.5.3 Dual Constrained Formulation 128 5.5.4 Matching Pursuit 130 Exercises 137 6 Gradient Descent 140 6.1 Functions 140 6.2 Gradients 143 6.3 Gradient Descent 144 6.3.1 Learning Rate 144 6.4 Fitting a Model to Data 150 6.4.1 Least Mean Squares Updates for Regression 151 6.4.2 Decomposable Functions 152 Exercises 156 7 Dimensionality Reduction 158 7.1 Data Matrices 158 7.1.1 Projections 160 7.1.2 Sum of Squared Errors Goal 161 7.2 Singular Value Decomposition 162 7.2.1 Best Rank-k Approximation of a Matrix 167 7.3 Eigenvalues and Eigenvectors 170 7.4 The Power Method 172 7.5 Principal Component Analysis 175 7.6 Multidimensional Scaling 176 7.6.1 Why does Classical MDS work? 178 7.7 Linear Discriminant Analysis 181 7.8 Distance Metric Learning 182 7.9 Matrix Completion 184 7.10 Random Projections 186 Exercises 188 8 Clustering 192 8.1 Voronoi Diagrams 192 8.1.1 Delaunay Triangulation 195 8.1.2 Connection to Assignment-Based Clustering 197 8.2 Gonzalez's Algorithm for k-Center Clustering 198 8.3 Lloyd's Algorithm for k-Means Clustering 200 8.3.1 Lloyd's Algorithm 201 8.3.2 k-Means++ 206 8.3.3 k-Mediod Clustering 207 8.3.4 Soft Clustering 208 8.4 Mixture of Gaussians 209 8.4.1 Expectation-Maximization 211 8.5 Hierarchical Clustering 211 8.6 Density-Based Clustering and Outliers 214 8.6.1 Outliers 215 8.7 Mean Shift Clustering 216 Exercises 218 9 Classification 221 9.1 Linear Classifiers 221 9.1.1 Loss Functions 224 9.1.2 Cross-Validation and Regularization 226 9.2 Perceptron Algorithm 227 9.3 Support Vector Machines and Kernels 231 9.3.1 The Dual: Mistake Counter 232 9.3.2 Feature Expansion 233 9.3.3 Support Vector Machines 235 9.4 Learnability and VC dimension 236 9.5 kNN Classifiers 239 9.6 Decision Trees 239 9.7 Neural Networks 242 9.7.1 Training with Back-propagation 244 10 Graph Structured Data 250 10.1 Markov Chains 252 10.1.1 Ergodic Markov Chains 255 10.1.2 Metropolis Algorithm 258 10.2 PageRank 259 10.3 Spectral Clustering on Graphs 262 10.3.1 Laplacians and their EigenStructures 263 10.4 Communities in Graphs 267 10.4.1 Preferential Attachment 269 10.4.2 Betweenness 269 10.4.3 Modularity 270 Exercises 272 11 Big Data and Sketching 274 11.1 The Streaming Model 275 11.1.1 Mean and Variance 277 11.1.2 Reservoir Sampling 277 11.2 Frequent Items 278 11.2.1 Warm-Up: Majority 281 11.2.2 Misra-Gries Algorithm 282 11.2.3 Count-Min Sketch 283 11.2.4 Count Sketch 285 11.3 Matrix Sketching 286 11.3.1 Covariance Matrix Summation 287 11.3.2 Frequent Directions 288 11.3.3 Row Sampling 290 11.3.4 Random Projections and Count Sketch Hashing 291 Exercises 293 Index 295