A comprehensive introduction to machine learning that uses probabilistic models and inference as a unifying approach.Today's Web-enabled deluge of electronic data calls for automated methods of data analysis. Machine learning provides these, developing methods that can automatically detect patterns in data and then use the uncovered patterns to predict future data. This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach.The coverage combines breadth and depth, offering necessary background material on such topics as probability, optimization, and linear algebra as well as discussion of recent developments in the field, including conditional random fields, L1 regularization, and deep learning. The book is written in an informal, accessible style, complete with pseudo-code for the most important algorithms. All topics are copiously illustrated with color images and worked examples drawn from such application domains as biology, text processing, computer vision, and robotics. Rather than providing a cookbook of different heuristic methods, the book stresses a principled model-based approach, often using the language of graphical models to specify models in a concise and intuitive way. Almost all the models described have been implemented in a MATLAB software package—PMTK (probabilistic modeling toolkit)—that is freely available online. The book is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students. Cover Page......Page 1 Half Title Page......Page 2 Title Page......Page 4 Copyright Page......Page 5 Dedication......Page 6 Contents......Page 8 Preface......Page 28 1.1 Machine learning: what and why?......Page 32 1.1.1 Types of machine learning......Page 33 1.2.1 Classification......Page 34 1.2.2 Regression......Page 39 1.3 Unsupervised learning......Page 40 1.3.1 Discovering clusters......Page 41 1.3.2 Discovering latent factors......Page 42 1.3.3 Discovering graph structure......Page 44 1.3.4 Matrix completion......Page 45 1.4.2 A simple non-parametric classifier: K-nearest neighbors......Page 47 1.4.3 The curse of dimensionality......Page 49 1.4.5 Linear regression......Page 50 1.4.6 Logistic regression......Page 52 1.4.8 Model selection......Page 53 1.4.9 No free lunch theorem......Page 55 2.1 Introduction......Page 58 2.2.2 Fundamental rules......Page 59 2.2.3 Bayes rule......Page 60 2.2.4 Independence and conditional independence......Page 61 2.2.5 Continuous random variables......Page 63 2.2.7 Mean and variance......Page 64 2.3.1 The binomial and Bernoulli distributions......Page 65 2.3.2 The multinomial and multinoulli distributions......Page 66 2.3.4 The empirical distribution......Page 68 2.4.1 Gaussian (normal) distribution......Page 69 2.4.2 Degenerate pdf......Page 70 2.4.4 The gamma distribution......Page 72 2.4.5 The beta distribution......Page 73 2.4.6 Pareto distribution......Page 74 2.5.1 Covariance and correlation......Page 75 2.5.3 Multivariate Student t distribution......Page 77 2.5.4 Dirichlet distribution......Page 78 2.6.1 Linear transformations......Page 80 2.6.2 General transformations......Page 81 2.6.3 Central limit theorem......Page 82 2.7 Monte Carlo approximation......Page 83 2.7.1 Example: change of variables, the MC way......Page 84 2.7.3 Accuracy of Monte Carlo approximation......Page 85 2.8.1 Entropy......Page 87 2.8.2 KL divergence......Page 88 2.8.3 Mutual information......Page 90 3.2 Bayesian concept learning......Page 96 3.2.2 Prior......Page 98 3.2.3 Posterior......Page 99 3.2.4 Posterior predictive distribution......Page 102 3.3 The beta-binomial model......Page 103 3.3.1 Likelihood......Page 104 3.3.2 Prior......Page 105 3.3.3 Posterior......Page 106 3.3.4 Posterior predictive distribution......Page 108 3.4 The Dirichlet-multinomial model......Page 109 3.4.3 Posterior......Page 110 3.4.4 Posterior predictive......Page 112 3.5 Naive Bayes classifiers......Page 113 3.5.1 Model fitting......Page 114 3.5.2 Using the model for prediction......Page 116 3.5.4 Feature selection using mutual information......Page 117 3.5.5 Classifying documents using bag of words......Page 118 4.1.2 Basics......Page 128 4.1.3 MLE for an MVN......Page 130 4.2 Gaussian discriminant analysis......Page 132 4.2.1 Quadratic discriminant analysis (QDA)......Page 133 4.2.2 Linear discriminant analysis (LDA)......Page 134 4.2.3 Two-class LDA......Page 135 4.2.5 Strategies for preventing overfitting......Page 137 4.2.6 Regularized LDA *......Page 138 4.2.7 Diagonal LDA......Page 139 4.2.8 Nearest shrunken centroids classifier *......Page 140 4.3 Inference in jointly Gaussian distributions......Page 141 4.3.2 Examples......Page 142 4.3.3 Information form......Page 146 4.3.4 Proof of the result *......Page 147 4.4.1 Statement of the result......Page 150 4.4.2 Examples......Page 151 4.4.3 Proof of the result *......Page 155 4.5 Digression: The Wishart distribution *......Page 156 4.5.1 Inverse Wishart distribution......Page 157 4.6 Inferring the parameters of an MVN......Page 158 þÿ......Page 159 þÿ......Page 163 4.6.4 Sensor fusion with unknown precisions *......Page 169 5.2.1 MAP estimation......Page 180 5.2.2 Credible intervals......Page 183 5.2.3 Inference for a difference in proportions......Page 185 5.3 Bayesian model selection......Page 186 5.3.1 Bayesian Occams razor......Page 187 5.3.2 Computing the marginal likelihood (evidence)......Page 189 5.3.3 Bayes factors......Page 194 5.3.4 Jeffreys-Lindley paradox *......Page 195 5.4.1 Uninformative priors......Page 196 5.4.2 Jeffreys priors *......Page 197 5.4.4 Mixtures of conjugate priors......Page 199 5.5.1 Example: modeling related cancer rates......Page 202 5.6 Empirical Bayes......Page 203 5.6.2 Example: Gaussian-Gaussian model......Page 204 5.7 Bayesian decision theory......Page 207 5.7.1 Bayes estimators for common loss functions......Page 208 5.7.2 The false positive vs false negative tradeoff......Page 211 5.7.3 Other topics *......Page 215 6.2 Sampling distribution of an estimator......Page 222 6.2.1 Bootstrap......Page 223 6.2.2 Large sample theory for the MLE *......Page 224 6.3 Frequentist decision theory......Page 225 6.3.1 Bayes risk......Page 226 6.3.2 Minimax risk......Page 227 6.3.3 Admissible estimators......Page 228 6.4.2 Unbiased estimators......Page 231 6.4.3 Minimum variance estimators......Page 232 6.4.4 The bias-variance tradeoff......Page 233 6.5 Empirical risk minimization......Page 235 6.5.1 Regularized risk minimization......Page 236 6.5.3 Estimating the risk using cross validation......Page 237 6.5.4 Upper bounding the risk using statistical learning theory *......Page 240 6.5.5 Surrogate loss functions......Page 241 6.6 Pathologies of frequentist statistics *......Page 242 6.6.1 Counter-intuitive behavior of confidence intervals......Page 243 6.6.2 p-values considered harmful......Page 244 6.6.3 The likelihood principle......Page 245 6.6.4 Why isnt everyone a Bayesian?......Page 246 7.3 Maximum likelihood estimation (least squares)......Page 248 7.3.1 Derivation of the MLE......Page 250 7.3.2 Geometric interpretation......Page 251 7.3.3 Convexity......Page 252 7.4 Robust linear regression *......Page 254 7.5.1 Basic idea......Page 256 7.5.2 Numerically stable computation *......Page 258 7.5.3 Connection with PCA *......Page 259 7.5.4 Regularization effects of big data......Page 261 7.6 Bayesian linear regression......Page 262 7.6.1 Computing the posterior......Page 263 7.6.2 Computing the posterior predictive......Page 264 7.6.4 EB for linear regression (evidence procedure)......Page 265 8.3 Model fitting......Page 276 8.3.1 MLE......Page 277 8.3.2 Steepest descent......Page 278 8.3.3 Newtons method......Page 280 8.3.4 Iteratively reweighted least squares (IRLS)......Page 281 8.3.5 Quasi-Newton (variable metric) methods......Page 282 8.3.7 Multi-class logistic regression......Page 283 8.4 Bayesian logistic regression......Page 285 8.4.2 Derivation of the BIC......Page 286 8.4.4 Approximating the posterior predictive......Page 287 8.4.5 Residual analysis (outlier detection) *......Page 291 8.5 Online learning and stochastic optimization......Page 292 8.5.2 Stochastic optimization and risk minimization......Page 293 8.5.3 The LMS algorithm......Page 295 8.5.4 The perceptron algorithm......Page 296 8.5.5 A Bayesian view......Page 297 8.6 Generative vs discriminative classifiers......Page 298 8.6.1 Pros and cons of each approach......Page 299 8.6.2 Dealing with missing data......Page 300 8.6.3 Fishers linear discriminant analysis (FLDA) *......Page 302 9.2 The exponential family......Page 312 9.2.2 Examples......Page 313 9.2.3 Log partition function......Page 315 9.2.4 MLE for the exponential family......Page 317 9.2.5 Bayes for the exponential family *......Page 318 9.2.6 Maximum entropy derivation of the exponential family *......Page 320 9.3.1 Basics......Page 321 9.3.2 ML and MAP estimation......Page 323 9.4 Probit regression......Page 324 9.4.2 Latent variable interpretation......Page 325 9.4.4 Multinomial probit models *......Page 326 9.5.2 Application to personalized email spam filtering......Page 327 9.5.4 Other kinds of prior......Page 328 9.6.1 Example: semi-parametric GLMMs for medical data......Page 329 9.7 Learning to rank *......Page 331 9.7.2 The pairwise approach......Page 332 9.7.3 The listwise approach......Page 333 9.7.4 Loss functions for ranking......Page 334 10.1.1 Chain rule......Page 338 10.1.3 Graphical models......Page 339 10.1.4 Graph terminology......Page 340 10.1.5 Directed graphical models......Page 341 10.2.1 Naive Bayes classifiers......Page 342 10.2.2 Markov and hidden Markov models......Page 343 10.2.3 Medical diagnosis......Page 344 10.2.4 Genetic linkage analysis *......Page 346 10.2.5 Directed Gaussian graphical models *......Page 349 10.3 Inference......Page 350 10.4.1 Plate notation......Page 351 10.4.2 Learning from complete data......Page 353 10.4.3 Learning with missing and/or latent variables......Page 354 10.5.1 d-separation and the Bayes Ball algorithm (global Markov properties)......Page 355 10.5.3 Markov blanket and full conditionals......Page 358 10.6 Influence (decision) diagrams *......Page 359 11.2 Mixture models......Page 368 11.2.1 Mixtures of Gaussians......Page 370 11.2.3 Using mixture models for clustering......Page 371 11.2.4 Mixtures of experts......Page 373 11.3 Parameter estimation for mixture models......Page 376 11.3.1 Unidentifiability......Page 377 11.3.2 Computing a MAP estimate is non-convex......Page 378 11.4 The EM algorithm......Page 379 11.4.1 Basic idea......Page 380 11.4.2 EM for GMMs......Page 381 11.4.3 EM for mixture of experts......Page 388 11.4.4 EM for DGMs with hidden variables......Page 389 11.4.5 EM for the Student distribution *......Page 390 11.4.6 EM for probit regression *......Page 393 11.4.7 Theoretical basis for EM *......Page 394 11.4.8 Online EM......Page 396 11.4.9 Other EM variants *......Page 398 11.5.2 Model selection for non-probabilistic methods......Page 401 11.6 Fitting models with missing data......Page 403 11.6.1 EM for the MLE of an MVN with missing data......Page 404 12.1.1 FA is a low rank parameterization of an MVN......Page 412 12.1.2 Inference of the latent factors......Page 413 12.1.3 Unidentifiability......Page 414 12.1.4 Mixtures of factor analysers......Page 416 12.1.5 EM for factor analysis models......Page 417 12.2.1 Classical PCA: statement of the theorem......Page 418 12.2.2 Proof *......Page 420 12.2.3 Singular value decomposition (SVD)......Page 423 12.2.4 Probabilistic PCA......Page 426 12.2.5 EM algorithm for PCA......Page 427 12.3.1 Model selection for FA/PPCA......Page 429 12.3.2 Model selection for PCA......Page 430 12.4 PCA for categorical data......Page 433 12.5 PCA for paired and multi-view data......Page 435 12.5.1 Supervised PCA (latent factor regression)......Page 436 12.5.2 Partial least squares......Page 437 12.6 Independent Component Analysis (ICA)......Page 438 12.6.1 Maximum likelihood estimation......Page 441 12.6.2 The FastICA algorithm......Page 442 12.6.3 Using EM......Page 445 12.6.4 Other estimation principles *......Page 446 13.1 Introduction......Page 452 13.2 Bayesian variable selection......Page 453 13.2.1 The spike and slab model......Page 455 13.2.2 From the Bernoulli-Gaussian model to 0 regularization......Page 456 13.2.3 Algorithms......Page 457 13.3 e1 regularization: basics......Page 460 13.3.1 Why does e1 regularization yield sparse solutions?......Page 461 13.3.2 Optimality conditions for lasso......Page 462 13.3.3 Comparison of least squares, lasso, ridge and subset selection......Page 466 13.3.4 Regularization path......Page 467 13.3.5 Model selection......Page 470 13.3.6 Bayesian inference for linear models with Laplace priors......Page 471 13.4.2 LARS and other homotopy methods......Page 472 13.4.3 Proximal and gradient projection methods......Page 473 13.4.4 EM for lasso......Page 478 13.5.1 Group Lasso......Page 480 13.5.2 Fused lasso......Page 485 13.5.3 Elastic net (ridge and lasso combined)......Page 486 13.6 Non-convex regularizers......Page 488 13.6.2 Hierarchical adaptive lasso......Page 489 13.6.3 Other hierarchical priors......Page 493 13.7.1 ARD for linear regression......Page 494 13.7.3 Connection to MAP estimation......Page 496 13.7.4 Algorithms for ARD *......Page 497 13.8 Sparse coding *......Page 499 13.8.1 Learning a sparse coding dictionary......Page 500 13.8.2 Results of dictionary learning from image patches......Page 501 13.8.4 Image inpainting and denoising......Page 503 14.2 Kernel functions......Page 510 14.2.2 Kernels for comparing documents......Page 511 14.2.3 Mercer (positive definite) kernels......Page 512 14.2.5 Matern kernels......Page 513 14.2.6 String kernels......Page 514 14.2.7 Pyramid match kernels......Page 515 14.2.8 Kernels derived from probabilistic generative models......Page 516 14.3.1 Kernel machines......Page 517 14.3.2 L1VMs, RVMs, and other sparse vector machines......Page 518 14.4 The kernel trick......Page 519 14.4.2 Kernelized K-medoids clustering......Page 520 14.4.3 Kernelized ridge regression......Page 523 14.4.4 Kernel PCA......Page 524 14.5 Support vector machines (SVMs)......Page 527 14.5.1 SVMs for regression......Page 528 14.5.2 SVMs for classification......Page 529 14.5.4 Summary of key points......Page 535 14.6 Comparison of discriminative kernel methods......Page 536 14.7.1 Smoothing kernels......Page 538 14.7.2 Kernel density estimation (KDE)......Page 539 14.7.3 From KDE to KNN......Page 540 14.7.4 Kernel regression......Page 541 14.7.5 Locally weighted regression......Page 543 15.1 Introduction......Page 546 15.2 GPs for regression......Page 547 15.2.1 Predictions using noise-free observations......Page 548 15.2.2 Predictions using noisy observations......Page 549 15.2.3 Effect of the kernel parameters......Page 550 15.2.4 Estimating the kernel parameters......Page 552 15.2.6 Semi-parametric GPs *......Page 555 15.3.1 Binary classification......Page 556 15.3.2 Multi-class classification......Page 559 15.3.3 GPs for Poisson regression......Page 562 15.4.1 Linear models compared to GPs......Page 563 15.4.2 Linear smoothers compared to GPs......Page 564 15.4.4 L1VM and RVMs compared to GPs......Page 565 15.4.5 Neural networks compared to GPs......Page 566 15.4.6 Smoothing splines compared to GPs *......Page 567 15.4.7 RKHS methods compared to GPs *......Page 569 15.5 GP latent variable model......Page 571 15.6 Approximation methods for large datasets......Page 573 16.1 Introduction......Page 574 16.2.1 Basics......Page 575 16.2.2 Growing a tree......Page 576 16.2.3 Pruning a tree......Page 580 16.2.5 Random forests......Page 581 16.2.6 CART compared to hierarchical mixture of experts *......Page 582 16.3.1 Backfitting......Page 583 16.3.3 Multivariate adaptive regression splines (MARS)......Page 584 16.4 Boosting......Page 585 16.4.1 Forward stagewise additive modeling......Page 586 16.4.2 L2boosting......Page 588 16.4.3 AdaBoost......Page 589 16.4.4 LogitBoost......Page 590 16.4.5 Boosting as functional gradient descent......Page 591 16.4.6 Sparse boosting......Page 592 16.4.8 Why does boosting work so well?......Page 593 16.5 Feedforward neural networks (multilayer perceptrons)......Page 594 16.5.1 Convolutional neural networks......Page 595 16.5.3 A brief history of the field......Page 599 16.5.4 The backpropagation algorithm......Page 600 16.5.6 Regularization......Page 603 16.5.7 Bayesian inference......Page 607 16.6.1 Stacking......Page 611 16.6.3 Ensemble learning is not equivalent to Bayes model averaging......Page 612 16.7.1 Low-dimensional features......Page 613 16.7.2 High-dimensional features......Page 614 16.8 Interpreting black-box models......Page 616 17.2.1 Transition matrix......Page 620 17.2.2 Application: Language modeling......Page 622 17.2.3 Stationary distribution of a Markov chain *......Page 627 17.2.4 Application: Googles PageRank algorithm for web page ranking......Page 631 17.3 Hidden Markov models......Page 634 17.3.1 Applications of HMMs......Page 635 17.4.1 Types of inference problems for temporal models......Page 637 17.4.2 The forwards algorithm......Page 640 17.4.3 The forwards-backwards algorithm......Page 641 17.4.4 The Viterbi algorithm......Page 643 17.4.5 Forwards filtering, backwards sampling......Page 647 17.5.1 Training with fully observed data......Page 648 17.5.2 EM for HMMs (the Baum-Welch algorithm)......Page 649 17.5.4 Discriminative training......Page 651 17.6 Generalizations of HMMs......Page 652 17.6.1 Variable duration (semi-Markov) HMMs......Page 653 17.6.2 Hierarchical HMMs......Page 655 17.6.3 Input-output HMMs......Page 656 17.6.4 Auto-regressive and buried HMMs......Page 657 17.6.5 Factorial HMM......Page 658 17.6.7 Dynamic Bayesian networks (DBNs)......Page 659 18.1 Introduction......Page 662 18.2.1 SSMs for object tracking......Page 663 18.2.2 Robotic SLAM......Page 664 18.2.3 Online parameter learning using recursive least squares......Page 667 18.2.4 SSM for time series forecasting *......Page 668 18.3.1 The Kalman filtering algorithm......Page 671 18.3.2 The Kalman smoothing algorithm......Page 674 18.4.1 Identifiability and numerical stability......Page 677 18.5 Approximate online inference for non-linear, non-Gaussian SSMs......Page 678 18.5.1 Extended Kalman filter (EKF)......Page 679 18.5.2 Unscented Kalman filter (UKF)......Page 681 18.5.3 Assumed density filtering (ADF)......Page 683 18.6 Hybrid discrete/continuous SSMs......Page 686 18.6.1 Inference......Page 687 18.6.2 Application: data association and multi-target tracking......Page 689 18.6.3 Application: fault diagnosis......Page 690 18.6.4 Application: econometric forecasting......Page 691 19.2.1 Key properties......Page 692 19.2.2 An undirected alternative to d-separation......Page 694 19.2.3 Comparing directed and undirected graphical models......Page 695 19.3.1 The Hammersley-Clifford theorem......Page 696 19.3.2 Representing potential functions......Page 698 19.4.1 Ising model......Page 699 19.4.2 Hopfield networks......Page 700 19.4.3 Potts model......Page 702 19.4.4 Gaussian MRFs......Page 703 19.4.5 Markov logic networks *......Page 705 19.5.1 Training maxent models using gradient methods......Page 707 19.5.2 Training partially observed maxent models......Page 708 19.5.4 Pseudo likelihood......Page 709 19.5.5 Stochastic maximum likelihood......Page 710 19.5.6 Feature induction for maxent models *......Page 711 19.5.7 Iterative proportional fitting (IPF) *......Page 712 19.6.1 Chain-structured CRFs, MEMMs and the label-bias problem......Page 715 19.6.2 Applications of CRFs......Page 717 19.6.3 CRF training......Page 723 19.7.1 SSVMs: a probabilistic view......Page 724 19.7.2 SSVMs: a non-probabilistic view......Page 726 19.7.3 Cutting plane methods for fitting SSVMs......Page 729 19.7.4 Online algorithms for fitting SSVMs......Page 731 19.7.5 Latent structural SVMs......Page 732 20.2.1 Serial protocol......Page 738 20.2.2 Parallel protocol......Page 740 20.2.3 Gaussian BP *......Page 741 20.2.4 Other BP variants *......Page 743 20.3 The variable elimination algorithm......Page 745 20.3.2 Computational complexity of VE......Page 748 20.4.1 Creating a junction tree......Page 751 20.4.2 Message passing on a junction tree......Page 753 20.4.3 Computational complexity of JTA......Page 756 20.5 Computational intractability of exact inference in the worst case......Page 757 20.5.1 Approximate inference......Page 758 21.1 Introduction......Page 762 21.2 Variational inference......Page 763 21.2.2 Forward or reverse KL? *......Page 764 21.3 The mean field method......Page 766 21.3.1 Derivation of the mean field update equations......Page 767 21.3.2 Example: mean field for the Ising model......Page 768 21.4 Structured mean field *......Page 770 21.4.1 Example: factorial HMM......Page 771 21.5.1 Example: VB for a univariate Gaussian......Page 773 21.5.2 Example: VB for linear regression......Page 777 21.6 Variational Bayes EM......Page 780 21.6.1 Example: VBEM for mixtures of Gaussians *......Page 781 21.8.1 Motivating applications......Page 787 21.8.2 Bohnings quadratic bound to the log-sum-exp function......Page 789 21.8.3 Bounds for the sigmoid function......Page 791 21.8.4 Other bounds and approximations to the log-sum-exp function......Page 793 21.8.5 Variational inference based on upper bounds......Page 794 22.2.1 A brief history......Page 798 22.2.2 LBP on pairwise models......Page 799 22.2.3 LBP on a factor graph......Page 800 22.2.4 Convergence......Page 802 22.2.5 Accuracy of LBP......Page 805 22.2.6 Other speedup tricks for LBP *......Page 806 22.3.1 UGMs represented in exponential family form......Page 807 22.3.2 The marginal polytope......Page 808 22.3.3 Exact inference as a variational optimization problem......Page 809 22.3.5 LBP as a variational optimization problem......Page 810 22.4.1 Generalized belief propagation......Page 814 22.4.2 Convex belief propagation......Page 816 22.5 Expectation propagation......Page 818 22.5.1 EP as a variational inference p......Page 819 22.5.2 Optimizing the EP objective using moment matching......Page 820 22.5.3 EP for the clutter problem......Page 822 22.5.4 LBP is a special case of EP......Page 823 22.5.5 Ranking players using TrueSkill......Page 824 22.6.1 Linear programming relaxation......Page 830 22.6.2 Max-product belief propagation......Page 831 22.6.3 Graphcuts......Page 832 22.6.4 Experimental comparison of graphcuts and BP......Page 835 22.6.5 Dual decomposition......Page 837 23.2.1 Using the cdf......Page 846 23.3.1 Basic idea......Page 848 23.3.2 Example......Page 849 23.3.4 Adaptive rejection sampling......Page 850 23.4.1 Basic idea......Page 851 23.4.2 Handling unnormalized distributions......Page 852 23.4.4 Sampling importance resampling (SIR)......Page 853 23.5 Particle filtering......Page 854 23.5.1 Sequential importance sampling......Page 855 23.5.3 The resampling step......Page 856 23.5.4 The proposal distribution......Page 858 23.5.6 Application: visual object tracking......Page 859 23.6.1 RBPF for switching LG-SSMs......Page 862 23.6.2 Application: tracking a maneuvering target......Page 863 23.6.3 Application: Fast SLAM......Page 865 24.1 Introduction......Page 868 24.2.2 Example: Gibbs sampling for the Ising model......Page 869 24.2.3 Example: Gibbs sampling for inferring the parameters of a GMM......Page 871 24.2.4 Collapsed Gibbs sampling *......Page 872 24.2.5 Gibbs sampling for hierarchical GLMs......Page 875 24.2.6 BUGS and JAGS......Page 877 24.2.8 Blocking Gibbs sampling......Page 878 24.3.1 Basic idea......Page 879 24.3.2 Gibbs sampling is a special case of MH......Page 880 24.3.3 Proposal distributions......Page 881 24.3.4 Adaptive MCMC......Page 884 24.3.6 Why MH works *......Page 885 24.3.7 Reversible jump (trans-dimensional) MCMC *......Page 886 24.4.1 The burn-in phase......Page 887 24.4.2 Mixing rates of Markov chains *......Page 888 24.4.3 Practical convergence diagnostic......Page 889 24.4.4 Accuracy of MCMC......Page 891 24.4.5 How many chains?......Page 893 24.5.1 Auxiliary variable sampling for logistic regression......Page 894 24.5.2 Slice sampling......Page 895 24.5.3 Swendsen Wang......Page 897 24.6 Annealing methods......Page 899 24.6.1 Simulated annealing......Page 900 24.6.3 Parallel tempering......Page 902 24.7.2 Harmonic mean estimate......Page 903 24.7.3 Annealed importance sampling......Page 904 25.1.1 Measuring (dis)similarity......Page 906 25.1.2 Evaluating the output of clustering methods *......Page 907 25.2.1 From finite to infinite mixture models......Page 910 25.2.2 The Dirichlet process......Page 913 25.2.3 Applying Dirichlet processes to mixture modeling......Page 916 25.2.4 Fitting a DP mixture model......Page 917 25.3 Affinity propagation......Page 918 25.4 Spectral clustering......Page 921 25.4.1 Graph Laplacian......Page 922 25.4.2 Normalized graph Laplacian......Page 923 25.5 Hierarchical clustering......Page 924 25.5.1 Agglomerative clustering......Page 926 25.5.2 Divisive clustering......Page 929 25.5.4 Bayesian hierarchical clustering......Page 930 25.6 Clustering datapoints and features......Page 932 25.6.2 Multi-view clustering......Page 934 26.1 Introduction......Page 938 26.2.1 Relevance networks......Page 939 26.2.2 Dependency networks......Page 940 26.3 Learning tree structures......Page 941 26.3.1 Directed or undirected tree?......Page 942 26.3.3 Finding the MAP forest......Page 943 26.4.1 Markov equivalence......Page 945 26.4.2 Exact structural inference......Page 947 26.4.3 Scaling up to larger graphs......Page 951 26.5.1 Approximating the marginal likelihood when we have missing data......Page 953 26.5.2 Structural EM......Page 956 26.5.3 Discovering hidden variables......Page 957 26.5.4 Case study: Googles Rephil......Page 959 26.5.5 Structural equation models *......Page 960 26.6.1 Causal interpretation of DAGs......Page 962 26.6.2 Using causal DAGs to resolve Simpsons paradox......Page 964 26.6.3 Learning causal DAG structures......Page 966 26.7.1 MLE for a GGM......Page 969 26.7.2 Graphical lasso......Page 970 26.7.3 Bayesian inference for GGM structure *......Page 972 26.8.1 Graphical lasso for MRFs/CRFs......Page 973 26.8.2 Thin junction trees......Page 975 27.1 Introduction......Page 976 27.2.1 Mixture models......Page 977 27.2.2 Exponential family PCA......Page 978 27.2.3 LDA and mPCA......Page 979 27.2.4 GaP model and non-negative matrix factorization......Page 980 27.3.1 Basics......Page 981 27.3.3 Quantitatively evaluating LDA as a language model......Page 984 27.3.4 Fitting using (collapsed) Gibbs sampling......Page 986 27.3.5 Example......Page 987 27.3.6 Fitting using batch variational inference......Page 988 27.3.7 Fitting using online variational inference......Page 990 27.3.8 Determining the number of topics......Page 991 27.4.1 Correlated topic model......Page 992 27.4.2 Dynamic topic model......Page 993 27.4.3 LDA-HMM......Page 994 27.4.4 Supervised LDA......Page 998 27.5 LVMs for graph-structured data......Page 1001 27.5.1 Stochastic block model......Page 1002 27.5.2 Mixed membership stochastic block model......Page 1004 27.5.3 Relational topic model......Page 1005 27.6 LVMs for relational data......Page 1006 27.6.1 Infinite relational model......Page 1007 27.6.2 Probabilistic matrix factorization for collaborative filtering......Page 1010 27.7 Restricted Boltzmann machines (RBMs)......Page 1014 27.7.1 Varieties of RBMs......Page 1016 27.7.2 Learning RBMs......Page 1018 27.7.3 Applications of RBMs......Page 1022 28.2 Deep generative models......Page 1026 28.2.2 Deep Boltzmann machines......Page 1027 28.2.3 Deep belief networks......Page 1028 28.2.4 Greedy layer-wise learning of DBNs......Page 1029 28.3.1 Deep multi-layer perceptrons......Page 1030 28.3.2 Deep auto-encoders......Page 1031 28.4.1 Handwritten digit classification using DBNs......Page 1032 28.4.2 Data visualization and feature discovery using deep auto-encoders......Page 1033 28.4.3 Information retrieval using deep auto-encoders (semantic hashing)......Page 1034 28.4.4 Learning audio features using 1d convolutional DBNs......Page 1035 28.5 Discussion......Page 1036 Notation......Page 1040 Bibliography......Page 1046 Index to Code......Page 1078 Index to Keywords......Page 1081 This Textbook Offers A Comprehensive And Self-contained Introduction To The Field Of Machine Learning, Based On A Unified, Probabilistic Approach. The Coverage Combines Breadth And Depth, Offering Necessary Background Material On Such Topics As Probability, Optimization, And Linear Algebra As Well As Discussion Of Recent Developments In The Field, Including Conditional Random Fields, L1 Regularization, And Deep Learning. The Book Is Written In An Informal, Accessible Style, Complete With Pseudo-code For The Most Important Algorithms. All Topics Are Copiously Illustrated With Color Images And Worked Examples Drawn From Such Application Domains As Biology, Text Processing, Computer Vision, And Robotics. Rather Than Providing A Cookbook Of Different Heuristic Methods, The Book Stresses A Principled Model-based Approach, Often Using The Language Of Graphical Models To Specify Models In A Concise And Intuitive Way. Almost All The Models Described Have Been Implemented In A Matlab Software Package--pmtk (probabilistic Modeling Toolkit)--that Is Freely Available Online--back Cover. Probability -- Generative Models For Discrete Data -- Gaussian Models -- Bayesian Statistics -- Frequentist Statistics -- Linear Regression -- Logistic Regression -- Generalized Linear Models And The Exponential Family -- Directed Graphical Models (bayes Nets) -- Mixture Models And The Em Algorithm -- Latent Linear Models -- Sparse Linear Models -- Kernels -- Gaussian Processes -- Adaptive Basis Function Models -- Markov And Hidden Markov Models -- State Space Models -- Undirected Graphical Models (markov Random Fields) -- Exact Inference For Graphical Models -- Variational Inference -- More Variational Inference -- Monte Carlo Inference -- Markov Chain Monte Carlo (mcmc) Inference -- Clustering -- Graphical Model Structure Learning -- Latent Variable Models For Discrete Data -- Deep Learning -- Notation. Kevin P. Murphy. Includes Bibliographical References And Index. A comprehensive introduction to machine learning that uses probabilistic models and inference as a unifying approach. Today's Web-enabled deluge of electronic data calls for automated methods of data analysis. Machine learning provides these, developing methods that can automatically detect patterns in data and then use the uncovered patterns to predict future data. This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach. The coverage combines breadth and depth, offering necessary background material on such topics as probability, optimization, and linear algebra as well as discussion of recent developments in the field, including conditional random fields, L1 regularization, and deep learning. The book is written in an informal, accessible style, complete with pseudo-code for the most important algorithms. All topics are copiously illustrated with color images and worked examples drawn from such application domains as biology, text processing, computer vision, and robotics. Rather than providing a cookbook of different heuristic methods, the book stresses a principled model-based approach, often using the language of graphical models to specify models in a concise and intuitive way. Almost all the models described have been implemented in a MATLAB software packagePMTK (probabilistic modeling toolkit)that is freely available online. The book is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.