Mining the Web: Discovering Knowledge from Hypertext Data is the first book devoted entirely to techniques for producing knowledge from the vast body of unstructured Web data. Building on an initial survey of infrastructural issues — including Web crawling and indexing — Chakrabarti examines low-level machine learning techniques as they relate specifically to the challenges of Web mining. He then devotes the final part of the book to applications that unite infrastructure and analysis to bring machine learning to bear on systematically acquired and stored data. Here the focus is on results: the strengths and weaknesses of these applications, along with their potential as foundations for further progress. From Chakrabarti's work — painstaking, critical, and forward-looking — readers will gain the theoretical and practical understanding they need to contribute to the Web mining effort. Cover......Page 1 FOREWORD......Page 7 Contents......Page 8 PREFACE......Page 16 1 - Introduction......Page 20 1.1 Crawling and Indexing......Page 25 1.2 Topic Directories......Page 26 1.3 Clustering and Classification......Page 27 1.4 Hyperlink Analysis......Page 28 1.6 Structured vs. Unstructured Data Mining......Page 30 1.7 Bibliographic Notes......Page 32 Part I - Infrastructure......Page 34 2 - Crawling the Web......Page 36 2.1 HTML and HTTP Basics......Page 37 2.2 Crawling Basics......Page 38 2.3 Engineering Large- Scale Crawlers......Page 40 2.4 Putting Together a Crawler......Page 54 2.5 Bibliographic Notes......Page 59 3.1 Boolean Queries and the Inverted Index......Page 64 3.2 Relevance Ranking......Page 72 3.3 Similarity Search......Page 86 3.4 Bibliographic Notes......Page 94 Part II - Learning......Page 96 4 - Similarity and Clustering......Page 98 4.1 Formulations and Approaches......Page 100 4.2 Bottom- Up and Top- Down Partitioning Paradigms......Page 103 4.3 Clustering and Visualization via Embeddings......Page 108 4.4 Probabilistic Approaches to Clustering......Page 118 4.5 Collaborative Filtering......Page 134 4.6 Bibliographic Notes......Page 140 5 - Supervised Learning......Page 144 5.1 The Supervised Learning Scenario......Page 145 5.2 Overview of Classification Strategies......Page 147 5.3 Evaluating Text Classifiers......Page 148 5.4 Nearest Neighbor Learners......Page 152 5.5 Feature Selection......Page 155 5.6 Bayesian Learners......Page 166 5.7 Exploiting Hierarchy among Topics......Page 174 5.8 Maximum Entropy Learners......Page 179 5.9 Discriminative Classification......Page 182 5.10 Hypertext Classification......Page 188 5.11 Bibliographic Notes......Page 192 6 - Semisupervised Learning......Page 196 6.1 Expectation Maximization......Page 197 6.2 Labeling Hypertext Graphs......Page 203 6.3 Co- training......Page 214 6.4 Bibliographic Notes......Page 217 Part III - Applications......Page 220 7 - Social Network Analysis......Page 222 7.1 Social Sciences and Bibliometry......Page 224 7.2 PageRank and HITS......Page 228 7.3 Shortcomings of the Coarse- Grained Graph Model......Page 238 7.4 Enhanced Models and Techniques......Page 244 7.5 Evaluation of Topic Distillation......Page 254 7.6 Measuring and Modeling the Web......Page 262 7.7 Bibliographic Notes......Page 273 8 - Resource Discovery......Page 274 8.1 Collecting Important Pages Preferentially......Page 276 8.2 Similarity Search Using Link Topology......Page 283 8.3 Topical Locality and Focused Crawling......Page 287 8.4 Discovering Communities......Page 303 8.5 Bibliographic Notes......Page 307 9 - The Future of Web Mining......Page 308 9.1 Information Extraction......Page 309 9.2 Natural Language Processing......Page 314 9.3 Question Answering......Page 321 9.4 Profiles, Personalization, and Collaboration......Page 324 REFERENCES......Page 326 INDEX......Page 346 Mining the Web: Discovering Knowledge from Hypertext Data is the first book devoted entirely to techniques for producing knowledge from the vast body of unstructured Web data. Building on an initial survey of infrastructural issues—including Web crawling and indexing—Chakrabarti examines low-level machine learning techniques as they relate specifically to the challenges of Web mining. He then devotes the final part of the book to applications that unite infrastructure and analysis to bring machine learning to bear on systematically acquired and stored data. Here the focus is on results: the strengths and weaknesses of these applications, along with their potential as foundations for further progress. From Chakrabarti's work—painstaking, critical, and forward-looking—readers will gain the theoretical and practical understanding they need to contribute to the Web mining effort.
* A comprehensive, critical exploration of statistics-based attempts to make sense of Web Mining.
* Details the special challenges associated with analyzing unstructured and semi-structured data.
* Looks at how classical Information Retrieval techniques have been modified for use with Web data.
* Focuses on today's dominant learning methods: clustering and classification, hyperlink analysis, and supervised and semi-supervised learning.
* Analyzes current applications for resource discovery and social network analysis.
* An excellent way to introduce students to especially vital applications of data mining and machine learning technology. Mining the Web: Discovering Knowledge from Hypertext Data is the first book devoted entirely to techniques for extracting and producing knowledge from the vast body of unstructured Web data. Building on an initial survey of infrastructural issues-including Web crawling and indexing-Chakrabarti examines machine learning techniques as they relate specifically to the challenges of Web mining and provides applications of machine learning to sytematically acquire, store, and analyze data. Here the focus is on results: the strengths and weaknesses of these applications, along with their potential as foundations for further progress toward a Web that is more aware of content semantics. This thorough and forward-looking book gives the theoretical and practical foundations you need to build innovative applications for mining the Web. Examines low-level machine learning techniques as they relate specifically to the challenges of Web mining. This work focuses on applications that unite infrastructure and analysis to bring machine learning to bear on systematically acquired and stored data. The World Wide Web is the largest and most widely known repository of hypertext.