- Level Foundation
- Duration 17 hours
- Course by University of Washington
-
Offered by
About
Case Studies: Finding Similar Documents A reader is interested in a specific news article and you want to find similar articles to recommend. What is the right notion of similarity? Moreover, what if there are millions of other documents? Each time you want to a retrieve a new document, do you need to search through all other documents? How do you group similar documents together? How do you discover new, emerging topics that the documents cover? In this third case study, finding similar documents, you will examine similarity-based algorithms for retrieval. In this course, you will also examine structured representations for describing the documents in the corpus, including clustering and mixed membership models, such as latent Dirichlet allocation (LDA). You will implement expectation maximization (EM) to learn the document clusterings, and see how to scale the methods using MapReduce. Learning Outcomes: By the end of this course, you will be able to: -Create a document retrieval system using k-nearest neighbors. -Identify various similarity metrics for text data. -Reduce computations in k-nearest neighbor search by using KD-trees. -Produce approximate nearest neighbors using locality sensitive hashing. -Compare and contrast supervised and unsupervised learning tasks. -Cluster documents by topic using k-means. -Describe how to parallelize k-means using MapReduce. -Examine probabilistic clustering approaches using mixtures models. -Fit a mixture of Gaussian model using expectation maximization (EM). -Perform mixed membership modeling using latent Dirichlet allocation (LDA). -Describe the steps of a Gibbs sampler and how to use its output to draw inferences. -Compare and contrast initialization techniques for non-convex optimization objectives. -Implement these techniques in Python.Modules
What is this course about?
4
Videos
- Welcome and introduction to clustering and retrieval tasks
- Course overview
- Module-by-module topics covered
- Assumed background
5
Readings
- Important Update regarding the Machine Learning Specialization
- Slides presented in this module
- Software tools you'll need for this course
- A big week ahead!
- Get help and meet other learners. Join your Community!
Introduction to nearest neighbor search and algorithms
3
Videos
- Retrieval as k-nearest neighbor search
- 1-NN algorithm
- k-NN algorithm
1
Readings
- Slides presented in this module
The importance of data representations and distance metrics
1
Assignment
- Representations and metrics
5
Videos
- Document representation
- Distance metrics: Euclidean and scaled Euclidean
- Writing (scaled) Euclidean distance using (weighted) inner products
- Distance metrics: Cosine similarity
- To normalize or not and other distance considerations
Programming Assignment 1
1
Assignment
- Choosing features and metrics for nearest neighbor search
1
Readings
- Choosing features and metrics for nearest neighbor search
Scaling up k-NN search using KD-trees
1
Assignment
- KD-trees
6
Videos
- Complexity of brute force search
- KD-tree representation
- NN search with KD-trees
- Complexity of NN search with KD-trees
- Visualizing scaling behavior of KD-trees
- Approximate k-NN search using KD-trees
1
Readings
- (OPTIONAL) A worked-out example for KD-trees
Locality sensitive hashing for approximate NN search
1
Assignment
- Locality Sensitive Hashing
7
Videos
- Limitations of KD-trees
- LSH as an alternative to KD-trees
- Using random lines to partition points
- Defining more bins
- Searching neighboring bins
- LSH in higher dimensions
- (OPTIONAL) Improving efficiency through multiple tables
Programming Assignment 2
1
Assignment
- Implementing Locality Sensitive Hashing from scratch
1
Readings
- Implementing Locality Sensitive Hashing from scratch
Summarizing nearest neighbor search
1
Videos
- A brief recap
Introduction to clustering
3
Videos
- The goal of clustering
- An unsupervised task
- Hope for unsupervised learning, and some challenge cases
1
Readings
- Slides presented in this module
Clustering via k-means
1
Assignment
- k-means
4
Videos
- The k-means algorithm
- k-means as coordinate descent
- Smart initialization via k-means++
- Assessing the quality and choosing the number of clusters
Programming Assignment
1
Assignment
- Clustering text data with K-means
1
Readings
- Clustering text data with k-means
MapReduce for scaling k-means
1
Assignment
- MapReduce for k-means
4
Videos
- Motivating MapReduce
- The general MapReduce abstraction
- MapReduce execution overview and combiners
- MapReduce for k-means
Summarizing clustering with k-means
2
Videos
- Other applications of clustering
- A brief recap
Motivating and setting the foundation for mixture models
4
Videos
- Motiving probabilistic clustering models
- Aggregating over unknown classes in an image dataset
- Univariate Gaussian distributions
- Bivariate and multivariate Gaussians
1
Readings
- Slides presented in this module
Mixtures of Gaussians for clustering
3
Videos
- Mixture of Gaussians
- Interpreting the mixture of Gaussian terms
- Scaling mixtures of Gaussians for document clustering
Expectation Maximization (EM) building blocks
4
Videos
- Computing soft assignments from known cluster parameters
- (OPTIONAL) Responsibilities as Bayes' rule
- Estimating cluster parameters from known cluster assignments
- Estimating cluster parameters from soft assignments
The EM algorithm
1
Assignment
- EM for Gaussian mixtures
3
Videos
- EM iterates in equations and pictures
- Convergence, initialization, and overfitting of EM
- Relationship to k-means
1
Readings
- (OPTIONAL) A worked-out example for EM
Summarizing mixture models
1
Videos
- A brief recap
Programming Assignment 1
1
Assignment
- Implementing EM for Gaussian mixtures
1
Readings
- Implementing EM for Gaussian mixtures
Programming Assignment 2
1
Assignment
- Clustering text data with Gaussian mixtures
1
Readings
- Clustering text data with Gaussian mixtures
Introduction to latent Dirichlet allocation
1
Assignment
- Latent Dirichlet Allocation
4
Videos
- Mixed membership models for documents
- An alternative document clustering model
- Components of latent Dirichlet allocation model
- Goal of LDA inference
1
Readings
- Slides presented in this module
Bayesian inference via Gibbs sampling
3
Videos
- The need for Bayesian inference
- Gibbs sampling from 10,000 feet
- A standard Gibbs sampler for LDA
Collapsed Gibbs sampling for LDA
4
Videos
- What is collapsed Gibbs sampling?
- A worked example for LDA: Initial setup
- A worked example for LDA: Deriving the resampling distribution
- Using the output of collapsed Gibbs sampling
Summarizing latent Dirichlet allocation
1
Assignment
- Learning LDA model via Gibbs sampling
1
Videos
- A brief recap
Programming Assignment
1
Assignment
- Modeling text topics with Latent Dirichlet Allocation
1
Readings
- Modeling text topics with Latent Dirichlet Allocation
What we've learned
4
Videos
- Module 1 recap
- Module 2 recap
- Module 3 recap
- Module 4 recap
1
Readings
- Slides presented in this module
Hierarchical clustering and clustering for time series segmentation
6
Videos
- Why hierarchical clustering?
- Divisive clustering
- Agglomerative clustering
- The dendrogram
- Agglomerative clustering details
- Hidden Markov models
Programming Assignment
1
Assignment
- Modeling text data with a hierarchy of clusters
1
Readings
- Modeling text data with a hierarchy of clusters
Summary and what's ahead in the specialization
2
Videos
- What we didn't cover
- Thank you!
Auto Summary
Explore the fascinating world of document similarity with "Machine Learning: Clustering & Retrieval" on Coursera. Dive into data science and AI with expert guidance as you learn to identify similar documents using advanced algorithms. This foundational course covers k-nearest neighbors, KD-trees, MapReduce, k-means clustering, Gaussian models, and latent Dirichlet allocation. With practical implementation in Python, you'll master both supervised and unsupervised learning tasks. Perfect for data science enthusiasts, this 1020-minute course offers a starter subscription to kickstart your journey in document clustering and retrieval.

Emily Fox

Carlos Guestrin