Clustering is one form of classifying structured datasets. They are useful when you want to know something about a larger subset and just about that larger subset.
Some situations where we might cluster involves automatically tagging someone in a photograph, based on prior photos of that person, or recommending a new song to someone based on prior music selections.
Some situations where we might cluster involves automatically tagging someone in a photograph, based on prior photos of that person, or recommending a new song to someone based on prior music selections.
When first utilizing a cluster algorithm, it helps to have a set of training data first, Two of the risks we run with clustering training data is overfitting and overgeneralization with our training data.
"If it walks like a duck and quacks like a duck, it might be a duck."
Overfitting: "It is a duck only if it looks and quacks precisely as I have observed ducks. If a new species of duck is added to the dataset, it can't be a duck."
Overgeneralization: "If it hobbles on two legs and emits a hi pitched noise it must be a duck."
Overgeneralization: "If it hobbles on two legs and emits a hi pitched noise it must be a duck."