-->

Wednesday, April 27, 2016

Data Science Basics - Structured

Big data is data that exceeds normal processing techniques because the data is too big, moves too fast, or doesn't plain fit structural architecture requirements.


Clustering is one form of classifying structured datasets. They are useful when you want to know something about a larger subset and just about that larger subset.

Some situations where we might cluster involves automatically tagging someone in a photograph, based on prior photos of that person, or recommending a new song to someone based on prior music selections.

When first utilizing a cluster algorithm, it helps to have a set of training data first, Two of the risks we run with clustering training data is overfitting and overgeneralization with our training data.

"If it walks like a duck and quacks like a duck, it might be a duck."

Overfitting: "It is a duck only if it looks and quacks precisely as I have observed ducks. If a new species of duck is added to the dataset, it can't be a duck."

Overgeneralization: "If it hobbles on two legs and emits a hi pitched noise it must be a duck."