When we talked about the Naïve Bayes model and the theory and the formulation behind it,

we didn't really focus on the features and what the features represented.

There are two ways in which Naïve Bayes features could be learned.

There are the two classic variants of Naïve Bayes for text.

You have the multinomial Naïve Bayes model and the other one would be a Bernoulli model,

and we will talk about it soon.

The multinomial Naïve Bayes model is one in

which you assume that the data follows a multinomial distribution.

So what does that mean?

It means that when you have the set of features that define a particular data instance,

we're assuming that these each come independent of each other and

can also have multiple occurrences or multiple instances of each feature.

So, counts become important in this multinomial distribution model.

So you have each feature value,

a some sort of a count or a weighted count.

Example would be word occurrence counts or TF-IDF weighting and so on.

So, suppose you have a piece of text, a document,

and you are finding out what are all the words that were used in this model.

That would be called a bag-of-words model.

And if you just use the words,

whether they were present or not,

then that is a Bernoulli distribution for each feature.

So, it becomes a multivariate Bernoulli when you're talking about it for all the words.

But if you say that the number of

times a particular word occurs is important, so for example,

if the statement is to be or not to be,

and you want to somehow say that the word to occur twice,

the word be occur twice,

the word or occur just once and so on,

you want to somehow keep track of what was the frequency of each of these words.

And then, if you want to give more importance to more rare words,

then you would add on something called a term frequency,

inverse document frequency weighting.

So you don't, not only give importance to the frequency,

but say how common is this word in the entire collection,

and that's what the idea of weighting comes from.

So for example, the word THE is very common,

it occurs on almost every sentence,

it occurs in every document,

so it is not very informative.

But if it is the word, like,

SIGNIFICANT, it is significant because it's not gonna be occurring in every document.

So, you want to give a higher importance to a document that

has this word significant as compared to the word the,

and that kind of variation in

weighting is possible when you're doing a multinomial Naïve Bayes model.

The second model is the Bernoulli Naïve Bayes model.

Here, the assumption is that the data follows a multivariate Bernoulli distribution,

where each feature is a binary feature, that is,

the word is present or not present,

and it's only that information about just the word being present that is

significant and modeled and it does not matter how many times that word was present.

In fact, it also does not matter whether the word is

significant or not in the sense that is the word THE,

which is fairly common in everything,

or is the word something like SIGNIFICANT,

which is less common in all documents.

So when you have just the binary features, I mean,

just a binary model for every feature,

then the entire data,

the set of features follows what is called a multivariate Bernoulli model.

So these are the two standard classic variants in Naïve Bayes,

and you'll see that most of the approaches and most of

the tools that you have for Naïve Bayes modeling give you that option,

give you the option of multinomial Naïve Bayes or Bernoulli Naïve Bayes.

It's fairly common in text documents to use the multinomial Naïve Bayes,

but there are instances where you would want to go the Bernoulli route,

especially if you want to somehow say that the frequency is

immaterial and it's just

whether the presence or absence of a word that is more important.