Okay, you've gotten the data, and you've read up on natural language processing and
text mining, so it's time to start digging into the data.
The two basic goals for this task are tokenization and profanity filtering.
So the basic idea is that you want to be able to take a bunch of text and
divide it into what we would call words.
So there are a number of issues that you're going to to have to deal with here,
including things like how to handle punctuation,
how to think about digits of capital and lowercase letters and
how to deal with typos, because people can spell things incorrectly.
So, as you're coding your solution, you're going to have to think of
an optimal strategy for dealing with all of these issues you know,
and think about them in a way that optimizes performance but also accuracy.
And so, keep in mind you're going to have to make a lot of decisions as you go
through this process.
And it's not necessarily going to be obvious what is the best decision because
there isn't necessarily a right answer.