Hello and welcome to this lesson on Production Data Analytics.
This lesson is going to focus on real world issues in data analytics,
in particular, the challenges that we
often don't face too heavily in a classroom setting.
This includes how to achieve modeling success, in particular,
when you're faced with issues of data cleaning and
data preparation and how you can understand
whether your model has achieved success based
on the goals you define at the start of the process.
We're also going to introduce Chris Diem.
This is an optional reading,
but I strongly encourage you to look at it.
The idea is to develop a formal procedure that
you can follow when confronting
a problem that's going to involve data analytics challenges.
So, specifically, at the end of this lesson,
you should understand the challenges of building successful models in the real world,
understand some of the differences between
classroom analytic problems and those in the real world and hopefully,
you'll also understand why we have a difficult task of trying to teach you
analytics in the classroom because companies rarely share real-world datasets with us,
and also hopefully, you can articulate issues that hinder
the promotion of data analytic processes
and when you try to move them into a production environment.
In the end, that's what you want.
You don't want to do your data analytics on a small dataset on
your laptop and then have that whole entire knowledge disappear.
You want to do something that's going to be impactful in the end.
It's going to have to move into a production environment.
We won't talk too much about production environments in this class mostly because that
involves much larger computational resources and big data and big data lakes.
And so, in this course, we're going to focus more on
the algorithms and how to apply them and where to apply them.
But you do need to be aware of the fact that you will need, in the real world,
to be working with teams who are going to be taking
what you do and putting it into a production environment.
So there's a number of readings for this particular week.
I mentioned the optional reading here that talks about Chris Diem.
Briefly, I'll walk through these websites.
You, of course, are expected to read them all.
The first one is modeling analytics successes,
and this article talks about how you can actually go from the start of
your data analytics task to actually
successfully implementing and knowing you've implemented appropriately.
That's why he has these steps that you can follow.
This is not the Chris Diem that we're going to talk about later,
but it is an idea of how to actually go through
this data analytics task that can lead you to success.
The second article is an interesting one.
It talks basically about universities doing a bad job of teaching.
And there are some points in here that I think are true.
Of course, I would counter that the argument of we are not doing a good job is because we
don't have that real-world datasets to actually demonstrate how to do data analytics on.
The main point that these articles make or are these type
of articles make is that in the real world,
in other words, in industry,
you're confronted with a question which is not clear cut.
So for instance, we may say,
make a prediction on this dataset whether it's the iris class A,
iris class B, or iris class C. That's a clean dataset.
It's a very well-defined problem.
You're often faced with a more difficult task where
you actually have to define what it is you're supposed to measure.
You also then have to find the datasets.
You may have to merge the datasets.
You may have to clean the data, fix missing data,
fix dates and time stamps,
identify missing entries that don't even fall into a normal category.
All of that work has to take place before you can
even think about applying a machine learning algorithm,
and that's the basic of this particular article, and many like it.
And that's because they're looking at it from the real world,
where the challenges they're facing,
and students they say are not coming out realizing these problems.
But, of course, here I am talking about these problems.
So, hopefully, you are somewhat aware of them.
Next one, article talks about moving from practice to production.
So there's lots of things that people do,
and they talk about how great these offerings are,
how great these machine learning techniques are,
but you have to think that it only matters
if they can be deployed to real world problems.
And so, this article talks about that challenge.
This other algorithm also talks about
building a machine learning or data science pipeline.
The whole idea is that the entire data analytics process is large and so,
you actually need to think about it from end to end.
Too often, we want to just jump right into
that machine learning algorithm and see what great things it's going to teach us.
And the point is that you have to think about it all from the start to the end.
And that's what this last one does.
This, of course, is an optional reading,
but I highly recommend you at least skim it.
And that is, it talks about
the entire way of approaching a data mining or data analytics process.
And that is, you first have to understand what it is you're actually trying to do,
then you have to understand your data,
then you have to prepare the data,
then you can build your model,
you can evaluate your model,
and then you can actually deploy it into a production environment.
Ideally then, this process would be continued because you're going to say,
is that model doing a good job in the production environment?
Did my new data that has been accumulating since I first built the model,
is it telling me something different?
Do I need to answer a different question?
All of those are important things that you need to understand and approach in
a systematic manner in order to have success in a data analytics project.
So, with that, I'm going to go ahead and stop.
If you have any questions,
please let us know. And of course, good luck.