Hi, in this unit we'll talk about Analyzing Social Media Data using Python. Now at this point, it's important that you have some basic idea about statistics. So this is not a statistic course, so it's not going to really cover the background, if you don't have the background, then please pause at this moment and get access to some material. Some of it actually is provided as part of this class to familiarize yourself with some basic ideas of statistics. Or if you have done this before, then make sure that you’re okay, going forward with this, this maybe a good time to brush up on those concepts. Okay, having said that now we can talk about how’s we're going to do some statistical analysis using Phyton. Now in order to do that we'll need a few libraries or packages. One of them is NumPy, it's a great library, it has as lot of functionality. One of the most common and most popular libraries for doing scientific data analysis in Python. So you can get it from this link, and elsewhere it's covered how you can install packages in Python, so make sure do that. Matplotlib is a great library for doing some visualization of data. And Pandas is a great library for doing data manipulation but more importantly here we'll be using it for importing data, or loading the data as a data frame in Python. So at this point make sure that you have these three things installed configured and ready to go. Okay so we're going to look at some social media data but the first thing we need to do is of course, access this data. You can collect the data using some of the things that we have done before, so we have shown how to access data. Download data from things like Twitter, YouTube, you can use those things or you can also get data from other sources. Like Kaggle often government organizations and some other websites provide sample data. So you can get data sets from other places. At this point we're going to deal with only CSV data. Later we'll see a different kind of data format but for now just to make thing easier, simpler. So make sure if you are getting data from somewhere else that it is in CSV. But to get us started, we'll do a small kind of a project, and in that we'll use the YouTube data that we collected before, for which we wrote a little Python program. And if you remember in that, we were able to download CSV format data, that had things like title, videoId, viewCount, and so on. So using that data we're going to do some analysis, and ask a question, to what extent, if any, does viewCount relate to likeCount and other variables? And so that's our simple question. And we're going to work with Python with the data that we've collected before. So I'm going to switch to Python now and I'm going to first stop writing some script here. The first thing we normally do is import the libraries that we're going to need. So this is how you do it, you say import numpy and we assign some alias to it so we'll call it np, so that's what it means. Import numpy and call it np, we're going to import matplotlib, and specifically pyplot, and we'll call it plt and we're going to import pandas as pd. So these are the three libraries that we talked about before installing and now we're going to use it. Let's just go ahead and save this script so, analysis.py. So now the first thing to do of course is load up the data. And we'll call it youtube_data using pandas, so pd refers to our pandas library. And it has a function read_csv, and we can just give it a name. Right, and so this is the name of the file which is in this case, it's in the current directory. So this is very important that if you're just giving this kind of a name, make sure that particular file is actually in the current directory. And, if you don't have that, you can also some other way to select the file including giving a full path. So if the file is not in the current data then you can give it a full path. Okay, so this should load it up, the data and now let's just do simple plot and see what it gives us. Let's go ahead and run this. And there's nothing really here, because we have simply asked to create a figure, but we actually still need to do a couple more things to create the chart. So what we're going to do, we're going to create a histogram and to do that, we're going to use the numpy to first get us some data for creating the histogram. So there are two steps here. One is having the right kind of data to create the histogram, and second is to actually visualize the histogram. So numpy is histogram function. We're going to use, and we'll just give it the usual data. And we're interested in seeing view count. Now when this data comes back, we need to catch it into something. So let's call it hist1 and edges1 and don't worry too much about this, you just have to take it for granted for now. That this particular function from numpy returns two objects that we're going to need to create the histogram. And so we're going to patch those two objects in these two variables, and so that's step one and step two is to actually visualize it. So that, for that we'll use the plotting library that we have and we're going to do a bar chart. And the syntax for that, so we're going to use those two objects that we got the hist1 and edges1. And I'm going to use part of that we'll come back to that syntax And we don't say how wide the bar is, so we'll say the width is edges1[1:]edges1. So that's our syntax. I'll explain in a second. For now let's just go ahead and run it. And this is what we see, right? And so this is the histogram, this is a bar chart of view count. And so here on the x axis, you see the number of views. And on the y axis, you see the number of instances that we have that fit in that range of views, right? So this is 500,000 here, this is 1 million and so on. So this many instances from our data set that fit in that range and so on. So that's all of our bar chart. Now back to this thing, edges, if you see here, it's an array. And I'm not going to go too much into details here, but essentially we are looking at, so this is edges1 and hist1. Both of them arrays, what this is showing, this is a data that it's showing. Edges is essentially showing this range where the values are. And the hist1 is showing this y-range where those numbers are. So you can see, it's got 19, 2, so this is your 19 here, 2 here, then 0, 0, 1 and then so on. So hist1 has this information and edges1 has this information. All right, so when we're creating a bar chart, we're giving that x axis and y axis. And for the x axis, we are using that information about the x axis values to determine how wide things should be. All right, so if you're interested you can look up more help on this, but for now that's all the detail I'm going to cover. Now, so we have some visualization of the data. But of course, we want to do something more than that. We want to see how different things correlate, okay. So how different variables connect. So it's actually very easy to do. We can just use, youtube_data is a data frame. It's already loaded up. And the nice thing about this, because it's loaded using panda libraries. That there are some functionalities that are already associated with the data frame. So here's one which is a correlation function, okay. So we can find that and we're going to just go ahead and print it. So put this whole thing in print statement. And then you run it. And then you see an output like this here on the console. What you see is a correlation table where you have the variables viewCount, likeCount, dislikeCount, commentCount and favoriteCount here on the row. And then on the column as well. So of course, correlation between viewCount and viewCount is 1 and likeCount and likeCount is 1. So the diagonal will be all 1, because it's self kind of relation. But then you can also start seeing a viewCount and likeCount. That correlation is kind of high, it's 0.76. So that's interesting because it shows that as that viewCount goes up likeCount goes up too. And that's again, not very surprising, but it's good to see the confirmation. Similarly as viewCount goes up, disLikeCount goes up because the correlation has a positive 0.83, right? So that's a high correlation. As the viewCount goes up, the commentCount also goes up but the correlation is small. It's 0.46, so it's not as highly correlated. In other words, if as viewCount goes up, the commentCount doesn't go up as much. So that's what kind of mean. So this is a very easy way to do some correlation analysis using Python. All right. Let's do some other kind of plotting. I'm going to actually remove some of these lines, because we're not going to need it for this exercise. So now we know that there's a good correlation between viewCount and likeCount, okay. So let's visualize it in some way because that's just a number but let's have some visualization. And so I'm going to use a plotting function, okay. This time using scatter plot. Though I need the x and the y, so I'll use viewCount as my x. And this is how you access. So youtube_data is your data frame and .viewCount means you're accessing that specific variable. And youtube_data.likeCount. Okay, and so now we run this and here's your scatter plot. So once again this is your correlation table and here's the scatter plot. So each point it's in these two dimension, the x axis is the viewCount and y axis is the likeCount, right? And so this is the scatter plot, now of course, what we really want to know is how these two variables are related. And we already know that's there's high correlation. But that's just part of a story, we need to know how they are related. And for that, we'll do a regression analysis. Now, regression analysis is where we want to learn about two variables that are relating in some way. And so this is our regression, it's called equation, right? So here we want to see viewCount, it's also called independent variable or predictor. Relating to likeCount which is called the dependent variable, in this case, it depends on viewCount or the outcome. So this is our outcome, this is our input, this is our dependent variable, this is independent variable. So in order to get to likeCount, viewCount must have some kind of factor, right? And so we don't know what that is. And perhaps we need to also adjust it by adding some constant to that. So this is the regression equation. So if we know the value of viewCount and if you want to find out the value of likeCount. We need to know this and this, okay, so we're going to do regression analysis. Fortunately, it's very easy to do this in Python. So here's how we're going to do it. Let's for now, go ahead and create these things so. Okay, regression analysis This is often represented using x and this is y. So we're going to follow that notion and we're going to do the same thing here. Okay so we'll say y is youtube_data.likeCount. And X typically it's a capital X is youtube_data.viewCount. And remember there's also that constant that we need to add so it's so to do that adjustment we're going to use so we need to import another library. Called statsmodels.api, and we're going to import it as sm. So that particular library will allow us to do this modeling, the linear modeling that we are trying to do. And what we need is a function at constant to X. Right? So that's the adjustment that we need to make to do the prediction. And then we are going to create model a linear regression model and SM library will us to do that. We're going to use the OLS function that's ordinary list square error method we won't worry about the details here but we need to give it Y and X. So y is the outcome and x is the predictor. And we're going to say, try to fit this thing. And so remember, this is our data. We applied, we're trying to find a line. Okay, this is the linear regression, so we're going to get a line that could fit this data as best as possible. Right? And that should do it. So this particular line, this should create that model. But of course, at this point, we don't know what that model is. We're going to print it. Then just say lr_model.summary. Save it, and let's just go ahead and run it. And when you do, you will see the model here. This is where you see OLS regression. OLS regression results. So that's your model. Now this is what it looks like. I know it's kind of boring and we don't have to go through everything, but let's look at couple of things that are useful. So here, you can see this particular line here that viewCount's coefficient is this. And that constant that we talked about is this. So if you go back to that equation, viewCount's coefficient and constant. So these two values are now known. And so what it means is if we have a viewCount value. We can multiply with this coefficient and add this constant and predict likeCount. Okay let's actually try doing that. All right, let's bring up a calculator. And let's see our view count is a million. If view count is million, what is the prediction for likeCount? How many likes should it have? So our model tells us that we need to multiply this with 0.0101. That's a coefficient. And add this constant to it, plus 887.4049. Okay. So, this is our prediction, 10,987 as the likeCount. Okay. And if you go to this visualization, again you can kind of see, so here's a million, and we're saying that the likeCount will be 10,987, so about 11,000 [INAUDIBLE]. One million view count, the like count should be somewhere here, if that's our prediction. And aiming makes sense if you think about how this data is distributed. But it's probably better to do some validation. So you can do these predictions but at least now you know how you can do the predictions. So this how if you know the view count you can multiply it with this number and add this number to have a prediction for likeCount. Yes, what we're going to do is we're going to actually test this. So what we'll do, we'll actually generate a bunch of artificial values. Yes, we're going to use linspace, a function which basically creates a bunch of data points and a linearly kind of separate it. So, we're going to take the X.viewCount.min, so this is the minimum value of view count and X.viewCount.max is the maximum value. So this is the whole range and we want to generate say 100 points in that range. So that's what that function does and we're calling it X_prime. And we're going to apply our model. Though I just for the constant a lot of what we did before but this time these are artificial points and before we used a real viewCom this is real data. Now we're using this artificially generated data in this range. Okay so that's the only difference. And then we're going to predict using the linear model that we generated and so it has a predict function. Okay and so we give it input. Okay so this is our input. If this is our, sorry independent variable, right. Our predictor. And we apply to our existing function the LR model is our existing model. And use a function predict to see what we come up with. Cause a y_hat is our prediction. And now we want to see. Let's actually just plot it. So we've already, Scatter plot of X.viewCount and y. So these are the real values that we have from the data. Let's put labels on them to make it easier to, while we are dealing with, and plt.ylabel Like count. And, yeah we'll do that plot. This is our artificial data that we just generated 100 points. Just see all we can get out of it and the prediction that we made. And that's it so, let's go ahead and run this and here we have it. So what we did first here as you can see we just plotted the original data the view count and the y right so the points here represent the real data that we have. And then we plot it this is a line and the line is made up of to do the line it two dimension you need X and Y. So the x is this 100 points that will be generated, right. These are artificial points and for each point we made the prediction. Using this, okay. So, that's what it shows. So, it shows x-point and the corresponding prediction. An x-point corresponding prediction. And so, there are this hundred of those and those hundred points are joined and that's our line. All right so it kind of looks pretty reasonable that this is our prediction and this is called the regression line. So what we have now is given any x value which is view count we can have the y value which is the like count. All right so to summarize we saw how to load data in Python using Pandas package. We saw how to use NumPy for doing statistical analyses. We saw Matplotlib for creating visualizations. We did some descriptive statistics and histogram. We did some correlation analysis and finally we did regression to connect independent variable in this case it was the view count. It's also called predictor to dependent variable or called outcome and then use that knowledge to make predictions. So once we have that equation you saw that you can have any view count value. And then be able to say what the like count is likely to be. Okay, so that's a very powerful stop, being able to create this kind of a model and use it for making predictions. Okay so that's it for doing social media data analysis using Python.