Okay folks. So our next topic is again taking these growing random network models. Now we've got a version which is fairly rich and spans a whole series of different degree distributions. Let's have a look at some data, see how this work out. And if you recall what we had in terms of the degree distribution was that we ended up with an equation for the degree distribution which looks like this. And what that does is if we - it's easy to do this in the log-log world, take log of each side and well we're actually bring the F over here, then take a log of each side. What we're going to end up with this log of 1 minus the distribution function is proportional to c minus x log d of a plus m times x. Okay. So what's the difficulty with actually fitting this kind of equation? We could just look at, you know, frequency of degrees, regress that off of log of a function of degrees and then try and estimate what the x is. The difficulty is that the x here is entering in both parts so the x is telling us something about the a. So we've got this a coming in in several places right? So a is the parameter we want to estimate how much is uniformly at random. How much is being formed via preferential attachment. The a is entering here. It also enters this x term. So what we've got is we've got c minus two over one minus a log of d plus am two over one minus a, right? So we've got several places where a is entering both here and here. And so what we going to do is, what we'll do is, search over a's. We'll search over a's until we minimize the distance between the degree distribution generated by this particular a and the actual the degree distribution. Okay? So we'll try and select an a to minimize that distance between the actual distribution and the models distribution. So when we search over a's, to do that, and let me say a little bit about how one can search so you can do the the program search by different methods. One is you can just plug in a bunch of different a's and for each one that'll give you a degree distribution and then try and see to what extent the actual data, you know, what's the distance between each frequency distribution in the data and the actual ones here? Sum those up, square them and try and minimize that. Alternatively what you can begin to do is guess an a over here and then do a regression that'll give you an expression here that gives you a new estimate for a, plug it back in here. Regress again, and so forth. You can actually show that that is a contraction. It's a well-behaved mathematical process. It'll converge to the right a. I don't want to go through a proof of that but that's an alternative technique so one is just brute force put down a bunch of a's between zero and one, search over a grid see how each one, how close does it come to matching the data, so you can look at the actual frequency distributions compared to the true frequency distributions. Right? So what we're going to end up with is, so what we've got is true frequencies for each degree, we've got for each a we've got a frequency, we can look at these things, sum them up, square them and say try and choose the a that minimizes the sum of the squared distances between the actual distribution and the one for a and just choose a to minimize this. So for each a we get an expression plug it in. Calculate all those numbers and then minimize that. Or you can do this iterative process where you guess and a, do a regression and get a new x that gives you a new a estimate and iterate and that'll give you another way to do this. Okay. So what do you end up with? Let's look at some plots. So here we've got different datasets. This is one where we're looking at citations of the original Small Worlds paper. And what do you end up with? You end up with a at about 0.56 and the fit of that is about 98% so you're doing very well in terms of actually calculating the variations of the actual variation in the frequency distribution that is matched by this model is about 98%. So there's about a 2% error when you look at the f of d's minus a of d's compared to the overall squares of f of d's. So you're doing very well here. When you look at different ones this is the prison inmates dataset that we looked at, we talked about before McCrae's dataset. Here you get a, the best fit is a equals one. So it looks like it's pretty much uniformly at random and R-squared of about 94%. So again a fairly tight fit. That one looks like it's being matched fairly well by uniformly at random and in fact there's a number of ones that come out at one. So if you do there's another one, Amateur Radio operators, so for people that don't know before the Internet people used to use short wave radios and talk to each other so instead of having chat rooms you could get on the radio and then just search to find somebody else who might be also broadcasting on the radio, you could talk to them. These were amateur radio operators. There's a dataset that looks at these again a equals one uniformly at random fairly good fit. The high school romance that we looked at, Bearman, Moody, and Stovel's data, the degree distribution there. Again very well fit uniformly at random. So looks better, like more like uniformly random here the R-squared is .99. So a remarkably strong fit to uniformly at random model that says something about high school romances, in any case. So you know these are just a couple of notes on these fits. Some of these are actually even more curved than with a equals one. And in those cases it could be that instead of having this growing network model if you go back to the static uniformly at random model you can even do a better fit. So those are fit better even by not including some of the growth. So the growth isn't a major factor in those. The citations networks, you know, some of the things is it has too many with degree zero to fit. And then part of what's happening in the model is the model starts with everybody having some degree and it's not taking into account the fact that some citations are directed and might never be hitting some articles. Interestingly, if we go back to the Notre Dame World Wide Web that has sort of generated a lot of interest in the power law. When you look at the best fitted series it actually has a at .36 so a little more than a third of the links are being formed uniformly at random. Now when you actually plot out, now instead of just binning the things a very fine binning of the actual degrees the R-Squared is .97. And interestingly this still looks, in a log-log plot it looks fairly close to a line. Right. But it's saying that the actual fit has some curvature. And one thing that's very important to understand about these kinds of things is that a lot of the data, when you do a log-log plot, you know, most of your data is actually in a smaller part of the graph. Right. So the log of the frequency is changing the relative weights. And so what you actually see in your eye is, you know, capturing the long tail. And so this looks fairly linear but there actually can be a lot of curvature that you've missed because the log straightens things out and it's very difficult for your eye to detect that. And so what this tells us is if we actually want accurate understandings of what's really going on in these things just eyeballing these things is not going to give us a good idea of what the true distributions look like. When you actually fitted distribution, the distribution here looks like it's sort of two thirds preferential attachment but one third uniformly at random is a much better fit than just a pure preferential attachment model. And so you know there there might be other things going on whereas you know pages accumulate some of their links through some other process other than preferential attachment. And that's an important caveat to these things. So when we understand power law is one thing we should understand is there are fat tails but not necessarily purely linear relationships. And when you fit them it's very difficult to let your eyes do things in a log-log plot. You can get a very different distribution if you're careful about it statistically. All right, so we end up here, you know, just to emphasize that eyeballing log-log plots can be misleading. So fat tails, yes. An actual power distribution, no. Here we're finding about two thirds of a power distribution and one third of uniform at random seems to be a better fit of this. Okay. One other thing in terms of fitting this kind of model into data. There are other things that the Friends of Friends model allows us to do. So you know it does give us a variety of degree distribution so we can span different degree distributions. And it ties the degree distribution to the way that links were formed in terms of having some meeting friends of friends. But to emphasize some other things things that were missing from the original models including a straight preferential attachment model is you still don't get clustering you're just forming links at random and the chance that two of my friends were already connected to each other begins to vanish as time becomes large. When you actually do things by this Friends of Friends model then what you end up doing is connecting to people in ways that form triangles. So let me just sort of give you an illustration of that so you understand it's an important idea. And so the idea, recall here when we've got these nodes, so we've got a new node and let's say suppose we have some existing nodes and let's suppose that those existing nodes were already connected to each other. Right. So this network and the way in which the node form some attachments was first found one uniformly at random it found this node and then it attached to one of its friends. So the fact that the attachment was via this search process of finding a friend of a friend means that we've actually formed a triple here. And so we ended up with two of this nodes friends being friends with each other partly because the way it found people was through finding friends of friends. So we get natural clustering directly through this process and that's different than what happens in the peer preferential attachment model or the uniformly at random model. So the fact that you're doing things through network search gives clustering directly. So the clustering is going to depend on this a parameter but we'll find clustering in the data. Second thing is diameter is naturally going to be small partly because you've already got either you know you've got random network formation so you've either got something that looks uniformly at random or preferential which even has a lower diameter than dash-ready kinds of networks. Another thing and one thing we haven't talked about yet in much detail but you will observe in a lot of networks is what's known as assortativity. Okay and what does assortativity mean? Let's think of the older nodes. The older nodes are more likely to be connected to each other because when they were born and young the formation process was only between the ones that were existing at that time period and so older nodes are more likely to be connected to other older nodes. And so what we and up with is a correlation in the age of nodes in connections which also means that they are the ones that have a higher degree are going to tend to be connected to other ones that have high degree and ones that have lower degree are going to have more connections to ones with lower degree because they're forming ones later on their forming their links later on and so will tend to have more links to younger nodes too. And so what we'll end up with is a degree distribution that exhibits some correlation in degrees. More of a high degree node's friends will be high degree and more low degree's nodes will be low degree. So we get a series of other things coming out directly. This is something which is going to come out of any of these growing systems. So this comes out of the growth. This just comes out of the fact that we've got random network. And this part, the clustering, comes out of the friends of friends part. Right. So we've got the growth that's giving us assortativity, the randomness gives us small world diameter, and the clustering part, the other part of small worlds is coming from the friends of friends of the process. So we're going to get three extra things. So here if we actually look at, you know, fitting these. So looking at these different datasets and then you can look at the number of nodes the average in-degree, m. r, let's just instead of keeping track a, let's let r be the relatives fraction of uniformly at random compared to preferential attachment or finding friends of friends part. So as r becomes huge then it's really a random network as r goes to zero then it's pretty much pure preferential attachment. And so you get these different r's. You know some of them going off to infinity looking like random networks some pretty much lower and looking like more preferential attachment. Then you can you know look at the fit of these things. You can also look at the clustering that comes out and the clustering actually from this model fits remarkably well some of them, the fit isn't doing as well it's not generating quite as much clustering as overall but for some of them it hits it remarkably closely. The diameters, the actual diameters in the data. The diameter is from fits what you have to do is simulate the model and then you get some range of things. The ranges of the diameters are usually fairly well on. So this model also fits on a series of other dimensions besides just matching the degree distribution which is also interesting. Okay, so that takes us through a look at some of these growing random network models and what they've done is they've enriched the system to allow us to capture a series of other attributes of the actual observed networks. We can produce a variety of different degree distributions. We can take these things to data and actually estimate different parameters, see how well the models are matching some of the data. And it gives us an idea of how the actual attributes that we see in the real world might have come from some formation process. So we see a richer idea here now of how exactly we can generate these particular attributes of social networks and why they might be arising. So our next subject now, we'll be looking at some other classes of models and understanding statistical estimation of those models.