[MUSIC] Hello, everybody. Welcome to Python for Everybody. We're going to do a little bit of sample code. If you're interested in getting the sample code, you can download this ZIP here at https://www.py4e.com/materials.php. And you will download and you will get all the files. And all the files that I'm looking at here. And so, the one I'm going to play with today is the file called urllinks.py. So the first thing you gotta do before urllinks.py works is, you have got to install BeautifulSoup. And I've got some simple instructions at the beginning of the file. And so one way to do it is install it using Python installed process to install this BeautifulSoup for all Python applications. And if you are the owner of your computer and you're going to use BeautifulSoup a lot, it's a fine idea to do that. But I want to show you a simpler way that if you don't own your own computer and you just want to make it so that BeautifulSoup works. You can download this file, this file right here, BeautifulSoup4.zip. Unzip it and put it in the same folder as here. And so if you look in this folder, I have a subfolder called bs4. And that's the unzipped version of this, and it has these things. I didn't write this code, so I'm sorry if the name is bad, but this is the code to BS4, and this is what's in bs4.zip. And it's in the same folder as. Urllinks.py. And so what happens is, is when you do this from bs4 import BeautifulSoup, that either can go to sort of this global magic place that Python installs stuff and pulls in the BeautifulSoup object. Or, it can go to the folder bs4 and pull it in, okay? And so that's how that works, so you have to do one of these two things. I prefer to keep it simple. Download and unzip this file and put it in the same folder as this code anywhere you go. So from the previous example we're going to use urllib of course. And then we're going to pull in the BeautifulSoup, from the BeautifulSoup4 library we're going to get the BeautifulSoup object. Now, if you do this with SSL, if these websites were going to play with SSL, you pretty much have to do this little hack. And these three lines, don't worry too much about it. The whole idea, you could do Google on Stack overflow and figure this out. But this is the way you ignore errors when you have SSL certificate errors. And so we have to add this parimeter, context=ctx which is this variable that we create. So this part and this part sort of just do them. If you don't you can take them out actually, but otherwise you won't be able to do HTTPS site. So let's take a look at what we're doing other than dealing with the HTTPS problem. We're going to ask the user for a URL. We are going to retrieve all the HTML. We're going to do a URL open just like we did before. Now this would return us something we could loop through line by line with a for loop. But instead, we're going to say, hey read the whole thing. And that basically returns us the entire document at that web page. In a single big string with new lines at the end of each line. And this is not in Unicode but it's probably UTF-8 string. But it turns out BeautifulSoup knows how to deal with UTF-8 and it also knows how to deal with Unicode strings. So what we're saying is BeautifulSoup read through and deal with all the nasty bits, right? So HTML is very very flexible. So dr-chuck.com/page1.htm. And so if we take a look at the source of this, view page source, make this bigger. You might be able to do regular expressions but it does things like breaks stuff across lines. There could be a line break here, there could be all kinds of things, right? And so writing a regular expressions or splits or whatever is really hard for HTML. And so what we do is someone has written this, it's called BeautifulSoup. And it's basically, this is the code and it's based on a joke from a children's story. It basically someone has just went through and figured all the bad things that could possibly happen when you're reading and parsing HTML. So either you use it or you'll slowly but surely derive all the things that it doesn't work. And so when we look at this line right here, this line at a high level is saying we're giving you ugly, nasty HTML that could make no sense whatsoever. Please read it and have all the brains that you have and all the weird stuff, figure that out for us and give us back an object. I happen to call it soup, you don't have to call it soup. An object, and that is a proxy for that HTML. But this soup object is clean. And so what we can do is we can sort of retrieve all the anchor tags. So we can talk to this object and say ask it, give me the anchor tag. What's an anchor tag? Well, if we take a look at this source, the anchor tag is the a through the /a. That is the tag, it is the tag. It is attributes that are on the tag, it is the text within the tag, and everything. So that's what we're going to get. Now, I called it tags plural, not because plural matters at all, but because we're going to get a list of tags. Because even though this webpage has lots and lots of tags, if we look at say dr-chuck.com. And, view source, woah, that's kind of small, view page source, right? And we go look for anchor tags. We got 45 of them and they all kind of have weird stuff in them, right? So this line will give us back a list of tags. It will give us all the tags in this document. So it goes the tag goes from there to there. And than we're going to do is we're going to write a loop to loop through all the tags. So that's basically hoping, like it's hopping through the documents sort of like this, that's what it's doing. Hop, hop, hop, hop, hop. And it's pulling the text of the href attribute so it's going to pull out this bit right here. Woops, darn that was so cool cause that's a flaw. Look at that. This is my own page. There is no closing quote here but it's going to work because HTML soup is like, I know what to do about that, I can deal with that. So let's check to see if that one works because that's like a mistake. But that's one of the things we like about BeautifulSoup. So we're going to read through and then we're going to pull out all the href. So [LAUGH] this is probably thousands of lines of code that you really don't want to run. So Python 3 urllinks.py. And so well let's start with a simple one, http:// www.dr-chuck.com. And it reads it. No, that's actually the hard one because we've got a whole bunch. So let's see if the tsugi one worked. It found that one, it's right after sakaiproject.org. Where is that? Is there another Tsugi? No, it didn't find that one, that's kind of funky. Look, it found it wrong but that's okay. So you see, it found all these and did a lot of nice stuff for us. If we do it, python3 urllinks.py, and do the easy one, http://www.dr-chuck.com/page1.htm we will only see one. And there we go. Now, the SSL is if you are looking at a page that has SSL. Python, URL links too, so I'll go to https://www.si.umich.edu, and that will get a bunch of links. And so you'll see if it wasn't for that, so all kinds of stuff coming back. And if it wasn't for this bit right here and this bit right here this HTTPS wouldn't have worked. And, it's not that that website had a bad URL, it has a certificate that's not in Python's official list. And so the URL is okay. So that gives you a quick summary of using the BeautifulSoup library in Python, along with the urllib. [MUSIC] [MUSIC]