We discussed the difference between online and offline algorithms. Offline algorithms preprocess the text T, where as online algorithms do not. And, we also decided that the Naive exact matching algorithm and Boyer-Moore, were both examples of online algorithms. Whereas web search engines, and our target problem that we're concentrating on, the read alignment problem, are both situations where offline algorithms are more appropriate. So again, our target problem is the read alignment problem, where we are given a large collection of sequencing reeds, and we'd like to find out where they best match, within a long reference genome. And, we decided that we need an offline solution, because the reference genome doesn't change from experiment to experiment, but the sequencing reads do. So, it seems to make sense to preprocess T, the reference genome. So, we are going to preprocess the text T, where T is long, and to understand how we might do this, we can use a couple of analogies. So, one kind of text that we might consider is, a book, and one familiar way to preprocess a book is by indexing it. So, this is an excerpt from the index of a book. An index is a list basically of key terms, and each key term is associated with a list of places, pages in the book, where it's mentioned. So, if we're trying to find more information about, for example, memory, here. We can look memory up in the index, and this is very convenient and easy for us to do, because the key terms are in alphabetical order. And then the index tells us which pages to look on for more information about memory. And, this is pretty convenient. It's obviously a lot more convenient than just scanning through the entire book, looking for mentions of the word memory. So, looking up a key term, like finding memory here, in the index, I'm going to call that querying the index. So, we have a key term in mind, and we'd like to know where in the book that key term occurs, so that's called a query. Here's another analogy. When you go to the grocery store to buy some milk, you don't have to look at every single shelf in the store until you find milk. You know that it's going to be with the other dairy items, because grocery stores are organized that way. They're organized into groups of related items, like dairy, or paper products, and things like this. So, it's easy to find an item in the grocery store, but not because all the items are in alphabetical order, or in any kind of order like that, it's because of grouping, or items are grouped together. So, these two examples highlight two different basic tools we have for organizing data, so that it's easy for us to get the item we want. So, that it's easy to query. So, ordering, like the alphabetical ordering in the index of a book, or grouping, like the aisles of a grocery store. So, let's build an index for a DNA string now. And, the index that we build will be more like, the index of a book. We're going to use the idea of ordering. So, here's our text T, and we're going to put our index in this orange box, here in the middle. And unlike our index of a book, a genome doesn't consist of words. It's not like a book in that sense. It's just one big long string. So, we kind of have to break it up into words our self, and one way we can do this is to just take every substring of the text T that's of a certain length, and we'll just treat each of those substrings as a word. So, in this example, we'll take each substring of length five, and insert it into our index. So, let's take the first substring of length five, and insert it. And, as in the index of a book, in our index here, we're going to associate with our substring that we extracted, the offset where it occurs in the text. So, for this very first substring of length five, we associate the offset 0. Okay. Let's do the second substring of length five. Okay. Same thing. And now, for the third. And now, for the fourth. And, an interesting thing about the fourth substring of length five is that it came alphabetically between two substrings that we had all ready added to the index. So, we had to stick it in between them here, in order, in order to keep the overall index in alphabetical order. Okay, now let's add the next substring of length five. Now, this substring is one that we've seen before, so instead of adding a new entry into the index, we simply append one more offset. This new offset, onto the list of offsets, already associated with this substring. So, we just add a four onto the lists that already had the zero there. And then, we just continue like this until we've added all the substrings of length five. And there we go, that's our index. It's very analogous to the index of a book, except for the fact that we built it over all substrings of length five. So, a quick piece of vocabulary. We use the term k-mer to refer to a substring of length k. So, for substrings of length five, we'll say 5-mer. So, this index we built where we took every 5-mer and inserted it into the index, so that for every 5-mer we were able to map it back to all the offsets where it occurred, we're going to call this a 5-mer index. Okay. We have our index. Now, how do we query this index? So, say we have a pattern string P, we can solve the exact matching problem. We can find all the offsets where P occurs within T, with the help of our 5-mer index. So, let's do that. Let's take the first 5-mer from P, and query the index. So, we're asking the index, where have you seen this 5-mer before? And the index replies, that it's seen it at offset three. So, we know that these five characters of P, the first five characters of P, match five characters of T, starting at offset three, within T. But we're done yet, because we don't know whether the rest of P matches where it should within T. We have to do one more character comparison here. And, this additional work that we do is called verification. So, finding the index hits, the place where the 5-mer occurs within the text, that's the first step. But then, this additional work that we have to do, to determine if we have a full match of P to T, that's called verification. So, in this case, the verification succeeds, because this C here matches this C here. And so, overall, we can conclude that P occurs within T at offset three. Here is a variation on that. What if, instead of taking the first 5-mer from P, we took the second 5-mer from P? In a sense, this doesn't matter very much, because the index contains all the 5-mers from T. So, whether we query it with the first 5-mer from P, or the second 5-mer from P, any of them should really work. None of them will fail to find a match of P within T. So, let's go ahead and the second 5-mer from P, and query the index. And this time, our 5-mer occurs twice in the text T, once at offset 0 and once at offset 4. So, we have to do two different verifications. So, let's start with the verification at offset 0. Well, our verification here fails, because there's no character to the left. There's a character to the left within P, but there's no character to the left within T, right? We're at offset 0. So, there's no way that P can occur within T here. So, now let's try offset 4. And then, our verification, we have to check whether this G, at beginning of P right here, matches this G. And it does, and this illustrates that it doesn't really matter which 5-mer from the pattern is used. Any of them will lead to the correct set of matches of P within T. Here again, we found the same match, the match at offset 3. So, here's a modified example. This is just like the previous example, except for that we made a substitution within P. So, this last base here is now an A. So, P no longer matches within T, because of this change that we made. But, let's go through our procedure anyway, and let's take the first 5-mer, and we'll query the index, and we see that that 5-mer hits at offset 3. Right here. But then, when we go verify, we do this character comparison. We compare this A to this C, and we find that P does not match T. So, that's the correct answer. P does not match T in this case. Now, here's another modified example. This time, we modified this character right here to be an A. That's the change we made. So again, the pattern does not occur in the text T. And in fact, in this case, we can give up even faster. We notice this even sooner, because in this case, the first 5-mer that we take from P doesn't even occur in the index at all. So, if a substring of length five, from our query string P doesn't occur in the text at all, then the full string can't possibly occur either. So, we can give up at that point. So, one piece of vocabulary which I've already been using is that when we use the index to determine that the 5-mer occurs within P, that's called an index hit. For each offset that the index reports back, that's called an index hit. So, we saw the example where we queried with the second 5-mer from P, and we found two index hits, at offset 0 and 4. So, when P matches within T, we've been calling that a match, or an occurrence. But, an index hit may or may not correspond to a match, it's just a hint that we should look harder in that particular region of T. So, not all index hits lead to matches.