We discussed the kMer index, as well as two different multimap data structures that we can use to implement the kMer index, and now we'll see two variations on the theme of a kMer index. So, before when we built our kMer index, we built it from all of the kMers from the text T. Every kMer offset pair appears in this index, for example. And it starts with the fivemer at offset 0 and goes all the way to the fivemer at offset 6. So, but what if we didn't take every kMer? What if, instead, we took every other kMer? So, for example, what if we took just the kMer that are at even offsets, with respect to the text T, and we got rid of the kMer that were at odd offsets with respect to the text? So, for example, starting with the index you see here, we can remove, as you see indicated by these red slashes, we can remove the kMers that start at the odd offsets. And so, once we remove them, we're left only with the kMer that start at even offsets. So, why might we do this? An advantage is that the index is now smaller. Alright. The index takes up less memory, and it's also a little faster to query. So, for example, if we were using Binary search to query this index, we've now cut the size of the index in half, which is going to save us about one bisection when we do Binary search. So it's going to be a little bit quicker to query. On the other hand, the fact that there are fewer kMers in the index now might be a little bit concerning. So you might wonder if that'll cause us to miss some of the matches that we might otherwise have been able to find, if we had kept all the kMers in our index. So for example, say we are trying to match this pattern P, that you see here. And we take the first fiber from P, and we query the index. But we don't get a hit. This particular fiber does not appear in the index, and yet this pattern P does occur in the text T you can see it occurs right here G-C-G-T-C-T-T. So that's concerning but you might also notice if we had just taken instead of the first 5 mir. We had taken the second 5 mir from P and used it to query the index we actually would have found two hits. And these correspond to these two different offsets with respect to T. And if we do verification that these two offset we will find that in the case of the second offset there was in fact a match. So we'll find the match. If we use just the first kMer we would have missed the match. But if we tried both the first and the second kMer then we would have found the match. So let's think about the scenario again but using a slightly different picture here. So here we represent the text T here up the top and the kMer that we add to the index are shown here below. Then here's our pattern P and then the kMers that we take from P in order to query the index are shown here. So again our strategy is to build the index with every other kMer from the text. Every even offset kMer, so here's the kMer in offset 0, the kMer in offset 2, offset 4, offset 6, ecetera. And then we query the index using the first kMer from P here recording all the index hits that we find. And then we query it again using the kMer the second offset with respect to P here. And again we would query all the index hits that we find. And we verify all those hits. So actually there's nothing all that special about using the first and second. kMers from P. So the important thing is that the two queries have to come from even and odd offsets with respect to P. So in the example we just saw, we use the kMers at offsets 0 and 1 with respect to P, but in another scenario, we might've used different kMers. So here are the kMers at offsets 4 and 3 with respect to P. That's fine. As long as one of them comes from an even offset, and the other one comes from an odd offset, with respect to P. And in fact, it's not too hard to generalize this scheme so that we index not every other kMer from the text T, but say, every third kMer. Or every fourth kMer or every nth kMer where we can choose that n. So for example say that our index is built not over every other kMer but over every third kMer. So maybe that looks a little bit like this. Now given so here we are we've extracted the zeroth and the third and the sixth, the kMers at all the different offsets. At multiples of three, and then we query it three times now, instead of twice. We query it with the first three kMers with respect to P. It's a similar thing but we're doing it in three, taking every third instead of every other. And then again, there's nothing special about taking the first three kMers with respect to the pattern P and using those to query the index. We could have just picked any three, as long as they are at different offsets, modulo three. You know, using modular arithmetic, where some offset modulo three is just the remainder after dividing by three. At any rate, it's easy to generalize this scheme, so that instead of taking every other gamer, we take every n-th gamer. So moving on to another variation on the theme of the kMer index. Instead of building the index over sub strings, taken from the textee, we can also build the index over subsequences, taken from the textee. So what's a subsequence? How's that different from a sub string? Well, subsequence of a string S is a string consisting of characters that also occur in S, in the same order, but not necessarily all in a row, not necessarily consecutively, which was a requirement of a substring. So, for example, here's a string, this string called seq here. And here is a subsequence of that string. And the way we got this subsequence is that it consists of the character at offset 0, concatenated with the character at offset 1, concatenated with the character at offset 5, offset 7 and it makes this subsequence here. So, AAGT, is a subsequence of the string, but it is not a substring. We can look for it using dot find and we get the return value -1 which just basically means that that does not occur as a substring. So, that string does not occur as a substring, but it is a subsequence. So substrings are always also subsequences, but subsequences are not necessarily substrings. In order to be a substring, there has to be some place where those characters occur consecutively, in a row, in a string. So let's look at an example of this. So here, instead of taking the first fivemur of T and putting into the index. We're going to take the first subsequence of a particular shape. So the shape of the subsequence is indicated by the blue underscores here. So we're going to take two characters, and then skip one, and then take one character, and then skip one, and then take two more characters after that. So we're going to take all these characters and concatenate them together and then insert them into the index and here we go, we got this subsequence extracted from offset 0 and then we can do the same thing and offset 1. Again we take a subsequence, it’s of the same shape, so we'll take two characters and then skip one, then take one character, and then skip one, and then take two. And then well insert the corresponding entry into our index. And we do the same for offset two and we do the same for all the other offsets. And here is the index we would get if we did things this way. So now we have an index built from subsequences of T and if we have a pattern P and we would like to now query this index, then the first thing we would do is we would extract the same shaped subsequence from the string P. So in this case, again, we'll take the first two characters GC and then skip one and take the next character and then skip one and then take the next two and those concatenated together are going to be our key that we can go look up in the index. And in this case we'll look in the index and get a hit at offset 3. So, why should we do this? And what sort of advantages and disadvantages would it have? Well, the short answer is that it tends to increase the specificity of the filter, provided by the index. In other words, when we get an index hit, it's going to lead to a successful verification more of the time than if we had taken those characters to be consecutive. To be a substring of T. And we'll elaborate on what this means a little bit later in a homework assignment.