Hello and welcome to the first lecture of the Bioinformatics Algorithms class! When I started teaching bioinformatics two decades ago, one of the biggest challenges was to connect with a very diverse audience. I had to talk to computer scientists and biologists, to physicists and biochemists, to mathematicians and statisticians. The challenge was how to keep computer scientists focused while explaining sometimes convoluted biological concepts. And likewise, how to keep biologists awake while explaining sometimes complex algorithmic ideas. I hope that in this class we have an equally diverse audience, and that's why I decided to start from a very simple problem. Here is the first problem in this class. Given a string called Text1, copy it to a string called Text2. And I see that some of you are already thinking about quitting this this class because it is so trivial. Please wait. You might think that it is trivial, but billions of cells (replicating your own genome right now) don't think so. In fact, they are trying to accomplish the incredibly complex task of replicating genome. Just think about this: you need to unwind DNA (two strands of DNA) and separate them. You need to make sure that there are enough free-floating nucleotides so replication can complete because if you start replication and you have to stop in the middle because there is no energy left, you are dead. So, in this course, we will learn about this and many other challenges related to replication. In fact, replication of the human genome is so complex that biologists still do not fully understand how its done. That's why we will focus on replication of bacterial genomes that are 1000 times smaller (their genomes are 1000 times smaller) and that's why replicating bacterial genomes is a little bit easier. I am not saying that it is trivial. And in fact, right now, there are billions of bacterial cells trying to replicate bacterial genomes inside you because the number of bacterial cells inside you is 10 times larger than the number of your own human cells. Now, since bacteria usually have circular chromosomes, I have to slightly change my problem. My new problem is: Given a CIRCULAR genome called Text1, copy it to a CIRCULAR Text2. It is equally trivial, but it makes some slight changes in the setup that we need to address. A circular genome does not have a start and an end, and that's why the roughly 5 million nucleotide-long E. coli genome has a choice on where to start replication. In this class, we will try to figure out where it starts because all E. coli cells on Earth start replication in the same region, called the "origin of replication". So, our first problem is "Where in the genome does it all begin?" and the only information that we have is the genome itself. Let's go ahead and first try to answer the question: What are the hidden messages in the replication origin that help the E. coli genome figure out where to start. And here is the first, or should I say the 3rd problem in this course: finding the origin of replication. I'll give you a genome, you will have to give me the location of origin of replication in this genome. Does the problem make sense? Yes [from the audience] You may be right or wrong depending on who you are. "Are you a computer scientist or a biologist?" Computer scientist [from the audience] Then you are wrong. If you were a biologist then you could think about an experiment. For example, you can start cutting short pieces from DNA, and when you cut a piece and the cell suddenly lost ability to replicate, it means that probably you cut out the origin of replication. But if you are a computer scientist, you should just refuse solving this problem; it is an ill-defined computational problem. So, how can we turn this ill-defined computational problem into something that makes sense? Lets try to ask a question: "how does the cell know to begin replication in short oriC region?" There must be some hidden messages in the genome that tell the cell "Start replication right here!" What are these hidden messages? Let's first try to formulate the problem right. The right formulation will be the Hidden Message Problem. In this case, Input is a string Text representing the replication origin, and Output is the hidden message in Text. Is this problem clear? If you feel this problem clear, please raise your hands. 1, 2, 3, 4, 5, 6, 7, 8... This problem is absolutely unclear. It is equally poorly defined as the first problem I showed you. Indeed, Son is not happy again because what is a hidden message? Have I described it? No. So, for a computer scientist there is not even a starting point to address this problem. But let's go two centuries back and let's see how somebody in a somewhat similar situation tried to solve a similar problem. We go to "The Gold Bug" short story by Edgar Allan Poe; in this story, the hero, whose name was Legrand, tried to decode the message left by pirates. And his hope was that when he decoded the message, he would be able to find pirate treasure, so he was quite motivated. The only thing he saw is this text. How would you tell what is written, what is coded in this text? What Legrand noticed is that a combination of three symbols ";48" appears surprisingly frequently in this text. Does it suggest an idea on what is written in the message? Here is the hint: The message is in English" And here is another hint: "THE" is the most frequent word in English And that's why he was able to substitute ";48" for "THE" and this message started to make sense. And if you have time, you can complete the decoding. Now, if our original Hidden Message Problem was clearly ill-defined, let's now try to redefine it so that it would make sense even for a computer scientist. How we would redefine it? Well, for various biological signals, certain words appear surprisingly frequently in the text. For example, on this slide, AATTT appears surprisingly frequently in a short text -- its unlikely that it would happen by chance. Then we formulate our first real Rosalind problem. The Frequent Words Problem is: "I'll give you a string and an integer k, and you need to find all most frequent k-mers in Text, which means all most frequent strings of length k in the Text. Is the problem clear now? Let's ask Son. He feels better, but he is still not clear because he does not know what exactly a "most frequent word" means. I think he is a little bit picky today because it's easy to define what a most frequent word means. A k-mer Pattern (a string of length k) is a most frequent word in Text if no other k-mer is more frequent than Pattern. And now even Son is satisfied and I want to thank Son for great help with development of this lecture. Now, Son is satisfied but we should ask biologists whether this problem actually makes sense. Remember, bioinformatics is about connecting computational and experimental ideas. And if we check with biologists, it also looks that they are quite satisfied with this problem. Indeed, replication is performed by DNA polymerase, but to initiate replication, DNA polymerase needs a protein called DnaA. DnaA binds to a short region (typically just a 9 nucleotide-long segment) within the replication origin that is known as a DnaA box. A DnaA box is actually the hidden message that we are looking for, and the DnaA box actually tells the DnaA protein: bind right here! And the DnaA protein wants to see multiple DnaA boxes to bind efficiently. So, it looks like we brought computational people and biologists on the same page, and we are now ready to try to solve this problem. Since it's the first problem at Rosalind, we expect a wide variety of different algorithms proposed by you. The simplest algorithm probably is this one. Start from the 1st k-mer in the Text, slide it through the Text and see how many identical k-mers are present in the Text. How much time it takes? It takes time roughly equal to |Text| ∙ k. But you need to do it for every k-mer in Text, and there are roughly Text of them. So, the overall running time of this very naïve algorithm is |Text2|∙ k. If |Text| is one million, you may have problems, but in our case |Text| is typically small. The average length of the replication origin is just 500, maybe 700 hundred nucleotides. In this course, we will soon learn much more sophisticated algorithms: for example, some of you may be amazed that the same task can be done incredibly fast. And we will cover
these algorithms in the class.