We will start looking at the genomic tools for alignment and for varying detection, and we will first look at the standard best formats for files and data. So let's start with the alignment format which I presented in the previous section, so I'm only going to go over it briefly. The standard for representing alignment of next generation sequencing data is the SAM format. And if you look here, it consists of a header at the top followed by a number of lines, each of which represents an alignment of one read to the genome. In the header, we have some information about the file, but it's sorted for instance and by [INAUDIBLE] followed by a number of sequence lines that show us the sequences that are present in the reference genome. And then lastly one line that tells us the program in the command line options that were used to create this file. And then we have as I've said, one line for each alignment. And let's look at that more closely. It has a number of columns separated by tabs where the first one gives us the Read identifier. The second one is a FLAG that gives us comprehensive information about this particular match and the mate, if we have paired end reads. The location given by chromosome and start position, mapping quality, and alignment stream that we call CIGAR, and then some information about where the mate is matched. Whether it's on the same chromosome or it's unmapped on a different chromosome, the main start position and the distance, the insert between the two lists that's inferred from the alignment. These are followed by the query sequence, and then by the base qualities for each of the bases, and a number of optional fields that give us information about the alignment, such as the edit distance to the reference, number of hits, whether this is the primary alignment, strand and so on. So this is basically in a nutshell the SAM format. Now I'd like to talk about one type of format that is being used for representing variants, so the SAMtools in the mpileup format. In the mpileup format, each line represents one position along the genome at which we have reads aligned. And then we have a number of fields that are again separated by caps. The first one gives us the chromosome number, in this case chromosome 17. The second gives us the position that we're looking at in one base coordinate. The third column tells us what the letter is, what the base is, and their position in the genome. The fourth one gives us the number of reads at the line, or the depth, at that position. The fifth column gives us a string of characters that tell us something about the bases in each of the 19 reads at their position. And we have a number of codes that can represent the options here. So for instance, a dot is used to represent a match in the forward direction. A comma is just to represent the match, however, the read must have been inverse complemented. A capital T or capital A, or in this case a C or a T for instance, would represent a mismatch where the read aligns in the forward direction, and T in this case would be the letter in the read. A lowercase t would mean a mismatch, where t is a letter in the read, again. However, the read has to be reverse complemented to match. There are characters to represent letters that are at the beginning of a read, given that those positions are usually lower quality than the rest of the read, and also to represent the end of the read, and this is the dollar sign. In case there are insertions or deletions, with respect to the reference, those can be marked by a plus or minus sign, respectively, followed by a number, and then followed by the string that was inserted and, respectively, deleted. And lastly, a greater than sign represents a reference skip, so jumping over a portion of this sequence. Following the bases in the reads, the sixth column gives the qualitities of those bases in the corresponding reads, and lastly, an optional column gives us the specific position of that particular letter in the read. So for instance, the first read had a match on the fourth strand, and that occurred at position 74 within the read. Next we'll be taking about another standard for representing a sequence variation, WCF format.