The VCF format or variant code format describes information about sequence variation. It can also be compressed to create BCF format to save space. Unlike the mpileup format, which showed one line for every base in the genome at which we had beats alike. The VCF format gives detailed information about only those positions at which we mark variation, we identify variation. So as you can see here, it is a fairly dense file that starts in a file beginning with a preamble. We recognize the preamble because every line starts with two # signs. And then at the beginning of the preamble, we have some information about the format. So this here is VCFv4 version 2, about the source, about the reference file, that is the genome file and perhaps the URL where it was extracted from. And in sequence, it's chromosome, scaffolds, or context sequences. This portion of the preamble is followed by a set of so called info lines. Followed itself by a set of filter lines that we're applied to produce the variant and a set of format lines. And then lastly, following this information we have the header line and one line for each sequence variant. And I going to describe each one of these shortly. So let's look at the lines here, and see what kind of information we have in the VCF format. The first column is easy, we have the chromosome. In this case, we have five new patients. We have five variants, all of them on chromosome 20. The second column gives us the position in the genome. The third column gives us an identifier, if one has been identified in database such as dbSNP or maybe the 1000 Genomes. The fourth column gives us the letter that's in the reference, or the variant that is present in the reference genome. Whereas the next column gives us the alternate. So the letter, the variation that is present in the sequencing grid. We then have the quality of the variant core. A filter, so that tells us whether the sequence with the variant pass the filters or not, and if it hasn't, why it has not. Followed by a column that gives us information so that information about the beat that support the core at the particular position. And then by format, in Genotype information. The FORMAT column, gives us a template for how the genotype information will be represented in one or multiple samples. You can find more information about this and this particular example in the address that I listed here, listed here. So let's look first at the INFO lines. In the info lines, we have a definition of what the info, what the particular field it would be and characteristics about that field. So, for instance, we have the field NS, the number of sequences, which would be represented by just one value of integer type. And which represents the number of samples, that contain the data. Another example of the second one, DP, again, is an integer value, just a single one. And the description is total depth, and so on. All of these that you can see here, NS, DP, AF, which stands for allele frequency. AA which stands for ancestral allele. DP for dbSNP membership, and H2 [INAUDIBLE] HapMap2 membership. Along with a few others, represent standard fields that are recognizable among different programs and different file representations. However, one can define new fields as well. Some examples that were shown here are, for instance, for our variance NS = 3, so we have three samples here. The depth is 14 reads, and the other frequency is 0.5. Let's look at the third position, we have that it is present in 2 samples, NS is 2. The total depth of reads there at that position is 10, and you might recall from the previous slide, that it has two possible variations. And their corresponding real frequencies are 0.333 and 0.667 which optimal. So now, we presented the information let's move on and describe the format. So as you might remember, the format simply gives the template for representing genotype information. Just like before, we first define what are the fields that belong to the format. So for instance here, we have four different types of entities, which are coded GT, GQ, DP, and HQ. And just like before, every format line tells us how many values we expect, what type of address, which can be streamed, integer float, flag. And they give us some description about what the field represent. For instance, GT stands for Genotype, GQ for Genotype Quality, DP Read Depth, and HQ for Haplotype. Just like we mentioned on the previous slide about the info data, there are a number of standard fields but others can be defined as well. And let's take a look at our example, and we look at the format, and also at one of the samples, the NA00001. So the template here for the first SNP. Because that's why we have a single nucleotide polymorphism, a substitution, is GT separated by a column from GQ separated by another column from DP, and then from HQ. And if we're looking at the string that corresponds to this format in the sample, we have that the 0 in real is present in the two samples. We have that the quality is 48, that there's only one read mapping. And then that haplotype, and we need to have two values here, haplotype quality is 51 and 51, and so on. Now, let's take a look at the field that describes the actual type of sequence variation. In the simplest case, this is a SNP or a single nucleotide polymorphism. That's a case of a substitution when the letter in the reference is modified, has a different variation substituted in the sequencing reads. So let's assume that the reference here, it has the sequence g c a G g t. And let's assume that the sequencing read show the pattern g c a A g t, how would we represent this? We will have one line on which we're showing first the chromosome, say chromosome 14. Followed by the position four, because G occurs at position four in the reference sequence. Followed by a dot, because we don't have this variant identified in the dbSNP or some. Followed by g which shows us the letter, the base in the reference genome, the alternate of real is A. The quality is unknown, so we leave it as a dot, let's assume that this variant pass the filter. And then the number of supporting read is 100, so that was a simple example. Let's try another one, a deletion from the reference. The reference stays the same, g c a G g t. And let's assume that the variant now, shows the same sequence, except that the G at position four is removed, g c a g t. But would show this as one line with the first field, 14, as the chromosome number. We will mark that the change notation starts at position three, there's, now, no known identifier in the databases. The reference genome variant is AG, and that becomes an A in the alternate. Notice that AG starts at position three. Then we have no information about the quality, assume that the variant has passed the filter, and the number of supporting reads is 100. And lastly, a more complex example in which let's assume that we have three different alleles that may come from multiple individuals. The reference genome is the same, g c a G g t, the variant one shows a deletion of G at the fourth position. Variant two shows a substitution, a single nucleotide polymorphism, where G is replaced by A. Variant three shows an insertion of a t at that position, so how do we represent all of these in one line? Well, we show the chromosome at the end 14, we show that the mutation starts at position 3. Again, no identifier given, that the reference genome has the sequence AG. And this is replaced by A for variant one, AA for variant two, and AGT for variant three. This variant pass the filter, and the depth at that position was 100. So these are ways in which we can represent sequence variation. Structure variance which are defined as those variance that start somewhere between 1000 basis and a few mega basis, can also be represented. But we will not be covering that here. So let's put this all together, and try to decode the entries in the VCF file that I've shown you. So we have in column one, the chromosome and all of these five variants are located on chromosome 20. Then the position, as explained on the previous slide. The identifier, and you can see that we have two that had been previously identified and stored in dbSNP, as well as a microsatellite. Then we have the reference allele, and we have SNP for the first variance, SNP for the second. We have two possibilities for the alternate allele for the third position, for the third variant, and so on. We have the qualities, the filter, and we can see that all except for variant number two pass the filters. And various number two do not pass, because there is no quality above 10. Then the input field, which gives us the number of samples in which we can see the variant, the total read depth, and the haplotype quality, as well as the other frequency. And lastly, the format, the template, and then correspondingly, the little presentation the genotype, which we've shown on one of the previous slides. So that's how we put it all together. In the next section, we'll be looking at how we can actually use genomic tools for producing alignments, naming both I and bwa. And then how we can analyze your alignments with tools such SAMtools and BCFtools, in order to produce variant costs.