In this lecture, I will show you how to make a clustergram in MATLAB. Hierarchical clustering, is another way to visualize high-dimensional data, and it clusters observations by distance and builds a hierarchical structure on top of that. It gives more detailed information of differences among clusters. For example, it can tell you which genes contributed the most to the difference between two clusters. Here is an example of hierarchical clustergram. It is made of a heat map in the middle. Denograms on the left and top. And row and column labels on the right and on bottom. There is also a scale bar on the left. This is the same data set as I used in the PCA plotting. Each column is one tumor cell gene expression profile. And each row is a gene. The color suggests relative expression values. And red indicates high expression values, blue indicates lower expression values. Looking at the column labels, we find that gene expression profiles of the same subtype, nicely clustered together. And there are three red clusters in the heatmap corresponding 3 subtypes. Recall that the colors suggest expression values, we can say that this bunch of genes at the upper side are highly expressed in cluster one which are subtype three. And these genes in the middle are highly expressed in subtype two. And these genes at bottom is highly expressed in cluster three, which are the subtype one. Here is an example of simulated clustergram by random numbers. In this clustergram, no distinct the clusters can be observed. Red and blue colors just mix all together. And that the column labels of g3 subtypes are also expectedly mixed. You cannot find order in it. I always want to present a random figure, because the tumor gene expression data we used is quite good. You can see clear patterns in it, but many data sets, will be noisy, and fall between the nice tumour cell data, and the simulated random data. Though the clustergram may look amazing and complex at first sight, its mechanism is quite simple. In this and next few slides, I will explain how it works. Suppose that we now have a to f, six gene expression profiles. The left are their representations in a two dimensional PCA figure. The question is, how we would like to cluster them? Well, by eye, you may want to cluster bc together, def together and leave a alone, but this is quite arbitrary. So is there a way to rationally and computationally cluster these data points ? Hierarchical clustering offers the solution. Here is the process. First we calculate the distance between every two points. And find that d e has the shortest distance. We clustered de together and treated de as one single data point. Now, we have five points. Point a, point b, point c, point de, point f. Then we calculate the distance of every two of these five points. And then we find that point b and point c have the shortest distance. Then we cluster b c together and treat bc as the one single data point. And now we are left with. Four data points. Point a. Point bc. Point de. And point f. Then in the next row of calculation we find point de and point f have the shortest distance. And cluster them as point def. We iterate this process until we got one single cluster that contains all data points. The whole process as illustrated in this picture is a tree-like structure. This tree-like structure is hierarchical and has different levels. Then how many clusters we want depends on which level we want to set the cutoff. If we set cut off here, we will only get two clusters. Cluster A. Cluster BCDEF. And if we set cut off here, we've got three clusters Cluster A, Cluster BC, Cluster DEF. And if we set the cut off to the lowest level we will have our original six data points. The denogram we saw in the Clustergram. is just a compact representation of this heirarchical tree-like structure after turned it upside down. Above is the main idea of hierarchical clustering. Here are some additional things you may want to consider when making a clustergram. The first topic is metric. Metric defines how to measure the distrnce between two gene expression profiles. The most common metric is the Euclidean distance. Each gene expression profile is a vector of values. And the Euclidean distance is calculated by the formula below. I think most of you are familiar with this formula. Besides Euclidean distance you can choose cosine distance, correlation distance, hamming distance and so on. But most of the time Euclidean distance will do the job. One special case may be, for example, you dataset is binary and you may want to use hamming distance. as your metric. Because it is specially designed for binary data. Look at this picture again. You can see hierachical clustering is performed twice, on both directions. Column wise and row wise. These two clusterings are independent of each other because the order of components do not matter when you compute the distance between two vectors. If this doesn't make sense to you, don't mind. Just remember that two clusterings are independent of each other. The result is that similar expression profiles are clustered together, and genes that have similar expressions across all profiles are also clustered together. For example, genes consistently highly expressed in cluster two is clustered to together, like here. The second topic will be the linkage function. You need linkage function while you want to calculate distance between clusters. Here is a simple example. You want to calculate the distance between clustered data point de and data point f. So how do you define the distance between them? There are a few options. The most common method is called Average. In this method, we caclulate the distance between d and f and the distance between e and f. Now you use the average of the two distances as the distance between this de cluster and this f. Median methods we use the median of the distances. And for single we use the shortest distance of the two and the complete we'll use the longest distance of the two. Here's one more example. If you now what to calculate the distance between cluster bc and cluster de using the single method, you calculated distance between bd, cd and the distance between be, ce and you've got four distances. And you will find that the distance between c and d is the shortest and then you will use this distance as the distance between these two clusters. One more thing to consider is standardization. Standardization converts data into standardized z-scores. Z-score means how many standard deviations away is a value from mean. If a value equals to the mean plus 2 standard deviations, its z-score will be 2. Standardization is a normalization process that forces the value to fall into the range that is most suitable to be visualized in a clustergram. There are two options, row standardization and column standardization. Row standardization calculates the z-scores for each row and column standardization calculates the z-scores for each column. For gene expression data, we generally use row standardization because we want to see for each gene, how their expression values change across different conditions. Okay, now we will begin our demo on clustergram in Matlab. Now, let's begin to make a clustergram on the cancer subtype gene expression data. First, import the data from a csv file. Click the Import data button, select the CSV file and wait for loading. We will import all the numerical values as a matrix. And give it a name as expressions. And then click Import. The next we will import the column labels. We will import them as a cell array. And name it as subtypes. Now we change the data type to text, because they are strings. And click import. The next step is to import the row labels, which are the gene symbols. We also import them as a cell array and then we name it as genes. The data type is already text, so we won't change that, and I click Import. Okay, now in our workspace, we have three variables. Expressions, genes, and subtypes. Clustergram is quite easy in MATLAB, because it is only one single function. You just need to specify all its properties. So this function has many input arguments. Name of this function is clustergram. Okay, I think I will open The script I already wrote. And the name of the clustergram, the first input argument is the expressions, then we will specify all the properties. The first one is row labels. We will give it the variable genes. And then the column labels which are the subtypes. Then are the rowPdist and columnPdist that specify the metrics to be used in column wise hierarchical clustering. And in row wise hierarchical clustering. We will use both Euclideans here. The linkage function we'll use average, then we specify standardization. I give it a number two here, because in MATLAB the number two means row wise Standardization. While number one means column standardization. The current map I will use is redbluecmap. Traditionally, it is usually red-green. Now, red-blue is more popular to take care of red-blue color blindness people. Okay. Now I will copy this command and paste it into command Window. And press Enter to run it. Okay, now we got our figure. This command, however, looks too long and it's not easy to write. Actually, many popular properties are already set by default, like the metric by default is Euclidian, linkage is average. So you can write the command in short as the one below. In this command, you do not need to specify rowPdist, columnPdist and linkage. Because Euclidian and average are already ready used by default. So, this command looks nicer and shorter. And it will do the same thing as I paste it here. And run it, we got the same figure. After you get this clustergram You can use this button to get a scale bar, and, you can use this button to toggle the denogram and this button to zoom in, and this button to zoom out. After you are in the zoom in mode, you can use this button to pan over the figure. One nice thing about this clustergram is that you can select a subset of the clustergram and copy it to a new clustergram. Then you can examine this part of the clustergram in close detail. Here I will teach you a trick to export clustergram in vector format. First click Export Setup. Change Rendering to Painters Vector Format and click Export. Choose the format as the PDF. This is the key step. Using EPS won't work, and I will give you it a name as clustergram. And in this export, I already resized the figure. But it's better not to resize it and just, use the figure as it appeared at the first time. [MUSIC]