Welcome back to the course on audio signal processing for music applications. This week we are talking about the application of audio analysis in the description of sounds. In the last lecture we talked about audio features, presenting various spectral wave analysis methods with which we can obtain features of a sound that may be relevant for describing it. Now in this lecture we want to go beyond the idea of describing single sounds. And introduce the concept of describing collections of sounds. So we'll first present the idea of the music information plane. And then we will distinguish between sounds and music with the aim of developing methodologies that are either relevant to the more generic concept of sound, or are more specific for the characteristics of music. So we'll first talk about sounds and sound recordings and then collections of sound recordings. And then we'll continue with the idea of music recordings and collections of music recordings. So this is the music information plane that should help us understand what do we mean by music description in our particular context. So we can dither different abstraction levels and the left column are these different abstraction levels. And we can go from the physical level, which is basically the lowest level that we're dealing with. And we go up to the cognitive level, okay. So that would be the highest level that we can see there and some steps in-between. So at the physical level, when we talk about sounds and music, we can talk about concepts like the frequency or the duration of the sound, or the spectrum and some clear characteristics of the spectrum like the centroid. And also we can talk about intensity of the sound. If we go a level higher, a sensorial level, then instead of frequency we can talk about the pitch of the sound. And then in search of duration, we can talk about times, the sensorial time of the duration. And then instead of taking about spectrum we can talk about timbre which is a sensorial concept. And finally in terms of intensity now we talk about loudness which again is a sensorial concept. And we can go a level higher and talk about perceptual topics, or perceptual concepts that are more musical, that are related to musical concepts. So in here, we talk about successive and simultaneous intervals of pitches, what will be called nodes. And then, we talk about time. We talk about structuring of time, and we talk about things like the beat. And then the timbre we talk about aspects of the timbre that we can identify and characterize with some aspect of a musical sound. And for example, the spectral envelope would be that. And then finally instead of loudness, when we talk about musical loudness we normally refer to dynamics. And we have vocabulary that talks about the dynamics of musical sounds. And we can still go a level higher and go towards the more formalized way of talking about musical concepts. And therefore, when we talk about pitch-related concepts, we talk about things like melody or key or tonality. When we talk about timing related concepts, so we talk about rhythmic patterns, we talk about tempo, we talk about meter. And when we talk about spectral timbre characteristics, we identify musical instruments or the voices, entities that have a characteristic timbre. And then finally when we talk about dynamics or loudness we are interested in direct articulations of the sounds and how sounds change from one to another. And finally we can reach the highest level, the cognitive level, the level that relates to as humans in a very subjective way. And how we listen to music and what issues are relevant for us in the interaction with music. So at this level the columns are not anymore valid. There is interaction between all these different concepts. And then we can talk about the emotion or the musical style or semantic concepts that clearly integrate all these other levels of descriptions to obtain these concepts. These ways of describing music that are clearly more generic and definitely are subjective or cultural. And that would be a level that definitely would be hard to reach and in this class we are definitely not going to talk much about that. So we will focus on the low level descriptions of sounds. Hopefully reaching a high enough level of description that is of relevance to user applications. If you want to describe sounds in a generic way, sounds like the ones we find in free sound, we can group audio features. The audio features that we talked about in the last lecture and we can group them in different categories. So we can talk about the timbre related features and we mentioned quite a few of them, like the spectral centroid, or the MFCCs, or the high-frequency content, etc. Then we can talk about another group of features that relate with dynamics. And that's basically the loudness and the level of a particular recording, and then we can talk about the pitch related features. And here is where we can talk about the pitch or the pitch salience, and finally we have to describe also the time varying aspects of it. Aspects of a sound that relate to the evolution of the sound to the texture of the sound, and this we can group them under the term morphological features. And here we can talk about things like the envelope of a sound or the onset rate or many other type of descriptions that we could include under this. We already have seen quite a few of these descriptors, so what is interesting now is from these descriptors from these features that we can analyze, we can now talk about collections of sounds. So let's talk about how to describe collections of sounds, and clearly there are many ways that we can analyze a collection of sounds and describe it. And we'll focus on three basic concepts. The first one and the most important concept that we need to develop is the idea of similarity If you want to talk about collection of sounds, we have to talk about the similarity between these sounds so we can form the year of collections and groups them. Once we can talk about similarity then we can cluster sounds. We can groups sounds according to some criteria. And finally, if we know some classes, some existing labels that we use to describe a particular group of sounds, then we can classify sounds. We can assign classes to particular sounds. In our context, sound collection can be represented by a diagram like this one. We consider a sound as a set of other features, each feature having a numerical value. In order to properly describe a sound, we have to use many features. But for simplicity, we will be taking only, in this case, two features. So if you consider a sound as represented by two features, we can display a sound, a set point in a two dimensional space. And that's what we're seeing here. Every feature is one dimension. So here, we're showing two audio features. The horizontal line is the mean of the spectral centroid. So we have analyze notes of three instruments, a violin, flute, and trumpet. We have computed the spectral centroid and we have taken the mean of it. So this is a multi-frame feature. And also we have done the mean of one of the MFCC coefficients, the second coefficient. So that's the mean of the second coefficient in the vertical line. And we can see that the violin has a quite high value for this coefficient, for the MFCC value, and it has a centroid that it quite covers quite a bit of space. The trumpet has this MFCC coefficient quite lower. So these blue dots are more in the lower side. And the flute sound is kind of in between and also the MFCCs are in between. So that we can kind of see that these types of sounds are distinct according to these two features. Now, in order to play around with the space, the most fundamental thing is to measure the distance between sounds, between points. So we have to find a way in a multi dimensional space, not just in this simple 2D space, how we compare 2 sounds. How do we find the similarity between the two? So Euclidian distance is one of the simplest ways to measure the distance between two points in a multi dimension of space. So in this case, p would correspond to one sound, the collection of features of one sound. And q would correspond to another sound, the collection of feature values of the other sound. And then, for every dimension i, we just take the distance between those two values on that particular feature. Then we square it and we sum over all the features of the dimensions, and then we take the square root. And that's the Euclidian distance. In the case of 2D space of just 2 features, that becomes much simpler. So in this case, the red and the blue are two sounds with two features. And we can just measure this Euclidian distance, and it's basically the line that separates these two points, the length of this line. Now that we know how to measure distance, we can cluster sound. K-means is a clustering algorithm. If we give to the algorithm the desired number of clusters, it will create the clusters and it will return the mean value of each of these clusters. K-means clustering aims to partition and observations, so the observations would correspond to the number of sounds in two K-clusters, so into K-categories or groups of sounds. And each observation belongs to the cluster that has the nearest mean. So this mean serves as the prototype of the clustered that we are creating. The problem of finding this means or this clustering is computationally difficult is what is call NB. However, there are efficient heuristic algorithms that are commonly employed and that converge quickly to local optimum. So this equation expresses these mean immunization process that we have to go through in the K-means algorithm. So the goal is to find the mu for every cluster, so we have this K-cluster that minimizes this overall sum, so we have to do it sort of holistically of attaining this overall minimization result. When here in the plot we see the three steps in this process of obtaining these clusters. On the left one, we start from a collection of points. In fact, this is not sound features. This is just random points in space. And the goal is to cluster them according to two clusters. So we're going to find two clusters. So we initialize the algorithm by putting two points that will be used as the initial means of two clusters. So the middle diagram, the red and the blue are the two initial means. And these two initial means, this collection of samples of sounds get clustered in the way that we see here with the red cluster and the cyan cluster. And now, with K-means, we iterate over this minimization that, this equation that we have here. And after a certain iteration, it converges, and it converges to the clustering that we have on the right. So it has clustered the red dots in the lower left corner and the cyan dots in the upper right corner. And clearly, this is a much better clustering than the initial random clustering that the algorithm started with. So now, with that, we can have collections of sounds and automatically find classes that group sounds that might have a similar audio features. The last thing that we talked about for describing sound collections is the classification of sound. And that means that we know some classes. We have identified certain categories of sounds. And what we want to do is given a new sound, we want to classify to one of these known classes. So the K nearest neighbors classifier, KNN, is an algorithm used for this type of classification. And the rule of that we implement with KNN, it classifies a sound by assigning to it the class that is most frequent in the neighbors. So we find K neighbors, and whatever is the majority vote of those neighbors then becomes the class of this query or of this new sound. So this block diagram exemplifies this process, these set of rules that are implemented in the KNN algorithms. So we start from a query, okay so that would correspond to a new sound, and we are starting with target examples. So we are starting with collection of samples, of sounds that have a label. For example, in the diagram below, we have two such collections, label collections, the blue and the red ones. And the cyan dots are our query. So we have to label or assign these query samples to one of these two collections. So what we do is we measure the distance with the Euclidean distance. We measure from every query sample to all the neighbors, okay? And we take the K top results. So we only look at the K nearest neighbors. And from those what we do is we take a majority vote based on the classes they belong to. So the last box is basically we know the classes that the neighbors belong to. And we take the majority of the vote and we assign the class that is the majority. So, on the right diagram, we see the result. So the cyan dots have been assigned a color. So some have been assigned the blue class, and the rest have been assigned in the red class. So this is a very simple but quite efficient way to classify sounds or, of course, any other type of data in to classes. If we now go to musical sounds, recordings of pieces of music. The features to be analyze should be more specific and more related to musical in meaningful concepts. So let's start by defining some categories or features or descriptors that are relevant musically. So we can talk about timbre related descriptors. And things that we mentioned like instrument characterization or instrumentation characterization, or even the remixing of musical recordings that is also an important feature of music. Then another category would be related to melody and harmony. And that includes things like the phrase, the motive, or the tonic of the piece of music and even if we talk about non-western music traditions like Indian music tradition, we talk about raga, or like in the Turkish music tradition, we talked about makam. So, these are melodic concepts that can be described and that are important to characterize particular piece of music. Then we can talk about rhythm. And then again we talk about patterns. Or we can talk about tempo. Or we can talk about beat. And in the case of many music traditions, the concept of metrical cycle is an important way to think about rhythm. And finally, another way of describing music is by describing the structure of a piece of music, defining the sections or the movements that a bigger piece of music might have. These descriptions cannot be obtained by just performing audio analysis. We normally start from audio features, but then we have to develop models from a combination of features that can capture the essence of each concept, and clearly this is beyond the aim of this class. And this is very much an open research area, very active and that hopefully we'll be evolving through the years and we'll be able to eventually do things like this. And if we go to music collections, it's even harder. The description of music collections can be very complex if we want to do musically relevant tasks. This is a very active research topic again. And this is what is referred as music information retrieval. In which we want to automatically classify pieces of music and be able to perform tasks such as recommending a piece of music or finding pieces of music that might be related to another one. And then, the concept that we talk about sounds, also apply but they have to be adapted here so similarities of fundamental concepts. But then we have to divide or we can find different facets of the similarity and we can talk about rhythmic similarity, we can talk about similarity of the instrumentation, of the melodic aspects or the harmonic aspects structural similarity. And then, of course, we could combine them in order to find similar songs. And these types of similarity are clearly not Euclidean distances. We have to develop similarities that are much more sophisticated. And then we can classify and cluster these pieces of music according to different criteria. The classification for example, can be classified according to genre, or style, or artist, or the school that the music tradition comes from. Again, this is much beyond what we can cover in this class. That is a fascinating topic that is a natural continuation of the kinds of things we talked about. So given that this is a very open research problem, the references come from research papers. And typically a lot of this research is label under what we call music information retrieval, so the Wikipedia entry for music information retrieval is a good starting point. And then for the more specific things that we have talked about, you can look at the specific entries for Euclidean distance or for the K-means clustering has a good entry in Wikipedia, or the concept of classification base on K-nearest neighbors. Of course, these are just two examples of different clustering and classification strategies. There's a lot of different strategies coming from the field of machine learning that has brought many new possibilities to do these type of tasks. And that's all, so in this lecture, we have opened the door into a huge research field that aims at automatically describing and organizing large collection of sounds and music recordings. We just introduced some of the basic concepts and specific methodologies that can be used to start working on this topic. In the programming lectures, we will show a little bit examples of how to actually do some of this. That clearly we cannot make justice to this field of research. However, I hope you got a taste of it. And I will see you next class. We will present some more demonstrations and practical examples of all this. See you next time, bye bye.