[MUSIC] So how can we build an algorithm that will be capable of discriminating the patterns corresponding to the signal or background before the experiment is built? So, partially, we can rely on simulators that can produce those electromagnetic showers. But we would be interested in getting some more realistic background. And for this, we can use experiment that has been running for five years. It runs in laboratory on the beam come from CERN. And it was actually looking for oscillation of neutrinos. And sensitive element of this detector was so precise, the cheap experiment is also going to use very same detector parts for detection of the showers. So those elements are based on nuclear emulsion. The emulsion consists of silver crystals merged into a polymer. And as a charged particle passes through those crystals, it activates individual crystals. And after development, we can see black dots at the places of activated crystals. And you can see on the right side of the slide, the trajectories that are coming or produced by different kinds of particles. So there is one electron that passes on high energy through the volume of the detector, and there is another that is framed electron with less energetic. And you see that the size or scale of this emulsion of this image is really small. So it is a 550 microns depicted on the slide. Those emulsion layers are put on a plastic plate on two sides. You see them on the left. And those interweaved with lead plates, massive target that make particles interact with it and produce some kind of traces, are formed into a brick. So this is like a building block for OPERA experiment. Remember, from the slides before, that OPERA is a huge experiment, so it is like a wall or three story building made of bricks. So those bricks are designed according to these techniques. So you can see at the bottom of this slide, examples of electromagnetic showers that are produced by neutrinos. And looking at those trajectories, you can measure energy and momentum of the original particles just by counting number of tracks that come to the shower. The additional problem that comes here is that we have to dig through a lot of noise, a lot of background tracks, that are collected during the period of data taking, or that has been produced inside emulsion plates during transportation. In case of OPERA, the plates was transported from Japan to Italy. During this transportation, it was subject for cosmic ray radiation. So we have to clean a lot of background to get a clear kind of signal. So in this image, you'll see there's some kind of clustering technique that is being applied to the data set, and you can see the individual showers there. But in more realistic scenario, the cleaning goes from image on the top-left to the bottom of the picture. So there are like 10 million base tracks that every brick consists of, and you have to identify maybe 1,000 of base track that come to an electromagnetic shower. So machine learning challenges that are related to this problem come into the venue of tracking. So you have to identify properties of the shower. So we have to identify original vertex the shower is coming from and you have to identify trajectories of additional particles in the final state, and identify types of those particles. Particle identification. And in case of OPERA, the datasets consist of background that are collected in the real environment, in the bricks, the same configuration bricks that are going to be used for SHiP experiment. And the signal that is simulated, it is coming in a cone-like shape, and every shower of a signal consists of roughly thousand different tracks. And in OPERA, you also have information about origin of the shower. So you know coordinates and angles of the initial particle that produced that shower. So in this example dataset, each base track is described by the following features, coordinates, angles, and goodness of fit, which could be minimum mean squared error for fitting individual argentum crystals to the line of a base track. And as I mentioned before, the background consists of base track randomly scattered around the brick, and there are in real scenario ten to the seven background tracks. In the setting that you will be working with as homework, there will be less background. And the signal consists of background base tracks forming a cone-like shape. There are several volumes of signal. And there will be 1,000 base track pair signal volume. Origin of a shower is known, as I mentioned before, the coordinates and angles are available. So before going to solution that you can apply to this problem, let's consider figure of merit that physicists are interested in. In this case, we are interested in estimated energy of an electron, right? So in case we are looking for tracks that belong to a signal shower, we can build a regression model that would connect number of tracks that algorithm has produced with energy of original particle. And then, we can estimate the mean square error, and by mean square error, estimate coefficients A and B, and fit straight line that will give us mean energy given number of tracks returned by our algorithm. And if we compute the relative error for every blue point or the red line, we can plot a histogram like one on the right. And it looks like a Gaussian by design, because we decided to go to linear regression. And the width of this shape actually gives us the uncertainty of our error. And this is actually energy resolution that physicists are interested in. Actually, they're interested in making it as sharp as possible. So it will mean that algorithm is capable of very accurate reconstruction of the energy. But the thing is we can design a proxy metric that can be applied to design of a regular algorithm. So in terms of machine learning, we can substitute the energy resolution with little simpler metric that we call metric proxy, which is average precision along certain recall range. So in general, it could just be area under the precision recall curve. Or if we're interested in a specific region, we can give higher weights for this region. So as another proxy, we can also use area under ROC curve metric. And the baseline solution that is produced by people at OPERA experiment is the following, so we consider only tracks within a cone of 50 milliradians from the original particle. And we iterate for all based tracks on the cone. And for every base track, we compute the distance from the origin, and we compute input parameter. See it on the figure. So it's like a distance from the base track to the point of the original particle. And we compute the angel that is the angle between original particle and each base track. So those features are added to a classifier like random forest or XGboost or whatever. And then we can get a solution that gives baseline result around 0.96 for area under the ROC curve, or precision that is roughly 1.0 at recall equal to 0.5. [MUSIC]