We now introduce the Discrete Memoryless Channel. A discrete channel can be used repeatedly at every time index i equals 1, 2, so on and so forth. Assume the noise for the transmission over the channel at different time indices are independent of each other. To properly formulate a DMC, we regard it as a subsystem of a discrete time stochastic system which will be referred to as the system. In such a system, random variables are generated sequentially in discrete time. More than one random variable maybe generated instantaneously, but sequentially at a particular time index. We now introduce our first definition of a DMC called DMC 1. This definition is based on our first definition of a discrete channel. A DMC specified by a transition matrix p(y|x) is a sequence of replicates of a generic discrete channel p(y|x). Let T_{i minus} be all the variables in the system generated before X_i, the input to the channel at time i. The Memoryless Property of the channel, which corresponds to the independent noise assumption is specified as follows. The probability that Y_i, the output of the channel at time i, equals small y and X_i, the input to the channel at time i equals small x, and T_{i minus}, equals some value t, is equal to the probability that X_i is equal to x and T_{i minus} is equal to t times p(y|x). Note that the transition probability p(y|x) here corresponds to the probability that Y_i is equal to small i given X_i is equal to small x and T_{i minus} is equal to t. And such, the memoryless property is actually equivalent to the Markov chain, T_{i minus} X_i, Y_i, or in words, given X_i, the input at time i, Y_i, the output at time i, is independent of everything in the past. Definition 7.4 is the formal definition for DMC 1. A Discrete Memoryless Channel p(y|x), is a sequence of replicates of a generic discrete channel p(y|x). These discrete channels are indexed by a discrete-time index i, where i is greater than or equal to 1 with the i-th channel being available for transmission at time i. Transmission through a channel is assumed to be instantaneous. Let X_i, and Y_i, be respectively the input and output of the DMC at time i, and let T_{i minus} denote all the random variables that are generated in the system before X_i. The following equality holds for all (x,y,t). The probability that Y_i is equal to small y, X_i is equals to small x, and T_{i minus} is equal to small t, is equal to the probability that X_i is equal to small x, T_{i minus} is equal to small t, times p(y|x). We now introduce an alternative definition of a DMC, called DMC II. This definition is based on our second definition of a discrete channel. A DMC specified by the pair alpha Z is a sequence of replicates of a generic discrete channel alpha Z. Z_i is noise variable for transition at time i and has the same distribution as Z. The memoryless property which corresponds to independent noise is specified as follows. Z_i, the noise variable for the i-th transmission is independent of X_i, the input at time i and T_{i minus}, all the random variables generated in the system X_i. That is the noise for transmission at time i, is independent of both X_i, and everything in the past. Definition 7.5 is the formal definition of DMC II. A discrete memoryless channel alpha Z is a sequence of replicates of a generic discrete channel alpha Z. These discrete channels are indexed by discrete time index i, where i is greater than or equal to 1 with the i-th channel being available for transmission at time i. Transmission through a channel is assumed to be instantaneous. Let X_i and Y_i be respectively the input and output of the DMC at time i, and let T_{i minus} denote all the random variables that are generated in the system before X_i. The noise variable, Z_i for the transition at time i is a copy of the generic noise variable, Z, and is independent of X_i and T_{i minus}. The output of the DMC at time i, is given by Y_i equals alpha of X_i, Z_i. The equivalence of definition 7.4, namely DMC 1, and 7.5, namely DMC 2 can be shown. For details, please see the textbook. For the rest of our discussion, assume that both the alphabet X and the alphabet Y are finite. Definition 7.6, defines the capacity of a DMC. The capacity of a Discrete Memoryless Channel, p(y|x) is defined as C equals the maximum of the mutual information between X and Y, where X and Y are respectively the input and the output of the generic discrete channel, and maximum is taken over all inputs distributions p(x). The meaning of this maximization is explained as follows. For each input distribution p(x), we can determine the joint distribution of the input and the output of the channel as p(x,y) equals p(x) times p(y|x), because p(y|x) is given. From the joint distribution p(x,y), we can compute the mutual information between an input and the output of the channel. Then maximize the mutual information between X and Y over all input distributions p(x). Here are a couple of remarks. Since I(X;Y), is a continuous functional of p(x), that is I(X;Y) varies continuously with the input distribution p(x) and the set of all p(x) is a compact set, that is, it is closed and bounded in the Euclidean space with dimension equal to the size of the input alphabet X, the maximum value of I(X;Y), can be attained. We are going to see that C is in fact the maximum rate at which information can be communicated reliably through a DMC. So indeed it is possible to communicate through a channel at a positive rate while P_e, the error probability tends to zero. In example 7.7, we look at an alternative representation of a BSC, with crossover probability epsilon. The output variable Y, is equal to the input variable X, plus a noise variable Z, where the addition is in modular 2, with the probability that Z is equal to 0, being 1 minus epsilon and the probability that Z is equal to 1 being epsilon and Z is independent of X. Note that when Z is equal to 0, Y is equal to X and when Z is equal to 1, Y is not equal to X. Therefore, Y is not equal to X with probability equal to epsilon. So this is exactly the same model of a BSC that we have seen before. We now determine the capacity of this channel. First, consider I(X;Y), the mutual information between the input and the output, equals H(Y) minus H(Y|X). We then write H(Y|X) as summation small x, p(x), H(Y|X=x). Now the conditional entropy of Y given X equals small x, is equal to h_b(epsilon), which does not depend on x. Therefore, we can move it outside the summation, where summation x p(x), is equal to 1. Now this is less than or equal to one minus h_b(epsilon), because Y being a binary random variable has entropy at most equal to one. So the capacity of the channel, which is equal to the maximum, of I(X;Y), over all input distributions p(x), is less than or equal to 1 minus h_b(epsilon) [BLANK_AUDIO] Now the upper bound on I(X;Y), is tight, if H(Y) is equal to 1. It can be shown that by taking the input distribution to be the uniform distribution, the output distribution is also the uniform distribution, and so, the entropy of Y is equal to 1. Therefore, we conclude that the capacity of the channel is equal to 1 minus h_b(epsilon) bit per use. This is the plot of the channel capacity, versus epsilon, the crossover probability. When epsilon is equal to zero, C is equal to one. Intuitively this is correct because there is not error in the channel and so we can communicate information at the rate one bit per channel use without error. When epsilon is equal to one, that is there is always an error in the channel, C is also equal to one. Again, intuitively this is correct, because for such a situation, the decoder only needs to flip the value of Y in order to recover the value of X. Therefore, from the information transmission point of view, there is no difference between epsilon equals 0 and epsilon equals 1. When epsilon is equal to 0.5, C is equal to 0. It can be shown, that the input X and output Y are always independent when epsilon is equal to 0.5. We leave the proof as an exercise. For this reason, by knowing the output Y of the channel, nothing is known about input X of the channel, and so it is not possible to communicate any information through. Therefore, intuitively, the channel capacity is equal to 0, when epsilon is equal to 0.5. We now introduce the second channel model called the Binary Erasure Channel. The channel description is the following. The input X to the channel is binary, but the output Y of the channel can take value zero, one, and e, where the output symbol e denotes erasure. Gamma is the parameter of the channel called the erasure probability, where gamma is between zero and one. With probability equal to 1 minus gamma, the output is equal to the input regardless of the input being zero or one. With probability gamma Y is equal to e, the output is equal to the erasure symbol e. If X is equal to zero, then Y cannot be equal to one and if X is equal to one then Y cannot be equal to zero. In other words, in the binary erasure channel, either there's no error, or there's an erasure. The erasure symbol can be interpreted as follows. In some communication systems, a symbol transmitted into the channel can be lost and so it is not received by the receiver. In other communication systems the receiver may not be able to distinguish between zero and one. In both situations it can be modelled as an erasure. We now determine the capacity of this channel. First, consider the channel capacity, C equals the maximum of I(X;Y), the mutual formation between the input and the output over all input distribution p(x). Now, I(X;Y) is equal to H(Y) minus H(Y|X). And H(Y|X), is equal to h_b(gamma), which is evident in the transition diagram above. Since h_b(gamma) does not depend on the input distribution p(x), we only have to maximize H(Y), over all p(x). Toward this end, we let p(0) equal to a, this is indicated in the diagram and we also have p(1) equals 1 minus a. We are going to find the value of a, that maximizes the output entropy H(Y). Now we define a binary random variable E, by E equals zero if Y is not equal to e, and E equals one if Y equals e. The random variable E indicates whether an erasure has occurred, and it is a function of Y. Therefore, H(Y) is equal to H(Y,E), because E is a function of Y. This is equal to H(E) plus H(Y|E). Now H(E) is simply equal to h_b(gamma). It can be shown that H(Y|E) is equal to one minus gamma times h_b(a). We leave the proof as an exercise. Following from the above, we have C equals the maximum of H(Y) over all p(x) minus h_b(gamma). Using this expression for H(Y), the maximization is equivalent to h_b(gamma), plus 1 minus gamma times h_b(a), where the maximization is over all a between zero and one. Now this h_b(gamma) cancels with this h_b(gamma), and therefore we are left with one minus gamma times the maximum of h_b(a) over all a. This is equal to one minus gamma, because the maximum of h_b(a) is attained when a is equal to 0.5, that is when input distribution is uniform. Therefore, we conclude that the channel capacity is equal to 1 minus gamma bit per use.