Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Maximum Entropy Fundamentals

Maximum Entropy Fundamentals Entropy, 2001, 3, 191{226 entropy ISSN 1099-4300 www.mdpi.org/entropy/ 1 2; P. Harremo es and F. Topse 1: Rnne All e, Sborg, Denmark email: moes@post7.tele.dk 2: Department of Mathematics; University of Copenhagen; Denmark email: topsoe@math.ku.dk Received: 12 September 2001/ Accepted: 18 September 2001 / Published: 30 September 2001 Abstract: In its modern formulation, the Maximum Entropy Principle was promoted by E.T. Jaynes, starting in the mid- fties. The principle dictates that one should look for a distribution, consistent with available information, which maximizes the entropy. However, this principle focuses only on distributions and it appears advantageous to bring information theoretical thinking more prominently into play by also focusing on the \observer" and on coding. This view was brought forward by the second named author in the late seventies and is the view we will follow-up on here. It leads to the consideration of a certain game, the Code Length Game and, via standard game theoretical thinking, to a principle of Game Theoretical Equilibrium. This principle is more basic than the Maximum Entropy Principle in the sense that the search for one type of optimal strategies in the Code Length Game translates directly into the search for distributions with maximum entropy. In the present paper we o er a self-contained and comprehensive treatment of funda- mentals of both principles mentioned, based on a study of the Code Length Game. Though new concepts and results are presented, the reading should be instructional and accessible to a rather wide audience, at least if certain mathematical details are left aside at a rst reading. The most frequently studied instance of entropy maximization pertains to the Mean Energy Model which involves a moment constraint related to a given function, here c 2001 by the authors. Reproduction for noncommercial purposes permitted. Entropy, 2001, 3 192 taken to represent \energy" . This type of application is very well known from the literature with hundreds of applications pertaining to several di erent elds and will also here serve as important illustration of the theory. But our approach reaches further, especially regarding the study of continuity properties of the entropy function, and this leads to new results which allow a discussion of models with so-called entropy loss. These results have tempted us to speculate over the development of natural languages. In fact, we are able to relate our theoretical ndings to the empirically found Zipf 's law which involves statistical aspects of words in a language. The apparent irregularity inherent in models with entropy loss turns out to imply desirable stability properties of languages. Keywords: Maximum Entropy, Minimum Risk, Game Theoretical Equilibrium, In- formation Topology, Nash Equilibrium Code, Entropy Loss, Partition Function, Expo- nential Family, continuity of entropy, hyperbolic distributions, Zipf 's law. Research of both authors supported by the Danish Natural Science Research Council, by INTAS (project 00-738) and by IMPAN{BC, the Banach Center, Warsaw. Entropy, 2001, 3 193 1 The Maximum Entropy Principle { overview and a generic example The Maximum Entropy Principle as conceived in its modern form by Jaynes, cf. [11], [12] and [13], is easy to formulate: \Given a model of probability distributions, choose the distribution with highest entropy." With this choice you single out the most signi cant distribution, the least biased one, the one which best represents the \true" distribution. The sensibility of this principle in a number of situations is well understood and discussed at length by Jaynes, in particular. The principle is by now well established and has numerous applications in physics, biology, demography, economy etc. For practically all applications, the key example which is taken as point of departure { and often the only example discussed { is that of models prescribed by moment conditions. We refer to Kapur, [14] for a large collection of examples as well as a long list of references. In this section we present models de ned by just one moment condition. These special models will later be used to illustrate theoretical points of more technical sections to follow. Our approach will be based on the introduction of a two-person zero-sum game. The principle which this leads to, called the principle of Game Theoretical Equilibrium is taken to be even more basic than the Maximum Entropy Principle. In fact, from this principle you are led directly to the Maximum Entropy Principle and, besides, new interesting features emerge naturally by focusing on the interplay between a system and the observer of the system. As such the new principle is in conformity with views of quantum physics, e.g. we can view the principle of Game Theoretical Equilibrium as one way of expressing certain sides of the notion of complementarity as advocated by Niels Bohr in a precise mathematical way. To be more speci c, let us choose the language of physics and assume that on the set of natural numbers N we have given a function E, the energy function. This function is assumed to be bounded below. Typically, E will be non-negative. Further, we specify a certain nite energy level, , and take as our model all probability distributions with mean energy . We assume that the energy E in \state" i 2 N goes fast enough to in nity that the entropies of distributions in the model remain bounded. In particular, this condition is ful lled if E = 1 for all i suciently large { the corresponding states are then \forbidden states" { and in this case the study reduces to a study of models with nite support. Once you have accepted the Maximum Entropy Principle, this leads to a search for a maximum entropy distribution in the model. It is then tempting to introduce Lagrange multipliers and to solve the constrained optimization problem you are faced with in the standard manner. In fact, this is what practically all authors do and we shall brie y indicate this approach. P P 1 1 We want to maximize entropy H = p log p subject to the moment condition p E = i i i i 1 1 Entropy, 2001, 3 194 and subject to the usual constraints p  0; i 2 N and p = 1. Introducing Lagrange i i multipliers and , we are led to search for a solution for which all partial derivatives of the P P 1 1 function H p E + p vanish. This leads to the suggestion that the solution is of the i i i 1 1 form exp( E ) p = ; i  1 (1.1) Z ( ) for some value of for which the partition function Z de ned by Z ( ) = exp( E ) (1.2) i=1 is nite. The approach is indeed very expedient. But there are diculties connected with it. Theoret- ically, we have to elaborate on the method to be absolutely certain that it leads to the solution, even in the nite case when E = 1 for i suciently large. Worse than this, in the in nite case there may not be any solution at all. This is connected with the fact that there may be no distribution of the form (1.1) which satis es the required moment condition. In such cases it is not clear what to do. Another concern is connected with the observation that the method of Lagrange multipliers is a completely general tool, and this very fact indicates that in any particular situation there may possibly be other ways forward which better re ect the special structure of the problem at hand. Thus one could hope to discover new basic features by appealing to more intrinsic methods. Finally we note that the method of Lagrange multipliers cannot handle all models of interest. If the model is re ned by just adding more moment constraints, this is no great obstacle. Then the distributions and partition functions that will occur instead of (1.1) and (1.2) will work with 0 0 00 00 inner products of the form E + E +  in place of the simple product E . In fact, we i i shall look into this later. Also other cases can be handled based on the above analysis, e.g. if we specify the geometric mean, this really corresponds to a linear constraint by taking logarithms, and the maximum entropy problem can be solved as above (and leads to interesting distributions in this case, socalled power laws). But for some problems it may be dicult, or even impossible to use natural extensions of the standard method or to use suitable transformations which will reduce the study to the standard set-up. In such cases, new techniques are required in the search for a maximum entropy distribution. As examples of this diculty we point to models involving binomial or empirical distributions, cf. [8] and [22]. After presentation of preliminary material, we introduce in Section 3 the basic concepts related to the game we shall study. Then follows a section which quickly leads to familiar key results. The method depends on the information- and game-theoretical point of view. This does not lead Entropy, 2001, 3 195 to complete clari cation. For the proper understanding of certain phenomena, a more thorough theoretical discussion is required and this is taken up in the remaining sections. New results are related to so-called entropy loss { situations where a maximum entropy distri- bution does not exist. In the last section, these type of models are related to Zipf 's law regarding statistical aspects of the semantics of natural languages. Mathematical justi cation of all results is provided. Some technical results which we shall need involve special analytical tools regarding Dirichlet series and are delegated to an appendix. 2 Information theoretical preliminaries Let A, the alphabet, be a discrete set, either nite or countably in nite and denote by M (A), respectively M (A) the set of non-negative measures P on A (with the discrete Borel structure) such that P (A)  1, respectively P (A) = 1. The elements in A can be thought of in many ways, e.g. as letters (for purely information theoretical or computer science oriented studies), as pure states (for applications to quantum physics) or as outcomes (for models of probability theory and statistics). For convenience, A will always be taken to be the set N of natural numbers or a nite section thereof, and elements in A are typically referred to by indices like i; j; . 1  1 Measures in M (A) are probability distributions, or just distributions, measures in M (A) + + 1 1 are general distributions and measures in M (A) n M (A) are incomplete distributions. For + + P; Q; 2 M (A), the point masses are, typically, denoted by p ; q ; . i i By K (A), we denote the set of all mappings  : A ! [0;1] which satisfy Kraft's inequality exp( )  1 : (2.3) i2A Elements in K (A) are general codes. The values of a general code  are denoted  . The terminology is motivated by the fact that if  2 K (A) and if the base for the exponential in (2.3) is 2, then there exists a binary pre x-free code such that the i'th code word consists of approximately  binary digits. By K (A) we denote the set of mappings  : A ! [0;1] which satisfy Kraft's equality exp( ) = 1 : (2.4) i2A This case corresponds to codes without super uous digits. For further motivation, the reader may wish to consult [23] or standard textbooks such as [3] and [6]. Elements in K (A) are compact codes, for short just codes. Entropy, 2001, 3 196 For mathematical convenience, we shall work with exponentials and logarithms to the base e. For  2 K (A) and i 2 A,  is the code length associated with i or, closer to the intended interpretation, we may think of  as the code length of the code word which we imagine associates with i. There is a natural bijective correspondance between M (A) and K (A), expressed notationally by writing P $  or  $ P , and de ned by the formulas = log p ; p = exp( ) : i i i i Here the values  = 1 and p = 0 correspond to eachother. When the above formulas hold, we i i call (; P ) a matching pair and we say that  is adapted to P or that P is the general distribution which matches . If P  M (A) and  2 K (A), we say that  is P -adapted if  is adapted to one of the distributions in P . Note that the correspondance  $ P also de nes a bijection between M (A) and K (A). The support of  is the set of i 2 A with  < 1. Thus, with obvious notation, supp () = supp (P ) where P is the distribution matching  and supp (P ) is the usual support of P . For expectations { always w.r.t. genuine probability distributions { we use the bracket notation. Thus, for P 2 M (A) and f : A ! [1;1], we put hf; Pi = f (i)p i2A whenever this is a well-de ned extended real number. Mostly, our functions will be non-negative and then hf; Pi will of course be a well de ned number in [0;1]. In particular this is the case for average code length de ned for  2 K (A) and P 2 M (A) by h; Pi =  p : i i i2A Entropy and divergence are de ned as usual, i.e., for P 2 M (A), the entropy of P is given by H (P ) = p log p (2.5) i i i2A or, equivalently, by H (P ) = h; Pi where  is the code adapted to P . And for P 2 M (A) and Q 2 M (A) we de ne the divergence (or relative entropy) between P and Q by D(PkQ) = p log : (2.6) i2A Divergence is well de ned with 0  D(PkQ)  1 and D(PkQ) = 0 if and only if P = Q. The topological properties which we shall nd useful for codes and for distributions do not quite go in parallel. On the coding side we consider the space K (A) of all general codes and Entropy, 2001, 3 197 remark that this space is a metrizable, compact and convex Hausdor space. This may be seen by embedding K (A) in the space [0;1] of all functions on A taking values in the compact space [0;1]. The topology on K (A) then is the topology of pointwise convergence. This is the only topology we shall need on K (A). On the distribution side we shall primarily consider probability distributions but on the corre- sponding space, M (A), we nd it useful to consider two topologies, the usual, pointwise topology and then a certain stronger non-metrizable topology, the information topology. As to the usual topology on M (A) we remind the reader that this is a metrizable topology, indeed it is metrized by total variation de ned by V (P; Q) = jp q j: i i We write P ! P for convergence and P , co P etc. for closure in this topology (the examples show the closure of P and of the convex hull of P , respectively). As to the information topology { the second topology which we need on the space M (A) { 1 1 this can be described as the strongest topology such that, for (P )  M (A) and P 2 M (A), n n1 + + lim D(P kP ) = 0 implies that the sequence (P ) converges to P . Convergence in this n!1 n n n1 topology is denoted P ! P . We only need convergence in this topology for sequences, not for generalized sequences or nets. Likewise, we only need sequential closure and P ; co P , or what the case may be denotes sequential closure. Thus P denotes the set of distributions P for which there exists a sequence (P ) of distributions in P with P ! P . The necessary and n n1 n sucient condition that P ! P holds is that D(P kP ) ! 0 as n ! 1. We warn the reader n n that the corresponding statement for nets (generalized sequences) is wrong { only the suciency part holds generally. For the purposes of this paper, the reader needs only worry about sequences but it is comforting to know that the sequential notion P ! P is indeed a topological notion of convergence. Further details will be in [9]. An important connection between total variation and divergence is expressed by Pinsker's inequality: D(PkQ)  V (P; Q) ; (2.7) which shows that convergence in the information topology is stronger than convergence in total variation. The functions of relevance to us, entropy and divergence, have important continuity properties: P y H (P ) is lower semi-continuous on M (A) and (P; Q) y D(PkQ) is jointly lower semi- 1  1 continuous on M (A) M (A). These continuity properties even hold w.r.t. the usual, pointwise + + topology. Details may be found in [23]. Entropy, 2001, 3 198 3 The Code Length Game, introduction In this section P is a non-empty subset of M (A), neutrally referred to as the model. In speci c applications it may be more appropriate with other terminology, e.g. the preparation space or the statistical model. Distributions in P are called consistent distributions. With P we associate a two-person zero-sum game, called the Code Length Game over P . In this game, Player I chooses a consistent distribution, and Player II chooses a general code. The cost-function, seen from the point of view of Player II, is the map P  K (A) ! [0;1] given by the average code length: (P; ) y h; Pi: This game was introduced in [20], see also [15], [21], [10], [8] and [22]. Player I may be taken to represent \the system", \Nature", \God" or  , whereas Player II represents \the observer", \the statistician" or  . We can motivate the game introduced in various ways. The personi cation of the two partic- ipants in the game is natural as far as Player II is concerned since, in many situations, we can identify ourselves with that person. Also, the objective of Player II appears well motivated. To comment on this in more detail, we rst remind the reader that we imagine that there is associated a real code consisting of binary sequences to  2 K (A) and that  merely tells us what the code lengths of the various code words are. We can think of a speci c code in at least three di erent ways: as a representation of the letters in A, as a means for identi cation of these letters and { the view we nd most fruitful { as a strategy for making observations from a source generating letters from A. The two last views are interrelated. In fact, for the strategy of observation which we have in mind, we use the code to identify the actual outcome by posing a succession of questions, starting with the question \is the rst binary digit in the code word corresponding to the outcome a 1?" , then we ask for the second binary digit and so on until it is clear to us which letter is the actual outcome from the source. The number of questions asked is the number of binary digits in the corresponding code word. The cost function can be interpretated as mean representation time, mean identi cation time or mean observation time and it is natural for Player II to attempt to minimize this quantity. The sense in assuming that Player I has the opposite aim, namely to maximize the cost function is more dubious. The arguments one can suggest to justify this, thereby motivating the zero-sum character of the Code Length Game, are partly natural to game theory in general, partly can be borrowed from Jaynes' reasoning behind his Maximum Entropy Principle. Without going into lengthy discussions we give some indications: Though we do not seriously imagine that Player I is a \real" person with rational behaviour, such thoughts regarding the ctive Player I re ect back on our own conceptions. With our ctitious assumptions we express our own modelling. If all we Entropy, 2001, 3 199 know is the model P and if, as is natural, all we strive for is minimization of the cost function, we cannot do better than imagining that Player I is a real person behaving rationally in a way which is least favourable to us. Any other assumption would, typically, lead to non-sensical results which would reveal that we actually knew more than rst expressed and therefore, as a consequence, we should change the model in order better to re ect our level of knowledge. To sum up, we have argued that the observer should be allowed freely to choose the means of observation, that codes o er an appropriate technical tool for this purpose and that the choice of a speci c code should be dictated by the wish to minimize mean observation time, modelled adequately by the chosen cost function. Further, the more ctitious views regarding Player I and the behaviour of that player, really re ect on the adequacy and completeness of our modelling. If our modelling is precise, the assumptions regarding Player I are sensible and general theory of two-person zero-sum games can be expected to lead to relevant and useful results. The overall principle we shall apply, we call the principle of Game Theoretical Equilibrium. It is obtained from general game theoretical considerations applied to the Code Length Game. No very rigid formulation of this principle is necessary. It simply dictates that in the study of a model, we shall investigate standard game theoretical notions such as equilibrium and optimal strategies. According to our basic principle, Player I should consider, for each possible strategy P 2 P , the in mum of h; Pi over  2 K (A). This corresponds to the optimal response of Player II to the chosen strategy. The in mum in question can easily be identi ed by appealing to an important identity which we shall use frequently in the following. The identity connects average code length, entropy and divergence and states that h; Pi = H (P ) + D(PkQ); (3.8) valid for any  2 K (A) and P 2 M (A) with Q the (possibly incomplete) distribution matching . The identity is called the linking identity. As D(PkQ)  0 with equality if and only if P = Q, an immediate consequence of the linking identity is that entropy can be conceived as minimal average code length: H (P ) = min h; Pi : (3.9) 2 K(A) The minimum is attained for the code adapted to P and, provided H (P ) < 1, for no other code. Seen from the point of view of Player I, the optimal performance is therefore achieved by maximizing entropy. The maximum value to strive for is called the maximum entropy value (H -value) and is given by max H (P ) = sup H (P ) : (3.10) max P2P Entropy, 2001, 3 200 On the side of Player II { the \coding side" { we consider, analogously, for each  2 K (A) the associated risk given by R(jP ) = suph; Pi (3.11) P2P and then the minimum risk value (R -value) min R (P ) = inf R(jP ) : (3.12) min 2 K(A) This is the value to strive for for Player II. We have now looked at each side of the game separately. Combining the two sides, we are led to the usual concepts, well known from the theory of two-person zero-sum games. Thus, the model P is in equilibrium if H (P ) = R (P ) < 1, and in this case, H (P ) = R (P ) is the value max min max min of the game. Note that as a \supinf " is bounded by the corresponding \infsup", the inequality H (P )  R (P ) (3.13) max min always holds. The concept of optimal strategies also follows from general considerations. For Player I, this is a consistent distribution with maximal entropy, i.e. a distribution P 2 P with H (P ) = H (P ). max And for Player II, an optimal strategy is a code  2 K (A) such that R( jP ) = R (P ). Such min a code is also called a minimum risk code (R -code). min 4 Cost-stable codes, partition functions and exponential families The purpose of this section is to establish a certain sucient condition for equilibrium and to identify the optimal strategies for each of the players in the Code Length Game. This cannot always be done but the simple result presented here already covers most applications. Furthermore, the approach leads to familiar concepts and results. This will enable the reader to judge the merits of the game theoretical method as compared to a more standard approach via the introduction of Lagrange multipliers. As in the previous section, we consider a model P  M (A). Let  2 K (A) together with its matching distribution P be given and assume that P is consistent. Then we call  a Nash equilibrium code for the model P if h ; Pi  h ; P i ; P 2 P (4.14) Entropy, 2001, 3 201 and if H (P ) < 1. The terminology is adapted from mathematical economy, cf. e.g. Aubin [2]. The requirement can be written R( jP )  H (P ) < 1. Note that here we insist that a Nash equilibrium code be P -adapted. This condition will later be relaxed. Theorem 4.1. Let P be a model and assume that there exists a P -adapted Nash equilibrium code , say, with matching distribution P . Then P is in equilibrium and both players have optimal strategies. Indeed, P is the unique optimal strategy for Player I and  the unique optimal strategy for Player II. Proof. Since R( jP )  H (P ), R (P )  H (P ). As the opposite inequality always holds by min max (3.13), P is in equilibrium, the value of the Code Length Game associated with P is H (P ) and and P are optimal strategies. To establish the uniqueness of  , let  be any code distinct from . Let P be the distribution matching . Then, by the linking identity, R(jP )  h; P i = H (P ) + D(P kP ) > H (P ) ; hence  is not optimal. For the uniqueness proof of P , let P be a consistent distribution distinct from P . Then, again by the linking identity, H (P ) < H (P ) + D(PkP ) = h ; Pi  H (P ) ; and P cannot be optimal. As we shall see later, the existence of a Nash equilibrium code is, essentially, also necessary for the conclusion of the theorem. This does not remove the diculty of actually nding the Nash equilibrium code in concrete cases of interest. In many cases it turns out to be helpful to search for codes with stronger properties. A code  is a cost-stable code for P if there exists h < 1 such that h ; Pi = h for all P 2 P . Clearly, a cost-stable code with a consistent matching distribution is a Nash equilibrium code. Therefore, we obtain the following corollary from Theorem 4.1: Corollary 4.2. If  is a cost-stable code for P and if the matching distribution P is consistent, then P is in equilibrium and  and P are the unique optimal strategies pertaining to the Code Length Game. The reader may want to note that it is in fact easy to prove directly that if P is convex, H (P ) nite and max P a consistent distribution with maximum entropy, then the adapted code  must be a Nash equilibrium code. To see this, let P and P be distributions with nite entropy and put P = (1 )P + P . Then h( ) = H (P ); 0 1 0 1 0   1 is strictly concave and h ( ) = h ; P ih ; P i with  the code adapted to P . From this it is easy 1 0 to derive the stated result. A more complete result is given in Theorem 7.3. Entropy, 2001, 3 202 In order to illustrate the usefulness of this result, consider the case of a model P given by nitely many linear constraints, say P = fP 2 M (A) j hE ; Pi =  ; : : : ;hE ; Pi =  g (4.15) 1 1 n n with E ; : : : ; E real-valued functions bounded from below and  ; : : : ;  real-valued constants. 1 n 1 n Let us search for cost-stable codes  for P . Clearly, any code of the form = + E + + E = +  E (4.16) 1 1 n n is cost-stable. Here, and the 's denote constants, and E vectors and a dot signi es scalar products of vectors. For  de ned by (4.16) to de ne a code we must require that   0 and, more importantly, that Kraft's equality (2.4) holds. We are thus forced to assume that the partition function evaluated at = ( ; : : : ; ) is nite, i.e. that 1 n Z ( ) = exp(  E ) (4.17) i2A is nite, and that = log Z ( ). When these conditions are ful lled,  =  de ned by = log Z ( ) +  E (4.18) de nes a cost-stable code with individual code lengths given by = log Z ( ) +  E : (4.19) The matching distribution P is given by the point probabilities exp(  E ) P = : (4.20) Z ( ) In most cases where linear models occur in the applications, one will be able to adjust the parameters in such that P is consistent. By Corollary 4.2, the entropy maximization problem will then be solved. However, not all cases can be settled in this way as there may not exist a consistent maximum entropy distribution. We have seen that the search for cost-stable codes led us to consider the well-known partition function and also the well-known exponential family consisting of distributions (P ) with ranging over all vectors 2 R for which Z ( ) < 1. From our game theoretical point of view, the family of codes ( ) with Z ( ) < 1 has at least as striking features as the corresponding family of distributions. We shall therefore focus on both types of objects and shall call the family of matching pairs ( ; P ) with ranging over vectors Entropy, 2001, 3 203 with Z ( ) < 1 for the exponential family associated with the set E = (E ; : : : ; E ) of functions 1 n on A or associated with the family of models one can de ne from E by choosing  = ( ; : : : ;  ) 1 n and considering P given by (4.15). We stress that the huge literature on exponential families displays other families of discrete distributions than those that can be derived from the above de nition. In spite of this we maintain the view that an information theoretical de nition in terms of codes (or related objects) is more natural than the usual structural de nitions. We shall not pursue this point vigorously here as it will require the consideration of further games than the simple Code Length Game. In order to further stress the signi cance of the class of cost stable codes we mention a simple continuity result: Theorem 4.3. If a model P has a cost-stable code, the entropy function H is continuous when restricted to P . Proof. Assume that h ; Pi = h < 1 for all P 2 P . Then H (P ) + D(PkP ) = h for P 2 P with P the distribution matching  . As the sum of the two lower semi-contimuous functions in this identity is a constant function, each of the functions, in particular the entropy function, must be continuous. As we have already seen, the notion of cost-stable codes is especially well suited to handle models de ned by linear constraints. In section 6 we shall take this up in more detail. The following sections will be more technical and mathematically abstract. This appears nec- essary in order to give a comprehensive treatment of all basic aspects related to the Cost Length Game and to the Maximum Entropy Principle. 5 The Code Length Game, further preparations In section 3 we introduced a minimum of concepts that enabled us to derive the useful results of section 4. With that behind us as motivation and background material, we are ready to embark on a more thorough investigation which will lead to a clari cation of certain obscure points, especially related to the possibility that a consistent distribution with maximal entropy may not exist. In this section we point out certain results and concepts which will later be useful. In view of our focus on codes it is natural to look upon divergence in a di erent way, as redundancy. Given is a code  2 K (A) and a distribution P 2 M (A). We imagine that we use  to code letters from A generated by a \source" and that P is the \true" distribution of the letters. The optimal performance is, according to (3.9), represented by the entropy H (P ) whereas the actual performance is represented by the number h; Pi. The di erence h; PiH (P ) is then Entropy, 2001, 3 204 taken as the redundancy. This is well de ned if H (P ) < 1 and then coincides with D(PkQ) where Q denotes the distribution matching . As D(PkQ) is always well de ned, we use this quantity for our technical de nition: The redundancy of  2 K (A) against P 2 M (A) is denoted D(Pk) and de ned by D(Pk) = D(PkQ) with  $ Q : Thus D(Pk) and D(PkQ) can be used synonymously and re ect di erent ways of thinking. Using redundancy rather than divergence, the linking identity takes the following form: h; Pi = H (P ) + D(Pk) : (5.21) We shall often appeal to basic concavity and convexity properties. Clearly, the entropy function is concave as a minimum of ane functions, cf. (3.9). However, we need a more detailed result which also implies strict concavity. The desired result is the following identity X X X H ( P ) = H (P ) + D(P kP ) ; (5.22) where P = P is any nite or countably in nite convex combination of probability distri- butions. This follows by the linking identity. A closely related identity involves divergence and states that, with notation as above and with Q denoting an arbitrary general distribution, X X X D(P kQ) = D P kQ + D(P jP ) : (5.23) The identity shows that divergence D(kQ) is strictly convex. A proof can be found in [23]. For the remainder of the section we consider a model P  M (A) and the associated Code Length Game. By supp (P ) we denote the support of P , i.e. the set of i 2 A for which there exists P 2 P with p > 0. Thus, supp (P ) = [ supp (P ), the union of the usual supports of all consistent i P2P distributions. Often one may restrict attention to models with full support, i.e. to models with supp (P ) = A. However, we shall not make this assumption unless pointed out speci cally. Recall that distributions inP are said to be consistent. Often, it is more appropriate to consider distributions in P . These distributions are called essentially consistent distributions. Using these distributions we relax the requirements to a distribution with maximum entropy, previously only considered for consistent distributions. Accordingly, a distribution P is called a maximum entropy Entropy, 2001, 3 205 distribution (H -distribution) if P is essentially consistent and H (P ) = H (P ). We warn max max the reader that the usual de nition in the literature insists on the requirement of consistency. Nevertheless, we nd the relaxed requirement of essential consistency more adequate. For one thing, lower semi-continuity of the entropy function implies that H (P ) = H (P ) = H (P ) (5.24) max max max and this points to the fact that the models P and P behave in the same way as P . This view is further supported by the observation that for any  2 K (A), R(jP ) = R(jco P ) : (5.25) This follows as the map P y h; Pi is lower semi-continuous and ane. As a consequence, R (P ) = R (co ) : (5.26) min min It follows that all models with P  P  P behave similarly as far as the Code Length Game is concerned. The reason why we do not relax further the requirement of a H -distribution from max P 2 P to P 2 P is rstly, that we hold the information topology for more relevant for our investigations than the usual topology. Secondly, we shall see that the property P 2 P which is stronger than P 2 P can in fact be veri ed in the situations we have in mind (see Theorem 6.2). The fact that a consistent H -distribution may not exist leads to further important notions. max Firstly, a sequence (P ) of distributions is said to be asymptotically optimal if all the P n n1 n are consistent and if H (P ) ! H (P ) for n ! 1. And, secondly, a distribution P is the n max maximum entropy attractor (H -attractor) if P is essentially consistent and if P ! P for max n every asymptotically optimal sequence (P ) . n n1 As an example, consider the (uninteresting!) model of all deterministic distributions. For this model, the H -attractor does not exist and there is no unique H -distribution. For more max max sensible models, the H -attractor P will exist, but it may not be the H -distribution as max max lower semi-continuity only quarantees the inequality H (P )  H (P ), not the corresponding max equality. Having by now re ned the concepts related to the distribution side of the Code Length Game, we turn to the coding side. It turns out that we need a localized variant of the risk associated with certain codes. The codes we shall consider are, intuitively, all codes which the observer (Player II) o -hand nds it worth while to consider. If P 2 P is the \true" distribution, and the observer knows this, he will choose the code adapted to P in order to minimize the average code length. As nature (Player Entropy, 2001, 3 206 I) could from time to time change the choice of P 2 P , essentially any strategy in the closure of P could be approached. With these remarks in mind we nd it natural for the observer only to consider P -adapted codes in the search for reasonable strategies. D D Assume now that the observer decides to choose a P -adapted code . Let P 2 P be the distribution which matches . Imagine that the choice of  is dictated by a strong belief that the true distribution is P or some distribution very close to P (in the information topology!). Then the observer can evaluate the associated risk by calculating the localized risk associated with which is de ned by the equation: R (jP ) = sup lim suph; P i ; (5.27) loc n D n!1 (P )P;P !P n n where the supremum is over the class of all sequences of consistent distributions which converge in the information topology to P . Note that we insist on a de nition which operates with sequences. Clearly, the normal \global" risk must be at least as large as localized risk, therefore, for any P -adapted code, R (jP )  R(jP ) : (5.28) loc A further and quite important inequality is the following: R (jP )  H (P ) : (5.29) loc max This inequality is easily derived from the de ning relation (5.27) by writing h; P i in the form H (P ) + D(P kP ) with  $ P , noting also that P ! P implies that D(P kP ) ! 0. We note n n n n that had we allowed nets in the de ning relation (5.27), a di erent and sometimes strictly larger quantity would result and (5.29) would not necessarily hold. As the last preparatory result, we establish pretty obvious properties of an eventual optimal strategy for the observer, i.e. of an eventual R -code. min Lemma 5.1. Let P  M (A) with R (P ) < 1 be given. Then the R -code is unique and if min min it exists, say R( jP ) = R (P ), then  is compact with supp ( ) = supp (P ). min Proof. Assume that  2 K (A) is a R -code. As R( jP ) < 1, supp (P )  supp ( ). Then min consider an a 2 supp ( ) and assume, for the purpose of an indirect proof, that a 2 Ansupp (P ). 0 0 Then the code  obtained from  by putting (a ) = 1 and keeping all other values xed, is a general non-compact code which is not identically +1. Therefore, there exists " > 0 such that " is a compact code. For any P 2 P , we use the fact that a 2= supp (P )) to conclude that h "; Pi = h "; Pi, hence R( "jP ) = R( jP ) ", contradicting the minimality property Entropy, 2001, 3 207 of  . Thus we conclude that supp ( ) = supp (P ). Similarly, it is clear that  must be compact { since otherwise,  " would be more ecient than  for some " > 0. In order to prove uniqueness, assume that both  and  are R -codes for P . If we assume 1 2 min that  6=  , then  (a) 6=  (a) holds for some a in the common support of the codes  and 1 2 1 2 1 2 and then, by the geometric/arithmetic inequality, we see that ( + ) is a general non-compact 1 2 code. For some " > 0, ( +  ) " will then also be a code and as this code is seen to be more 1 2 ecient than  and  , we have arrived at a contradiction. Thus  =  , proving the uniqueness 1 2 1 2 assertion. 6 Models in equilibrium Let P  M (A). By de nition, the requirement of equilibrium is one which involves the rela- tionship between both sides of the Code Length Game. The main result of this section shows that the requirement can be expressed in terms involving only one of the sides of the game, either distributions or codes. Theorem 6.1 (conditions for equilibrium). Let P  M (A) be a model and assume that H (P ) < 1. Then the following conditions are equivalent: max (i) P is in equilibrium, (ii) H (co P ) = H (P ), max max (iii) there exists a P -adapted code  such that R( jP ) = R ( jP ) : loc Proof. (i) ) (iii): Here we assume that H (P ) = R (P ). In particular, R (P ) < 1. As the max min min map  y R(jP ) is lower semi-continuous on K (A) (as the supremum of the maps  y h; Pi; P 2 P ), and as K (A) is compact, the minimum of  y R(jP ) is attained. Thus, there exists 2 K (A) such that R( jP ) = R (P ). As observed in Lemma 5.1,  is a compact code and min is the unique R -code. min For P 2 P , H (P ) + D(Pk ) = h ; Pi  R (P ) = H (P ) : min max It follows that D(P k ) ! 0 for any asymptotically optimal sequence (P ) . In other words, n n n1 the distribution P 2 M (A) which matches  is the H - attractor of the model. max + Entropy, 2001, 3 208 We can now consider any asymptotically optimal sequence (P ) in order to conclude that n n1 R ( jP )  lim suph ; P i = lim sup(H (P ) + D(P k )) loc n n n n!1 n!1 = H (P )  R (P ) = R( jP ) : max min By (5.28), the assertion of (iii) follows. (iii) ) (ii): Assuming that (iii) holds, we nd from (5.25), (3.13) and (5.29) that H (co P )  R (co P ) = R (P )  R( jP ) max min min R ( jP )  H (P ) loc max and the equality of (ii) must hold. (ii) ) (i): For this part of the proof we x a speci c asymptotically optimal sequence (P ) n n1 P . We assume that (ii) holds. For each n and m we observe that by (5.22) and (2.7), with M = (P + P ), n m H (P ) = H (coP )  H ( (P + P )) max max n m 1 1 1 1 = H (P ) + H (P ) + D(P kM ) + D(P kM ) n m n m 2 2 2 2 1 1 1 H (P ) + H (P ) + V (P ; P ) : n m n m 2 2 8 It follows that (P ) is a Cauchy sequence with respect to total variation, hence there exists n n1 P 2 M (A) such that P ! P . Let  be the code adapted to P . In order to evaluate R( jP ) we consider any P 2 P . For a suitable sequence (" ) of positive numbers converging to zero, we consider the sequence n n1 (Q )  P given by n n1 Q = (1 " )P + " P ; n  1 : n n n n By (5.22) we nd that H (P ) = H (coP )  H (Q )  (1 " )H (P ) + " H (P ) + " D(PkQ ) ; max max n n n n n n hence H (P ) + D(PkQ )  H (P ) + (H (P ) H (P )) : n n max n As Q y D(PkQ) is lower semi-continuous, we conclude from this that H (P ) + D(PkP )  H (P ) + lim inf (H (P ) H (P )) : max max n n!1 n Entropy, 2001, 3 209 Choosing the " 's appropriately, e.g. " = (H (P )H (P )) , it follows that H (P )+D(PkP ) n n max n H (P ), i.e. that h ; Pi  H (P ). As this holds for all P 2 P , R( jP )  H (P ) follows. max max max Thus R (P )  H (P ), hence equality must hold here, and we have proved that P is in min max equilibrium, as desired. It is now easy to derive the basic properties which hold for a system in equilibrium. Theorem 6.2 (models in equilibrium). Assume that P  M (A) is a model in equilibrium. Then the following properties hold: (i) There exists a unique H -attractor and for this distribution, say P , the inequality max R (P ) + D(P k)  R(jP ) (6.30) min holds for all  2 K (A). (ii) There exists a unique R -code and for this code, say  , the inequality min H (P ) + D(Pk )  H (P ) (6.31) max holds for every P 2 P , even for every P 2 co P . The R -code is compact. min Proof. The existence of the R -code was established by the compactness argument in the begin- min ning of the proof of Theorem 6.1. The inequality (6.31) for P 2 P is nothing but an equivalent form of the inequality R( jP )  H (P ) and this inequality immediately implies that the distri- max bution P matching  is the H -attractor. The extension of the validity of (6.31) to P 2 co P max follows from (5.26). To prove (6.30), let (P )  P be asymptotically optimal. Then n n1 R(jP )  lim suph; P i n!1 = lim sup(H (P ) + D(P k)) n n n!1 H (P ) + lim inf D(P k) max n n!1 H (P ) + D(P k) ; max which is the desired conclusion as H (P ) = R (P ). max min If P is a model in equilibrium, we refer to the pair ( ; P ) from Theorem 6.2 as the optimal matching pair pair. Thus  denotes the R -code and P the H -attractor. min max Combining Theorem 6.2 and Theorem 6.1 we realize that, unless R(jP ) = 1 for every  2 K (A), there exists a unique R -code. The matching distribution is the H -attractor for the min max Entropy, 2001, 3 210 model co (P ). We also note that simple examples (even with A a two-element set) show that P may have a H -attractor without P being in equilibrium and this attractor may be far away max from the H -attractor for co (P ). max Corollary 6.3. For a model P  M (A) in equilibrium, the R -code and the H -attractor min max form a matching pair:  $ P , and for any matching pair (; P ) with P 2 co P , V (P; P )  (R(jP ) H (P )) : Proof. Combining (6.30) with (6.31) it folows that for  $ P with P 2 co P , D(PkP ) + D(P kP )  R(jP ) H (P ) and the result follows from Pinskers inequality, (2.7). Corollary 6.3 may help us to judge the approximate position of the H -attractor P even max without knowing the value of H (P ). Note also that the proof gave the more precise bound max J (P; P )  R(jP )H (P ) with assumptions as in the theorem and with J ( ; ) denoting Je rey's measure of discrimination, cf. [3] or [16]. Corollary 6.4. Assume that the model P has a cost-stable code  and let P be the matching distribution. Then P is in equilibrium and has ( ; P ) as optimal matching pair if and only if P is essentially consistent. Proof. By de nition, an H -attractor is essentially consistent. Therefore, the necessity of the max condition P 2 P is trivial. For the proof of suciency, assume that h ; Pi = h for all P 2 P with h a nite constant. Clearly then, H (P )  h. Now, let (P ) be a sequence of consistent max n n1 distributions with P ! P . By the linking identity, H (P ) + D(P kP ) = h for all n, and we n n n see that H (P )  h. Thus H (P ) = h and (P ) is asymptotically optimal. By Theorem 6.2, max max n the sequence converges in the information topology to the H -attractor which must then be P . max The result follows. Note that this result is a natural further development of Corollary 4.2. 1 0 Corollary 6.5. Assume that P  M (A) is a model in equilibrium. Then all models P with 0 V P  P  co P are in equilibrium too and they all have the same optimal matching pair. 0 V Proof. If P  P  co P then 0 V 0 H (co P )  H (co P ) = H (co P ) = H (P )  H (P ) max max max max max and we see that P is in equilibrium. As an asymptotically optimal sequence for P is also asymp- 0 0 totically optimal for P , it follows that P has the same H -attractor, hence also the same max optimal matching pair, as P . Entropy, 2001, 3 211 Another corollary is the following result which can be used as a basis for proving certain limit theorems, cf. [22]. Corollary 6.6. Let (A;P ) be a sequence of models and assume that they are all in equilibrium n n1 with sup H (P ) < 1 and that they are nested in the sense that co(P )  co(P )   . Let max n 1 2 n1 there further be given a model P such that [ [ P  P  co ( P ): n n n1 n1 Then P is in equilibrium too, and the sequence of H -attractors of the P 's converges in diver- max n gence to the H -attractor of P . max Clearly, the corollaries are related and we leave it to the reader to extend the argument in the proof of Corollary 6.5 so that it also covers the case of Corollary 6.6. We end this section by developing some results on models given by linear conditions, thereby continuing the preliminary results from sections 1 and 4. We start with a general result which uses the following notion: A distribution P is algebraically inner in the model P if, for every P 2 P there exists Q 2 P such that P is a convex combination of P and Q. Lemma 6.7. If the model P is in equilibrium and has a H -distribution P which is algebraically max inner in P , then P is cost-stable. Proof. Let  be the code adapted to P . To any P 2 P we determine Q 2 P such that P is a convex combination of these two distributions. Then, as h ; Pi  H (P ) and h ; Qi max H (P ) and as a convex combination gives h ; P i  H (P ) we must conclude that h ; Pi = max max h ; Qi since h ; P i is in fact equal to H (P ). Therefore,  is cost-stable. max Theorem 6.8. If the alphabet A is nite and the model P ane, then the model is in equilibrium and the R -code is cost-stable. min Proof. We may assume that P is closed. By Theorem 6.1, the model is in equilibrium and by continuity of the entropy function, the H -attractor is a H -distribution. For the R -code  , max max min supp (P ) = supp (P ) by Lemma 5.1. As A is nite we can then conclude that P is algebraically inner and Lemma 6.7 applies. We can now prove the following result: Theorem 6.9. Let P be a non-empty model given by nitely many linear constraints as in (4.15): P = fP 2 M (A) j hE ; Pi =  ; : : : ;hE ; Pi =  g: 1 1 n n + Entropy, 2001, 3 212 Assume that the functions E ; ; E ; 1 are linearly independent and that H (P ) < 1. Then 1 n max the model is in equilibrium and the optimal matching pair ( ; P ) belongs to the exponential family de ned by (4.18) and (4.20). In particular,  is cost-stable. Proof. The model is in equilibrium by Theorem 6.1. Let ( ; P ) be the corresponding optimal matching pair. If A is nite the result follows by Theorem 6.8 and some standard linear algebra. Assume now that A is in nite. Choose an asymptotically optimal sequence (P ) . Let A be n n1 0 a nite subset of A, chosen suciently large (see below), and denote by P the convex model of all P 2 P for which p = P for all i 2 AnA . Let P be the H -attractor for P and  the i n;i 0 max n n n adapted code. Then this code is cost-stable for P and of the form = +  E (i) ; i 2 A : n n;  0 n;i =1 If the set A is suciently large, the constants appearing here are uniquely determined. We nd that (P ) is asymptotically optimal for P , and therefore, P ! P . It follows that the n1 n n constants and converge to some constants and and that n; n = +  E (i) ; i 2 A =1 As A can be chosen arbitrarily large, the constants and must be independent of A with 0  0 i 2 A and the above equation must hold for all i 2 A. Remark. Extensions of the result just proved may well be possible, but care has to be taken. For instance, if we consider models obtained by in nitely many linear constraints, the result does not hold. As a simple instance of this, the reader may consider the case where the model is a \line" , viz. the ane hull generated by the two distributions P; Q on A = N given by p = 2 ; i  1 3 1 and q = ( (3) i ) ; i  1. This model is in equilibrium with P as H -distribution, but the i max adapted code is not cost-stable. These facts can be established quite easily via the results quoted in the footnote following the proof of Theorem 4.1. 7 Entropy-continuous models In the sequel we shall only discuss models in equilibrium. Such models can be quite di erent regarding the behaviour of the entropy function near the maximum. We start with a simple observation. Lemma 7.1. If P is in equilibrium and the H -value H (P ) is attained on P , it is only max max attained for the H -attractor. max Entropy, 2001, 3 213 Proof. Assume that P 2 P and that H (P ) = H (P ). Choose a sequence (P )  P which max n n1 converges to P in total variation. By lower semi-continuity and as H (P ) = H (P ) we see that max (P ) is asymptotically optimal. Therefore, for the H -attractor P , P ! P , hence also n max n P ! P . It follows that P = P . Lemma 7.2. For a model P in equilibrium and with H -attractor P the following conditions max are equivalent: (i) H : P ! R is continuous at P in the topology of total variation, (ii) H : P ! R is sequentially continuous at P in the information topology, (iii) H (P ) = H (P ). max Proof. Clearly, (i) implies (ii). Assume that (ii) holds and let (P )  P be asymptotically optimal. Then P ! P . By n n1 n assumption, H (P ) ! H (P ) and (iii) follows since H (P ) ! H (P ) also holds. n n max V V Finally, assume that (iii) holds and let (P )  P satisfy P ! P . By lower semi- n n1 n continuity, H (P ) = H (P )  lim inf H (P )  lim sup H (P )  H (P ) = H (P ) max n n max max n!1 n!1 and H (P ) ! H (P ) follows. Thus (i) holds. A model P in equilibrium is entropy-continuous if H (P ) = H (P ) with P the H - max max attractor. In the opposite case we say that there is an entropy loss. We now discuss entropy-continuous models. As we shall see, the previously introduced notion of Nash equilibrium code, cf. Section 4, is of central importance in this connection. We need this concept for any P -adapted code. Thus, by de nition, a code  is a Nash equilibrium code if is P -adapted and if R( jP )  H (P ) < 1 : (7.32) We stress that the de nition is used for any model P (whether or not it is known beforehand that the model is in equilibrium). We shall see below that a Nash equilibrium code is unique. Theorem 7.3 (entropy-continuous models). Let P  M (A) be a model. The following con- ditions are equivalent: (i) P is in equilibrium and entropy-continuous, Entropy, 2001, 3 214 (ii) P is in equilibrium and has a maximum entropy distribution, (iii) P has a Nash equilibrium code. If these conditions are ful lled, the H -distribution is unique and coincides with the H - max max attractor. Likewise, the Nash equilibrium code is unique and it coincides with the R -code. min Proof. (i) ) (ii): This is clear since, assuming that (i) holds, the H -attractor must be a max H -distribution. max (ii) ) (iii): Assume that P is in equilibrium and that P 2 P is a H -distribution. Let 0 max ( ; P ) be the optimal matching pair pair. Applying Theorem 6.2, (6.31) with P = P , we conclude that D(P kP ) = 0, hence P = P . Then we nd that 0 0 R( jP ) = R (P ) = H (P ) = H (P ) = H (P ) min max 0 and we see that  is a Nash equilibrium code. (iii) ) (i): If  is a Nash equilibrium code for P , then R (P )  R( jP )  H (P )  H (P )  R (P ) min max min and we conclude that P is in equilibrium and that  is the minimum risk code. In establishing the equivalence of (i){(iii) we also established the uniqueness assertions claimed. The theorem generalizes the previous result, Theorem 4.1. We refer to section 4 for results which point to the great applicability of results like Theorem 7.3. 8 Loss of entropy We shall study a model P in equilibrium. By previous results we realize that for many purposes we may assume that P is a closed, convex subset of M (A) with H (P ) < 1. Henceforth, these max assumptions are in force. Denote by ( ; P ) the optimal matching pair associated with P . By the disection of P we understand the decomposition of P consisting of all non-empty sets of the form P = fP 2 P j h ; Pi = xg : (8.33) Let  denote the set of x 2 R withP 6= ?. As R( jP ) = R (P ) = H (P ), and asP is convex x min max with P 2 P ,  is a subinterval of [0; H (P )] which contains the interval [H (P ); H (P )[. max max Entropy, 2001, 3 215 Clearly,  is a cost-stable code for all models P ; x 2 . Hence, by Theorem 4.3 the entropy function is continuous on each of the sets P ; x 2 . Each set P ; x 2  is a sub-model of P and as each P is convex with H (P ) < 1, these x x max x sub-models are all in equilibrium. The linking identity shows that for all P 2 P , H (P ) + D(PkP ) = x : (8.34) This implies that H (P )  x, a sharpening of the trivial inequality H (P )  H (P ). max x max x max From (8.34) it also follows that maximizing entropy H () over P amounts to the same thing as minimizing divergence D(kP ) over P . In other words, the H -attractor of P may, alterna- x max x tively, be characterized as the I -projection of P on P , i.e. as the unique distribution P for x x which Q ! P for every sequence (Q )  P for which D(Q kP ) converges to the in mum of n x n x n D(QkP ) with Q 2 P . Further basic results are collected below: Theorem 8.1 (disection of models). Let P be a convex model in equilibrium with optimal matching pair ( ; P ) and assume that P 2 P . Then the following properties hold for the disection (P ) de ned by (8.33): x x2 (i) The set  is an interval with sup  = H (P ). A necessary and sucient condition that max H (P ) 2  is that P is entropy-continuous. If P has entropy loss,  contains the non- max degenerate interval [H (P ); H (P )[. max (ii) The entropy function is continuous on each sub-model P ; x 2 , (iii) Each sub-model P ; x 2  is in equilibrium and the H -attractor for P is the I -projection x max x of P on P , (iv) For x 2 , H (P )  x and the following bi-implications hold, where P denotes the max x H -attractor of P : max x H (P ) = x () P = P () x  H (P ) : (8.35) max x Proof. (i){(ii) as well as the inequality H (P )  x of (iii) were proved above. max x For the proof of (iv) we consider an x 2  and let (P )  P be an asymptotically optimal n x sequence for P . Then the condition H (P ) = x is equivalent with the condition H (P ) ! x, x max x n Terminology is close to that adopted by Csisz ar, cf. [4], [5], who rst developed the concept for closed models. This was later extended, using a di erent terminology, in Topse [20]. In this paper we refrain from a closer study of I -projections and refer the reader to sources just cited. Entropy, 2001, 3 216 and the condition P = P is equivalent with the condition P ! P . In view of the equality x = H (P ) +D(P kP ) we now realize that the rst bi-implication of (8.35) holds. For the second n n bi-implication we rst remark that as x  H (P ) holds generally, if P = P then x  H (P ) x x must hold. For the nal part of the proof of (iv), we assume that x  H (P ). The equality H (P ) = x max x is evident if x = H (P ). We may therefore assume that H (P ) < x < H (P ). We now max let (P ) denote an asymptotically optimal sequence for the full model P such that H (P )  x; n n n  1. As h ; P i  x  h ; P i for all n, we can nd a sequence (Q ) of distributions n n n1 in P such that each Q is a convex combination of the form Q = P + P . By (5.23), x n n n n n D(Q kP )  D(P kP ) ! 0. Thus P is essentially consistent for P and as the code adapted n n n x to P is cost-stable for this model, Corollary 6.4 implies that the model has P as its H - max attractor. A distribution P is said to have potential entropy loss if the distribution is the H -attractor max of a model in equilibrium with entropy loss. As we shall see, this amounts to a very special behaviour of the point probabilities. The de nition we need at this point we rst formulate quite generally for an arbitrary distribution P . With P we consider the density function associated with the adapted code, cf. the appendix. In terms of P this function is given by: (t) = #fi 2 A j p  exp(t)g (8.36) (# = \number of elements in"). We can now de ne a hyperbolic distribution as a distribution P such that log (t) lim sup = 1 : (8.37) t!1 Clearly, (t)  exp(t) for each t so that the equality in the de ning relation may just as well be replaced by the inequality \" . We note that zero point probabilities do not really enter into the de nition, therefore we may assume without any essential loss of generality that all point probabilities are positive. And then, we may as well assume that the point probabilities are ordered: p  p   . In this case, it is 1 2 easy to see that (8.37) is equivalent with the requirement log p lim inf = 1 : (8.38) i!1 log In the sequel we shall typically work with distributions which are ordered in the above sense. The terminology regarding hyperbolic distributions is inspired by [19] but goes back further, cf. [24]. In these references the reader will nd remarks and results pertaining to this and related types Entropy, 2001, 3 217 of distributions and their discovery from empirical studies which we will also comment on in the next section. We note that in (8.38) the inequality \" is trivial as p  for every i 2 A. Therefore, in more detail, a distribution with ordered point probabilities is hyperbolic if and only if, for every a > 1, p  (8.39) for in nitely many indices. Theorem 8.2. Every distribution with in nite entropy is hyperbolic. Proof. Assume that P is not hyperbolic and that the point probabilities are ordered. Then there exists a > 1 such that p  i for all suciently large i. As the distribution with point probabilities equal to i , properly normalized, has nite entropy, the result follows. With every model P in equilibrium we associate a partition function and an exponential family, simply by considering the corresponding objects associated with the R -code for the model in min question. This then follows the de nition given in Section 4, but for the simple case where there is only one \energy function" with the R -code playing the role of the energy function. min Theorem 8.3 (maximal models). Let P  M (A) be given and assume that there exists a 0 0 0 0 model P such that P  P , P is in equilibrium and H (P ) = H (P ). Then P itself must be max max in equilibrium. Furthermore, there exists a largest model P with the stated properties, namely max the model P = fP 2 M (A) j h ; Pi  H (P )g ; (8.40) max max 0 0 where  denotes the minimum risk code of P . Finally, any model P with P  P  P is in max equilibrium and has the same optimal matching pair as P . Proof. Choose P with the stated properties. By Theorem 6.1, 0 0 H (co P )  H (co P ) = H (P ) = H (P ) ; max max max max hence P is in equilibrium. Let  be the R -code of P in accordance with Theorem 6.2 and min consider P de ned by (8.40). max 0 0 Now let P  P be an equilibrium model with H (P ) = H (P ). As an asymptotically max max 0 0 optimal sequence forP is also asymptotically optimal forP , we realize thatP has the same H - max 0 0 0 attractor, hence also the same R -code as P . Thus R( jP ) = R (P ) = H (P ) = H (P ) min min max max and it follows that P  P . max Entropy, 2001, 3 218 Clearly, P is convex and H (P ) = H (P ) < 1, hence P is in equilibrium by max max max max max Theorem 6.1. The nal assertion of the theorem follows by one more application of Theorem 6.1. The models which can arise as in Theorem 8.3 via (8.40) are called maximal models. Let  2 K (A) and 0  h < 1. Put P  = fP 2 M (A) j h ; Pi  hg : (8.41) ;h We know that any maximal model must be of this form. Naturally, the converse does not hold. An obvious necessary condition is that the entropy of the matching distribution be nite. But we must require more. Clearly, the models in (8.41) are in equilibrium but it is not clear that they have  as R -code and h as H -value. min max Theorem 8.4. A distribution P 2 M (N) with nite entropy has potential entropy loss if and only if it is hyperbolic. Proof. We may assume that the point probabilities of P are ordered. Assume rst that P is not hyperbolic and that P is the attractor for some model. Consider the corresponding maximal models P and consider a value of h with H (P )  h  H (P ). ;h max Let be the abscissa of convergence associated with  and let  be de ned as in the appendix. As < 1, we can choose > such that ( ) = h. Now both P and Q given by exp(  ) Q = Z ( ) are attractors for P and hence equal. It follows that h = H (P ). Next we show that a ;h hyperbolic distribution has potential entropy loss. Consider the maximal models P . Each one of these models is given by a single linear ;h constraint. Therefore, the attractor is element in the corresponding exponential family. The abscissa of convergence is 1 and, therefore, the range of the map  : [1;1[! R is ] ; H (P )]. For h 2] ; H (P )], there exists a consistent maximum entropy distribution. Assume that h > H (P ) and that the attractor equals exp(  ) Q = : Z ( ) By Theorem 8.1, Q must be attractor for all P with h 2 [( ); h ]. Especially, this holds ;h 0 for h = H (P ). This shows that P = Q . By Theorem 8.1 the conclusion is now clear. Entropy, 2001, 3 219 9 Zipf 's law Zipf 's law is an empirically discovered relationship for the relative frequencies of the words of a natural language. The law states that log(f )  a log( ) + b where f is the relative frequency of the i'th most common word in the language, and where a and b denote constants. For large values of i we then have log f log The constants a and b depend on the language, but for many languages a  1, see [24]. Now consider an ideal language where the frequencies of words is described by a hyperbolic probability distribution P . Assume that the entropy of the distribution is nite. We shall discribe in qualitative terms the consequences of these assumptions as they can be derived from the developed theory, especially Theorem 8.4. We shall see that our asumption introduces a kind of stability of the language which is desirable in most situations. Small children with a limited vocabulary will use the few words they know with relative frequen- cies very di erent from the probabilities described by P . They will only form simple sentences, and at this stage the number of bits per word will be small in the sense that the entropy of the childs probability distribution is small. Therefore the parents will often be able to understand the child even though the pronounciation is poor. The parents will, typically, talk to their chil- dren with a lower bit rate than they normally use, but with a higher bit rate than their children. Thereby new words and grammatical structures will be presented to the child, and, adopting ele- ments of this structure, the child will be able to increase its bit rate. At a certain stage the child will be able to communicate at a reasonably high rate (about H (P )). Now the child knows all the basic words and structures of the language. The child is still able to increase its bit rate, but from now on this will make no signi cant change in the relative frequencies of the words. Bit rates higher than H (P ) are from now on obtained by the introduction of specialized words, which occur seldom in the language as a whole. The introduction of new specialized words can be continued during the rest of the life. Therefore one is able to express even complicated ideas without changing the basic structure of the language, indeed there is no limit, theoretically, to the bit rate at which one can communicate without change of basic structure. We realize that in view of our theoretical results, speci cally Theorem 8.4, the features of a natural language as just discussed are only possible if the language obeys Zipf 's law. Thus we Entropy, 2001, 3 220 have the striking phenomenon that the apparent \irregular" behaviour of models with entropy loss (or just potential entropy loss) is actually the key to desirable stability, the fact that for such models you can increase the bit rate, the level of communication, and maintain the basic features of the language. One could even speculate that modelling based on entropy loss lies behind the phenomenon that many will realize as a fact, viz. that \we can talk without thinking" . We just start talking using basic structure of the language (and rather common words) and then from time to time stick in more informative words and phrases in order to give our talk more semantic content, but in doing so, we use relatively infrequent words and structures, thus not violating basic principles { hence still speaking recognizably danish, english or what the case may be, so that also the receiver or listener feels at ease and recognizes our talk as unmistakenly danish, english or ... We see that very informative speeking can be obtained by use of infrequent expressions. There- fore a conversation between, say 2 physicists may use English supplied with specialized words like electron and magnetic ux. We recognize their language as English because the basic words and grammer is the same in all English. The specialists only have to know special words, not a special grammer. In this sense the languages are stable. If the entropy of our distribution is in nite the language will behave in just about the same manner as described above. In fact one would not feel any di erence between a language with nite entropy and a language with in nite entropy. We see that it is convienient that a language follows a Zipf 's law, but the information theoretic methods also gives some explanation of how the language may have evolved into a state which obeys Zipf 's law. The set of hyperbolic distributions is convex. Therefore if 2 information sources both follows Zipf 's law then so do their mixture, and if 2 information sources both approximately follows Zipf 's law their mixture will do this even more. The information sources may be from di erent languages, but it is more interesting to consider a small child learning the language. The child gets input from di erent sources: the mother, father, other children ect. trying to imitate their language the child will use the words with frequences which are closer to Zipf 's law the the sources. As the language develops during the centuries the frequences will converge to a hyperbolic distribution. Here we have discussed entropy as bit per word and not bit per letter. The letters give an encoding of the words which should primarilly be understood by others, and therefore the encoding cannot just be changed to obtain a better data compression. To stress the di erence between bit per word and bit per letter we remark the the words are the basic semantic structure in the language. Therefore we may have an internal representation of the words which has very little to do with their length when spoken, which could explain that it is often much easier to remember a long word in a language you understand than a short word in a language you do not understand. It would be interesting to compare these ideas with empirical measurements of the entropy here considered but, precisely in the regime where Zipf 's law holds, such a study is very dicult as Entropy, 2001, 3 221 convergence of estimators of the entropy is very slow, cf. [1]. A The partition function In this appendix we collect some basic facts about partition functions associated with one linear constraint. The point of departure is a code  2 K (A). With  we associate the partition function Z = Z which maps R into ]0;1], given by Z ( ) = e : (A.42) i2I Here we adopt the convention that e = 0 if = 0 and  = 1. Clearly, Z is decreasing on i i [1;1[ and Z ( ) ! 0 for ! 1 (note that, given K , e  e e for all i when is large enough). The series de ning Z is a Dirichlet-series, cf. Hardy and Riesz [7] or Mandelbrojt [17]. The abscissa of convergence we denote by . Thus, by de nition, Z ( ) < 1 for > and Z ( ) = 1 for < . As Z (1) = 1,  1. If supp () is in nite, Z ( ) = 1 for  0, hence  0. If supp () is nite, Z ( ) < 1 for all 2 R and we then nd that = 1. Mostly, we shall have the case when supp () is in nite in mind. We shall characterize analytically. This is only a problem when supp () is in nite. So assume that this is the case and also assume, for the sake of convenience, that  has full support and that the indexing set I is the set of natural numbers: I = N, and that      . 1 2 Lemma A.1. With assumptions as just introduced, log i = lim sup : (A.43) i!1 i Proof. First assume that > with de ned by (A.43). Then, for some > 1, log i= for all suciently large values of i. For these values of i, e  i and we conclude that Z ( ) < 1. Conversely, assume that, for some value of ; Z ( ) < 1. Then > 0 and Remark. If no special ordering on A is given, the abscissa of convergence can be expressed analytically via the density function : R ! N (N = N[f0g) which is de ned by 0 0 (t) = #fa 2 A j (a)  tg (A.44) (# = \number of elements in"). In fact, as follows easily from (A.43), log (t) = lim sup : (A.45) t!1 Entropy, 2001, 3 222 We can now introduce the exponential family associated with the model . It is the family of distributions (Q ) with ranging over all values with Z ( ) < 1 which is de ned by Q (a ) = ; i 2 I : (A.46) Z ( ) The family of adapted codes, denoted ( ), is also of signi cance. These codes are given by (a ) = log Z ( ) +  ; i 2 I : (A.47) i i We also need certain approximations to Z , Q and  . For convenience we stick to the assump- tion I = N,     : : : . We then de ne Z , Q and  by 1 2 n n; n; Z ( ) = e ; 2 R ; (A.48) i=1 Q (a ) = ; i  n ; (A.49) n; i Z ( ) (a ) = log Z ( ) +  ; i  n ; (A.50) n; i n i it being understood that supp (Q ) = supp ( ) = f1; 2; : : : ; ng. Formally, the approximating n; n; quantities could be obtained from (non-compact) codes obtained from  by replacing  by the value 1 for i > n. We are particularly interested in the mean values h; Q i and h; Q i, and de ne functions n; and  ; n  1 by ( ) = h; Q i ; Z ( ) < 1 ; (A.51) ( ) = h; Q i ; 2 R : (A.52) n n; Note that (1) = H (P ) and that ( ) = Z ( )=Z ( ) = log Z ( ) ; (A.53) ( ) = Z ( )=Z ( ) = log Z ( ) : (A.54) n n n Furthermore, Z is a Dirichlet series with the same abscissa of convergence as Z , hence ( ) is well de ned and nite for all > . Lemma A.2. With assumptions and notation as above, the following properties hold: (i)     : : : , 1 2 Entropy, 2001, 3 223 (ii)  is strictly decreasing on R (except if  =  =  ), n 1 n (iii) lim  ( ) =  , lim  ( ) =  , n 1 n n !1 !1 (iv)  is strictly decreasing on ] ;1[, (v) lim ( ) =  , !1 (vi) ( ) is in nite if and only if Z ( ) = 1, (vii) If Z ( ) < 1, then  ! , uniformly on [ ;1[, 0 n 0 (viii) lim  ( ) = ( ), the limit from the right at , n!1 (ix) for every < , lim  ( ) = 1. n!1 Proof. (i) follows from n+1 X ( )  ( ) = (  )e ; n+1 n n+1 i Z ( )Z ( ) n+1 n i=1 (ii) from 0 2 ( + ) i j ( ) = (  ) e i j Z ( ) ni>j and (iv) from an obvious extension of this formula. Writing  in the form ( ) = P ; (  ) i j j=1 i=1 we derive the limiting behaviour of (iii) and, with a little care, the limit relation of (v) follows in a similar way. 0 0 (vii) follows from (A.53) and (A.54) since the convergences Z (x) ! Z (x) and Z (x) ! Z (x) hold uniformly on [ ;1[ when Z ( ) < 1. Actually, the uniform convergence is rst derived 0 0 only for intervals of the form [ ; K ]. By (i) and (v) it is easy to extend the uniform convergence to [ ;1[. + + It is now clear that for n  1 and x > ,  (x)  (x)  ( ), hence lim  ( )  ( ). n n!1 n On the other hand, for x > , lim  ( )  lim  (x) = (x). We conclude that (viii) n!1 n n!1 n holds. Entropy, 2001, 3 224 0 0 If Z ( ) < 1, then Z ( ) < 1 and lim  ( ) = Z ( )=Z ( ) < 1. By (viii), this shows n!1 n + 0 + that ( ) < 1. Now assume that Z ( ) = 1. If Z ( ) < 1, it is easy to see that ( ) = 1. If also Z ( ) = 1, we choose to N  1, n > N such that Q (f1; 2; : : : ; Ng)  for n  n : n; 0 Then, for n  n , ( ) =  Q (i)   Q (fN + 1; : : : ; ng)   : n i n; N n; N i=1 This shows that  ( ) ! 1, hence ( ) = 1. We have now proved (vi). In order to prove (ix), let < and choose, to a given K , n such that   K for i  n . 0 i 0 Then, for n  n , 0 1 n 1 P 0 K e B C i=n 0 i=1 B C ( )   K 1 + : n n @ A P P i i e e i=1 i=n As e = 1, we see that for n suciently large,  ( )  K=2 and (ix) follows. 0 0 0 Remark. The formula for  and  can be interpreted more probabilistically. Consider  and remark rst that when we consider  as a random variable de ned on the discrete probability space (A; Q ) = (N; Q ), then ( ) is the expectation of this random variable. A simple calculation shows that  ( ) is the variance of this random variable. References [1] A. Antos and I. Kontoyiannis: \Convergence Properties of Functional Estimates for Discrete Distributions," to appear. [2] J.-P. Aubin, Optima and equilibria. An introduction to nonlinear analysis., Berlin: Springer, [3] T.M. Cover and J.A. Thomas, Information Theory, New York: Wiley, 1991. [4] I. Csisz ar, \I-divergence geometry of probability distributions and minimization problems," Ann. Probab., vol. 3, pp. 146{158, 1975. Entropy, 2001, 3 225 [5] I. Csisz ar, \Sanov property, generalized I-projection and a conditional limit theorem," Ann. Probab., vol. 12, pp. 768{793, 1984. [6] R.G. Gallager, Information Theory and reliable Communication. New York: Wiley, 1968. [7] G.H. Hardy and M. Riesz, The general Theory of Dirichlet's series. Cambridge: Cambridge University Press, 1915. [8] P. Harremo es, \Binomial and Poisson Distributions as Maximum Entropy Distributions," IEEE Trans. Inform. Theory, vol. 47, pp. 2039{2041, 2001. [9] P. Harremo es, \The Information Topology," In preparation [10] D. Haussler, \A general Minimax Result for Relative Entropy," IEEE Trans. Inform. Theory, vol. 43, pp. 1276{1280, 1997. [11] E. T. Jaynes, \Information Theory and Statistical Mechanics," Physical Reviews, vol. 106, pp. 620{630, vol. 108, pp. 171{190, 1957. [12] E. T. Jaynes, \Clearing up mysteries { The original goal," in Maximum Entropy and Bayesian Methods, J. Skilling (ed.), Kluwer, Dordrecht, 1989. [13] http://bayes.wustl.edu [ONLINE] { a web page dedicated to Edwin T. Jaynes, maintained by L. Brethorst. [14] J.N. Kapur, \Maximum Entropy Models in Science and Engineering," New York: Wiley, 1993 ( rst edition 1989). [15] D. Kazakos, \Robust Noiceless Source Coding Through a Game Theoretic Approach," IEEE Trans. Inform. Theory, vol. 29, pp. 577{583, 1983. [16] S. Kullback, \Information Theory and Statistics," New York: Wiley, 1959 (Dover edition 1968). [17] S. Mandelbrot, \Series de Dirichlet," Paris: Gauthier-Villars, 1969. [18] B. B. Mandelbrot, \On the theory of word frequencies and on related Markovian models of discourse," in R. Jacobsen (ed.): \Structures of Language and its Mathematical Aspects," New York, American Mathematical Society, 1961. [19] M. Schroeder, \Fractals, Chaos, Power Laws," New York: W. H. Freeman, 1991. Entropy, 2001, 3 226 [20] F. Topse, \Information theoretical Optimization Techniques," Kybernetika, vol. 15, pp. 8{27, [21] F. Topse, \Game theoretical equilibrium, maximum entropy and minimum information dis- crimination," in Maximum Entropy and Bayesian Methods, A. Mohammad-Djafari and G. De- moments (eds.), pp. 15{23, Kluwer, Dordrecht, 1993. [22] F. Topse, \Maximum Entropy versus Minimum Risk and Applications to some classical discrete Distributions," submitted for publication [23] F. Topse, \Basic Concepts, Identities and Inequalities { the Toolkit of Information Theory," http://www.mdpi.org/entropy/ [ONLINE], Entropy, vol. 3, pp. 162{190, 2001. [24] G. K. Zipf, "Human Behavior and the Principle of Least E ort," Addison-Wesley, Cambridge, http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Entropy Unpaywall

Maximum Entropy Fundamentals

EntropySep 30, 2001

Loading next page...
 
/lp/unpaywall/maximum-entropy-fundamentals-EWLPh13QBw

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
Unpaywall
ISSN
1099-4300
DOI
10.3390/e3030191
Publisher site
See Article on Publisher Site

Abstract

Entropy, 2001, 3, 191{226 entropy ISSN 1099-4300 www.mdpi.org/entropy/ 1 2; P. Harremo es and F. Topse 1: Rnne All e, Sborg, Denmark email: moes@post7.tele.dk 2: Department of Mathematics; University of Copenhagen; Denmark email: topsoe@math.ku.dk Received: 12 September 2001/ Accepted: 18 September 2001 / Published: 30 September 2001 Abstract: In its modern formulation, the Maximum Entropy Principle was promoted by E.T. Jaynes, starting in the mid- fties. The principle dictates that one should look for a distribution, consistent with available information, which maximizes the entropy. However, this principle focuses only on distributions and it appears advantageous to bring information theoretical thinking more prominently into play by also focusing on the \observer" and on coding. This view was brought forward by the second named author in the late seventies and is the view we will follow-up on here. It leads to the consideration of a certain game, the Code Length Game and, via standard game theoretical thinking, to a principle of Game Theoretical Equilibrium. This principle is more basic than the Maximum Entropy Principle in the sense that the search for one type of optimal strategies in the Code Length Game translates directly into the search for distributions with maximum entropy. In the present paper we o er a self-contained and comprehensive treatment of funda- mentals of both principles mentioned, based on a study of the Code Length Game. Though new concepts and results are presented, the reading should be instructional and accessible to a rather wide audience, at least if certain mathematical details are left aside at a rst reading. The most frequently studied instance of entropy maximization pertains to the Mean Energy Model which involves a moment constraint related to a given function, here c 2001 by the authors. Reproduction for noncommercial purposes permitted. Entropy, 2001, 3 192 taken to represent \energy" . This type of application is very well known from the literature with hundreds of applications pertaining to several di erent elds and will also here serve as important illustration of the theory. But our approach reaches further, especially regarding the study of continuity properties of the entropy function, and this leads to new results which allow a discussion of models with so-called entropy loss. These results have tempted us to speculate over the development of natural languages. In fact, we are able to relate our theoretical ndings to the empirically found Zipf 's law which involves statistical aspects of words in a language. The apparent irregularity inherent in models with entropy loss turns out to imply desirable stability properties of languages. Keywords: Maximum Entropy, Minimum Risk, Game Theoretical Equilibrium, In- formation Topology, Nash Equilibrium Code, Entropy Loss, Partition Function, Expo- nential Family, continuity of entropy, hyperbolic distributions, Zipf 's law. Research of both authors supported by the Danish Natural Science Research Council, by INTAS (project 00-738) and by IMPAN{BC, the Banach Center, Warsaw. Entropy, 2001, 3 193 1 The Maximum Entropy Principle { overview and a generic example The Maximum Entropy Principle as conceived in its modern form by Jaynes, cf. [11], [12] and [13], is easy to formulate: \Given a model of probability distributions, choose the distribution with highest entropy." With this choice you single out the most signi cant distribution, the least biased one, the one which best represents the \true" distribution. The sensibility of this principle in a number of situations is well understood and discussed at length by Jaynes, in particular. The principle is by now well established and has numerous applications in physics, biology, demography, economy etc. For practically all applications, the key example which is taken as point of departure { and often the only example discussed { is that of models prescribed by moment conditions. We refer to Kapur, [14] for a large collection of examples as well as a long list of references. In this section we present models de ned by just one moment condition. These special models will later be used to illustrate theoretical points of more technical sections to follow. Our approach will be based on the introduction of a two-person zero-sum game. The principle which this leads to, called the principle of Game Theoretical Equilibrium is taken to be even more basic than the Maximum Entropy Principle. In fact, from this principle you are led directly to the Maximum Entropy Principle and, besides, new interesting features emerge naturally by focusing on the interplay between a system and the observer of the system. As such the new principle is in conformity with views of quantum physics, e.g. we can view the principle of Game Theoretical Equilibrium as one way of expressing certain sides of the notion of complementarity as advocated by Niels Bohr in a precise mathematical way. To be more speci c, let us choose the language of physics and assume that on the set of natural numbers N we have given a function E, the energy function. This function is assumed to be bounded below. Typically, E will be non-negative. Further, we specify a certain nite energy level, , and take as our model all probability distributions with mean energy . We assume that the energy E in \state" i 2 N goes fast enough to in nity that the entropies of distributions in the model remain bounded. In particular, this condition is ful lled if E = 1 for all i suciently large { the corresponding states are then \forbidden states" { and in this case the study reduces to a study of models with nite support. Once you have accepted the Maximum Entropy Principle, this leads to a search for a maximum entropy distribution in the model. It is then tempting to introduce Lagrange multipliers and to solve the constrained optimization problem you are faced with in the standard manner. In fact, this is what practically all authors do and we shall brie y indicate this approach. P P 1 1 We want to maximize entropy H = p log p subject to the moment condition p E = i i i i 1 1 Entropy, 2001, 3 194 and subject to the usual constraints p  0; i 2 N and p = 1. Introducing Lagrange i i multipliers and , we are led to search for a solution for which all partial derivatives of the P P 1 1 function H p E + p vanish. This leads to the suggestion that the solution is of the i i i 1 1 form exp( E ) p = ; i  1 (1.1) Z ( ) for some value of for which the partition function Z de ned by Z ( ) = exp( E ) (1.2) i=1 is nite. The approach is indeed very expedient. But there are diculties connected with it. Theoret- ically, we have to elaborate on the method to be absolutely certain that it leads to the solution, even in the nite case when E = 1 for i suciently large. Worse than this, in the in nite case there may not be any solution at all. This is connected with the fact that there may be no distribution of the form (1.1) which satis es the required moment condition. In such cases it is not clear what to do. Another concern is connected with the observation that the method of Lagrange multipliers is a completely general tool, and this very fact indicates that in any particular situation there may possibly be other ways forward which better re ect the special structure of the problem at hand. Thus one could hope to discover new basic features by appealing to more intrinsic methods. Finally we note that the method of Lagrange multipliers cannot handle all models of interest. If the model is re ned by just adding more moment constraints, this is no great obstacle. Then the distributions and partition functions that will occur instead of (1.1) and (1.2) will work with 0 0 00 00 inner products of the form E + E +  in place of the simple product E . In fact, we i i shall look into this later. Also other cases can be handled based on the above analysis, e.g. if we specify the geometric mean, this really corresponds to a linear constraint by taking logarithms, and the maximum entropy problem can be solved as above (and leads to interesting distributions in this case, socalled power laws). But for some problems it may be dicult, or even impossible to use natural extensions of the standard method or to use suitable transformations which will reduce the study to the standard set-up. In such cases, new techniques are required in the search for a maximum entropy distribution. As examples of this diculty we point to models involving binomial or empirical distributions, cf. [8] and [22]. After presentation of preliminary material, we introduce in Section 3 the basic concepts related to the game we shall study. Then follows a section which quickly leads to familiar key results. The method depends on the information- and game-theoretical point of view. This does not lead Entropy, 2001, 3 195 to complete clari cation. For the proper understanding of certain phenomena, a more thorough theoretical discussion is required and this is taken up in the remaining sections. New results are related to so-called entropy loss { situations where a maximum entropy distri- bution does not exist. In the last section, these type of models are related to Zipf 's law regarding statistical aspects of the semantics of natural languages. Mathematical justi cation of all results is provided. Some technical results which we shall need involve special analytical tools regarding Dirichlet series and are delegated to an appendix. 2 Information theoretical preliminaries Let A, the alphabet, be a discrete set, either nite or countably in nite and denote by M (A), respectively M (A) the set of non-negative measures P on A (with the discrete Borel structure) such that P (A)  1, respectively P (A) = 1. The elements in A can be thought of in many ways, e.g. as letters (for purely information theoretical or computer science oriented studies), as pure states (for applications to quantum physics) or as outcomes (for models of probability theory and statistics). For convenience, A will always be taken to be the set N of natural numbers or a nite section thereof, and elements in A are typically referred to by indices like i; j; . 1  1 Measures in M (A) are probability distributions, or just distributions, measures in M (A) + + 1 1 are general distributions and measures in M (A) n M (A) are incomplete distributions. For + + P; Q; 2 M (A), the point masses are, typically, denoted by p ; q ; . i i By K (A), we denote the set of all mappings  : A ! [0;1] which satisfy Kraft's inequality exp( )  1 : (2.3) i2A Elements in K (A) are general codes. The values of a general code  are denoted  . The terminology is motivated by the fact that if  2 K (A) and if the base for the exponential in (2.3) is 2, then there exists a binary pre x-free code such that the i'th code word consists of approximately  binary digits. By K (A) we denote the set of mappings  : A ! [0;1] which satisfy Kraft's equality exp( ) = 1 : (2.4) i2A This case corresponds to codes without super uous digits. For further motivation, the reader may wish to consult [23] or standard textbooks such as [3] and [6]. Elements in K (A) are compact codes, for short just codes. Entropy, 2001, 3 196 For mathematical convenience, we shall work with exponentials and logarithms to the base e. For  2 K (A) and i 2 A,  is the code length associated with i or, closer to the intended interpretation, we may think of  as the code length of the code word which we imagine associates with i. There is a natural bijective correspondance between M (A) and K (A), expressed notationally by writing P $  or  $ P , and de ned by the formulas = log p ; p = exp( ) : i i i i Here the values  = 1 and p = 0 correspond to eachother. When the above formulas hold, we i i call (; P ) a matching pair and we say that  is adapted to P or that P is the general distribution which matches . If P  M (A) and  2 K (A), we say that  is P -adapted if  is adapted to one of the distributions in P . Note that the correspondance  $ P also de nes a bijection between M (A) and K (A). The support of  is the set of i 2 A with  < 1. Thus, with obvious notation, supp () = supp (P ) where P is the distribution matching  and supp (P ) is the usual support of P . For expectations { always w.r.t. genuine probability distributions { we use the bracket notation. Thus, for P 2 M (A) and f : A ! [1;1], we put hf; Pi = f (i)p i2A whenever this is a well-de ned extended real number. Mostly, our functions will be non-negative and then hf; Pi will of course be a well de ned number in [0;1]. In particular this is the case for average code length de ned for  2 K (A) and P 2 M (A) by h; Pi =  p : i i i2A Entropy and divergence are de ned as usual, i.e., for P 2 M (A), the entropy of P is given by H (P ) = p log p (2.5) i i i2A or, equivalently, by H (P ) = h; Pi where  is the code adapted to P . And for P 2 M (A) and Q 2 M (A) we de ne the divergence (or relative entropy) between P and Q by D(PkQ) = p log : (2.6) i2A Divergence is well de ned with 0  D(PkQ)  1 and D(PkQ) = 0 if and only if P = Q. The topological properties which we shall nd useful for codes and for distributions do not quite go in parallel. On the coding side we consider the space K (A) of all general codes and Entropy, 2001, 3 197 remark that this space is a metrizable, compact and convex Hausdor space. This may be seen by embedding K (A) in the space [0;1] of all functions on A taking values in the compact space [0;1]. The topology on K (A) then is the topology of pointwise convergence. This is the only topology we shall need on K (A). On the distribution side we shall primarily consider probability distributions but on the corre- sponding space, M (A), we nd it useful to consider two topologies, the usual, pointwise topology and then a certain stronger non-metrizable topology, the information topology. As to the usual topology on M (A) we remind the reader that this is a metrizable topology, indeed it is metrized by total variation de ned by V (P; Q) = jp q j: i i We write P ! P for convergence and P , co P etc. for closure in this topology (the examples show the closure of P and of the convex hull of P , respectively). As to the information topology { the second topology which we need on the space M (A) { 1 1 this can be described as the strongest topology such that, for (P )  M (A) and P 2 M (A), n n1 + + lim D(P kP ) = 0 implies that the sequence (P ) converges to P . Convergence in this n!1 n n n1 topology is denoted P ! P . We only need convergence in this topology for sequences, not for generalized sequences or nets. Likewise, we only need sequential closure and P ; co P , or what the case may be denotes sequential closure. Thus P denotes the set of distributions P for which there exists a sequence (P ) of distributions in P with P ! P . The necessary and n n1 n sucient condition that P ! P holds is that D(P kP ) ! 0 as n ! 1. We warn the reader n n that the corresponding statement for nets (generalized sequences) is wrong { only the suciency part holds generally. For the purposes of this paper, the reader needs only worry about sequences but it is comforting to know that the sequential notion P ! P is indeed a topological notion of convergence. Further details will be in [9]. An important connection between total variation and divergence is expressed by Pinsker's inequality: D(PkQ)  V (P; Q) ; (2.7) which shows that convergence in the information topology is stronger than convergence in total variation. The functions of relevance to us, entropy and divergence, have important continuity properties: P y H (P ) is lower semi-continuous on M (A) and (P; Q) y D(PkQ) is jointly lower semi- 1  1 continuous on M (A) M (A). These continuity properties even hold w.r.t. the usual, pointwise + + topology. Details may be found in [23]. Entropy, 2001, 3 198 3 The Code Length Game, introduction In this section P is a non-empty subset of M (A), neutrally referred to as the model. In speci c applications it may be more appropriate with other terminology, e.g. the preparation space or the statistical model. Distributions in P are called consistent distributions. With P we associate a two-person zero-sum game, called the Code Length Game over P . In this game, Player I chooses a consistent distribution, and Player II chooses a general code. The cost-function, seen from the point of view of Player II, is the map P  K (A) ! [0;1] given by the average code length: (P; ) y h; Pi: This game was introduced in [20], see also [15], [21], [10], [8] and [22]. Player I may be taken to represent \the system", \Nature", \God" or  , whereas Player II represents \the observer", \the statistician" or  . We can motivate the game introduced in various ways. The personi cation of the two partic- ipants in the game is natural as far as Player II is concerned since, in many situations, we can identify ourselves with that person. Also, the objective of Player II appears well motivated. To comment on this in more detail, we rst remind the reader that we imagine that there is associated a real code consisting of binary sequences to  2 K (A) and that  merely tells us what the code lengths of the various code words are. We can think of a speci c code in at least three di erent ways: as a representation of the letters in A, as a means for identi cation of these letters and { the view we nd most fruitful { as a strategy for making observations from a source generating letters from A. The two last views are interrelated. In fact, for the strategy of observation which we have in mind, we use the code to identify the actual outcome by posing a succession of questions, starting with the question \is the rst binary digit in the code word corresponding to the outcome a 1?" , then we ask for the second binary digit and so on until it is clear to us which letter is the actual outcome from the source. The number of questions asked is the number of binary digits in the corresponding code word. The cost function can be interpretated as mean representation time, mean identi cation time or mean observation time and it is natural for Player II to attempt to minimize this quantity. The sense in assuming that Player I has the opposite aim, namely to maximize the cost function is more dubious. The arguments one can suggest to justify this, thereby motivating the zero-sum character of the Code Length Game, are partly natural to game theory in general, partly can be borrowed from Jaynes' reasoning behind his Maximum Entropy Principle. Without going into lengthy discussions we give some indications: Though we do not seriously imagine that Player I is a \real" person with rational behaviour, such thoughts regarding the ctive Player I re ect back on our own conceptions. With our ctitious assumptions we express our own modelling. If all we Entropy, 2001, 3 199 know is the model P and if, as is natural, all we strive for is minimization of the cost function, we cannot do better than imagining that Player I is a real person behaving rationally in a way which is least favourable to us. Any other assumption would, typically, lead to non-sensical results which would reveal that we actually knew more than rst expressed and therefore, as a consequence, we should change the model in order better to re ect our level of knowledge. To sum up, we have argued that the observer should be allowed freely to choose the means of observation, that codes o er an appropriate technical tool for this purpose and that the choice of a speci c code should be dictated by the wish to minimize mean observation time, modelled adequately by the chosen cost function. Further, the more ctitious views regarding Player I and the behaviour of that player, really re ect on the adequacy and completeness of our modelling. If our modelling is precise, the assumptions regarding Player I are sensible and general theory of two-person zero-sum games can be expected to lead to relevant and useful results. The overall principle we shall apply, we call the principle of Game Theoretical Equilibrium. It is obtained from general game theoretical considerations applied to the Code Length Game. No very rigid formulation of this principle is necessary. It simply dictates that in the study of a model, we shall investigate standard game theoretical notions such as equilibrium and optimal strategies. According to our basic principle, Player I should consider, for each possible strategy P 2 P , the in mum of h; Pi over  2 K (A). This corresponds to the optimal response of Player II to the chosen strategy. The in mum in question can easily be identi ed by appealing to an important identity which we shall use frequently in the following. The identity connects average code length, entropy and divergence and states that h; Pi = H (P ) + D(PkQ); (3.8) valid for any  2 K (A) and P 2 M (A) with Q the (possibly incomplete) distribution matching . The identity is called the linking identity. As D(PkQ)  0 with equality if and only if P = Q, an immediate consequence of the linking identity is that entropy can be conceived as minimal average code length: H (P ) = min h; Pi : (3.9) 2 K(A) The minimum is attained for the code adapted to P and, provided H (P ) < 1, for no other code. Seen from the point of view of Player I, the optimal performance is therefore achieved by maximizing entropy. The maximum value to strive for is called the maximum entropy value (H -value) and is given by max H (P ) = sup H (P ) : (3.10) max P2P Entropy, 2001, 3 200 On the side of Player II { the \coding side" { we consider, analogously, for each  2 K (A) the associated risk given by R(jP ) = suph; Pi (3.11) P2P and then the minimum risk value (R -value) min R (P ) = inf R(jP ) : (3.12) min 2 K(A) This is the value to strive for for Player II. We have now looked at each side of the game separately. Combining the two sides, we are led to the usual concepts, well known from the theory of two-person zero-sum games. Thus, the model P is in equilibrium if H (P ) = R (P ) < 1, and in this case, H (P ) = R (P ) is the value max min max min of the game. Note that as a \supinf " is bounded by the corresponding \infsup", the inequality H (P )  R (P ) (3.13) max min always holds. The concept of optimal strategies also follows from general considerations. For Player I, this is a consistent distribution with maximal entropy, i.e. a distribution P 2 P with H (P ) = H (P ). max And for Player II, an optimal strategy is a code  2 K (A) such that R( jP ) = R (P ). Such min a code is also called a minimum risk code (R -code). min 4 Cost-stable codes, partition functions and exponential families The purpose of this section is to establish a certain sucient condition for equilibrium and to identify the optimal strategies for each of the players in the Code Length Game. This cannot always be done but the simple result presented here already covers most applications. Furthermore, the approach leads to familiar concepts and results. This will enable the reader to judge the merits of the game theoretical method as compared to a more standard approach via the introduction of Lagrange multipliers. As in the previous section, we consider a model P  M (A). Let  2 K (A) together with its matching distribution P be given and assume that P is consistent. Then we call  a Nash equilibrium code for the model P if h ; Pi  h ; P i ; P 2 P (4.14) Entropy, 2001, 3 201 and if H (P ) < 1. The terminology is adapted from mathematical economy, cf. e.g. Aubin [2]. The requirement can be written R( jP )  H (P ) < 1. Note that here we insist that a Nash equilibrium code be P -adapted. This condition will later be relaxed. Theorem 4.1. Let P be a model and assume that there exists a P -adapted Nash equilibrium code , say, with matching distribution P . Then P is in equilibrium and both players have optimal strategies. Indeed, P is the unique optimal strategy for Player I and  the unique optimal strategy for Player II. Proof. Since R( jP )  H (P ), R (P )  H (P ). As the opposite inequality always holds by min max (3.13), P is in equilibrium, the value of the Code Length Game associated with P is H (P ) and and P are optimal strategies. To establish the uniqueness of  , let  be any code distinct from . Let P be the distribution matching . Then, by the linking identity, R(jP )  h; P i = H (P ) + D(P kP ) > H (P ) ; hence  is not optimal. For the uniqueness proof of P , let P be a consistent distribution distinct from P . Then, again by the linking identity, H (P ) < H (P ) + D(PkP ) = h ; Pi  H (P ) ; and P cannot be optimal. As we shall see later, the existence of a Nash equilibrium code is, essentially, also necessary for the conclusion of the theorem. This does not remove the diculty of actually nding the Nash equilibrium code in concrete cases of interest. In many cases it turns out to be helpful to search for codes with stronger properties. A code  is a cost-stable code for P if there exists h < 1 such that h ; Pi = h for all P 2 P . Clearly, a cost-stable code with a consistent matching distribution is a Nash equilibrium code. Therefore, we obtain the following corollary from Theorem 4.1: Corollary 4.2. If  is a cost-stable code for P and if the matching distribution P is consistent, then P is in equilibrium and  and P are the unique optimal strategies pertaining to the Code Length Game. The reader may want to note that it is in fact easy to prove directly that if P is convex, H (P ) nite and max P a consistent distribution with maximum entropy, then the adapted code  must be a Nash equilibrium code. To see this, let P and P be distributions with nite entropy and put P = (1 )P + P . Then h( ) = H (P ); 0 1 0 1 0   1 is strictly concave and h ( ) = h ; P ih ; P i with  the code adapted to P . From this it is easy 1 0 to derive the stated result. A more complete result is given in Theorem 7.3. Entropy, 2001, 3 202 In order to illustrate the usefulness of this result, consider the case of a model P given by nitely many linear constraints, say P = fP 2 M (A) j hE ; Pi =  ; : : : ;hE ; Pi =  g (4.15) 1 1 n n with E ; : : : ; E real-valued functions bounded from below and  ; : : : ;  real-valued constants. 1 n 1 n Let us search for cost-stable codes  for P . Clearly, any code of the form = + E + + E = +  E (4.16) 1 1 n n is cost-stable. Here, and the 's denote constants, and E vectors and a dot signi es scalar products of vectors. For  de ned by (4.16) to de ne a code we must require that   0 and, more importantly, that Kraft's equality (2.4) holds. We are thus forced to assume that the partition function evaluated at = ( ; : : : ; ) is nite, i.e. that 1 n Z ( ) = exp(  E ) (4.17) i2A is nite, and that = log Z ( ). When these conditions are ful lled,  =  de ned by = log Z ( ) +  E (4.18) de nes a cost-stable code with individual code lengths given by = log Z ( ) +  E : (4.19) The matching distribution P is given by the point probabilities exp(  E ) P = : (4.20) Z ( ) In most cases where linear models occur in the applications, one will be able to adjust the parameters in such that P is consistent. By Corollary 4.2, the entropy maximization problem will then be solved. However, not all cases can be settled in this way as there may not exist a consistent maximum entropy distribution. We have seen that the search for cost-stable codes led us to consider the well-known partition function and also the well-known exponential family consisting of distributions (P ) with ranging over all vectors 2 R for which Z ( ) < 1. From our game theoretical point of view, the family of codes ( ) with Z ( ) < 1 has at least as striking features as the corresponding family of distributions. We shall therefore focus on both types of objects and shall call the family of matching pairs ( ; P ) with ranging over vectors Entropy, 2001, 3 203 with Z ( ) < 1 for the exponential family associated with the set E = (E ; : : : ; E ) of functions 1 n on A or associated with the family of models one can de ne from E by choosing  = ( ; : : : ;  ) 1 n and considering P given by (4.15). We stress that the huge literature on exponential families displays other families of discrete distributions than those that can be derived from the above de nition. In spite of this we maintain the view that an information theoretical de nition in terms of codes (or related objects) is more natural than the usual structural de nitions. We shall not pursue this point vigorously here as it will require the consideration of further games than the simple Code Length Game. In order to further stress the signi cance of the class of cost stable codes we mention a simple continuity result: Theorem 4.3. If a model P has a cost-stable code, the entropy function H is continuous when restricted to P . Proof. Assume that h ; Pi = h < 1 for all P 2 P . Then H (P ) + D(PkP ) = h for P 2 P with P the distribution matching  . As the sum of the two lower semi-contimuous functions in this identity is a constant function, each of the functions, in particular the entropy function, must be continuous. As we have already seen, the notion of cost-stable codes is especially well suited to handle models de ned by linear constraints. In section 6 we shall take this up in more detail. The following sections will be more technical and mathematically abstract. This appears nec- essary in order to give a comprehensive treatment of all basic aspects related to the Cost Length Game and to the Maximum Entropy Principle. 5 The Code Length Game, further preparations In section 3 we introduced a minimum of concepts that enabled us to derive the useful results of section 4. With that behind us as motivation and background material, we are ready to embark on a more thorough investigation which will lead to a clari cation of certain obscure points, especially related to the possibility that a consistent distribution with maximal entropy may not exist. In this section we point out certain results and concepts which will later be useful. In view of our focus on codes it is natural to look upon divergence in a di erent way, as redundancy. Given is a code  2 K (A) and a distribution P 2 M (A). We imagine that we use  to code letters from A generated by a \source" and that P is the \true" distribution of the letters. The optimal performance is, according to (3.9), represented by the entropy H (P ) whereas the actual performance is represented by the number h; Pi. The di erence h; PiH (P ) is then Entropy, 2001, 3 204 taken as the redundancy. This is well de ned if H (P ) < 1 and then coincides with D(PkQ) where Q denotes the distribution matching . As D(PkQ) is always well de ned, we use this quantity for our technical de nition: The redundancy of  2 K (A) against P 2 M (A) is denoted D(Pk) and de ned by D(Pk) = D(PkQ) with  $ Q : Thus D(Pk) and D(PkQ) can be used synonymously and re ect di erent ways of thinking. Using redundancy rather than divergence, the linking identity takes the following form: h; Pi = H (P ) + D(Pk) : (5.21) We shall often appeal to basic concavity and convexity properties. Clearly, the entropy function is concave as a minimum of ane functions, cf. (3.9). However, we need a more detailed result which also implies strict concavity. The desired result is the following identity X X X H ( P ) = H (P ) + D(P kP ) ; (5.22) where P = P is any nite or countably in nite convex combination of probability distri- butions. This follows by the linking identity. A closely related identity involves divergence and states that, with notation as above and with Q denoting an arbitrary general distribution, X X X D(P kQ) = D P kQ + D(P jP ) : (5.23) The identity shows that divergence D(kQ) is strictly convex. A proof can be found in [23]. For the remainder of the section we consider a model P  M (A) and the associated Code Length Game. By supp (P ) we denote the support of P , i.e. the set of i 2 A for which there exists P 2 P with p > 0. Thus, supp (P ) = [ supp (P ), the union of the usual supports of all consistent i P2P distributions. Often one may restrict attention to models with full support, i.e. to models with supp (P ) = A. However, we shall not make this assumption unless pointed out speci cally. Recall that distributions inP are said to be consistent. Often, it is more appropriate to consider distributions in P . These distributions are called essentially consistent distributions. Using these distributions we relax the requirements to a distribution with maximum entropy, previously only considered for consistent distributions. Accordingly, a distribution P is called a maximum entropy Entropy, 2001, 3 205 distribution (H -distribution) if P is essentially consistent and H (P ) = H (P ). We warn max max the reader that the usual de nition in the literature insists on the requirement of consistency. Nevertheless, we nd the relaxed requirement of essential consistency more adequate. For one thing, lower semi-continuity of the entropy function implies that H (P ) = H (P ) = H (P ) (5.24) max max max and this points to the fact that the models P and P behave in the same way as P . This view is further supported by the observation that for any  2 K (A), R(jP ) = R(jco P ) : (5.25) This follows as the map P y h; Pi is lower semi-continuous and ane. As a consequence, R (P ) = R (co ) : (5.26) min min It follows that all models with P  P  P behave similarly as far as the Code Length Game is concerned. The reason why we do not relax further the requirement of a H -distribution from max P 2 P to P 2 P is rstly, that we hold the information topology for more relevant for our investigations than the usual topology. Secondly, we shall see that the property P 2 P which is stronger than P 2 P can in fact be veri ed in the situations we have in mind (see Theorem 6.2). The fact that a consistent H -distribution may not exist leads to further important notions. max Firstly, a sequence (P ) of distributions is said to be asymptotically optimal if all the P n n1 n are consistent and if H (P ) ! H (P ) for n ! 1. And, secondly, a distribution P is the n max maximum entropy attractor (H -attractor) if P is essentially consistent and if P ! P for max n every asymptotically optimal sequence (P ) . n n1 As an example, consider the (uninteresting!) model of all deterministic distributions. For this model, the H -attractor does not exist and there is no unique H -distribution. For more max max sensible models, the H -attractor P will exist, but it may not be the H -distribution as max max lower semi-continuity only quarantees the inequality H (P )  H (P ), not the corresponding max equality. Having by now re ned the concepts related to the distribution side of the Code Length Game, we turn to the coding side. It turns out that we need a localized variant of the risk associated with certain codes. The codes we shall consider are, intuitively, all codes which the observer (Player II) o -hand nds it worth while to consider. If P 2 P is the \true" distribution, and the observer knows this, he will choose the code adapted to P in order to minimize the average code length. As nature (Player Entropy, 2001, 3 206 I) could from time to time change the choice of P 2 P , essentially any strategy in the closure of P could be approached. With these remarks in mind we nd it natural for the observer only to consider P -adapted codes in the search for reasonable strategies. D D Assume now that the observer decides to choose a P -adapted code . Let P 2 P be the distribution which matches . Imagine that the choice of  is dictated by a strong belief that the true distribution is P or some distribution very close to P (in the information topology!). Then the observer can evaluate the associated risk by calculating the localized risk associated with which is de ned by the equation: R (jP ) = sup lim suph; P i ; (5.27) loc n D n!1 (P )P;P !P n n where the supremum is over the class of all sequences of consistent distributions which converge in the information topology to P . Note that we insist on a de nition which operates with sequences. Clearly, the normal \global" risk must be at least as large as localized risk, therefore, for any P -adapted code, R (jP )  R(jP ) : (5.28) loc A further and quite important inequality is the following: R (jP )  H (P ) : (5.29) loc max This inequality is easily derived from the de ning relation (5.27) by writing h; P i in the form H (P ) + D(P kP ) with  $ P , noting also that P ! P implies that D(P kP ) ! 0. We note n n n n that had we allowed nets in the de ning relation (5.27), a di erent and sometimes strictly larger quantity would result and (5.29) would not necessarily hold. As the last preparatory result, we establish pretty obvious properties of an eventual optimal strategy for the observer, i.e. of an eventual R -code. min Lemma 5.1. Let P  M (A) with R (P ) < 1 be given. Then the R -code is unique and if min min it exists, say R( jP ) = R (P ), then  is compact with supp ( ) = supp (P ). min Proof. Assume that  2 K (A) is a R -code. As R( jP ) < 1, supp (P )  supp ( ). Then min consider an a 2 supp ( ) and assume, for the purpose of an indirect proof, that a 2 Ansupp (P ). 0 0 Then the code  obtained from  by putting (a ) = 1 and keeping all other values xed, is a general non-compact code which is not identically +1. Therefore, there exists " > 0 such that " is a compact code. For any P 2 P , we use the fact that a 2= supp (P )) to conclude that h "; Pi = h "; Pi, hence R( "jP ) = R( jP ) ", contradicting the minimality property Entropy, 2001, 3 207 of  . Thus we conclude that supp ( ) = supp (P ). Similarly, it is clear that  must be compact { since otherwise,  " would be more ecient than  for some " > 0. In order to prove uniqueness, assume that both  and  are R -codes for P . If we assume 1 2 min that  6=  , then  (a) 6=  (a) holds for some a in the common support of the codes  and 1 2 1 2 1 2 and then, by the geometric/arithmetic inequality, we see that ( + ) is a general non-compact 1 2 code. For some " > 0, ( +  ) " will then also be a code and as this code is seen to be more 1 2 ecient than  and  , we have arrived at a contradiction. Thus  =  , proving the uniqueness 1 2 1 2 assertion. 6 Models in equilibrium Let P  M (A). By de nition, the requirement of equilibrium is one which involves the rela- tionship between both sides of the Code Length Game. The main result of this section shows that the requirement can be expressed in terms involving only one of the sides of the game, either distributions or codes. Theorem 6.1 (conditions for equilibrium). Let P  M (A) be a model and assume that H (P ) < 1. Then the following conditions are equivalent: max (i) P is in equilibrium, (ii) H (co P ) = H (P ), max max (iii) there exists a P -adapted code  such that R( jP ) = R ( jP ) : loc Proof. (i) ) (iii): Here we assume that H (P ) = R (P ). In particular, R (P ) < 1. As the max min min map  y R(jP ) is lower semi-continuous on K (A) (as the supremum of the maps  y h; Pi; P 2 P ), and as K (A) is compact, the minimum of  y R(jP ) is attained. Thus, there exists 2 K (A) such that R( jP ) = R (P ). As observed in Lemma 5.1,  is a compact code and min is the unique R -code. min For P 2 P , H (P ) + D(Pk ) = h ; Pi  R (P ) = H (P ) : min max It follows that D(P k ) ! 0 for any asymptotically optimal sequence (P ) . In other words, n n n1 the distribution P 2 M (A) which matches  is the H - attractor of the model. max + Entropy, 2001, 3 208 We can now consider any asymptotically optimal sequence (P ) in order to conclude that n n1 R ( jP )  lim suph ; P i = lim sup(H (P ) + D(P k )) loc n n n n!1 n!1 = H (P )  R (P ) = R( jP ) : max min By (5.28), the assertion of (iii) follows. (iii) ) (ii): Assuming that (iii) holds, we nd from (5.25), (3.13) and (5.29) that H (co P )  R (co P ) = R (P )  R( jP ) max min min R ( jP )  H (P ) loc max and the equality of (ii) must hold. (ii) ) (i): For this part of the proof we x a speci c asymptotically optimal sequence (P ) n n1 P . We assume that (ii) holds. For each n and m we observe that by (5.22) and (2.7), with M = (P + P ), n m H (P ) = H (coP )  H ( (P + P )) max max n m 1 1 1 1 = H (P ) + H (P ) + D(P kM ) + D(P kM ) n m n m 2 2 2 2 1 1 1 H (P ) + H (P ) + V (P ; P ) : n m n m 2 2 8 It follows that (P ) is a Cauchy sequence with respect to total variation, hence there exists n n1 P 2 M (A) such that P ! P . Let  be the code adapted to P . In order to evaluate R( jP ) we consider any P 2 P . For a suitable sequence (" ) of positive numbers converging to zero, we consider the sequence n n1 (Q )  P given by n n1 Q = (1 " )P + " P ; n  1 : n n n n By (5.22) we nd that H (P ) = H (coP )  H (Q )  (1 " )H (P ) + " H (P ) + " D(PkQ ) ; max max n n n n n n hence H (P ) + D(PkQ )  H (P ) + (H (P ) H (P )) : n n max n As Q y D(PkQ) is lower semi-continuous, we conclude from this that H (P ) + D(PkP )  H (P ) + lim inf (H (P ) H (P )) : max max n n!1 n Entropy, 2001, 3 209 Choosing the " 's appropriately, e.g. " = (H (P )H (P )) , it follows that H (P )+D(PkP ) n n max n H (P ), i.e. that h ; Pi  H (P ). As this holds for all P 2 P , R( jP )  H (P ) follows. max max max Thus R (P )  H (P ), hence equality must hold here, and we have proved that P is in min max equilibrium, as desired. It is now easy to derive the basic properties which hold for a system in equilibrium. Theorem 6.2 (models in equilibrium). Assume that P  M (A) is a model in equilibrium. Then the following properties hold: (i) There exists a unique H -attractor and for this distribution, say P , the inequality max R (P ) + D(P k)  R(jP ) (6.30) min holds for all  2 K (A). (ii) There exists a unique R -code and for this code, say  , the inequality min H (P ) + D(Pk )  H (P ) (6.31) max holds for every P 2 P , even for every P 2 co P . The R -code is compact. min Proof. The existence of the R -code was established by the compactness argument in the begin- min ning of the proof of Theorem 6.1. The inequality (6.31) for P 2 P is nothing but an equivalent form of the inequality R( jP )  H (P ) and this inequality immediately implies that the distri- max bution P matching  is the H -attractor. The extension of the validity of (6.31) to P 2 co P max follows from (5.26). To prove (6.30), let (P )  P be asymptotically optimal. Then n n1 R(jP )  lim suph; P i n!1 = lim sup(H (P ) + D(P k)) n n n!1 H (P ) + lim inf D(P k) max n n!1 H (P ) + D(P k) ; max which is the desired conclusion as H (P ) = R (P ). max min If P is a model in equilibrium, we refer to the pair ( ; P ) from Theorem 6.2 as the optimal matching pair pair. Thus  denotes the R -code and P the H -attractor. min max Combining Theorem 6.2 and Theorem 6.1 we realize that, unless R(jP ) = 1 for every  2 K (A), there exists a unique R -code. The matching distribution is the H -attractor for the min max Entropy, 2001, 3 210 model co (P ). We also note that simple examples (even with A a two-element set) show that P may have a H -attractor without P being in equilibrium and this attractor may be far away max from the H -attractor for co (P ). max Corollary 6.3. For a model P  M (A) in equilibrium, the R -code and the H -attractor min max form a matching pair:  $ P , and for any matching pair (; P ) with P 2 co P , V (P; P )  (R(jP ) H (P )) : Proof. Combining (6.30) with (6.31) it folows that for  $ P with P 2 co P , D(PkP ) + D(P kP )  R(jP ) H (P ) and the result follows from Pinskers inequality, (2.7). Corollary 6.3 may help us to judge the approximate position of the H -attractor P even max without knowing the value of H (P ). Note also that the proof gave the more precise bound max J (P; P )  R(jP )H (P ) with assumptions as in the theorem and with J ( ; ) denoting Je rey's measure of discrimination, cf. [3] or [16]. Corollary 6.4. Assume that the model P has a cost-stable code  and let P be the matching distribution. Then P is in equilibrium and has ( ; P ) as optimal matching pair if and only if P is essentially consistent. Proof. By de nition, an H -attractor is essentially consistent. Therefore, the necessity of the max condition P 2 P is trivial. For the proof of suciency, assume that h ; Pi = h for all P 2 P with h a nite constant. Clearly then, H (P )  h. Now, let (P ) be a sequence of consistent max n n1 distributions with P ! P . By the linking identity, H (P ) + D(P kP ) = h for all n, and we n n n see that H (P )  h. Thus H (P ) = h and (P ) is asymptotically optimal. By Theorem 6.2, max max n the sequence converges in the information topology to the H -attractor which must then be P . max The result follows. Note that this result is a natural further development of Corollary 4.2. 1 0 Corollary 6.5. Assume that P  M (A) is a model in equilibrium. Then all models P with 0 V P  P  co P are in equilibrium too and they all have the same optimal matching pair. 0 V Proof. If P  P  co P then 0 V 0 H (co P )  H (co P ) = H (co P ) = H (P )  H (P ) max max max max max and we see that P is in equilibrium. As an asymptotically optimal sequence for P is also asymp- 0 0 totically optimal for P , it follows that P has the same H -attractor, hence also the same max optimal matching pair, as P . Entropy, 2001, 3 211 Another corollary is the following result which can be used as a basis for proving certain limit theorems, cf. [22]. Corollary 6.6. Let (A;P ) be a sequence of models and assume that they are all in equilibrium n n1 with sup H (P ) < 1 and that they are nested in the sense that co(P )  co(P )   . Let max n 1 2 n1 there further be given a model P such that [ [ P  P  co ( P ): n n n1 n1 Then P is in equilibrium too, and the sequence of H -attractors of the P 's converges in diver- max n gence to the H -attractor of P . max Clearly, the corollaries are related and we leave it to the reader to extend the argument in the proof of Corollary 6.5 so that it also covers the case of Corollary 6.6. We end this section by developing some results on models given by linear conditions, thereby continuing the preliminary results from sections 1 and 4. We start with a general result which uses the following notion: A distribution P is algebraically inner in the model P if, for every P 2 P there exists Q 2 P such that P is a convex combination of P and Q. Lemma 6.7. If the model P is in equilibrium and has a H -distribution P which is algebraically max inner in P , then P is cost-stable. Proof. Let  be the code adapted to P . To any P 2 P we determine Q 2 P such that P is a convex combination of these two distributions. Then, as h ; Pi  H (P ) and h ; Qi max H (P ) and as a convex combination gives h ; P i  H (P ) we must conclude that h ; Pi = max max h ; Qi since h ; P i is in fact equal to H (P ). Therefore,  is cost-stable. max Theorem 6.8. If the alphabet A is nite and the model P ane, then the model is in equilibrium and the R -code is cost-stable. min Proof. We may assume that P is closed. By Theorem 6.1, the model is in equilibrium and by continuity of the entropy function, the H -attractor is a H -distribution. For the R -code  , max max min supp (P ) = supp (P ) by Lemma 5.1. As A is nite we can then conclude that P is algebraically inner and Lemma 6.7 applies. We can now prove the following result: Theorem 6.9. Let P be a non-empty model given by nitely many linear constraints as in (4.15): P = fP 2 M (A) j hE ; Pi =  ; : : : ;hE ; Pi =  g: 1 1 n n + Entropy, 2001, 3 212 Assume that the functions E ; ; E ; 1 are linearly independent and that H (P ) < 1. Then 1 n max the model is in equilibrium and the optimal matching pair ( ; P ) belongs to the exponential family de ned by (4.18) and (4.20). In particular,  is cost-stable. Proof. The model is in equilibrium by Theorem 6.1. Let ( ; P ) be the corresponding optimal matching pair. If A is nite the result follows by Theorem 6.8 and some standard linear algebra. Assume now that A is in nite. Choose an asymptotically optimal sequence (P ) . Let A be n n1 0 a nite subset of A, chosen suciently large (see below), and denote by P the convex model of all P 2 P for which p = P for all i 2 AnA . Let P be the H -attractor for P and  the i n;i 0 max n n n adapted code. Then this code is cost-stable for P and of the form = +  E (i) ; i 2 A : n n;  0 n;i =1 If the set A is suciently large, the constants appearing here are uniquely determined. We nd that (P ) is asymptotically optimal for P , and therefore, P ! P . It follows that the n1 n n constants and converge to some constants and and that n; n = +  E (i) ; i 2 A =1 As A can be chosen arbitrarily large, the constants and must be independent of A with 0  0 i 2 A and the above equation must hold for all i 2 A. Remark. Extensions of the result just proved may well be possible, but care has to be taken. For instance, if we consider models obtained by in nitely many linear constraints, the result does not hold. As a simple instance of this, the reader may consider the case where the model is a \line" , viz. the ane hull generated by the two distributions P; Q on A = N given by p = 2 ; i  1 3 1 and q = ( (3) i ) ; i  1. This model is in equilibrium with P as H -distribution, but the i max adapted code is not cost-stable. These facts can be established quite easily via the results quoted in the footnote following the proof of Theorem 4.1. 7 Entropy-continuous models In the sequel we shall only discuss models in equilibrium. Such models can be quite di erent regarding the behaviour of the entropy function near the maximum. We start with a simple observation. Lemma 7.1. If P is in equilibrium and the H -value H (P ) is attained on P , it is only max max attained for the H -attractor. max Entropy, 2001, 3 213 Proof. Assume that P 2 P and that H (P ) = H (P ). Choose a sequence (P )  P which max n n1 converges to P in total variation. By lower semi-continuity and as H (P ) = H (P ) we see that max (P ) is asymptotically optimal. Therefore, for the H -attractor P , P ! P , hence also n max n P ! P . It follows that P = P . Lemma 7.2. For a model P in equilibrium and with H -attractor P the following conditions max are equivalent: (i) H : P ! R is continuous at P in the topology of total variation, (ii) H : P ! R is sequentially continuous at P in the information topology, (iii) H (P ) = H (P ). max Proof. Clearly, (i) implies (ii). Assume that (ii) holds and let (P )  P be asymptotically optimal. Then P ! P . By n n1 n assumption, H (P ) ! H (P ) and (iii) follows since H (P ) ! H (P ) also holds. n n max V V Finally, assume that (iii) holds and let (P )  P satisfy P ! P . By lower semi- n n1 n continuity, H (P ) = H (P )  lim inf H (P )  lim sup H (P )  H (P ) = H (P ) max n n max max n!1 n!1 and H (P ) ! H (P ) follows. Thus (i) holds. A model P in equilibrium is entropy-continuous if H (P ) = H (P ) with P the H - max max attractor. In the opposite case we say that there is an entropy loss. We now discuss entropy-continuous models. As we shall see, the previously introduced notion of Nash equilibrium code, cf. Section 4, is of central importance in this connection. We need this concept for any P -adapted code. Thus, by de nition, a code  is a Nash equilibrium code if is P -adapted and if R( jP )  H (P ) < 1 : (7.32) We stress that the de nition is used for any model P (whether or not it is known beforehand that the model is in equilibrium). We shall see below that a Nash equilibrium code is unique. Theorem 7.3 (entropy-continuous models). Let P  M (A) be a model. The following con- ditions are equivalent: (i) P is in equilibrium and entropy-continuous, Entropy, 2001, 3 214 (ii) P is in equilibrium and has a maximum entropy distribution, (iii) P has a Nash equilibrium code. If these conditions are ful lled, the H -distribution is unique and coincides with the H - max max attractor. Likewise, the Nash equilibrium code is unique and it coincides with the R -code. min Proof. (i) ) (ii): This is clear since, assuming that (i) holds, the H -attractor must be a max H -distribution. max (ii) ) (iii): Assume that P is in equilibrium and that P 2 P is a H -distribution. Let 0 max ( ; P ) be the optimal matching pair pair. Applying Theorem 6.2, (6.31) with P = P , we conclude that D(P kP ) = 0, hence P = P . Then we nd that 0 0 R( jP ) = R (P ) = H (P ) = H (P ) = H (P ) min max 0 and we see that  is a Nash equilibrium code. (iii) ) (i): If  is a Nash equilibrium code for P , then R (P )  R( jP )  H (P )  H (P )  R (P ) min max min and we conclude that P is in equilibrium and that  is the minimum risk code. In establishing the equivalence of (i){(iii) we also established the uniqueness assertions claimed. The theorem generalizes the previous result, Theorem 4.1. We refer to section 4 for results which point to the great applicability of results like Theorem 7.3. 8 Loss of entropy We shall study a model P in equilibrium. By previous results we realize that for many purposes we may assume that P is a closed, convex subset of M (A) with H (P ) < 1. Henceforth, these max assumptions are in force. Denote by ( ; P ) the optimal matching pair associated with P . By the disection of P we understand the decomposition of P consisting of all non-empty sets of the form P = fP 2 P j h ; Pi = xg : (8.33) Let  denote the set of x 2 R withP 6= ?. As R( jP ) = R (P ) = H (P ), and asP is convex x min max with P 2 P ,  is a subinterval of [0; H (P )] which contains the interval [H (P ); H (P )[. max max Entropy, 2001, 3 215 Clearly,  is a cost-stable code for all models P ; x 2 . Hence, by Theorem 4.3 the entropy function is continuous on each of the sets P ; x 2 . Each set P ; x 2  is a sub-model of P and as each P is convex with H (P ) < 1, these x x max x sub-models are all in equilibrium. The linking identity shows that for all P 2 P , H (P ) + D(PkP ) = x : (8.34) This implies that H (P )  x, a sharpening of the trivial inequality H (P )  H (P ). max x max x max From (8.34) it also follows that maximizing entropy H () over P amounts to the same thing as minimizing divergence D(kP ) over P . In other words, the H -attractor of P may, alterna- x max x tively, be characterized as the I -projection of P on P , i.e. as the unique distribution P for x x which Q ! P for every sequence (Q )  P for which D(Q kP ) converges to the in mum of n x n x n D(QkP ) with Q 2 P . Further basic results are collected below: Theorem 8.1 (disection of models). Let P be a convex model in equilibrium with optimal matching pair ( ; P ) and assume that P 2 P . Then the following properties hold for the disection (P ) de ned by (8.33): x x2 (i) The set  is an interval with sup  = H (P ). A necessary and sucient condition that max H (P ) 2  is that P is entropy-continuous. If P has entropy loss,  contains the non- max degenerate interval [H (P ); H (P )[. max (ii) The entropy function is continuous on each sub-model P ; x 2 , (iii) Each sub-model P ; x 2  is in equilibrium and the H -attractor for P is the I -projection x max x of P on P , (iv) For x 2 , H (P )  x and the following bi-implications hold, where P denotes the max x H -attractor of P : max x H (P ) = x () P = P () x  H (P ) : (8.35) max x Proof. (i){(ii) as well as the inequality H (P )  x of (iii) were proved above. max x For the proof of (iv) we consider an x 2  and let (P )  P be an asymptotically optimal n x sequence for P . Then the condition H (P ) = x is equivalent with the condition H (P ) ! x, x max x n Terminology is close to that adopted by Csisz ar, cf. [4], [5], who rst developed the concept for closed models. This was later extended, using a di erent terminology, in Topse [20]. In this paper we refrain from a closer study of I -projections and refer the reader to sources just cited. Entropy, 2001, 3 216 and the condition P = P is equivalent with the condition P ! P . In view of the equality x = H (P ) +D(P kP ) we now realize that the rst bi-implication of (8.35) holds. For the second n n bi-implication we rst remark that as x  H (P ) holds generally, if P = P then x  H (P ) x x must hold. For the nal part of the proof of (iv), we assume that x  H (P ). The equality H (P ) = x max x is evident if x = H (P ). We may therefore assume that H (P ) < x < H (P ). We now max let (P ) denote an asymptotically optimal sequence for the full model P such that H (P )  x; n n n  1. As h ; P i  x  h ; P i for all n, we can nd a sequence (Q ) of distributions n n n1 in P such that each Q is a convex combination of the form Q = P + P . By (5.23), x n n n n n D(Q kP )  D(P kP ) ! 0. Thus P is essentially consistent for P and as the code adapted n n n x to P is cost-stable for this model, Corollary 6.4 implies that the model has P as its H - max attractor. A distribution P is said to have potential entropy loss if the distribution is the H -attractor max of a model in equilibrium with entropy loss. As we shall see, this amounts to a very special behaviour of the point probabilities. The de nition we need at this point we rst formulate quite generally for an arbitrary distribution P . With P we consider the density function associated with the adapted code, cf. the appendix. In terms of P this function is given by: (t) = #fi 2 A j p  exp(t)g (8.36) (# = \number of elements in"). We can now de ne a hyperbolic distribution as a distribution P such that log (t) lim sup = 1 : (8.37) t!1 Clearly, (t)  exp(t) for each t so that the equality in the de ning relation may just as well be replaced by the inequality \" . We note that zero point probabilities do not really enter into the de nition, therefore we may assume without any essential loss of generality that all point probabilities are positive. And then, we may as well assume that the point probabilities are ordered: p  p   . In this case, it is 1 2 easy to see that (8.37) is equivalent with the requirement log p lim inf = 1 : (8.38) i!1 log In the sequel we shall typically work with distributions which are ordered in the above sense. The terminology regarding hyperbolic distributions is inspired by [19] but goes back further, cf. [24]. In these references the reader will nd remarks and results pertaining to this and related types Entropy, 2001, 3 217 of distributions and their discovery from empirical studies which we will also comment on in the next section. We note that in (8.38) the inequality \" is trivial as p  for every i 2 A. Therefore, in more detail, a distribution with ordered point probabilities is hyperbolic if and only if, for every a > 1, p  (8.39) for in nitely many indices. Theorem 8.2. Every distribution with in nite entropy is hyperbolic. Proof. Assume that P is not hyperbolic and that the point probabilities are ordered. Then there exists a > 1 such that p  i for all suciently large i. As the distribution with point probabilities equal to i , properly normalized, has nite entropy, the result follows. With every model P in equilibrium we associate a partition function and an exponential family, simply by considering the corresponding objects associated with the R -code for the model in min question. This then follows the de nition given in Section 4, but for the simple case where there is only one \energy function" with the R -code playing the role of the energy function. min Theorem 8.3 (maximal models). Let P  M (A) be given and assume that there exists a 0 0 0 0 model P such that P  P , P is in equilibrium and H (P ) = H (P ). Then P itself must be max max in equilibrium. Furthermore, there exists a largest model P with the stated properties, namely max the model P = fP 2 M (A) j h ; Pi  H (P )g ; (8.40) max max 0 0 where  denotes the minimum risk code of P . Finally, any model P with P  P  P is in max equilibrium and has the same optimal matching pair as P . Proof. Choose P with the stated properties. By Theorem 6.1, 0 0 H (co P )  H (co P ) = H (P ) = H (P ) ; max max max max hence P is in equilibrium. Let  be the R -code of P in accordance with Theorem 6.2 and min consider P de ned by (8.40). max 0 0 Now let P  P be an equilibrium model with H (P ) = H (P ). As an asymptotically max max 0 0 optimal sequence forP is also asymptotically optimal forP , we realize thatP has the same H - max 0 0 0 attractor, hence also the same R -code as P . Thus R( jP ) = R (P ) = H (P ) = H (P ) min min max max and it follows that P  P . max Entropy, 2001, 3 218 Clearly, P is convex and H (P ) = H (P ) < 1, hence P is in equilibrium by max max max max max Theorem 6.1. The nal assertion of the theorem follows by one more application of Theorem 6.1. The models which can arise as in Theorem 8.3 via (8.40) are called maximal models. Let  2 K (A) and 0  h < 1. Put P  = fP 2 M (A) j h ; Pi  hg : (8.41) ;h We know that any maximal model must be of this form. Naturally, the converse does not hold. An obvious necessary condition is that the entropy of the matching distribution be nite. But we must require more. Clearly, the models in (8.41) are in equilibrium but it is not clear that they have  as R -code and h as H -value. min max Theorem 8.4. A distribution P 2 M (N) with nite entropy has potential entropy loss if and only if it is hyperbolic. Proof. We may assume that the point probabilities of P are ordered. Assume rst that P is not hyperbolic and that P is the attractor for some model. Consider the corresponding maximal models P and consider a value of h with H (P )  h  H (P ). ;h max Let be the abscissa of convergence associated with  and let  be de ned as in the appendix. As < 1, we can choose > such that ( ) = h. Now both P and Q given by exp(  ) Q = Z ( ) are attractors for P and hence equal. It follows that h = H (P ). Next we show that a ;h hyperbolic distribution has potential entropy loss. Consider the maximal models P . Each one of these models is given by a single linear ;h constraint. Therefore, the attractor is element in the corresponding exponential family. The abscissa of convergence is 1 and, therefore, the range of the map  : [1;1[! R is ] ; H (P )]. For h 2] ; H (P )], there exists a consistent maximum entropy distribution. Assume that h > H (P ) and that the attractor equals exp(  ) Q = : Z ( ) By Theorem 8.1, Q must be attractor for all P with h 2 [( ); h ]. Especially, this holds ;h 0 for h = H (P ). This shows that P = Q . By Theorem 8.1 the conclusion is now clear. Entropy, 2001, 3 219 9 Zipf 's law Zipf 's law is an empirically discovered relationship for the relative frequencies of the words of a natural language. The law states that log(f )  a log( ) + b where f is the relative frequency of the i'th most common word in the language, and where a and b denote constants. For large values of i we then have log f log The constants a and b depend on the language, but for many languages a  1, see [24]. Now consider an ideal language where the frequencies of words is described by a hyperbolic probability distribution P . Assume that the entropy of the distribution is nite. We shall discribe in qualitative terms the consequences of these assumptions as they can be derived from the developed theory, especially Theorem 8.4. We shall see that our asumption introduces a kind of stability of the language which is desirable in most situations. Small children with a limited vocabulary will use the few words they know with relative frequen- cies very di erent from the probabilities described by P . They will only form simple sentences, and at this stage the number of bits per word will be small in the sense that the entropy of the childs probability distribution is small. Therefore the parents will often be able to understand the child even though the pronounciation is poor. The parents will, typically, talk to their chil- dren with a lower bit rate than they normally use, but with a higher bit rate than their children. Thereby new words and grammatical structures will be presented to the child, and, adopting ele- ments of this structure, the child will be able to increase its bit rate. At a certain stage the child will be able to communicate at a reasonably high rate (about H (P )). Now the child knows all the basic words and structures of the language. The child is still able to increase its bit rate, but from now on this will make no signi cant change in the relative frequencies of the words. Bit rates higher than H (P ) are from now on obtained by the introduction of specialized words, which occur seldom in the language as a whole. The introduction of new specialized words can be continued during the rest of the life. Therefore one is able to express even complicated ideas without changing the basic structure of the language, indeed there is no limit, theoretically, to the bit rate at which one can communicate without change of basic structure. We realize that in view of our theoretical results, speci cally Theorem 8.4, the features of a natural language as just discussed are only possible if the language obeys Zipf 's law. Thus we Entropy, 2001, 3 220 have the striking phenomenon that the apparent \irregular" behaviour of models with entropy loss (or just potential entropy loss) is actually the key to desirable stability, the fact that for such models you can increase the bit rate, the level of communication, and maintain the basic features of the language. One could even speculate that modelling based on entropy loss lies behind the phenomenon that many will realize as a fact, viz. that \we can talk without thinking" . We just start talking using basic structure of the language (and rather common words) and then from time to time stick in more informative words and phrases in order to give our talk more semantic content, but in doing so, we use relatively infrequent words and structures, thus not violating basic principles { hence still speaking recognizably danish, english or what the case may be, so that also the receiver or listener feels at ease and recognizes our talk as unmistakenly danish, english or ... We see that very informative speeking can be obtained by use of infrequent expressions. There- fore a conversation between, say 2 physicists may use English supplied with specialized words like electron and magnetic ux. We recognize their language as English because the basic words and grammer is the same in all English. The specialists only have to know special words, not a special grammer. In this sense the languages are stable. If the entropy of our distribution is in nite the language will behave in just about the same manner as described above. In fact one would not feel any di erence between a language with nite entropy and a language with in nite entropy. We see that it is convienient that a language follows a Zipf 's law, but the information theoretic methods also gives some explanation of how the language may have evolved into a state which obeys Zipf 's law. The set of hyperbolic distributions is convex. Therefore if 2 information sources both follows Zipf 's law then so do their mixture, and if 2 information sources both approximately follows Zipf 's law their mixture will do this even more. The information sources may be from di erent languages, but it is more interesting to consider a small child learning the language. The child gets input from di erent sources: the mother, father, other children ect. trying to imitate their language the child will use the words with frequences which are closer to Zipf 's law the the sources. As the language develops during the centuries the frequences will converge to a hyperbolic distribution. Here we have discussed entropy as bit per word and not bit per letter. The letters give an encoding of the words which should primarilly be understood by others, and therefore the encoding cannot just be changed to obtain a better data compression. To stress the di erence between bit per word and bit per letter we remark the the words are the basic semantic structure in the language. Therefore we may have an internal representation of the words which has very little to do with their length when spoken, which could explain that it is often much easier to remember a long word in a language you understand than a short word in a language you do not understand. It would be interesting to compare these ideas with empirical measurements of the entropy here considered but, precisely in the regime where Zipf 's law holds, such a study is very dicult as Entropy, 2001, 3 221 convergence of estimators of the entropy is very slow, cf. [1]. A The partition function In this appendix we collect some basic facts about partition functions associated with one linear constraint. The point of departure is a code  2 K (A). With  we associate the partition function Z = Z which maps R into ]0;1], given by Z ( ) = e : (A.42) i2I Here we adopt the convention that e = 0 if = 0 and  = 1. Clearly, Z is decreasing on i i [1;1[ and Z ( ) ! 0 for ! 1 (note that, given K , e  e e for all i when is large enough). The series de ning Z is a Dirichlet-series, cf. Hardy and Riesz [7] or Mandelbrojt [17]. The abscissa of convergence we denote by . Thus, by de nition, Z ( ) < 1 for > and Z ( ) = 1 for < . As Z (1) = 1,  1. If supp () is in nite, Z ( ) = 1 for  0, hence  0. If supp () is nite, Z ( ) < 1 for all 2 R and we then nd that = 1. Mostly, we shall have the case when supp () is in nite in mind. We shall characterize analytically. This is only a problem when supp () is in nite. So assume that this is the case and also assume, for the sake of convenience, that  has full support and that the indexing set I is the set of natural numbers: I = N, and that      . 1 2 Lemma A.1. With assumptions as just introduced, log i = lim sup : (A.43) i!1 i Proof. First assume that > with de ned by (A.43). Then, for some > 1, log i= for all suciently large values of i. For these values of i, e  i and we conclude that Z ( ) < 1. Conversely, assume that, for some value of ; Z ( ) < 1. Then > 0 and Remark. If no special ordering on A is given, the abscissa of convergence can be expressed analytically via the density function : R ! N (N = N[f0g) which is de ned by 0 0 (t) = #fa 2 A j (a)  tg (A.44) (# = \number of elements in"). In fact, as follows easily from (A.43), log (t) = lim sup : (A.45) t!1 Entropy, 2001, 3 222 We can now introduce the exponential family associated with the model . It is the family of distributions (Q ) with ranging over all values with Z ( ) < 1 which is de ned by Q (a ) = ; i 2 I : (A.46) Z ( ) The family of adapted codes, denoted ( ), is also of signi cance. These codes are given by (a ) = log Z ( ) +  ; i 2 I : (A.47) i i We also need certain approximations to Z , Q and  . For convenience we stick to the assump- tion I = N,     : : : . We then de ne Z , Q and  by 1 2 n n; n; Z ( ) = e ; 2 R ; (A.48) i=1 Q (a ) = ; i  n ; (A.49) n; i Z ( ) (a ) = log Z ( ) +  ; i  n ; (A.50) n; i n i it being understood that supp (Q ) = supp ( ) = f1; 2; : : : ; ng. Formally, the approximating n; n; quantities could be obtained from (non-compact) codes obtained from  by replacing  by the value 1 for i > n. We are particularly interested in the mean values h; Q i and h; Q i, and de ne functions n; and  ; n  1 by ( ) = h; Q i ; Z ( ) < 1 ; (A.51) ( ) = h; Q i ; 2 R : (A.52) n n; Note that (1) = H (P ) and that ( ) = Z ( )=Z ( ) = log Z ( ) ; (A.53) ( ) = Z ( )=Z ( ) = log Z ( ) : (A.54) n n n Furthermore, Z is a Dirichlet series with the same abscissa of convergence as Z , hence ( ) is well de ned and nite for all > . Lemma A.2. With assumptions and notation as above, the following properties hold: (i)     : : : , 1 2 Entropy, 2001, 3 223 (ii)  is strictly decreasing on R (except if  =  =  ), n 1 n (iii) lim  ( ) =  , lim  ( ) =  , n 1 n n !1 !1 (iv)  is strictly decreasing on ] ;1[, (v) lim ( ) =  , !1 (vi) ( ) is in nite if and only if Z ( ) = 1, (vii) If Z ( ) < 1, then  ! , uniformly on [ ;1[, 0 n 0 (viii) lim  ( ) = ( ), the limit from the right at , n!1 (ix) for every < , lim  ( ) = 1. n!1 Proof. (i) follows from n+1 X ( )  ( ) = (  )e ; n+1 n n+1 i Z ( )Z ( ) n+1 n i=1 (ii) from 0 2 ( + ) i j ( ) = (  ) e i j Z ( ) ni>j and (iv) from an obvious extension of this formula. Writing  in the form ( ) = P ; (  ) i j j=1 i=1 we derive the limiting behaviour of (iii) and, with a little care, the limit relation of (v) follows in a similar way. 0 0 (vii) follows from (A.53) and (A.54) since the convergences Z (x) ! Z (x) and Z (x) ! Z (x) hold uniformly on [ ;1[ when Z ( ) < 1. Actually, the uniform convergence is rst derived 0 0 only for intervals of the form [ ; K ]. By (i) and (v) it is easy to extend the uniform convergence to [ ;1[. + + It is now clear that for n  1 and x > ,  (x)  (x)  ( ), hence lim  ( )  ( ). n n!1 n On the other hand, for x > , lim  ( )  lim  (x) = (x). We conclude that (viii) n!1 n n!1 n holds. Entropy, 2001, 3 224 0 0 If Z ( ) < 1, then Z ( ) < 1 and lim  ( ) = Z ( )=Z ( ) < 1. By (viii), this shows n!1 n + 0 + that ( ) < 1. Now assume that Z ( ) = 1. If Z ( ) < 1, it is easy to see that ( ) = 1. If also Z ( ) = 1, we choose to N  1, n > N such that Q (f1; 2; : : : ; Ng)  for n  n : n; 0 Then, for n  n , ( ) =  Q (i)   Q (fN + 1; : : : ; ng)   : n i n; N n; N i=1 This shows that  ( ) ! 1, hence ( ) = 1. We have now proved (vi). In order to prove (ix), let < and choose, to a given K , n such that   K for i  n . 0 i 0 Then, for n  n , 0 1 n 1 P 0 K e B C i=n 0 i=1 B C ( )   K 1 + : n n @ A P P i i e e i=1 i=n As e = 1, we see that for n suciently large,  ( )  K=2 and (ix) follows. 0 0 0 Remark. The formula for  and  can be interpreted more probabilistically. Consider  and remark rst that when we consider  as a random variable de ned on the discrete probability space (A; Q ) = (N; Q ), then ( ) is the expectation of this random variable. A simple calculation shows that  ( ) is the variance of this random variable. References [1] A. Antos and I. Kontoyiannis: \Convergence Properties of Functional Estimates for Discrete Distributions," to appear. [2] J.-P. Aubin, Optima and equilibria. An introduction to nonlinear analysis., Berlin: Springer, [3] T.M. Cover and J.A. Thomas, Information Theory, New York: Wiley, 1991. [4] I. Csisz ar, \I-divergence geometry of probability distributions and minimization problems," Ann. Probab., vol. 3, pp. 146{158, 1975. Entropy, 2001, 3 225 [5] I. Csisz ar, \Sanov property, generalized I-projection and a conditional limit theorem," Ann. Probab., vol. 12, pp. 768{793, 1984. [6] R.G. Gallager, Information Theory and reliable Communication. New York: Wiley, 1968. [7] G.H. Hardy and M. Riesz, The general Theory of Dirichlet's series. Cambridge: Cambridge University Press, 1915. [8] P. Harremo es, \Binomial and Poisson Distributions as Maximum Entropy Distributions," IEEE Trans. Inform. Theory, vol. 47, pp. 2039{2041, 2001. [9] P. Harremo es, \The Information Topology," In preparation [10] D. Haussler, \A general Minimax Result for Relative Entropy," IEEE Trans. Inform. Theory, vol. 43, pp. 1276{1280, 1997. [11] E. T. Jaynes, \Information Theory and Statistical Mechanics," Physical Reviews, vol. 106, pp. 620{630, vol. 108, pp. 171{190, 1957. [12] E. T. Jaynes, \Clearing up mysteries { The original goal," in Maximum Entropy and Bayesian Methods, J. Skilling (ed.), Kluwer, Dordrecht, 1989. [13] http://bayes.wustl.edu [ONLINE] { a web page dedicated to Edwin T. Jaynes, maintained by L. Brethorst. [14] J.N. Kapur, \Maximum Entropy Models in Science and Engineering," New York: Wiley, 1993 ( rst edition 1989). [15] D. Kazakos, \Robust Noiceless Source Coding Through a Game Theoretic Approach," IEEE Trans. Inform. Theory, vol. 29, pp. 577{583, 1983. [16] S. Kullback, \Information Theory and Statistics," New York: Wiley, 1959 (Dover edition 1968). [17] S. Mandelbrot, \Series de Dirichlet," Paris: Gauthier-Villars, 1969. [18] B. B. Mandelbrot, \On the theory of word frequencies and on related Markovian models of discourse," in R. Jacobsen (ed.): \Structures of Language and its Mathematical Aspects," New York, American Mathematical Society, 1961. [19] M. Schroeder, \Fractals, Chaos, Power Laws," New York: W. H. Freeman, 1991. Entropy, 2001, 3 226 [20] F. Topse, \Information theoretical Optimization Techniques," Kybernetika, vol. 15, pp. 8{27, [21] F. Topse, \Game theoretical equilibrium, maximum entropy and minimum information dis- crimination," in Maximum Entropy and Bayesian Methods, A. Mohammad-Djafari and G. De- moments (eds.), pp. 15{23, Kluwer, Dordrecht, 1993. [22] F. Topse, \Maximum Entropy versus Minimum Risk and Applications to some classical discrete Distributions," submitted for publication [23] F. Topse, \Basic Concepts, Identities and Inequalities { the Toolkit of Information Theory," http://www.mdpi.org/entropy/ [ONLINE], Entropy, vol. 3, pp. 162{190, 2001. [24] G. K. Zipf, "Human Behavior and the Principle of Least E ort," Addison-Wesley, Cambridge,

Journal

EntropyUnpaywall

Published: Sep 30, 2001

There are no references for this article.