Natural selection. V. How to read the fundamental equations *

doi: 10.1111/jeb.12010
Natural selection. V. How to read the fundamental equations
of evolutionary change in terms of information theory*
Department of Ecology and Evolutionary Biology, University of California, Irvine, CA, USA
evolutionary theory;
Fisher information;
mathematical models;
population genetics.
The equations of evolutionary change by natural selection are commonly
expressed in statistical terms. Fisher’s fundamental theorem emphasizes the
variance in fitness. Quantitative genetics expresses selection with covariances and regressions. Population genetic equations depend on genetic variances. How can we read those statistical expressions with respect to the
meaning of natural selection? One possibility is to relate the statistical
expressions to the amount of information that populations accumulate by
selection. However, the connection between selection and information theory has never been compelling. Here, I show the correct relations between
statistical expressions for selection and information theory expressions for
selection. Those relations link selection to the fundamental concepts of
entropy and information in the theories of physics, statistics and communication. We can now read the equations of selection in terms of their natural
meaning. Selection causes populations to accumulate information about the
There are difficulties in applying information theory
in genetics. They arise principally, not in the transmission of information, but in its meaning (Maynard
Smith, 2000, p. 181).
I show that natural selection can be described by the
same measure of information that provides the conceptual foundations of physics, statistics and communication. Briefly, the argument runs as follows. The
classical models of selection express evolutionary rates
in proportion to the variance in fitness. The variance in
fitness is equivalent to a symmetric form of the Kullback–Leibler information that the population acquires
about the environment through the changes in gene
frequency caused by selection.
Correspondence: Steven A. Frank, Department of Ecology and
Evolutionary Biology, University of California, Irvine,
CA 92697–2525, USA
Tel.: +1 949 824 2244; fax: +1 949 824 2181;
e-mail: [email protected]
*Part of the Topics in Natural Selection series. See Box 1.
Kullback–Leibler information is closely related to
Fisher information, likelihood and Bayesian updating
from statistics, as well as Shannon information and the
measures of entropy that arise as the fundamental
quantities of communication theory and physics. Thus,
the common variances and covariances of evolutionary
models are equivalent to the fundamental measures of
information that arise in many different fields of study.
In Fisher’s fundamental theorem of natural selection,
the rate of increase in fitness caused by natural selection is equal to the genetic variance in fitness. Equivalently, the rate of increase in fitness is proportional to
the amount of information that the population acquires
about the environment (Frank, 2009).
In my view, information is a primary quantity with
intuitive meaning in the study of selection, whereas
the genetic variance just happens to be an algebraic
equivalence for the measure of information. The history
of evolutionary theory has it backwards, using statistical
expressions of variances and covariances in place of the
equivalent and more meaningful expressions of information. To read the fundamental equations of evolutionary change, one must learn to interpret the
standard expressions of variances and covariances as
expressions of information.
ª 2012 THE AUTHOR. J. EVOL. BIOL. 25 (2012) 2377–2396
Box 1: Topics in the theory of natural selection
This article is part of a series on natural selection.
Although the theory of natural selection is simple, it
remains endlessly contentious and difficult to apply. My
goal is to make more accessible the concepts that are so
important, yet either mostly unknown or widely misunderstood. I write in a nontechnical style, showing the key
equations and results rather than providing full derivations or discussions of mathematical problems. Boxes list
technical issues and brief summaries of the literature.
The first section reviews the classic statistical expressions
for selection. Evolutionary change caused by selection is
the covariance between fitness and character value.
That covariance equals the regression of character value
on fitness multiplied by the variance in fitness.
The second section expresses selection in terms of
the classic equations from information theory (Box 2).
I show that the change in the mean logarithm of fitness
is the Jeffreys information divergence. That divergence
measures the accumulation of information by natural
selection between the initial population and the
population after it has been updated by selection. The
relations between the statistical and information
perspectives follow by connecting the classic statistical
expressions of selection to the new information description for selection.
The third section analyses the Jeffreys divergence as
the measure of information in the fundamental equations of selection. The Jeffreys divergence is the sum of
two expressions for relative entropy. Relative entropy,
known as the Kullback–Leibler divergence, measures
the gain in information with regard to an abstract and
universal notion of encoding, independently of the
meaning of that information. A universal, abstract measure of information in terms of encoding allows a general theory of information to provide the foundation
for the deepest concepts in communication, physics and
The fourth section concerns the meaning of information. Although encoding provides a useful measure with
regard to information theory, we must also interpret the
meaning of that information in terms of selection. Meaning arises by the relation of encoded information to
whatever scale we use to interpret a particular problem.
For selection, we interpret meaning with regard to characters. Characters may be gene frequencies or measurements made on individuals. Characters lead to a general
notion of the scale for meaning with respect to the scale
of encoded information.
Box 2: Information, entropy and complexity
Cover & Thomas (1991) give an excellent introduction to
information theory and its applications. Jaynes (2003) is a
fascinating analysis of the connections between information,
entropy, probability, Bayesian analysis and statistical inference. Kullback (1959) is a broad synthesis of information
theory in relation to classical statistics. Fisher’s (1922, 1925)
original papers on the theoretical foundations of statistics set
the basis for all future work on information and statistics,
with the 1925 paper showing the key role of Fisher
Entropy arose in the study of thermodynamics (Clausius,
1867; Boltzmann, 1872; Gibbs, 1902). Ben-Naim (2008a)
gives a simple introduction. Hill (1987) provides a classical
text. Information theory arose in Fisher’s work and separately in the study of communication through the analyses
of Hartley (1928) and Shannon (1948a, b). The underlying
concepts of entropy and information are very close. Some
think the concepts are identical, but controversy remains
(Jaynes, 2003; Ben-Naim, 2008b).
Jeffreys (1946) divergence first appeared in an attempt to
derive prior distributions for use in Bayesian analysis rather
than as the sort of divergence used in this article. Kullback
& Leibler (1951) and Kullback (1959) presented both the
asymmetric divergence D, given in eqn 10, which is now
known as the Kullback–Leibler divergence, and the symmetric form, J, given in eqn 12, which is now known as the
Jeffreys divergence. They noted Jeffreys’ previous usage of J
in the context of Bayesian priors and then developed the
importance of the divergence interpretation for statistical
theory, particularly the asymmetric form, D.
I do not discuss Kolmogorov complexity in this article.
However, it is an important concept that may ultimately
prove as interesting for biological applications as the classic
analyses of entropy and information. Kolmogorov complexity measures the information content of an object
(individual) by the shortest binary computer program that
fully describes the object (Cover & Thomas, 1991; Li &
Vita´nyi, 2008). At the population level, the average Kolmogorov complexity often has a close association with the
formal theories of entropy and information, but it is not
exactly the same.
With respect to selection, fitness is, in essence, the match of
characters to environmental challenge. That match depends
on the algorithmic relation between the information content
of an organism and the interpretation of that information
through the development of phenotype. Development is not
exactly like running a computer program encoded in the
genes, but the analogy is not so far off. I suspect that, someday, Kolmogorov complexity or related measures will help to
understand biochemical, developmental and evolutionary
processes. A few authors have taken the first steps (Gell-Mann
& Lloyd, 1996; Adami & Cerf, 2000; Adami, 2002).
ª 2012 THE AUTHOR. J. EVOL. BIOL. 25 (2012) 2377–2396
Selection and information
The fifth section explicitly connects the abstract scale
of encoded information to the meaningful scale of
information in problems of selection. The analysis leads
to the relation between the Jeffreys divergence, the
most general expression for selection, and Fisher information as the limiting form of the Jeffreys divergence
when changes in magnitude are small. Fisher information is the sensitivity of changes in abstract encoded
information relative to the distance that one moves
along a scale of meaning. Encoded information is
equivalent to the log-likelihood ratio, which is why
Fisher information provides the conceptual foundations
for the theory of statistics.
The sixth section uses Fisher information to derive
various elegant expressions for selection. For example,
suppose that changes in the average value of a character sufficiently describe the changes caused by
selection. Then, mean log fitness increases by the
Fisher information in an observation about the average character value multiplied by the squared change
in the average character value. This expression connects the scale of encoded information, which is
mean log fitness, to the scale of meaning, which in
this case is the average value of a character in the
The seventh section relates the parametric description
of characters to a more general nonparametric expression. In the previous example, the change caused by
selection was described fully by a change in a parameter, the mean. In the general case, no parametric summary statistics fully capture the change in populations.
Instead, one must use the full range of different types
in the population, providing a nonparametric description of the change in the distribution of frequencies
caused by selection. The full nonparametric expression
shows the universal applicability of the equations selection and information.
The eighth section distinguishes changes by selection
from total evolutionary change. Numerous extrinsic
and unpredictable forces beyond selection can change
the characteristics of populations and their fit to the
environment. I show the full expression for evolutionary change, placing selection in the broader evolutionary context. No general conclusion about total
evolutionary change is possible, because the complete
range of forces that can perturb populations remains
unpredictable. However, we can express an elegant
equilibrium condition. At equilibrium, the gain in
information by selection must be exactly balanced by
the decay in information caused by other evolutionary
The discussion reviews the main argument. Classic
equations for selection describe the change by statistical
expressions of covariances, variances and regressions.
In terms of encoded information, the change caused by
selection is the Jeffreys divergence. A generalized notion
of Fisher information connects encoded information to
the scale of meaning. By equating the statistical description with the information description, we learn how to
read the fundamental equations of selection in terms of
Classic equations of natural selection
Equations of natural selection are often expressed in
the statistical language of population variances, covariances and regressions. In this section, I show how these
statistical expressions arise from the simplest models of
selection. Later sections connect these classic equations
to the amount of information that a population accumulates by selection.
Textbooks on population genetics and quantitative
genetics present the classic equations of selection (Crow
& Kimura, 1970; Falconer & Mackay, 1996; Roff, 1997;
Futuyma, 1998; Lynch & Walsh, 1998; Charlesworth &
Charlesworth, 2010; Ewens, 2010). Lande developed
the statistical nature of selection equations (Lande,
1979; Lande & Arnold, 1983; Frank, 1997c).
A simple model starts with n different types of individuals. The frequency of each type is qi . Each type has wi
offspring, where w expresses fitness. In the simplest
case, each type is a clone producing wi copies of itself
in each round of reproduction.
The frequency of each type after selection is
q0i ¼ qi
w i
¼ qi wi is average fitness. The summation
where w
is over all of the n different types indexed by
the i subscripts. See Box 3 for the proper interpretation
of q0i .
This equation is called a haploid model in classical
population genetics, because it expresses the dynamics
of different alleles at a haploid genetic locus. Recently,
economists, mathematicians and game theorists have
called this expression the replicator equation, because it
expresses in the simplest way the dynamics of replication (Taylor & Jonker, 1978; Hofbauer & Sigmund,
1998, 2003).
It is often convenient to rewrite eqn 1 as the change
in the frequency of each type, Dqi ¼ q0i qi . Subtracting
qi from both sides of eqn 1 yields
Dqi ¼ qi
1 :
Box 3 describes a universal interpretation of these
equations for selection that transcends the narrow
haploid and replicator models.
ª 2012 THE AUTHOR. J. EVOL. BIOL. 25 (2012) 2377–2396
Box 3: Interpretation of q′ and z′
Classical population genetics and replicator equation analyses
interpret q0i in eqn 1 as the frequency of type i in the descendant population. However, selection theory in its most
abstract and general form requires a set mapping interpretation, in which q0i is the frequency of descendants derived from
type i in the ancestral population. The set mapping interpretation arises from the Price equation (Price, 1972a; Frank,
1995, 1997c, 1998).
Similarly, zi0 , developed in eqn 26 and mentioned earlier,
is the average value of the property associated with z among
the descendants derived from ancestors with index i, rather
than the usual interpretation of the character value of i
types in the descendant population. Here, I elaborate briefly
on these interpretations of q0 and z 0 by adapting the presentation in Frank (2012b).
Let qi be the frequency of the ith type in the ancestral population. The index i may be used as a label for any sort of
property of things in the set, such as allele, genotype, phenotype, group of individuals and so on. Let q0i be the frequencies
in the descendant population, defined as the fraction of the
descendant population that is derived from members of the
ancestral population that have the label i. Thus, if i = 2 specifies a particular phenotype, then q02 is not the frequency of the
phenotype i = 2 among the descendants. Rather, it is the fraction of the descendants derived from entities with the phenotype i = 2 in the ancestors. One can have partial assignments,
such that a descendant entity derives from more than one
ancestor, in which case each ancestor gets a fractional assignment of the descendant. The key is that the i indexing is
always with respect to the properties of the ancestors, and
descendant frequencies have to do with the fraction of
descendants derived from particular ancestors.
Given this particular mapping between sets, we can spec
ify a particular definition for fitness. Let q0i ¼ qi ðwi =wÞ,
Equation 2 describes the change in frequency. How
does selection change the value of characters? Suppose
that each type, i, has an associated character value, zi .
The P
average character value in the initial population is
z ¼ qi zi . The average character value in the descenP
dant population is z 0 ¼ q0i zi0 , where zi0 is the character
value in the descendants (Box 3). For now, assume
that descendants have the same
character value as their
parents, zi0 ¼ zi . Then, z 0 ¼ q0i zi , and the change in the
average value of the character caused by selection is
z 0 z ¼ Dsz ¼
q0i zi qi zi ¼
q0i qi zi ;
where Ds means the change caused by selection (Price,
1972b; Ewens, 1989; Frank & Slatkin, 1992). We may
simplify this expression by using Dqi ¼ q0i qi for
frequency changes
Dsz ¼
Dqi zi :
¼ qi wi is
where wi is the fitness of the ith type and w
is proportional to the fraction of
average fitness. Here, wi =w
the descendant population that derives from type i entities
in the ancestors.
Usually, we are interested in how some measurement
changes or evolves between sets or over time. Let the measurement for each i be zi . The value z may be the frequency
of a gene, the squared deviation of some phenotypic value
in relation to the mean, the value obtained by multiplying
measurements of two different phenotypes of the same
entity and so on. In other words, zi can be a measurement
of any property of an entity with label, i. The average propP
erty value is z ¼ qi zi , where this is a population average.
The value zi has a peculiar definition that parallels the
definition for q0i . In particular, zi0 is the average measurement
of the property associated with z among the descendants
derived from ancestors with index i. The population average
among descendants is z 0 ¼ q0i zi 0 .
The Price equation (eqn 26) expresses the total change
in the average property value, Dz ¼ z 0 z , in terms of
these special definitions of set relations. This way of
expressing total evolutionary change and the part of total
change that can be separated out as selection is very different from the usual ways of thinking about populations
and evolutionary change. The set mapping interpretation
allows one to generalize equations of selection theory and
total evolutionary change to a much wider array of problems than would be possible under the common interpretations of the terms. By following the set mapping
approach, our evaluation of selection and information can
be presented in a much simpler and more general way.
Note that the classic interpretations of the haploid and
replicator models are special cases of the generalized set
mapping expressions.
This equation expresses the fundamental concept of
selection (Frank, 2012b). Frequencies change according
to differences in fitness, as given by eqn 2. Thus, eqn 3
is the change in character value caused by differences
in fitness, holding constant the character values, zi .
Later, we will also include the changes in character
values during transmission from parent to offspring,
Dzi ¼ zi0 zi .
Variance, covariance and regression
Many of the classic equations of selection are expressed
in terms of variances, covariances and regressions. I show
the relation between the expression for frequency
changes in eqn 3 and the common statistical expressions
for selection.
Combining eqns 2 and 3 leads to
X w i
Dsz ¼
1 zi :
Dqi zi ¼
ª 2012 THE AUTHOR. J. EVOL. BIOL. 25 (2012) 2377–2396
Selection and information
term outside
On the right-hand side, move the w
Dsz ¼
X w i
Þzi =w:
1 zi ¼
qi ðwi w
The definition of the population covariance allows us
to rewrite this equation. Given a population of paired
values ðxi ; yi Þ, where each particular pair subscripted by
i occurs at frequency qi , and writing x as the mean
value in the population of the x values, the population
covariance has the general form
qi ðxi x Þyi ¼ Covðx; yÞ:
Note that the right-hand expression in eqn 4 has
the form of the covariance definition, so we can
Þzi =w
¼ Covðw; zÞ=w;
qi ðwi w
Dsz ¼
following Price (1970). The standard definition of a
regression coefficient of y on x is the covariance of y
and x divided by the variance of x. Thus, the regression
of fitness, w, on character, Covðw;
z, is zÞ
bwz ¼
where Vz denotes the variance of z. This expression
implies Covðw; zÞ ¼ bwz Vz . We can also reverse the order
of the regression, Covðw; zÞ ¼ bzw Vw . Thus, eqn 5 is
¼ bzw Vw =w:
Dsz ¼ bwz Vz =w
Because z can be the value of any character, we can
use fitness, w, in place of z, yielding
¼ Vw =w;
Ds w
where the regression has disappeared because the
regression of a variable on itself is one, thus bww ¼ 1.
This expression shows that the change in mean fitness
is the variance in fitness, normalized by the initial
mean value.
All of these expressions assume that character values
do not change between parent and offspring, Dzi ¼ 0.
As I mentioned, I will take up changes during transmission in a later section.
Selection expressed as change in
This section derives a new result that connects the
change in fitness caused by natural selection to the
amount of information accumulated by the population.
In particular, I express the change caused by selection
in terms of a classical measure of information from formal information theory. Those readers unfamiliar with
information theory will find some new expressions in
this section, presented without explanation. The following sections explain the meaning of the expressions
from information theory and the connection to natural
selection. (See Boxes 4–6 for prior work on selection
and information.)
Change in log fitness
Fitness captures the notion of a match between a type
and the environment. We may therefore expect that
fitness is, in some way, an expression of the information in the population about the environment. Those
types with high fitness increase in frequency, increasing
the fitness (information) contained in the population.
From eqn 1, we can write the fitness of a type, wi , in
terms of current frequencies, qi , and updated frequencies
after selection, q0i , as
i :
wi ¼ w
Fitness depends on the ratio of frequencies, q0i =qi .
Entities that depend on ratios have a natural logarithmic
scaling (Hand, 2004). Therefore, we should use the logarithmic scale when analysing fitness (Wagner, 2010).
It is traditional to describe the logarithm of fitness as the
Malthusian expression, mi ¼ logðwi Þ, yielding
þ log i :
mi ¼ logðwi Þ ¼ logðwÞ
Using z ≡ m as our character in the selection expression of eqn 4, we have the increase in mean log fitness
by natural selection as
Dqi log i :
Ds m
An information measure for the change in fitness
Perhaps the most important measure of information in
communication, statistics and physics is the Kullback–
Leibler divergence
Dðq0 kqÞ ¼
q0i log i :
This divergence has directionality from the initial
population, q, to the updated population after selection, q0 (Box 2). Using this definition for D in the
expression for the change in fitness given in eqn 9,
we obtain
¼ Dðq0 kqÞ þ Dðqkq0 Þ:
Ds m
This expression is the sum of Kullback–Leibler divergences taken in each direction between the initial
population, q, and the updated population after selection, q0 . In information theory, this sum is known as
the Jeffreys divergence
ª 2012 THE AUTHOR. J. EVOL. BIOL. 25 (2012) 2377–2396
Box 4: Selection and information
No one seems to have provided a full development of the
relations between selection and information. In many
respects, R.A. Fisher created the key concepts. However,
before I start listing aspects of the problem and related citations, I cannot resist quoting from Li & Vita´nyi (2008, p. 96)
about the difficulties of attribution. In discussing the name
‘Kolmogorov complexity’ for the discipline of the algorithmic
analysis of complexity, they note that Solomonoff published
the key idea before Kolmogorov, although Kolmogorov later
discovered the idea independently and developed it more
deeply and thoroughly. Ultimately, Kolmogorov got almost
all the credit, perhaps because he was much more famous
than Solomonoff. Li & Vita´nyi summarize as follows.
Associating Kolmogorov’s name with the area may
be viewed as an example in the sociology of science of the Matthew effect, first noted in the Gospel according to Matthew, 25: 29–30, ‘For to every
one who has more will be given, and he will have
in abundance; but from him who has not, even
what he has will be taken away’.
Fisher (1930) discussed the relation of his fundamental
theorem of natural selection to the second law of thermodynamics, a universal law about changes in entropy. However,
Fisher never came around to an information perspective in
this discussion and, perhaps for that reason, was restrained
in his enthusiasm for the analogy. Alternatively, Fisher’s
restraint may have had to do with the high dimensionality
Jðq0 ; qÞ ¼ Dðq0 kqÞ þ Dðqkq0 Þ:
Thus, we have the simple expression for the change
in mean log fitness caused by natural selection as
Ds m
of the evolutionary problem (Edwards, 2000). However, one
of Fisher’s great contributions in his book was his use of the
average effect to reduce the dimensionality required for
analysing selection. Although Fisher never developed an
information analysis of selection, one must remember that
the modern field of information theory only began with
Shannon’s work on communication (Shannon, 1948a,b).
The use of Fisher information outside of statistical problems
developed later.
The analogy between selection and information is obvious
and has been mentioned often. However, brief mention of
the analogy does not, by itself, provide any real insight
about the connections between information and selection or
new ways in which to understand selection.
Edwards (2000) noted that, in the continuous-time limit,
the fundamental equations of selection can be expressed in
terms of Fisher information. However, he concluded that
the analogy between selection and Fisher information provides little insight. By contrast, Frieden et al. (2001) argued
that selection expressed in terms of Fisher information is
indeed significant. Although I believe Frieden et al. were on
the right track, their particular analysis and presentation did
not add much. Fisher information is always information
about an underlying scale. Frieden et al. concluded that natural selection provides a measure of Fisher information
about time, which I think is the wrong scale on which to
interpret meaning. The present article extends the start
made in Frank (2009).
an example in which an observation provides information. I then discuss how to quantify the amount of information. Finally, I analyse the amount of information in a
comparison, which provides the basis for comparing the
information in a population before and after selection.
where J is shorthand for Jðq ; qÞ. Equating this expression with eqn 7, using m ≡ z, we have
¼ bmw Vw =w;
J ¼ bwm Vm =w
Thus, the variance in fitness is proportional to the
information divergence, J. The regression terms divided
give the constants of proportionality that adjust for
by w
the different scales of measurement for fitness, w or
m=log(w). This expression shows the relation between
the information accumulated by natural selection, J, and
the traditional statistical expressions of natural selection
in terms of variances and regression coefficients.
The encoding of information
Before continuing to discuss the relation between selection and information, we need some additional background about the nature of information. I first describe
Statistics and information
In statistical problems, the divergence, D, measures the
amount of information in an observation with respect
to discriminating between two distributions (Kullback,
1959; Cover & Thomas, 1991). Suppose the true underlying probability distribution is q0 . However, we do not
know whether we are sampling from q0 or an alternative distribution q. The different distributions may
be associated with different values of a parameter, h0
and h. The parameter may, for example, be the mean
or the variance.
When we take a sample from the true underlying
distribution, q0 , how much information do we obtain
about whether the sampled distribution is q0 or q? In
the parametric case, how much information do we
obtain about whether the parameter of the distribution
from which we sampled is h0 or h?
ª 2012 THE AUTHOR. J. EVOL. BIOL. 25 (2012) 2377–2396
Selection and information
Box 5: Entropy, information and stochastic evolutionary models
The most interesting development of the theory arises from
stochastic models of evolutionary change framed in terms of
entropy and statistical mechanics. Iwasa (1988) derived a
general expression for ‘free fitness’ by analogy with free
energy and entropy. Iwasa showed the analogy between the
continual increase in free fitness in evolutionary models and
the second law of thermodynamics, by which entropy continually increases. He also calculated the distributions in
population characteristics as they change under various stochastic models of evolutionary change.
These kinds of stochastic evolutionary models require certain assumptions in order to achieve continual increase in
entropy or free fitness. There is certainly no universal law
about the increase in fitness in evolution, whereas restricted
notions of selection may have universal properties. I have
drawn a sharp distinction between selection and evolution
in my own analyses. The evolutionary literature does not
always make that distinction so clearly.
de Vladar & Barton (2011a) reviewed the significant
advances in the use of entropy and statistical mechanics to
study evolutionary dynamics, including their own contribu-
For each observation, with value associated with the
index i, the relative likelihood of obtaining that observation from the true distribution, q0 , versus the alternative distribution, q, is the ratio q0i =qi . The log of the
likelihood ratio is logðq0i =qi Þ. Because the true distribution is q0 , the actual probability of observing i is q0i .
Thus, averaging the log-likelihood ratio over the probability of each observed i value gives the average
log-likelihood ratio, which is
Dðq0 kqÞ ¼
q0i log i :
The divergence D is simply the average log-likelihood
ratio, which means an average of the relative weight of
evidence in favour of q0 as the true distribution
compared with q. The greater the ratio of likelihoods,
the greater the divergence between distributions and
the greater the information in each observed value to
discriminate between the distributions.
The scale of information
Clearly, D gives a measure of information provided by
an observed value. But what sort of scale, or units, does
that measure have? If, for example, D ¼ 2, then what
does the value ‘two’ mean?
The Shannon measure of information is commonly
used. That measure is related to entropy, which means
randomness. The more random something is, the less
information we have about it. For example, if a flipped
coin comes up on either side with equal probability, we
say that it is completely random. We also say that we
tions to the subject (Barton & de Vladar, 2009; de Vladar &
Barton, 2011b). This work on stochastic evolutionary models
may eventually converge with general studies of entropy,
information and dynamics. For example, there has been
recent discussion about a maximum entropy production
(MEP) principle for dynamics (Dewar, 2005; Kleidon, 2010;
Volk & Pauluis, 2010). In the MEP theory, the most likely
dynamical path is associated with the greatest production of
entropy. Further, the probability distribution over dynamical
paths may be a function of the relative entropy production
associated with the different paths.
One may be able to use the distribution of entropy
changes over paths to calculate the stochastic evolution of
populations. Under some conditions, one may be able to
specify the expected probability distribution over types when
the population achieves certain kinds of equilibrium. However, a full understanding of MEP and its limitations has yet
to be achieved. There may be some relation between
dynamics analysed in terms of Fisher information (Frieden,
2004) and MEP. However, I do not understand the similarities and differences of those approaches.
have no information about which side is likely to come
up. The Shannon measure captures this duality between
increasing randomness and decreasing information or,
equivalently, between decreasing randomness and
increasing information.
The Shannon measure is
HðqÞ ¼ qi logðqi Þ:
We can use any base for the logarithm. It is sometimes convenient to use base 2, in which case H is the
average number of bits required to encode a message.
This bit-encoding interpretation arises from the fact that
log2 ðqi Þ ¼ log2 ð1=qi Þ
expresses the number of bits required to encode a probability. For example, if qi is 1/32, then log2 ð1=32Þ ¼
log2 ð32Þ ¼ 5 bits. A bit is the number of digits in base
two required to express a number. The number 32 in
base 2 is 10000, a bit-string with 5 digits. Each digit is a
bit that takes on a value of either 0 or 1.
To encode a probability 1/32 requires five bits. By
contrast, to encode a probability of 1/2 requires only
log2 ð2Þ ¼ 1 bit. It takes four bits more to encode 1/32
compared with 1/2. The key idea is that a rarer event,
with lower probability, q, provides greater surprise
when the event actually occurs. A greater surprise
means a greater distinction from what was expected, a
lower ability to predict, more randomness and less
information. Thus, more bits means more randomness
and less information, providing a scale for measuring
information in terms of bits.
ª 2012 THE AUTHOR. J. EVOL. BIOL. 25 (2012) 2377–2396
Box 6: Bayesian interpretations of selection
Bayesian updating combines prior information with new
information to improve prediction. The Bayesian process
makes an obvious analogy with selection. The initial population encodes predictions about the fit of characters to the
environment. Selection through differential fitness provides
new information. The updated population combines the
prior information in the initial population with the new
information from selection to improve the fit of the new
population to the environment. I am sure this Bayesian
analogy has been noted many times. But it has never developed into a coherent framework that has contributed significantly to understanding selection.
Part of the problem is that the analogy, as currently
developed, provides little more than a match of labels
between the theory of selection and Bayesian theory. As
Harper (2010) shows, if one begins with the replicator
equation (eqn 1), then one can label the set fqi g as the
as the new information
initial (prior) population, fwi =wg
through differential fitness and fq0i g as the updated (posterior) population. Shalizi (2009) presents a similar view.
The analogy provides a useful correspondence between
the structure of the theories but, by itself, does not provide any truly significant insight into selection. It may be
The number of bits associated with each probability
concerns only that particular probability. How should
we measure the randomness and information over a set
of different possible outcomes? For a distribution, q,
with different probabilities qi for each outcome, i, we
must combine the randomness (bits) associated with
each probability, log2 ðqi Þ, and the chance that the
event i occurs, qi .
In particular, the randomness associated with each
event is the product of how often the event happens
multiplied by the randomness of that event,
qi log2 ðqi Þ. The total over all events is the sum given
in the definition for H(q) in eqn 15, which measures
the total randomness over a set of events.
To understand the notion of total randomness over a
set, we can think of each i as a symbol to be communicated or an event that may occur. A message, or a set
of events, has frequencies qi . In such a set, each
log2 ðqi Þ is the number of bits required to encode each
i, and the event i occurs with frequency qi , so
qi log2 ðqi Þ is the relative cost in terms of bits required
to encode event i. If the message, or set, is highly random, it takes more bits to encode the message. High
randomness corresponds to a high average level of surprise per event, which means that we have relatively
little information.
Note that information is the opposite of randomness
and entropy. The measurement of information can be
expressed as the negative entropy, H.
possible to develop the analogy in useful ways, a challenge that remains open.
Another Bayesian line of study analyses how individuals
adjust their characters in response to information obtained
directly from the environment. Those studies include learning, phenotypic plasticity, and various aspects of conditional
development. By one view, learning and other processes
that accumulate information follow Popper’s (1972) dictum
that all new knowledge must ultimately derive from trial
and error, in effect, from selection.
Vast literatures discuss information theoretic and Bayesian interpretations of learning, which are beyond our
scope. In an explicitly selectionist view, Fernando et al.
(2012) analyse theories of neural development in relation
to Bayesian updating – part of the wider field of developmental selection (Frank, 1996, 1997a,b). Closer to the standard evolutionary interpretation of selection, DonaldsonMatasci et al. (2010) provide an interesting discussion of
information directly acquired from the environment in
relation to fitness. Frank (1998, section 6.3) used a Bayesian analysis to combine selectively acquired information by
the population as a prior state with new information
acquired directly from the environment (learning).
The information in a comparison
The problem with H as a measure of information is
that, by itself, it does not give a sense of comparison or
information gain. In the statistical example, we compared two distributions and the information gained to
discriminate between those distributions provided by an
observation. In terms of selection, we will be concerned
with the information gain by a population before and
after evolutionary change, requiring a comparison
between the initial and updated probability distributions
that describe the population before and after selection.
In a comparison, one way to measure a gain in information is by the reduction in the number of bits required
to encode, or to predict, the distribution of outcomes in
one population relative to another. A reduced number of
bits corresponds to reduced randomness, and reduced
randomness corresponds to improved prediction and
more information. Thus, we can measure information
gain by the reduction in the number of bits.
To make comparisons, we need an expanded definition of entropy
Hðr; pÞ ¼ ri log2 ðpi Þ;
where H(r,p) is the entropy in the probability distribution r when encoded by the associated probabilities p.
This expression may be interpreted by thinking of the
different i values as symbols in an alphabet, the ri as
ª 2012 THE AUTHOR. J. EVOL. BIOL. 25 (2012) 2377–2396
Selection and information
the frequency of the symbols in a message and the pi as
the frequencies used to determine the encoding of the
symbols i. Then, H(r,p) is the average number of bits
required to encode a message r in a code based on p.
To compare populations, suppose an updated population has probabilities of types (events) q0i , and entropy
Hðq0 ; q0 Þ ¼ Hðq0 Þ. By contrast, the entropy of the new
population, when using the encoding of the old
population, q, before new information was acquired, is
Hðq0 ; qÞ, which is the randomness in the new population when encoded by the old frequencies.
In the updated population, the change in information
obtained from the updated encoding is the average
number of bits to encode q0 based on the new frequencies, Hðq0 ; q0 Þ, minus the average number of bits to
encode q0 based on the old frequencies, Hðq0 ; qÞ, which is
ðHðq0 ; q0 Þ Hðq0 ; qÞÞ ¼
q0i log2 ðq0i Þ q0i log2 ðqi Þ
q0i log2 i
¼ Dðq0 kqÞ;
where the initial minus sign is used to express negative
entropy, which is information. The term log2 ðq0i =qi Þ is
the number of extra bits to encode q0i given a prior
assumption that event i happens with probability qi .
The expression D measures the average number of
extra bits needed when encoding the new population
by the old frequencies rather than with the new,
updated frequencies. Thus, D is the average gain in
information in a population update when measured in
terms of number of bits. A value of D ¼ 2 means that
an efficiency gain of two bits has been achieved by the
extra information provided. Alternatively, we may say
that the new information enhances predictability, such
that the remaining randomness, or unpredictability, has
been reduced by two bits.
Selection and the meaning of information
The encoding interpretation of information is well
known and widely accepted (Kullback, 1959; Cover &
Thomas, 1991). By contrast, a formal interpretation of
natural selection in terms of information has never
been developed in a simple, clear and widely agreed
manner. Here, I give my interpretation of natural selection and information.
Jðq0 ; qÞ ¼ Dðq0 kqÞ þ Dðqkq0 Þ:
In most statistical and physical applications, measures
of divergence and information typically use D (Cover &
Thomas, 1991). For example, Bayesian updating can
often be expressed in terms of a prior distribution, q, an
updated distribution based on new data, q0 , and the
divergence of the updated distribution from the prior,
Dðq0 kqÞ. In the Bayesian expression, D describes the
gain in information measured in terms of bits and
interpreted with regard to the efficiency of encoding
information or, equivalently, the reduced randomness
and increased predictability of outcomes.
The measure D is asymmetric, because Dðq0 kqÞ 6¼
Dðqkq0 Þ. By contrast, J is symmetric, because it is the
sum of the divergence in each direction. The symmetry
in the selection equation arises because, from eqn 9,
we have
Ds m
Dqi log i
Dqi logðq0i Þ logðqi Þ
Dqi ½Dlogðqi Þ:
If we switch q0i and qi , then Dqi changes sign and
Dlogðqi Þ also changes sign. The two sign changes cancel.
Thus, we obtain the same information gain when selection moves a population as q ! q0 or in the reverse
direction as q0 ! q.
Fitness in terms of encoded information
The information expression for fitness in eqn 18 is in
terms of logðq0i =qi Þ. Thus, the information gain continues to be about efficiency of encoding or, equivalently,
the reduced randomness and increased predictability of
outcomes. We could, for example, think of an increase
in mean log fitness as an increase in the population’s
prediction of, or match to, the state of nature – the fit
of the population to the environmental challenge.
This interpretation of fitness in terms of encoding is
universal, in the sense that the particular environmental challenges and the particular meaning of the gain in
fitness with respect to particular characters do not enter
into the expressions. The universal expression of fitness
and selection in terms of probabilities and encoding
yields the match between changes in mean log fitness
and changes in the classical expressions of information.
Why J rather than D ?
To analyse the meaning of information with regard to
natural selection, we must begin with the fundamental
expression of selection in terms of information diver ¼ J. That expression
gence given in eqn 13 as Ds m
states that the change in mean log fitness is the Jeffreys
divergence, J. Recall the definition of J from eqn 12 as
Encoding versus meaning
The great power and universality of the classic theory
of information arises because it does not depend on
meaning. Information is formulated strictly in terms of
encoding, bits, randomness and predictability, independently of what is being encoded or predicted. Fitness
ª 2012 THE AUTHOR. J. EVOL. BIOL. 25 (2012) 2377–2396
obtains the same universality, because fitness uses the
same expressions of relative frequency as the classic
information measures. That universality for fitness
makes sense, because fitness is a general expression for
the way in which populations accumulate information,
independent of the characters and environmental challenges that distinguish particular cases.
Although it is certainly beneficial to have a universal
expression of fitness in terms of information, we pay for
that universality by the limited scope of fitness expressed
only in terms of encoding. Information is about predictability, and predictability is always predictability about
something. Natural selection must, in some way, be about
the increased information with respect to the environmental challenges that shape success. How can we bring
this particular meaning of the information about environmental challenges into the formulation of fitness?
There is perhaps no universal way to express meaning
with respect to information. That may be why the
encoding interpretation has been so valuable. The following sections explore two related ways in which to
bring meaning into the information interpretation of fitness. The next section develops the notion of Fisher
information. Later sections present the idea of a coordinate system for information and evolutionary change – a
connection between the Price equation and information.
Natural selection and Fisher information
Shannon information is not really information as
such, but rather the capacity to transmit information,
whereas Fisher information is truly a measure of
informativeness about something specific, the value of
a parameter. Shannon’s refers to the medium, Fisher’s
to the message (Edwards, 2000, p. 6).
We have been working on the scale of encoded information. That scale depends only on probability distributions, without any explicit connection to what sort of
events or meaning attach to the probabilities. Units of
encoded information can be measured in terms of bits.
The following extends Frank (2009).
One way to interpret meaning is to change the scale.
Suppose we could relate bits of encoded information to
a new scale on which we interpret meaning. To relate
the change in information to the change in meaning,
we could evaluate
Dinformation ¼
The relation is trivial when expressed in this way.
However, we can see that the ratio of change in information to change in meaning provides the translation
between the two scales.
To make this expression for the relations between
the scales useful, we must connect each of the terms to
our prior discussion of information and to a new way
of describing meaning. That connection leads us to
expressions of natural selection in terms of the fit
of characters to the environment, rather than the
efficiency of encoding information in terms of bits.
Up to this point, I have been writing qi or q0i for the
probability of event i, whatever sort of event or characteristic i may be. The probability distribution is the set of
qi values over the range of possible characters, each possible character associated with a label i. In this formulation, one can think of the probability distributions as
interpreted nonparametrically, in the sense that we work
directly with the actual distribution of probabilities without reference to any underlying parameters or causes.
Now suppose we associate a set of values, h, with
each probability distribution (Amari & Nagaoka, 2000).
We could think of h as a parameter, for example the
mean of the distribution. Or we could think of h as the
predictions about the environment associated with a
probability distribution. The predictions might be
expressed as characters. The quality of the predictions
could be associated with fitness.
For now, we take h in the general sense of some
values associated with a distribution. To express the
association, we expand our notation for probabilities to
write qi jh, the probability of event i given the associated
value h. An updated population may have a new
associated value, h0 , such as a new mean or a new
prediction about the environment, so we write q0i jh0 .
The change in probability is now expressed as
Dqi jh ¼ q0i jh0 qi jh:
To express the scaling of probability changes relative
to changes on the new h scale, we can divide both sides
by the change on the h scale, yielding
Dqi jh q0i jh0 qi jh
h0 h
This expression gives us a way to match changes on
the scale of meaning, h, to changes on the scale of
probability and encoded information, q.
We can now follow eqn 19 to express the change in
information as the change on the scale of meaning multiplied by the change of information scaled relative to the
change in meaning. To develop this expression, we must
continue to match our previous work on information
and selection to the new notation in relation to meaning.
The log-likelihood ratio, logðq0i =qi Þ, can be written as
logðq0i Þ logðqi Þ, which may be abbreviated as Dlogðqi Þ,
as in eqn 18. This difference of logarithms expresses the
change in the number of bits required to encode the
probabilities associated with i (as described below
eqn 17). If we now express probabilities in relation to
h, as q|h, and divide by Dh, we obtain the change in the
number of bits in relation to the change on our scale of
ª 2012 THE AUTHOR. J. EVOL. BIOL. 25 (2012) 2377–2396
Selection and information
logðq0i jh0 Þ logðqi jhÞ Dlogðqi jhÞ
h0 h
We can now put the pieces together by relating these
new expressions with the expression in eqn 18 for the
change in mean log fitness, yielding a form equivalent
to the intuitive description in eqn 19 as
Ds m
JðhÞ 2
Dh ;
in which I write Dh2 ¼ ðDhÞ2 for the square of the change
in the parameter, and the term J(h) is the Jeffreys
divergence, which is now a function of the scale of
meaning, h, and is written as
JðhÞ ¼
ðDqi jhÞ½Dlogðqi jhÞ:
These expressions simply repeat our prior derivation
¼ J, but with explicit consideration of h.
of Ds m
As the changes become small, Dh?0, the Jeffreys
divergence, J(h), divided by the squared change in
scale, Dh2 , converges to the important quantity in
statistical theory known as Fisher information, F(h),
which we write as
! FðhÞ;
as shown in Appendix A. Thus, for small changes on
the scale of meaning, Dh?0, we may write the change
in average log fitness as
¼ FðhÞDh2 :
Ds m
This derivation provides a more general way to arrive
at my earlier statement that changes in mean fitness
are proportional to Fisher information (Frank, 2009).
Fisher information is the information in an observation
about a parameter, or a set of parameters. In our case,
h represents the parameters, which is our scale of
One can also think of Fisher information as the Jeffreys divergence between populations, J(h), relative to
the squared divergence on the scale of meaning, Dh2 .
Thus, Fisher information is the sensitivity of change in
the encoded information in populations, J(h), relative
to change on the parametric scale of meaning. The
greater the sensitivity, the more information in an
observation with respect to the divergence between
populations on the underlying parametric scale. See
Appendix B for ways in which Fisher information has
been used in previous models of selection.
Parametric coordinates for selection and
The change in mean log fitness measures the amount
of information that the population accumulates by
selection. Because fitness describes changes in relative
frequencies, fitness concerns encoding of information,
which can be measured in numbers of bits.
The previous section showed how to convert from
bits to an alternative scaling of information in terms of
h. We may interpret the parameters h as a scale that
has meaning with respect to the fit of the population’s
characteristics to the environment. This section further
analyses the notion of parametric coordinates for selection and information, followed by an example.
Parametric coordinates and Fisher information
From eqn 20, the key result for the change in mean log
fitness in terms of a parametric scale can be rewritten as
Ds m
! FðhÞ:
Change in mean log fitness is the amount of informa2
is the
tion gained by selection. The ratio Ds m=Dh
change in information per unit change in squared distance on the parametric scale. Because we consider the
parametric scale as the scale of meaning, this ratio is
the change in information relative to the change in
squared distance on the scale of meaning (Amari &
Nagaoka, 2000). The arrow on the right-hand side states
that the relative change in information per unit of
squared parametric distance is the Fisher information in
an observation about the parameter, h.
The interpretation of ‘observation’ with respect to
natural selection is interesting. Each interaction of an
individual with the environment leads to a realized fitness. That realized individual fitness is an observation,
by the population, of the fit between certain characteristics and the environment. For a particular type, i, the
average information in each observed individual fitness
is logðq0i =qi Þ ¼ Dlogðqi jhÞ. Thus, the ratio Dlogðqi jhÞ=Dh is
the change, or sensitivity, of information in an observation relative to a change in h. To get the average over
all types, i, we weight this information per type by qi jh.
To analyse selection, we need the change in frequencies, or sensitivity of those changes, relative to changes
in h, which is Dqi jh=Dh. Combining these terms yields
JðhÞ=Dh2 ! FðhÞ.
Change in the mean or variance of a character
A few examples clarify the abstract expressions for information. To keep things simple, I assume small changes
so that we can use the Fisher information simplification
in eqn 23. With larger changes, we could make exact
calculations using J(h) instead of Fisher information.
Change in the mean of a normal distribution under
directional selection
Suppose the character values in a population, zi , follow
a normal distribution with mean, l, and variance, v.
ª 2012 THE AUTHOR. J. EVOL. BIOL. 25 (2012) 2377–2396
An observation from that population provides information about the mean of the population. It is well known
that an observation from a normal population provides
Fisher information about the mean of F(l)=1/v. The
more variable the population, the larger v and the less
information in an observation about the average value.
Put another way, the precision in measurement is
proportional to 1/v. More variable populations yield less
precise measurements and thus less information per
observation about the average value.
We interpret natural selection as obtaining information through the observed fitnesses associated with
character values. Suppose that the population retains a
normal shape and a fixed variance before and after
selection and changes only in its mean value. Then, the
change in the mean, Dl, is sufficient to describe the
effects of selection. From eqn 22, the increase in information by natural selection is
¼ FðlÞDl2 ¼
Ds m
This expression provides the relation between the
which is a universal
change in information, Ds m,
abstract quantity about encoding, and the scaling of the
character that gives meaning for this particular case,
Dl2 =v.
Change in the variance of a normal distribution under
stabilizing selection
The previous example described directional selection on
the average trait value, holding the variance constant.
This section considers stabilizing selection. In this case,
the population begins with its centre at the optimum.
Selection reduces the variance, but leaves the mean
unchanged. For a normal distribution, the Fisher information in an observation about the variance, v, is
1=2v2 . Thus,
¼ FðvÞDv2 ¼
Ds m
which is the gain in information when stabilizing selection reduces the variance of a normally distributed
Change in the mean of an exponential distribution
Suppose the character follows an exponential distribution before and after selection. An observation from an
exponential population provides Fisher information of
1/v about the mean, l. The variance of an exponential
distribution is v ¼ l2 . The change in information by
selection is
¼ FðlÞDl2 ¼
Ds m
which matches the case of the normal distribution.
However, the variance of the exponential distribution
changes with the mean. By contrast, the normal
distribution has a separate parameter for the variance,
which we held constant by assumption.
Change in allele frequency
Suppose q1 ¼ p is the frequency of a particular allele
and q0 ¼ 1 p is the frequency of the alternative allele.
The distribution of allele frequencies is binomial with a
single observation. The mean allelic value is l = p, and
the variance is v = p(1p) The Fisher information in an
observation about the mean of a binomial population is
1/v. The change in information by selection is
¼ FðlÞDl2 ¼
Ds m
Using p for gene frequency to match the familiar
notation of population genetics
¼ FðpÞDp2 ¼
Ds m
pð1 pÞ
which holds when Dl=Dp is small. For larger changes,
we can obtain an exact expression by using the Jeffreys
divergence rather than the Fisher information, as in
eqn 23.
Character coordinates and selection
The previous section assumed that the parameters, h,
summarize all differences in the frequency distributions
before and after selection. We can think of h as defining
the coordinate system for evolutionary change. The
reduction of frequencies to a parametric description,
such as the mean of the distribution, typically requires
character values to be associated with the i values. By
convention, we use zi for character values. Thus, if
changes in the mean are sufficient to describe the
changes in the probability distribution of characters in
the population
before and after selection, then
l ¼ z ¼ qi zi is a reduction of the full distribution of
character values to a single parametric dimension.
Parametric character coordinates
Let us review the use of parametric coordinates before
discussing nonparametric coordinates. In a parametric
example, suppose that frequencies before and after
selection are normally distributed, with parameters (l,v)
for the mean and the variance. Selection moves the
population from the initial location, defined by the
parameters (l,v), to the location after selection, ðl0 ; v0 Þ.
The two parametric dimensions provide a complete
description of change by selection. If we hold one
parameter constant, such as the variance, and only
allow the mean to change, then change in the single
parametric dimension from l to l0 fully describes the
population before and after selection.
Parametric expressions describe the total change in
information by
ª 2012 THE AUTHOR. J. EVOL. BIOL. 25 (2012) 2377–2396
Selection and information
Ds m
Dh2 ! FðhÞDh2 :
For example, let the parameter be the mean, h = l.
The term JðlÞ=Dl2 ! FðlÞ reduces the change in the
average information per observation to the single
dimension of l. If we multiply the information per
observation by the distance moved in the parametric
dimension, Dl2 , we obtain the total change in information. Thus, the calculation for the change in information is made along the single parametric dimension
of l.
The parametric dimension of l can be thought of as
the coordinate system in which we evaluate the change
by selection. Each change in position along the coordinate of l corresponds to changes by selection, because
l is a sufficient description for the full frequency distribution of character values. In general, when we can
reduce the description of frequency distributions to a
sufficient set of parameters, h, those parameters form
the coordinates in which we evaluate the changes by
Nonparametric character coordinates
We can think of our fundamental expression for selection
Dsz ¼
Dqi zi
as a nonparametric expression. Each term includes the
actual frequencies in the population. The calculation is
made over the full dimensionality of the frequency
The character values, fzi g ¼ z1 ; z2 ; . . ., form a nonparametric coordinate system. For the population frequencies, fqi g, the point fqi zi g locates the population
before selection and the point fq0i zi g locates the population after selection. The movement of the population
caused by selection is given by fDqi zi g.
The expression for the total change in information
caused by selection is
Ds m
Dqi D logðqi Þ ¼
Dqi log i :
Each frequency change, Dqi , associates with the character zi ¼ D logðqi Þ, the change in information for the
ith type. This is a nonparametric expression, because
the calculation is made over the full frequency
Character coordinates and information
The character values provide the coordinates of meaning
in an analysis of selection. We can derive the relations
between information and the coordinates of meaning by
using the results of eqns 7 and 8. From those equations,
we obtain the relation between the change given the
coordinates of meaning, Dsz , and the change given the
coordinates of information, Ds m,
Dsz ¼
Ds m:
The term bzw is the regression coefficient of the character values, z, on the fitnesses, w. The term bmw is the
regression coefficient of the log fitnesses, m, on the fitnesses, w. These regressions provide an exact expression
for changing the coordinates from information, Ds m,
characters, Dsz . When the magnitudes of the changes
are small, w?m+1, thus
Dsz ! bzm Ds m:
To repeat, it is important to recognize a regression
coefficient as an exact expression for the change in
scale associated with a change in coordinates. The
regression is sufficient when evaluating the consequences for a change in coordinates with respect to a
change in mean value.
The underlying values, zi , may themselves be nonlinear functions of other values (Frank, 2012b). For example, zi could be the product of different character values
measured on each individual, or the square of some
underlying character. What matters is that we average
over the zi values to get Dsz .
Character coordinates and total
evolutionary change
The previous analyses have focused on the selection
part of total evolutionary change. I defined selection as
the change caused by frequency differences
Dsz ¼
Dqi zi :
The subscript s emphasizes that this expression is the
partial change caused by selection (Price, 1972b; Ewens,
1989; Frank & Slatkin, 1992).
Total change in characters
The partial change arises by holding constant the character values, such that Dzi ¼ zi0 zi ¼ 0. This assumption fixes the coordinates, zi , and evaluates the
meaning of changing frequencies in the context of that
fixed set of coordinates.
If the coordinates that give meaning also change,
Dzi 6¼ 0, then we must account for that change in
coordinates with respect to the total evolutionary
change. In particular, the total change is the sum of the
change, Ds , caused by selection through varying frequencies, q, holding constant the coordinates, z, plus
the change in coordinates, Dc , holding constant the
new frequencies in the updated population, q0 . We
write the total change as
ª 2012 THE AUTHOR. J. EVOL. BIOL. 25 (2012) 2377–2396
Dz ¼ Dsz þ Dc z ¼
Dqi zi þ
q0i Dzi :
This expression is a form of the Price equation.
I devoted the prior article to a full discussion of this equation (Frank, 2012b). Here, I focus only on those aspects
that concern information. In particular, I emphasize the
interpretation of z as a coordinate system that gives
meaning to the information basis of natural selection.
information gain by selection and information decay by
change in coordinates,
J ¼ Dc m:
It is sometimes possible to analyse particular problems by using that universal expression for the balance
of forces (Frank & Slatkin, 1990; Frank, 1995).
Evolution of the coordinate system
Total change in information
The total evolutionary change in eqn 26 can be used to
evaluate information. Let z = m, where the log fitness,
m, provides a measure of the information accumulated
by a population. Thus,
¼ Ds m
þ Dc m:
From eqn 13, the selection component of change is
¼ J. In general, no simplified reduction or particuDs m
lar interpretation is possible for the change in coordi That change in coordinates arises from any
nates, Dc m.
environmental or extrinsic factors that may change,
altering the fit of the characters to the environment.
The changes in the frequencies themselves can be an
‘environmental’ change that alters fitnesses (Price,
1972b; Ewens, 1989; Frank & Slatkin, 1992). Thus, no
general expression for total evolutionary change in
fitness is possible other than
¼ J þ Dc m:
One can, of course, analyse particular models such as
mutation–selection balance. Mutation decays information through changes in fitness that are, on average,
negative, causing
a loss of information through the
¼ q0i Dmi . The particular loss of information
term Dc m
depends on the specific assumptions. By
through Dc m
contrast, the gain in information through selection is
¼ J.
always Ds m
Equilibrium balance between information
gain and loss
Many processes lead to an equilibrium balance between
gain of information by selection and decay of information by an opposing force (Frank, 2012a). Mutation–
selection balance is one example. Frequency-dependent
selection is another, in which the gain in information
by selection is balanced by the decay of information
(fitness) caused by frequency changes. For example, in
the evolution of sex ratios, making more daughters
may be favoured by selection. But as the number
of daughters increases by selection, the advantage of
making extra daughters decays.
Although we cannot, in general, specify the change
we can express the equiin the coordinate term, Dc m,
¼ 0. Under a balance between
librium condition, Dm
In the previous sections, I have fixed the particular
dimensions that define the coordinate system. Although
the coordinates may change, Dzi , each dimension i
remained. From a broader perspective, the evolution of
the various dimensions in the coordinate system itself is
perhaps among the most interesting evolutionary problems. One aspect concerns the origin of new characters
(West-Eberhard, 2003). More generally, one may consider the evolution of the optimal set of characters with
respect to the capture of information.
There is an interesting literature in engineering about
optimal design of sensors with respect to capturing
information. That literature sometimes uses Fisher
information as the optimality criterion with respect to
design (Borguet & Le´onard, 2008). Application of that
design perspective with regard to information may provide insight into biological problems. For example, multiple cellular receptors may respond to the same sort of
information, such as the concentration of a hormone.
But those receptors may be tuned differently with
regard to sensitivity to signals. A related idea concerns
the common trade-off between informativeness and
simplicity in classification (Kemp & Regier, 2012).
A second aspect of coordinates concerns the parametric reduction of the full nonparametric distribution of
characters. Reducing the full distribution to the mean is
an extreme reduction and probably not justified in general. However, there often may be some suitable reduction of dimensionality to a sufficient set of parameters
with respect to the acquisition of information (Carter
et al., 2009; Goh et al., 2011). That sufficient set defines
the coordinates of information and meaning followed
by an evolving population. It may be that an improved
parametric representation of information in the environment by a set of characters enhances fitness. Thus,
it may be the parametric representation itself that is
under the strongest selection or, at least, a particularly
interesting form of selection.
The fundamental equations of selection are often written in the statistical terms of variances, covariances and
regressions. I have argued that one obtains a deeper
understanding of selection if one learns to read the
fundamental equations in terms of information. Here,
I review my argument by listing the key steps derived
ª 2012 THE AUTHOR. J. EVOL. BIOL. 25 (2012) 2377–2396
Selection and information
in previous sections. I start with the classic statistical
equations of selection. I then show the connection of
those statistical expressions of selection to expressions
for the information that populations accumulate about
the fit of characters to the environment.
Statistical expressions of selection
To understand where the classic statistical expressions
of selection come from and what they mean, let us start
with the basic equation for evolutionary change by
natural selection
Dsz ¼
Dqi zi
given in eqn 3. Here, Dsz is the change caused by selection in the average value of a character, z . This expression applies generally to selection of any value. For
example, z could be gene frequency, leading to population genetics expressions, or z could be a quantitative
trait such as weight, or z could be a nonlinear function
of several characters. The Dqi terms are the changes
caused by selection in the frequency of the ith character value, zi . Total selection is the total change in
frequencies, with each change caused by selection, Dqi ,
weighted by its associated character value, zi .
I showed that one can rewrite the association
between the change caused by selection and the character value as
Dqi zi ¼ Covðw; zÞ=w;
a form known as the Price equation and also related to
Robertson’s secondary theorem of natural selection
(Frank, 2012b). This form provides the foundation for
quantitative genetics theory and also arises in standard
models of population genetics. The definition of covariance allows us to rewrite the covariance as the product
of a regression coefficient and a variance term
¼ bzw Vw =w;
Dsz ¼ Covðw; zÞ=w
where bzw is the regression of character value, z, on fitness, w, and Vw is the variance in fitness. These sorts of
regression and variance terms arise repeatedly in the
fundamental equations of selection.
One can easily understand why selection depends on
an association between fitness, w, and character value,
z. Those character values associated with higher fitness
will increase, whereas those character values associated
with lower fitness will decrease. But why should the
expression for selection be exactly the covariance, or
the regression multiplied by the variance, which capture only the linear component of association? The reason is that Dsz describes selection by a change in
average values. To calculate a change in the average,
we need only the linear component of association
between character and fitness.
These statistical expressions of selection in terms of
covariances, variances and regressions have been very
useful throughout the history of evolutionary theory.
However, these expressions give no sense of what
selection means. To say that selection is the covariance
of fitness and character value is simply to express an
algebraic relation. That algebraic relation is very useful,
but it does not give a sense of what selection is actually
doing with regard to adaptation or how selection relates
to processes in other fields of study. The statistical
expressions do not tell us how to read the fundamental
equations of selection with regard to the meaning of
the underlying process.
Selection in terms of information
In this article, I argued that selection causes populations
to accumulate information about the fit of characters to
the environment. I gave a precise definition of ‘information’. That definition of information with respect to
selection matches exactly the classic usage of information and entropy from the fundamental theories of
physics, statistics and communication. By showing the
exact relations between selection and information, I
tied the theory of natural selection to the broader conceptual framing of problems at the foundation of many
key scientific disciplines.
I will not repeat the whole argument here. Instead, I list
a few steps to emphasize the essential points. To understand the information associated with selection and fitness,
we must analyse fitness on a logarithmic scale
þ log i :
mi ¼ logðwi Þ ¼ logðwÞ
The logarithmic scale compares relative magnitudes.
We need relative magnitudes because there is no meaning in the number of babies or the number of copies
produced with regard to whether a type, i, is increasing
or decreasing in the population. We need to know the
relative success. The logarithmic scale is the natural
scale of relative magnitudes.
Using log fitness, m, as the character value of interest
in eqn 28, we obtain
Ds m
Dqi mi ¼
Dqi log i :
We recognize the fundamental expression for the
change in information given by the Kullback–Leibler
divergence, or relative entropy, as
Dðq0 kqÞ ¼
q0i log i :
Using this definition for change in information, D,
we can express the change in mean log fitness caused
by selection as
¼ Dðq0 kqÞ þ Dðqkq0 Þ:
Ds m
ª 2012 THE AUTHOR. J. EVOL. BIOL. 25 (2012) 2377–2396
This sum of the changes in information in each direction is known as the Jeffreys divergence, J. Thus, we
can write the fundamental expression for the accumulation in information by natural selection as
¼ J:
Ds m
Because z in eqn 29 is just a placeholder for any
character, we can use m in place of z in that equation,
¼ bmw Vw =w:
Ds m
Thus, the information accumulated by natural selection is equivalently expressed in terms of the regression
coefficient and variance
J ¼ bmw Vw =w:
The value of J is the gain in information. The variance
in fitness, Vw , is therefore a measure of the separation
between the initial population and the population after
selection, when the separation between populations is
expressed on a scale of information. The regression
is a scaling factor
divided by the mean fitness, bmw =m,
that translates the measure of information in Vw to the
scale of log fitness, m. That scaling change is required
because log fitness is the proper measure of information
in expressions of selection.
Equation 30 shows the equivalence between the
expression of information gain and the expression of it
in terms of statistical quantities. There is nothing in the
mathematics to favour either an information interpretation or a statistical interpretation.
I have argued that, when reading the fundamental
equations of selection for meaning, we should prefer
the information interpretation. The information perspective makes sense intuitively. Selection is the process
by which populations accumulate information about
the environment.
My research is supported by National Science Foundation grant EF-0822399.
Adami, C. 2002. What is complexity? BioEssays 24: 1085–1094.
Adami, C. & Cerf, N.J. 2000. Physical complexity of symbolic
sequences. Physica D 137: 62–69.
Amari, S. & Cichocki, A. 2010. Information geometry of divergence functions. Bull. Polish Acad. Sci. Tech. Sci. 58: 183–195.
Amari, S. & Nagaoka, H. 2000. Methods of Information Geometry.
Oxford University Press, New York.
Barton, N.H. & de Vladar, H.P. 2009. Statistical mechanics and
the evolution of polygenic quantitative traits. Genetics 181:
Ben-Naim, A. 2008a. Entropy Demystified: The Second Law
Reduced to Plain Common Sense. World Scientific, Singapore.
Ben-Naim, A. 2008b. A Farewell to Entropy: Statistical Thermodynamics Based on Information. World Scientific, Singapore.
Boltzmann, L. 1872. Weitere Studien u¨ber das Wa¨rmegleichgewicht unter Gasmoleku¨len. Wien. Akad. Sitz 66: 275–370.
Borguet, S. & Le´onard, O. 2008. The Fisher information matrix
as a relevant tool for sensor selection in engine health monitoring. Int. J. Rotat. Machine. 2008: 784749.
Carter, K.M., Raich, R., Finn, W. & Hero, A. 2009. FINE:
Fisher information nonparametric embedding. IEEE Trans.
Pattern Anal. Mach. Intell. 31: 2093–2098.
Charlesworth, B. & Charlesworth, D. 2010. Elements of Evolutionary Genetics. Roberts & Company, Greenwood Village, CO.
Cichocki, A. & Amari, S. 2010. Families of alpha- beta- and
gamma-divergences: Flexible and robust measures of similarities. Entropy 12: 1532–1568.
Cichocki, A., Cruces, S. & Amari, S. 2011. Generalized alphabeta divergences and their application to robust nonnegative
matrix factorization. Entropy 13: 134–170.
Clausius, R. 1867. Mechanical Theory of Heat. John Van Voorst,
Cover, T.M. & Thomas, J.A. 1991. Elements of Information Theory.
Wiley, New York.
Crow, J.F. & Kimura, M. 1970. An Introduction to Population
Genetics Theory. Burgess, Minneapolis, MN.
Dewar, R.C. 2005. Maximum entropy production and the
fluctuation theorem. J. Phys. A Math. Gen. 38: L371-L381.
Donaldson-Matasci, M.C., Bergstrom, C.T. & Lachmann, M.
2010. The fitness value of information. Oikos 119: 219–230.
Edwards, A.W.F. 2000. Fisher information and the fundamental theorem of natural selection. Istit. Lomb. (Rend. Sc.) B 134:
Ewens, W.J. 1989. An interpretation and proof of the fundamental theorem of natural selection. Theor. Popul. Biol. 36: 167–180.
Ewens, W.J. 1992. An optimizing principle of natural selection
in evolutionary population genetics. Theor. Popul. Biol. 42:
Ewens, W.J. 2010. Mathematical Population Genetics: I. Theoretical
Introduction, 2nd edn. Springer-Verlag, New York.
Falconer, D.S. & Mackay, T.F.C. 1996. Introduction to Quantitative Genetics, 4th edn. Longman, Essex, England.
Fernando, C., Szathma´ry, E. & Husbands, P. 2012. Selectionist
and evolutionary approaches to brain function: a critical
appraisal. Front. Comput. Neurosci. 6: 24.
Fisher, R.A. 1922. On the mathematical foundations of theoretical statistics. Philos. Trans. R. Soc. Lond. A 222: 309–368.
Fisher, R.A. 1925. Theory of statistical estimation. Math. Proc.
Cambridge Phil. Soc. 22: 700–725.
Fisher, R.A. 1930. The Genetical Theory of Natural Selection.
Clarendon, Oxford.
Frank, S.A. 1995. George Price’s contributions to evolutionary
genetics. J. Theor. Biol. 175: 373-388.
Frank, S.A. 1996. The design of natural and artificial adaptive
systems. In: Adaptation (M.R. Rose & G.V. Lauder, eds), pp.
451–505. Academic Press, San Diego, CA.
Frank, S.A. 1997a. The design of adaptive systems: optimal
parameters for variation and selection in learning and development. J. Theor. Biol. 184: 31–39.
Frank, S.A. 1997b. Developmental selection and self-organization. BioSystems 40: 237-243.
Frank, S.A. 1997c. The Price equation, Fisher’s fundamental
theorem, kin selection, and causal analysis. Evolution 51:
ª 2012 THE AUTHOR. J. EVOL. BIOL. 25 (2012) 2377–2396
Selection and information
Frank, S.A. 1998. Foundations of Social Evolution. Princeton
University Press, Princeton, NJ.
Frank, S.A. 2009. Natural selection maximizes Fisher information. J. Evol. Biol. 22: 231–244.
Frank, S.A. 2012a. Natural selection. III. Selection versus transmission and the levels of selection. J. Evol. Biol. 25: 227–243.
Frank, S.A. 2012b. Natural selection. IV. The Price equation.
J. Evol. Biol. 25: 1002–1019.
Frank, S.A. & Slatkin, M. 1990. The distribution of allelic effects
under mutation and selection. Genet. Res. 55: 111–117.
Frank, S.A. & Slatkin, M. 1992. Fisher’s fundamental theorem
of natural selection. Trends Ecol. Evol. 7: 92–95.
Frieden, B.R. 2004. Science From Fisher Information: A Unification. Cambridge University Press, Cambridge, UK.
Frieden, B.R., Plastino, A. & Soffer, B.H. 2001. Population
genetics from an information perspective. J. Theor. Biol. 208:
Futuyma, D.J. 1998. Evolutionary Biology, 3rd edn. Sinauer
Associates, Sunderland, MA.
Gell-Mann, M. & Lloyd, S. 1996. Information measures, effective complexity, and total information. Complexity 2: 44–52.
Gibbs, J.W. 1902. Elementary Principles in Statistical Mechanics.
Scribner, New York.
Goh, A., Lenglet, C., Thompson, P.M. & Vidal, R. 2011. A nonparametric Riemannian framework for processing high
angular resolution diffusion images and its applications to
odf-based morphometry. NeuroImage 56: 1181–1201.
Hand, D.J. 2004. Measurement Theory and Practice. Arnold, London.
Harper, M. 2010. The replicator equation as an inference
dynamic. arXiv:0911.1763v3.
Hartley, R.V.L. 1928. Transmission of information. Bell Syst.
Tech. J. 7: 535–563.
Hill, T.L. 1987. An Introduction to Statistical Thermodynamics.
Dover, New York.
Hofbauer, J. & Sigmund, K. 1998. Evolutionary Games and Population Dynamics. Cambridge University Press, Cambridge, UK.
Hofbauer, J. & Sigmund, K. 2003. Evolutionary game dynamics. Bull. Am. Math. Soc. 40: 479–519.
Iwasa, Y. 1988. Free fitness that always increases in evolution.
J. Theor. Biol. 135: 265–281.
Jaynes, E.T. 2003. Probability Theory: The Logic of Science.
Cambridge University Press, New York.
Jeffreys, H. 1946. An invariant form for the prior probability
in estimation problems. Proc. R. Soc. London A 186: 453–461.
Kemp, C. & Regier, T. 2012. Kinship categories across languages reflect general communicative principles. Science 336:
Kimura, M. 1958. On the change of population fitness by
natural selection. Heredity 12: 145–167.
Kleidon, A. 2010. A basic introduction to the thermodynamics of the earth system far from equilibrium and maximum entropy production. Philos. Trans. R. Soc. Lond. B
365: 1303–1315.
Kullback, S. 1959. Information Theory and Statistics. Wiley,
New York.
Kullback, S. & Leibler, R.A. 1951. On information and sufficiency. Ann. Math. Stat. 22: 79–86.
Lande, R. 1979. Quantitative genetic analysis of multivariate
evolution, applied to brain: body size allometry. Evolution
33: 402–416.
Lande, R. & Arnold, S.J. 1983. The measurement of selection
on correlated characters. Evolution 37: 1212–1226.
Li, M. & Vita´nyi, P. 2008. An Introduction to Kolmogorov
Complexity and its Applications. Springer-Verlag, New York.
Lynch, M. & Walsh, B. 1998. Genetics and Analysis of Quantitative Traits. Sinauer Associates, Sunderland, MA.
Maynard Smith, J. 2000. The concept of information in
biology. Philos. Sci. 67: 177–194.
Popper, K.R. 1972. Objective Knowledge: An Evolutionary
Approach. Oxford, Oxford, UK.
Price, G.R. 1970. Selection and covariance. Nature 227: 520–
Price, G.R. 1972a. Extension of covariance selection mathematics. Ann. Hum. Genet. 35: 485–490.
Price, G.R. 1972b. Fisher’s ‘fundamental theorem’ made clear.
Ann. Hum. Genet. 36: 129–140.
Roff, D.A. 1997. Evolutionary Quantitative Genetics. Chapman
and Hall, New York.
Shalizi, C.R. 2009. Dynamics of Bayesian updating with
dependent data and misspecified models. Electron. J. Stat. 3:
Shannon, C.E. 1948a. A mathematical theory of communication. Bell Syst. Tech. J. 27: 379–423.
Shannon, C.E. 1948b. A mathematical theory of communication. Bell Syst. Tech. J. 27: 623–656.
Taylor, P.D. & Jonker, L.B. 1978. Evolutionary stable strategies
and game dynamics. Math. Biosci. 40: 145–156.
de Vladar, H.P. & Barton, N.H. 2011a. The contribution of
statistical physics to evolutionary biology. Trends Ecol. Evol.
26: 424–432.
de Vladar, H.P. & Barton, N.H. 2011b. The statistical mechanics
of a polygenic character under stabilizing selection, mutation
and drift. J. R. Soc. Interface 8: 720–739.
Volk, T. & Pauluis, O. 2010. It is not the entropy you produce,
rather, how you produce it. Philos. Trans. R. Soc. Lond. B 365:
Wagner, G.P. 2010. The measurement theory of fitness. Evolution
64: 1358–1376.
West-Eberhard, M.J. 2003. Developmental Plasticity and Evolution.
Oxford University Press, New York.
Appendix A: Fisher information as the
limiting form of the Jeffreys divergence
A large family of divergence measures converges to
Fisher information in the limit of small changes (Amari
& Nagaoka, 2000; Amari & Cichocki, 2010; Cichocki &
Amari, 2010; Cichocki et al., 2011). In this appendix,
I show that the limit of the Jeffreys divergence is the
Fisher information multiplied by a scaling factor for
parametric distance.
I also show that the chi-square divergence becomes
the Fisher information metric in the limit of small
changes. The different forms of divergence can be confusing if one does not realize that all of the different
divergence measures in the Fisher family are equivalent
in the limit, but differ when changes are not small.
My main point is that the Jeffreys divergence holds
the unique position as the only correct divergence measure for models of selection. It is the only measure that
is correct both for large changes and, in the limit, for
ª 2012 THE AUTHOR. J. EVOL. BIOL. 25 (2012) 2377–2396
small changes. As far as I know, my derivation in this
article of the Jeffreys divergence in relation to selection
has not been shown previously. The clear relation of
the Jeffreys divergence to changes in information is
essential to make the proper connection between selection and information.
Limiting form of Jeffreys divergence
I show JðhÞ ! FðhÞDh2 as the distance in the parametric
coordinates Dh2 ! 0. Notationally, Dh2 ðDhÞ2 . Using
the standard differential notation for small differences,
we write Dh2 ! dh2 . Thus, I show JðhÞ ! FðhÞdh2 .
I use the vector h as parametric coordinates for probability distributions, following standard analysis in
information geometry (Amari & Nagaoka, 2000). For
simplicity, I usually treat the parametric vector as a single dimension. The extension to multiple dimensions is
The Jeffreys divergence in parametric form, from
eqn 21, is
JðhÞ ¼
ðDqi jhÞ½D logðqi jhÞ:
As the changes become small, Dqi jh ¼ q0i jh0 qi jh ! 0
and Dh ¼ h0 h ! 0, we write
Dqi jh !dqi jh
dqi jh
¼ q_ i dh;
where q_ i is the derivative of qi jh with respect to h. Next,
D logðqi jhÞ !d logðqi jhÞ
d logðqi jhÞ
q_ i
Xq_ 2 i
dh2 :
P 2
Below, I show that
q_ i =qi is Fisher information,
F(h). Thus, JðhÞ ! FðhÞdh2 .
Pearson’s chi-square divergence
We have from the previous expression
JðhÞ !
X dq2
Xq_ 2 i
dh2 ¼
As the changes become small,
v2 ðhÞ !
X dq2
Xq_ 2 i
dh2 ;
demonstrating that the Jeffreys and chi-square divergences have the same limiting form. The next section
shows that the limiting form is related to the Fisher
information metric.
When changes are large, only the Jeffreys divergence
gives the correct expression for changes by selection in
The chi-square divergence is the
mean log fitness, Ds m.
change in mean fitness on a linear scale
Ds w
Dqi wi ¼
X Dq2
As I discussed in the text, the correct scale for analysing the changes in fitness is logarithmic, because fitness is a relative measure, and logarithmic scaling is the
correct scale for relative measures (Wagner, 2010). In
addition, the relations between selection and information are only clear on the logarithmic scale, because it
is only on that scale that one can see the connections
to the classic theories of entropy and information. In
the limit of small changes, the logarithmic scale
! Ds w.
becomes linear, and thus, Ds m
Alternative expressions for Fisher information
where, to make the notation more concise, I use
qi qi jh. Thus,
JðhÞ !
Pearson’s chi-square divergence, or chi-square test
statistic, is usually described as follows. Given an
expected probability distribution, fqi g, and an observed
probability distribution, fq0i g, the chi-square statistic is
the sum of observed minus expected squared over
expected. Writing the observed minus expected squared
as Dq2i ¼ ðq0i qi Þ2 , we have
X Dq2
v2 ðhÞ ¼
One can think of Fisher information as the change in a
probability distribution with respect to a change in a
parameter that specifies the distribution. The more rapidly a distribution changes with respect to a parameter,
the more information each observation provides about
the value of the parameter. For example, if the distribution changes very slowly, then small differences in the
distribution of observed values may translate into big
differences in parameter values. Thus, approximately
similar distributions of observations map to widely different parameter values, so each observation provides
relatively little information about the parameter. If, by
contrast, the distribution changes rapidly with respect
to a parameter, then the distribution of observations is
very different for small changes in the parameter, and
each observation provides much information about the
likely value of the parameter.
ª 2012 THE AUTHOR. J. EVOL. BIOL. 25 (2012) 2377–2396
Selection and information
Mathematically, Fisher information is the negative
value of the expected curvature of the log-likelihood
X d2 logðqi jhÞ
FðhÞ ¼ qi
Doing the differentiation, and noting (Amari & Nagaoka, 2000) that
X d2 qi jh
d X dqi jh
¼ 0;
because the sum of changes in frequencies must be zero
over a distribution, we obtain
X q_ 2
FðhÞ ¼
A large number of different divergence measures
converge to Fisher information in the limit. Thus,
knowing only that the limiting form of a divergence is
Fisher information only weakly constrains the associated form of divergence. For example, from the expression above for the chi-square divergence
X dq2 Xq_ 2 i
v2 ðhÞ !
dh2 ;
it might be tempting, in a particular application in which
Fisher information arises, to think of the chi-square
divergence as somehow the natural measure of change,
because the chi-square form for large changes most closely resembles the limiting Fisher information form for
small changes. In the case of selection, that conclusion
would not be correct. The Jeffreys divergence is in fact
the natural measure of change, because the logarithmic
scale is the natural scale for changes in fitness and for
changes in information.
Appendix B: Historical aspects
Kimura (1958) noted that the change in fitness in
certain models of selection is
Ds m
X q_ 2
Kimura used the standard notion of change with
respect to time in his study of continuous dynamics
with respect to small changes. Thus, the parameter is
h≡t for time, and q_ ¼ dq=dt.
Ewens (1992) and Edwards (2000) provide comprehensive syntheses of the literature
on the various uses
P 2
of Kimura’s expression,
q_ i =qi . The main use concerned information geometry expressions of selection
dynamics on a Riemannian manifold. Neither Ewens
nor Edwards found that discussion of information
geometry particularly useful. Edwards did note that the
Kimura’s expression is in fact just an expression for
Fisher information. But Edwards did not think that
association was useful.
I agree with the criticisms by Ewens and Edwards
within the context of how the literature had been
framed. From Kimura (1958) through the various
developments in the literature, the emphasis had
always been on dynamics with respect to time. I agree
with Edwards that one cannot say anything very interesting about the temporal dynamics of evolutionary
change from the simple expression in eqn 32 for selection. That expression is the partial change caused by
selection (Price, 1972b; Ewens, 1989; Frank & Slatkin,
1992), not the total evolutionary change. The partial
change gives a clear sense of what selection is doing at
any moment, but provides no insight by itself about
evolutionary dynamics.
My presentation in this article is also based on Fisher
information and, more generally, on the Jeffreys divergence. Two aspects of my presentation go beyond the
past work and, in my view, provide a compelling case
for framing our understanding of selection in these
First, I connected selection to information theory
¼ J, the Jeffreys diverthrough the general result Ds m
gence. This result does not depend on the limit of small
changes, but instead is a general description of the
nature of selection. This result establishes the proper
measure for the amount of information accumulated by
Second, I related the change in information to various underlying parametric and nonparametric scales.
Those scales provide the meaning with respect to the
abstract scale for encoded information that forms the
basis for classical information theory. As Edwards
(2000) emphasized, Fisher information is information
about meaning with respect to underlying parameters
(Frank, 2009). Earlier work implicitly used time as the
parameter, which is not a meaningful way of expressing the accumulation of information. One does not
think of selection as providing information about time.
In addition to making the parametric basis for selection
and information explicit, my use of the Jeffreys
divergence clarified the relation of selection to classical
information theory.
Finally, I achieved greater generality than past work
by respecting the fundamental distinction between
selection and evolution. Past work often tried to make
general statements about evolutionary dynamics, which
is not possible. It is possible to make strong and completely general statements about the partial change
caused by selection. Such statements clarify the relations
between selection and information. One can achieve
that depth and generality only by working within the
fundamental limitations imposed by the distinction
between selection and total evolutionary change.
I mentioned that Ewens (1992) and Edwards (2000)
concluded that past work based on the Kimura’s result
ª 2012 THE AUTHOR. J. EVOL. BIOL. 25 (2012) 2377–2396
did not contribute significantly to understanding selection. Ewens (1992) did develop his own extension to
that theory, in which he showed an optimization principle in relation to Fisher’s fundamental theorem.
Frank (2009) developed a similar idea but with a different approach that emphasized information and the
Fisher information metric. Those studies derive from a
partitioning of the causes of fitness, which is the topic
of a future article in this series.
Received 6 July 2012; revised 15 September 2012; accepted 20
September 2012
ª 2012 THE AUTHOR. J. EVOL. BIOL. 25 (2012) 2377–2396