14 Statistical Inference and Information Theory

Author

Alec Loudenback

“My greatest concern was what to call [the amount of unpredictability in a random outcome]. I thought of calling it ‘information,’ but the word was overly used, so I decided to call it ‘uncertainty.’

When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, ‘You should call it entropy, for two reasons. In the first place, your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage.’”- Claude Shannon (1971)

14.1 Chapter Overview

A brief introduction to information theory and its foundational role in statistics. Entropy and probability distributions. Bayes’ rule and model selection comparison via likelihoods. A brief tour of modern Bayesian statistics.

14.2 Introduction

Statistics has an invaluable role in any data-driven modeling enterprise. As financial professionals dealing inherently with risk and uncertainty - we use probability and statistics to understand, model, and communicate these aspects.

Statistics curricula and practice is undergoing a significant transformation, with a larger focus on information theory and Bayesian methods (as opposed to the common Frequentist methods that have dominated the statistics field for more than a century). Why the change? In short, these methods work better in a wider range of situations and convey more meaningful information about model performance and uncertainty. Until now, computational challenges have limited Bayesian methods to simpler problems, but newer algorithms and better hardware are overcoming this limitation.

In our experience, it is rare that a financial professional has had exposure to non-Frequentist theory and methods. Given how central probability and statistics is to this endeavor, we have drafted this chapter as a introduction - a proper treatment is beyond the scope of this book. Armed with this knowledge, some of the terminology and tools should be more accessible, and possibly included in new financial models.

14.3 Information Theory

Probability, statistics, machine learning, signal processing, and even physics have a foundational link in information theory which is the description and analysis of how much useful data is contained within something. We will work through this with concrete examples.

14.3.1 Example: The Missing Digit

Let’s consider the following situation: we are studying a poorly made copy of a financial statement. Amongst many associated exhibits, we are interested in the par value of a particular asset class. Unfortunately, for one reason or another one of the digits is completely indecipherable. Here’s what you can read, with the _ indicating that one of the digit is missing from the scanned copy:

32,000,_00

It is likely that you quickly formed an opinion on what the missing number is, but let us make that intuition more formal and quantitative.

Given that we know that par values of assets tend to be nice round numbers, our prior assumption for what the probability of the missing digit is may be something like the \(p(x_i)\) row of Table 14.1. This prior distribution assumes that the missing digit is most likely a 0. We shall call the individual outcomes \(x_i\) and the overall set of probabilities \(\{x_0,x_1,...x_9\}\) is called \(X\).

The information content of an outcome, \(h(x)\) is measured in bits and defined as¹:

\[ h(x_i) = \text{log}_2\frac{1}{p(x_i)} = -\text{log}_2 p(x_i) \tag{14.1}\]

Looking at Table 14.1, we can see that the information content of an outcome is lower when that outcome has a higher probability than the other potential outcomes. Specifically, If the digit was indeed 0, we have gained less information relative to our expectation than if the missing digit were anything other than 0 .

Tip

The information content is sometimes referred to as a measure of surprise that one would have when observing a realized outcome. In our missing digit example (@tbl-digit-information-human), we would not be surprised at all to find out that the missing digit were 0. In contrast, we would be more surprised to find out the digit were an 8.

We can characterize the entire distribution \(X\) via the entropy, \(H(X)\), of a probability set is the ensemble’s average information content:

\[ H(X) = \sum{p(x_i)\text{log}_2}\frac{1}{p(x_i)} = -\sum_{i} p(x_i)\,\log_2 p(x_i) \tag{14.2}\]

The entropy \(H(X)\) of the presumed outcomes in Table 14.1 distribution of outcomes is \(0.722 \text{bits}\). In just

Table 14.1: Probability distribution of missing digit, knowing the human inclination to prefer round numbers for par values of assets.

\(x_i\)	0	1	2	3	4	5	6	7	8	9
\(p(x_i)\)	.91	.01	.01	.01	.01	.01	.01	.01	.01	.01
\(h(x_i)\)	0.136	6.644	6.644	6.644	6.644	6.644	6.644	6.644	6.644	6.644

To be clear, we have take a non-uniform view on the probability distribution for the missing digit, and we’ll refer to this as the prior assumption (or just prior). This is unashamedly an opinionated assumption, just like your intuition when you encountered 32,000,_00! All we are doing is giving a quantitative basis for describing this assumption. Taking a view on a prior distribution is a quantitatively incorporating previously encountered data and professional judgment. Having a prior assumption like this is completely compatible with information theory.

Our professional judgment notwithstanding: what if we had another colleague who believed humans are completely rational and without bias for certain numbers? They think an asset’s par value need not be rounded at all. They argue for a prior distribution consistent with Table 14.2.

With the uniform prior assumption, \(H(X) = 3.322 \text{bits}\) and \(h(x_i)\) is also uniform. Note that \(H\) is higher for the uniform prior than the prior in Table 14.1. We will not prove it here, but a uniform probability over a set of outcomes is the highest entropy distribution that can be assumed for this problem. A higher entropy prior distribution can typically be viewed as a less biased prior assumption than a lower entropy prior.

Table 14.2: Probability distribution of missing digit with uniform, maximal entropy for the assumed probability distribution.

\(x_i\)	0	1	2	3	4	5	6	7	8	9
\(p(x_i)\)	.10	.10	.10	.10	.10	.10	.10	.10	.10	.10
\(h(x_i)\)	3.322	3.322	3.322	3.322	3.322	3.322	3.322	3.322	3.322	3.322

The choice of prior assumption can significantly impact the interpretation and analysis of the missing information. If we have strong reasons to believe that the human bias prior is more appropriate given the context (e.g., knowing that the number is likely a round number), then we would expect the missing digit to be ‘0’ with high probability. However, if we have no specific knowledge about the nature of the number and prefer to make a more conservative assumption, the uniform prior may be more suitable.

In real-world scenarios, the choice of prior assumptions often depends on domain knowledge, available data, and the specific problem at hand. It is important to carefully consider and justify the prior assumptions used in information-theoretic and statistical analyses.

14.3.2 Example: Classification

In this example, we will determine the optimal splits for a decision tree² based on the information gained at each node in the tree.

using DataFrames

employed = [true, false, true, true, true, false, false, true]
good_credit = [true, true, false, true, false, false, false, true]
default = [true, false, true, true, true, true, false, true]
default_data = DataFrame(; employed, good_credit, default)

Table 14.3: Fictional data regarding loan attributes and whether or not a loan defaulted before its maturity.

8×3 DataFrame

Row	employed	good_credit	default
	Bool	Bool	Bool
1	true	true	true
2	false	true	false
3	true	false	true
4	true	true	true
5	true	false	true
6	false	false	true
7	false	false	false
8	true	true	true

The entropy of the default rate data is, per Equation 14.2:

H0 = let
    p_default = sum(default_data.default) / nrow(default_data)
    p_good = 1 - p_default
    p_default * log2(1 / p_default) + p_good * log2(1 / p_good)
end

0.8112781244591328

Our goal is to determine which attribute (employed or good_credit) to use as the first split in the decision tree. Intuitively, we are looking for the most important factor in predicting default rates. We will quantitatively evaluate this by calculating the information gain, which is the difference in entropy between the prior node and the candidate node. Whichever criteria gains us the most information is the preferred attribute to create a decision split.

In our case we start with H0 as calculated above for the output variable default and calculate the difference in entropy between it and the average entropy of the data if we split on that node. The name for this is the information gain, \(IG(inputs, attributes)\):

\[ IG(T,a) = H{(T)} - H{(T|a)} \]

In words, the the information gain is simply the difference in entropy before and after learning the value of an outcome \(a\). We will illustrate that by determining the first branch in the decision tree.

Let’s first consider splitting the tree based on the employed status. We will calculate the entropy of each subset: with employment and without employment.

If we split the data based on being employed, we’d get two sub-datasets:

df_employed = filter(:employed => ==(true), default_data)

5×3 DataFrame

Row	employed	good_credit	default
	Bool	Bool	Bool
1	true	true	true
2	true	false	true
3	true	true	true
4	true	false	true
5	true	true	true

and

df_unemployed = filter(:employed => ==(false), default_data)

3×3 DataFrame

Row	employed	good_credit	default
	Bool	Bool	Bool
1	false	true	false
2	false	false	true
3	false	false	false

Let’s call its entropy H_employed, which should be zero because there is no variability in the default outcome for this subset.

H_employed = let
    p_default = sum(df_employed.default) / nrow(df_employed)
    p_good = 1 - p_default
    # p_default * log2(1 / p_default) + p_good * log2(1 / p_good) 
1    p_default * log2(1 / p_default) + 0
end

1: In the case of \(p_i = 0\) the value of \(h\) (the second term in the sum above) is taken to be \(0\), which is consistent with the \(\lim_{p\to0^+}p\log (p) = 0\).

0.0

And the corresponding candidate leaf is H_unemployed:

H_unemployed = let
    p_default = sum(df_unemployed.default) / nrow(df_unemployed)
    p_good = 1 - p_default
    p_default * log2(1 / p_default) + p_good * log2(1 / p_good)
end

0.9182958340544893

To balance these two results, we weight them according to the amount of data (number of observations) that would fall into each leaf:

H1_employment = let
    p_emp = nrow(df_employed) / nrow(default_data)
    p_unemp = 1 - p_emp

    p_emp * H_employed + p_unemp * H_unemployed
end

0.34436093777043353

The information gain for splitting the tree using employment status is the difference between the root entropy and the entropy of the employment split:

IG_employment = H0 - H1_employment

0.4669171866886993

We could repeat the analysis to determine the information gain if we were to split the tree based on having good credit. However, given that there are only two attributes we can already conclude that employed is a better attribute to split the data on. This is because the information gain of IG_employment (0.467) is the majority of the overall entropy H0 (0.811). Entropy is always additive and you cannot have negative entropy, therefore no other attribute could have greater information gain. This also matches our intuition when looking at Table 14.3 as the eye can spot a higher correlation between employed and default than good_credit and default.

The above example demonstrates how we can use information theory to create more optimal inferences on data.

14.3.3 Maximum Entropy Distributions

Why is information theory a useful concept? Many financial models are statistical in nature and concepts of randomness and entropy are foundational. For example, when trying to estimate parameter distributions or assume a distribution for a random process you can lean on information theory to use the most conservative choice: the distribution with the highest entropy given known constraints. These distributions are referred to as maximum entropy (maxent) distributions.

In many real-world problems, we seek a distribution that is “least biased” or “most conservative” given certain known information (such as a mean or a range). Maximum entropy distributions accomplish this by spreading out probability as widely as possible under the constraints we know to be true (e.g., average value, bounded domain). They make no additional assumptions beyond those constraints, thereby avoiding unwarranted specificity.

By using a maxent distribution, we effectively acknowledge that everything else about the system’s behavior is unknown and should remain as “random” (unconstrained) as possible. This is a powerful principle because it aligns well with real-world modeling scenarios where we might know just a few key facts—like a process’s average rate or finite variance—but have no strong reason to assume anything else about its structure.

Many of the most common probability distributions (Normal, Exponential, Gamma, etc.) can be derived by applying the maximum entropy principle under simple, natural constraints: - Normal distribution when the mean and variance are finite but otherwise unconstrained.
- Exponential distribution when we know only that the mean is positive, with outcomes over ([0,)).
- Uniform distribution when outcomes are bounded within a certain interval, and we have no further information about how likely each point is.

Additional maxent distributions and associated constraints are listed in Table 14.4. Those distributions arise again and again in nature because of the second law of thermodynamics - nature likes to have constantly increasing entropy and therefore it should be no surprise (random) processes that maximize entropy pop up all over the place. The second law of thermodynamics in physics is an analogy: it states that a closed system tends to move toward higher entropy states. Similarly, in purely probabilistic settings, when few constraints are imposed, the system’s “natural” distribution tends to be the one that maximizes entropy.

Maxent distributions have practical modeling use:

Conservative Assumptions: Using a maxent distribution guards against over-fitting or adding hidden assumptions. It essentially says: “Given only these constraints, let the data spread out in the most uniform (least structured) way consistent with what I know.”
Simplicity and Clarity: It’s often easier to justify a maxent model to stakeholders or regulators. If you only know a mean and a variance, a Normal distribution may be the least-biased fit. If you only know a mean and posit that values must be positive, the Exponential distribution is your maxent choice.
Built-In Neutrality: In financial or actuarial contexts, adopting a maxent framework can prevent overly optimistic or pessimistic models. By sticking to the distribution with the fewest assumptions, the risk analysis remains transparent and more robust to model mis-specification.

Some discussion of maximum entropy distributions in the context of risk assessment is available in an article by Duracz³.

Table 14.4: Maximum Entropy Distributions and the conditions under which they are applicable. For example, if you know that a probability must be continuous and have a positive mean (and can’t be normalized), then the MED is the Exponential Distribution.

Constraint	Discrete Distribution	Continuous Distribution
Bounded range	Uniform (discrete)	Uniform (continuous)
Bounded range (0 to 1) with information about the mean or variance		Beta
Mean is finite, two possible values	Binomial
Mean is finite and positive	Geometric	Exponential
Mean is finite and range is > zero		Gamma
Mean and Variance are finite		Gaussian (Normal)
Positive and equal mean and variance	Poisson

As an example, let’s look at processes that behave like the Gaussian (Normal) distribution.

14.3.3.1 Processes that give rise to certain distributions

A random walk can be viewed as the cumulative impact of nudges pushing in opposite directions. This behavior culminates in the random, terminal position being able to be described by a Gaussian distribution. The center of a Gaussian distribution is “thick” because there are many more ways for the cumulative total nudges to mostly cancel out, while its increasingly rare to end up further and further from the starting point (mean). The distribution then spreads out as flat (randomly) as it can while still maintaining the constraint of having a given, finite variance. Any other continuous distribution that has the same mean and variance has lower entropy than the Gaussian.

Table 14.5: Underlying processes create typical probability distributions. That there is significant overlap with the distributions in Section 14.3.3 is not a coincidence.

Process	Distribution of Data	Examples
Many additive pluses and minus that move an outcome in one dimension	Normal	Sum of many dice rolls, errors in measurements, sample means (Central Limit Theorem)
Many multiplicative pluses and minus that move an outcome in one dimension	Log-normal	Incomes, sizes of cities, stock prices
Waiting times between independent events occurring at a constant average rate	Exponential	Time between radioactive decay events, customer arrivals
Discrete trials each with the same probability of success, counting the number of successes	Binomial	Coin flips, defective items in a batch
Discrete trials each with the same probability of success, counting the number of trials until the first success	Geometric	Number of job applications until getting hired
Continuous trials each with the same probability of success, measuring the time until the first success	Exponential	Time until a component fails, time until a sales call results in a sale
Waiting time until the r-th event occurs in a Poisson process	Gamma	Time until the 3rd customer arrives, time until the 5th defect occurs

Probability Distributions

There are a lot of specialized distributions. There are lists of distributions you can find online or in references such as Leemis and McQueston (2008) which has a full-page network diagram of the relationships.

The information-theoretic and Bayesian perspective on it is to eschew memorization of a bunch of special cases and statistical tests. If you pull up the aforementioned diagram in Leemis and McQueston (2008), you can see just a handful of distributions that have the most central roles in the universe of distributions. Many distributions are simply transformations, limiting instances, or otherwise special cases of a more fundamental distribution. Instead of trying to memorize a bunch of probability distributions, it’s better to think critically about:

The fundamental processes that give rise to the randomness we are interested in modeling.
Transformations of the data to make it nicer to work with, such as translations, scaling, or other non-destructive changes.

Then when you encounter an unusual dataset, you don’t need to comb the depths of Wikipedia to find the perfect probability distribution for that situation.

14.3.3.2 Additive and Multiplicative Processes

Table 14.5 describes some examples, let us discuss further what it means to have a process that arises via an additive vs multiplicative effect⁴. Additive processes result in a normal distribution while multiplicative processes give rise to a log-normal distribution⁵.

An outcome is additive and results in a normal distribution if it’s the sum or difference of multiple independent processes. Examples of this include:

Rolling multiple dice and taking their sum.
A random walk along the natural numbers wherein with equal probability you take a step left or right.
Calculating the arithmetic mean of samples (the Central Limit Theorem).

However, many processes are multiplicative in nature. For example the population density of cities is distributed in a log-normal fashion. If we think about the factors that contribute to choice of place to live, we can see how these factors multiply: an attractive city might make someone 10% more likely to move, a city with water features 15% more likely, high crime 30% less likely, etc. These forces combine in a multiplicative way in the generative process of deciding where to move. In finance, many price processes are considered multiplicative.

Tip 14.1: Logarithms

The logarithm of a geometric process transforms the outcomes into “log-space”. The information is the same, but is often a more convenient form for the analysis. That is, if:

\[ Y = x_1 \times x_2 \times \ldots \times x_i \]

Then,

\[ log(Y) = log(x_1) + log(x_2) + \ldots + \log(x_i) \]

This is effectively the transformation that gives rise to the Normal versus Log-Normal distribution.

In the context of computational thinking:

First, we should think about how to transform data or modeling outcomes into a more convenient format. The log transform doesn’t eliminate any information but may map the information into a shape that is easier for an optimizer or Monte Carlo simulation to explore.

Second, per Chapter 5, floating point math is a lossy transformation of real numbers into a digital computer representation. Some information (in the literal Shannon information sense) is lost when computing and this tends to be worst with very small real numbers, such as those we encounter frequently in probabilities and likelihoods. Logarithms map very small numbers into negative numbers that don’t encounter the same degree of truncation error that tiny numbers do

Third, modern CPUs are generally much faster at adding or subtracting numbers than multiplying or dividing. Therefore working with the logarithm of processes may be computationally faster than the direct process itself.

14.4 Bayes’ Rule

With some of the foundational concepts laid down, we now turn to the perpetual challenge of attempting to make inferences and predictions given a set of data. We covered basic information theory, probability distributions, log transformations, and random processes because modern statistical analysis relies heavily on those concepts and techniques. We’ll introduce Bayes’ Rule, but then analysis beyond trivial applications one will typically quickly encounter challenges posed by those ideas.

The remainder of this chapter will re-introduce Bayes’ Rule and then build up modeling applications that illustrate some core concepts of applying Bayes Rule to complex, data-intensive problems.

14.4.1 Bayes’ Rule Formula

The minister and statistician Thomas Bayes derived a relationship of conditional probabilities that we today know as Bayes’ Rule. Laplace⁶ furthered the notion, and developed the modern formulation. commonly written as:

\[ P(H|D) = \frac{P(D|H) \times P(H)}{P(D)} \]

The components of this are:

\(P(H∣D)\) is the conditional probability of event \(H\) occurring given that \(D\) is true.
\(P(D∣H)\) is the conditional probability of event \(D\) occurring given that \(H\) is true.
\(P(H)\) is the prior probability of event \(H\).
\(P(D)\) is the prior probability of event \(D\).

If we take the following:

\(D\) is the available data
\(H\) is our hypothesis

Then we can draw conclusions about the probability of a hypothesis being true given the observed data. When thought about this way, Bayes’ rule is often described as:

\[ \text{posterior} = \frac{\text{likelihood} \times \text{prior}}{\text{evidence}} \]

This is a very useful framework, which we’ll return to more completely in Section 14.5. First, let’s look at combining information theory and Bayes’ rule in an applied example.

14.4.2 Model Selection via Likelihoods

Let’s say that we have competing hypothesis about a data generating process, such as: “given a set of data representing risk outcomes, what distribution best fits the data”? We will not be able to determine an absolute probability of a model given the data, but amazingly we can determine relative probability of models given the data. This is powerful because often times one of the most difficult modeling tasks is to select a model formulation - and Bayes gives us a powerful tool to help choose.

We can compare these models using Bayes’ rules by observing the following: Suppose we have two models, \(H_1\) and \(H_2\), and we want to compare their likelihoods given the observed data, D. We can use Bayes’ rule to calculate the posterior probability of each model:

\[ P(H_1\|D) = (P(D\|H_1) \times P(H_1)) / P(D) \]

\[ P(H_2\|D) = (P(D\|H_2) \times P(H_2)) / P(D) \]

Where:

\(P(H_1|D)\) and \(P(H_2|D)\) are the posterior probabilities of models \(H_1\) and \(H_2\), respectively, given the data \(D\).
\(P(D|H_1)\) and \(P(D|H_2)\) are the likelihoods of the data \(D\) under models \(H_1\) and \(H_2\), respectively.
\(P(H_1)\) and \(P(H_2)\) are the prior probabilities of models \(H_1\) and \(H_2\), respectively.
\(P(D)\) is the marginal likelihood of the data, which serves as a normalizing constant.

To compare the likelihoods of the two models, we can calculate the ratio of their posterior probabilities, known as the Bayes factor, \(BF\):

\[ BF = \frac{P(H_1|D)}{P(H_2|D)} \]

Substituting the expressions for the posterior probabilities from Bayes’ rule, we get:

\[ BF = \frac{P(D|H_1) \times P(H_1)} {P(D|H_2) \times P(H_2)} \]

The marginal likelihood \(P(D)\) cancels out since it appears in both the numerator and denominator. If we assume equal prior probabilities for the models, i.e., \(P(H_1)\) = \(P(H_2)\), then the Bayes factor simplifies to the likelihood ratio: \[ BF = \frac{P(D|H_1)}{P(D|H_2)} \]

The Bayes factor then is a statement about the relative probability of two competing models for the given data. We can interpret the results as:

If \(BF > 1\), the data favor model \(H_1\) over model \(H_2\).
If \(BF < 1\), the data favor model \(H_2\) over model \(H_1\).
If \(BF = 1\), the data do not provide evidence in favor of either model.

In practice, the likelihoods \(P(D|H_1)\) and \(P(D|H_2)\) are often calculated using the probability density or mass functions of the models, evaluated at the observed data points. The prior probabilities \(P(H_1)\) and \(P(H_2)\) can be assigned based on prior knowledge or assumptions about the models. By comparing the likelihoods of the models using the Bayes factor, we can quantify the relative support for each model given the observed data, while taking into account the prior probabilities of the models.

Another way of interpreting this is the evaluation of which model has the higher likelihood given the data.

Null Hypothesis Statistical Test

Null Hypothesis Statistical Tests (NHST) is the idea of trying to statistically support an alternative hypothesis over a null hypothesis. The support in favor of alternative versus the null is reported via some statistical power, such as the p-value (the probability that the test result is as, or more extreme, than the value computed). The idea is that there’s some objective way to push science towards greater truths and NHST was seen as a methodology that avoided the subjectivity of the Bayesian approach. However, while pure in intention, the NHST choices of both null hypothesis and model contain significant amounts of subjectivity! There is subjectivity in the null hypothesis, data collection methodologies, study design, handling of missing data, choice of data not to include, which statistical tests to perform, and interpretation of relationships.

We might as well call the null hypothesis a prior and stop trying to disprove it absolutely. Instead: focus on model comparison, model structure, and posterior probabilities of the competing theories.

Over 100 statistical tests have been developed in service of NHST Lewis (2013), but it’s widely viewed now that a focus on NHST has led to worse science due to a multitude of factors, such as:

“P-hacking” or trying to find subsets of data which can (often only by chance) support rejecting some null.
Cognitive anchoring to the importance of a p-value of 0.05 or less – why choose that number versus 0.01 or 0.001 or 0.49?
Bias in research processes where one may stop data collection or experimentation after achieving a favorable test result.
Inappropriate application of the myriad of statistical tests.
Focus on p-values rather than effects that simply matter more or have greater effect.
- For example, which is of more interest to doctors? A study indicating a 1 in a billion chance of serious side effect , with p-value of 0.0001 or a study indicating a 1 in 3 chance with p-value 0.06? Many journals would only publish the former study, even though the latter study intuitively suggests a potentially more risky drug.
Difficulty to determine causal relationships.

The authors of this book recommend against basic NHST and memorization of statistical tests in favor of principled Bayesian approaches. For the actuarial readers, NHST is analogous to traditional credibility methods (of which the authors also prefer more modern statistical approaches).

14.4.2.1 Example: Rainfall Risk Model Comparison

The example we’ll look at relates to the annual rainfall totals for a specific location in California⁷, which could be useful for insuring flood risk or determining the value of a catastrophe bond. Acknowledging that we are attempting to create a geocentric model⁸ instead of a scientifically accurate weather model, we narrow the problem to finding a probability distribution that matches the historical rainfall totals.

Our goal is to recommend a model that best fits the data and justify that recommendation quantitatively. Before even looking at the data, Table 14.6 shows three competing models based on thinking about the real-world outcome we are trying to model. These three are chosen for the increasingly sophisticated thought process that might lead the modeler to recommend them - but which is supportable by the statistics?

Table 14.6: Three alternative hypothesis about the distribution of annual rainfall totals.

Hypothesis	Process	Possible Rationale
\(H_1\)	A Normal (Gaussian) distribution	The sum of independent rainstorms creates annual rainfall totals that are normally distributed
\(H_2\)	A LogNormal distribution	Since it’s normal-ish, but skewed and can’t be negative
\(H_3\)	A Gamma Distribution	Since rainfall totals would be the sum of exponentially-distributed independent rainfall events

Note

In the literature for rainfall modeling, \(H_3\) (the Gamma distribution) is known as the “Log-Pearson Type III distribution”. It’s actually recommended by the US Corps of Army Engineers as the recommended way to model rainfall totals.

We are able to avoid learning and memorizing specialty distributions and statistical tests, which are so common in Frequentist approaches. First-principles reasoning on the probabilistic processes can get one to a reasonable hypothesis, comparable to ‘specialist’ knowledge one would encounter in the literature for a particular applied field.

Here’s the data:

rain = [
    39.51, 42.65, 44.09, 41.92, 28.42, 58.65, 30.18, 64.4, 29.02,
    37.00, 32.17, 36.37, 47.55, 27.71, 58.26, 36.55, 49.57, 39.84,
    82.22, 47.58, 51.18, 32.28, 52.48, 65.24, 51.12, 25.03, 23.27,
    26.11, 47.3, 31.8, 61.45, 94.95, 34.8, 49.53, 28.65, 35.3, 34.8,
    27.45, 20.7, 36.99, 60.54, 22.5, 64.85, 43.1, 37.55, 82.05, 27.9,
    36.55, 28.7, 29.25, 42.32, 31.93, 41.8, 55.9, 20.65, 29.28, 18.4,
    39.31, 20.36, 22.73, 12.75, 23.35, 29.59, 44.47, 20.06, 46.48,
    13.46, 9.34, 16.51, 48.24
];

Plotted, we see some of the characteristics that align with our prior assumptions and knowledge about the system itself, such as: the data being constrained to positive values and a skew towards having some extreme weather years with lots of rainfall.

using CairoMakie
hist(rain)

Figure 14.1: Annual rainfall totals for a specific location in California.

We will show the likelihood of the three models after deriving the maximum likelihood (MLE), which is simply finding the parameters that maximize the calculated likelihood. In general, this can be accomplished by an optimization routine, but here we will just use the functions built into Distributions.jl:

using StatsBase
using Distributions

# H1: Normal
h1_dist = fit_mle(Normal, rain)

# H2: LogNormal - parameters are from fitting a Normal to the log of the data
h2_params = fit_mle(Normal, log.(rain))
h2_dist = LogNormal(params(h2_params)...)

# H3: Gamma
h3_dist = fit_mle(Gamma, rain)

@show h1_dist
@show h2_dist
@show h3_dist;

h1_dist = Distributions.Normal{Float64}(μ=38.91442857142857, σ=16.643603630714306)
h2_dist = Distributions.LogNormal{Float64}(μ=3.5690550009062663, σ=0.44148379736539156)
h3_dist = Distributions.Gamma{Float64}(α=5.577859182601731, θ=6.976588561577376)

Let’s look at the likelihoods by applying the maximum likelihood distribution to the observed data. For the practical reasons described in Tip 14.1, we will compare the the log-likelihoods to maintain convention with what you’d likely see or deal with in practice. Taking the log of the likelihood does not change the ranking of the likelihoods.

h1_loglik = sum(log.(pdf.(h1_dist, rain)))
h2_loglik = sum(log.(pdf.(h2_dist, rain)))
h3_loglik = sum(log.(pdf.(h3_dist, rain)))

@show h1_loglik
@show h2_loglik
@show h3_loglik

h1_loglik = -296.16751566478115
h2_loglik = -291.9265702808178
h3_loglik = -291.08855184856486

-291.08855184856486

Note

Instead of sum(log.(pdf.(distribution,data))), we could call loglikelihood(distribution,data). We show the former for the explicitness of the calculation for pedagogy. In real use, loglikelihood should be preferred because it has additional logic to improve numerical stability for floating point operations.

The results indicate that the LogNormal and the Gamma model for rainfall distribution are very superior to the Normal model, consistent with the visual inspection of the quantiles in Figure 14.2. We reach that conclusion by noting how much more likely the latter two are, as the likelihoods of -291.93 and -291.09 is higher (less negative) than -296.17⁹. Higher (less negative) log-likelihood values indicate a better fit.

let x = rain
    # Use the fitted distributions from the previous step
    range = 1:0.1:100
    fig, ax, _ = lines(range, cdf.(h1_dist, range), label="Normal", axis=(xgridvisible=false, ygridvisible=false,))
    lines!(ax, range, cdf.(h2_dist, range), label="LogNormal")
    lines!(ax, range, cdf.(h3_dist, range), label="Gamma")
    
    # Plot empirical CDF of the data
    lines!(quantile.(Ref(x), 0.01:0.01:0.99), 0.01:0.01:0.99, label="Data", color=(:black, 0.6), linewidth=3)
    
    fig[1, 2] = Legend(fig, ax, "Model", framevisible=false)
    fig
end

We evaluated the likelihood at a single point estimate of the parameters, but a true posterior probability of the parameters of the distributions will be represented by a distribution rather than a point. The rest of chapter will describe how to express the posterior probabilities of the the parameters for \(H_1\), \(H_2\), and \(H_3\) using Bayesian statistical methods.

14.5 Modern Bayesian Statistics

14.5.1 Background

Bayesian statistics is generally not taught in undergraduate statistics. Bayes’ rule is introduced, basic probability exercises are assigned, and then statistics moves on to a curriculum of regression and NHSTs (of the Frequentist school). Why is the applied practice of statistics currently gravitating towards Bayesian approaches? There are both philosophical and practical reasons why.

14.5.1.1 Philosophical Motivations

Philosophically, one of the main reasons why Bayesian thinking is appealing is its ability to provide straightforward interpretations of statistical conclusions.

For example, when estimating an unknown quantity, a Bayesian probability interval can be directly understood as having a high probability of containing that quantity. In contrast, a Frequentist confidence interval is typically interpreted only in the context of a series of similar inferences that could be made in repeated practice. In recent years, there has been a growing emphasis on interval estimation rather than hypothesis testing in applied statistics. This shift has strengthened the Bayesian perspective since it is likely that many users of standard confidence intervals intuitively interpret them in a manner consistent with Bayesian thinking.

“The fallacy of placing confidence in confidence intervals”

“The fallacy of placing confidence in confidence intervals” is the title of an article (Morey_Hoekstra_Rouder_Lee_Wagenmakers_2016?) describing the issues with confidence intervals, with one of the primary issues being what they call the “Fundamental Confidence Fallacy,” or FCF.

The FCF is the belief that if you have a \(X\%\) confidence interval, that the probability that the true value of interest lies within that interval is \(X\%\). This is false, and an abbreviated explanation of why is as follows:

An \(X\%\) confidence interval refers to a procedure that results in a range \(R\) wherein \(X\%\) of the time the true value of interest lies within \(R\).
However, once the data is observed we can know with higher probability than \(X\) whether the true value lies within that range.

Here’s an example taken from the article referenced above: we wish to estimate the mean of a continuous random variable. We observe two data points \(y_1\) and \(y_2\). If \(y_1 < y_2\) then we say our interval is \((-\infty,\infty)\) otherwise the interval is empty. Our credibility procedure creates an interval for which 50% of the time it contains the true value. However, once we’ve observed the data and created the interval, then “post-data” we can tell with certainty whether our interval contains the variable of interest. We can only say “pre-data”, or pre-observations, that the credibile interval is really \(X\%\) probable to contain the right value. Similar (and other) issues arises with more “real world” examples of confidence intervals.

In contrast, the Bayesian procedure of estimating an interval using the posterior distribution of the data can be interpreted as an interval for which you can say that you believe the true value is contained within the interval. Typically, this is referred to as a credibility interval to distinguish from a confidence interval.

Another meaningful way to understand the contrast between Bayesian and Frequentist approaches is through the lens of decision theory, specifically how each view treats the concept of randomness. This perspective pertains to whether you regard the data being random or the parameters being random.

Frequentist statistics treats parameters as fixed and unknown, and the data as random — that data you collect is but one realization of an infinitely repeatable random process. Consequently, Frequentist procedures, like hypothesis testing or confidence intervals, are generally based on the idea of long-run frequency or repeatable sampling.

Conversely, Bayesian statistics turns this on its head by treating the data as fixed — after all, once you’ve collected your data, it’s no longer random but a fixed observed quantity. Parameters, which are unknown, are treated as random variables. The Bayesian approach then allows us to use probability to quantify our uncertainty about these parameters.

The Bayesian approach tends to align more closely with our intuitive way of reasoning about problems. Often, you are given specific data and you want to understand what that particular set of data tells you about the world. You’re likely less interested in what might happen if you had infinite data, but rather in drawing the best conclusions you can from the data you do have.

14.5.1.2 Practical Motivations

Practically, recent advances in computational power, algorithm development, and open-source libraries have enabled practitioners to adapt the Bayesian workflow.

For most real-world problems, deriving the posterior distribution is analytically intractable and computational methods must be used. Advances in raw computing power only in the 1990’s made non-trivial Bayesian analysis possible, and recent advances in algorithms have made the computations more efficient. For example, one of the most popular algorithms, NUTS, was only published in the 2010’s.

Many problems require the use of compute clusters to manage runtime, but if there is any place to invest in understanding posterior probability distributions, it’s financial companies trying to manage risk!

The availability of open-source libraries, such as Turing.jl, PyMC3, and Stan provide access to the core routines in an accessible interface. To get the most out of these tools requires the mindset of computational thinking described in this book - understanding model complexity, model transformations and structure, data types and program organization, etc.

14.5.1.3 Advantages of the Bayesian Approach

The main advantages of this approach over traditional actuarial techniques are:

Focus on distributions rather than point estimates of the posterior’s mean or mode. We are often interested in the distribution of the parameters and a focus on a single parameter estimate will understate the risk distribution.
Model flexibility. A Bayesian model can be as simple as an ordinary linear regression, but as complex as modeling a full insurance mechanics.
Simpler mental model. Fundamentally, Bayes’ theorem could be distilled down to an approach where you count the ways that things could occur and update the probabilities accordingly.
Explicit Assumptions.: Enumerating the random variables in your model and explicitly parameterizing prior assumptions avoids ambiguity of the assumptions inside the statistical model.

14.5.1.4 Challenges with the Bayesian Approach

With the Bayesian approach, there are a handful of things that are challenging. Many of the listed items are not unique to the Bayesian approach, but there are different facets of the issues that arise.

Model Construction. One must be thoughtful about the model and how variables interact. However, with the flexibility of modeling, you can apply (actuarial) science to makes better models!
Model Diagnostics. Instead of R^2 values, there are unique diagnostics that one must monitor to ensure that the posterior sampling worked as intended.
Model Complexity and Size of Data. The sampling algorithms are computationally intensive - as the amount of data grows and model complexity grows, the runtime demands cluster computing.
Model Representation. The statistical derivation of the posterior can only reflect the complexity of the world as defined by your model. A Bayesian model won’t automatically infer all possible real-world relationships and constraints.

Subjectivity of the Priors?

There are two ways one might react to subjectivity in a Bayesian context: It’s a feature that should be embraced or it’s a flaw that should be avoided.

14.5.1.5 Subjectivity as a Feature

A Bayesian approach to defining a statistical model is an approach that allows for explicitly incorporating professional judgment. Encoding assumptions into a Bayesian model forces the actuary to be explicit about otherwise fuzzy predilections. The explicit assumption is also more amenable to productive debate about its merits and biases than an implicit judgmental override.

14.5.1.6 Subjectivity as a Flaw

Subjectivity is inherent in all useful statistical methods. Subjectivity in traditional approaches include how the data was collected, which hypothesis to test, what significant levels to use, and assumptions about the data-generating processes.

In fact, the “objective” approach to null hypothesis testing is so prone to abuse and misinterpretation that in 2016, the American Statistical Association issued a statement intended to steer statistical analysis into a “post p<0.05 era.” That “p<0.05” approach is embedded in most traditional approaches to actuarial credibility¹⁰ and therefore should be similarly reconsidered.

14.5.2 Implications for Financial Modeling

Like Bayes’ Formula itself, another aspect of financial literature that is taught but often glossed over in practice is the difference between process risk (volatility), parameter risk, and model formulation risk. When performing analysis that relies on stochastic results, in practice typically only process/volatility risk is assessed.

Bayesian statistics provides the tools to help financial modelers address parameter risk and model formulation. The posterior distribution of parameters derived is consistent with the observed data and modeled relationships. This posterior distribution of parameters can then be run as an additional dimension to the risk analysis.

Additionally, best practices include skepticism of the model construction itself, and testing different formulation of the modeled relationships and variable combinations to identify models which are best fit for purpose. Tools such as Information Criterion, posterior predictive checks, Bayes factors, and other statistical diagnostics can inform the actuary about trade-offs between different choices of model.

Bayesian Versus Machine Learning

Machine learning (ML) is fully compatible with Bayesian analysis - one can derive posterior distributions for the ML parameters like any other statistical model and the combination of approaches may be fruitful in practice.

However, to the extent that actuaries have leaned on ML approaches due to the shortcomings of traditional actuarial approaches, Bayesian modeling may provide an attractive alternative without resorting to notoriously finicky and difficult-to-explain ML models. The Bayesian framework provides an explainable model and offers several analytic extensions beyond the scope of this introductory chapter:

Causal Modeling: Identifying not just correlated relationships, but causal ones, in contexts where a traditional designed experiment is unavailable.
Bayes Action: Optimizing a parameter for, e.g., a CTE95 level instead of a parameter mean.
Information Criterion: Principled techniques to compare model fit and complexity.
Missing data: Mechanisms to handle the different kinds of missing data.
Model averaging: Posteriors can be combined from different models to synthesize different approaches.
Credibility Intervals: A posterior representation around the likely range of values for parameters of interest.

14.5.3 Basics of Bayesian Modeling

A Bayesian statistical model has four main components to focus on:

Prior encoding assumptions about the random variables related to the problem at hand, before conditioning on the data.
A Model that defines how the random variables give rise to the observed outcome.
Data that we use to update our prior assumptions.
Posterior distributions of our random variables, conditioned on the observed data and our model

While this is simply stating Bayes’ formula in words, it’s also the blueprint for a workflow to implement more advanced Bayesian methods.

Having defined a prior assumption, selected a model, and collected our data, the computation of the posterior can be the most challenging. The workflow involves computationally sampling the posterior distribution, often using a technique called Markov Chain Monte-Carlo (MCMC). The result is a series of values that are sampled statistically from the posterior distribution. Introducing this process is the focus of the rest of this chapter.

14.5.4 Markov-Chain Monte Carlo

Computing the posterior distribution for most model parameters is analytically intractable. However, we can probabilistically sample from the posterior distribution and achieve an approximation of the posterior distribution. MCMC samplers, as they are called, do this by moving through the parameter space (the set of possible values for the parameters) in a special way. The statistical marvel is that they travel to different points in proportion to the posterior probability. It is a “Markov-Chain” because the probability of the next point’s location is influenced by the prior sampling point’s location.

14.5.4.1 Example: MCMC from Scratch

Here is a simple example demonstrated with one of the oldest MCMC algorithm, called Metropolis-Hastings. The general idea is this:

Start at an arbitrary point and make that the current_state.
Propose a new point which is the current_state plus some movement that comes from a random distribution, proposal_dist.
Calculate the likelihood ratio of the proposed versus current point (acceptance_ratio below).
Draw a random number - if that random number is less than the acceptance_ratio, then move to that new point. Otherwise do not move.
Repeat steps 2-4 until the distribution of points converges to a stable posterior distribution.

This gets us what we desire because the resulting distribution of samples has frequency that’s proportional to the posterior distribution.

We will try to find the posterior of an arbitrary set of normally distributed asset returns. We set the true, (unobserved in reality) values for \(\mu\) and \(\sigma\) and then draw 250 generated observations:

# In reality, we don't observe the parameters 
# we are interested in determining the values for.
σ = 0.15
μ = 0.1


n_observations = 250
return_dist = Normal(μ,σ)
returns = rand(return_dist,n_observations)

# plot the distribution of returns
μ_range = LinRange(-0.5, 0.5, 400)
σ_range = LinRange(0.0, 3.0, 400)

f = Figure()
ax1 = Axis(f[1,1],title="True Distribution of Returns")
ax2 = Axis(f[2,1],title="Simulated Outcomes",xlabel="Return")
plot!(ax1,return_dist)
vlines!(ax1,[μ],color=(:black,0.7))
text!(ax1,μ,0;text="mean ($μ)",rotation=pi/2)
hist!(ax2,returns)
vlines!(ax2,[mean(returns)],color=(:black,0.7))
text!(ax2,mean(returns),0;text="mean ($(round(mean(returns);digits=3)))",rotation=pi/2)

linkxaxes!(ax1,ax2)

f

The target probability densities which we will attempt to infer via MCMC.

Having generated sample data, we will next define a probability distribution for the random step that we take from the current_point on the Markov chain. We choose a 2D Gaussian for this, since the parameter space to explore is two-dimensional ( \(\mu\) and \(\sigma\)). The proposal_std controls how big of a movement is taken at each step.

# Define the proposal step distribution
proposal_std = 0.05
proposal_dist = Normal(0,proposal_std)

Distributions.Normal{Float64}(μ=0.0, σ=0.05)

We next define how many steps we want the chain to sample for, and implement the algorithm’s main loop containing the logic in the steps above.

# MCMC parameters
num_samples = 5000
burn_in = 500

# Define priors
μ_prior = Normal(0, 0.25)
σ_prior = Gamma(0.5,1.0)

# Initialize the Markov chain
μ_current, σ_current =  0.0, 0.25
current_prob = sum(logpdf(Normal(μ_current, σ_current), r) for r in returns) + 
               logpdf(μ_prior, μ_current) + 
               logpdf(σ_prior, σ_current)


# The 'chain' will store all samples, including the burn-in phase.
# We will discard the first 'burn_in' samples later during analysis.
chain = zeros(num_samples, 2)

count = 0

# MCMC sampling loop
while count < num_samples
    
    # Generate a new proposal
    μ̇, σ̇ = μ_current + rand(proposal_dist), σ_current + rand(proposal_dist)
    if σ̇ > 0 
        
        # Calculate the acceptance ratio
        
        proposal_prob = sum(logpdf(Normal(μ̇, σ̇), r) for r in returns) +
                           logpdf(μ_prior, μ̇) + logpdf(σ_prior, σ̇)
        log_acceptance_ratio = proposal_prob - current_prob
        
        # Accept or reject the proposal
        if log(rand()) < log_acceptance_ratio
            μ_current, σ_current = μ̇, σ̇
            current_prob = proposal_prob
        end
        
        # Store the current state as a sample
        count += 1
        chain[count, :] .= μ_current, σ_current
    else
        # skip because σ can't be negative
    end
end

chain

5000×2 Matrix{Float64}:
 0.0        0.25
 0.0        0.25
 0.0        0.25
 0.0        0.25
 0.0        0.25
 0.0        0.25
 0.0396623  0.263393
 0.0263357  0.245004
 0.0263357  0.245004
 0.0716547  0.227808
 ⋮          
 0.101525   0.158281
 0.101525   0.158281
 0.101525   0.158281
 0.101525   0.158281
 0.101525   0.158281
 0.101525   0.158281
 0.101525   0.158281
 0.115062   0.162088
 0.115062   0.162088

The resulting chain contains a list of points that the algorithm has moved along during the sampling process. Note that there is a burn-in parameter. This is because we want the chain iterate long enough to be effectively independent of both (1) the starting point for the sample, and (2) that different chains are effectively independent.

After having performed the sampling, we can now visualize the chain versus the target_distribution. A few things to note:

The red line indicates the “warm up” or “burn-in” phase and we do not consider that as part of the sampled chain because those values are too correlated with the arbitrary starting point.
The blue line indicates the path traveled by the Metropolis-Hasting algorithm. Long asides into low-probability regions are possible, but in general the path will traverse areas in proportion to the probability of interest.

# Plot the chain
let 
    f = Figure()
    
    # μ lines
    ax1 = Axis(f[1, 1], ylabel="σ", xticklabelsvisible=false)

    # burn in lines
    scatterlines!(ax1,chain[1:burn_in,1], 
                      chain[1:burn_in,2],
                      color=(:red,0.1),
                      markercolor=(:red,0.1))

    # sampled lines
    scatterlines!(ax1,chain[burn_in+1:end,1],
                      chain[burn_in+1:end,2],
                      color=(:blue,0.1),
                      markercolor=(:blue,0.1))

    # μ histogram
    ax2 = Axis(f[2,1],xlabel="μ", yticklabelsvisible=false)
    hist!(ax2,chain[burn_in+1:end,1],color=:blue)
    linkxaxes!(ax1, ax2)

    # σ histogram
    ax3 = Axis(f[1,2],xticklabelsvisible=false)
    hist!(ax3,chain[burn_in+1:end,2],color=:blue, direction=:x)
    linkyaxes!(ax1, ax3)

    f
end

Figure 14.3: The blue lines of the MCMC chain explore the posterior density of interest (after discarding the burn-in samples in red). Note that locations where the sampler remained longer (rejected more proposals) show up as darker points.

In this example, \(\mu\) and \(\sigma\) are independent, but if there were a correlation (such as when \(\mu\) were higher, \(\sigma\) were also higher) then the sampler would pick up on this, and we would see a skew in the plotted chain.

The point of this short, ground-up introduction to MCMC is that the technique is not magic by demonstrating that we could do-it-from-scratch with small amounts of code. The challenge is that it’s just computationally intensive. Modern libraries perform the sampling for you with more advanced algorithms than Metropolis-Hastings.

14.5.5 MCMC Algorithms

The Metropolis-Hasting algorithm is simple, but somewhat inefficient. Some challenges with MCMC sampling are both mathematical and computational:

Often times the algorithm will back-track (take a “U-Turn”), wasting steps in regions already explored.
The algorithm can have a very high rate of rejecting proposals if the proposal mechanism generates steps that would move the current state into a low-probability regions.
The choice of proposal distribution and parameters can greatly influence the speed of convergence. Too large of movement and key regions can be entirely skipped over, while small movements can take much longer than necessary to explore the space.
As the number of parameters grows, the dimensionality of the parameter space to explore also grows making posterior exploration much harder.
The shape of the posterior space can be more or less difficult to explore. Complex models may have regions of density that are not nicely “round” - regions may be curved, donut shaped, or disjointed.

The issues above mean that MCMC sampling is very computationally expensive for more complex examples. Compared with Metropolis-Hastings, modern algorithms (such as the No-U-Turn (NUTS)) algorithm explore the posterior distribution more efficiently by avoiding back-tracking to already explored regions and dynamically adjusting the proposals to adaptively fit the posterior. Many of them take direct influence from particle physics, with the algorithm keeping track of the energy of the current state as it explores the posterior space.

Algorithms have only brought so much relief to the modeler with finite resources and compute. There is still a lot of responsibility for modeler to design models that are computationally efficient, transformed to eliminate oddly-shaped density regions, or find the right simplifications to the analysis in order to make the problem tractable.

Note

What does it mean to transform the parameter space?

An example will be shown in Chapter 31 where we want to ensure that a binomial variable is constrained to the region \([0,1]\) but the underlying factors are allowed to vary across the entire real numbers. We use a logit (or inverse logit, a.k.a. logistic) to transform the parameters to the required probability range for the binomial outcome.

Another common transform is “Normalizing” the data to center the data around zero and to scale the outcomes such that the sample standard deviation is equal to one.

14.5.6 Rainfall Example (Continued)

We will construct a Bayesian model using the Turing.jl library. Using a battle-tested library allows us to step back from the intricacies of defining our own sampler and routine and focus on the models and analysis. The goal is to fit the parameters of one of the competing models from above in order to demonstrate an MCMC analysis workflow and essential concepts.

The first thing that we will do is use Turing’s @model macro to define a model. This has a few components:

The “model” is really just a Julia function that takes in data and relates the data to the statistical outcomes modeled.
The ~ is the syntax to either relate a parameter to a prior assumptions.
A loop (or broadcasted .~) that ties specific data observations to the random process.

Think of the @model block really as a model constructor. It isn’t until we pass data to the model that you get a fully instantiated Model type¹¹.

Here’s what defining the LogNormal model looks like in Turing. We have to specify prior distributions for LogNormal parameters.

using Turing

@model function rain_lognormal(data)
    μ ~ Normal(4.0, 1.0)
    σ ~ Exponential(2.0) 
    for i in eachindex(data)
        data[i] ~ LogNormal(μ, σ)
    end
end

m = rain_lognormal(rain)

DynamicPPL.Model{typeof(rain_lognormal), (:data,), (), (), Tuple{Vector{Float64}}, Tuple{}, DynamicPPL.DefaultContext}(rain_lognormal, (data = [39.51, 42.65, 44.09, 41.92, 28.42, 58.65, 30.18, 64.4, 29.02, 37.0  …  12.75, 23.35, 29.59, 44.47, 20.06, 46.48, 13.46, 9.34, 16.51, 48.24],), NamedTuple(), DynamicPPL.DefaultContext())

Defining the model uses the @model macro from Turing.
We presume that there will be positive rainfall and 96% of mean annual rainfall will be somewhere between \(exp(2)\) and \(exp(6)\), or 7 and 403 inches.
In a LogNormal model, 0.5 deviations covers a lot of variation in outcomes.

14.5.6.1 Setting Priors

In the example above, we used “weakly informative” priors. We constrained the prior probability to plausible ranges, knowing enough about the system of study (rainfall) that it would be completely implausible for there to be a Uniform(0,Inf) distribution of mean log-rainfall total, knowing that rain can’t fall in infinite quantities.

Admittedly, we haven’t confirmed with a meteorologist that \(exp(20)\) (485 million) inches of rain per year is impossible. But such is the beauty of the transparency of Bayesian analysis that the prior assumption is right there! Front and center and ready to be debated by other modelers! If you think that 485 million inches of rain is possible next year than you can challenge this assumption and propose another explicit alternative.

“Strongly informative” priors would be something where we want to encode a stronger assumption about the plausible range of outcomes, such as if we knew enough about the problem domain that we could tell given the location of the rainfall, we’d expect 95% of the rainfall to be between, say, 10 and 30 inches per year. Then we could constrain the prior even more than we did above.

“Uninformative” priors use only maximum entropy or uniform priors to avoid encoding other assumptions into the model.

14.5.6.2 Sampling

Analysis should begin by evaluating the prior assumptions for reasonability and coverage over possible outcomes of the process we are trying to model. The top plot in Figure 14.7 shows the modeled rainfall outcomes taking on a wide range of possible outcomes. If we had more knowledge of the system we could enforce a stronger (narrower) prior assumption to constrain the model to a smaller set of values.

The object returned is an MCMCChains structure containing the samples as well as diagnostic information. Summary information gets printed below.

chain_prior = sample(m, Prior(), 1000)

Chains MCMC chain (1000×3×1 Array{Float64, 3}):

Iterations        = 1:1:1000
Number of chains  = 1
Samples per chain = 1000
Wall duration     = 0.21 seconds
Compute duration  = 0.21 seconds
parameters        = μ, σ
internals         = lp

Summary Statistics
  parameters      mean       std      mcse    ess_bulk   ess_tail      rhat    ⋯
      Symbol   Float64   Float64   Float64     Float64    Float64   Float64    ⋯

           μ    4.0009    1.0254    0.0326   1004.1610   873.7012    1.0008    ⋯
           σ    1.9860    2.0059    0.0604    967.4756   978.0497    1.0007    ⋯
                                                                1 column omitted

Quantiles
  parameters      2.5%     25.0%     50.0%     75.0%     97.5% 
      Symbol   Float64   Float64   Float64   Float64   Float64 

           μ    2.0439    3.2891    4.0289    4.6373    6.0328
           σ    0.0414    0.5618    1.4293    2.8617    7.1660

Figure 14.4: Model output for the sampled prior. This isn’t running an MCMC algorithm, it’s simply taking draws from the defined prior assumptions.

Assessment of samples from the prior should include:

Confirming that the model’s behavior is reasonable that
Confirming that the model covers the range of possible data that might be observed.

The sample outcomes from the modeled prior are shown in Figure 14.7.

Next, we sample the posterior by using the No-U-Turns (NUTS) algorithm and drawing 1000 samples (not including the warm-up phase). This is the primary result we will analyze further.

chain_posterior = sample(m, NUTS(), 1000)

Chains MCMC chain (1000×14×1 Array{Float64, 3}):

Iterations        = 501:1:1500
Number of chains  = 1
Samples per chain = 1000
Wall duration     = 2.02 seconds
Compute duration  = 2.02 seconds
parameters        = μ, σ
internals         = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size

Summary Statistics
  parameters      mean       std      mcse   ess_bulk   ess_tail      rhat   e ⋯
      Symbol   Float64   Float64   Float64    Float64    Float64   Float64     ⋯

           μ    3.5696    0.0550    0.0018   908.6817   515.2105    1.0071     ⋯
           σ    0.4511    0.0365    0.0012   915.4671   640.0229    1.0051     ⋯
                                                                1 column omitted

Quantiles
  parameters      2.5%     25.0%     50.0%     75.0%     97.5% 
      Symbol   Float64   Float64   Float64   Float64   Float64 

           μ    3.4614    3.5314    3.5700    3.6082    3.6737
           σ    0.3853    0.4273    0.4499    0.4748    0.5278

Figure 14.5: Model output for the sampled posterior.

14.5.6.3 Diagnostics

Before analyzing the result itself, we should check a few things to ensure the model and sampler were well behaved. MCMC techniques are fundamentally stochastic and randomness can cause an errant sampling path. Or a model may be mis-specified such that the parameter space to explore is incompatible with the current algorithm (or any known so far).

A few things we can check:

First, the ess or effective sample size which adjusts the number of samples for the degree of autocorrelation in the chain.Ideally, we would be able to draw independent samples from the posterior, but due to the Markov-Chain approach the samples can have autocorrelation between neighboring samples. We collect less information about the posterior in the presence of positive autocorrelation.

An ess greater than our sample indicates that there was less (negative) autocorrelation than we would have expected for the chain. An ess much less than the number of samples indicates that the chain isn’t sampling very efficiently but, aside from needing to run more samples, isn’t necessarily a problem.

ess(chain_posterior)

ESS
  parameters        ess   ess_per_sec 
      Symbol    Float64       Float64 

           μ   908.6817      450.7350
           σ   915.4671      454.1007

Second, the rhat (\(\hat{R}\)) is the Gelman-Rubin convergence diagnostic and its value should be very close to 1.0 for a chain that has converged properly. Even a value of 1.01 may indicate an issue and quickly gets worse for higher values.

rhat(chain_posterior)

R-hat
  parameters      rhat 
      Symbol   Float64 

           μ    1.0071
           σ    1.0051

Next, we can look at the “trace” plots for the parameters being sampled (Figure 14.6). These are sometimes called “hairy caterpillar” plots because in a healthy chain sample, we should see a series without autocorrelation and that the values bounce around randomly between individual samples.

let
    f = Figure()
    ax1 = Axis(f[1,1],ylabel="μ")
    lines!(ax1,vec(get(chain_posterior,:μ).μ.data))
    ax2 = Axis(f[2,1],ylabel="σ")
    lines!(ax2,vec(get(chain_posterior,:σ).σ.data))
    f
end

Figure 14.6: The trace plots indicate low autocorrelation which is desirable for an MCMC sample.

The ess, rhat, and trace plots all look good for our sampled chain so we can next we will analyze the results in the context of our rainfall problem.

14.5.6.4 Analysis

Let’s see how it looks compared to the data first. Figure 14.7 shows 200 samples from the prior and posterior. The prior (top) shows how wide the range of possible rainfall outcomes could be using our weakly informative prior assumptions. The bottom shows that after having learned from the data, the posterior probability of rainfall has narrowed considerably.

function chn_cdf!(axis,chain,rain)    
    n = 200
    s = sample(chain, n)
    vals = get(sample(chain, 200), [:μ, :σ])
    ds = LogNormal.(vals.μ, vals.σ)
    xs = range(1.0, stop=200.0, length=200)
    for d in ds
        lines!(axis, xs, cdf.(d, xs), color=(:gray, 0.3))
    end

    # plot the actual data
    percentiles= 0.01:0.01:0.99
    lines!(axis,quantile.(Ref(rain),percentiles),percentiles,linewidth=3)
end

let 
    f = Figure()
    ax1 = Axis(f[1,1],title="Prior", xgridvisible=false, ygridvisible=false,ylabel="Quantile")
    chn_cdf!(ax1,chain_prior,rain)

    ax2 = Axis(f[2,1],title="Posterior", xgridvisible=false, ygridvisible=false,xlabel="Annual Rainfall (inches)",ylabel="Quantile")
    chn_cdf!(ax2,chain_posterior,rain)

    linkxaxes!(ax1, ax2)

    f
end

Figure 14.7: The prior model show a wide range of possible outcomes, and the shape of the distribution is reasonable: there’s a nice ‘S’ shape to the CDF, indicating a dense region where most outcomes would fall in the PDF. The fitted posterior model (bottom) has good coverage of the observed data (shown in blue).

Comparing to the maximum likelihood analysis from before by plotting the MLE point estimate onto the marginal densities in Figure 14.8. The peak of the the posterior is referred to as the maximum a posteriori (MAP) and would be the point estimate proposed by this Bayesian analysis. However, the Bayesian way of thinking about distributions of outcomes rather than point estimates is one of the main aspects we encourage for financial modelers. Using the posterior distribution of the parameters, we can assess parameter uncertainty directly instead of ignoring it as we tend to do with point estimates.

let
    # get the parameters from the earlier MLE approach
    p = params(h2_dist)

    f = Figure()

    # plot μ posterior
    ax1 = Axis(f[1,1],title="μ posterior",xgridvisible=false)
    hideydecorations!(ax1)
    d = density!(ax1,vec(get(chain_posterior,:μ).μ.data))
    l = vlines!(ax1,[p[1]],color=:red)
    

    # plot σ posterior
    ax2 = Axis(f[2,1],title="σ posterior", xgridvisible=false)
    hideydecorations!(ax2)
    density!(ax2,vec(get(chain_posterior,:σ).σ.data))
    vlines!(ax2,[p[2]],color=:red)
    
    Legend(f[1,2],[d,l],["Posterior Density", "MLE Estimate"])

    f
end

Figure 14.8: The MLE point estimate need not necessarily align with the peak or center of posterior densities (e.g. in the case of a bimodal distribution).

14.5.6.5 Model Limitations

We have built and assessed a simple statistical model that could be used in the estimation of risk for a particular location. Nowhere in our model did we define a mechanism to capture a more sophisticated view of the world. There is no parameter for changes over time due to climate change, or inter-annual seasonality for El Niño or La Niña cycles, or any of a multitude of other real-world factors that can influence the forecasting. All we’ve defined is that there is a LogNormal process generating rainfall in a particular location. This may or may not be sufficient to capture the dynamics of the problem at hand.

Part of the benefits of the Bayesian approach is that it allows us to extend the statistical model to be arbitrarily complex in order to capture our intended dynamics. We are limited by the availability of data, computational power and time, and our own expertise in the modeling. Regardless of the complexity of the model, the same fundamental techniques and idea apply in the Bayesian approach.

14.5.6.6 Continuing the Analysis

Like any good model, you can often continue the analysis in any number of directions, such as: collecting more data, evaluating different models, creating different visualizations, making predictions about future events, creating a multi-level model that predicts rainfall for multiple related locations simultaneously, among many other threads of analysis.

Earlier we discussed model comparison. To compute a real Bayes Factor in comparing the different models, we would take the average likelihood across the posterior samples instead of just comparing the maximum likelihood points as we did earlier. There are more sophisticated tools for estimating out of sample performance of the model, or measures that evaluate a model for over-fitting by penalizing the diagnostic statistic for the model having too many free parameters. See LOO (leave-one-out) cross-validation and various “information criteria” in the resources listed in Section 14.5.8.

14.5.7 Conclusion

This chapter has attempted to make accessible the foundations of statistical inference and the modern tools and approaches available. Underlying this approach to thinking about statistical problems are informational theoretic and mathematical concepts that can be challenging to learn. This is especially true when traditional finance and actuarial curricula is not centered on the necessary computational foundations associated with modern statistical analysis. Most importantly, moving beyond single ‘best estimate’ values to embrace full probability distributions leads to richer financial analysis and more comprehensive risk assessment.

14.5.8 Further Reading

Bayesian approaches to statistical problems are rapidly changing the professional statistical field. To the extent that the actuarial profession incorporates statistical procedures, financial professionals should consider adopting the same practices. The benefits of this are a better understanding of the distribution of risk and return, results that are more interpretable and explainable, and techniques that can be applied to a wider range of problems. The combination of these things would serve to enhance best practices related to understanding and communicating about financial quantities.

Textbooks recommended by the author are:

Statistical Rethinking (McElreath)
Bayes Rules! (Johnson, Ott, Dogucu)
Bayesian Data Analysis (Gelman, et. al.)

Chi Feng has an interactive demonstration of different MCMC samplers available at: https://chi-feng.github.io/mcmc-demo/.

Log base two turns out to be the most natural representation of information content as it mimics the fundamental 0 or 1 value bit. A more complete introduction is available in “Information Theory, Inference, and Learning Algorithms” by David MacKay.↩︎
A decision tree is a classification algorithm which attempts to optimally classify an output based on if/else type branches on the input variables.↩︎
https://www.researchgate.net/publication/239752412_Derivation_of_Probability_Distributions_for_Risk_Assessment ↩︎
Multiplicative process are often referred to as “geometric”, as in “geometric Brownian motion” or “geometric mean”. Additive processes are sometimes referred to as “arithmetic”. This root of this confusing terminology appears to be due to the fact that series involving repeated multiplication were solved via geometric (triangles, angles, etc.) methods while those using sums and differences were solved via arithmetic.↩︎
See Tip 14.1.↩︎
Laplace actually deserves most of the credit, as it was he who formalized the modern notion of Bayes’ rule and cemented the mathematical formulation. Bayes just described it first, in a way that actually had almost no direct impact on math or science. See “The Theory That Would Not Die”.↩︎
https://data.ca.gov/dataset/annual-precipitation-data-for-northern-california-1944-current ↩︎
See @sec-predictive-vs-explanatory.↩︎
The values are negative because we are taking the logarithm of a number less than \(1\). The likelihoods are less than \(1\) because the likelihood is the joint (multiplicative) probability of observing each of the individual outcomes.↩︎
Note that the approach discussed here is much more encompassing than the Bühlmann-Straub Bayesian approach described in the actuarial literature.↩︎
Specifically: a DynamicPPL.Model type (PPL = Probabilistic Programming Language).↩︎