Mathematical Foundations

This chapter describes the mathematical foundations on which kafe2 is built, in particular the method of maximum likelihood and how confidence intervals are determined from the profile likelihood.

When performing a fit (as a physicist) the problem is as follows: you have some amount of measurement data from an experiment that you need to compare to one or more models to figure out which of the models - if any - provides the most accurate description of physical reality. You typically also want to know the values of the parameters of a model and the degree of (un)certainty on those values.

For very simple problems you can figure this out analytically: you take a formula and simply plug in your measurements. However, as your problems become more complex this approach becomes much more difficult - or even straight up impossible. For this reason complex problems are typically solved numerically: you use an algorithm to calculate a result that only approximates the analytical solution but in a way that is much easier to solve. kafe2 is a tool for the latter approach.

Cost Functions

In the context of parameter estimation, a model m_i(\bm{p}) (where i is just some index) is a function of the parameters \bm{p}. During the fitting process the parameters \bm{p} are varied in such a way that the agreement between the model and the corresponding data d_i becomes “best”. This of course means that we need to somehow define “good” or “bad” agreement - we need a metric. This metric is called a loss function or cost function. It is defined in such a way that a lower cost function value corresponds with a “better” agreement between a model and our data.

The cost functions implemented by kafe2 are based on the method of maximum likelihood. The idea behind this method is to maximize the likelihood function \mathcal{L}({\bm p}) which represents the probability with which a model \bm{m}(\bm{p}) would result in the data \bm{d} that we ended up measuring:

\mathcal{L}({\bm p}) = \prod_i P(m_i({\bm p}), d_i),

where P(m_i({\bm p}), d_i) describes the probability of measuring the data point d_i given the corresponding model prediction m_i({\bm p}). This approach allows for a statistical interpretation of our fit results as we will see later. Instead of the likelihood described above however, we are instead using twice its negative logarithm, the so-called negative log-likelihood (NLL):

\mathrm{NLL} (\bm{p})
= - 2 \log \mathcal{L}({\bm p})
= - 2 \log \left( \prod_i P(m_i({\bm p}), d_i) \right)
= - 2 \sum_i \log P(m_i({\bm p}), d_i).

This transformation is allowed because logarithms are strictly monotonically increasing functions, and therefore the negative logarithm of any function will have its global minimum at the same place where the original function is maximal. The model \bm{m}({\bm p}) that minimizes the NLL will therefore also maximize the likelihood.

While the above transformation may seem nonsensical at first, there are important advantages to calculating the negative log-likelihood over the likelihood:

  • The product of the probabilities \prod_i P(m_i({\bm p}), d_i) is replaced by a sum over the logarithms of the probabilities \sum_i \log P(m_i({\bm p}), d_i). In the context of a computer program sums are preferable over products because they can be calculated more quickly and because a product of many small numbers can lead to arithmetic underflow.

  • Because the probabilities P(m_i({\bm p}), d_i) oftentimes contain exponential functions, calculating their logarithm is actually faster because it reduces the number of necessary operations.

  • Algorithms for numerical optimization minimize functions so they can be directly used to optimize the NLL.

As an example, let us consider the likelihood function of data d_i that follows a normal distribution around means \mu_i with standard deviations (uncertainties) \sigma_i:

\mathcal{L}
= \prod_i P(\mu_i, \sigma_i, d_i)
= \prod_i \frac{1}{\sqrt[]{2 \pi} \: \sigma_i}
  \exp \left[ - \frac{1}{2} \left( \frac{d_i - \mu_i}{\sigma_i} \right)^2 \right].

The immediate trouble that we run into with this definition is that we have no idea what the means \mu_i are - after all these are the “true values” that our data deviates from. However, since we are trying to determine the likelihood with which our data would be produced given our model we can set \mu_i = m_i({\bm p}).

Note

Conceptually uncertainties are typically associated with the data \bm{d} even though they represent deviations from the model \bm{m}({\bm p}). However, because the normal distribution is symmetric this does not have an effect on the likelihood function \mathcal{L} (as long as the uncertainties \bm{\sigma} do not depend on the model \bm{m}({\bm p})).

For the NLL we now find:

\mathrm{NLL}(\bm{p})
= -2 \log \mathcal{L}({\bm p}) \\
= - 2 \log \prod_i \frac{1}{\sqrt[]{2 \pi} \: \sigma_i}
 \exp \left[ - \frac{1}{2} \left( \frac{d_i - m_i({\bm p})}{\sigma_i} \right)^2 \right] \\
= - 2 \sum_i \log \frac{1}{\sqrt[]{2 \pi} \: \sigma_i}
 + \sum_i \left( \frac{d_i - m_i({\bm p})}{\sigma_i} \right)^2 \\
=: - 2 \log L_\mathrm{max} + \chi^2({\bm p}) .

As we can see the logarithm cancels out the exponential function of the normal distribution and we are left with two parts: The first is a part represented by - 2 \log L_\mathrm{max} that only depends on the uncertainties \sigma_i but not on the model m_i({\bm p}) or the data d_i. This is the minimum value the NLL could possibly take on if the model m_i({\bm p}) were to exactly fit the data d_i. The second part can be summed up as

\chi^2 (\bm{p}) = \sum_i \left( \frac{d_i - m_i({\bm p})}{\sigma_i} \right)^2.

As mentioned before, this is much easier and faster to calculate than the original likelihood function. If the uncertainties \bm{\sigma} are constant we can ignore the first part and directly use \chi^2({\bm p}) (chi-squared) as our cost function because we are only interested in differences between cost function values. This special case of the method of maximum likelihood is known as the method of least squares and it is by far the most common cost function used for fits.

Covariance

The \chi^2({\bm p}) cost function that we discussed in the previous section assumes that our data points d_i are subject to uncertainties \sigma_i. With this notation we implicitly assumed that our uncertainties are independent of one another: that there is no relationship between the uncertainties of any two individual data points. Independent uncertainties are frequently caused by random fluctuations in the measurement values, either because of technical limitations of the experimental setup (electronic noise, mechanical vibrations) or because of the intrinsic statistical nature of the measured observable (as is typical in quantum mechanics, e. g. radioactive decays).

However, there are also correlated uncertainties that arise due to effects that distort multiple measurements in the same way. Such uncertainties can for example be caused by a random imperfection of the measurement device which affects all measurements equally. The uncertainties of the measurements taken with such a device are no longer uncorrelated, but instead have one common uncertainty.

Historically uncertainties have been divided into statistical and systematic uncertainties. While this is appropriate when propagating the uncertainties of the input variables by hand it is not a suitable distinction for a numerical fit. In kafe2 multiple uncertainties are combined to construct a so-called covariance matrix. This is a matrix with the pointwise data variances \mathrm{Var}_i on its diagonal and the covariances \mathrm{Cov}_{ij} between two data points outside the diagonal. By using this covariance matrix for our fit we can estimate the uncertainty of our model parameters numerically with no need for propagating uncertainties by hand.

As mentioned before, the diagonal elements of our covariance matrix represent the variances \mathrm{Var}_i = \sigma_i^2 of our data points. They simply represent the uncertainty of a single data point d_i while ignoring all other data points. An element outside the diagonal at position (i,j) represents the covariance \mathrm{Cov}_{ij} between points d_i and d_j:

\mathrm{Cov}_{ij}
= E[ (d_i - E[d_i])(d_j - E[d_j]) ]
= E[d_i \cdot d_j] - E[d_i] \cdot E[d_j],

where E is the expected value of a variable. The covariance \mathrm{Cov}_{ij} is a measure of the joint variability of d_i and d_j - but for a meaningful interpretation it needs to be considered relative to the pointwise uncertainties \sigma_i. We therefore define the so-called Pearson correlation coefficient \rho_{ij} as follows:

\rho_{ij} = \frac{\mathrm{Cov}_{ij}}{\sigma_i \sigma_j}.

The correlation \rho_{ij} is normalized to the interval [-1, 1]. Its absolute value is a measure of how strongly the residuals r_k = d_k - \mu_k depend on one another. In other words, the absolute value of \rho_{ij} measures how much information you get about r_i or r_j if you know the other one. For \rho = 0 they are completely independent from one another. For \rho = \pm 1 r_i and r_j are directly proportional to one another with a positive (negative) proportional constant for \rho = +1 (\rho = -1). Let’s look at some toy samples for different values of \rho_{ij}:

../_images/covariance_plot.png

For \rho_{ij} = 0 the sample forms a circle around (0,0). As the absolute value of \rho_{ij} increases the sample changes its shape to a tilted ellipse - some combinations of r_i and r_j become more likely than others. For \rho_{ij} = \pm 1 the ellipse becomes a line - in this degenerate case we really only have one source of uncertainty that affects two data points.

Covariance Matrix Construction

In a physics experiment it is typically necessary to consider more than one source of uncertainty. Let us consider the following example: we want to measure Earth’s gravitational constant g by dropping things from various heights and timing the time they take to hit the ground with a stopwatch. We assume an independent uncertainty of \sigma_{\rm human} = 0.5 s for each data point because humans are not able to precisely align pressing the button of a stopwatch with the actual event. For one reason or another the stopwatch we’re using is also consistently off by a few percentage points. To account for this we assume a fully correlated (\rho_{ij} = 1) uncertainty of \sigma_{\rm watch} = 2 \% for all data points. To determine the variance of a single data point we can simply add up the variances of the uncertainty sources:

{\rm Var}_{\rm total}
= \sigma_{\rm total}^2
= {\rm Var}_{\rm human} + {\rm Var}_{\rm watch}
= \sigma_{\rm human}^2 + \sigma_{\rm watch}^2.

As it turns out we can use the same approach for the covariances: we can simply add up the covariance matrices of the different uncertainty sources to calculate a total covariance matrix:

{\bm V}_{\rm total} = {\bm V}_{\rm human} + {\bm V}_{\rm watch}.

The next question would then be how you would determine the covariance matrices for the individual uncertainty sources. A useful approach is to split a covariance matrix into a vector of uncertainty \bm \sigma and the corresponding correlation matrix \bm \rho:

\bm{V} = (\bm{\sigma} \cdot \bm{\sigma}^T) \circ \bm{\rho},

where \circ is the Hadamard product (a.k.a. Schur product). In other words, the components of \bm V are calculated by simply multiplying the components of {\bm \sigma} \cdot {\bm \sigma}^T and \bm \rho at the same position. If we assume that we have three data points we can express the human uncertainty as follows:

\bm{\sigma}_\mathrm{human} = \begin{pmatrix} 0.5 \\ 0.5 \\ 0.5 \end{pmatrix},
\quad \bm{\rho}_\mathrm{human} = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0\\ 0 & 0 & 1\end{pmatrix},
\quad \bm{V}_\mathrm{human}
= \begin{pmatrix} 0.25 & 0 & 0 \\ 0 & 0.25 & 0\\ 0 & 0 & 0.25 \end{pmatrix}.

Because the human uncertainties of the individual data points are completely independent from one another the covariance/correlation matrix is a diagonal matrix. On the other hand, given some data points \bm{d} the watch uncertainty is expressed like this:

\bm{\sigma}_\mathrm{watch} = 0.02 \cdot \bm{d}
= 0.02 \cdot \begin{pmatrix} d_1 \\ d_2 \\ d_3 \end{pmatrix},
\quad \bm{\rho}_\mathrm{watch} = \begin{pmatrix} 1 & 1 & 1 \\ 1 & 1 & 1\\ 1 & 1 & 1\end{pmatrix},
\quad \bm{V}_\mathrm{watch} = 0.0004 \cdot
 \begin{pmatrix} d_1^2 & d_1 d_2 & d_1 d_3 \\
                 d_1 d_2 & d_2^2 & d_2 d_3 \\
                 d_1 d_3 & d_2 d_3 & d_3^2
 \end{pmatrix}.

Because the watch uncertainties of the individual data points are fully correlated all components of the correlation matrix are equal to 1. However, this does not necessarily mean that all components of the covariance matrix are also equal. In this example the watch uncertainty per data point is relative, meaning that the absolute uncertainty differs from data point to data point.

If we were to visualize the correlations of the uncertainty components described above, we would find that samples of the human component form a circle while samples from the watch component form a line. If we were to visualize the total uncertainty we would end up with the mixed case where the sample forms an ellipse.

Correlated Least Squares

We previously defined the \chi^2 cost function like this:

\chi^2 (\bm{p}) = \sum_i \left( \frac{d_i - m_i({\bm p})}{\sigma_i} \right)^2.

This definition is only correct if the uncertainties for each data point are independent. If we want to consider the correlations between uncertainties we need to use the covariance matrix \bm{V} instead of the pointwise uncertainties \sigma_i:

\chi^2 (\bm{p})
= (\bm{d} - \bm{m}(\bm{p}))^T \cdot \bm{V}^{-1} \cdot (\bm{d} - \bm{m}(\bm{p})).

Notably the division by the uncertainties \sigma_i has been replaced by a matrix inversion. This is because the uncorrelated definition is a special case of the correlated definition. If the uncertainties are completely uncorrelated then \bm{V} is a diagonal matrix. To invert such a matrix you only need to replace the diagonal elements V_{ii} with 1 / V_{ii}.

Parameter Confidence Intervals

When we perform a fit we are not only interested in the parameter values that fit our data “best”, we also want to determine the uncertainty on our result. The standard method with which fitting tools determine parameter confidence intervals is to make use of the so-called Rao-Cramér-Fréchet bound. It states for the variance of the estimator of a single parameter estimate \hat{p}:

\mathrm{Var}_{\hat{p}} \ge 2 \frac{1 + \frac{\partial b}{\partial p}}
{E \left[ \frac{\partial^2 \mathrm{NLL}}{\partial p^2} \right]},

where b is the bias of the estimator. Because the bias cannot be easily computed it is usually assumed to be 0 in practice (check with a Monte Carlo study when in doubt). Furthermore, because likelihood methods are efficient (if an efficient estimator exists at all) the uncertainties on the fit results decrease “quickly” as more data is added and the RCF bound becomes an equality. Finally, in the large sample limit (i.e. if you have “enough” data and your uncertainties are sufficiently small) the expectation in the denominator can be replaced with the derivative of the likelihood at the cost function minimum. All together we thus find for the uncertainty of our fit result:

\hat{\mathrm{Var}}_{\hat{p}}
= \left. - \frac{2}{\frac{\partial^2 \mathrm{NLL}}{\partial p^2}} \right|_{p = \hat{p}}.

The default output of a fitting tool are \hat{p} as the parameter value and \hat{\sigma}_{\hat{p}} = \sqrt{\hat{\mathrm{Var}}_{\hat{p}}} as the parameter error. For reasons that will become clear in the following sections these errors will also be referred to as the “parabolic errors”. If the estimator \hat{p} of a parameter is normally distributed then so-called confidence intervals can be calculated which (for a random dataset) contain the true value p with a given probability called the confidence level \mathrm{CL}. For a confidence interval [a, b) the confidence level is calculated by integrating the probability density function of the normal distribution:

\mathrm{CL} = \int_a^b \frac{1}{\hat{\sigma}_{\hat{p}} \sqrt{2 \pi}}
\exp \left( {-\frac{1}{2} \frac{(x - \hat{p})^2}{\hat{\sigma}_{\hat{p}}^2}} \right).

The confidence interval bounds are frequently chosen symmetrically around \hat{p} and expressed as multiples of \hat{\sigma}_{\hat{p}}. For example, the “1 \sigma interval” [\hat{p} - \hat{\sigma}_{\hat{p}}, \hat{p} + \hat{\sigma}_{\hat{p}}) has a confidence level of approximately 68%.

If \hat{p} is normally distributed then the method described above can be used directly. In the large sample limit this is always the case for maximum likelihood estimators. If you don’t have enough data (or if you don’t know) you will need to use e.g. the method described in the next section.

Profile Likelihood (1 Parameter)

Let’s assume we have a fit with only a single parameter a. If the estimator \hat{a} is normally distributed then the negative log-likelihood \mathrm{NLL} is a parabola. If a is varied by N standard deviations \hat{\sigma}_{\hat{a}} from the optimal value in either direction then \mathrm{NLL} increases by N^2. For a non-Gaussian estimator the confidence intervals derived from the RCF bound can be approximated by determining the parameter values at which \Delta \mathrm{NLL}(a) = \mathrm{NLL}(a) - \mathrm{NLL}(\hat{a}) is equal to the squared equivalent sigma value N^2. In general the confidence intervals determined in this manner with this “profile likelihood method” will be asymmetrical. In loose terms, if the cost function value increases very sharply when we move away from the cost function minimum then this tells us that even a small deviation from our fit result would result in a significantly worse fit, making large deviations unlikely. Conversely, if the cost function value increases very slowly when we move away from the cost function minimum then this tells us that a deviation from our fit result would result in a fit that is only slightly worse than our optimal fit result, making such a deviation from our fit result quite possible.

The obvious problem with the profile likelihood method described above is that in practice fits will almost always have more than one parameter (the additional parameters being denoted as \bm{p}). So how do we determine the values for these other parameters as we vary just one of them? The solution is to choose \bm{p} in such a way that \Delta \mathrm{NLL}(a, \bm{p}) becomes minimal. In practical terms this means that we fix a to several values near the cost function minimum and then perform a fit over all other parameters for each of these values (this process is called profiling). In this context the parameters \bm{p} are called nuisance parameters: we don’t care about their values (right now) but we need to include them in our fits for a statistically correct result.

If the estimator \hat{a} is normally distributed then the confidence intervals derived from the profile likelihood are the same as the ones derived from the RCF bound - if you suspect that this is not the case you should always check the profiles of the parameters. The easiest way to do this is to set the flag profile = True when calling kafe2.xy_fit or kafe2.plot. The parabolic parameter uncertainties are then replaced with the edges of the 1-\sigma-intervals determined from the profile likelihood. Because these intervals are not necessarily symmetric around the cost function minimum they are referred to as asymmetric parameter errors in kafe2 (in Minuit they are called Minos errors).

When the above flag is set kafe2 will then also create plots of the profile likelihood in addition to the regular plots. As an example, let us look at the profile of the parameter g from the double slit example:

../_images/003_double_slit_profile_g.png

The profile of this parameter is very clearly asymmetric and not even close to the parabolic approximation of a normal distribution. If we had only looked at the parabolic parameter error we could have unknowingly assumed confidence intervals that are very inaccurate.

Profile Likelihood (2 parameters)

In the previous section we learned about the profiles of single fit parameters which serve as a replacement for the parabolic errors of single fit parameters. In this section we will learn about so-called contours, which serve as a replacement for the covariance of two fit parameters. Conceptually profiles and contours are very similar. A profile can be used to define confidence intervals for a single parameter with a certain probability of containing the true value of a parameter while a contour defines a confidence region with a certain probability of containing a pair of parameters. Let us start by looking at the contours produced in the double slit example:

../_images/003_double_slit_contours.png

In this visualization the confidence region inside the contours is colored. By looking at the legend we find that the contours correspond to 1 \sigma and 2 \sigma. Notably the confidence levels of the corresponding confidence regions are not the same as in one dimension. In one dimension 1 \sigma corresponds to roughly 68% while 2 \sigma corresponds to roughly 95%. We could derive these confidence levels by integrating the probability density function of the standard normal distribution over the interval [-\sigma , \sigma] for a desired \sigma value. In two dimensions we instead integrate the PDF of the uncorrelated standard bivariate normal distribution over a circle with radius \sigma around the origin:

\mathrm{CL}(\sigma)
= \int_0^\sigma dr \int_0^{2 \pi} d \varphi \ r \frac{1}{2 \pi} e^{- \frac{r^2}{2}}
= \int_0^\sigma dr \ r e^{- \frac{r^2}{2}}
= \left[ -e^{-\frac{r^2}{2}} \right]_0^\sigma
= 1 - e^{-\frac{\sigma^2}{2}}.

With this formula we now find \mathrm{CL}(1) = 39.3\%,\ \mathrm{CL}(2) = 86.4\%,\ \mathrm{CL}(3) = 98.8\%.

Note

So far there has been no mention of how a contour for a given \Delta \mathrm{NLL} could be calculated. This is because (efficiently) calculating these contours is not straightforward and even in kafe2 this is an area of active development.

The parabolic equivalent of a contour is to look at the parameter covariance matrix and to extrapolate the correlated distribution of two estimators. As with the input uncertainties the confidence region calculated this way will always be an ellipse. For (nearly) normally distributed estimators such as the estimators from the exponential fit in the “model functions” example the calculated contours will then look something like this:

../_images/002_exponential_contours.png

If the estimators were normally distributed the 1-\sigma-contour would reach exactly from -\sigma to +\sigma while the 2-\sigma-contour would reach exactly from -2 \sigma to +2 \sigma. As we can see the deviation from this is very small so we can probably use the parameter covariance matrix (or the parabolic parameter errors and the parameter correlation matrix) without issue. If we require highly precise confidence intervals for our parameters this might not be acceptable though.

Note

The degree to which confidence intervals/regions are distorted from their parabolic approximation depends on the scale at which the profile likelihood is calculated. Because every function minimum can be accurately approximated by a parabola at infinitesimally small scales (Taylor expansion) the parabolic approximation becomes more accurate for small parameter uncertainties. Conversely, for large parameter uncertainties the parabolic approximation of the profile likelihood becomes less accurate.

Nonlinear Models

In the previous section we discussed the profile likelihood and how it can be used to calculate confidence intervals for our fit parameters. We also discussed how it is necessary to use this method if the parameter estimators are not normally distributed. What was not discussed is how to determine if an estimator is normally distributed in the first place. This is where this section comes in.

Linear Models

Let us assume we have some vector of N data points d_i with corresponding constant Gaussian uncertainties \sigma_i (that can also be correlated). A linear model is then defined as a model m_i(\bm{p}) that is a multilinear function of its M parameters p_j:

m_i(\bm{p}) = b_i + \sum_{j=1}^M w_{ij} p_j,

where the weights w_{ij} and biases b_i are simply real numbers (the biases here have nothing to do with the bias in the RCF bound). Put another way, each model value m_i is a linear combination of the parameter values p_j plus some bias b_i. We can express the same relationship as above with a weight matrix \bm{W} and a bias vector \bm{b}:

\bm{m}(\bm{p}) = \bm{W} \bm{p} + \bm{b}.

If we now use the method of least squares (\chi^2 ) to estimate the optimal fit parameters \hat{\bm{p}} we get a very useful property: the estimators of our parameters are normally distributed. We can therefore skip the (relatively) expensive process of profiling the parameters!

Let us look at some examples for linear models in the context of xy fits since those are the most common. Let us therefore assume that we have some y data \bm{d} measured at x values \bm{x} = (0, 1, 2)^T. For our model function we choose the first degree polynomial f(x) = a + b x. We can thus express our model like this:

\bm{m}(\bm{p})
= \bm{W} \bm{p}
= \left( \bm{x}^0, \bm{x}^1 \right) \bm{p}
= \begin{pmatrix} 1 & 0\\ 1 & 1\\ 1 & 2 \end{pmatrix} \begin{pmatrix} a\\ b \end{pmatrix}
= a \bm{x}^0 + b \bm{x}^1.

The upper indices of vectors are to be interpreted as powers of said vectors using the Hadamard/Schur product (component-wise multiplication). In the above equation we only have a weight matrix W = \left( \bm{x}^0, \bm{x}^1 \right) but no bias vector. We can clearly see that the first degree polynomial (a line) is a linear model. Let’s take a look at the third degree polynomial f(x) = a + b x + c x^2 + d x^3:

\bm{m}(\bm{p})
= \bm{W} \bm{p}
= \left( \bm{x}^0, \bm{x}^1, \bm{x}^2, \bm{x}^3 \right) \bm{p}
= \begin{pmatrix} 1 & 0 & 0 & 0\\ 1 & 1 & 1 & 1\\ 1 & 2 & 4 & 8\end{pmatrix}
  \begin{pmatrix} a\\ b\\ c\\ d \end{pmatrix}
= a \bm{x}^0 + b \bm{x}^1 + c \bm{x}^2 + d \bm{x}^3.

Again we find that the model \bm{m}(\bm{p}) is a linear function of its parameters \bm{p}. A third degree polynomial is therefore also a linear model. This is even though the model function is not a linear function of the independent variable x. However, this was never required in our definition of linear models to begin with because x is not one of our fit parameters. In fact, all polynomials are linear models.

Nonlinear Models

Now that we have defined linear models, the definition of nonlinear models is rather easy: a model that is not a linear model. The natural consequence of this is that the estimators of our fit parameters are no longer guaranteed to be normally distributed. We will therefore need to resort to calculating confidence intervals from the profile likelihood. Let us consider an exponential model as an example: f(x) = A \cdot e^{- \lambda x}. It is simply not possible to express this function using only a finite weight matrix \bm{W} and a bias vector \bm{b}. We would instead need an infinitely large matrix and infinitely many parameters. With the same x vector \bm{x} = (0, 1, 2)^T as before we find:

\bm{m}(\bm{p})
= A \cdot e^{- \lambda \bm{x}}
= A \cdot \sum_{k=0}^\infty \frac{(- \lambda \bm{x})^k}{k!}

= A \cdot \begin{pmatrix}
         \bm{x}^0 & -\bm{x}^1 & \frac{\bm{x^2}}{2} & -\frac{\bm{x}^3}{6} & \cdots
 \end{pmatrix} \begin{pmatrix}
         \lambda^0 \\ \lambda^1 \\ \lambda^2 \\ \lambda^3 \\ \vdots
 \end{pmatrix}
= A \cdot \begin{pmatrix}
         1 & 0 & 0 & 0 & \\
         1 & -1 & \frac{1}{2} & -\frac{1}{6} & \cdots \\
         1 & -2 & 2 & -\frac{4}{3} & \\
 \end{pmatrix} \begin{pmatrix}
         \lambda^0 \\ \lambda^1 \\ \lambda^2 \\ \lambda^3 \\ \vdots
 \end{pmatrix}.

Note

We could of course just cut off the series at some point to approximate the exponential function. This would be equivalent to approximating the exponential function with a polynomial. But this would not allow us to calculate an estimate for the parameter \lambda - which is very often the entire point of a physics experiment.

Unfortunately, even with a linear model function the regression problem as a whole can become nonlinear if certain kafe2 features are used. As of right now these features are uncertainties in x direction for xy fits and uncertainties relative to the model. This is because when using those features the uncertainties that we feed to our negative log-likelihood are no longer constant. Instead they become a function of the fit parameters: \sigma_i \rightarrow \sigma_i(\bm{p}).

Another complication is that we then have to consider the full Gaussian likelihood rather than just \chi^2 to avoid biasing our results:

\mathrm{NLL}(\bm{p})
= - 2 \log L_\mathrm{max}(\bm p) + \chi^2({\bm p})
= - 2 \sum_i \log \frac{1}{\sqrt[]{2 \pi} \: \sigma_i(\bm{p})} + \chi^2(\bm{p}) \\
= N \log (2 \pi) + 2 \sum_i^N \log \sigma_i(\bm{p}) + \chi^2(\bm{p})
=: N \log (2 \pi) + C_\mathrm{det}(\bm{p}) + \chi^2(\bm{p}).

As with our derivation of \chi^2 we end up with a constant term N \log (2 \pi) which we can ignore because we are only interested in the differences in cost. We also get a new term C_\mathrm{det}(\bm{p}) = 2 \sum_i^N \log \sigma_i(\bm{p}) that we need to consider when our uncertainties depend on our fit parameters. The new term results in higher cost when the uncertainties increase. If we didn’t add C_\mathrm{det}(\bm{p}) while handling parameter-dependent uncertainties we would end up with a bias towards parameter values for which the uncertainties are increased because those values result in a lower value for \chi^2. The subscript “det” is short for determinant, the reason for which should become clear when we look at the full Gaussian likelihood with correlated uncertainties represented by a covariance matrix \bm{V}(\bm{p}):

\mathrm{NLL}(\bm{p})
= - 2 \log L_\mathrm{max}(\bm{p}) + \chi^2(\bm{p})
= - 2 \log \left[ (2 \pi)^{-\frac{N}{2}}
  \frac{1}{\sqrt{\det \bm{V}(\bm{p})}} \right] + \chi^2(\bm{p}) \\
= N \log (2 \pi) + \log \det \bm{V}(\bm{p}) + \chi^2(\bm{p})
=: N \log (2 \pi) + C_\mathrm{det}(\bm{p}) + \chi^2(\bm{p})

The constant term is the same as with the uncorrelated uncertainties but term we’re interested in has changed to C_\mathrm{det}(\bm{p}) = \log \det \bm{V}(\bm{p}). If the uncertainties are uncorrelated then the covariance matrix is diagonal and the result is equal to the term we found earlier.

Note

Handling correlated uncertainties that are a function of our fit parameters \bm{p} is computationally expensive because this means that we need to recalculate the inverse (actually Cholesky decomposition) of our covariance many times which has complexity O(N^3) for N data points - on modern hardware this is typically not an issue though.

Uncertainties In x Direction

Now that we know how to handle parameter-dependent uncertainties we can use this knowledge to handle a very common problem: fitting a model with model function f(x; \bm{p}) to data with x values x_i and uncertainties in both the x and the y direction. The uncertainties in the y direction \sigma_{y, i} can be used directly. For the x uncertainties \sigma_{x, i} we need a trick: we project the uncertainties \sigma_{x, i} onto the y axis by multiplying them with the corresponding model function derivative by x f'(x_i; \bm{p}):

\sigma_{xy,i}(\bm{p}) = \sqrt{\sigma_{y,i}^2 + (\sigma_{x,i} \cdot f'(x_i; \bm{p}))^2}.

The formula for the pointwise projected xy uncertainties \bm{\sigma}_{xy} can be generalized for the equivalent covariance matrices \bm{V}_x and \bm{V}_y:

\bm{V}_{xy}(\bm{p})
= \bm{V}_y + (f'(\bm{x}; \bm{p}) \cdot f'(\bm{x}; \bm{p})^T) \circ \bm{V}_x,

where \circ is again the Hadamard product (a.k.a. Schur product) where two matrices are multiplied on a component-by-component basis. We are also implicitly assuming that f'(\bm{x}; \bm{p}) is a vectorized function à la NumPy that returns a vector of derivatives for a vector of x values \bm{x}.

Uncertainties Relative To The Model

Relative uncertainties are very common. For example, the uncertainties of digital multimeters are typically specified as a percentage of the reading. Unfortunately such uncertainties are therefore relative to the true values which we don’t know. The standard approach for handling relative uncertainties is therefore to specify them relative to the data points d_i which we do know. However, this approach introduces a bias: if the random fluctuation represented by an uncertainty causes our data d_i to have a reduced (increased) absolute value then the relative uncertainties are underestimated (overestimated). This causes a bias towards models with smaller absolute values in our fit because we are giving data points that randomly happen to have a low absolute value a higher weight than data points with a high absolute value - and this bias increases for large relative uncertainties.

A better result can be achieved by specifying uncertainties relative to the model m_i(\bm{p}) rather than the data d_i. Because the model (ideally) converges against the true values in the large sample limit we no longer give a higher weight to data that randomly happens to have a lower absolute value. The price we pay for this is that our total uncertainty becomes a function of our model parameters \bm{p} which results in an increase in computation time as described above.

Gaussian Approximation Of The Poisson Distribution

kafe2 has a built-in approximation of the Poisson distribution where the Gaussian uncertainty is assumed as:

\sigma_i(\bm{p}) = \sqrt{m_i(\bm{p})}.

The rationale for using the square root of the model m_i(\bm{p}) rather than the square root of the data d_i is the same as with the relative uncertainties described in the previous section. The benefit of using this approximation of the Poisson distribution instead of the Poisson distribution itself is that it is capable of handling additional Gaussian uncertainties on our data.

Hypothesis Testing

So far we have used cost functions to compare how good or bad certain models and parameter values fit our data relative to each other - but we have never discussed how good or bad a fit is in an absolute sense. Luckily for us there is a metric that we can use: \chi^2 / \mathrm{NDF}, where \chi^2 is simply the sum of the squared residuals that we already know and \mathrm{NDF} is the number of degrees of freedom that our fit has. The basic definition of \mathrm{NDF} is that it’s simply the number of data points N_{\bm{d}} minus the number of parameters N_{\bm{p}}:

\mathrm{NDF} = N_{\bm{d}} - N_{\bm{p}}.

Conceptually the number of degrees of freedom are the number of “extra measurements” over the minimum number of data points needed to fully specify a model with N_{\bm{p}} linearly independent parameters. If our model is not fully specified then our cost function has multiple (or even infinitely many) global minima. For example, a line with model function f(x; a, b) = a x + b has two linearly independent parameters and as such needs at least two data points to be fully specified.

If our model accurately describes our data, and if our assumptions about the uncertainties of our data are correct, then \chi^2 / \mathrm{NDF} has an expected value of 1. If \chi^2 / \mathrm{NDF} is smaller (larger) than 1 we might be overestimating (underestimating) the uncertainties on our data. If \chi^2 / \mathrm{NDF} is much larger than 1 then our model may not accurately describe our data at all.

To further quantify these rather loose criteria we can make use of Pearson’s chi-squared test. This is a statistical test that allows us to calculate the probability P(\chi^2, \mathrm{NDF}) with which we can expect to observe deviations from our model that are at least as large as the deviations that we saw in our data. To conduct this test we first need to define the so-called \bm{\chi^2} distribution. This distribution has a single parameter k and when sampling from this distribution, the samples from k standard normal distributions x_l are simply squared and then added up:

\chi^2 (k) = \sum_{l=1}^k x_l^2 .

The deviations of our data relative to its true values (represented by our model) and normalized to its uncertainties follow such standard normal distributions. We can therefore expect the sum of the squares of these deviations \chi^2 (\bm{p}) to follow a \chi^2 (k) distribution with k = \mathrm{NDF} - if our model and our assumptions about the uncertainties of our data are correct. We can associate the following cumulative distribution function (CDF) F(k, x) with the \chi^2 distribution:

F(x, k)
= \frac{\int_0^\frac{x}{2} t^{\frac{k}{2} - 1}
e^{-t} dt}{\int_0^\infty t^{\frac{k}{2} - 1} e^{-t} dt} .

To calculate the probability P(\chi^2, \mathrm{NDF}) with which we would expect a \chi^2 value larger than what we got for our fit (i.e. the probability of our fit being worse if we were to repeat it with a new random dataset) we can now simply use:

P(\chi^2, \mathrm{NDF}) = 1 - F(\chi^2, \mathrm{NDF}).

In kafe2 P(\chi^2, \mathrm{NDF}) is also referred to as the \chi^2 probability. We can use this number to determine if deviations from our assumed model are statistically significant.

The concept of \chi^2 / \mathrm{NDF} as can be generalized for non-Gaussian likelihoods where the metric becomes goodness of fit per degree of freedom \mathrm{GoF} / \mathrm{NDF}. For a negative log likelihood \mathrm{NLL}(\bm{m}(\bm{p}), \bm{d}) with model \bm{m}(\bm{p}) and data \bm{d} it is defined like this:

\mathrm{GoF} / \mathrm{NDF}
= \frac{\mathrm{NLL}(\bm{m}(\hat{\bm{p}}), \bm{d}) - \mathrm{NLL}(\bm{d}, \bm{d})}{\mathrm{NDF}}.

We are subtracting the so-called saturated likelihood \mathrm{NLL}(\bm{d}, \bm{d}) (the minimum value our NLL could have if our model were to perfectly describe our data) from the global cost function minimum \mathrm{NLL}(\bm{m}(\hat{\bm{p}}), \bm{d}) and then divide this difference by \mathrm{NDF}. As before the expected value of \mathrm{GoF} / \mathrm{NDF} is 1 if our model and our assumptions about the uncertainties of our data are correct.

Calculating Data Uncertainties from \chi^2 / \mathrm{NDF}

Many fitting tools allow users to fit a model to data without specifying any data uncertainties. This seems to be at odds with our current understanding of Gaussian likelihood-based fits where we always required our data to have some amount of uncertainty. So how does this work? The “solution” is to first give all data points an uncorrelated uncertainty of 1 and to scale these uncertainties after the fit in such a way that \chi^2 / \mathrm{NDF} is equal to 1. This approach has a big problem which makes it unsuitable for physics experiments: we cannot do any hypothesis tests because we are implicitly assuming that our model is 100% correct. This goes against the very purpose of many physics experiments where experimenters are trying to determine if a theoretical model is consistent with experimental data.

For example, at the Large Hadron Collider the standard model of particle physics has undergone very thorough testing that continues to this day. So far, no statistically significant deviations from the standard model have been found - which is actually a bummer for theoretical physicists. You see, we know for a fact that the standard model is incomplete because (among other things) it does not include gravity. If we were to find an area in which the predictions of the standard model are incompatible with the measured data this would give theorists an important clue for a new theory that could potentially fix the problems of the standard model.

Fixing And Constraining Parameters

kafe2 allows users to fix fit parameters. The practical consequence of this is that one of our fit parameters becomes a constant and is not changed during the fit. Because this effectively lowers the number of fit parameters we have to consider the number of fixed parameters N_\mathrm{fixed} in the calculation of the number of degrees of fredom:

\mathrm{NDF}
= N_{\bm{d}} - (N_{\bm{p}} - N_\mathrm{fixed})
= N_{\bm{d}} - N_{\bm{p}} + N_\mathrm{fixed}.

It’s also possible to constrain fit parameters. Constraints are effectively direct measurements of our fit parameters and they increase the cost of our fit if they are not exactly met. For example, the additional cost C_\mathrm{con} of a Gaussian constraint for fit parameter a with mean \mu_a and standard deviation \sigma_a can be calculated like this:

C_\mathrm{con} = \left( \frac{a - \mu_a}{\sigma_a} \right)^2.

We can of course generalize this concept to account for correlations between parameters \bm{p} as defined by a covariance matrix \bm{V}_{\bm{p}}:

C_\mathrm{con}
= (\bm{p} - \bm{\mu}_{\bm{p}})^\intercal \bm{V}_{\bm{p}}^{-1} (\bm{p} - \bm{\mu}_{\bm{p}}).

If we define any constraints we are adding more data to our fit. We therefore also have to increase \mathrm{NDF} by the number of constraints N_\mathrm{con}:

\mathrm{NDF}
= N_{\bm{d}} + N_\mathrm{con} - N_{\bm{p}} + N_\mathrm{fixed}.

A simple parameter constraint that constrains a single parameter counts as one constraint. On the other hand, a matrix parameter constraint that constrains n parameters at once counts as n constraints.

Data/Fit Types

A large percentage of fits can be expressed as an XYFit. However, there are cases where an XYFit is not suitable; kafe2 offers alternatives fit types for those cases. Typically these alternative fit types are associated with alternative data (container) types so both concepts are explained simultaneously in this section. For example, an XYFit uses an XYContainer to hold its xy data while a HistFit uses a HistContainer to hold and bin its data.

For the following considerations \bm{p} always describes the vector of fit parameters. Unless mentioned otherwise fits calculate their cost from a data vector \bm{d} and a model vector \bm{m}.

XYFit

Let’s start with the most common fit type: XYFit. The data associated with this fit type consists of two vectors of equal length: a vector of x data \bm{x} and a vector of y data \bm{d}. Our model values are calculated as \bm{m}(\bm{x}; \bm{p}) = f(\bm{x}; \bm{p}), they are a function of our x data and our fit parameters. As the difference in notation implies the x and y axes are not treated in the same way. The x axis is interpreted as the independent variable of our fit while the y data values \bm{d} and y model values \bm{m}(\bm{x}; (\bm{p})) are what we ultimately compare to calculate the negative log-likelihood.

Note

Although we only have a few discreet x values for which we have to calculate our model \bm{m}(\bm{x}; \bm{p}), our model function f(x; \bm{p}) is still expected to be a continuous function of x.

A visualization of XYFit is fairly straightforward: the xy axes of our fix directly correspond to the axes of a plot.

IndexedFit

Conceptually IndexedFit is a simplified version of XYFit: we only have a data vector \bm{d} and no independent variable at all. Instead we calculate the model vector \bm{m}(\bm{p}) as a function of just the fit parameters. In kafe2 IndexedFit is visualized by interpreting the indices of the data/model vectors as x values and the corresponding xth entry of those vectors as the y value.

HistFit

HistFit handles N one-dimensional data points \bm{x} by binning them according to some bin edges x_0 < ... < x_k < ... < x_K to form our data vector \bm{d} \in \mathbb{R}^K. By default the model function f(x; \bm{p}) that is fitted to these bins is a probability density function for the observed values \bm{x}. The bin heights \bm{m}(\bm{p}) predicted by our model are obtained by integrating f(x; \bm{p}) over a given bin and multiplying the result with N:

m_k(\bm{p}) = N \int_{x_{k-1}}^{x_k} f(t; \bm{p}) dt .

The amplitude of our distribution is therefore not one of the fit parameters; we are effectively fitting a density function to a normalized histogram. By setting density=False when creating the HistFit object an arbitrary model can be fit to a histogram that is not normalized.

Unlike with XYFit or IndexedFit the default distribution assumed for the data of a HistFit is the Poisson distribution rather than the normal distribution.

UnbinnedFit

Just like HistFit an UnbinnedFit accepts a vector of N one-dimensional data points \bm{x} in conjunction with a probability density function f(x; \bm{p}) for these values as its model function. As the name implies the data is not binned. Instead, because our model function can be interpreted as a probability density we can simply calculate the negative log-likelihood like this:

\mathrm{NLL}(\bm{p}) = - 2 \sum_{n=1}^N \log f(x_n; \bm{p}).

In kafe2 UnbinnedFit is visualized by interpreting the independent variable as the x axis of a plot and the height of the probability density function as the y axis. Additionally, a thin, vertical line is added for each data point to indicate the density of our data.

CustomFit

Unlike the other fit types discussed so far, CustomFit does not explicitly use data \bm{d} or a model \bm{m}. Instead the user has to manually define how the cost function value is calculated from the fit parameters \bm{p}. Because any potential data is outside kafe2 there is no built-in visualization (plotting) available except for the fit parameter profiles/contours calculated by ContoursProfiler.

MultiFit

A MultiFit is constructed from N regular fits with cost functions C_i(\bm{p}). The idea behind MultiFit is rather simple: multiple models that share at least one parameter are simultaneously fitted to their respective data. In accordance with the method of maximum likelihood the optimal fit parameters are those that make the observed combination of individual datasets the most likely. The corresponding cost function can simply be calculated as:

C_\mathrm{multi}(\bm{p}) = \sum_i^N C_i(\bm{p}).

If a MultiFit is built from several fits that assume Gaussian uncertainties, it’s possible to specify uncertainties that are correlated between those fits. For example, in the case of two fits that have a fully correlated source of uncertainty expressed by a covariance matrix \bm{V}_\mathrm{shared} the effective covariance matrix \bm{V}_\mathrm{multi} for the MultiFit becomes:

\bm{V}_\mathrm{multi} = \begin{pmatrix}
   \bm{V}_\mathrm{shared} & \bm{V}_\mathrm{shared} \\
   \bm{V}_\mathrm{shared} & \bm{V}_\mathrm{shared}
\end{pmatrix} .

Cost Functions

So far we almost universally assumed that the uncertainties of our data can be described with a normal distribution. However, this is not always the case. For example, the number of radioactive decays in a given time interval follows a Poisson distribution. In kafe2 such distinctions are handled via the cost function, the function that in one way or another calculates a scalar cost from the data, model, and uncertainties of a fit. This section describes the built-in cost functions that kafe2 provides.

\chi^2 Cost Function

The by far most common cost function used is the \chi^2 cost function that assumes a normal distribution for the uncertainties of our data. In kafe2 the name is strictly speaking a misnomer because the actual cost calculation considers the full likelihood rather than just \chi^2 in order to handle non-constant uncertainties. For N data points d_i with corresponding model values m_i(\bm{p}) and uncorrelated (but possible non-constant) uncertainties \sigma_i(\bm{p}) the cost function value is calculated like this:

\mathrm{NLL}(\bm{p})
= C_\mathrm{det}(\bm{p}) + \chi^2(\bm{p})
= \sum_i^N 2 \log \sigma_i(\bm{p}) + \left( \frac{d_i - m_i(\bm{p})}{\sigma_i(\bm{p})} \right)^2.

If the uncertainties are instead correlated as described by a covariance matrix \bm{V}(\bm{p}) the cost function value becomes:

\mathrm{NLL}(\bm{p})
= C_\mathrm{det}(\bm{p}) + \chi^2(\bm{p})
= \log \det \bm{V}(\bm{p})
 + (\bm{d} - \bm{m}(\bm{p}))^T\: \bm{V}(\bm{p})^{-1}\: (\bm{d} - \bm{m}(\bm{p})).

Poisson Cost Function

The Poisson cost function assumes - as the name implies - a Poisson distribution for our data. Compared to the normal distribution the Poisson distribution has two important features: Firstly the data values d_i (but not the model values m_i(\bm{p})) have to be positive integers, and secondly the mean and variance are inherently linked. We can define the likelihood function \mathcal{L}(\bm{p}) of the Poisson distribution like this:

\mathcal{L}(\bm{p}) = \prod_i^N \frac{m_i(\bm{p})^{d_i}\: e^{-m_i(\bm{p})}}{d_i !}.

The negative log-likelihood \mathrm{NLL}(\bm{p}) thus becomes:

\mathrm{NLL}(\bm{p})
= - 2 \log \mathcal{L}
= 2 \sum_i^N m_i(\bm{p}) - d_i \log m_i(\bm{p}) + \frac{d_i (d_i + 1)}{2}.

Notably \mathrm{NLL}(\bm{p}) depends only on the data d_i and the model m_i(\bm{p}) but not on any specified uncertainties \bm{\sigma}. The advantage is that we don’t need to specify any uncertainties - but the significant disadvantage is that we can’t specify any uncertainties either. In such cases the cost function in the following section will need to be used.

Gauss Approximation Cost Function

Because a Poisson distribution cannot handle Gaussian data uncertainties the Poisson distribution is frequently approximated with a normal distribution. The easiest approach is to simply derive the uncertainties \sigma_i from the data d_i:

\sigma_i = \sqrt{d_i}.

However, as described in the previous section about linear models, this leads to a bias towards small model values m_i(\bm{p}). In kafe2 the uncertainties are therefore derived from the model values:

\sigma_i = \sqrt{m_i(\bm{p})}.

Just like before these uncertainties can be easily combined with other sources of uncertainty by simply adding up the (co)variances. However, this approach has an important limitation: it is only valid if the model values m_i(\bm{p}) are large enough (something like m_i(\bm{p}) \ge 10). This is because for small model values the asymmetry of the Poisson distribution and the portion of the normal distribution that resides in the unphysical region with m_i(\bm{p}) < 0 are no longer negligible.

Numerical Considerations

The mathematical description of \chi^2 shown so far makes use of the inverse of the covariance matrix \bm{V}^{-1}. However, kafe2 does not actually calculate \bm{V}^{-1}. Instead the Cholesky decomposition \bm{L} \bm{L}^T = \bm{V} of the covariance matrix is being used where \bm{L} is a lower triangular matrix. Calculating \bm{L} is much faster than calculating \bm{V}^{-1} and it also reduces the rounding error from floating point operations.

We can always calculate a Cholesky decomposition for a matrix that is symmetrical and positive-definite. Obviously a covariance matrix is symmetrical by definition. And because all eigenvalues of a covariance matrix are (typically) positive a covariance matrix is (typically) also positive definite.

Note

The eigenvalues of a covariance matrix represent the equivalent variances in a coordinate system where said variances are uncorrelated (see principal component analysis). The eigenvalues are therefore all positive unless the uncertainties of two or more data points are fully correlated. In this case some of the eigenvalues are 0. However, as a consequence the covariance matrix would also no longer have full rank so we wouldn’t be able to invert it either.

Because \bm{L} is a triangular matrix solving the corresponding system of linear equations for the residual vector \bm{r} = \bm{d} - \bm{m} (difference between data and model) can be done very quickly:

\bm{L} \bm{x} = \bm{r} .

With \bm{x} = \bm{L}^{-1} \bm{r} we now find:

\chi^2
= \bm{r}^T \bm{V}^{-1} \bm{r}
= \bm{r}^T (\bm{L} \bm{L}^T)^{-1} \bm{r}
= \bm{r}^T \bm{L}^{-T} \bm{L}^{-1} \bm{r}
= \bm{x}^T \bm{x}.

Because \bm{L} is a triangular matrix it can also be used to efficiently calculate \log \det(\bm{V}):

\det (\bm{L}) = \det (\bm{L}^T) = \prod_i^N L_{ii},

C_\mathrm{det}
= \log \det (\bm{V})
= \log \det (\bm{L} \bm{L}^T)
= \log (\det \bm{L} \cdot \det \bm{L}^T)
= \log (\prod_i^N L_{ii}^2)
= 2 \sum_i^N \log L_{ii}.