Statistical Rethinking. 2nd Edition

New highlights added 2025-04-17

Unobserved variables are usually called parameters (View Highlight)

In conventional statistics, a distribution function assigned to an observed variable is usually called a likelihood. That term has special meaning in nonBayesian statistics, however. (View Highlight)

Beyond all of the above, there’s no law mandating we use only one prior. If you don’t have a strong argument for any particular prior, then try different ones. Because the prior is an assumption, it should be interrogated like other assumptions: by altering it and checking how sensitive inference is to the assumption. (View Highlight)

For every unique combination of data, likelihood, parameters, and prior, there is a unique posterior distribution. This distribution cont (View Highlight)

various numerical techniques are needed to approximate the mathematics that follows from the definition of Bayes’ theorem (View Highlight)

(1) Grid approximation (2) Quadratic approximation (3) Markov chain Monte Carlo (MCMC) There are many other engines, and new ones are being invented all the time (View Highlight)

The same model fit to the same data using different techniques may produce different answers. (View Highlight)

Auseful approach is quadratic approximation. Under quite general conditions, the region near the peak of the posterior distribution will be nearly Gaussian—or “normal”—in shape. This means the posterior distribution can be usefully approximated by a Gaussian distribution. A Gaussian distribution is convenient, because it can be completely described by only two numbers: the location of its center (mean) and its spread (variance). A Gaussian approximation is called “quadratic approximation” because the logarithm of a Gaussian distribution forms a parabola. And a parabola is a quadratic function. So this approximation essentially represents any log-posterior with a parabola. (View Highlight)

A Hessian is a square matrix of second derivatives. It is used for many purposes in mathematics, but in the quadratic approximation it is second derivatives of the log of posterior probability with respect to the parameters. It turns out that these derivatives are sufficient to describe a Gaussian distribution, because the logarithm of a Gaussian distribution is just a parabola. Parabolas have no derivatives beyond the second, so once we know the center of the parabola (the posterior mode) and its second derivative, we know everything about it. And indeed the second derivative (with respect to the outcome) of the logarithm of a Gaussian distribution is proportional to its inverse squared standard deviation (its “precision”: page 79). So knowing the standard deviation tells us everything about its shape. (View Highlight)

The conceptual challenge withMCMC lies in its highly non-obvious strategy. Instead of attempting to compute or approximate the posterior distribution directly,MCMC techniques merely draw samples from the posterior. You end up with a collection of parameter values, and the frequencies of these values correspond to the posterior plausibilities. You can then build a picture of the posterior from the histogram of these samples. (View Highlight)

There is actually a set of theorems, the No Free Lunch theorems. These theorems—and others which are similar but named and derived separately—effectively state that there is no optimal way to pick priors (for Bayesians) or select estimators or procedures (for non-Bayesians). See Wolpert and Macready (1997) for example. (View Highlight)

New highlights added 2025-07-09

Intervals of defined mass. It is more common to see scientific journals reporting an interval of defined mass, usually known as a confidence interval. An interval of posterior probability, such as the oneswe areworkingwith,may instead be called a credible interval. We’re going to call it a compatibility interval instead, in order to avoid the unwarranted implications of “confidence” and “credibility.”53 What the interval indicates is a range of parameter values compatible with the model and data. The model and data themselves may not inspire confidence, in which case the interval will not either. (View Highlight)

New highlights added 2025-07-13

Remember, the entire posterior distribution is the Bayesian “estimate.” It summarizes the relative plausibilities of each possible value of the parameter. Intervals of the distribution are just helpful for summarizing it. If choice of interval leads to different inferences, then you’d be better off just plotting the entire posterior distribution. (View Highlight)

One principled way to go beyond using the entire posterior as the estimate is to choose a loss function. A loss function is a rule that tells you the cost associated with using any particular point estimate (View Highlight)

New highlights added 2025-11-03

So ifyouwere to compute the sampling distribution ofoutcomes at each value ofp, then you could average all ofthese prediction distributions together, using the posterior probabilities of each value of p, to get a posterior predictive distribution. (View Highlight)

Dr. Mario's 2nd 🧠

Explorer

Statistical Rethinking. 2nd Edition

New highlights added 2025-04-17

New highlights added 2025-07-09

New highlights added 2025-07-13

New highlights added 2025-11-03

Webmentions

❤️ Likes

🔄 Reposts

💬 Replies

🔗 Mentions

Graph View

Table of Contents