announcing the distributions3 package
Sep 3, 2019
Alex Hayes
4 minute read

I am pleased to announce that distributions3 is on CRAN! The package is a collaborative effort with Emil Hvitfeldt, Ralph Trane, Dan Jordan and Bruna Wundervald. Working with them has been fantastic, and I strongly encourage you to team up with them for future projects.

What is distributions3?

distributions3 is a package for using S3 with probability distributions. This means that we start by constructing distribution objects, and then we interact with the distributions by calling S3 methods on them.

For example, if we want to get a random sample from a binomial distribution, we’d do this:

library(distributions3)

X <- Binomial(size = 10, p = 0.1)  # construct the distribution object

random(X, n = 20)                  # get samples from the distribution
##  [1] 3 0 2 0 0 1 0 0 0 0 1 2 1 2 2 2 3 0 1 0

Note that distributions3 convention is to name random variables with single uppercase letters. This is to match notation used in introductory textbooks.

distributions3 replicates the functionality of base R with four key generics:

  • random(): Draw samples from a distribution.
  • pdf(): Evaluate the probability density (or mass) at a point.
  • cdf(): Evaluate the cumulative probability up to a point.
  • quantile(): Determine the quantile for a given probability.

We can take the new functions for a spin with a normal distribution.

Y <- Normal(mu = 2, sigma = 4)
pdf(Y, 0.3)
## [1] 0.09112297
cdf(Y, 2)
## [1] 0.5
quantile(Y, 0.5)
## [1] 2

Why should I use distributions3?

The r/d/p/q memory test isn’t as fun as it used to be

The simplest reason to use distributions3 is because you can’t remember if you’re supposed to be using rnorm(), dnorm(), pnorm(), or qnorm(). Hopefully it’s easy to remember random(), pdf(), cdf() and quantile().

Extensive online documentation

We’ve also tried to improve upon base R’s documentation. If you visit https://alexpghayes.github.io/distributions3/, you’ll find a reference page for each distribution that includes example code, and mathematical details about the distribution.

Here’s the section on mathematical details of the binomial distribution:

We’ve put a lot of effort into these reference pages, and the hope is that this online documentation is easier for students in intro stats courses, both to find, and to understand.

No, really, extensive online documentation

In addition to extensive notes on each distribution, distributions3 has, dare I say, a shit ton of vignettes. These vignettes are self-contained tutorials on common topics in intro stat courses.

At the moment, we have vignettes on:

These vignettes include detailed examples, beginning with a sample problem, working through assumption checking, the null hypothesis and test statistic, p-value calculations, rejection regions, and power and sample size calculations where appropriate.

In practice, these vignettes cover about half a semester of a frequentist intro probability course. We developed this material as a strictly selfish endeavour – students tend to have the same issues understanding hypothesis testing, finding formulas, and using R over and over again. The goal of this material is to dramatically reduce teaching workload by providing students with high quality examples.

Please, please take a look at the one sample Z test vignette to get a sense for the amount of effort we’ve invested in these things. We hope these vignettes are a serious pedagogical resource for stats instructors.

S3 is love, S3 is life

Another audience for the package is more experienced R programmers. By making distributions S3 objects, you can easily implement new generics and dispatch on S3 classes, taking advantage of functional object oriented programming.

For example, the distributions3 implementation of a log-likehood calculation is gloriously simple:

log_likelihood <- function(d, x, ...) {
  sum(log_pdf(d, x, ...))
}

Since log_pdf() is an S3 generic, log_likelihood() will immediately work with any distribution object. By contrast, doing this in base R would be rather painful.

What’s next?

For me, research. This means I’ll be stepping back from active development and will take on a more advisory role for the next bit. I’m hoping that distributions3 is useful enough that the R community can contribute new distributions and desired functionality via PRs. Documentation improvements are particularly appreciated!

I am very happy to review PRs and will try to keep a close-ish eye on the repo. If you are really excited about this package, I’d also be happy to mentor a Google Summer of Code project to flesh out more functionality.



comments powered by Disqus