Summary In this post I work through a recent homework exercise that illustrates why you shouldn't compare means by checking for confidence interval overlap. I calculate the type I error rate of this procedure for a simple case. This reveals where our intuition goes wrong: namely, we can recover the confidence interval heuristic by confusing standard deviations and variances.

Checking confidence intervals for overlap Sometimes you may want to check if two (or more) means are statistically distinguishable.
Motivation Yesterday, for the first time ever, I coded up a model in Stan and it actually did what I wanted. My current knowledge of Stan is, at best, nascent, but I'll show you the process I went through to write my first Stan program, pointing out what I wish I'd known along the way.

My goal is to provide a quick and dirty introduction to Stan, hopefully enough to get you started without having to dig into the manual yourself.
Tue, 11 Dec 2018 00:00:00 +0000https://www.alexpghayes.com/blog/consent-in-the-presence-of-correlation/Motivation This post explores some ideas for a normative ethics of personal data.
Motivation This post assumes you are familiar with logistic regression and that you just fit your first or second multinomial logistic regression model. While there is an interpretation for the coefficients in a multinomial regression, that interpretation is relative to a base class, which may not be the most useful. Partial dependence plots are an alternative way to understand multinomial regression, and in fact can be used to understand any predictive model.
Summary Ockham's Razor is about what to believe when we have no evidence, not how to pick between theories supported by equal amounts of evidence.
In slighly longer form I’m in the middle of The Science of Conjecture and I just realized that I’ve been misinterpreting Ockham’s Razor for the last several years. Ockham’s Razor says:
Entities are not to be multiplied without necessity.
Tue, 14 Aug 2018 00:00:00 +0000https://www.alexpghayes.com/blog/swans-uncertainty-and-randomness/Motivation Why is probability an appropriate way represent uncertainty?
Today is the last day of my summer internship with RStudio. This is the first year that RStudio has had an official internship program, and I couldn't be happier to have been a part of it.

My mandate for the summer has been to make broom better. My project was advised by both Dave Robinson (DataCamp) and Max Kuhn (RStudio). Dave originally wrote the broom package and acted as my primary mentor.
My mandate for the summer has been to make broom better. My project was advised by both Dave Robinson (DataCamp) and Max Kuhn (RStudio). Dave originally wrote the broom package and acted as my primary mentor.speeding up GPX ingest: profiling, Rcpp and furrr
This post is a casual case study in speeding up R code. I work through several iterations of a function to read and process GPS running data from Strava stored in the GPX format. Along the way I describe how to visualize code bottlenecks with profvis and briefly touch on fast compiled code with Rcpp and parallelization with furrr.

The problem: tidying trajectories in GPX files I record my runs on my phone using Strava.
Fri, 01 Jun 2018 00:00:00 +0000https://www.alexpghayes.com/blog/reflections-on-samsis-2018-undergraduate-modelling-workshop/I spent the last week at Statistical and Mathematical Sciences Institute’s (SAMSI) undergraduate modelling workshop. This year the workshop was hosted at North Carolina State University in Raleigh.
Runners often vary the distance and intensity of their workouts. In this post I demonstrate how to compare runs of different lengths using Riegel's formula. The formula accurately describes the tradeoff between run distance and average speed for aerobic runs up to about a half-marathon in length. Using my Strava data, I demonstrate how to use Riegel's formula to measure the difficulty of runs on a standardized scale and briefly investigate how my fitness has changed over time with GAMs.
When we build a predictive model, we are interested in how the model will perform on data it hasn't seen before. If we have lots of data, we can split it into training and test sets to assess model performance. If we don't have lots of data, it's better to fit a model using all of the available data and to assess its predictive performance using resampling techniques. The bootstrap is one such resampling technique.
As the semester ends, I would like to remind students of the value of a well-written course evaluation. Course evaluations allow students to share wisdom with the next generation and to provide feedback to instructors and the university. Despite this, few students fill out narrative reviews. I propose we up our game.

In my ideal world, course1 evaluations are written for students by students. They contain any advice you would go back and give to yourself before the class.
Motivation Suppose you have some loss function \(\mathcal{L}(\beta) : \mathbb{R}^n \to \mathbb{R}\) you want to minimize with respect to some model parameters \(\beta\). You understand how gradient descent works and you have a correct implementation of \(\mathcal{L}\) but aren't sure if you took the gradient correctly or implemented it correctly in code.

Solution We can compare our implemention of the gradient of \(\mathcal{L}\) to a finite difference approximation of the gradient.
Mon, 07 Aug 2017 00:00:00 +0000https://www.alexpghayes.com/blog/gentle-tidy-eval-with-examples/I’ve been using the tidy eval framework introduced with dplyr 0.7 for about two months now, and it’s time for an update to my original post on tidy eval. My goal is not to explain tidy eval to you, but rather to show you some simple examples that you can easily generalize from.
library(tidyverse) starwars ## # A tibble: 87 x 13 ## name height mass hair_color skin_color eye_color birth_year gender ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> ## 1 Luke~ 172 77 blond fair blue 19 male ## 2 C-3PO 167 75 <NA> gold yellow 112 <NA> ## 3 R2-D2 96 32 <NA> white, bl~ red 33 <NA> ## 4 Dart~ 202 136 none white yellow 41.about
I'm a first year PhD student in the University of Wisconsin-Madison statistics program. I just graduated from Rice University with a degree in statistics. At Rice, I spent most of time getting Rice DataSci, the fledgingly data science club, off the ground.
I spent my summer interning at RStudio. Previously I’ve done biostats research at Fred Hutch. Before that I led canoe trips for YMCA Camp Menogyn.
Mon, 01 Jan 0001 00:00:00 +0000https://www.alexpghayes.com/news/January 2019: rstudio::conf(2019) was an absolute blast! It was a pleasure to spend two days teaching the tidymodels approach to machine learning in R with Max Kuhn and Davis Vaughn (workshop materials). I also presented on broom and how it can smooth modeling workflows (video, slides).
August 2018: I’m pleased to announce that I’ll be co-teaching a workshop on machine learning with Max Kuhn at rstudio::conf 2019.
August 2018: Finished my fantastic summer with rstudio and moved to Madison, Wisconsin.