aleatoric
https://www.alexpghayes.com/
Recent content on aleatoricHugo -- gohugo.ioen-USSun, 28 Mar 2021 00:00:00 +0000many models workflows in python: part ii
https://www.alexpghayes.com/blog/many-models-workflows-in-python-part-ii/
Sun, 28 Mar 2021 00:00:00 +0000https://www.alexpghayes.com/blog/many-models-workflows-in-python-part-ii/In this followup to my earlier post on modeling workflows in Python, I demonstrate how to integrate sample splitting, parallel processing, exception handling and caching into many-models workflows. I also discuss some differences between exploration/inference-centric workflows and tuning-centric workflows.
Motivating example We will work with the Palmer Penguin dataset, which contains various biological measures on three species of penguins. Our goal will be to cluster the penguins into groups that correspond to their species using their bill length, bill depth, flipper length and body mass.many models workflows in python: part i
https://www.alexpghayes.com/blog/many-models-workflows-in-python-part-i/
Tue, 25 Aug 2020 00:00:00 +0000https://www.alexpghayes.com/blog/many-models-workflows-in-python-part-i/This summer I worked on my first substantial research project in Python. I’ve used Python for a number of small projects before, but this was the first time that it was important for me to have an efficient workflow for working with many models at once. In R, I spend almost all of my time using a ‘many models’ workflow that leverages list-columns in tibbles and a hefty amount of tidyverse manipulation.using the data twice
https://www.alexpghayes.com/blog/using-the-data-twice/
Mon, 04 May 2020 00:00:00 +0000https://www.alexpghayes.com/blog/using-the-data-twice/Berna Devezer, Danielle Navarro, Joachim Vandekerckhove, and Erkan Ozge Buzbas recently posted a pre-print, Devezer et al. (2020), responding to various claims within the open science community1. In particular, they explore the following claims:
reproducibility of an effect can be used to demarcate scientific claims and non-scientific claims,
data should not be used twice in a data analysis, and
exploratory data analysis is characterized by poor statistical practice.synthetic control: elon's tweet tanked tesla's stock
https://www.alexpghayes.com/blog/elon-musk-send-tweet/
Fri, 01 May 2020 00:00:00 +0000https://www.alexpghayes.com/blog/elon-musk-send-tweet/At 2020-05-01 15:11:26 UTC Elon Musk tweeted
and Tesla stock started tanking. I find this absolutely hilarious, especially since he did this a while back and got fined like several million dollars for tinkering with market or something illegal like that.
Anyway, I asked myself: can we causally attribute Tesla tanking stock price to this tweet?
The answer is yes, yes, we absolutely can. In the following I use a synthetic control approach to estimate the causal impact of Musk’s tweet on the Tesla stock price (God I hate that I just wrote that sentence).to transform or not to transform
https://www.alexpghayes.com/blog/to-transform-or-not-to-transform/
Sun, 22 Mar 2020 00:00:00 +0000https://www.alexpghayes.com/blog/to-transform-or-not-to-transform/You may have heard that it is impossible to compare models when the outcome has been transformed in one model but not the other. This is not the case. Models fit to transformed data implicitly model the original data as well as the transformed data, and it is relatively straightforward to calculate the corresponding likelihoods. In this post, I’ll show you how to calculate these induced likelihoods. This will allow you to compare models fit to transformed data with models fit to the original, untransformed data.overfitting: a guided tour
https://www.alexpghayes.com/blog/overfitting-a-guided-tour/
Mon, 06 Jan 2020 00:00:00 +0000https://www.alexpghayes.com/blog/overfitting-a-guided-tour/Summary
This post introduces overfitting, describes how overfitting influences both prediction and inference problems, provides supervised and unsupervised examples of overfitting, and presents a fundamental relationship between train and test error. The goal is to provide some additional intuition beyond material covered in introductory machine learning resources.
A coin flip guessing game
Before we begin, I want to play a guessing game. Here’s how it works: I show you two sequences of coin flips.announcing the distributions3 package
https://www.alexpghayes.com/blog/announcing-the-distributions3-package/
Tue, 03 Sep 2019 00:00:00 +0000https://www.alexpghayes.com/blog/announcing-the-distributions3-package/I am pleased to announce that distributions3 is on CRAN! The package is a collaborative effort with Emil Hvitfeldt, Ralph Trane, Dan Jordan and Bruna Wundervald. Working with them has been fantastic, and I strongly encourage you to team up with them for future projects.
What is distributions3?
distributions3 is a package for using S3 with probability distributions. This means that we start by constructing distribution objects, and then we interact with the distributions by calling S3 methods on them.consistency and the linear probability model
https://www.alexpghayes.com/blog/consistency-and-the-linear-probability-model/
Sat, 31 Aug 2019 00:00:00 +0000https://www.alexpghayes.com/blog/consistency-and-the-linear-probability-model/Summary
A while back Twitter once again lost its collective mind and decided to rehash the logistic regression versus linear probability model debate for the umpteenth time. The genesis for this new round of chaos was Gomila (2019), a pre-print by Robin Gomila, a grad student in pyschology at Princeton1. You can get a taste of the discussion in the replies to the announcement:
When the outcome is binary: logistic or linear?an annotated bibliography on stochastic blockmodels
https://www.alexpghayes.com/blog/an-annotated-bibliography-on-stochastic-block-models/
Fri, 26 Jul 2019 00:00:00 +0000https://www.alexpghayes.com/blog/an-annotated-bibliography-on-stochastic-block-models/Summary
I’ve been reading a lot of papers on network analysis recently. I thought I’d write down some takeaways and point out papers that I’ve found helpful. This collection of papers is centered around the stochastic blockmodel, and is intended to be introductory rather than comprehensive. I’ve included a few papers with other miscellaneous tidbits of interest.
Models
The most basic random graph model is the Erdos-Renyi graph, which assumes that you have a fixed number of nodes, and edges appear independently with probability \(p\).testing statistical software
https://www.alexpghayes.com/blog/testing-statistical-software/
Fri, 07 Jun 2019 00:00:00 +0000https://www.alexpghayes.com/blog/testing-statistical-software/Motivation
Recently I’ve been implementing and attempting to extend some computationally intense methods. These methods are from papers published in the last several years, and haven’t made their way into mainstream software libraries yet. So I’ve been spending a lot of time reading research code, and I’d like to share what I’ve learned.
In this post, I describe how I evaluate the trustworthiness of a modeling package, and in particular what I want from the test suite.type stable estimation
https://www.alexpghayes.com/blog/type-stable-estimation/
Tue, 21 May 2019 00:00:00 +0000https://www.alexpghayes.com/blog/type-stable-estimation/Abstract
This post discusses how the mathematical objects we use in formal data modeling are represented in statistical software. First I introduce these objects, then I argue that each object should be represented by a distinct type. Next I present three principles to ensure the type system is statistically meaningful. These principles suggest that existing modeling software has an overly crude type system. I believe a finer type system in statistical packages would result in more intuitive interfaces while increasing extensibility and reducing possibilities for methodological errors.implementing the super learner with tidymodels
https://www.alexpghayes.com/blog/implementing-the-super-learner-with-tidymodels/
Sat, 13 Apr 2019 00:00:00 +0000https://www.alexpghayes.com/blog/implementing-the-super-learner-with-tidymodels/Summary
In this post I demonstrate how to implement the Super Learner using tidymodels infrastructure. The Super Learner is an ensembling strategy that relies on cross-validation to determine how to combine predictions from many models. tidymodels provides low-level predictive modeling infrastructure that makes the implementation rather slick. The goal of this post is to show how you can use this infrastructure to build new methods with consistent, tidy behavior. You’ll get the most out of this post if you’ve used rsample, recipes and parsnip before and are comfortable working with list-columns.overlapping confidence intervals: correcting bad intuition
https://www.alexpghayes.com/blog/overlapping-confidence-intervals-correcting-bad-intuition/
Thu, 31 Jan 2019 00:00:00 +0000https://www.alexpghayes.com/blog/overlapping-confidence-intervals-correcting-bad-intuition/Summary In this post I work through a recent homework exercise that illustrates why you shouldn’t compare means by checking for confidence interval overlap. I calculate the type I error rate of this procedure for a simple case. This reveals where our intuition goes wrong: namely, we can recover the confidence interval heuristic by confusing standard deviations and variances.
Checking confidence intervals for overlap Sometimes you may want to check if two (or more) means are statistically distinguishable.some things i've learned about stan
https://www.alexpghayes.com/blog/some-things-ive-learned-about-stan/
Mon, 24 Dec 2018 00:00:00 +0000https://www.alexpghayes.com/blog/some-things-ive-learned-about-stan/Motivation
Yesterday, for the first time ever, I coded up a model in Stan and it actually did what I wanted. My current knowledge of Stan is, at best, nascent, but I’ll show you the process I went through to write my first Stan program, pointing out what I wish I’d known along the way.
My goal is to provide a quick and dirty introduction to Stan, hopefully enough to get you started without having to dig into the manual yourself.understanding multinomial regression with partial dependence plots
https://www.alexpghayes.com/blog/understanding-multinomial-regression-with-partial-dependence-plots/
Tue, 23 Oct 2018 00:00:00 +0000https://www.alexpghayes.com/blog/understanding-multinomial-regression-with-partial-dependence-plots/Motivation
This post assumes you are familiar with logistic regression and that you just fit your first or second multinomial logistic regression model. While there is an interpretation for the coefficients in a multinomial regression, that interpretation is relative to a base class, which may not be the most useful. Partial dependence plots are an alternative way to understand multinomial regression, and in fact can be used to understand any predictive model.ockham's razor isn't about model selection
https://www.alexpghayes.com/blog/ockhams-razor-isnt-about-model-selection/
Mon, 03 Sep 2018 00:00:00 +0000https://www.alexpghayes.com/blog/ockhams-razor-isnt-about-model-selection/Summary
Ockham’s Razor is about what to believe when we have no evidence, not how to pick between theories supported by equal amounts of evidence.
In slighly longer form
I’m in the middle of The Science of Conjecture and I just realized that I’ve been misinterpreting Ockham’s Razor for the last several years. Ockham’s Razor says:
Entities are not to be multiplied without necessity.
For a long time, I’d taken this to mean:swans, uncertainty and randomness
https://www.alexpghayes.com/blog/swans-uncertainty-and-randomness/
Tue, 14 Aug 2018 00:00:00 +0000https://www.alexpghayes.com/blog/swans-uncertainty-and-randomness/Motivation
Why is probability an appropriate way represent uncertainty?
Statisticians typically emphasize the need to estimate uncertainty in inference and prediction. Despite making heavy use of randomness in statistics, we rarely explain why randomness is an appropriate tool to use to model the world. If we would like others to use statistics, I believe we should provide an explanation of the importance of probability. This post contains one explanation I find personally satisfying.a summer with rstudio
https://www.alexpghayes.com/blog/a-summer-with-rstudio/
Fri, 10 Aug 2018 00:00:00 +0000https://www.alexpghayes.com/blog/a-summer-with-rstudio/Today is the last day of my summer internship with RStudio. This is the first year that RStudio has had an official internship program, and I couldn’t be happier to have been a part of it.
My mandate for the summer has been to make broom better. My project was advised by both Dave Robinson and Max Kuhn. Dave originally wrote the broom package and acted as my primary mentor.speeding up GPX ingest: profiling, Rcpp and furrr
https://www.alexpghayes.com/blog/speeding-up-gpx-ingest-profiling-rcpp-and-furrr/
Fri, 15 Jun 2018 00:00:00 +0000https://www.alexpghayes.com/blog/speeding-up-gpx-ingest-profiling-rcpp-and-furrr/This post is a casual case study in speeding up R code. I work through several iterations of a function to read and process GPS running data from Strava stored in the GPX format. Along the way I describe how to visualize code bottlenecks with profvis and briefly touch on fast compiled code with Rcpp and parallelization with furrr.
The problem: tidying trajectories in GPX files
I record my runs on my phone using Strava.reflections on SAMSI's 2018 undergraduate modelling workshop
https://www.alexpghayes.com/blog/reflections-on-samsis-2018-undergraduate-modelling-workshop/
Fri, 01 Jun 2018 00:00:00 +0000https://www.alexpghayes.com/blog/reflections-on-samsis-2018-undergraduate-modelling-workshop/I spent the last week at Statistical and Mathematical Sciences Institute’s (SAMSI) undergraduate modelling workshop. This year the workshop was hosted at North Carolina State University in Raleigh.
Rundown of the workshop
About thirty students attended the workshop. To get in there’s a mellow application process. SAMSI covered travel, rooming and food for the participants. We were expected to bring laptops with R and RStudio installed. The purpose of the workshop was to give undergrads experience modelling real world data.comparing runs with riegel's formula and GAMs
https://www.alexpghayes.com/blog/comparing-runs-with-riegels-formula-and-gams/
Wed, 16 May 2018 00:00:00 +0000https://www.alexpghayes.com/blog/comparing-runs-with-riegels-formula-and-gams/Runners often vary the distance and intensity of their workouts. In this post I demonstrate how to compare runs of different lengths using Riegel’s formula. The formula accurately describes the tradeoff between run distance and average speed for aerobic runs up to about a half-marathon in length. Using my Strava data, I demonstrate how to use Riegel’s formula to measure the difficulty of runs on a standardized scale and briefly investigate how my fitness has changed over time with GAMs.predictive performance via bootstrap variants
https://www.alexpghayes.com/blog/predictive-performance-via-bootstrap-variants/
Thu, 03 May 2018 00:00:00 +0000https://www.alexpghayes.com/blog/predictive-performance-via-bootstrap-variants/When we build a predictive model, we are interested in how the model will perform on data it hasn’t seen before. If we have lots of data, we can split it into training and test sets to assess model performance. If we don’t have lots of data, it’s better to fit a model using all of the available data and to assess its predictive performance using resampling techniques. The bootstrap is one such resampling technique.dear students: take course evals seriously
https://www.alexpghayes.com/blog/dear-students-take-course-evals-seriously/
Fri, 08 Dec 2017 00:00:00 +0000https://www.alexpghayes.com/blog/dear-students-take-course-evals-seriously/As the semester ends, I would like to remind students of the value of a well-written course evaluation. Course evaluations allow students to share wisdom with the next generation and to provide feedback to instructors and the university. Despite this, few students fill out narrative reviews. I propose we up our game.
In my ideal world, course1 evaluations are written for students by students. They contain any advice you would go back and give to yourself before the class.numerical gradient checks
https://www.alexpghayes.com/blog/numerical-gradient-checks/
Wed, 18 Oct 2017 00:00:00 +0000https://www.alexpghayes.com/blog/numerical-gradient-checks/Motivation
Suppose you have some loss function \(\mathcal{L}(\beta) : \mathbb{R}^n \to \mathbb{R}\) you want to minimize with respect to some model parameters \(\beta\). You understand how gradient descent works and you have a correct implementation of \(\mathcal{L}\) but aren’t sure if you took the gradient correctly or implemented it correctly in code.
Solution
We can compare our implemention of the gradient of \(\mathcal{L}\) to a finite difference approximation of the gradient.gentle tidy eval with examples
https://www.alexpghayes.com/blog/gentle-tidy-eval-with-examples/
Mon, 07 Aug 2017 00:00:00 +0000https://www.alexpghayes.com/blog/gentle-tidy-eval-with-examples/I’ve been using the tidy eval framework introduced with dplyr 0.7 for about two months now, and it’s time for an update to my original post on tidy eval. My goal is not to explain tidy eval to you, but rather to show you some simple examples that you can easily generalize from.
library(tidyverse)
starwars
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> ## 1 Luke~ 172 77 blond fair blue 19 male ## 2 C-3PO 167 75 <NA> gold yellow 112 <NA> ## 3 R2-D2 96 32 <NA> white, bl~ red 33 <NA> ## 4 Dart~ 202 136 none white yellow 41.about
https://www.alexpghayes.com/about/
Mon, 01 Jan 0001 00:00:00 +0000https://www.alexpghayes.com/about/I’m a third year PhD student in the University of Wisconsin-Madison statistics program. Before grad school, I got a degree in statistics at Rice University. Previously I interned at RStudio, conducted biostats research at Fred Hutch, and led canoe trips for YMCA Camp Menogyn.
I’m interested in building statistical tools, and how statistics can help people make better decisions. In my free time, I enjoy long afternoon bike rides.research
https://www.alexpghayes.com/research/
Mon, 01 Jan 0001 00:00:00 +0000https://www.alexpghayes.com/research/I study community detection in networks with Karl Rohe, primarily using spectral methods. The central idea behind these methods is to estimate parameters in stochastic blockmodels (and more generally random dot product graphs) by taking the singular value decomposition of matrix representations of graphs. Much of this work is motivated by applied problems in social networks, especially on Twitter.
Outside of network analysis, I’m excited about (word) embeddings, GAMs, (semi-parametric) causal inference, statistical software design, and the philosophical foundations of statistics.