I spent the last week at Statistical and Mathematical Sciences Institute’s (SAMSI) undergraduate modelling workshop. This year the workshop was hosted at North Carolina State University in Raleigh.
Rundown of the workshop
About thirty students attended the workshop. To get in there’s a mellow application process. SAMSI covered travel, rooming and food for the participants. We were expected to bring laptops with R and RStudio installed. The purpose of the workshop was to give undergrads experience modelling real world data. Each year the workshop has a different theme, in our case statistical analysis of climate phenomena.
Before the workshop, we choose from a list of six projects for the week. On Sunday night, we flew in for a welcome dinner and met the other students on our project team. Each group had a SAMSI postdoc as group leader.
On Monday Doug Nychka and Chris Jones gave us a broad overview of the statistical issues present in climate science. We spent the afternoon doing some team building activities, discussing our interests, what skills we brought to our respective groups and developing research questions.
We spent the next three days working on our projects. We probably spent six hours a day modelling, and an hour or so at a research presentation or R workshop, and an hour goofing off and hanging out. The talks in particular were very good, presenting current research at the undergrad level in an engaging way.
In the evenings a small group would normally explore the bars in the NCSU area, which was nice after a long day on campus. The workshop concluded on Friday, when each group presented their findings before flying out in the afternoon.
What students got out of the workshop
- Experience picking a research question and trying to answer it
- An introduction to spatial and climate statistics
- Exposure to extreme value distributions, spatial autocorrelation and Gaussian processes
- Practice checking for violations of (spatial) modelling assumptions
- Practice munging and working with complicated real world data
- Experience using R for statistical projects
- Individualized feedback on modelling from SAMSI postdocs
My group was lead by Mikael Kuusela, who did a fantastic job helping my group find research questions. He gave us a ton of individual feedback and was very attentive and patient. I particularly appreciated his advice on choosing questions that scientists care about.
Personally, the workshop enabled me to make some valuable connections within the stats community. At the end of the workshop, Mikael asked me if I’d like to write up a short outreach piece based on my project with him, which I’m super excited about. Keep an eye out for an upcoming piece on a functional decomposition of ocean thermoclines during El Niño (feat a plot we’re calling The Bananafold).
Earlier this year Maggie Johnson, another SAMSI postdoc, put me in contact with some of the bioinformatics crew at Pacific Northwest National Laboratory and I nearly ended up taking a year off to work on omics projects with them.
I also had a blast getting to know Doug Nychka. Not only was Doug super patient with my many newbie questions about GAMs and splines, it was fun to chat with him about climbing and the UW-Madison statistics program.
Some noticings about the undergrad stats community
As someone who’s spent a bunch of time organizing undergrad statistics activities over the last year, the workshop was an interesting opportunity to learn about the broader community of statistics undergraduates. Here are some of the notes I took.
We have fundamental misconceptions about the purpose of modelling: When groups presented their initial research questions, it was immediately clear that many students were conflating description, prediction and causation. Throughout the week, there were many attempts to turn everything into a prediction problem, or to interpret descriptive analyses as causal.
Everything is a nail: Many students had only taken one or two modelling courses and tried to frame their research question in the context of the tools they knew. For some reason an astonishing number of people wanted to tackle every problem with ARIMA.
The pre-requisite stack is not very deep: Most students had taken a mathematical statistic course, but very few had much coursework beyond that. Less than half the workshop had background in linear regression, and people were much less comfortable with linear algebra than I would have expected. Barely anyone had probability or analysis background.
Programming skills are rate determining: We dramatically overestimate our R capabilities. In particular, non-tabular data really threw people off. My group took about three days to calculate mostly summary statistics and make basic plots. Twitter said it best:
Everybody’s resume looks the same: I’ll write more about this soon, but everybody advertises themselves in exactly the same way. This is despite having wildly varying skillsets. As a job seeker, how do you demonstrate that you are on the upper end of the competency spectrum? As a recruiter, how do you differentiate between candidates who look identical?
I was thrilled by Rice incredible attendance rate (eight out the thirty!). Given the huge amount of work I’ve put into Rice DataSci over the last year, it’s great to see increasing excitement and an expanding community. The workshop organizers were pretty surprised that Rice showed up in so much force, and I’m hoping the trend will continue in years to come.
If you have the opportunity, I strongly recommend visiting the SAS corporate headquarters in nearby Cary1. One of my friend’s parents works for SAS and he took us on a tour of their campus and for lunch at the SAS cafeteria. In addition to be being astoundingly beautiful campus full of delightfully nerdy art, I was blown away by how well they treat their employees. They also seem to have a healthy internship program that may be of interest to students.
We learned things at a great workshop. Everyone should go if they get a chance. The statistics community should spend more time teaching beginners about the big picture: what statistics is and how we should use it.
I’ve always been fairly anti-SAS, mostly because I think the expense makes it a borderline immoral tool to teach to students. Hell, I’m working on open source statistical software at RStudio this summer. But you should seriously go check SAS out. They do some seriously cool stuff and I’m re-evaluating my opinion of them. Plus, I got to eat lunch in the building where the intro to Iron Man 3 was filmed.↩