Tuesday, May 17, 2016

The Intersection of Running and Statistical Analysis

I spent this semester taking Foundations of Experimental Design at RIT. I was dreading this class based off of horror stories from other students, but it actually turned out to be probably my favorite class so far. It was a lot of work, but everyone knows that I work hard, and that I will diligently search through resources (old class material and textbooks, the internet, etc.) for hours in order to figure out problems and better understand what I'm doing. I am also now known as "the girl who posts questions on the online message board" - literally the ONLY one posting questions. It became a running gag throughout the class and I was thanked by several classmates for posting my questions. I tend to start my homework early, as early as right after class on Monday night, so my questions got posted early in the week. Anyways, I loved the class, it only made me cry about twice, it will be useful to me in the future, I digress. The most interesting part was the project I had to do, which involved designing an experiment (about anything I wanted), executing the experiment, analyzing the results, and then presenting my results in the form of a paper/presentation. (I didn't love writing the paper but the rest was fun). I haven't done any writing other than blogging (fun) or technical writing (work) since I graduated from college TEN YEARS AGO.

I did my experiment on the accuracy of GPS devices while running. It took me a long time to come up with this idea. I wanted to do something interesting to me because otherwise, what's the point? Originally I had wanted to do a coffee tasting experiment, but that got scrapped. One day I just had an epiphany and came up with this idea.. it ended up turning into a very interesting experiment with a really cool design that I otherwise would not have had a chance to do in this class. The best part was that I was truly interested in the outcome and did my best to run the experiment well!

I really love my Garmin
Now, DC Rainmaker has explored this subject several times. I actually probably should have read his methodology before I set up my experiment, because he had some better ideas than I did. Since I had to be able to reproduce my methods multiple times over several trials and days, I chose to use high school tracks - a track has a standardized, built in measurement system that establishes a "truth" which could be replicated multiple times at different tracks. There was no other way to do this that I could think of (DC Rainmaker actually ran while rolling one of those measuring wheels which is a brilliant idea that I wish I had seen before I submitted my idea).

I had 5 devices: a Garmin ForeRunner 305, two Garmin ForeRunner 310XTs, a Garmin Edge 500, and my iPhone with MapMyRun. I wore the 305 and the 310XT on my right wrist (I'm a lefty), I held the iPhone in my right hand, I wore the Edge in a spi-belt around my waist, and I hooked the other 310XT to the spi-belt to see how the accuracy would change with it at my waist (that was my professor's idea). I wanted to see how these devices measured 1 or 4 laps around a track at a jog (10:00 min/mile) or a run (8:00 min/mile). I did this four times at four tracks: Greece Athena, Greece Odyssey, East High School, and Brighton High School. I wanted to explore: how did accuracy change at the different speeds, distances, and between the devices themselves.


I'm not going to go into the nuances of my experimental design*, other than to say that I did this as a Split-Plot design, which to non-stats people, means that I wore/carried all 5 devices at once while I ran. Each day, I did 4 runs: jog 1 lap, run 1 lap, jog 4 laps, run 4 laps. I recorded the distance measured by each device and then subtracted the actual distance of the track (0.25 miles and 0.99 miles). This left me with my "response" or "Y" value called "Diff" (difference between measurement and truth).

I executed this experiment over 4 days in April. It was interesting trying to get access to tracks during track season (I hadn't thought about this), but I managed to use tracks near work (in Greece) during the day for the Thursday and Friday that I did this. I had to dodge a few phys-ed classes and felt kind of like a creeper, but I managed. During the weekend I used tracks near my house that didn't have weekend track meets at them. I will note that I randomized the run order at each track (the four combinations of speed/distance) but that was the only randomization that I needed. Four days of running gave me 80 data points which was great. Plus I got in some workouts which otherwise have been few and far between.

Awkward..
After getting the data, I sat on the project for a while due to other work associated with the class. I kind of poked at it here and there and had most of the basic analysis done prior to the last week of school (I needed a little bit of help from my professor since we never went over how do do this type of analysis). I was able to find a lot of resources online, however my example was a bit more complicated than basically every example I could find on the internet. After I finished my last take home exam for the class, I was able to power through the rest of the analysis. I will summarize a bit of it here, trying to explain these tables and figures as I go. For those interested, I use R to do basically everything stats related unless otherwise required. R is a free piece of software that is very powerful and flexible.. however its user-friendly-ness is "meh." I'm slowly getting better with hard work. I can also do quite a bit of stuff in SAS (early in this course I did everything in both R and SAS but that got time consuming and I stopped doing it).


OK, back to the analysis. After typing all of the data into Excel (I had to record it by hand at the track using a data collection sheet and a clipboard like some kind of hobo), I imported it into R and got to work. First, I ran the "full model" which included track, speed, distance, and device, plus all interactions between speed/distance/device.

Sidenote: In general, you can think of an interaction as follows: you have two foods - ice cream and a hamburger, and two condiments - chocolate sauce and ketchup. Individually, you like all four of these items. However, you only like specific combinations of them.  You like ice cream with chocolate sauce but not with ketchup. You like your hamburger with ketchup but not with chocolate sauce. This is an interaction between food and condiment. When these interactions are plotted, the slopes of their lines are not parallel (in fact, with the ice cream/hamburger example, they would probably cross and form an X which you can see below in my super professional interaction plot drawn via Paint). My Responses of "Yum!" or "Gross" are especially scientific and totally valid. Understanding this plot will be useful later.

Anyways, back to the model. A full model is run to determine which factors and interactions are truly important in the model, and which factors can be removed. This is done using Analysis of Variance (ANOVA).

Whole Plot

DF
Sum Sq.
Mean Sq.
F-Value
P-Value
Track
3
0.001874
0.000625
6.379
0.01315
Distance
1
0.011761
0.011761
120.115
1.66 x 10^-6
Speed
1
0.001051
0.001051
10.736
0.00958
Distance:Speed
1
0.000911
0.000911
9.306
0.01378
Residuals
9
0.000881
0.000098








Split-Plot

DF
Sum Sq.
Mean Sq.
F-Value
P-Value
Device
4
0.05454
0.013636
68.036
< 2.0 x 10^-16
Distance:Device
4
0.01693
0.004233
21.122
4.27 x 10^-10
Speed:Device
4
0.00137
0.000342
1.706
0.164
Distance:Speed:Device
4
0.00026
0.000064
0.321
0.862
Residuals
48
0.00962
0.000200



The table seen above is an ANOVA table for the full model. In the left column, you can see all of the factors and their interaction terms (denoted by Factor1:Factor2). "Residuals" refer to the unknown error in the model. To determine if a factor or interaction is important to the model, we compare the Mean Square Error of the factor in question to the Mean Square Error of the Residuals (by dividing). This is essentially comparing the variation in the response due to the factor to the variation in the response due to "noise" in the model caused by unexplained variation. If the variation due to the factor is the same as the variation due to the noise, then the factor is irrelevant. However, if the variation generated by the factor is greater than the noise, then the factor IS causing a change in the model. By dividing the MSE of the factor by the MSE of the residuals, you get an F-value, which is the statistical test of whether or not that factor is important. Since we now do these tests in software -and not by hand, the software will also generate a p-value, which you can see in the right hand column. If the p-value is less than 0.05, the factor its associated with needs to be kept in the model - and is important. You can actually try this by hand above. If you divide the MSE for Device (0.013636) by the MSE for the Residuals (0.000200), you will get an F-value of 68.036. Big F-value = important factor!

Note: this ANOVA table is a bit more complicated than a typical one because of the split-plot analysis. Typically there is only one error term (Residuals) however with split-plot, there are two (or sometimes more!). Getting the error terms correct and having them in the appropriate place is critical for performing the tests to see if factors are significant.

Based on this table, the terms that are important (statistically significant) are: Track, Distance, Speed, Distance:Speed, Device and Distance:Device. Can you see why this is so?

Cool! My experiment worked! Now what? What does this mean? This confusing table can't be the end of the analysis, right? Absolutely not. Part of the job of a statistician is to not only run the analysis, but to translate the results into explanations (words), plots, and tables that a client or management can understand. What did we find out from this experiment and how does it apply to  your product, company, clinical trial, etc. This isn't always possible for the statistician to do - if we're consulting then we are probably also not an SME (subject matter expert) and sometimes the data may even be coded so that we don't actually know what the experiment is on. In this case, we would summarize in terms of plots and leave the "why" to the experts.

So this ANOVA table tells me a lot. However, it doesn't explain WHY these factors are significant. So let's try to find out. Before going any further, the model needs to be checked for adequacy. If the model is not adequate, then any inference (fancy stats word for conclusion) made from it may not be correct. I won't go into this further other than to point out one interesting feature that I saw.


This is a plot used to detect outliers in my data. An outlier is a data point that is far away from the majority of the other data points. Sometimes the outlier can have negative effects on the data analysis, and it can be appropriate to remove them prior to analyzing the data. I had 4 outliers, and interestingly enough, ALL 4 outliers were associated with the Garmin Edge 500 - they are circled in pink in case that wasn't obvious enough. For those of you who are familiar with Garmin's lineup, this is actually a cycling computer that I borrowed from John (again at the suggestion of my professor, smart man). In this case, I left my outliers in my data set (they did not have much of an effect on the overall outcome of my analysis, and I saw no other reason to remove them).

So we've got the ANOVA table, the model checks out, now it's time for the fun part - see what happened and try to figure out why! We do this by using main effects plots and interaction plots. Rule of thumb is: if there is an interaction between two factors, then use the interaction plot. If a factor is not associated with an interaction, then use the main effects plot. However, it's not appropriate to use a main effects plot for a factor that is involved in an interaction, since you don't end up seeing the whole story for that factor!

First we'll look at the interaction plots. What an interaction plot is showing us is the mean of all of the responses at certain factors. For instance, below we have one factor: Distance, at 1600 m and 400 m - those are the two lines, dotted and solid. The other factor, Speed, is on the X axis. Speed is at two levels - jog and run. The response - mean of Diff - is on the Y axis and the mean is taken across all 5 devices at each Distance: Speed point. So this plot is showing us that at the short distance of 400 m, the accuracy of all devices together doesn't change much when you run slowly or quickly. Conversely, the speed DOES matter when you run a mile. The devices (on average) are more accurate when you run slowly (the mean of Diff is closer to 0) than when you run fast. I would wager that this interaction is important purely due to the interaction of the 1600 m distance with speed.


What could have caused this to happen? Well, think about what happens when you run 1 lap of a track and your Garmin reads 0.27 instead of 0.25. Not a big deal, right? That's pretty close. However, when you run 3 more laps, now your Garmin might read 1.08 (0.27 x 4) instead of 0.99. Propagation of error. Also, think about what happens when you run faster, but your Garmin is updating in 1 second intervals. You are running further in between those 1 second intervals, and the Garmin has to estimate your travel path in between the two update points. This can also cause more error. Just a thought. I'm not an expert.

The second interaction plot, seen below, shows the interaction of Distance and each Device. Again, Distance is represented by the lines, Device is on the X axis, and mean of Diff is on the Y axis. This time, the mean of Diff is the mean across both speeds for each Device:Distance point. Overall there is a pattern seen at both distances: the Garmin Edge tends to read short (less than 0) and the iPhone tends to read long (greater than 0). The other three (all ForeRunners) are in the middle somewhere. It also looks like this pattern is exaggerated at the longer distance - i.e. if the iPhone reads "long" for 400 m, it reads even longer at 1600 m! Same goes for the Edge.


This one is a little more interesting. Why does the Edge suck so much? One thing I failed to mention was that in setting up this experiment, I set all three ForeRunners to update every second. The Edge didn't have this option (it would be silly for the Edge to update every second because it would require more storage and the device is generally used to go much further than a ForeRunner). I honestly have no idea how often the Edge updates, but if it's less than once per second, then again, it's going to be guessing your path of travel between those updates. Especially since I was going around in circles, it probably anticipates that travel is in a straight line, not a curve, therefore it probably marks the tangents, therefore under-measuring. Why does the iPhone suck so much? Welp, it's not designed for running. Also, I held it in my right hand, so it was getting the additive effect of the arm pumping motion from running and traveling further by basically being in the 2nd lane (ok, that's a bit of an exaggeration since I'm not really that wide).

Next we'll look at the main effects plot. I haven't talked much about Track or its purpose in the model. Essentially, Track and Day were not something I cared about. They were not experimental factors. However, because the track was changing, as was the day, I needed to include them in the model in case there were differences seen track-to-track or day-to-day (due to cloud cover, satellite position, etc.). This is called a blocking factor (or sometimes, a nuisance factor). In experimental design, we group (block) data points that we think will probably will be similar to each other. In industry, this could be all products made on one machine, in agriculture it could be all plants on one tract of land. In my case, all runs done on Thursday at Greece Athena are probably more similar to each other than all runs done on Friday on Greece Odyssey, and so on. Therefore, we block on day/track. Let's go back to the ice cream/hamburger example. If 10 people were taste testing this combination, what do you think our blocking factor would be? If you said "person" then you are correct! We would block on person because each person's taste preferences would be different. Blocking factors need to be factored into the model to account for potential noise or changes in the response due to the different blocks.

Below you can see my main effects plot of Track. A main effects plot is used when there are no interactions involved. This plot shows the mean of all data points obtained at each track. Even though Track was significant in the model (meaning it contributed to a change in the accuracy that was discernible from noise), when looking at this plot, the change doesn't seem to be a big deal. It's probably due to changes in satellite positions day to day, or possibility track surroundings (for example, East High has a large stadium that could impact the GPS accuracy due to reflecting the signal).


So now we've seen that distance, speed, and device are all important in determining how accurate a GPS device will be. We know that the block plays a minor role. However, we still don't exactly know how these devices are different from each other. We know that they are different - Device is significant in the model and we saw that interaction plot above that looks like they are different, but I wanted to investigate this further.

This is called "post-hoc" analysis, which is probably Latin for something cool. There are some constraints to post-hoc testing that I won't touch on, but in this case, I wanted to compare all of the GPS devices to each other, known as pairwise testing. I used pairwise Tukey tests to do this.

With one line of code, R (my statistical package of choice) gives me this sweet table below that has all of the information that I need to answer my question! The left column are groups - any device with the same letter are not statistically different from each other. Therefore, the 305 and the 310XT that I wore on my wrist are the same. So glad I dropped that $400 to upgrade. The devices are ordered by their mean Diff (Diff being the response) meaning that on average, the iPhone measured ~0.05 miles long. On average, the Edge measured ~0.03 miles short. The two 310XTs are in different groups, and the 310XT worn at the waist is the best device seen here (it's almost perfectly accurate)!!

Grouping
Treatment (Device)
Mean Diff
A
iPhone
0.04875
B
Garmin ForeRunner 305
0.02812
B
Garmin ForeRunner 310XT arm
0.02500
C
Garmin ForeRunner 310XT waist
-0.000625
D
Garmin Edge 500
-0.02688


So what does this mean? Well, remember that I was running counter-clockwise in circles with all these things strapped to my right arm or my waist. So it kind of makes sense for the things strapped to my right arm (the top 3) to have a mean Diff that is positive, and the thing strapped at my waist to be nearly accurate. (And don't forget, the Edge just sucks for running, so maybe stick with a ForeRunner for your marathoning). I suspect that if I had done this experiment while running in a straight line, or turning equally in both directions, that the waist 310XT and the arm 310XT would be the same. What if I had a 3rd 310XT that I could have worn on my left wrist? Even better, right? Maybe next time..


Anyways, that was my class project. I wanted to just kind of explain what I found and ended up basically typing out an entire tutorial with explanations but I'm ok with that. If more people understood the power of statistical analysis and how cool it is, maybe I would get less ugly faces made at me when I tell people what my graduate degree will be.


*For a typical "full factorial" experimental design, you have multiple factors that you wish to study (in my case: device, speed, and distance) and you run them in all possible combinations. It would have been extremely time consuming to do one device at a time (0.25 miles x 2 speeds x 5 devices = 2.5 miles + 1 mile x 2 speeds x 5 devices = 10 miles for a total of 12.5 miles per day for 4 days). You use a Split-Plot design when one factor is "harder to change" than the others - in this case, the Device. It was an interesting use of the Split-Plot design because these were developed (I believe) for agricultural studies where it's easier to spread some sort of treatment (i.e. fertilizer or pesticide) over a large area than a small one. So visually, it's easy to see for the ag. industry, but it wasn't quite as easy to visualize this with my project. A friend of mine actually suggested carrying them at the same time, I asked my professor about it, he guided me towards split-plot, and I researched how to do it and proceeded from there.