Research Article: Overcoming the problem of multicollinearity in sports performance data: A novel application of partial least squares correlation analysis

Date Published: February 14, 2019

Publisher: Public Library of Science

Author(s): Dan Weaving, Ben Jones, Matt Ireton, Sarah Whitehead, Kevin Till, Clive B. Beggs, Chris Connaboy.


Professional sporting organisations invest considerable resources collecting and analysing data in order to better understand the factors that influence performance. Recent advances in non-invasive technologies, such as global positioning systems (GPS), mean that large volumes of data are now readily available to coaches and sport scientists. However analysing such data can be challenging, particularly when sample sizes are small and data sets contain multiple highly correlated variables, as is often the case in a sporting context. Multicollinearity in particular, if not treated appropriately, can be problematic and might lead to erroneous conclusions. In this paper we present a novel ‘leave one variable out’ (LOVO) partial least squares correlation analysis (PLSCA) methodology, designed to overcome the problem of multicollinearity, and show how this can be used to identify the training load (TL) variables that influence most ‘end fitness’ in young rugby league players.

The accumulated TL of sixteen male professional youth rugby league players (17.7 ± 0.9 years) was quantified via GPS, a micro-electrical-mechanical-system (MEMS), and players’ session-rating-of-perceived-exertion (sRPE) over a 6-week pre-season training period. Immediately prior to and following this training period, participants undertook a 30–15 intermittent fitness test (30-15IFT), which was used to determine a players ‘starting fitness’ and ‘end fitness’. In total twelve TL variables were collected, and these along with ‘starting fitness’ as a covariate were regressed against ‘end fitness’. However, considerable multicollinearity in the data (VIF >1000 for nine variables) meant that the multiple linear regression (MLR) process was unstable and so we developed a novel LOVO PLSCA adaptation to quantify the relative importance of the predictor variables and thus minimise multicollinearity issues. As such, the LOVO PLSCA was used as a tool to inform and refine the MLR process.

The LOVO PLSCA identified the distance accumulated at very-high speed (>7 m·s-1) as being the most important TL variable to influence improvement in player fitness, with this variable causing the largest decrease in singular value inertia (5.93). When included in a refined linear regression model, this variable, along with ‘starting fitness’ as a covariate, explained 73% of the variance in v30-15IFT ‘end fitness’ (p<0.001) and eliminated completely any multicollinearity issues. The LOVO PLSCA technique appears to be a useful tool for evaluating the relative importance of predictor variables in data sets that exhibit considerable multicollinearity. When used as a filtering tool, LOVO PLSCA produced a MLR model that demonstrated a significant relationship between ‘end fitness’ and the predictor variable ‘accumulated distance at very-high speed’ when ‘starting fitness’ was included as a covariate. As such, LOVO PLSCA may be a useful tool for sport scientists and coaches seeking to analyse data sets obtained using GPS and MEMS technologies.

Partial Text

Professional sporting organisations invest considerable resources collecting and analysing data to better understand the factors that influence athletic performance. Recent advances in wearable technology and computing power mean that large volumes of data are now readily available to the applied practitioner [1]. However, while this data is becoming easier to collect, analysing it can be a challenging task, particularly when sample sizes are small (i.e. limited by squad size) and the data is highly correlated–something that can lead to instability when applying standard least squares regression techniques, making it difficult to draw firm inference [2–3]. With respect to this, global positioning system (GPS) and micro-electrical-mechanical-system (MEMS) data can be particularly problematic [4–5]. GPS and MEMS are often used to measure an athlete’s movement, from which speed, distance travelled, and acceleration can be computed using standard mathematical algorithms. For example, a player’s velocity and acceleration are simply the first and second derivatives of the distance travelled. Consequently, these variables are not independent, but instead are highly correlated. It is therefore not surprising that strong correlations have been reported between variables widely used to assess training load (TL) [4–5].

The TL descriptive results are presented in Table 2 along with the study data collected for each of the 16 subjects.

The overall aim of the study was to evaluate the extent to which PLSCA might be helpful when analysing TL data that exhibited considerable multicollinearity. As such, we wanted to identify the TL variables that best related to 30-15IFT performance in young rugby league players following 6-weeks of training. With respect to this, the specific findings of the current study revealed perhaps unsurprisingly, that ‘starting fitness’ is an important covariate of ‘end fitness’, with a strong positive correlation between the two–something that others have observed [30–31]. The strongest regression model (Table 7; MLR Model 1) suggests that professional youth rugby league players with a lower starting fitness require a lower accumulation of distance at very-high speed (> 7 m·s-1) (compared to players with a higher starting fitness) to elicit a comparable incremental improvement in end fitness (e.g. +1 km·h-1 in v30-15IFT) following 6-weeks of training. This model suggests, for example, that a professional youth rugby league player with a starting v30-15IFT of 17.5 km·h-1 would require an accumulation of 350m at very-high speed over 6-weeks to improve their v30-15IFT by 1 km·h-1 compared to 1050m for a player with a starting fitness of 20.5 km·h-1. As such, this regression model could be used to translate TL data (in conjunction with starting fitness) into practical targets for the applied practitioner working with youth rugby league players. However, it is important to note that this relationship (and associated MLR model) was observed within a single team, meaning the variability between players in the accumulated distances at very-high-speed are specific to the context of the training modalities prescribed by the coaching staff at this club [32]. We therefore recommend that future researchers conduct randomised control trials with appropriate comparator arms in order to consolidate or refute our findings regarding the importance of the interaction between the distance accumulated at very-high speed and a players starting ‘fitness’ to improving prolonged intermittent running ability in team sport athletes.

The findings of the current study demonstrate that multicollinearity is a major limiting factor, which has the potential to compromise analysis of TL data. However, this problem can be overcome by using an orthogonal PLSCA approach, which is immune to multicollinearity, thus enabling the user to quantify the strength of the relationships between the respective variables. Using LOVO PLSCA we were able to identify those variables that were most influential in explaining improvements in player fitness. This enabled us to remove irrelevant variables and so overcome any multicollinearity issues. This allowed us to produce a robust MLR model for predicting ‘end fitness’, from which we inferred that ‘starting fitness’ and the accumulation of distance at ‘very-high speed’ across a 6-week period of training were the most influential predictors of end fitness in professional youth rugby league players. As such, PLSCA appears to be a useful tool for filtering out irrelevant information and identifying those variables that should be included prior to any given MLR analysis.




Leave a Reply

Your email address will not be published.