The impact of Orthogonal variation in Chemometrics: - Review of 15 years of method development and applications Johan Trygg Hans Stenlund, Erik Johansson, Max Bylesjö, Svante Wold Computational life science cluster (CLiC) Department of Chemistry, Umeå University, Sweden What is Orthogonal Variation? The concept of Orthogonal variation Defining properties of Orthogonal variation (X matrix) • Systematic variation in X • Orthogonal to Y (considering noise level in data) • Belong to the X space (i.e. you can use it to predict new samples) Orthogonal variation is important for understanding a complex system – Gender, Drift, Side reactions, Unknown interferents, Sampling, Experimental problems, Non-linearities, biological variation The new set of ‘O’-methods, OPLS, OPLS-DA, K-OPLS, O2PLS, OnPLS and other related methods divide the systematic X-variation into two parts: – What in X is related to Y; Predictive variation – What in X is uncorrelated to Y; Orthogonal variation Orthogonal variation – schematic view Y-predictive noise X variables 1 2 3 4 5 Y-predictive Y-predictive Measured signal is the sum of many contributing factors – Pharmaceutical tablet formulation (e.g. binders, fillers, active drug, lubricant) – Human urine sample (e.g. genetics, diet, gender, age, stress, disease) – Plant biotech / Pulp & paper (e.g. wood species, cellulose & lignin content, water, age) – QSAR the molecular descriptors are a function of their chemical and biological property/activity/function Orthogonal Orthogonal Y-predictive Orthogonal • 1 2 3 4 5 % of variation • Lots of unknown systematic variation – mostly due to poor knowledge… – strong dietary, environmental, hormonal variations, etc… – Experimental variation, sampling, instrumental variation – Input material varies with supplier X variables % of variation Orthogonal variation is included in the measured data values, and form an integral part of data variability (multi-component) noise Example: Two component system Spectral profile of Predictive component X matrix y1 y1 100 PLS 70 y2 y2 = 70 X = y x T+ y x T + E 1 1 2 2 70 Constraint: y ┴ y 1 2 Spectral profile of 2011-06-15 Orthogonal component PLS results Example: Two component system Observed vs Predicted One PLS component model Observed vs Predicted Two PLS component model What about interpretation of PLS model Example: Single-Y, two component system Score plot Loading plot w*2c2 y R2X(1)=94% variation w*1 R2X(2)= 3.7% variation w*1c1 w*2 Regression coefficients, b p2 p1 Target rotation – Olav Kvalheim Kvalheim O M, Karstang T V. Interpretation of Latent-Variable RegressionModels. Chemometrics Intell. Lab. Syst., 1989; 7: 39-51 • Although a multi-component PLS model for a single-Ý variable, – Only exists a single Y-related component py’ yhat X b’coeff. y Orthogonal Signal Correction (OSC) - Presented in 1997 in Lahti by S.Wold • Basic idea, X Y: Remove structured noise (i.e. systematic) from X not correlated to Y (i.e. YTto=0) – X = to pTo + XE X = to pTo + Indirect approach 1.) Starting vector to vector, use as y-variable 2.) PCA/PLS/PCR regression model with to as y-vector to find orthogonal component 3.) iterate if necessary References: OSC Wold (1998), OSC Sjöblom (1998) DO Andersson (1999), DOSC Westerhuis (2001), POSC Trygg (2002) XE Direct approach 1.) Calculate covariation matrix W= X’Y 2.) Any row vector in X orthogonal to W is an orthogonal component. References: OSC Fearn (2000), OSC Höskuldsson (2001) OPLS Trygg (2002) 9 Reviews: Svensson (2002), Goicoechea (2001) How OSC method (Wold et al. 1998) finds one Orthogonal component tOSC X PLS Make score t orthogonal to y t y wT Multi-component PLS model PCA component (w = regression coefficient vector) OSC method… problems with overfit in estimating OSC component (s) failure to achieve Y-orthogonality unclear objectives – can result in a more complex model it does not consider the prediction model (e.g. PLS) two-step process (OSC + PLS) Orthogonal projections to latent structures (OPLS) - Prediction model with integrated filter (Orthogonal+Predictive) Trygg J, Wold S. J. Chemometr., 2002; 16: 119-128 Po’ po2’ po3’ po4’ X = tppp’ + ToPo + E y = upcp’ + f Only a single Y-related component. To pp1’ tp1 to2to3to4 up y OPLS XE wp1* cp’ c1’ 11 Orthogonal projections to latent structures (OPLS,2002) - Prediction model with integrated filter (Orthogonal+Predictive) Some theoretical properties of OPLS Alternative methods • POSC (Trygg, 2002) • PLS-PCP (Langsrud, 2003) – Focus on prediction For single-Y variable OPLS model: wo= p - w • • • PLS-CCA (Yu, 2004) PLS-ST (Ergon,2005,2007) XTP (Kvalheim, 2008) • Additional theoretical aspects of OPLS – Verron (2004) – Kemsley (2009) PLS vs OPLS model Example: Single-Y, two component system PLS model OPLS model to[1] Orthogonal Predictive 94% variation 3.7% variation w1,w*1 49% variation p1 w*2 49% variation p2 p2 p1 Predictive profile 2011-06-15 Orthogonal profile Understanding Orthogonal variation is important OPLS 90° HS_rot90.M2 (OPLS) 0,12 HS_rot45.M2 (OPLS) OPLS p1p 0,05 0,10 45° p1p p1p 0,04 0,08 0,03 0,06 0,02 0,04 0,01 t[2]O t[2]O 0,02 -0,00 -0,02 -0,04 p1o -0,06 p1o p1o -0,02 -0,03 -0,08 -0,04 -0,10 -0,05 -0,12 -0,3 -0,2 -0,1 -0,0 0,1 R2X[1] = 0,4968 PLS 0,2 -0,5 0,3 R2X[2] = 0,496254 90° HS_rot90.M1 (PLS) 0,08 -0,4 -0,3 -0,2 -0,1 -0,0 0,1 R2X[1] = 0,845847 0,2 0,3 0,5 45° p1 HS_rot45.M1 (PLS) p1 0,4 R2X[2] = 0,149945 PLS p1 0,04 0,06 0,03 0,04 0,02 0,02 0,01 t[2] t[2] 0,00 -0,01 0,00 0,00 p2 p2 -0,01 -0,02 p2 -0,04 -0,02 -0,03 -0,06 -0,04 -0,08 -0,4 -0,3 -0,2 -0,1 R2X[1] = 0,955399 -0,0 0,1 0,2 0,3 0,4 R2X[2] = 0,0376559 -0,6 -0,5 -0,4 -0,3 -0,2 -0,1 0,0 R2X[1] = 0,989184 0,1 0,2 0,3 0,4 0,5 0,6 R2X[2] = 0,00660776 OPLS multi-Y in multivariate - Pure profile estimation! • Direct calibration predicts X from Y (Classical Least Squares) X = YKT + E • Indirect calibration predicts Y from X (PLS, OPLS) Y = XB + F B are the regression coefficients for X (XY) K are the regression coefficients for Y (YX) K matrix is useful for spectral or chromatographic data - estimate of the pure profile for each analyte (column) in Y - useful model diagnostics (focus on correct variation in model) B matrix does not have similar interpretation However, there is a link between them, K=B(BTB)-1 Trygg, J. Prediction and pure profile estimation in multivariate calibration, J Chemometr., 2004 (18) 166-172 Single-Y vs multi-Y OPLS models Trygg J, Prediction and spectral profile estimation in multivariate calibration JOURNAL OF CHEMOMETRICS 18 (3-4): 166-172 MAR-APR 2004 Two single-Y OPLS models 84 % variation 15 % variation Multi-Y OPLS regression K=B (BTB)-1 po2 y1 y1 Predictive profile 84 % variation y2 Y-orthogonal profile 50 % variation Predictive profile 15 % variation y2 50 % variation 2011-06-15 profile Predictive 16 Predictive profile X orth 17 Case study: Plant metabolomics on Poplar trees PttPME1 expression was up and down regulated in transgenic aspen trees PME enzyme activity in wood forming tissues was correspondingly altered Lines in this study WT poplar 5‐ down regulated PttPME1 gene Orthogonal variation Metabolomics study of xylem OPLS model Between class variation Orthogonal variation Plant metabolomics on Poplar trees Orthogonal variation reveals experimental problems with scraping Line 5 vs WT Orthogonal S‐plot Multivariate calibration Carrageenan application • Carrageenans are polysaccharides extracted from seaweed, which are used as gelling and thickening agents in a wide range of industries, including food, pharmaceuticals and cosmetics. Five naturally occurring carrageenan types, viz. Lambda, Kappa, Iota, Mu and Nu • • Three spectral techniques (NIR, IR, Raman) – DOE mixture design Objectives: – (i) to find out overlapping spectral information; – (ii) to highlight the unique features of the different spectroscopic techniques; – (iii) to accomplish a predictive calibration model for five different carrageenan constituents. – Reference: M. Dyrby et al., Carbohydrate Polymers 57, 337-348, 2004. Hi-OPLS & Hi-OPLS/PCA Eriksson L, Toft M, Johansson E, Wold S, Trygg J, Separating Y-predictive and Yorthogonal variation in multi-block spectral data, JOURNAL OF CHEMOMETRICS, 20 (8-10), 352-361 2006 • • Base level: OPLS models between each spectral block & Y Top level: Separate OPLS and PCA models P O 5 Y OPLS to focus on Y-correlating information P PCA to focus on Y-orthogonal variation O P 704 O P 667 NIR O 3406 IR 5 Raman Y 102 102 102 102 26 26 26 26 Interpretation of PCA model Orthogonal variation Line plot of t1 reveals time trend ! Hi-Carra_NIR_IR_Raman_SNV.M7 (PCA-Class(1)), PCA Top level only orth score vars t[Comp. 1] 4 Base level OPLS model NIR & IR reveal water peak (Raman not influenced) 3 SD 3 2 SD 2 Contribution plot 1 Hi-Carra_NIR_IR_Raman_SNV.M7 (PCA-Class(1)), PCA Top level only orth score vars Score Contrib(Obs Group - Obs Group), Weight=p[1] 60 70 80 90 Num R2X[1] = 0.186045 Day 1 Day 2 Day 3 Day 4 100 -1 -2 -3 Var ID (Primary) $M5.t8(Ram 50 $M5.t7(Ram 40 $M5.t6(Ram 30 $M5.t5(Ram 20 $M4.t7(IR 10 $M5.t4(Ram 0 $M4.t6(IR 3 SD 0 $M4.t5(IR 2 SD -4 $M3.t7(NIR -3 1 $M3.t6(NIR -2 2 $M3.t5(NIR -1 $M3.t4(NIR 0 Score Contrib(Obs Group- Obs Group), Weight=p1 t[1] t1 Orthogonal variation for fault detection and quality control NIR spectroscopy data PLS scores OPLS scores 23 Chemical imaging application: FT-IR imaging spectroscopy on mouse liver 1.) Stenlund H, et al., ANALYTICAL CHEMISTRY, 2008, 80, 6898–6906 2.) Gorzsás A,, et al., THE PLANT JOURNAL doi: 10.1111/j.1365-313X.2011.04542.x Liver samples with two different cell types - Hepatocytes (cell of the main tissue of the liver) - Erythrocytes (red blood cells) Bruker Equinox 55 spectrometer FPA detector (64x64) Orthogonal Projections to Latent Structures (OPLS) Example PAT: Binary powder Access to the current data set was kindly granted by Dr Ola Berntsson of AstraZeneca, Södertälje [Berntsson, et al., 2000; Berntsson, 2001] • Diffuse reflectance NIR spectroscopy • Mixture of two powders with markedly different particle size • 11 batches of powders, 0% to 100% in steps of 10%. • X = NIR spectra (SNV) in the range 1080-2025 nm • Y = % binary mix of powders PLS model scores Figure: Schematic overview of the vertical cone mixer and the fibre-optic probe set-up. OPLS model scores Example PAT: Binary powder Non-linearity is detected in Orthogonal components PLS loading profiles (p) OPLS loading profiles (p) Example: Batch processes Orthogonal variation = Kinetics Batch mini plant: Hydrogenation, Nitrobenzene to Aniline OPLS model 1st derivative UV spectra vs Gas feed Y-orthogonal (7%) Modelled variation in X Loading vector po2 Y-predictive(92%) Loading vector pp1 Loading profile similarity Kinetic differences Not a competing side reaction. Nitrobenzene (260nm) Aniline (235nm) 2011-06-15 27 OPLS method was top-ranked for microarray-based predictive model, not PLS! (1) The performance of the prediction models depend largely on the quality and relevance of data (2) The experience and proficiency of the data analysis team are crucial factors for success (3) Different prediction methods yield similar prediction results. Reference: Leming Shi et al.,Nature Biotechnology, Aug 09 2010. doi:10.1038/nbt.1665 O2PLS model for overview - extended OPLS model • • • Separate model for joined and orthogonal variation Model of X: X = TpPpT + ToPoT + E Model of Y: Y = UpQpT + UoQoT + F X-Y Joint Variation ’Y‐orthogonal’ ’Y‐predictive’ P T O ’X‐predictive’ PT Unique to X T O X ’X‐unrelated’’ QT T O2PLS U Q T o U Y Unique to Y o Trygg J, Wold S, O2-PLS, a two-block (X-Y) latent variable regression (LVR) method with an integral OSC filter JOURNAL OF CHEMOMETRICS 17 (1): 53-64 JAN 2003 O2PLS for overview has been extended - finds two types of Orthogonal variation Po’ po1’ po2’ po3’ po4’ Orth(OPLS) To O2PLS model - Finds ALL systematic Orthogonal variation - PCA on residual matrix E - Does not affect prediction model - Added to existing Orthogonal scores and loadings up tp1 O2PLS XE Orth(PCA) to1to2to3 pp1’ E po4’ po5’ to4to5 y qp1’ O2PLS for two-block overview (think PCA) Preference mapping • Sensory and preference data for a set of 13 apples – 70 sensory attributes (X-variables); panel averages across 12 judges – 108 comsumer likings (Y-variables), expressed on a nine-grade scale – Original reference [MacFie, H., et al., 1999]. 70 108 Sensory judges X Consumer Likings Y 13 • 13 Group formation among sensory attributes – – – – – – – 1_ is a "First Bite" attribute, E_ is an "External Appearance" attribute, EA_ is "External Aroma”, A_ is "Astringent aftertaste”, F_ is "Flavor", I_ is "Internal Appearance“, T_ is "Texture" Preference mapping O2PLS overview plot X model Y model SensCons.usp_5.M16 (O2PLS) X/Y Overview R2 Predictive (X) R2 Orthogonal in X (PCA) R2 Predictive (Y) R2 Orthogonal in Y (OPLS) 1 R2 Orthogonal in Y (PCA) R2 0,9 0,8 0,7 0,6 Q2 0,5 0,4 0,3 0,2 0,1 Y Model X Model 0 SIMCA 13.0 - 2011-06-08 14:14:44 (UTC+2) O2PLS modeling in Preference mapping Unique variation in Y (uncorrelated) Unique variation in Y (28%) Consumers likings not picked up by Sensory data Process integration in pharma industry Food and Drug Administration (FDA) Instead go from product testing to quality by design! Risk minimisation – process understanding Systems biology approach: Combined profiling of transgenic Poplar Study design O2PLS model results G3 & G5 contain antisense constructs of the gene PttMYB21a, affecting plant growth.The closest ortholog to PttMYB21a in Arabidopsis thaliana is AtMYB52 New OPLS developments: OnPLS for modeling any number of matrices • Extension of O2PLS, uses MAXDIFF for predictive model, Fully symmetric • Tommy Löfstedt, Johan Trygg. OnPLSۛa novel multiblock method or the modelling of predictive and orthogonal variation. Journal of Chemometrics, 2011. Mohamed Hanafi and Henk A. L. Kiers. Analysis of k sets of data, with differential emphasis on agreement between and within sets. Computational Statistics & Data Analysis, 51(3):1491-1508, 2006. • Concluding remarks Introduction of OSC, OPLS and related efforts has made a real impact. In biology, our methods are now even more credible and accessible for scientists outside our field • OPLS is used by more than 150 Swedish companies, 50 international institutions and the ten largest pharmaceutical companies in the world • More than 500-600 citations in total so far citing these methods. • We have only begun to scratch the surface of the potential Highest impact journals (Nature, Lancet) Citations Thesis 2011 Acknowledgements Chemistry dep, Umeå University M Eliasson M. Bylesjö H. Stenlund R. Madsen S. Wiklund P. Jonsson H. Antti Uppsala University T. Lundstedt Imperial College J. Nicholson E. Holmes M. Rantalainen O. Cloarec Umeå Plant Science Center, Umeå Univ, Sweden T. Moritz L. Gerber A. Sjödin R. Nilsson A Grönlund S Jansson B Sundberg G. Wingsle J. Karlsson V. Srivastava R. Bahlerao G. Sandberg Riken University M. Kusano MKS Umetrics E. Johansson L. Eriksson J. Christensen S. Wold Funding: Swedish Research Council Swedish Foundation for Strategic Research (SSF) FORMAS Knut & Alice Wallenberg Foundation GlaxoSmithKline AstraZeneca MKS Umetrics

© Copyright 2019