# When does the dimensional reduction go wrong?

## 11 Dimension reduction

Transcript

1 Chapter 11 Dimension Reduction This chapter deals with complex situations in which you have to deal with several variables at the same time. The aim is to reduce the complexity in such a way that one can recognize essential structures in data without losing too much information. A by-product of such an analysis is the creation of new, simpler variables based on the original ones. These can be used for further analyzes. Learning objectives: After working through this chapter, you have achieved the following: You know what principal components mean and what a principal component analysis is, and you can calculate them in R and represent them graphically. You will be able to interpret the result technically and in terms of content. They know what component loads and communalities are and can calculate and evaluate them in R. You know how to determine the number of major components. With R you can create a scree plot and calculate the eigenvalues. You can check the prerequisites of a principal component method using Kaiser-Meyer-Olkin (KMO) and Bartlett statistics. You know what rotation of components means and you can calculate this in R. You are able to use the results of a principal component analysis in the form of component values. Can the complexity of multidimensional metric data be reduced to a few important principal components? So far we have limited ourselves to two areas with regard to the number of variables to be analyzed: univariate and bivariate methods. With univariates

2 Dimension Reduction Analysis, as the name suggests, focuses on a single variable. This does not mean that several variables cannot also be involved. If you want to answer questions in which the distinction between dependent and independent variables is important, then univariate means that there is only a single dependent variable. We have e.g. got to know the one-sample test (Section 8.2), but also regression (Section 9.2) or analysis of variance as representatives of such methods. The number of further variables whose influence on the dependent variable is to be examined is irrelevant for the assignment to the class of univariate methods. With bivariate methods, as we have got to know them so far, it was about the connection (association) of two variables, e.g. in the chi-square test for independence (Section 7.4) or in correlation analyzes (Section 9.1). This chapter covers an extension to multivariate methods, whereby we only want to consider those for scale variables. In addition, we will not deal with situations in which we distinguish between dependent and independent, but rather we will examine structures of association between several variables. The term multivariate also goes hand in hand with the term multidimensional. In scatter diagrams (Section 9.1.1), individual persons (or observation units) are shown as points in a coordinate system that results from two variables. When considering three variables at the same time, each person is represented in a three-dimensional space as a point, with ten variables in a ten-dimensional space. Unfortunately, one can no longer imagine that, but it is not necessary at all. When it comes to visual imagination, it's best to stay in two or three dimensions and just abstract. And mathematics (especially linear algebra) helps us with calculations anyway. So if we want to study several variables at the same time, we are dealing with multidimensional problems. Since we can only imagine multidimensional relationships with great difficulty, let alone grasp them, we need methods that help us. Fundamentals of principal component analysis The method that we want to get to know and that is suitable for analyzing such questions is called principal component analysis. Principal Component Analysis, abbreviated to PCA). In some ways it is related to what is known as factor analysis, and often gives similar results. Conceptually, however, factor analysis is a completely different process and actually an umbrella term for various specialized methods. In some popular statistical programs this distinction is unfortunately not made clear and

3 11.1 Can the complexity of multidimensional data be reduced? 391 the impression arises that the principal component method is only a special case of factor analysis. A detailed presentation and comparison of main component and factor analysis would go far beyond the scope of this book. A very good and understandable overview is provided e.g. Bühner (2004) for everyone who wants to delve into this topic. We can only present a few basic ideas for analyzing multivariate relationships and will concentrate on the most important and most frequently used method, principal component analysis. The focus here is on the exploratory nature of the method and the objective of data reduction. Principal component analysis: Overview Objectives: Reduction of a larger number of correlated variables to a smaller number of uncorrelated variables, whereby a large part of the information is to be preserved Uncover a structure that underlies a large number of variables Discover patterns of common dispersion (correlation) of the multiple variables Generate artificial dimensions or new variables (so-called principal components) that correlate highly with the original variables. Principal component analysis is an exploratory method, ie there is no prior knowledge (no hypotheses) about the underlying patterns. These are to be discovered. As an exploratory method, principal component analysis does not provide results in the sense of decision-making aids when testing statistical hypotheses, but it can provide valuable information about structures in the data. In this respect, it can also be viewed as a hypothesis-generating method. Principal component analysis is based on covariances (correlations) between all variables. Therefore, the data must be metric and, like linear regression, their relationship to one another must be linear. So all the conditions apply as for the Pearson correlation coefficient. In practice, however, Likert items are often used which, strictly speaking, are actually only ordinally categorized. However, the application of a principal component analysis is justified (due to its exploratory nature) if the results can be meaningfully interpreted. We want to get to know the basic ideas first using a simple example. It's about the self-concept of students, i.e. how do students see themselves

4 Dimension reduction itself. Let us assume that we presented the following items to a sample of students, which they answered on a 5-point Likert response scale (with the values ​​strongly agree to strongly disagree): 1. I go like to go to parties (A1) 2. I prefer to live together rather than alone (A2) 3. It is easy for me to talk to anybody (A3) 4. I'm very good at math (B1) 5. I like to write essays or seminar papers (B2) 6. My favorite place at the university is the library (B3) If you look at these items, you can see that there are two groups that are identified by the names A and B. Group A describes social characteristics, group B characteristics that have more to do with academic, study-centered aspects. For the sake of clarity, we have already anticipated the result by dividing the six initial items into two groups. In the case of real problems, this grouping would of course not be known in advance. The aim of principal component analysis is to help find such groupings. The starting point of the analysis are correlations between all variables, which can be represented in the form of a table (the so-called correlation matrix) (Figure 11.1). A1 A2 A3 B1 B2 B3 A A A B B B Table 11.1: Correlation matrix for the self-concept example You can see that items A1 A3 correlate highly with one another, namely with 0.7 to 0.8. They form a group, which is also indicated by the colored background, and thus have something to do with each other. The higher e.g. the agreement with I like to go to parties is, the higher the agreement with I prefer to live together than alone. These three have only very small correlation values ​​with the other items in group B. So they have nothing to do with them. We see the same pattern for the items in group B. The B items correlate highly with one another (0.79 to 0.85), but only slightly with the A items.

5 11.1 Can the complexity of multidimensional data be reduced? In essence, we've already done some kind of intuitive principal component analysis this way. In practice this is unfortunately not so easy, as the groups often overlap, the values ​​of the correlation coefficients are not so large and clear and the variables are not so clearly arranged. These two groups can now be grouped together insofar as we regard each of them as a new variable. These are known as the main components. We can give them names, e.g. social orientation and academic orientation. And we can work out the value for each person by which they can be described with this new variable. For example, If a person gives high approval for items A1 to A3, then they will also have a high value for social orientation. This value is called the component value or component score. Think of it like linear regression, with the component values ​​taking on the role of the dependent variable. The equations for the two main components are ScoreA = β A a1 a1 + β A a2 a2 + β A a3 a3 + β A b1 b1 + β A b2 b2 + β A b3 b3 ScoreB = β B a1 a1 + β B a2 a2 + β B a3 a3 + β B b1 b1 + β B b2 b2 + β B b3 b3 where ScoreA and ScoreB are the values ​​that the people get for the new variables, a1 to b3 are the standardized values ​​of the original variables A1 to B3 1 The β s are weights. For example, β A a1 is the weight with which the (standardized) value of the item is included in the ScoreA, while β B a1 is the weight with which the (standardized) value of the same item is included in the ScoreB. In ScoreA, items A1 to A3 will have high weights, while for B1 to B3 they only have very low values ​​for this component. The opposite applies to component B. How do you get the main components and how many are useful? Our fictitious example of the self-concept of students is of course kept simple and manageable. It is obvious that there are two major main components. But purely mathematically there are 6 main components in this case, namely as many as there are output variables. But since we look at the 1 For each person, we get the standardized value of A1, namely a1, if we subtract the mean value of the variable A1 from the respective value A1 and then divide it by the standard deviation of A1, i.e., a1 = A1 mean value (A1) Standard deviation (A1) This transformation (standardization) results in the mean of a1 being 0 and the variance of a1 being 1. The same applies to the standardization of any variables.

6 Dimension Reduction Having set the goal of reducing dimensionality, we have to find a solution in which there are fewer new variables than original items. The basic idea is as follows. We take it step by step. Step 1 Look for the largest group of items that are highly correlated with each other. They form the first main component. Step 2 One looks for the second largest group of items that correlate highly with one another, but that correlate as little as possible with the first group. Steps ... You keep doing this until you have the same number of main components as items. This procedure is called extraction of main components. 2 Of course it will become more and more difficult to find further groups of items that correlate highly with each other, but only low with all previous groups of items. Therefore, one should have a stop rule that separates important from unimportant main components (or components for short). Unfortunately, there is no one-size-fits-all method of determining the number of important components. Since the principal component analysis is an exploratory method, you will choose the number in which the extracted components are easy to interpret and also meaningful in terms of content. It's a subjective decision. However, there are already a few rules of thumb which, as a first step, give you an idea of ​​how many components you should roughly choose. In practice, two methods are usually used to help find a reasonable number: One examines the so-called eigenvalues ​​(numerically). One examines the so-called scree plot (graphically). 2 This description is mathematically inaccurate and is only intended to serve as a basic understanding. A more detailed description can be found in the appendix to this chapter.

7 11.1 Can the complexity of multidimensional data be reduced? 395 Percentage of cumulative component eigenvalue of explained variance Percentage Component 1: Component 2: Component 3: Component 4: Component 5: Component 6: Figure 11.1: Eigenvalues ​​and proportions of explained variance Meaning of the eigenvalues ​​Each main component has an eigenvalue. The size of the respective eigenvalue describes the proportion of the total variance in the data that is explained by this component. The greater the number of items that are grouped together in one of the steps mentioned above and the higher the correlations within this group, the greater the eigenvalue of the corresponding main component (which is formed by this group). The size of the eigenvalue corresponds to the explanatory value of the main component. If a component has only a small eigenvalue, then it contributes little to the explanation of the total scatter and can be ignored as unimportant compared to other main components with a large eigenvalue. The number of items in this group is small, the correlations between the items are relatively low and the correlations with items in other groups are relatively high. The output variables (items) are standardized, i.e. the values ​​are transformed so that their mean value is 0 and their variance is 1 (see footnote on page 393). Then the total variance is equal to the number of variables (6 in our example). The sum of the eigenvalues ​​is therefore also 6. For each eigenvalue, the proportion of explained variance can be calculated. To do this, you divide the eigenvalue by the number of variables. For our example of the student self-concept, the eigenvalues ​​and the proportions of explained variance are shown in Figure 11.1. You can see that the first component has an eigenvalue of 2.9 and 48% of the

8 Dimension reduction exhausts the total variance. The second component has an eigenvalue of 2.2 and explains 37%, together they explain 86% of the total variance. All other main components have only very small eigenvalues ​​and can be neglected. One of the rules of thumb for determining the number of principal components is the eigenvalue criterion. Eigenvalue criterion: All main components that have an eigenvalue greater than 1 are taken into account. The reason is that principal components with an eigenvalue less than 1 have less explanatory value than the original variables. This is an unambiguous, but rigid rule, the application of which in practice often fails if one wants to obtain main components that can be meaningfully interpreted. In our example, however, this rule fits well. An alternative is the so-called scree plot. With this graphical method, the eigenvalues ​​are plotted on the y-axis and the main component number on the x-axis and connected with a line (for our example, Figure 11.2). The scree plot is based on the fact that the first or the first main components usually have high eigenvalues, but they quickly become smaller. From a certain point onwards, they remain relatively constant at a fairly low level, there is a kink or elbow. Scree plot criterion: All main components are taken into account that are to the left of the kink (of the elbow) in the scree plot. If there are several kinks, then those main components are selected that are to the left before the rightmost kink. According to this criterion, only the main components that are to the left in front of the kink or elbow are significant. The word scree means scree or scree slope. That which belongs to the mountain is significant, the rubble can be ignored.

9 11.1 Can the complexity of multidimensional data be reduced? 397 scree plot values ​​index Figure 11.2: Scree plot for the self-concept example The problem is sometimes that you don't know exactly what is still mountain and what is already rubble, or there are several kinks. If there is no kink at all, the scree plot will not help. For both criteria, it is true that they should only be viewed as a rough guide. It is best to combine both methods. The most important thing is that the main components can be meaningfully interpreted in terms of content. In our example, both criteria match; the scree plot (Figure 11.2) also suggests the choice of two main components. How do you interpret the main components? One of the results obtained by calculating a principal component analysis is the component charge matrix (or component matrix for short). It contains the so-called component charges.For the self-concept example, it looks like this (Figure 11.3). The component charges are the correlation coefficients between the original

10 Dimension Reduction Component Item 1 2 A A A B B B Figure 11.3: Component matrix: Component loads for the self-concept example of initial variables (rows) and the main components (columns). One can now interpret the main components by looking for those variables that show a high correlation to a component. You have to find a name for the common properties of the items that are uploaded to a component. Component load matrix: indicates how much each variable loads on a component. Rule of thumb (for the absolute amount of the charge):> very high high poor

11 11.1 Can the complexity of multidimensional data be reduced? 399 Component 2 1.0 0.5 A2 A3 A1-1.0-0.5 Component 1 0.5 1.0-0.5 B3 B2 B1 Figure 11.4: Component loads for the self-concept example -1.0 Generally can one interprets such a graphic in such a way that items that are close to an axis (a main component) have a strong relationship to it. The further they are from the origin, the more important they are for this component. If we hadn't already known, then from the graph (Figure 11.4) we could see that there are two main components that are clearly represented by the two groups of points. These two groups could not have been read off so easily from the charges in the table (Figure 11.3). The groups lie in the middle between the two axes and are therefore related to both. But we want main components that are as independent of each other as possible, i.e. that the items can be clearly assigned to one of the main components, i.e. lie on one of the axes (or at least are positioned very close to an axis). Then we could interpret the axes better. Graphically it's very easy. You just have to turn or rotate the axes so that they go through the item groups. In Figure 11.5 you can see what happens when the axes are rotated.

12 Dimension reduction component 2 1.0 0.5 A2 A3 A1 new component 2 1.0 A3 A1 A2 0.5-1.0-0.5 component 1 0.5 1.0-1.0-0.5 B2 B3 B1 0.5 1.0 new component 1-0.5 B3 B2 B1-0.5-1.0-1.0 Figure 11.5: Unrotated and rotated axes Rotation of the component structure The coordinate system that depicts the component structure is as follows rotated so that the variables (items) come as close as possible to the rotated axes, whereby the goal is that each variable is related to only one component, if possible. The pattern is usually clearer and there are simpler solutions that are easier to interpret. There is a redistribution of the correlations between variables and principal components. The explanatory value of individual main components changes in the sense of a more even distribution (in contrast to the eigenvalues ​​originally obtained). In other words, the explanatory value of the second most important (third most important, etc.) main component does not decrease as much compared to the first component as in the case of the unrotated solution. Overall, however, the proportion of explained variance does not change as a result of the main components extracted once. There are a variety of methods for rotating the main components. A distinction is made between orthogonal and oblique. In the case of orthogonal rotation, the right angles between the main components are preserved (the meaning is that the main components are uncorrelated), in the case of oblique rotations, correlations between the main components are allowed. For various reasons (e.g. strong sample dependency), it is better to avoid oblique-angled rotations. Among the orthogonal methods, the so-called varimax rotation is most often used. It transforms the axes in such a way that for each component (each column in the component charge matrix) some (few) items have high charges, but the other items have charges close to zero. Another method is Quartamax, in which the aim is that for each item (each row in the component charge matrix) high charges become even higher, while low charges become even lower. Equamax is a compromise between these two methods.

13 11.1 Can the complexity of multidimensional data be reduced? 401 rotated component Item 1 2 A A A B B B Figure 11.6: Rotated component matrix For our example, the points after the rotation are all very close to the axes. They are now clearly assignable and the axes are much easier to interpret. We conclude that the first component is characterized by items B1 B3, the second by A1 A3. We name the first component academic self-concept (as an umbrella term for the items B1 B3 uploaded on it) and the second social self-concept (umbrella term for items A1 A3). The new, rotated component matrix now looks like this (Figure 11.6). As can already be seen from the graphic (Figure 11.5), the charges have changed in such a way that each item group has very high values ​​on one component, but only very small values ​​on the other. After the rotation, this structure can also be clearly seen in the component matrix and it is easier to refer to it for interpretation without having to resort to graphics. This is important when you have extracted a large number of main components, because in this case many two-dimensional graphics are created (with 5 main components that is 5 4/2 = 10), all of which would have to be viewed at the same time. It's easier to do in tabular form. You can also help yourself with the interpretation by masking out the small charges from the component matrix (as in Figure 11.7). Of course, this result does not surprise us, since we knew the solution from the start in this example. In practice, however, we don't know the solution or we only have a (sometimes vague) idea of ​​the underlying structure. The rotation of axes is usually a very suitable aid to facilitate interpretation.

14 Dimension reduction rotated component Item 1 2 A A A B B B Figure 11.7: Rotated component matrix in which the small charges are hidden General aspects of the interpretation of main components For each component, an umbrella term must be found for those variables (items) that are loaded onto it. The substantive importance attached to a component is usually based on careful consideration of what the variables with high loads actually have in common. Principal components are orthogonal, i.e. uncorrelated (unless one uses oblique rotations). For the naming, this means that the components should also be conceptually understood as being independent of each other, all combinations of high / low for all main components should be possible. (In the self-concept example it should be possible, for example, that you either like or not like being with people, regardless of whether one of them is strongly or weakly inclined to academic pursuits.) Loads an item up to two components (with an otherwise satisfactory component structure), then one should consider an aspect of this item that is common to both components, but conceptually (strongly) contrasts with other components in which the charges are small. This aspect is then common to the two components for which the charges are high. Principal components must be named differently than a particular original variable. (Principal components are aggregated from the original variables, so their names should reflect the aggregate. A specific variable does not adequately describe them.) If it is very difficult or impossible to find a name for the common aspects of a component: try a different number of major components omitting variables with low commonality (see page 407) The data should not be violated. If in doubt, it is better to look for a different analysis method.

15 11.1 Can the complexity of multidimensional data be reduced? Applying Principal Component Analysis Let us now turn to a more complex example of how it appears in practice and in which we will discuss strategies for analyzing multidimensional data. Case 31: Characteristics of supermarkets Data file: superm.dat In a representative study of 637 people in Wales, Hutcheson and Moutinho (1998) examined characteristics of supermarkets. The respondents were asked to rate the importance of the following properties on a scale ranging from 1 (not very important) to 5 (very important): Non-food offerings Open tills Express tills Baby facilities Petrol station Restaurant or cafeteria Regular customer discount Parking spaces Favorable location Customer service and advice center Frequency of special offers Friendliness of staff General atmosphere Help with packing Length of lines at checkouts Low prices Quality of fresh products Quality of packaged products Quality of shopping trolleys Supply of deliveries Question: Can the characteristics of supermarkets be summarized in a few key dimensions? As already shown in section, the calculation of a principal component analysis involves several steps. 1. Checking the prerequisites: linearity, Kaiser-Meyer-Olkin (KMO) statistics 2. First calculation: attempting a standard solution 3. Refinement: changing the number of extracted main components

16 Dimension reduction Checking the prerequisites Since the principal component analysis is based on covariances or correlations, it makes sense to take a closer look at these. Correlation coefficients are most strongly influenced by outliers or non-linearities. For this purpose, the variables could be displayed in pairs in scatter diagrams and examined for abnormalities. It may be possible to improve the situation through transformations (in the case of non-linearities) or by omitting outliers (which can be problematic). Or you can decide to take the relevant variable out of the analysis entirely. Another prerequisite is the underlying statistical methodology: The key figures mean and variance should be meaningfully calculable for the output variables, since otherwise the eigenvalue calculation only leads to the new variables being uncorrelated, but not necessarily meaningful grouping of the items. However, this actually assumes that the data have metric properties and follow a normal distribution. However, if one focuses on the exploratory nature of the application of a principal component analysis, one may weaken these restrictive assumptions. A general check as to whether the data is at all suitable for a principal component analysis can be carried out using the Kaiser-Meyer-Olkin (KMO) statistics. Put simply, this statistic indicates the degree of correlation that is in the data and also takes into account the degree of partial correlation, i.e. how strongly the correlation of two variables is influenced by the correlation with other variables. The higher the value of the KMO statistic, which can be between 0 and 1, the sooner one will arrive at a satisfactory main component solution. Values ​​below 0.5 are considered unacceptable, those higher than 0.8 are considered very good. Another criterion for whether it makes sense to use a principal component analysis is to examine whether the variables correlate with one another at all. A common technique for doing this is the Bartlett's test, which tests the null hypothesis that all correlations are zero. In practice this is very rare. An insignificant value would be an alarm signal. We want to check this for our data. For the calculation of the principal component analysis, we will mainly use the R-Package psych. For the KMO statistics, we also need the rela package. We first load these two packages and then read the data file with the superm.dat supermarket data. The 20 items with the variable names q08a to q08n, and q08q to q08v can be found in columns 6 to 25. We save this data under smd, removing people with missing values.

17 11.1 Can the complexity of multidimensional data be reduced? 405> smarkt <- read.table ("smarkt.dat", header = TRUE)> smd <- na.omit (smarkt [, 6:25])> smarkt <- read.table ("rdata / smarkt.dat" , header = TRUE)> smd <- na.omit (smarkt [, 6:25]) RR Using nrow (smd) we can convince ourselves that we now have 497 people available. In addition, we want to replace the variable names q08a to q08n and q08q to q08v with abbreviations to facilitate interpretation. R> itemnam <- c ("non-food", "open checkouts", + "express checkouts", "baby facilities", "petrol station", + "restaurant", "regular customer discount", "parking spaces", + "location", "customer service" , "Special offers", + "Friendliness", "Atmosphere", "Packing assistance", + "Queue checkouts", "Prices", "Qual.frischprod.", + "Qual.verp.Prod.", "Qual.Wagen" , "Delivery")> colnames (smd) <- itemnam The KMO and Bartlett statistics are contained in the output object of the paf () function from the rela-package. We calculate it using R> library (rela)> paf.obj <- paf (as.matrix (smd))> cat ("kmo statistic:", paf.obj \$ kmo, "Bartlett statistic:", + paf. obj \$ bartlett, "\ n") KMO statistics: Bartlett statistics: 3036 The psych-package contains a somewhat more informative implementation of the Bartlett test, but here you need the correlation matrix and the number of observations as input. The result is obtained using R> bart <- cortest.bartlett (cor (smd), n = nrow (smd))> unlist (bart) chisq p.value df The value for the KMO statistic is This indicates that the Data are useful for principal component analysis. The Bartlett test is highly significant and does not speak against the application of principal component analysis.

18 Dimension reduction Calculation of a first solution according to standard specifications In R there are different possibilities to calculate a principal component analysis. We use the function principal () from the package psych, with which we have already calculated the Bartlett test. Since we have to specify the number of components to be extracted in principal (), we first want to look at a scree plot (Figure 11.8) for orientation. You can create one with the VSS.scree function from the psych-package. R> VSS.scree (smd, lty = "dotted") scree plot values ​​index Figure 11.8: Scree plot for the supermarket example The plot indicates that only two to three main components should be selected, but five main components have an eigenvalue greater than one . Therefore, as a first approach, we want to extract five components and not perform any rotation for the time being. R> pca.smd <- principal (smd, 5, rotate = "none")> pca.smd \$ criteria <- NULL> pca.smd

19 11.1 Can the complexity of multidimensional data be reduced? 407 Principal Components Analysis Call: principal (r = smd, nfactors = 5, rotate = "none") Standardized loadings based upon correlation matrix PC1 PC2 PC3 PC4 PC5 h2 u2 Non-food open registers Express checkouts Baby facilities Petrol station Restaurant Regular customer discount Parking lot Location Customer service Special offers Friendliness Atmosphere Packing aid Queues till prices Qual.frischprod Qual.pack.prod Qual.Wagen delivery PC1 PC2 PC3 PC4 PC5 SS loadings Proportion Var Cumulative Var We get a lot of output again, but first we are interested in how much explanatory value individual items provide overall. This can be seen in the communalities in the column under h2. The commonality of an item is the sum of the squared charges on all extracted components. If the commonality of an item is low, it is not well represented by the extracted components. This item should possibly be removed from the analysis. In our example there are no very low values, so we leave all items in the analysis. It should also be mentioned that the communalities do not change through rotation. Next we want to examine the extracted main components (PC1 to PC5, PC stands for Principal Component), in particular their eigenvalues. We extracted the five main components with eigenvalues ​​greater than 1. We find the five eigenvalues ​​in the output below in the line SS loadings. The proportion of the total variance that they explain is given by the Proportion Var line, totaling

20 Dimension reduction, the proportions of variance are shown in the Cumulative Var row. Overall, the five components explain 57% of the total variance. This is not a particularly high value, but adding further variables to increase the explanatory value does not seem to make sense, since all other eigenvalues ​​are less than 1. As mentioned, the scree plot (Figure 11.8) would rather indicate that even five main components are too many and only two or three main components should be chosen.So we have to find a compromise that depends on how well and sensibly the components can be interpreted. To do this, we look at the charge matrix. This is very confusing in the output for two reasons. First, there are a lot of small charges that don't add much to the interpretation and second, we should rotate the components to get a clearer picture. It would be better to specify the output in such a way that the items are sorted according to the size of their loads and smaller loads are suppressed. After we have calculated the principal component analysis again without the rotate option (the default is varimax rotation) and suppressed the parts that are not of interest for the output, we can output the loads sorted using the sort option and rounded to 2 decimal places. > pca.smdr <- principal (smd, 5)> pca.smdr \$ criteria <- NULL> print (pca.smdr, cut = 0.5, sort = TRUE, digits = 2) R Principal Components Analysis Call: principal (r = smd, nfactors = 5) Standardized loadings based upon correlation matrix item RC1 RC2 RC5 RC3 RC4 h2 u2 Qual.pack.prod Atmosphere Qual.cars Qual.freshproduct Friendliness Customer service Restaurant delivery Baby facilities Packing aid Non-food Special offers Prices Regular customer discount Queues at checkouts

21 11.1 Can the complexity of multidimensional data be reduced? 409 parking spaces Petrol station Express cash registers Location open cash registers RC1 RC2 RC5 RC3 RC4 SS loadings Proportion Var Cumulative Var At first glance, the result looks promising. Several items (at least two) are always uploaded (> 0.5) on a component (which are now referred to as RC1 to RC5, RC stands for Rotated Component). In addition, there are no items that can be uploaded to more than one component. The attempt to interpret the content of the components also leads to a satisfactory result. (The original item numbers are given in the item column.) So the items that are uploaded to the first component could be summarized with the term quality of products and personnel. The second component describes the availability of additional services, the third the price-performance ratio and the fourth facilities for cars. Finally, the items in the fifth group could be called comfort. The fa.diagram () function can also be used to display the result graphically (Figure 11.9). The cex and rsize options used are used to scale the font size and the rectangles. These are chosen according to the length of the names. To get nice graphics, you usually have to play around a bit. The definition main = "" suppresses the title. R> fa.diagram (pca.smdr, cut = 0.5, cex = 0.8, rsize = 0.5, + main = "") Due to the rotation, of course, the eigenvalues ​​of the components have also changed, which is somewhat beyond the proportion explained by the components testify to the total variance. The comparison with the output of the first calculation (on page 406) shows that the variance components are now more evenly distributed over the five extracted components. Refinement of the solution You are rarely satisfied with the first solution, mostly you will extract different numbers of main components (possibly also other rotation methods and other forms of representation of the charges, e.g. a different one

22 Dimension reduction, quality packaging prod. Atmosphere Qual. Wagons Qual. Fresh Prod. Friendliness customer service restaurant delivery baby facilities packing aid non-food special offers prices regular customer discounts queues cash registers parking lots petrol station express cash registers location open cash registers RC1 RC2 RC5 RC3 RC4 Figure 11.9: Assignment of the items (with loads> 0.5) to the components limit as 0.5 for the suppression of the display, use). The aim is to clarify which solution makes the most sense and which is best to interpret. The scree plot in Figure 11.8 indicated that only two components should actually be extracted. We want to check that out. R> pca.smd2 <- principal (smd, 2)> pca.smd2 \$ criteria <- NULL> print (pca.smd2, cut = 0.5, sort = TRUE, digits = 2) Principal Components Analysis Call: principal (r = smd, nfactors = 2) Standardized loadings based upon correlation matrix item RC1 RC2 h2 u2 Friendliness Atmosphere Customer service Qual.pack.prod Queues Tills Prices Qual.frischprod

23 11.1 Can the complexity of multidimensional data be reduced? 411 Qual.Wagen Special offers Location Petrol station Non-foodstuffs Open cash registers Restaurant Baby facilities Parking lot Discounts for regular customers Packing aid Express cash registers Delivery RC1 RC2 SS loadings Proportion Var Cumulative Var You can see that the formation of terms is not so easy here. The first component describes aspects of the supermarket per se, while the second covers additional services. However, the item that records the availability of enough open coffers does not go particularly well with this. One would rather expect a higher charge on the first component. This also applies to other items, such as regular customer discounts or help with packing, which have their highest load (which is less than 0.5) on the second component, but would go better with the first. Overall, different aspects seem to be mixed up in this solution, which indicates more components and therefore more underlying characteristics of supermarkets. In the same way, one can now calculate solutions with three and four components, but it turns out that even then the interpretation cannot be as conclusive as with the five main components found first. So it makes sense to stick with this solution.

24 Dimension Reduction Case Study 31: Interpretation A principal component analysis was performed to get a clearer structure of the importance of 20 different characteristics of supermarkets as assessed by 497 consumers. The assignment of the items to specific components allows a simpler interpretation of the importance of various supermarket properties. The KMO statistic from shows that the correlation structure in the data contains enough information to perform a principal component analysis. Five main components were extracted and rotated using the Varimax method. The five main components with the variables loading high (> 0.5) are: Load Quality of products and staff Quality of packaged products.727 General atmosphere.685 Quality of shopping carts.683 Quality of fresh products.643 Friendliness of staff.634 Availability of additional services restaurant or cafeteria. 686 Supply of deliveries. 581 Baby facilities. 578 Help with packing. 524 Non-food items. 516 Value for money Frequency of special offers. 733 Low prices. 659 Discounts for regular customers. 507 Length of lines at cash registers. 819 gas station. 815 convenience express cash registers. 658 convenient location. 656 open cash registers. 536

25 11.1 Can the complexity of multidimensional data be reduced? 413 Case study 31: Interpretation (continued) The consistently positive charges mean that people with high values ​​for the respective items also have high scores on the component and consider the respective property to be important. (Two items had medium loads and were not taken into account in the presentation. Customer service and advisory services also loaded onto component RC1 and onto component RC5, while the length of the queues at checkouts had a load value of component RC5.) The five main components extracted can be interpreted directly as quality of products and personnel, availability of additional services, value for money, facilities for cars and convenience. Overall, 57% percent of the total variance could be explained by these main components, the relative importance of the main components (after the varimax rotation) is: Percentage component eigenvalue of explained variance Component 1 (RC1): Quality% Component 2 (RC2): Services% Component 3 ( RC5): Price-performance% Component 4 (RC3): Car furnishings% Component 5 (RC4): Convenience% Supermarkets can accordingly be characterized according to five key features. Customers rate the quality of the goods and the staff highly when they rate the quality of packaged and fresh products, the quality of the shopping trolleys, the friendliness of the staff and the general atmosphere positively. The availability of additional services is also appreciated. This concerns the possibility of going to a restaurant or a cafeteria, facilities for the care of infants, assistance with packing the goods and the offer of home deliveries. The price-performance ratio is in third place. The frequency of special offers, generally low prices and discounts for regular customers count positively. Other important properties of supermarkets relate to car facilities, i.e. the availability of parking spaces and petrol stations, as well as general convenience, i.e. whether the location is assessed as favorable and whether there are enough open tills or express tills. The more positively these individual aspects are assessed, the more positive the overall assessment of a supermarket will be.

26 Dimension Reduction 11.2 How can the results of a principal component analysis be used for further analyzes? One of the objectives of principal component analysis was to group a large number of variables into a few groups of variables. These variable groups or main components have the meaning of new variables. If you know the values ​​of the people for the new variables, you can use them for further analysis and do not have to take into account the large number of original variables. As mentioned in the last section, these new values ​​are called component values ​​or component scores. Case study 32: The quality aspect in supermarkets Data file: superm_scores.dat The main component analysis for the characteristics of supermarkets in case study 31 showed, among other things, that the quality of the goods and the friendliness of the staff are key characteristics for supermarkets. Question: Is the quality aspect of a supermarket equally important for men and women? If we had not carried out a principal component analysis, but wanted to answer the question from case study 32, we would have to analyze it for each individual item in question. Or we would have to answer the question for all items uploaded to this component (provided we know the results of the principal component analysis). However, once we have calculated the component values, a single analysis is sufficient; we simply use the new variable quality. In principle, one can see component values ​​as if they were observations for the principal components. A person's component value (for a specific component) is an aggregation or index derived from the original values. We know the values ​​of a person for all items (the raw data) and also the relationship of these items to the main components (the result of the main component analysis). The component values ​​can be calculated from these two pieces of information. There are various methods of doing this, the most common being the multiple regression method. The formula of how to get the component value