Red Wine Quality

Data Analyst Nanodegree - Project 4

Cheng Wang


Introduction

The red wine quality dataset will be explored and analyzed in this report.

Univariate Analysis

Structures

Structure of the dataset are shown below. There are 1599 observations and 13 variables. The first variable X is index and it will be removed from the dataset since it will not be used in the future analysis. 11 numeric variables are physicochemical properties of red wine. quality is output variable based on sensory data.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...


Statistics and Histograms

Basic statistics of the dataset are explored below and the histograms of 12 variables are in Figure 1 and 2.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000


Figure 1: Univariate Boxplots


Figure 2: Univariate Histograms

Quality

Quality is the output variable and it ranges from 3-8 in this dataset. About 82% of wines are rating 5 or 6. The distribution is close to normal distribution. So I will be more interested in the physicochemical properties close to normal distribution in the dataset.

Residual sugar

Residual sugar indicates the sweetness of wine. Based on Wikipedia, the sweetness of wine is labeled to be three categories: dry (<=4 g/L), medium dry (4 to 12 g/L) and medium sweet (12 to 45 g/L). A new variable sweetness is created. The maximum of residual sugar is 15.5 g/L, so there is no sweet wine in this dataset.

Figure 5: Sweetness of Wine

Others

  • Chloride is normally distributed with several outliers in high end from figure 1 and figure 2. After truncated the data with 95th percentile, it is more clear and close to normal distribution as in figure 6.
  • Density is normally distributed.
  • Alcohol is not normally distributed and after taking log10 or sqrt transform as in figure 6.

Figure 6: Chloride and Alcohol Distribution

Questions

What is/are the main feature(s) of interest in your dataset?

I am mainly interested in pH since it is normally distributed as wine quality and also related to six acidity related features based on physicochemical meaning as discussed above.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Volatile acidity and sulphates could be helpful for my investigation. Both of them are normally distributed in log10 scale.

Did you create any new variables from existing variables in the dataset?

A new variable sweetness is created to classify each wine into dry, meidum dry and medium sweet. Total acid is created to sum all the acid coupounds. Five log10 transformed variables are also created as discused above.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I have transformed several variables into log10 scale to make the distribution of variables close to normal distribution due to long tail in original distribution. Detail discussions are in acidity related features section.



Bivariate Analysis

Bivariate Plots Section

Two overall plots are shown below for bivariate analysis. In Figure 7, all physicochemcial properties are shown in boxplot versus the quality of wine. Pairwise correlation plot is shown in figure 8 with matrix. Large size of dots indicate high correlation and blue indicates positive correlation while red indicates negative correlation. The correlation coefficients are also shown left bottom triangule.

Figure 7: Physicochemical Properties and Wine Quality by Boxplot


Figure 8: Pairwise Correlation Plot

Properties and Wine quality

Clear relation between quality and physicochemcial properties can be identify from the boxplot in Figure 7. The trends in the boxplots can be summarized as:

  • Good wine have high citric acid, sulphates, and alcohol.
  • Good wine have low volatile acidity, density and pH.
  • Residual sugar, chlorides, free and total sulfur dioxide do not seem have impact on the quality of wine.

Among these variables, alcohol and density are correlated and the correlation is -0.50. The physical reason is alcohol have density lower than water. Wine is a mixture donamited by water and alcohol. So if more alcohol in wine, the density will be more close to alcohol, in other words, lower. In physics, the relation of density and alcohol (or the percentage of alcohol) is not linear. However, in a short range for example here (8.4% to 14.9% of alcohol), we can approximate it is linear. We can see it is not perfect linear in Figure 9 since there are sugar, acid and sodium cholride in the wine and they also contributes to the density. In addition, good wine have high alcohol percentage with correlaiton 0.48, which is the highest one in properties.

Figure 9: Scatter Plot of Density and Alcohol

Good wine have low volatile acid with correaltion -0.39. This makes sense to me. If volatile acid (acetic acid) is high, the smell is not good.

Good wine have high citric acid and this surprised me. As I discussed above, citric acid is minute in wine comparing with tartaric acid (wiki). It will be interesting to explore about this feature in the future.

Sweetness and quality are shown in Figure 10. It seems that the sweetness have no relation to the quality of wine. The same as the conclusion from Figure 7 and 8.

Figure 10: Histogram of Quality by Sweetness

Questions

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Alcohol correlates best with quality in this dataset and good wine have high percentage of alcohol. The relation of alcohol and density is also discussed above.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

pH and fixed acidity are highly correlated as expected. pH and other acidity related features are also explored above.

What was the strongest relationship you found?

The strongest relationship is between fixed.acidity and pH (correlation coefficient -0.683). The created variable total.acid are correlated with fixed.acidity (correlation coefficient close to 1) and the reasons are discussed above.



Multivariate Analysis

Top Correlated Features with Wine Quality

The top three features correlated with quality are alcohol (0.48), volatile acidity (-0.39) and sulphates in log10 scale (0.309). The relation of features and quality will be shown in following section.

Alcohol, fixed acidity and quality relations are shown in Figure 14 (left). In the figure, quality However, the relations among them are not clear. So I plot the relation between fixed acidity and alcohol by facet_wrap with quality in Figure 14 (right). We can see that worst wine (quality = 3) is in the top left corner and best wine (quality = 8) is in the bottom right corner.

Best wine and worse wine are ploted together in one plot as in Figure 15 (left). It clearly indicates that best wine have low volatile acidity and high alcohol percentage, while worst wine have high volatile acidity and low alcohol percentage. If wine quality = 4 and 7 are added to the plot as in Figure 15 (middle), it also shows the same trend as before with some overlaps. However, if quality = 5 and 6 are added to the plot as in Figure 15 (right), it still shows the same trend but there are a lot of overlaps.


Figure 14: Scatter Plots of Alcohol, Volatile Acid and Quality


Figure 15: Modified Scatter Plots of Alcohol, Volatile Acid and Quality

Same as quality, alcohol and volatile acid, I switched the volatile acid to sulphates (log10). And the results are in Figure 16 and Figure 17.

Here it shows that best wine have high alcohol percentage and high sulphates, while worst wine have low alcohol percentage and low sulphates as in Figure 17 (left). After add middle quality wine into the plots as in Figure 17 (middle and right), the boundary will be not clear.


Figure 16: Scatter Plots of Alcohol, Sulphates (log10) and Quality


Figure 17: Modified Scatter Plots of Alcohol, Sulphates (log10) and Quality

pH with features

pH is correlated with fixed.acdity and citric.acid. In Figure 18, the relation of pH between fixed acidity and citric acid is plot in scatter plot. We can see that the fixed acidity and citric are correlated. In addition, low fixed acidity and low citric acid have high pH. High fixed acidity and high citric acid have low pH.


Figure 18: pH with Fixed Acidity and Citric Acid

In addition, the quality with pH and alcohol is also shown in Figure 19 and 20. We can see that best wine have low pH and high alcohol in Figure 20 (left), while worst wine have high pH and low alcohol. Similar to Figure 15 and 17, the boundary will not clear as we add more wine with quality between 4 and 7.


Figure 19: Scatter Plots of Alcohol, pH and Quality


Figure 20: Modified Scatter Plots of Alcohol, pH and Quality

Questions

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I explore the relation of wine quality with the top three correlated features. These features can clearly tell the difference between best (quality = 8) and worst (quality = 3) wine in the dataset. And combine them in one plot can clearly show the difference and enhance the visulation. The relation between pH and pH related features is also explored above.

Were there any interesting or surprising interactions between features?

It is pretty interesting to see that combine features in one plot (Figure 15 and 17 left) can express multiple relations discussed in bivariate section clearly.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

No.


Final Plots and Summary

Plot One

Figure 21: Histogram of Acid Coupounds

Description One

In Plot One (or Figure 21), it shows the distribution of three types acid coupounds in the log10 scale. The fixed acidity and volatile acidity are close to normal distributions, while citric acid have long tail in lower end. From the figure, we can see that the concentrations of citric acid and volatile acidity are close while fixed acidity is about 10 times higher.

Reason: this figure represents the exploration in univarite section. It shows the distributions of three acid coupounds in the dataset.

Plot Two

Figure 22: Scatter Plot of pH and Fixed Acidity in log10 Scale

Description Two

In Plot Two (or Figure 22), the relations between pH and fixed acidity (in log10 scale) is shown in scatter plot. We can see that fixed acidity (in log10 scale) is linear correlated with pH with correlation coefficient -0.706. The negative coefficient confirms the defination of pH which is the negative logarithm of hydronium ios related from acid.

Reason: this figure represents the bivariate exploration and it shows the relation of pH and acid concentration.

Plot Three

Figure 23: Scatter Plot of Good and Bad Wine

Description Three

In Plot Three (or Figure 23), the good (quality = 7 or 8) and bad (quality = 3 or 4) wine are shown with alcohol and volatile acidity. This figure clearly indicates that good wine have high alcohol and low volatile acidity. In addition, there are some overlaps in between.

Reason: this figure represents the multivariate exploration and it shows the relation of wine quality with two properties.


Reflection

In this report, the red wine dataset is explored by univariate, bivariate and multivariate analysis. The features related to pH are explored in original and log10 scale. A new feature sweetness of wine is created. However it turns out there is not clear relation between sweetness and wine quality in this dataset. The correlation between each feature and quality is explored in bivariate section by boxplot and corrplot. The trends of quality along each feature are explored. It is interesting that quality have correlation with trace citric acid, which I thought it was not important in univariate analysis. Sulphates in log10 scale show better correlation with quality than original scale.

When I explore quality with multiple features, the relationship is not clear with all data points which is discouraging. So I start to explore multiple features with each quality. It clearly shows what makes a best wine or a worst wine. When I keep adding more wine with quality of 4 and 7, it still shows the separation between good and bad wine. The features that contribute to good and bad wine are successfully identified. As discussed in the quality part, 82% of wines in this dataset are ranked as 5 and 6 which means a lot of wine here are average. It is not a blanced dataset. And features of average quality wine overlaps a lot with good and bad wines. It will be helpful to include more good and bad wine in this dataset. In addition, the quality is based on averaged sensory data and subjective.

I thought pH is an important property to wine quality, but the correlation coefficient does not support this. In multivarite analysis part, I found that pH can tell the different between best and worst wine, but not wine with average quality.

Volatile acidity is only variable related to the smell of wine in this dataset. Modern techniques can determine the coupounds in the smell not just the acetic acid. For example, esters are well-known compounds in wine’s aroma. More smell related features could be helpful for wine quality prediction.

In summary, a lot log10 transformation of features are acutally not helpful. The new defined the features such as sweetness and total acid also do not have an impact on my analysis. Both of these are discouranging. Success part is using facet plot of each quality level can help me to clear identify the relations between wine quality and features. In addition, more smell related features can be added and adding more good and bad wine could make the dataset more balanced.