Exploratory Data Analysis

We begin with an example. We have the return activated sludge (RAS) concentrations from two clarifiers, named Clarifier A and Clarifier B. The sample data spans a period of three years and represents the daily RAS concentrations from each clarifier. A simple process flow diagram illustrating the sampling points is shown in Figure 1. Given this substantial data set, what can we learn about the data, what story does it tell? This is always the starting point. This is what is called exploratory data analysis. It is a slow, iterative process, and at this initial stage there should be no rush and no assumptions. Properly conducted, exploratory data analysis will allow the data to speak for itself.

Figure 1: Two Clarifiers, Two RAS Sample Points

Simple wastewater process flow diagram

One practical way to begin a review of the data is to look at the mean and the standard deviation for each set of values. The mean is defined as the sum of the values in the data set, or the sum of the measurements, divided by the total number of measurements or values in the data set. The mean is commonly referred to as the average. The mean, or average, describes where the center of a set of measurements lies but it can easily be misrepresented by the influence of a few very small or a few very large values.

For the data set comprised of the RAS concentrations, the smallest possible value, in theory at least, is zero and the practical upper limit is in the range of 25,000 mg/L (total suspended solids) or so. With this particular data set, the influence of a few very small values would be to pull the mean closer to zero. In contrast, a few very large values would pull the mean upward, toward the upper limit of 25,000. In the first case, the mean would skewed left and in the second case it would be skewed right. The concept of skewness is illustrated in Figure 2 where the two skewed distributions are compared to a normally-distributed set of data.

Figure 2: Skewed Distributions

Before we discuss the standard deviation we will take a look at the results of first pass at the data. This is shown in Figure 3, where a statistical analysis program, called Minitab, was used to produce the sample statistics mean and standard deviation. A spreadsheet could have been used, but not quite as easily. And shortly we are going to expand our analysis at which point the ease of a dedicated statistical analysis program really starts to set itself apart from a spreadsheet, like Excel, which I like and use all the time.

Figure 3: Mean and Standard Deviation Produced by Minitab

Descriptive statistics

The means and standard deviations don’t look to be too different from one another but the standard deviations themselves do seem to be rather large. The standard deviation gives us information about the spread, or the variability in the data. The standard deviation of a set of measurements is defined as being the positive square root of the variance, which we have not mentioned until now.

The variance of a set of values with a mean y-bar is the sum of the squared deviations divided by n - 1, where n represents the number of individual values. The equation for calculating the variance is shown in Equation 1. We are not going to focus on the equations; I just wanted to give an example of how the variance is calculated. We will use Minitab to include the variance. First, though, we need to discuss one other measure that describes the central tendency of a data set, the median.

Equation 1: Formula for Calculating Variance

Variance formula

The median of a set of measurements is defined as the middle value when the data has been arranged from the smallest value to the largest value. When the data set consists of an even number of values the mean of the two middle-most values defines the median. Using Minitab, we will expand our statistics to include the median and the variance as shown in Figure 4.

Figure 4: Additional Statistics Produced by Minitab

The variances are huge numbers, representing the total suspended solids (TSS) concentration in units of (mg/L)2. I don’t know about you but I really cannot relate to this sample statistic. In taking the square root of the variance to obtain the standard deviation the units go back to being mg/L, which makes more sense. Regarding the sample statistics in Figure 4 it can be seen that for both clarifiers the median RAS concentrations are less than the mean values. Since the means are to the right of the medians we know both data sets are skewed left. A histogram with a normal distribution fit to the data will clarify this.

Figures 5 & 6 show the histograms with the normal distribution fit to each set of data. In addition, each histogram has the median (vertical red line) and mean (vertical black line) indicated. Notice in both figures that there is no data under the left tail but the data extends well into the right tail, showing that the data is skewed right.

Figure 5: Histogram of Clarifier “A” RAS Concentrations

Histogram for Clarifier A

Figure 6: Histogram of Clarifier “B” RAS Concentrations

Histogram for Clarifier B

In Figures 7 & 8 distributions have been fitted to the data using @Risk with each of these graphs giving a much better idea of how the data extends to the right tail, pulling the mean to the right as well. The differences between the two sets of data are better revealed in Figures 7 & 8 and further evidence of the difference is given by the fact that the best distribution fit for Clarifier A was a Beta General while the best distribution fit for Clarifier B was  a Weibull. The graphical analysis is indicating a degree of difference that was hard to discern from the means, medians, and standard deviations alone.

Figure 7: Distribution Fitted to Clarifier “A” RAS Data

Clarifier A fitted distribution

Figure 8: Distribution Fitted to Clarifier “B” RAS Data

Clarifier B fitted distribution

We will conclude Exploratory Data Analysis here, ending with this question: Are the two data sets, consisting of the return activated sludge concentrations from clarifiers A & B, the same or is there a statistically significant difference between them? The answer is provided using hypothesis testing.

Home   Treatment   Formulas   F:M Ratio   Sludge Age   MCRT   SVI   Mass Balance   General Math   Rise Rate   Statistics   OUR   Unit Processes   Modeling   Library   About