The sharing of resources about Statistical Learning Theory and Machine Learning(includeing SVM,Semi-Supervised Learning,Ensemble Learning,Clustering) ,welcome to contact and communicate with me: Email: xiankaichen@gmail.com,QQ:112035246,

Wednesday, June 25, 2008

Box Plot Graphs(盒图)

  • Conceptual Overview

Box Plot graphs, also referred to as Box and Whisker Plot graphs, are quite common in statistics and quality measurements. Graphs in the Box Plot data class organize data items by category. There is only one series of data, and all data items are the same color.

A Box Plot graph data item has five main values: Low, Q1, Median, Q3 and High. These numbers are determined from the data set you are using to create the Box Plot. The data set also can have any number of Outlier data values.

Let's look at an example with a small set of data:

35, 42, 48, 50, 51, 53, 54, 60, 75

Important: These are not the numbers we send to the graph. These are the number we use to compute the values that we send to the graph. The graph itself cannot compute these figures for us.

In this case, the Median value is 51. If we divide the data into two sets, we have 35, 42, 48, 50, 51 and 51, 53, 54, 60, 75 (the Median value is included in both sets because there is an odd number of items). Now we find the median of each of these two sets to find the first quartile, Q1, and the third quartile, Q3. These numbers are 48 and 54 respectively.

The difference between Q1 and Q3 is 6 (54 - 48 = 6). This is called the inter quartile range or IQR. The Low and High values are based on the IQR. The Low value, or lower whisker, is a maximum of 1.5 times the IQR below Q1; in this case, that is a maximum of 9 (6 x 1.5 = 9). So the minimum Low value for our example is 39 (48 - 9 = 39). However, because the Low value is the smallest data item value that is equal to or greater than the minimum Low value; in our set the Low value is 42.

The High value, or upper whisker, is determined similarly. We find the maximum High value, which is Q3 plus 1.5 times the IRQ, or 63 (54 + 9 = 63). We then find the data item value that is equal to or less than the maximum High value; in our data set, the High value is 60.

Therefore, for this data set, the values are as follows: Low=42, Q1=48, Median=51, Q3=54 and High=60.

So what do we do with the values 35 and 75? These are Outlier values, because they are more than 1.5 times the IQR away from Q1 and Q3. Each Outlier value is represented by a small circle symbol in the Box Plot graph. If an Outlier is more than 3 times the IQR away from Q1 or Q3, it is classified as an extreme Outlier and is represented in the graph by a plus sign. In our example, the lower threshold for an extreme Outlier is Q1 minus 3 times the IQR, or 30 (48 - (3 x 6) = 30). Since our minimum data value is 35 and thus higher than this threshold, it is a normal Outlier. The upper threshold for an extreme Outlier is Q3 plus 3 times the IQR, or 72 (54 + (3 * 6) = 72). Our maximum data value is 75 and thus larger than this threshold, making it an extreme Outlier value.

The image below displays the Box Plot graph derived from our data set.

Box Plot graphs are in the Box Plot Data Class.

  • Possible Uses

Box Plot graphs can be used for:

  • Statistical analysis

  • Emphasize outlying data

No comments: