~Class Meeting #2 - Describing data


   
                              

~Frequency distribution: Data is listed individually or in intervals (range of values) (classes) along with their counts (frequencies).

~Relative frequency distribution: The counts for each data point or interval (class) is expressed as a percent or decimal (divide the frequency of each class by the total frequency.

~Cumulative frequency distribution: The sum of the frequencies for a given level (class) and the frequencies of all other classes less than that class.

~Note: The following are guidelines for grouping data
1) There should be between 5 & 20 classes(levels of data)(intervals or points)
2) Each data point belongs to one & only one class
3) All classes (intervals) should have the same width

~Note: Single-valued grouping has one numerical value for each class (discrete data).

~Example: The following are hypothetical results from one of my calculus exams:
52,62,65,65,68,68,70,73,75,75,78,82,82,87,88,88,88,89,89,92,96,96,98,98,99.

Scores     Frequency       Relative frequency     Cumulative frequency
50-59             1                 1/25 = .04 = 4%                        1
60-69             5                 5/25 = .20 = 20%                      6
70-79             5                 5/25 = .20 = 20%                     11
80-89             8                 8/25 = .32 = 32%                     19
90-99             6                 6/25 = .24 = 24%                     25

Lower class limits are 50,60,70,80,90
Upper class limits are 59,69,79,89,99
Class midpoint or class mark is the average of the classlimits:54.5,64.5,74.5,84.5,94.5
Class width is the difference between the lower class limits: 10 in each case

~Frequency Histogram: A bar graph where the bars (rectangles) are drawn adjacent (no gaps) to each other. The horizontal axis displays the classes & the heights of the rectangles displays the frequency. The center of the rectangle is placed over the midpoint of each class.

~Relative Frequency Histogram: Same as above with the vertical axis displaying the relative frequencies. There is an added feature that the sum of the areas of all the rectangles add to 1, if data set consists of single values (rectangle widths are 1).

~Cumulative Frequency Histogram: Same as above with the vertical axis displaying the cumulative frequencies.

~Frequency Polygon: Points are placed at the top of each rectangle (over its midpoint) then these points are connected by line segments. Segments are extended to the extreme left & right so that they originate & terminate on the horizontal axis.

~Ogive: Points are placed over the upper class limit of each class with the vertical axis displaying the cumulative frequencies, then connected by line segments. We start the ogive from a point over the smallest value & end with a point over the largest value.(useful for determining the number of values below some particular value. (a good way to visualize percentiles, if vertical axis are relative cumulative frequencies)

~Pareto Graphs: (used for qualitative or discrete data). Adjacent bar graphs in decreasing order of their frequencies. (vertical axis are frequencies or relative frequencies)

~Note: For single-valued data, we use bar graphs & place the center of the bar over the value.

~Note: Bar graphs are very commonly displayed side-ways.

~Pie Charts: (I think they should be called Pizza Charts). Expressing relative frequencies (%) as a slice of pie. The key is to get a reasonable estimate of the central angle for that slice (class).

~Dot Plots: (used for small amounts of data). One horizontal axis is drawn indicating the data and a point or dot is placed above each data point.

~Stem and Leaf Plots: (more informative than a histogram since the actual data points are used and are visualized). The data is separated into two columns. The right column are the leaves (consisting of the ones digits) and the left column are the stems (consisting of the ten’s or higher digits).

~Scatter diagrams: Plotting paired data (x,y) collected from two different data sets, one for x and one for y. Then these are plotted as points in the xy-plane. Looking at the way the points are scattered, one can determined if there is a relationship present. The relationship could be linear or non-linear. The linear relationship is directly related to correlation (will study later in the course). Equations are found & predictions are made (regression line).

~Time-Series Graph: Data that has been collected at increasing points in time & plotted as Quantity vs. Time on the xy-plane. Many trends can be visualized this way.

~Examples of each  (will do these during class time)
                                                    
Pareto Chart: Train derailments: 23 by bad track, 9 by faulty equipment, 12 by human error, 6 had other causes. The following is the Pareto Chart.  (will do in class)