Statistics & Probability
Statistics have the following important characteristics:
(i) Statistics are aggregate of facts and not a single observation.
(ii) Statistics are expressed quantitatively.
(iii) In an experiment statistics are related to each other and comparable. It can be classified into various groups.
(iv) Statistics are collected for a pre-determined purpose.
(v) In collection of statistics a reasonable standard of accuracy must be maintained.
Statistics have the following limitations:
(i) Statistics is not fit for study of qualitative phenomenon like honesty, intelligence, poverty etc.
(ii) Statistics deals with groups and does not study individuals.
(iii) Laws of statistics are not exact. These are true on averages.
(iv) Data collected for a definite purpose may not be suitable for another purpose.
Statistical data are the facts which are collected for the purpose of investigation. There are two types of statistical data:
(i) Primary data: The data collected by an investigator for the first time for his own purpose are called primary data. As the primary data are collected by the user of the data, so it is more reliable and relevant.
(ii) Secondary data: The data collected by a secondary source and used by the investigator for his purpose is called secondary data. For example score of a cricket match noted from newspapers is secondary data.
Thus data which are primary in the hands of one become secondary in the hands of the other.
Data collected by any source also can be divided in following two types:
(i) Raw Data: Raw data are those data which are obtained from the original source but not arranged numerically. This is also called ‘ungrouped data’ for example marks of 10 students in maths are given as:
75, 96, 25, 32, 89, 62, 40, 79, 35, 55
An ‘array’ is an arrangement of raw numerical data in the ascending or descending order of magnitude. Above data can be written as
25, 32, 35, 40, 55, 62, 75, 79, 89, 96
(ii) Grouped data: An array can be placed systematically in groups or categories. For example the above data can be grouped in following manner.
|GROUPS||MARKS||TOTAL NUMBER OF STUDENTS|
|0 to 20||-||0|
|21 to 40||25, 32, 35, 40||4|
|41 to 60||55||1|
|61 to 80||62, 75, 79||3|
|81 to 100||89, 96||2|
(i) Variate: Variate is a quantity that may vary from observation to observation.
(ii) Range: Range is difference between the maximum and minimum observations.
(iii) Class Interval: When data are divided in groups, each group is called a class interval.
(iv) Class Limit: Every class interval has two limits. The smallest observation of the interval is called lower limit and the largest observation of the interval is called upper limit.
(v) Class Mark: The mid value of any class is called its class mark.
Class Mark = Upper limit of the class + lower limit of the class
(vi) Class Size: Class size is defined as the difference between two successive class marks. It is also the difference between the upper and lower limits of any class interval.
(vii) Frequency: In a particular class the count of the number of observation is called its frequency. So the corresponding frequency of a class is called its class frequency.
(viii) Cumulative Frequency: The cumulative frequency of any class is obtained by adding all the frequencies successively prior to that class i.e. it is the sum of all frequencies up to that class.
Inclusive and Exclusive distributions:
Inclusive Distribution : When in a distribution, the upper limit does not coincide with the lower limit of the next class then the distribution is called an inclusive distribution. e.g.
|Height (in cm)||No. of Students|
Exclusive Distribution: An exclusive distribution is that distribution in which the upper limit of one class coincides with the lower limit of the next class. e.g.
|Height (in years)||No. of Students|
True Class Limit: In the case of exclusive classes the upper and lower limits are respectively known as its true upper limits and true lower limits.
In the case of inclusive classes, the true lower and upper limits are obtained by subtracting 0.5 from the lower limit and adding 0.5 to the upper limit.
True upper limits and true lower limits are also known as boundaries of the class.
Tally: Tally method is used to keep the chance of error at minimum in counting. A bar (|) called tally mark is put against any item when it occurs. The fifth occurrence of any item is represented by putting diagonally a cross tally (|) on the first four tallies.
The tabular arrangement of data showing the frequency of each item is called a frequency distribution table. It is a method to present raw data in the form from which one can easily understand the information contained in the raw data.
Frequency distribution are of two types:
(i) Discrete frequency distribution: In this type of frequency distribution, in the first column of frequency table we write all possible values of the variables from the lowest to the highest, in the second column we write tally marks and in the third column we show frequency of each item. In this method data are not divided into groups or classes.
(ii) Continuous or Grouped Frequency Distribution: In the frequency distribution data are divided into groups or classes. This method is used only where the values in the raw data are largely repeating and the difference between the greatest and the smallest observations is not very large.
The following steps are taken to prepare a frequency distribution table:
(i) First of all we arrange the data in an array.
(ii) Then draw a table consisting of 3 columns. First column is used for class, the second column for tally and the third column for frequency.
(iii) Then in the first column we write the classes keeping the lowest and the highest scores in view.
(iv) In second column we put tally marks against each class according to the scores.
(v) Then we write frequency of each class in the third column after counting the tally.
(vi) Figures in first column and third column taken together represent the frequency table.
Cumulative frequency table is obtained from the ordinary frequency table by successively adding the several frequencies. Thus to form a cumulative frequency table we add a column of cumulative frequency in the frequency distribution table. It is obvious that the cumulative frequency of the last class is the sum of the frequencies of all the classes.
Cumulative frequency series are of two types:
(i) Less than series
(ii) More than series
A given data can be represented in graphical way. There are various methods of graphical representation of frequency distribution. Here we shall study only four of them:
(i) Bar Graphs
(iii) Frequency Polygon
(iv) Cumulative frequency curve or ogive
The frequency distribution of a discrete value is best represented by a bar graph. The height of the bars is proportional to the frequency of each variate-value. In a bar graph the bars must be kept distinct to show that the variate-values are distinct. The bars are of equal width and are drawn with equal spacing between them on the x-axis depicting the variable. The values of the variable are shown on the y-axis.
Histogram is a graphical representation of a grouped frequency distribution with continuous classes. It consists of a set of rectangles where heights of rectangles are proportional to their class frequencies, for equal class intervals. There is no gap between two successive rectangles. The rectangles are constructed with base as the class size and their heights representing the frequencies.
A frequency polygon is a graph of frequency distribution. It is a line graph of class frequency which is plotted against class mark. A frequency polygon can be obtained by two methods:
(1) By using Histogram: A frequency polygon can be obtained by joining mid points of the top of the rectangles of a histogram. For this we obtain the mid points of the upper horizontal sides of each rectangle and then join these mid points by dotted lines to get frequency polygon. End of a frequency polygon preferably extended to the mid points of imagined class intervals adjacent to first and last class intervals.
(2) Frequency polygon without using Histogram: Following procedure is used to make a frequency polygon without using histogram.
(i) Calculate the class marks, x1, x2, ...., xn of each of the given class intervals.
(ii) Mark class marks x1, x2, .... xn, along X-axis and frequencies f1, f2, .... fn along Y-axis.
(iii) Plot the points (x1, f1), (x2, f2), ,....., (xn, fn).
(iv) Obtain the mid-points of two class intervals of zero frequencies at the beginning of the first interval and at the end of the last interval.
(v) Join the points (x1, f1), (x2, f2), ..., (xn, fn) by the line segments and complete the frequency polygon by joining the mid points of the first and last intervals to the mid points of the imagined classes adjacent to them.
The graphical representation of a cumulative frequency distribution is known as cumulative frequency curve or an ogive.
An ogive can be constructed by following two methods:
(1) Less than method: A less than ogive can be constructed by following steps:
(i) First of all we make class intervals in exclusive form if it is given in inclusive form.
(ii) Then we construct a less than type cumulative frequency distribution by adding the frequency of each class to the sum of frequencies of its prior classes.
(iii) Now we mark upper class limits along X-axis and cumulative frequencies along Y-axis.
(iv) We plot the points (upper class limit, corresponding cumulative frequency) and join them by a free hand curve.
(v) The lower limit of the first class interval becomes the upper limit of the imagined class with frequency 0. We join the imagined point (lower limit of first class, 0) with the first point of the curve and so on.
In this way we get the required curve called an Ogive by less than type method.
More than Type:
We apply the following steps to construct a more than type ogive:
Step (1) : First of all we make class intervals in exclusive form if it is given in inclusive form.
Step (2) : Then we construct a more than type cumulative frequency distribution.
Step (3) : Now we mark lower lass limits along x-axis and cumulative frequencies along y-axis.
Step (4) : We plot the points (lower class limit, corresponding cumulative frequency) and join them by a free hand curve.
Step (5) : The upper limit of the last class interval becomes the lower limit of the imagined class interval with frequency 0. We join the imagine point (upper limit of last class, 0) with the last point of the curve to end the ogive.
In this way we get the required curve called an ogive by more than type method.
An average of a distribution is a single expression which represents a group of variables in a simple and concise manner. It is the representative of entire distribution. Averages are generally in the central parts of the distribution and therefore they are called Measures of Central Tendency.
An ideal measures of central tendency should have following properties:
(i) It should be defined rigidly.
(ii) It should be based on all observations.
(iii) It should be easy to calculate and readily comprehensible.
(iv) It should be affected as less as possible by fluctuations of sampling.
(v) Extreme values should not affect very much to measure of central tendency.
(i) Arithmetic mean
Arithmetic mean for ungrouped data (A. M.)
The arithmetic mean is the most commonly used measure of central tendency. It is obtained by dividing number of observations to the sum of observations. The A. M. of n observations, x1, x2, x3, ......,, xn is given by
A. M. =
Properties of Arithmetic Mean
(1) If x is the mean of n observations, x1, x2, ....., xn, then the mean of observations x1 + a, x2 + a, ...., xn + a is , i.e. if each observation is increased by a, then the mean is also increased by a.
(2) If is the mean of n observations, x1, x2, ..... xn, then the mean of observation, x1 – a, x2 – a, ..., xn – a is i.e. if each observation is decreased by a, then the mean is also decreased by a.
(3) If is the mean of x1, x2, .... xn then mean of ax1, ax2, .... axn is , where a is any number different from zero i.e. if each observation is multiplied by a non-zero number a, then the mean is also multiplied by a.
(4) If is the mean of n observations x1, x2, ...., xn then the mean of x1/a, x2/a, ..... xn/a is xÌ„/a where a ≠ 0, i.e. if each observation is divided by a non-zero number, then the mean is also divided by it.
Arithmetic mean of Grouped Data:
Let x1, x2, x3, ..... xn be n observations whose frequencies are f1, f2, f3, .., fn respectively, then the arithmetic mean of this distribution is given by
Let and be the means of two groups of observations with number of observations n1 and n2 respectively, then the combined mean of two groups is given by,
Merits of Arithmetic Mean
(i) A. M. is rigidly defined.
(ii) It is very simple. One can easily understand and calculate it.
(iii) It is uniquely defined.
(iv) It is based upon all the observations.
(v) A. M. is least affected by sampling fluctuations.
(vi) We can mathematically analysis mean.
(vii) A. M. relatively reliable.
Demerits of Arithmetic Mean
(i) A. M. cannot be used for qualitative characteristics like richness, beauty, poverty etc.
(ii) A. M. of a given data can not be determined by inspection. It can be also represented graphically also.
(iii) If any observation is missing then A.M. cannot be calculated.
(iv) A. M. is very much affected by extreme values. In case of extreme items, A. M. gives a distorted picture of the distribution and no longer remains representative of the distribution.
(v) If the extreme class is open, e.g. below 10 or above 100 then A. M. cannot be calculated.
(vi) If the given data from which the mean has to be calculated, is not given then A. M. may lead to wrong conclusions.
(vii) A. M. cannot be used in the study of ratios, rates etc.
Uses of Arithmetic Mean
(i) A. M. is extensively used in practical statistics.
(ii) Estimates can be obtained using A. M.
(iii) A. M. is used for different purposes by different persons like it is used for calculating average marks of the students. It is also used by businessmen to find out profit per unit article, output per machine, average monthly income and expenditure etc.
= 16 – 6
Hence, f1 = 6 and f2 = 10
Median is defined as the value of that item of the arrayed data which divides the whole data into two equal parts. Hence we have following definition of median:
The middle item of the arrayed data is called its median.
Calculation of median of raw data:
(i) If the number of observations ‘n’ is odd, then the median will be the value of observation.
(ii) If n is even, then we have two middle terms i.e. (n/2)th observation and (n/2 + 1)th observation.
Median of the given data will be mean of these two middle observations.
Merits of Median
(i) Median is rigidly defined.
(ii) It can be easily understood and calculate.
(iii) The median is not much affected by extreme values and therefore it is a better representative as an average of given data.
(iv) The median can be calculated graphically, while mean can not be.
(v) In some cases, median can be determined even by inspection.
(vi) If the class intervals are unequal then also median can be calculated.
Demerits of Median
(i) Median is not based on all the observations.
(ii) If the number of observations is even, median cannot be determined exactly.
(iii) If there is fluctuation of sampling then the median would be much affected by it.
(iv) It is not subject to algebraic treatment.
Uses of Median
(1) Since the median is middle term of an arrayed data, therefore it is the only average which is used while dealing with qualitative data which can be arrayed but cannot be measured quantitatively.
(2) Median is used for determining the typical value in problems concerning wages, distribution of wealth etc.
Mode of a given data is the value of that observation which occurs maximum number of times i.e. the observation which occurs with the highest frequency.
According to Croxton and Cowden, “The mode of a distribution is the value at the point around which, the items tend to be most heavily concentrated.”
Mode of ungrouped data
For a given ungrouped data, the mode can be located simply by inspection. It is variate which is having maximum frequency.
Empirical Formula: It two or more observations occurs the same number of time with highest frequency, then mode can be determined by following formula
Mode = 3 median – 2 mean
Merits of Mode:
(i) Mode can be easily understood and calculate.
(ii) It can be calculated graphically.
(iii) It is not affected by extreme values.
(iv) In some cases, it can be found by inspection also.
(v) It can be used for open ended distribution and qualitative data.
Demerits of Mode:
(i) Mode is not based upon all the observations.
(ii) Mode is ill-defined. It is not always possible to find a clearly defined mode.
(iii) Mode is affected to a greater extent by fluctuations of sampling.
Uses of Mode
Mode is the average which is used to find the ideal size, e.g. in business forecasting, in manufacturing of ready-made garments, shoes etc.
The theory of probability was originated from the games of chance related to gambling. An Italian Mathematician, Jerome Cardan (1501–1576) was the first to write a book named “Book on Games of Chance” published in 1663. Notable contributions were also made by mathematicians J. Bernoulli, P. Laplace and A. A. Markov. In the twentieth century, a book “Foundation of Probability” was published by Russian Mathematician Kolomogorov in 1933 and this was the first book to introduce probability as a set function.
|(i)||Coin||:||Coin is a well known object. It has two faces, one is Head and other is Tail.|
|(ii)||Die||:||A die is a well balanced solid cube having six faces marked with numbers (dots) from 1 to 6, one number of one face. The plural of die is dice.|
|(iii)||Playing Cards||:||A pack of playing cards contains 52 cards out of which 26 are red cards and 26 are black cards.|
|These 52 cards are divided in four groups, each group is called a suit and has 13 cards. Name of these suits are:|
|(i) Diamond (¨)|
|(ii) Heart (©)|
|(ii) Spades (ª)|
|(iv) Club (§)|
Out of these four suits Diamond and Heart are read cards and Spade and Club are black cards. Each suit having 13 cards which are 1, 2, 3,
...., 10, Jack, Queen and King. Card having 1 is also called an ace. Jack, Queen and King are known as face cards. Therefore total 12 face cards are in a pack of 52 playing cards.
Experiment: An activity which ends in some well defined results is called an experiment. These results are called outcomes. There are two types of experiments:
(i) Deterministic experiment
Those experiments which when repeated under identical conditions produce the same results or outcomes are known as deterministic experiments.
Example: Formation of Methane in laboratory.
(ii) Random Experiment:
An experiment, when repeated under identical conditions do not produce the same outcome every time but the outcome is one of the several outcomes, it is known as Random Experiment:
Performing an experiment once is called a trial.
The collection of all the possible outcomes of a random experiment is called a sample space. It is usually denoted by S.
Example: After tossing a coin, possible outcomes are head and tail so sample space for tossing a coin consists of head and tail.
Each possible outcome of a trial is known as an event. It is generally denoted by E. It is of two types:
(i) Simple Event: If any event E contains only one outcome of sample space then it is known as simple event. In this way each outcome of sample space related to any experiment is a simple event.
Example: The experiment of throwing a die once consists of 6 simple events viz. coming the face showing up 1 or 2 or 3 or 4 or 5 or 6.
(ii) Compound Event: If any event contains more than one outcomes of sample space, then it is known as compound event.
Example: After throwing a die the outcome is an even number i.e. 2 or 4 or 6.
There are following approaches to theory of probability:
(1) Empirical Approach
(2) Classical Approach
(3) Axiomatic Approach
Here we study only Empirical Approach to Probability
Let E be any event related to a Random experiment whose sample space has n outcomes and out of these n outcomes, the event can be performed by m outcomes, then probability of occurrence of event E will be
i.e. probability of any event lies between 0 and 1.
(i) Probability of any event cannot be less than 0 and cannot be more than 1. So it can be any fraction from 0 to 1.
(ii) It p is the probability of occurrence of an event E and q is the probability of non occurrence of that event then
p + q = 1
q = 1 – p
(iii) The sum of the probabilities of all the possible outcomes of a trial is 1.
If the number of favorable outcomes for an event is zero then the probability of occurrence of that event will be zero and such type of event is known as Impossible Event.
Sure Event or Certain Event
It the number of favorable outcomes for an event is equal to the total number of possible outcomes then the probability of occurrence of that event will be one and such type of event is known as sure or certain event.