Skip to main content

Non-Parametric Test : Kolmogorov - Smirnov one sample test and two sample test.

 Non-Parametric tests

Kolmogorov-Smirnov one sample test.  or Kolmogorov-smirnov goodness of fit test.

The test was discovered by A. N. Kolmogorov and N. V. Smirnov hence that test has name Kolmogorov-smirnov  test, they developed two test for one sample test and two sample test. in this article we discus the Kolmogorov-smirnov  one sample test. this is simple one sample Non- Parametric test used to test whether the data follow specific distribution or their is significant difference between the observed and theoretical distribution. (i.e. theoretical distribution means assumed distribution or considered specific distribution). the test measured the goodness of fit of theoretical distribution. we known that chi-square test is used to test the goodness of fit. the main difference is that the chi-square test is used when data is categorical and Kolmogorov-smirnov  test is used when data is continuous.

Assumptions of Kolmogorov-smirnov test as follows:

1. the sample is selected form the population having unknown distribution. 

2. the observation are independent.

3. the variable under study is continuous.

The procedure of Kolmogorov-smirnov  test is: 

Let X1, X2, ........Xn be a random sample of size n form population with unknown continuous distribution function. F(X). but we are interested to test the data follows a specific distribution F0(X) or not. for testing the hypothesis are

H0: Data follow a specific distribution.    

H1: Data does not follow a specific distribution.

OR 

H0: F(X) =  F0(X)   vs H0: F(X) ≠ F0(X).

(it is two tailed test)

The test consist following steps:

Step I: the Kolmogorov-smirnov  test (i.e. K-S test ) is based on the comparison of empirical cumulative distribution ( observed or sample distribution) with theoretical cumulative distribution function (i.e. specific or considered cumulative distribution function). now we first finding the empirical cumulative distribution function which is based on the sample and is defined as the proportion of sample observation which are less than or equal to some value of X and it is denoted as S(X).

S(X) = (number of observations are less than or equal to x ) / (total number of observations)

Step II: in this step we find the theoretical cumulative distribution function F0(X) for all possible  values of x.

Step III: After finding both empirical and theoretical cumulative distribution function for all possible values of x then we take the difference of empirical and theoretical cumulative distribution function  for all x

i.e. S(X) - F0(X)  for all 

Step IV: the test statistics denoted as Dn =    Sup| S(X) - F0(X) |

where Dis the supreme over all x of absolute value of difference of empirical and theoretical cumulative distribution function.

Step V: the calculated value of the test statistics D=    Sup| S(X) - F0(X) | is compared with the critical value at ∝ % level of significance. and take decision about test to accept or reject the null hypothesis.

here two case are arises if small sample test and large sample test

i) Small sample test: if the sample size n is less than or equal to 40 (i.e. n ≤  40) 

then the  test statistics as D=    Sup| S(X) - F0(X) |  is compared with the critical value at ∝ % level of significance

if the calculated Dn  is greater than   or equal to critical value  Dn α  at α% of level of significance. i.e.  

Dn   > Dn α


 We reject null hypothesis  at α% of level of significance. other wise we accept null hypothesis.

ii) Large sample test: if the sample size n is grater than 40 (i.e. n > 40)

 then the  test statistics as D=    Sup| S(X) - F0(X) |  is compared with the critical value at ∝ % level of significance.

but the sample size  n is greater than 40 then the critical value for test statistics at given ∝ % level of significance is approximately calculated as 

Dn ∝   = (1.36) / √n      (note that in large sample test we calculate the critical value)

where n is sample size

then we comparing the calculated and critical value of D and take decision about test to accept or reject the null hypothesis. if the calculated Dn  is greater than or equal to  the critical value  Dn α  at α% of level of significance. i.e.  Dn   > Dn α then we reject the null hypothesis  at α% of level of significance. other wise we accept null hypothesis.


Kolmogorov-Smirnov two sample test. 

In kolmogorov-Smirnov one sample test is used to compare the empirical cumulative distribution with hypothesized cumulative distribution function. but in two sample test compare the empirical cumulative  distribution of two sample. 

Assumptions: 

1. the sample is selected form the population having unknown distribution. 

2. the observation are independent.

3. the variable under study is continuous.

The procedure of Kolmogorov-smirnov  test is: 

Let X1, X2, ........Xand Y1, Y2, ........Ybe a random sample of size n form first and second population.  Let S1(X) and S2(X) are sample empirical cumulative distribution function of first and second sample respectively now we want to the sample come from population have same distribution or not. for testing the hypothesis are

H0:    F1(X) =  F2(X)

H1:  F1(X) ≠ F2(X)

The test consist following steps:

Step I:  the test is based on the compression  of the sample or empirical cumulative distribution function. so we firstly calculate the sample cumulative distribution for both sample and denoted as S1(X) and S2(X) and calculated as the proportion of number of sample observations are less than or equal to some value of X. to the total number of observations. 

S1(X) = the number of observations less than or equal to value of X  in first sample / total number of observation in sample first.  

and 

S2(X) = the number of observations less than or equal to value of X  in second sample / total number of observation in sample second.   

S1(X) and S2(X) calculated for all values of X 

Step II: after calculating the empirical distribution function S1(X) and S2(X) for all sample we take the difference between them. (i.e. S1(X) - S2(X)).

Step III:  the test statistics denoted as D= Sup| S1(X) - S2(X) |

where Dn  is the supreme or maximum  over all x of absolute value of difference of empirical cumulative distribution function of two samples.

the calculated value of the test statistics D=  Sup| S1(X) - S2(X) | is compared with the critical value at ∝ % level of significance. and take decision about test to accept or reject the null hypothesis.

here two case are arises if small sample test and large sample test

i) Small sample test: if the sample size n is less than or equal to 40 (i.e. n ≤  40) 

then the  test statistics as D=  Sup| S1(X) - S2(X) |  is compared with the critical value at ∝ % level of significance

if the calculated Dn  is greater than  or equal to critical value  Dn α  at α% of level of significance. i.e.  

Dn   > Dn α

 We reject null hypothesis  at α% of level of significance. other wise we accept null hypothesis.

ii) Large sample test: if the sample size n is grater than 40 (i.e. n > 40)

 then the  test statistics as D= Sup| S1(X) - S2(X) | is compared with the critical value at ∝ % level of significance.

but the sample size  n is greater than 40 then the critical value for test statistics at given ∝ % level of significance is approximately calculated as 

Dn ∝  = (1.36) / √n      (note that in large sample test we calculate the critical value) here the selection of quantity 1.36 id based on the level of significance 1.36 is for 0.05 l. o. s. and 1.22 is for 0.10 and 1.63 for 0.01 l.o.s. refer below table.

OR

where n is sample size n1=n2=n and for unequal sample size the critical value calculated using the formula,

 Dn α = c(∝) X √((n1+n2)/(n1n2). we select the value of C(∝) from table.

C(a)

1.22

1.36

1.48

1.63

a

0.1

0.05

0.025

0.01

then we comparing the calculated and critical value of D and take decision about test to accept or reject the null hypothesis. if the calculated Dn  is greater than or equal to  the critical value  Dn α  at α% of level of significance. i.e.  Dn   > Dn α then we reject the null hypothesis  at α% of level of significance. other wise we accept null hypothesis.


this tests are helpful to TY B.Sc. (Statistics) students  share with them.

Comments

Post a Comment

Popular posts from this blog

MCQ'S based on Basic Statistics (For B. Com. II Business Statistics)

    (MCQ Based on Probability, Index Number, Time Series   and Statistical Quality Control Sem - IV)                                                            1.The control chart were developed by ……         A) Karl Pearson B) R.A. fisher C) W.A. Shewhart D) B. Benjamin   2.the mean = 4 and variance = 2 for binomial r.v. x then value of n is….. A) 7 B) 10 C) 8 D)9   3.the mean = 3 and variance = 2 for binomial r.v. x then value of n is….. A) 7 B) 10 C) 8 D)9 4. If sampl...

Basic Concepts of Probability and Binomial Distribution , Poisson Distribution.

 Probability:  Basic concepts of Probability:  Probability is a way to measure hoe likely something is to happen. Probability is number between 0 and 1, where probability is 0 means is not happen at all and probability is 1 means it will be definitely happen, e.g. if we tossed coin there is a 50% chance to get head and 50% chance to get tail, it can be represented in probability as 0.5 for each outcome to get head and tail. Probability is used to help us taking decision and predicting the likelihood of the event in many areas, that are science, finance and Statistics.  Now we learn the some basic concepts that used in Probability:  i) Random Experiment OR Trail: A Random Experiment is an process that get one or more possible outcomes. examples of random experiment include tossing a coin, rolling a die, drawing  a card from pack of card etc. using this we specify the possible outcomes known as sample pace.  ii)Outcome: An outcome is a result of experi...

Statistical Inference: Basic Terms and Definitions.

  📚📖 Statistical Inference: Basic Terms. The theory of estimation is of paramount importance in statistics for several reasons. Firstly, it allows researchers to make informed inferences about population characteristics based on limited sample data. Since it is often impractical or impossible to measure an entire population, estimation provides a framework to generalize findings from a sample to the larger population. By employing various estimation methods, statisticians can estimate population parameters such as means, proportions, and variances, providing valuable insights into the population's characteristics. Second, the theory of estimating aids in quantifying the estimates' inherent uncertainty. Measures like standard errors, confidence intervals, and p-values are included with estimators to provide  an idea of how accurate and reliable the estimates are. The range of possible values for the population characteristics and the degree of confidence attached to those est...

Index Number

 Index Number      Introduction  We seen in measures of central tendency the data can be reduced to a single figure by calculating an average and two series can be compared by their averages. But the data are homogeneous then the average is meaningful. (Data is homogeneous means data in same type). If the two series of the price of commodity for two years. It is clear that we cannot compare the cost of living for two years by using simple average of the price of the commodities. For that type of problem we need type of average is called Index number. Index number firstly defined or developed to study the effect of price change on the cost of living. But now days the theory of index number is extended to the field of wholesale price, industrial production, agricultural production etc. Index number is like barometers to measure the change in change in economics activities.   An index may be defined as a " specialized  average designed to measure the...

B. Com. -I Statistics Practical No. 1 Classification, tabulation and frequency distribution –I: Qualitative data.

  Shree GaneshA B. Com. Part – I: Semester – I OE–I    Semester – I (BASIC STATISTICS PRACTICAL-I) Practical: 60 Hrs. Marks: 50 (Credits: 02) Course Outcomes: After completion of this practical course, the student will be able to: i) apply sampling techniques in real life. ii) perform classification and tabulation of primary data. iii) represent the data by means of simple diagrams and graphs. iv) summarize data by computing measures of central tendency.   LIST OF PRACTICALS: 1. Classification, tabulation and frequency distribution –I: Qualitative data. 2. Classification, tabulation and frequency distribution –II : Quantitative data. 3. Diagrammatic representation of data by using Pie Diagram and Bar Diagrams. 4. Graphical representation of data by using Histogram, Frequency Polygon, Frequency Curve and     Locating Modal Value. 5. Graphical representation of data by using Ogive Curves and Locating Quartile Values....

B. Com. I Practical No. 3 :Diagrammatic representation of data by using Pie Diagram and Bar Diagrams.

Practical No. 3 :Diagrammatic representation of data by using Pie Diagram and Bar Diagrams. Diagrammatic Presentation. We have observed the classification and tabulation method. We use this method to take a lot of information and make it fit into a small table. The reason we do this is to make the information more organized and easier to understand. Tabulation helps us arrange data neatly so that it's not messy and confusing. tabulation is a way to make big files of information look neat and tidy in a table.  but better and beautiful way to represent data using diagrams and graphs. the diagram and graph have some advantages because that used to visualise the data. that helps to understand and give information easily to any common man or any one, following are the some  advantages of diagram and graph.  I. Advantages i. Data Representation: Diagrams and graphs are excellent for presenting data visually, making trends, comparisons, and statistical information easier to...

Method of Moment & Maximum Likelihood Estimator: Method, Properties and Examples.

 Statistical Inference I: Method Of Moment:   One of the oldest method of finding estimator is Method of Moment, it was discovered by Karl Pearson in 1884.  Method of Moment Estimator Let X1, X2, ........Xn be a random sample from a population with probability density function (pdf) f(x, θ) or probability mass function (pmf) p(x) with parameters θ1, θ2,……..θk. If μ r ' (r-th raw moment about the origin) then μ r ' = ∫ -∞ ∞ x r f(x,θ) dx for r=1,2,3,….k .........Equation i In general, μ 1 ' , μ 2 ' ,…..μ k ' will be functions of parameters θ 1 , θ 2 ,……..θ k . Let X 1 , X 2 ,……X n be the random sample of size n from the population. The method of moments consists of solving "k" equations (in Equation i) for θ 1 , θ 2 ,……..θ k to obtain estimators for the parameters by equating μ 1 ' , μ 2 ' ,…..μ k ' with the corresponding sample moments m 1 ' , m 2 ' ,…..m k ' . Where m r ' = sample m...

Time Series

 Time series  Introduction:-         We see the many variables are changes over period of time that are population (I.e. population are changes over time means population increase day by day), monthly demand of commodity, food production, agriculture production increases and that can be observed over period of times known as time series. Time series is defined as a set of observation arranged according to time is called time series. Or a time Series is a set of statistical observation arnging chronological order. ( Chronological order means it is arrangements of variable according to time) and it gives information about variable.  Also we draw the graph of time series to see the behaviour of variable over time. It can be used of forecasting. The analysis of time series is helpful to economist, business men, also for scientist etc. Because it used to forecasting the future, observing the past behaviour of that variable or items. Also planning for future...

Statistical Inference II Notes

Likelihood Ratio Test