Basic parameters of a small sample. Small sample

Small sample method

The main advantage of the small sample method is the ability to evaluate the dynamics of the process over time, reducing the time for computational procedures.

Instantaneous samples are randomly selected at certain periods of time ranging from 5 to 20 units. The sampling period is established empirically and depends on the stability of the process, determined by analyzing a priori information.

For each instantaneous sample, the main statistical characteristics are determined. The instantaneous samples and their main statistical characteristics are presented in Appendix B.

A hypothesis about the homogeneity of sample dispersion is put forward and tested using one of the possible criteria (Fisher’s criterion).

Testing the hypothesis about the homogeneity of sample characteristics.

To check the significance of the difference between arithmetic means in 2 series of measurements, measure G is introduced. Calculations are given in Appendix B

The decision rule is formulated as follows:

where tr is the value of the quantile of the normalized distribution at a given confidence probability P, ? = 0.095, n = 10, tр =2.78.

When the inequality is satisfied, the hypothesis is confirmed that the difference between the sample means is not significant.

Since the inequality is satisfied in all cases, the hypothesis that the difference between the sample means is not significant is confirmed.

To test the hypothesis about the homogeneity of sample variances, the F0 measure is introduced as the ratio of unbiased estimates of the variances of the results of 2 series of measurements. Moreover, the larger of the 2 estimates is taken as the numerator and if Sx1>Sx2, then

The calculation results are given in Appendix B.

Then the values ​​of the confidence probability P are specified and the values ​​of F(K1; K2; ?/2) are determined with K1 = n1 - 1 and K2 = n2 - 1.

With P = 0.025 and K1 = 10-1 = 4 and K2 = 10-1 = 4 F (9;9;0.025/2) =4.1.

Decision rule: if F(K1; K2; ?/2)>F0, then the hypothesis about the homogeneity of variances in the two samples is accepted.

Since the condition F(K1; K2; ?/2) > F0 is satisfied in all cases, the hypothesis of homogeneity of variances is accepted.

Thus, the hypothesis about the homogeneity of sample variances is confirmed, which indicates the stability of the process; the hypothesis about the homogeneity of sample means using the method of comparison of means is confirmed, this means that the center of dispersion has not changed and the process is in a stable state.

Scatter and precision plot method

Over a certain period of time, instant samples of 3 to 10 products are taken and the statistical characteristics of each sample are determined.

The obtained data are plotted on diagrams with time on the abscissa axis? or numbers k of samples, and on the ordinate axis - individual values ​​of xk or the value of one of the statistical characteristics (sample arithmetic mean, sample standard deviation). In addition, two horizontal lines Тв and Тн are drawn on the diagram, limiting the tolerance range of the product.

Instantaneous samples are given in Appendix B.


Figure 1 accuracy chart

The diagram clearly shows the progress of the production process. It can be used to indicate that the production process is unstable

The extension of sample characteristics to the general population, based on the law of large numbers, requires a sufficiently large sample size. However, in the practice of statistical research, one often encounters the impossibility, for one reason or another, of increasing the number of sample units that have a small size. This applies to studying the activities of enterprises, educational institutions, commercial banks, etc., the number of which in the regions is, as a rule, insignificant, and sometimes amounts to only 5-10 units.

In the case when the sample population consists of a small number of units, less than 30, the sample is called small In this case, Lyapunov’s theorem cannot be used to calculate the sampling error, since the sample mean is significantly influenced by the value of each of the randomly selected units and its distribution may differ significantly from normal.

In 1908 V.S. Gosset proved that the estimate of the discrepancy between the sample mean of a small sample and the general mean has a special distribution law (see Chapter 4). Dealing with the problem of probabilistic estimation of a sample mean with a small number of observations, he showed that in this case it is necessary to consider the distribution not of the sample means themselves, but of the magnitude of their deviations from the mean of the original population. In this case, the conclusions can be quite reliable.

Student's discovery is called small sample theory.

When assessing the results of a small sample, the value of the general variance is not used in the calculations. In small samples, the “corrected” sample variance is used to calculate the average sampling error:

those. in contrast to large samples in the denominator instead P costs (and - 1). The calculation of the average sampling error for a small sample is given in table. 5.7.

Table 5.7

Calculation of the average error of a small sample

The marginal error of a small sample is: where t- trust factor.

Magnitude t relates differently to probable estimation than with a large sample. In accordance with the Student distribution, the probable estimate depends on both the value t, and on the sample size I in the event that the marginal error does not exceed r-fold the average error in small samples. However, it largely depends on the number of units selected.

V.S. Gosset compiled a table of probability distributions in small samples corresponding to given values ​​of the confidence coefficient t and different volumes of a small sample and, an excerpt from it is given in table. 5.8.

Table 5.8

Fragment of Student's probability table (probabilities multiplied by 1000)

Table data 5.8 indicate that with an unlimited increase in the sample size (i = °°), the Student distribution tends to the normal distribution law, and at i = 20 it differs little from it.

The Student distribution table is often given in a different form, more convenient for practical use (Table 5.9).

Table 5.9

Some values ​​(Student's t-distributions

Number of degrees of freedom

for one-way interval

for two-way spacing

P= 0,99

Let's look at how to use the distribution table. Each fixed value P calculate the number of degrees of freedom k, Where k = n - 1. For each value of the degree of freedom, the limit value is indicated t p (t 095 or t 0 99), which with a given probability R will not be exceeded due to random fluctuations in the sampling results. Based on magnitude tp the boundaries of trust are determined

interval

As a rule, the confidence level for two-sided testing is used P = 0.95 or P = 0.99, which does not exclude the choice of other probability values. The probability value is selected based on the specific requirements of the tasks for which a small sample is used.

The probability of the general average values ​​going beyond the confidence interval is equal to q, Where q = 1 - R. This value is very small. Accordingly, for the considered probabilities R it is 0.05 and 0.01.

Small samples are widespread in the technical sciences and biology, but they must be used in statistical research with great caution, only with appropriate theoretical and practical examination. A small sample can be used only if the distribution of the characteristic in the population is normal or close to it, and the average value is calculated from sample data obtained as a result of independent observations. In addition, keep in mind that the accuracy of results from a small sample size is lower than from a large sample size.

small-sample statistics

It is generally accepted that the beginning of S. m.v. or, as it is often called, “small n” statistics, was founded in the first decade of the 20th century with the publication of the work of W. Gosset, in which he placed the t-distribution postulated by the “student” who gained world fame a little later. At the time, Gossett was working as a statistician at the Guinness breweries. One of his duties was to analyze successive batches of barrels of freshly brewed porter. For a reason he never really explained, Gossett experimented with the idea of ​​significantly reducing the number of samples taken from the very large number of barrels in the brewery's warehouses to randomly control the quality of the porter. This led him to postulate the t-distribution. Because the Guinness breweries' bylaws prohibited their employees from publishing research results, Gossett published the results of his experiment comparing quality control sampling using the t-distribution for small samples and the traditional z-distribution (normal distribution) anonymously, under the pseudonym "Student" - hence the name Student's t-distribution).

t-distribution. The t-distribution theory, like the z-distribution theory, is used to test the null hypothesis that two samples are simply random samples from the same population and therefore the calculated statistics (eg mean and standard deviation) are unbiased estimates of population parameters. However, unlike the theory of the normal distribution, the theory of the t-distribution for small samples does not require a priori knowledge or precise estimates of the expected value and population variance. Moreover, although testing a difference between the means of two large samples for statistical significance requires the fundamental assumption that characteristics of the population are normally distributed, the theory of the t distribution does not require assumptions about the parameters.

It is well known that normally distributed characteristics are described by one single curve - the Gaussian curve, which satisfies the following equation:

With the t-distribution, the whole family of curves is represented by the following formula:

This is why the equation for t includes a gamma function, which in mathematics means that as n changes, a different curve will satisfy the given equation.

Degrees of freedom

In the equation for t, the letter n denotes the number of degrees of freedom (df) associated with the estimate of the population variance (S2), which represents the second moment of any moment generating function, such as the equation for the t distribution. In S., the number of degrees of freedom indicates how many characteristics remain free after their partial use in a particular type of analysis. In a t-distribution, one of the deviations from the sample mean is always fixed, since the sum of all such deviations must be equal to zero. This affects the sum of squares when calculating the sample variance as an unbiased estimate of the parameter S2 and leads to df being equal to the number of measurements minus one for each sample. Hence, in the formulas and procedures for calculating t-statistics for testing the null hypothesis, df = n - 2.

F-pacndivision. The null hypothesis tested by a t test is that the two samples were randomly drawn from the same population or were randomly drawn from two different populations with the same variance. But what if you need to analyze more groups? The answer to this question was sought for twenty years after Gosset discovered the t-distribution. Two of the most eminent statisticians of the 20th century were directly involved in its production. One is the largest English statistician R. A. Fisher, who proposed the first theories. formulations, the development of which led to the production of the F-distribution; his work on small sample theory, developing Gosset's ideas, was published in the mid-20s (Fisher, 1925). Another is George Snedecor, one of a galaxy of early American statisticians, who developed a way to compare two independent samples of any size by calculating the ratio of two estimates of variance. He called this relationship the F-ratio, after Fischer. Research results Snedecor led to the fact that the F-distribution began to be specified as the distribution of the ratio of two statistics c2, each with its own degrees of freedom:

From this came Fisher's classic work on analysis of variance, a statistical method explicitly focused on the analysis of small samples.

The sampling distribution F (where n = df) is represented by the following equation:

As with the t-distribution, the gamma function indicates that there is a family of distributions that satisfy the equation for F. In this case, however, the analysis involves two df quantities: the number of degrees of freedom for the numerator and for the denominator of the F-ratio.

Tables for estimating t- and F-statistics. When testing the null hypothesis using S., based on the theory of large samples, usually only one lookup table is required - a table of normal deviations (z), which allows you to determine the area under the normal curve between any two z values ​​​​on the x-axis. However, the tables for the t- and F-distributions are necessarily presented in a set of tables, since these tables are based on a variety of distributions resulting from varying the number of degrees of freedom. Although t- and F-distributions are probability density distributions, like the normal distribution for large samples, they differ from the latter in four ways that are used to describe them. The t distribution, for example, is symmetric (note t2 in its equation) for all df, but becomes increasingly peaked as the sample size decreases. Peaked curves (those with kurtosis greater than normal) tend to be less asymptotic (i.e., less close to the x-axis at the ends of the distribution) than curves with normal kurtosis, such as the Gaussian curve. This difference results in noticeable discrepancies between the points on the x-axis corresponding to the t and z values. With df = 5 and a two-tailed α level of 0.05, t = 2.57, whereas the corresponding z = 1.96. Therefore, t = 2.57 indicates statistical significance at the 5% level. However, in the case of a normal curve, z = 2.57 (more precisely 2.58) will already indicate a 1% level of statistical significance. Similar comparisons can be made with the F distribution, since t is equal to F when the number of samples is two.

What constitutes a “small” sample?

At one time, the question was raised about how large the sample should be in order to be considered small. There is simply no definite answer to this question. However, the conventional boundary between a small and a large sample is considered to be df = 30. The basis for this somewhat arbitrary decision is the result of comparing the t-distribution with the normal distribution. As noted above, the discrepancy between t and z values ​​tends to increase as df decreases and decrease as df increases. In fact, t begins to closely approach z long before the limiting case where t = z for df = ∞. A simple visual examination of the table values ​​of t shows that this approximation becomes quite fast, starting from df = 30 and above. Comparative values ​​of t (at df = 30) and z are equal, respectively: 2.04 and 1.96 for p = 0.05; 2.75 and 2.58 for p = 0.01; 3.65 and 3.29 for p = 0.001.

Other statistics for “small” samples

Although statistics such as t and F are specifically designed for use with small samples, they are equally applicable to large samples. There are, however, many other statistical methods designed to analyze small samples and are often used for this purpose. This refers to the so-called. non-parametric or distribution-free methods. Basically, the scales appearing in these methods are intended to be applied to measurements obtained using scales that do not satisfy the definition of ratio or interval scales. Most often these are ordinal (rank) or nominal measurements. Nonparametric scales do not require assumptions regarding distribution parameters, particularly regarding estimates of dispersion, because ordinal and nominal scales eliminate the very concept of dispersion. For this reason, nonparametric methods are also used for measurements obtained using interval and ratio scales when small samples are analyzed and the basic assumptions required for the use of parametric methods are likely to be violated. These tests, which can be reasonably applied to small samples, include: Fisher's exact probability test, Friedman's two-factor nonparametric (rank) analysis of variance, Kendall's t rank correlation coefficient, Kendall's coefficient of concordance (W), Kruskal's H test - Wallace for non-parametric (rank) one-way analysis of variance, Mann-Whitney U-test, median test, sign test, Spearman's rank correlation coefficient r and Wilcoxon t-test.

When studying variability, quantitative and qualitative characteristics are distinguished, the study of which is carried out by variation statistics, which is based on probability theory. Probability indicates the possible frequency of an individual meeting a particular trait. P=m/n, where m is the number of individuals with a given trait value; n is the number of all individuals in the group. The probability ranges from 0 to 1 (for example, the probability is 0.02 - the appearance of twins in a herd, i.e., two twins will appear per 100 calvings). Thus, the object of study of biometrics is a varying characteristic, the study of which is carried out on a certain group of objects, i.e. totality. There are general and sample populations. Population This is a large group of individuals that interests us based on the trait being studied. The general population may include a species of animal or breed of the same species. The general population (breed) includes several million animals. At the same time, the breed diverges into many groups, i.e. herds of individual farms. Since the general population consists of a large number of individuals, it is technically difficult to study it. Therefore, they do not study the entire population, but only a part of it, which is called elective or sample population.

Based on the sample population, a judgment is made about the entire population as a whole. The sampling must be carried out according to all the rules, which must include individuals with all values ​​of the varying trait. The selection of individuals from the general population is carried out according to the principle of chance or by drawing lots. In biometrics, there are two types of random sampling: large and small. Large sample they call one that includes more than 30 individuals or observations, and small sample less than 30 individuals. There are different data processing methods for large and small sample populations. The source of statistical information can be data from zootechnical and veterinary records, which provide information about each animal from birth to disposal. Another source of information can be data from scientific and production experiments conducted on a limited number of animals. Once the sample has been obtained, processing begins. This makes it possible to obtain in the form of mathematical quantities a number of statistical quantities or coefficients that characterize the characteristics of the groups of animals of interest.

The following statistical parameters or indicators are obtained using the biometric method:

1. Average values ​​of a varying characteristic (arithmetic mean, mode, median, geometric mean).

2. Coefficients that measure the amount of variation i.e. (variability) of the studied characteristic (standard deviation, coefficient of variation).

3. Coefficients that measure the magnitude of the relationship between characteristics (correlation coefficient, regression coefficient and correlation ratio).

4. Statistical errors and reliability of the obtained statistical data.

5. The share of variation arising under the influence of various factors and other indicators that are associated with the study of genetic and selection problems.

When statistically processing a sample, members of the population are organized in the form of a variation series. A series of variations is a grouping of individuals into classes depending on the value of the trait being studied. The variation series consists of two elements: classes and a series of frequencies. The variation series can be intermittent or continuous. Features that can only take an integer are called intermittent number heads, number of eggs, number of piglets and others. Features that can be expressed in fractional numbers are called continuous(height cm, milk yield kg, % fat, live weight and others).

When constructing a variation series, the following principles or rules are adhered to:

1. Determine or count the number of individuals for which the variation series (n) will be constructed.

2. Find the max and min value of the characteristic being studied.

3. Determine the class interval K = max - min / number of classes, the number of classes is taken arbitrarily.

4. Construct classes and determine the boundary of each class, min+K.

5. They distribute the members of the population into classes.

After constructing classes and distributing individuals into classes, the main indicators of the variation series (X, σ, Cv, Mх, Мσ, Мcv) are calculated. The average value of the attribute received the greatest value in characterizing the population. When solving all zootechnical, veterinary, medical, economic and other problems, the average value of a trait is always determined (average milk yield for the herd, % fat, fertility in pig breeding, egg production in chickens and other traits). The parameters characterizing the average value of a characteristic include the following:

1. Arithmetic mean.

2. Weighted arithmetic average.

3. Geometric mean.

4. Fashion (Mo).

5. Median (Me) and other parameters.

Arithmetic mean shows us what value of characteristics the individuals of a given group had, if it was the same for everyone, and is determined by the formula X = A + b × K

The main property of the arithmetic mean is that it eliminates the variation of a characteristic and makes it common to the entire population. At the same time, it should be noted that the arithmetic mean takes on an abstract meaning, i.e. when calculating it, fractional indicators are obtained, which in reality may not exist. For example: the yield of calves per 100 cows is 85.3 calves, the fertility of sows is 11.8 piglets, the egg production of chickens is 252.4 eggs and other indicators.

The value of the arithmetic mean is very high in livestock farming practice and population characteristics. In the practice of animal husbandry, in particular cattle breeding, a weighted arithmetic value is used to determine the average fat content in milk during lactation.

Geometric mean value is calculated if it is necessary to characterize the growth rate, the rate of population increase, when the arithmetic average distorts the data.

Fashion name the most frequently encountered value of a varying characteristic, both quantitative and qualitative. The modal number for a cow is teat number-4. Although there are cows with five or six teats. In a variation series, the modal class will be the class where there is the largest number of frequencies and we define it as the zero class.

Median is called a variant that divides all members of the population into two equal parts. Half of the population members will have a variable trait value less than the median, and the other half will have a value greater than the median (for example: breed standard). The median is most often used to characterize qualitative characteristics. For example: the shape of the udder is cup-shaped, round, goat. With correct sampling option, all three indicators should be the same (i.e. X, Mo, Me). Thus, the first characteristic of a population is average values, but they are not enough to judge the population.

The second important indicator of any population is the variability or variability of the trait. The variability of a trait is determined by many environmental factors and internal factors, i.e. hereditary factors.

Determining the variability of a trait is of great importance, both in biology and in animal husbandry practice. Thus, using statistical parameters that measure the degree of variability of a trait, it is possible to establish breed differences in the degree of variability of various economically useful traits, to predict the level of selection in different groups of animals, as well as its effectiveness.

The current state of statistical analysis makes it possible not only to establish the degree of manifestation of phenotypic variability, but also to divide phenotypic variability into its component types, namely genotypic and paratypic variability. This decomposition of variability is done using analysis of variance.

The main indicators of variability are the following statistical values:

1. Limits;

2. Standard deviation (σ);

3. Coefficient of variability or variation (Cv).

The simplest way to present the amount of variability of a trait is through limits. The limits are determined as follows: the difference between the max and min values ​​of the attribute. The greater this difference, the greater the variability of this trait. The main parameter for measuring the variability of a trait is the standard deviation or (σ) and is determined by the formula:

σ = ±K ∙ √∑ Pa 2- b 2

The main properties of the standard deviation i.e. (σ) are as follows:

1. Sigma is always a named value and is expressed (in kg, g, meters, cm, pcs.).

2. Sigma is always a positive value.

3. The greater the value of σ, the greater the variability of the trait.

4. In the variation series, all frequencies are included in ±3σ.

Using the standard deviation, you can determine which variation series a given individual belongs to. Methods for determining the variability of a characteristic using limits and standard deviation have their drawbacks, since it is impossible to compare different characteristics based on the magnitude of variability. It is necessary to know the variability of various traits in the same animal or the same group of animals, for example: variability in milk yield, fat content in milk, live weight, amount of milk fat. Therefore, by comparing the variability of opposite characteristics and identifying the degree of their variability, the coefficient of variability is calculated using the following formula:

Thus, the main methods for assessing the variability of characteristics among members of a population are: limits; standard deviation (σ) and coefficient of variation or variability.

In animal husbandry practice and experimental research, one often has to deal with small samples. Small sample they call the number of individuals or animals not exceeding 30 or less than 30. Established patterns using a small sample are transferred to the entire population. For a small sample, the same statistical parameters are determined as for a large sample (X, σ, Cv, Mx). However, their formulas and calculations differ from a large sample (i.e., from the formulas and calculations of a variation series).

1. Arithmetic mean value X = ∑V

V - absolute value of the option or characteristic;

n is the number of variants or number of individuals.

2. Standard deviation σ = ± √ ∑α 2

α = x-¯x, this is the difference between the value of the option and the arithmetic mean. This difference α is squared and α 2 n-1 is the number of degrees of freedom, i.e. the number of all variants or individuals reduced by one (1).

Control questions:

1.What is biometrics?

2.What statistical parameters characterize the population?

3.What indicators characterize variability?

4.What is a small sample

5. What are mode and median?

Lecture No. 12

Biotechnology and embryo transplantation

1. The concept of biotechnology.

2. Selection of donor and recipient cows, embryo transplantation.

3. The importance of transplantation in animal husbandry.

In the practice of statistical research one often encounters small samples , which have a volume of less than 30 units. Large samples usually include samples of more than 100 units.

Usually small samples are used in cases where it is impossible or impractical to use a large sample. One has to deal with such samples, for example, when surveying tourists and hotel visitors.

The magnitude of the error of a small sample is determined using formulas that differ from those for a relatively large sample size ().

With a small sample size n the relationship between sample and population variance should be taken into account:

Since in a small sample the fraction is significant, the variance is calculated taking into account the so-called number of degrees of freedom . It is understood as the number of options that can take arbitrary values ​​without changing the value of the average.

The average error of a small sample is determined by the formula:

The maximum sampling error for the mean and proportion is found similarly to the case of a large sample:

where t is the confidence coefficient, depending on the given level of significance and the number of degrees of freedom (Appendix 5).

The coefficient values ​​depend not only on the given confidence probability, but also on the sample size n. For individual values ​​of t and n, the confidence probability is determined by the Student distribution, which contains distributions of standardized deviations:

Comment. As the sample size increases, the Student distribution approaches the normal distribution: when n=20 it differs little from the normal distribution. When conducting small sample surveys, it should be taken into account that the smaller the sample size n, the greater the difference between the Student distribution and the normal distribution. For example, when p min . = 4 this difference is quite significant, which indicates a decrease in the accuracy of the results of a small sample.