Empirical distribution function. Empirical distribution function, properties The empirical distribution function is the function f x

Lecture 13. The concept of statistical estimates of random variables

Let the statistical frequency distribution of a quantitative characteristic X be known. Let us denote by the number of observations in which the value of the characteristic was observed to be less than x and by n the total number of observations. Obviously, the relative frequency of event X< x равна и является функцией x. Так как эта функция находится эмпирическим (опытным) путем, то ее называют эмпирической.

Empirical distribution function(sampling distribution function) is a function that determines for each value x the relative frequency of the event X< x. Таким образом, по определению ,где - число вариант, меньших x, n – объем выборки.

In contrast to the empirical distribution function of a sample, the population distribution function is called theoretical distribution function. The difference between these functions is that the theoretical function determines probability events X< x, тогда как эмпирическая – relative frequency the same event.

As n increases, the relative frequency of the event X< x, т.е. стремится по вероятности к вероятности этого события. Иными словами

Properties of the empirical distribution function:

1) The values ​​of the empirical function belong to the segment

2) - non-decreasing function

3) If is the smallest option, then = 0 for , if is the largest option, then = 1 for .

The empirical distribution function of the sample serves to estimate the theoretical distribution function of the population.

Example. Let's construct an empirical function based on the sample distribution:

Options
Frequencies

Let's find the sample size: 12+18+30=60. The smallest option is 2, so =0 for x £ 2. The value of x<6, т.е. , наблюдалось 12 раз, следовательно, =12/60=0,2 при 2< x £6. Аналогично, значения X < 10, т.е. и наблюдались 12+18=30 раз, поэтому =30/60 =0,5 при 6< x £10. Так как x=10 – наибольшая варианта, то =1 при x>10. Thus, the desired empirical function has the form:

The most important properties of statistical estimates

Let it be necessary to study some quantitative characteristic of the general population. Let us assume that from theoretical considerations it has been possible to establish that which one exactly the distribution has a sign and it is necessary to evaluate the parameters by which it is determined. For example, if the characteristic being studied is distributed normally in the population, then it is necessary to estimate the mathematical expectation and standard deviation; if the characteristic has a Poisson distribution, then it is necessary to estimate the parameter l.

Typically, only sample data is available, for example, values ​​of a quantitative characteristic obtained as a result of n independent observations. Considering as independent random variables we can say that to find a statistical estimate of an unknown parameter of a theoretical distribution means to find a function of observed random variables that gives an approximate value of the estimated parameter. For example, to estimate the mathematical expectation of a normal distribution, the role of the function is played by the arithmetic mean



In order for statistical estimates to provide correct approximations of the estimated parameters, they must satisfy certain requirements, among which the most important are the requirements undisplaced And solvency assessments.

Let be a statistical estimate of the unknown parameter of the theoretical distribution. Let the estimate be found from a sample of size n. Let's repeat the experiment, i.e. let's extract another sample of the same size from the general population and, based on its data, obtain a different estimate. Repeating the experiment many times, we get different numbers. The score can be thought of as a random variable, and the numbers as its possible values.

If the estimate gives an approximate value in abundance, i.e. each number is greater than the true value, and as a consequence, the mathematical expectation (average value) of the random variable is greater than:. Likewise, if it gives an estimate with a disadvantage, That .

Thus, the use of a statistical estimate, the mathematical expectation of which is not equal to the estimated parameter, would lead to systematic (of the same sign) errors. If, on the contrary, then this guarantees against systematic errors.

Unbiased called a statistical estimate, the mathematical expectation of which is equal to the estimated parameter for any sample size.

Displaced is called an estimate that does not satisfy this condition.

The unbiasedness of the estimate does not yet guarantee a good approximation for the estimated parameter, since possible values ​​can be very scattered around its average value, i.e. the variance can be significant. In this case, the estimate found from the data of one sample, for example, may turn out to be significantly distant from the average value, and therefore from the parameter being estimated.

Effective is a statistical estimate that, for a given sample size n, has smallest possible variance .

When considering large samples, statistical estimates are required to solvency .

Wealthy is called a statistical estimate, which, as n®¥ tends in probability to the estimated parameter. For example, if the variance of an unbiased estimate tends to zero as n®¥, then such an estimate turns out to be consistent.

Let's study some quantitative trait? general population, and assume that for any sample size the frequency distribution of this characteristic is known. By fixing the sample size to P, denote by p x number of options less than x. Then it is not difficult to see that the relation njn expresses the relative frequency of an event (?

This ratio depends on a fixed number x and, therefore, is some function of this quantity x. Let us denote it by F*(x).

Definition 1.10. Function F*(x) = -, expressing the relative

event frequency (? empirical function

distribution (sampling distribution function or statistical distribution function).

Thus, by definition

Recall that the distribution function of the feature ?, population is defined as the probability of an event (?

and in contrast to the empirical distribution function is called theoretical distribution function. Since the empirical distribution function is the probability of the same event, then according to Bernoulli’s theorem (see section 5.4), with a large sample size they differ little from each other in the sense that

where e is any arbitrarily small positive number.

Relation (1.2) shows that if the theoretical distribution function is unknown, then the empirical distribution function found from the sample can be used as its sample estimate. From formula (1.2) it simultaneously follows that this estimate is consistent (see Definition 2.4).

Comment 1.6. Attitude nJn can also be interpreted as share those members of the sample that lie to the left of a fixed number x. Let us denote it by co^. Consequently,

Now let's look at an example of constructing an empirical distribution function for a discrete sample.

Example 1.2. The distribution of the sample is known (Table 1.7).

Table 1.7

Option x.

Frequency I.

Construct its empirical distribution function.

First, let's find the sample size:

Option x x- the smallest. That's why n x = 0 and F*(x)= 0 at X% 3, then P z = 6, i.e. to the left of the point X= 3 there are six sample values. Hence, F*(3) = - = 0.12. To the left x = 5 located

wives n x=5 = 6 + 9= 15 sample option. That's why Fn(5) = - = 0.3. So

How n x=1 = 6 + 9 + 18 = 33, then Fn(7) = - = 0.66. Similarly we find

33 + 12 = 45. Therefore F* (9) = ^ = 0,9.

Option x 5 = 9 is the largest. Therefore, for x > 9, the entire sample lies to the left of this point x. That's why n x>9= 50 and F*(x) = -= 1 for x > 9. 50

Thus, from the calculations carried out above, it follows that the desired empirical function is uniquely defined on the entire real axis, piecewise constant and has the form

The graph of this function represents a step figure and is shown in Fig. 1.6. ?

As for the question of constructing an empirical function for continuous samples, this problem is solved, generally speaking, far from unambiguously. This is due to the fact that the values ​​of the empirical function can be uniquely found only at the end points of partial intervals into which the main interval containing the sample population is divided. But at the interior points of partial intervals it is not defined. At these points it is further determined either by a piecewise constant function (see the previous example) or by some increasing continuous function, for example a linear function, i.e. To construct the empirical distribution function, a linear approximation is used.

Example 1.3. According to Table 1.3, find the empirical distribution function of the enterprise’s employees by length of service.

For definiteness, we assume that the partial intervals under consideration are closed on the left and open on the right, i.e. they contain only their left ends. Let x = 2. Then event n 2 = 0 and F*(2)= 0. If x e (2; 6), then at this point the value p x is no longer defined and along with it the value of the empirical function is not defined. For example, if x = 3, then from the conditions of the problem it is impossible to determine the number of workers with less than three years of work experience, i.e. can't find the frequency p x and therefore F*(x).

Further, reasoning in a similar way, we are convinced that the required function F*(x) takes specific values ​​at the left endpoints of partial intervals, for example: "6) = 4/100 = 0.04; "10) = 0.12; "14) = 0.24; "18) = 0.59; F*(22) = 0.78; "26) = 0.90"; "30) = 1, but it is not defined at the interior points of partial intervals. To finally solve the problem, the desired function at the internal points of partial intervals is further defined either by a piecewise constant function (Fig. 1.7) or by some continuous increasing function (Fig. 1.8, where the desired empirical function is further defined by a linear function). ?

Determination of the empirical distribution function

Let $X$ be a random variable. $F(x)$ is the distribution function of a given random variable. We will carry out $n$ experiments on a given random variable under the same conditions, independent from each other. In this case, we obtain a sequence of values ​​$x_1,\ x_2\ $, ... ,$\ x_n$, which is called a sample.

Definition 1

Each value $x_i$ ($i=1,2\ $, ... ,$ \ n$) is called a variant.

One estimate of the theoretical distribution function is the empirical distribution function.

Definition 3

An empirical distribution function $F_n(x)$ is a function that determines for each value $x$ the relative frequency of the event $X \

where $n_x$ is the number of options less than $x$, $n$ is the sample size.

The difference between the empirical function and the theoretical one is that the theoretical function determines the probability of the event $X

Properties of the empirical distribution function

Let us now consider several basic properties of the distribution function.

    The range of the function $F_n\left(x\right)$ is the segment $$.

    $F_n\left(x\right)$ is a non-decreasing function.

    $F_n\left(x\right)$ is a left continuous function.

    $F_n\left(x\right)$ is a piecewise constant function and increases only at points of values ​​of the random variable $X$

    Let $X_1$ be the smallest and $X_n$ the largest variant. Then $F_n\left(x\right)=0$ for $(x\le X)_1$ and $F_n\left(x\right)=1$ for $x\ge X_n$.

Let us introduce a theorem that connects the theoretical and empirical functions.

Theorem 1

Let $F_n\left(x\right)$ be the empirical distribution function, and $F\left(x\right)$ be the theoretical distribution function of the general sample. Then the equality holds:

\[(\mathop(lim)_(n\to \infty ) (|F)_n\left(x\right)-F\left(x\right)|=0\ )\]

Examples of problems on finding the empirical distribution function

Example 1

Let the sampling distribution have the following data recorded using a table:

Picture 1.

Find the sample size, create an empirical distribution function and plot it.

Sample size: $n=5+10+15+20=50$.

By property 5, we have that for $x\le 1$ $F_n\left(x\right)=0$, and for $x>4$ $F_n\left(x\right)=1$.

$x value

$x value

$x value

Thus we get:

Figure 2.

Figure 3.

Example 2

20 cities were randomly selected from the cities of the central part of Russia, for which the following data on public transport fares were obtained: 14, 15, 12, 12, 13, 15, 15, 13, 15, 12, 15, 14, 15, 13 , 13, 12, 12, 15, 14, 14.

Create an empirical distribution function for this sample and plot it.

Let's write down the sample values ​​in ascending order and calculate the frequency of each value. We get the following table:

Figure 4.

Sample size: $n=20$.

By property 5, we have that for $x\le 12$ $F_n\left(x\right)=0$, and for $x>15$ $F_n\left(x\right)=1$.

$x value

$x value

$x value

Thus we get:

Figure 5.

Let's plot the empirical distribution:

Figure 6.

Originality: $92.12\%$.

As is known, the distribution law of a random variable can be specified in various ways. A discrete random variable can be specified using a distribution series or an integral function, and a continuous random variable can be specified using either an integral or a differential function. Let's consider selective analogues of these two functions.

Let there be a sample set of values ​​of some random volume variable and each option from this set is associated with its frequency. Let further is some real number, and – number of sample values ​​of the random variable
, smaller .Then the number is the frequency of the quantity values ​​observed in the sample X, smaller , those. frequency of occurrence of the event
. When it changes x in the general case, the value will also change . This means that the relative frequency is a function of the argument . And since this function is found from sample data obtained as a result of experiments, it is called selective or empirical.

Definition 10.15. Empirical distribution function(sampling distribution function) is the function
, defining for each value x relative frequency of the event
.

(10.19)

In contrast to the empirical sampling distribution function, the distribution function F(x) of the general population is called theoretical distribution function. The difference between them is that the theoretical function F(x) determines the probability of an event
, and the empirical one is the relative frequency of the same event. From Bernoulli's theorem it follows

,
(10.20)

those. at large probability
and relative frequency of the event
, i.e.
differ little from one another. From this it follows that it is advisable to use the empirical distribution function of the sample to approximate the theoretical (integral) distribution function of the general population.

Function
And
have the same properties. This follows from the definition of the function.

Properties
:


Example 10.4. Construct an empirical function based on the given sample distribution:

Options

Frequencies

Solution: Let's find the sample size n= 12+18+30=60. Smallest option
, hence,
at
. Meaning
, namely
observed 12 times, therefore:

=
at
.

Meaning x< 10, namely
And
were observed 12+18=30 times, therefore,
=
at
. At

.

The required empirical distribution function:

=

Schedule
shown in Fig. 10.2

R
is. 10.2

Control questions

1. What main problems does mathematical statistics solve? 2. General and sample population? 3. Define sample size. 4. What samples are called representative? 5. Errors of representativeness. 6. Basic methods of sampling. 7. Concepts of frequency, relative frequency. 8. The concept of statistical series. 9. Write down the Sturges formula. 10. Formulate the concepts of sample range, median and mode. 11. Frequency polygon, histogram. 12. The concept of a point estimate of a sample population. 13. Biased and unbiased point estimate. 14. Formulate the concept of a sample average. 15. Formulate the concept of sample variance. 16. Formulate the concept of sample standard deviation. 17. Formulate the concept of sample coefficient of variation. 18. Formulate the concept of sample geometric mean.

Determination of the empirical distribution function

Let $X$ be a random variable. $F(x)$ is the distribution function of a given random variable. We will carry out $n$ experiments on a given random variable under the same conditions, independent from each other. In this case, we obtain a sequence of values ​​$x_1,\ x_2\ $, ... ,$\ x_n$, which is called a sample.

Definition 1

Each value $x_i$ ($i=1,2\ $, ... ,$ \ n$) is called a variant.

One estimate of the theoretical distribution function is the empirical distribution function.

Definition 3

An empirical distribution function $F_n(x)$ is a function that determines for each value $x$ the relative frequency of the event $X \

where $n_x$ is the number of options less than $x$, $n$ is the sample size.

The difference between the empirical function and the theoretical one is that the theoretical function determines the probability of the event $X

Properties of the empirical distribution function

Let us now consider several basic properties of the distribution function.

    The range of the function $F_n\left(x\right)$ is the segment $$.

    $F_n\left(x\right)$ is a non-decreasing function.

    $F_n\left(x\right)$ is a left continuous function.

    $F_n\left(x\right)$ is a piecewise constant function and increases only at points of values ​​of the random variable $X$

    Let $X_1$ be the smallest and $X_n$ the largest variant. Then $F_n\left(x\right)=0$ for $(x\le X)_1$ and $F_n\left(x\right)=1$ for $x\ge X_n$.

Let us introduce a theorem that connects the theoretical and empirical functions.

Theorem 1

Let $F_n\left(x\right)$ be the empirical distribution function, and $F\left(x\right)$ be the theoretical distribution function of the general sample. Then the equality holds:

\[(\mathop(lim)_(n\to \infty ) (|F)_n\left(x\right)-F\left(x\right)|=0\ )\]

Examples of problems on finding the empirical distribution function

Example 1

Let the sampling distribution have the following data recorded using a table:

Picture 1.

Find the sample size, create an empirical distribution function and plot it.

Sample size: $n=5+10+15+20=50$.

By property 5, we have that for $x\le 1$ $F_n\left(x\right)=0$, and for $x>4$ $F_n\left(x\right)=1$.

$x value

$x value

$x value

Thus we get:

Figure 2.

Figure 3.

Example 2

20 cities were randomly selected from the cities of the central part of Russia, for which the following data on public transport fares were obtained: 14, 15, 12, 12, 13, 15, 15, 13, 15, 12, 15, 14, 15, 13 , 13, 12, 12, 15, 14, 14.

Create an empirical distribution function for this sample and plot it.

Let's write down the sample values ​​in ascending order and calculate the frequency of each value. We get the following table:

Figure 4.

Sample size: $n=20$.

By property 5, we have that for $x\le 12$ $F_n\left(x\right)=0$, and for $x>15$ $F_n\left(x\right)=1$.

$x value

$x value

$x value

Thus we get:

Figure 5.

Let's plot the empirical distribution:

Figure 6.

Originality: $92.12\%$.