Empirical distribution function. Empirical distribution function, properties Empirical distribution function example

Lecture 13. The concept of statistical estimates of random variables

Let the statistical frequency distribution of a quantitative characteristic X be known. Let us denote by the number of observations in which the value of the characteristic was observed to be less than x and by n the total number of observations. Obviously, the relative frequency of event X< x равна и является функцией x. Так как эта функция находится эмпирическим (опытным) путем, то ее называют эмпирической.

Empirical distribution function(sampling distribution function) is a function that determines for each value x the relative frequency of the event X< x. Таким образом, по определению ,где - число вариант, меньших x, n – объем выборки.

In contrast to the empirical distribution function of a sample, the population distribution function is called theoretical distribution function. The difference between these functions is that the theoretical function determines probability events X< x, тогда как эмпирическая – relative frequency the same event.

As n increases, the relative frequency of the event X< x, т.е. стремится по вероятности к вероятности этого события. Иными словами

Properties of the empirical distribution function:

1) The values ​​of the empirical function belong to the segment

2) - non-decreasing function

3) If is the smallest option, then = 0 for , if is the largest option, then = 1 for .

The empirical distribution function of the sample serves to estimate the theoretical distribution function of the population.

Example. Let's construct an empirical function based on the sample distribution:

Options
Frequencies

Let's find the sample size: 12+18+30=60. The smallest option is 2, so =0 for x £ 2. The value of x<6, т.е. , наблюдалось 12 раз, следовательно, =12/60=0,2 при 2< x £6. Аналогично, значения X < 10, т.е. и наблюдались 12+18=30 раз, поэтому =30/60 =0,5 при 6< x £10. Так как x=10 – наибольшая варианта, то =1 при x>10. Thus, the desired empirical function has the form:

The most important properties of statistical estimates

Let it be necessary to study some quantitative characteristic of the general population. Let us assume that from theoretical considerations it has been possible to establish that which one exactly the distribution has a sign and it is necessary to estimate the parameters by which it is determined. For example, if the characteristic being studied is distributed normally in the population, then it is necessary to estimate the mathematical expectation and standard deviation; if the characteristic has a Poisson distribution, then it is necessary to estimate the parameter l.

Typically, only sample data are available, for example, values ​​of a quantitative characteristic obtained as a result of n independent observations. Considering them as independent random variables, we can say that to find a statistical estimate of an unknown parameter of a theoretical distribution means to find a function of observed random variables that gives an approximate value of the estimated parameter. For example, to estimate the mathematical expectation of a normal distribution, the role of the function is played by the arithmetic mean



In order for statistical estimates to provide correct approximations of the estimated parameters, they must satisfy certain requirements, among which the most important are the requirements undisplaced And solvency assessments.

Let be a statistical estimate of the unknown parameter of the theoretical distribution. Let the estimate be found from a sample of size n. Let's repeat the experiment, i.e. let's extract another sample of the same size from the general population and, based on its data, obtain a different estimate. Repeating the experiment many times, we get different numbers. The score can be thought of as a random variable, and the numbers as its possible values.

If the estimate gives an approximate value in abundance, i.e. each number is greater than the true value, and as a consequence, the mathematical expectation (average value) of the random variable is greater than:. Likewise, if it gives an estimate with a disadvantage, That .

Thus, the use of a statistical estimate, the mathematical expectation of which is not equal to the estimated parameter, would lead to systematic (of the same sign) errors. If, on the contrary, then this guarantees against systematic errors.

Unbiased called a statistical estimate, the mathematical expectation of which is equal to the estimated parameter for any sample size.

Displaced is called an estimate that does not satisfy this condition.

The unbiasedness of the estimate does not yet guarantee a good approximation for the estimated parameter, since possible values ​​can be very scattered around its average value, i.e. the variance can be significant. In this case, the estimate found from the data of one sample, for example, may turn out to be significantly distant from the average value, and therefore from the estimated parameter itself.

Effective is a statistical estimate that, for a given sample size n, has smallest possible variance .

When considering large samples, statistical estimates are required to solvency .

Wealthy is called a statistical estimate, which, as n®¥ tends in probability to the estimated parameter. For example, if the variance of an unbiased estimate tends to zero as n®¥, then such an estimate turns out to be consistent.

As is known, the distribution law of a random variable can be specified in various ways. A discrete random variable can be specified using a distribution series or an integral function, and a continuous random variable can be specified using either an integral or a differential function. Let's consider selective analogues of these two functions.

Let there be a sample set of values ​​of some random volume variable and each option from this set is associated with its frequency. Let further is some real number, and – number of sample values ​​of the random variable
, smaller .Then the number is the frequency of the quantity values ​​observed in the sample X, smaller , those. frequency of occurrence of the event
. When it changes x in the general case, the value will also change . This means that the relative frequency is a function of the argument . And since this function is found from sample data obtained as a result of experiments, it is called selective or empirical.

Definition 10.15. Empirical distribution function(sampling distribution function) is the function
, defining for each value x relative frequency of the event
.

(10.19)

In contrast to the empirical sampling distribution function, the distribution function F(x) of the general population is called theoretical distribution function. The difference between them is that the theoretical function F(x) determines the probability of an event
, and the empirical one is the relative frequency of the same event. From Bernoulli's theorem it follows

,
(10.20)

those. at large probability
and relative frequency of the event
, i.e.
differ little from one another. From this it follows that it is advisable to use the empirical distribution function of the sample to approximate the theoretical (integral) distribution function of the general population.

Function
And
have the same properties. This follows from the definition of the function.

Properties
:


Example 10.4. Construct an empirical function based on the given sample distribution:

Options

Frequencies

Solution: Let's find the sample size n= 12+18+30=60. Smallest option
, hence,
at
. Meaning
, namely
observed 12 times, therefore:

=
at
.

Meaning x< 10, namely
And
were observed 12+18=30 times, therefore,
=
at
. At

.

The required empirical distribution function:

=

Schedule
shown in Fig. 10.2

R
is. 10.2

Control questions

1. What main problems does mathematical statistics solve? 2. General and sample population? 3. Define sample size. 4. What samples are called representative? 5. Errors of representativeness. 6. Basic methods of sampling. 7. Concepts of frequency, relative frequency. 8. The concept of statistical series. 9. Write down the Sturges formula. 10. Formulate the concepts of sample range, median and mode. 11. Frequency polygon, histogram. 12. The concept of a point estimate of a sample population. 13. Biased and unbiased point estimate. 14. Formulate the concept of a sample average. 15. Formulate the concept of sample variance. 16. Formulate the concept of sample standard deviation. 17. Formulate the concept of sample coefficient of variation. 18. Formulate the concept of sample geometric mean.

Variation series. Polygon and histogram.

Distribution range- represents an ordered distribution of units of the population being studied into groups according to a certain varying characteristic.

Depending on the characteristic underlying the formation of the distribution series, they are distinguished attributive and variational distribution rows:

§ Distribution series constructed in ascending or descending order of values ​​of a quantitative characteristic are called variational.

The variation series of the distribution consists of two columns:

The first column provides quantitative values ​​of the varying characteristic, which are called options and are designated . Discrete option - expressed as an integer. The interval option ranges from and to. Depending on the type of options, you can construct a discrete or interval variation series.
The second column contains number of specific option, expressed in terms of frequencies or frequencies:

Frequencies- these are absolute numbers, showing the number of times in total a given value of a characteristic occurs, which denote. The sum of all frequencies must be equal to the number of units in the entire population.

Frequencies() are frequencies expressed as a percentage of the total. The sum of all frequencies expressed as percentages must be equal to 100% in fractions of one.

Graphic representation of distribution series

The distribution series are visually presented using graphical images.

The distribution series are depicted as:

§ Polygon

§ Histograms

§ Cumulates

Polygon

When constructing a polygon, the values ​​of the varying characteristic are plotted on the horizontal axis (x-axis), and frequencies or frequencies are plotted on the vertical axis (y-axis).

1. Polygon in Fig. 6.1 is based on data from the micro-census of the population of Russia in 1994.


bar chart



To construct a histogram, the values ​​of the boundaries of the intervals are indicated on the abscissa axis and, based on them, rectangles are constructed, the height of which is proportional to the frequencies (or frequencies).

In Fig. 6.2. shows a histogram of the distribution of the Russian population in 1997 by age group.

Fig.1. Distribution of the Russian population by age groups

Empirical distribution function, properties.

Let the statistical frequency distribution of a quantitative characteristic X be known. Let us denote by the number of observations in which the value of the characteristic was observed to be less than x and by n the total number of observations. Obviously, the relative frequency of event X

An empirical distribution function (sampling distribution function) is a function that determines for each value x the relative frequency of the event X

In contrast to the empirical distribution function of a sample, the population distribution function is called the theoretical distribution function. The difference between these functions is that the theoretical function determines the probability of event X

As n increases, the relative frequency of the event X

Basic properties

Let an elementary outcome be fixed. Then is the distribution function of the discrete distribution given by the following probability function:

where and - number of sample elements equal to . In particular, if all elements of the sample are different, then .

The mathematical expectation of this distribution is:

.

Thus, the sample mean is the theoretical mean of the sampling distribution.

Similarly, sample variance is the theoretical variance of a sampling distribution.

The random variable has a binomial distribution:

The sample distribution function is an unbiased estimate of the distribution function:

.

The variance of the sample distribution function has the form:

.

According to the strong law of large numbers, the sample distribution function converges almost certainly to the theoretical distribution function:

almost certainly at .

The sample distribution function is an asymptotically normal estimate of the theoretical distribution function. If , then

According to the distribution at .

Determination of the empirical distribution function

Let $X$ be a random variable. $F(x)$ is the distribution function of a given random variable. We will carry out $n$ experiments on a given random variable under the same conditions, independent from each other. In this case, we obtain a sequence of values ​​$x_1,\ x_2\ $, ... ,$\ x_n$, which is called a sample.

Definition 1

Each value $x_i$ ($i=1,2\ $, ... ,$ \ n$) is called a variant.

One estimate of the theoretical distribution function is the empirical distribution function.

Definition 3

An empirical distribution function $F_n(x)$ is a function that determines for each value $x$ the relative frequency of the event $X \

where $n_x$ is the number of options less than $x$, $n$ is the sample size.

The difference between the empirical function and the theoretical one is that the theoretical function determines the probability of the event $X

Properties of the empirical distribution function

Let us now consider several basic properties of the distribution function.

    The range of the function $F_n\left(x\right)$ is the segment $$.

    $F_n\left(x\right)$ is a non-decreasing function.

    $F_n\left(x\right)$ is a left continuous function.

    $F_n\left(x\right)$ is a piecewise constant function and increases only at points of values ​​of the random variable $X$

    Let $X_1$ be the smallest and $X_n$ the largest variant. Then $F_n\left(x\right)=0$ for $(x\le X)_1$ and $F_n\left(x\right)=1$ for $x\ge X_n$.

Let us introduce a theorem that connects the theoretical and empirical functions.

Theorem 1

Let $F_n\left(x\right)$ be the empirical distribution function, and $F\left(x\right)$ be the theoretical distribution function of the general sample. Then the equality holds:

\[(\mathop(lim)_(n\to \infty ) (|F)_n\left(x\right)-F\left(x\right)|=0\ )\]

Examples of problems on finding the empirical distribution function

Example 1

Let the sampling distribution have the following data recorded using a table:

Picture 1.

Find the sample size, create an empirical distribution function and plot it.

Sample size: $n=5+10+15+20=50$.

By property 5, we have that for $x\le 1$ $F_n\left(x\right)=0$, and for $x>4$ $F_n\left(x\right)=1$.

$x value

$x value

$x value

Thus we get:

Figure 2.

Figure 3.

Example 2

20 cities were randomly selected from the cities of the central part of Russia, for which the following data on public transport fares were obtained: 14, 15, 12, 12, 13, 15, 15, 13, 15, 12, 15, 14, 15, 13 , 13, 12, 12, 15, 14, 14.

Create an empirical distribution function for this sample and plot it.

Let's write down the sample values ​​in ascending order and calculate the frequency of each value. We get the following table:

Figure 4.

Sample size: $n=20$.

By property 5, we have that for $x\le 12$ $F_n\left(x\right)=0$, and for $x>15$ $F_n\left(x\right)=1$.

$x value

$x value

$x value

Thus we get:

Figure 5.

Let's plot the empirical distribution:

Figure 6.

Originality: $92.12\%$.

Find out what the empirical formula is. In chemistry, EP is the simplest way to describe a compound—essentially a list of the elements that make up a compound based on their percentage. It should be noted that this simple formula does not describe order atoms in a compound, it simply indicates what elements it consists of. For example:

  • A compound consisting of 40.92% carbon; 4.58% hydrogen and 54.5% oxygen will have the empirical formula C 3 H 4 O 3 (an example of how to find the EF of this compound will be discussed in the second part).
  • Understand the term "percentage composition.""Percentage composition" is the percentage of each individual atom in the entire compound in question. To find the empirical formula of a compound, you need to know the percentage composition of the compound. If you are looking up an empirical formula for homework, then percentages will most likely be given.

    • To find the percentage composition of a chemical compound in the laboratory, it is subjected to some physical experiments and then quantitative analysis. Unless you are in a lab, you don't need to do these experiments.
  • Keep in mind that you will have to deal with gram atoms. A gram atom is a specific amount of a substance whose mass is equal to its atomic mass. To find the gram atom, you need to use the following equation: The percentage of an element in a compound is divided by the atomic mass of the element.

    • Let's say, for example, that we have a compound that contains 40.92% carbon. The atomic mass of carbon is 12, so our equation would be 40.92 / 12 = 3.41.
  • Know how to find atomic ratios. When working with a compound, you will end up with more than one gram atom. After finding all the gram atoms of your compound, look at them. In order to find the atomic ratio, you will need to select the smallest gram-atom value that you have calculated. Then you will need to divide all the gram atoms into the smallest gram atom. For example:

    • Let's say you are working with a compound containing three gram atoms: 1.5; 2 and 2.5. The smallest of these numbers is 1.5. Therefore, to find the ratio of atoms, you must divide all the numbers by 1.5 and put a ratio sign between them : .
    • 1.5 / 1.5 = 1. 2 / 1.5 = 1.33. 2.5 / 1.5 = 1.66. Therefore, the ratio of atoms is 1: 1,33: 1,66 .
  • Understand how to convert atomic ratio values ​​to integers. When writing an empirical formula, you must use whole numbers. This means you can't use numbers like 1.33. After you find the ratio of the atoms, you need to convert fractions (like 1.33) to whole numbers (like 3). To do this, you need to find an integer, multiplying each number of the atomic ratio by which you will get integers. For example:

    • Try 2. Multiply the atomic ratio numbers (1, 1.33, and 1.66) by 2. You get 2, 2.66, and 3.32. These are not integers, so 2 is not appropriate.
    • Try 3. If you multiply 1, 1.33 and 1.66 by 3, you get 3, 4 and 5 respectively. Therefore, the atomic ratio of integers has the form 3: 4: 5 .