Bias – p. 2

Subsampling Bias (Continued)

Note: For the equations on this page, I’m using MathML and MathJax. Elsewhere I’ve used PNG image files for most equations. The original PNG images looked fine until I got a retina display with 2 device pixels per px and suddenly every equation appeared fuzzy. On other pages, I’m using higher-resolution images and onload event handlers to improve the appearance. On my Mac or iPhone, with a retina display, they look more or less the same now, but MathML/MathJax gives a slightly better rendering on all my PC displays (possibly a ClearType issue).

The theorems given below provide bounds for the bias of a sample chosen according to Gy’s criterion.

Definition Suppose a lot to be sampled and tested for an analyte is composed of N fragments. A sample from this lot is defined to be a random nonempty subset of the N fragments. In other words, a sample is a random variable whose possible values are nonempty subsets of the N fragments. (Note that the term random sample has a different meaning.) A sample is correct if each fragment in the lot has the same probability of being included in the sample.
Notation Index the fragments of the lot using the set L = {1, 2, …, N}. For any integer jL, let mj denote the mass of the j th fragment, Aj the mass of the critical component (analyte) in the j th fragment, and aj the critical content (mass fraction of analyte) in the j th fragment (aj = Aj / mj). The fragment masses, mj, are assumed to be known, but the masses of critical component, Aj, and critical contents, aj, are unknown. In problems where Aj and aj are allowed to vary, Aj will be treated as a function of aj and mj, which are considered more fundamental (Aj = ajmj).

For any nonempty subset GL, identify G with the collection of fragments indexed by the elements of G. For example if G = {1, 2, 3}, then identify G with the collection that consists of the 1st, 2nd, and 3rd fragments in the lot. Also, for any nonempty subset GL, let:

m G = j G m j ,  A G = j G A j ,  a G = A G m G

In particular mL denotes the total mass of the lot, AL denotes the mass of critical component in the lot, and aL denotes the critical content of the lot. Furthermore, if S denotes a sample from the lot, then:

mS = mass of sample S,
AS = mass of the critical component in sample S, and
aS = critical content of sample S.

In this case, mS, AS, and aS are numerical random variables.
Theorem A.1 Let S be a correct sample from lot L. Then: E ( A S ) E ( m S ) = a L
Proof…
Proof: Since S is correct, there is a real number p with 0 < p ≤ 1, such that Pr[ jS ] = p for j = 1, 2, …, N. For any event F, let IF denote the random variable whose value is 1 if F occurs and 0 if F does not occur. So, for example, if jL, then I[ jS] equals 1 if fragment j belongs to sample S and it equals 0 otherwise. Then:
m S = j = 1 N I [ j S ] m j  and  A S = j = 1 N I [ j S ] A j

For any event F, the expected value of IF equals the probability of F. So, if jL, then E(I[ jS]) = Pr[ jS ] = p. So,

E ( m S ) = j = 1 N E ( I [ j S ] m j ) = j = 1 N p m j = p m L

and:

E ( A S ) = j = 1 N E ( I [ j S ] A j ) = j = 1 N p A j = p A L

So,

E ( A S ) E ( m S ) = p A L p m L = A L m L = a L

A stronger result can also be proved. It can be shown that E(AS) / E(mS) = aL for all possible values of a1, a2, …, aN if and only if S is correct.

Note that the sampling bias is a bias in the mass fraction of analyte in the sample, which is defined by aS = AS / mS. So, a sample S is unbiased if and only if E(AS / mS) = aL. Unfortunately, the mean of the quotient, E(AS / mS), is not necessarily equal to the quotient of the means, E(AS) / E(mS). If one measured the total mass of analyte in a correct sample, AS, and divided it by the expected mass of the sample, E(mS), rather than the actual mass, the result would be unaffected by sampling bias; however, this is not typically done, and in most cases it would not be desirable anyhow, because the elimination of a rather small bias would not be worth the increase in variability that would occur.

One may consider the sampling bias to be negligible if it is a small fraction of the standard deviation of aS, and the following corollary to Theorem A.1 shows that this is true whenever the relative standard deviation of the sample mass, mS, is small.

Corollary A.1.1 Assume S is a correct sample. Then:
Bias ( a S ) = Cov ( a S , m S ) E ( m S ) = σ ( a S ) × RSD ( m S ) × ρ ( a S , m S ) and | Bias ( a S ) | σ ( a S ) × RSD ( m S )

where RSD denotes relative standard deviation (coefficient of variation).

Proof…
Proof: First use the fact that aL = E(AS) / E(mS) to derive the following equations. Bias ( a S ) = E ( a S ) a L = E ( a S ) E ( A S ) E ( m S ) = E ( a S ) E ( a S m S ) E ( m S ) = E ( a S ) E ( a S ) E ( m S ) + Cov ( a S , m S ) E ( m S ) = Cov ( a S , m S ) E ( m S ) = ρ ( a S , m S ) σ ( a S ) σ ( m S ) E ( m S ) = σ ( a S ) × RSD ( m S ) × ρ ( a S , m S ) Then, since |ρ(aS, mS) | 1, it follows that | Bias(aS) | ≤ σ(aS) × RSD(mS).

Note that a large value for RSD(mS) does not necessarily imply a large sampling bias, because aS and mS may be only weakly correlated.

Theorem A.2 Assume S is a sample chosen in a manner such that the mass of the sample always falls between (1 − δ) × M and (1 + δ) × M for specified values of M and δ. If Pr[ jS ] = M / mL for all jL, then: ( 1 δ ) E ( a S ) a L ( 1 + δ ) E ( a S )

Given the premise of Theorem A.2, it can also be shown that if S is unbiased for all possible values of a1, a2, …, aN, then for all jL,

( 1 δ ) M m L Pr [ j S ] ( 1 + δ ) M m L

So, if the mass of the sample is not allowed to vary much, one can ensure zero sampling bias only if all the fragments have nearly the same selection probability.