McCroan/Bias

Subsampling Bias (Continued)

Note: For the equations on this page, I’m using MathML and MathJax. Elsewhere I’ve used PNG image files for most equations. The original PNG images looked fine until I got a retina display with 2 device pixels per px and suddenly every equation appeared fuzzy. On other pages, I’m using higher-resolution images and onload event handlers to improve the appearance. On my Mac or iPhone, with a retina display, they look more or less the same now, but MathML/MathJax gives a slightly better rendering on all my PC displays (possibly a ClearType issue).

The theorems given below provide bounds for the bias of a sample chosen according to Gy’s criterion.

Definition Suppose a lot to be sampled and tested for an analyte is composed of N fragments. A sample from this lot is defined to be a random nonempty subset of the N fragments. In other words, a sample is a random variable whose possible values are nonempty subsets of the N fragments. (Note that the term random sample has a different meaning.) A sample is correct if each fragment in the lot has the same probability of being included in the sample.

Notation Index the fragments of the lot using the set L = {1, 2, …, N}. For any integer j ∈ L, let m_j denote the mass of the j^th fragment, A_j the mass of the critical component (analyte) in the j^th fragment, and a_j the critical content (mass fraction of analyte) in the j^th fragment (a_j = A_j / m_j). The fragment masses, m_j, are assumed to be known, but the masses of critical component, A_j, and critical contents, a_j, are unknown. In problems where A_j and a_j are allowed to vary, A_j will be treated as a function of a_j and m_j, which are considered more fundamental (A_j = a_j m_j).

For any nonempty subset G ⊆ L, identify G with the collection of fragments indexed by the elements of G. For example if G = {1, 2, 3}, then identify G with the collection that consists of the 1^st, 2^nd, and 3^rd fragments in the lot. Also, for any nonempty subset G ⊆ L, let:

m_{G} = \sum_{j \in G} m_{j}, A_{G} = \sum_{j \in G} A_{j}, a_{G} = \frac{A_{G}}{m_{G}}

In particular m_L denotes the total mass of the lot, A_L denotes the mass of critical component in the lot, and a_L denotes the critical content of the lot. Furthermore, if S denotes a sample from the lot, then:

m_S = mass of sample S,

A_S = mass of the critical component in sample S, and

a_S = critical content of sample S.

In this case, m_S, A_S, and a_S are numerical random variables.

Theorem A.1 Let S be a correct sample from lot L. Then:

\frac{E (A_{S})}{E (m_{S})} = a_{L}

Proof…

Proof: Since S is correct, there is a real number p with 0 < p ≤ 1, such that Pr[ j ∈ S ] = p for j = 1, 2, …, N. For any event F, let I_F denote the random variable whose value is 1 if F occurs and 0 if F does not occur. So, for example, if j ∈ L, then I_{[ j ∈ S]} equals 1 if fragment j belongs to sample S and it equals 0 otherwise. Then:

m_{S} = \sum_{j = 1}^{N} I_{[j \in S]} m_{j}

and

A_{S} = \sum_{j = 1}^{N} I_{[j \in S]} A_{j}

For any event F, the expected value of I_F equals the probability of F. So, if j ∈ L, then E(I_{[ j ∈ S]}) = Pr[ j ∈ S ] = p. So,

E (m_{S}) = \sum_{j = 1}^{N} E (I_{[j \in S]} m_{j}) = \sum_{j = 1}^{N} p m_{j} = p m_{L}

and:

E (A_{S}) = \sum_{j = 1}^{N} E (I_{[j \in S]} A_{j}) = \sum_{j = 1}^{N} p A_{j} = p A_{L}

So,

\frac{E (A_{S})}{E (m_{S})} = \frac{p A_{L}}{p m_{L}} = \frac{A_{L}}{m_{L}} = a_{L}

A stronger result can also be proved. It can be shown that E(A_S) / E(m_S) = a_L for all possible values of a₁, a₂, …, a_N if and only if S is correct.

Note that the sampling bias is a bias in the mass fraction of analyte in the sample, which is defined by a_S = A_S / m_S. So, a sample S is unbiased if and only if E(A_S / m_S) = a_L. Unfortunately, the mean of the quotient, E(A_S / m_S), is not necessarily equal to the quotient of the means, E(A_S) / E(m_S). If one measured the total mass of analyte in a correct sample, A_S, and divided it by the expected mass of the sample, E(m_S), rather than the actual mass, the result would be unaffected by sampling bias; however, this is not typically done, and in most cases it would not be desirable anyhow, because the elimination of a rather small bias would not be worth the increase in variability that would occur.

One may consider the sampling bias to be negligible if it is a small fraction of the standard deviation of a_S, and the following corollary to Theorem A.1 shows that this is true whenever the relative standard deviation of the sample mass, m_S, is small.

Corollary A.1.1 Assume S is a correct sample. Then:

Bias (a_{S}) = - \frac{Cov (a_{S}, m_{S})}{E (m_{S})} = - σ (a_{S}) \times RSD (m_{S}) \times ρ (a_{S}, m_{S})

and

| Bias (a_{S}) | \leq σ (a_{S}) \times RSD (m_{S})

where RSD denotes relative standard deviation (coefficient of variation).

Proof…

Proof: First use the fact that a_L = E(A_S) / E(m_S) to derive the following equations.

\begin{array}{l} Bias (a_{S}) & = E (a_{S}) - a_{L} \\ = E (a_{S}) - \frac{E (A_{S})}{E (m_{S})} \\ = E (a_{S}) - \frac{E (a_{S} m_{S})}{E (m_{S})} \\ = E (a_{S}) - \frac{E (a_{S}) E (m_{S}) + Cov (a_{S}, m_{S})}{E (m_{S})} \\ = - \frac{Cov (a_{S}, m_{S})}{E (m_{S})} \\ = - \frac{ρ (a_{S}, m_{S}) σ (a_{S}) σ (m_{S})}{E (m_{S})} \\ = - σ (a_{S}) \times RSD (m_{S}) \times ρ (a_{S}, m_{S}) \end{array}

Then, since | ρ(a_S, m_S) | ≤ 1, it follows that | Bias(a_S) | ≤ σ(a_S) × RSD(m_S).

Note that a large value for RSD(m_S) does not necessarily imply a large sampling bias, because a_S and m_S may be only weakly correlated.

Theorem A.2 Assume S is a sample chosen in a manner such that the mass of the sample always falls between (1 − δ) × M and (1 + δ) × M for specified values of M and δ. If Pr[ j ∈ S ] = M / m_L for all j ∈ L, then:

(1 - δ) E (a_{S}) \leq a_{L} \leq (1 + δ) E (a_{S})

Given the premise of Theorem A.2, it can also be shown that if S is unbiased for all possible values of a₁, a₂, …, a_N, then for all j ∈ L,

(1 - δ) \frac{M}{m_{L}} \leq \Pr [j \in S] \leq (1 + δ) \frac{M}{m_{L}}

So, if the mass of the sample is not allowed to vary much, one can ensure zero sampling bias only if all the fragments have nearly the same selection probability.

Bias – p. 2

Subsampling Bias (Continued)