5/17/2017

Overview

  1. What is Size-Biased Data?
  2. Scientific Background for Mitochondria
  3. Goals for this project
  4. How the sampling process caused size-biased data?
  5. Investigate Possible Estimators with simulation study
  6. Use the best ones on real data
  7. Conclusion
  8. Future Works

Examples for Size-Biased Data



Scientific Background for Mitochondria

Goals for this project

  1. Whether Properties (area, perimeter, circularity and aspect ratio) of mitochondria are different by locations (proximal, middle and distal end).
  2. Suggestions on sampling method for future research (more cells).

Sampling Process - 1

  • A young muscle fiber cell was magnifired to 166 different images by using Transmission Electron Microscope (TEM).


Sampling Process - 2

  • For each location, divide images into two groups:
    Subsarcolemmanl and Interfibrillar group (ignore later).
  • In each group, randomly pick one image.
  • In each image, sample 20 mitochondria.

Sampling Process - 3

  • Generate a list of random coordinates.
  • Pick the mitochondria whose area in the photo includes one or more generated coordinates.

Problems from the Sampled Data

  1. It is NOT random sample but size-biased!
  2. The larger mitochondria are easier to be picked in our sample.
  3. If we used sample mean as the estimator of population mean, it will definitely be overestimated!

Raw Data

Raw Data

  • Area \(({\mu m}^{2})\):
    The area occupied by a mitochondrion in an image.
  • Perimeter \((\mu m)\):
    The length of the boundary of a mitochondrion in an image.
  • Circularity:
    Circularity is equal to \(\frac{4 \pi Area}{Perimeter^2}\).

    (Measuring the resemblance of a mitochondrion to a circle. The range of circularity is between 0 and 1. 1 means a perfect circle.)

  • Aspect Ratio:
    Aspect Ratio is equal to \(\frac{Length}{Width}\).

    (If \(AR \leq 2\), it is considered short; if \(2 < AR \leq 4\), intermediate; if \(AR > 4\), long.)

Data Exploration: Area

Data Exploration: Perimeter

Data Exploration: Circularity

Data Exploration: Aspect Ratio

Data Exploration: Scatter Plots



Best Estimators

  • Circularity:
    Arithmetic Mean
  • Aspect Ratio:
    Arithmetic Mean

New Goals for this project

  1. What is the appropriate estimator for the size-biased data?
    A: Simulation Study for finding the best estimator.
  2. Whether Properties of mitochondria are different by locations.
    A: Permutation Test and Bootstrapping Confidence Interval
  3. Suggestions on sampling scheme for future research.
    A: Based on the Simulation Study.

Weighted Distribution

  • Cox (1962) proposed an idea of Weighted Distribution, \[{f}^{\ast}(x)=\frac{w(x)f(x)}{{E}_{f}(w(x))}.\]
  • Cox (1962) also proposed that the Harmonic Mean (\(\frac{n}{\sum_{i=1}^{n}\frac{1}{{x}_{i}}}\)) as an estimator of population mean of \(X\), and proved that it will converge to \(\mu={E}_{f}(x)\) as \(n \to \infty\), when \(w(x)=x\).

Simulation Study - Area

  • Assume the true distribution, \(Area\; \sim \; Exp(\theta) = f(A)\).
  • Then the observed distribution, \(Area\; \sim \; Gamma(2,\theta) = f^{*}(A)\).
  • The red dash line is \(Gamma(2, \widehat{\theta}),\) where \(\widehat{\theta} = \frac{\bar{a}}{2} \doteq 1183\)

Candidate Estimators - Area

  1. Arithmetic Mean (AM) \[\frac{\sum_{i=1}^{n}{a}_{i}}{n}\]
  2. Weighted Mean (WM) or Harmonic Mean

    \[ \frac{\sum_{i=1}^{n}{w}_{i}{a}_{i}}{\sum_{i=1}^{n}{w}_{i}}=\frac{n}{\sum_{i=1}^{n}\frac{1}{{a}_{i}}}\;,\;\;\text{where}\;\; {w}_{i}=\frac{1}{{p}_{i}}=\frac{n\bar{a}}{{a}_{i}} \]
  3. Maximum Likelihood Estimator (MLE)

    \[\frac{\sum_{i=1}^{n}{a}_{i}}{2n}=\frac{AM}{2}\]

Simulation Study - Area (Overview)

  • Simulate mitochondria data in a muscle fiber cell.
  • Sample from finite population (\(\mathbf{N}\)) rather than infinite population.
  • Do both sampling with replacement and without replacement.
  • Sample size (\(\mathbf{n}\)) is decided by the \(\mathbf{Ratio}\) between \(\mathbf{N}\) and \(\mathbf{n}\).

Simulation Study - Area

  1. Assume \(Area \sim Exp(\mu)\),
    Set \(\mu = 1000\)
    \(N = 2000\),
    \(Ratio\) = \((5\%, 10\%, 30\%, 50\%, 70\%, 95\%)\),
    \(Repeated\;Times = 1000\).
  2. Generate \(N\) samples from \(Exp(\mu)\) as subpopulation of Area
    Calculate subpopulation mean, \(\mu_A\), (This is what we are interested in!).
  3. Sample a set of samples (\(n\)) from subpopulation (\(N\)) with sampling probability proportional to the value of Area with and without replacement (\(n = N \times Ratio\)).

Simulation Study - Area

  1. For each set of samples, calculate the candidate estimators: Arithmetic Mean (AM), Weighted Mean (WM) and Maximum Likelihood Estimator (MLE).
  2. Repeat 3. 4. for the set \(Repeated\;Times\) for each \(Ratio\).
  3. Calculate the Mean, Standard Deviation and Root MSE for each candidate estimator.
    Draw plots of sampling distributions for each candidate estimator.

Results of Simulation Study - Area

Best Estimators - Area

  • Sampling "WITH" Replacement:
    Weighted Mean and MLE
  • Sampling "WITHOUT" Replacement:
    Unfortunately, not clear yet.

Simulation Study - Perimeter

  • \(Perimeter =\sqrt{4\pi}\sqrt{\frac{Area}{Circularity}}\)
  • \(Area \perp Circularity\).
  • The observed distribution of \(Circularity\;\sim\;Beta(15,5)\).
  • Assume that the true distribution of \(Circularity\;\sim\;Beta(\alpha=15, \beta=5)\).
  • The red dash line is \(Beta(15, 5)\).


Candidate Estimators - Perimeter

  1. Arithmetic Mean (AM) \[\frac{\sum_{i=1}^{n}{p}_{i}}{n}\]
  2. Weighted Mean (WM) \[\frac{\sum_{i=1}^{n}{w}_{i}{p}_{i}}{\sum_{i=1}^{n}{w}_{i}}\;,\;\;\text{where}\;\; {w}_{i}=\frac{n\bar{a}}{{a}_{i}}\]
  3. Delta Method Esitmator (DME) \[\sqrt{4\pi}\sqrt{\frac{\bar{a}/2}{\bar{c}}}\]
  4. 2nd Order Taylor's Approximation Estimator (2TAE) \[\sqrt{4 \pi}\left[ \sqrt{\frac{\bar{a}/2}{\bar{c}}} - \frac{1}{8} (\frac{\bar{a}}{2})^\frac{-3}{2}(\bar{c})^\frac{-1}{2}\frac{{s}_{a}^2}{2}+\frac{3}{8}(\frac{\bar{c}}{2})^\frac{1}{2}(\bar{c})^\frac{-5}{2}{s}_{c}^2\right]\]

Simulation Study - Perimeter

  1. Generate the finite subpopulation (\(\mathbf{N}\)) data from \(Circularity \sim Beta(15,5)\).
  2. Plug the generated Area and Circularity data into formula to obtain subpopulation of Perimeter.
  3. Sample from the finite subpopulation (\(N\)) of Perimeter with sampling probability proportional to Area with and without replacement.
  4. See the performance of the candidates estimators: Arithmetic Mean (AM), Weighted Mean (WM), Delta Method Esitmator (DME), 2nd Order Taylor's Approximation Estimator (2TAE).

Simulated Data - Perimeter

Results of Simulation Study - Perimeter

Best Estimators - Perimeter

  • Sampling "WITH" Replacement:
    Weighted Mean and 2TAE
  • Sampling "WITHOUT" Replacement:
    Unfortunately, not clear yet.

Hypothesis Test

  • Overall Hypothesis Test:

\[ \begin{align*} {H}_{0} &: {\mu}_{{i}_{P}} = {\mu}_{{i}_{M}} = {\mu}_{{i}_{D}}\\ {H}_{A} &: \text{At least one} \: {\mu}_{{i}_{j}} \neq {\mu}_{{i}_{k}} \end{align*} \]

  • Pairwise Comparison Test:

\[ \begin{align*} {H}_{0} &: {\mu}_{{i}_{j}} = {\mu}_{{i}_{k}} \\ {H}_{A} &: {\mu}_{{i}_{j}} \neq {\mu}_{{i}_{k}} \\ \end{align*} \] \[ \begin{align*} i &= \left \{ \text{Area, Perimeter, Circularity, Aspect Ratio} \right \} \\ j,k & = \left \{ \text{P, M, D} \right \} \end{align*} \]

Hypothesis Test : Permutation Test

  • Reasons:
    • Area and Perimeter are size-biased.
    • Circularity and Aspect Ratio, the data violated the normality assumption of ANOVA and T-test.

  • Overall Test (Permutation Test of ANOVA):
    • \(\sum_{i=\left\{P,M,D\right\}}{(\widehat{{\mu}_{i}}-\widehat{{\mu}})}^{2}\)
    • significance level = \(5\%\)

  • Pairwise Comparison Test (Permutation Test of T-test):
    • \(\widehat{\mu}_{i}-\widehat{\mu}_{j}\), where \(i=\left \{P,M,D\right\}\)
    • Bonferroni’s correction: significance level = \(\frac{5\%}{3} = 0.0167\).

Results for the Hypothesis Test

Bootstrapping CI for Means

Bootstrapping CI for the differences

Conclusions

  1. Middle part and Proximal part of the muscle fiber cell have significantly large Area, Perimeter and Circularity.
    • Area: \(\underline{M > P} > D\)
    • Perimeter: \(\underline{M > P} > D\)
    • Circularity: \(\underline{M > P} > D\) & \(M > \underline{P > D}\)
    • Aspect Ratio: \(\underline{P > D > M}\)

  2. The appropriate estimator for the size-biased data is Non-parametric Weighted Mean.

  3. Suggest to use Sampling With Replacement (SWR) rather than Sampling Without Replacement (SWOR) in their future sampling scheme.

Future Work

  • Find the best estimator for SWOR.
  • Robustness of the distribution assumptions can be an interesting topic.
    The Nonparametric Weighted Mean had notably different results with the Parametric Estimators (MLE for Area and 2TAE for Perimeter). Maybe it is because of improper distribution assumptions on Area and Circularity.
  • Include the effect of Subsarcolemmanl and Interfibrillar group and even possible interaction.

References

  • Bratic, Ana and Larsson, Nils-Gran. “The Role of Mitochondria in Aging.” Journal of Clinical Investigation 123, no. 3 (2013): 951-57.
  • Cox, D. R. Renewal Theory. London: Methuen, 1962.
  • Patil,G. P. and Ord,J. K. “On Size-Biased Sampling and Related Form-Invariant Weighted Dis- tributions.” Sankhya. Series B 38,48-61.
  • Jones, M. C. “Kernel Density Estimation for Length Biased Data.” Biometrika. Vol. 78, No. 3 (Sep., 1991), pp. 511-519

Photos

The end

Questions?