Inference

In order to run inference for your data using sweepLink, please, specify the infer task.

Example run command:

  $ ./sweepLink infer --meta sweepLink_meta.txt --counts sweepLink_alleleCounts.txt --nGridPoints 50 --numThreads 8 --h.update false --numStatesAbsS 18 --grid_abs_s 0.008,0.011,0.015,0.019,0.024,0.031,0.036,0.044,0.0520,0.061,0.072,0.084,0.098,0.113,0.131,0.151,0.174,0.2 --forwardPrior betaSeparate

Required Input Data

SweepLink requires three options as input:

  1. Meta file (--meta option)

    This file should contain meta information about populations and time points of given genetic data. It has information for each population/time point data. This file consists of three columns separated by tabs: label name, time of samples, population.

    Labels are used in the allele counts file. Note that time is forward, meaning that 0 is past.

    Example of meta file:

       time_0_pop_0    0       pop0
       time_10_pop_0   10      pop0
       time_20_pop_0   20      pop0
       time_30_pop_0   30      pop0
       time_40_pop_0   40      pop0
       time_50_pop_0   50      pop0
       time_60_pop_0   60      pop0
       time_70_pop_0   70      pop0
       time_80_pop_0   80      pop0
       time_90_pop_0   90      pop0
    
  2. Allele counts file (--counts option)
    This file contains allele counts for each population and each time point specified in meta file. It consists of the following columns: chromosome, position, allele counts for each (time point + population) label. Each allele count is C/N, where C is derived allele count and N is total number of haploid samples. If some data is missed, it can be marked as 0/0.

    Example of allele counts file:

       -       -       time_90_pop_0   time_80_pop_0   time_70_pop_0   time_60_pop_0   time_50_pop_0   time_40_pop_0   time_30_pop_0   time_20_pop_0   time_10_pop_0   time_0_pop_0
       1       3482    0/200   0/200   3/200   2/200   3/200   1/200   2/200   8/200   7/200   17/200
       1       4576    1/200   1/200   3/200   0/200   0/200   0/200   0/200   0/200   0/200   0/200
       1       6981    2/200   1/200   0/200   0/200   0/200   0/200   0/200   0/200   0/200   0/200
    
  1. Mutation rates (--mu_a_A and mu_A_a options)

    One have to specify two mutation rates: rate to mutate from ancestral allele to derived allele (--mu_a_A) and rate to mutate in the oposite direction (--mu_A_a).

Additional Input Options

There are several additional options that one could specify for the inference run.

  1. The grid for the Wright-Fisher diffusion equation (--nGridPoints option and flags for grid type)

    One can specify the number of grid points in the numerical method using --nGridPoints option.

    By default, the grid type is quadratic, but it can be changed by specific options to uniform grid (--uniformGrid) or logarithmic grid (--logisticGrid). (TODO: describe grid types)

  2. Partial Differential Equation solver (--PDEmethod option and --backward flag)

    There is a choice of two methods for PDE solver - the numerical method that solves the Wright-Fisher diffusion equation:

    • --PDEmethod CC - use Chang-Cooper numerical scheme (default)

    • --PDEmethod CN - use Crank-Nikolson numerical scheme, it is less unstable according to our experiments

    The second option is flag --backward that specifies that sweepLink will use backward pass in order to evaluate the likelihood of one locus data for given demographic parameters. By default, it uses forward pass with some prior (see prior options below).

  3. Prior for forward pass (--forwardPrior option)

    There is a choice of three priors for forward pass that allows likelihood evaluations in sweepLink:

    • --forwardPrior uniform specifies uniform prior distribution

    • --forwardPrior beta specifies beta prior distribution for each population, where parameters of these distributions (alpha and beta for each population) are estimated along with the demographic parameters

    • --forwardPrior betaSeparate refers to beta prior distribution that is different for the loci that are segregating at the first time point and that start to segregate at some intermediate time point (default)

    The last option has idea that the prior should be different for the loci that were just introduced and for those that were introduced in the past and were not lost. (TODO: better explanation)

  4. Selection estimation (--numStatesAbsS, --grid_abs_s, --abs_s_max options)

    One can set the specific grid for selection coefficients. Option --numStatesAbsS N refers to the number of positive selection coefficients on the grid. One can specify the maximum value of selection by --abs_s_max X option, where X lies in (0, 1] interval. If --numStatesAbsS and --abs_s_max are set, sweepLink uses unifrom grid with required number of points. It is possible to specify non-unifrm grid by setting --grid_abs_s X1,X2,X3,X4...,Xn option, where X1,X2,X3,X4...,Xn are specific positive grid points.

    One can disable selection inference by using --s.update false option. By default this will use neutral selection for all loci. In order to fix selection for each locus to specific value, one can use option --s <file_with_values>, where <file_with_values> is a file that contain indices for selection values in the grid on each line.

  5. Dominance estimation (use --h.update false to avoid estimation of dominance coefficients)