![]() |
SpartaABC A web server to simulate sequences based on indel parameters inferred using an approximate Bayesian computation algorithm |
|
Research Site | Pupko Group |
||
Approximate Bayesian Computation (ABC) is an inference framework originally proposed by Rubin 1984 and Tavaré et al. 1997. It is designed to deal with parameter inference in cases in which a likelihood function cannot be termed or easily processed. This is often the case with complex statistical models, such as the process of insertion and deletion along a phylogenetic tree. ABC bypasses the need to directly term and solve a likelihood function by relying on a simulation procedure to approximate it. The basic reject procedure of ABC consists of proposing a parameter set θ* by sampling from a prior distribution of the parameters. Next, θ* is used to simulate data D*, under the statistical model defined by θ*. Finally, the distance between D* and the real data D is measured based on numerical features extracted from them (see details below – summary statistics). This process is repeated many times, recording the distance between D* and D for each examined θ*. At the end of the procedure, only cases in which the distance between D* and D was small enough are retained. This yields a subset of proposed θ* that give rise to datasets most similar to the real data D. This subset of θ* approximates the posterior distribution of the model parameters.
The difference between the real data D and each simulated instance D* can be computed directly based on the datasets themselves. However, it is often more efficient to reduce the dimensionality of the datasets by considering only a set of representative numerical features extracted from each of them. This is especially useful when the probability of generating a dataset in a simulation that matches exactly to the real data D is very slim. These numerical features are called summary statistics. SpartaABC relies on extracting 27 summary statistics from each dataset (real or simulated). These representative summary statics are used to compare each simulated dataset to the real dataset and the weighted Euclidean distance between each simulation to the real dataset is measured based upon them. The summary statistics computed by SpartaABC were chosen to capture the impact of the different indel parameters on the dataset. For example, the summary statistic “average gap block length” is expected to have a higher value in cases where longer indel lengths are more probable, thus it can help in inferring the value of the parameter controlling the indel length. Naturally, all summary statics are evaluated simultaneously, inferring all indel parameters at once.
The indel parameters inferred by SpartaABC: ‘R_I’, 'R_D', ‘A_I’, 'A_D' (RIM, for SIM 'R_ID' and 'A_ID') and ‘RL’ are the model parameters by which each simulation in the ABC procedure is conducted. In each simulation step a combination of these parameters is sampled at random from uniform (non-informative) prior distributions. The sampled set is provided to the integrated sequence simulator to produce a simulated dataset D*. The summary statistics are numerical features that can be computed from a dataset (simulated or real). They are not inferred or sampled but rather computed directly from the dataset. The summary statistics computed by SpartaABC were chosen to capture the impact of the different indel parameters on the dataset.
The uniform priors make no specific assumption about the divergence of the analyzed sequences; they serve as a tool to suggest parameter combinations to simulate from. Coupled with a large number of simulation steps, they cover combinations from the whole parameter space. Sequence divergence and the phylogenetic divergence among species are explicitly handled by simulating all pseudo-MSAs along the tree provided with (or inferred from) the real sequence dataset. This means the tree used for simulations has the same branch lengths as the real sequences, which results in simulated sequences that have comparable divergence to the real sequences.
We follow the practice suggested by Beaumont et al. 2002 to set the distance cutoff, ε, empirically so that the percentage of accepted parameter combinations is p% of the total simulations. When p% is large, more parameter combinations are used to estimate the parameters, leading to a more stable inference. However, large p% also suggests that parameter combinations behind MSAs with a larger distance from the input MSA are included in the estimation. Thus, there is a trade-off between quantity and quality of the subset of simulations used for inference. We conducted a simulation study to determine the number of simulations to retain and found that retaining 100 parameter combinations offers the most accurate indel parameter estimations.
In theory, SpartaABC can evaluate any number of sequences. However, the running time substantially increases with larger datasets. We therefore recommend to limit SpartaABC to 400 sequences. Users who wish to analyze more sequences are recommended to sample a subset of the sequences prior to using the SpartaABC web server. Such sampling is possible using, for example CD-HIT.
SpartaABC is a stochastic inference method. In each step, parameters are drawn from some prior distribution (first source of randomness). Next, MSAs are stochastically simulated based on these parameters (second source of randomness). Because of this, results are likely to be slightly different among runs. However, the larger the dataset – the more stable the results are.