API

Species definitions

The Catalog contains a large number of species and simulation model definitions, which are built using a number of classes defined here. These are usually not intended to be instantiated directly, but should be accessed through the main entry point, get_species().

stdpopsim.get_species(id)[source]

Returns a Species object for the specified id.

Parameters

id (str) – The string identifier for the requested species. E.g. “HomSap”. A complete list of species, and their IDs, can be found in the Catalog.

Returns

An object containing the species definition.

Return type

Species

class stdpopsim.Species[source]

Class representing a species in the catalog.

Variables
  • id (str) – The unique identifier for this species. The species ID is the three first letters of the genus name followed by the first three letters of the species name, and does not contain any spaces or punctuation. The usual scheme is to use the first three letters of the genus and species (similar to the approach used in the UCSC genome browser), e.g., “HomSap” is the ID for Homo Sapiens.

  • name (str) – The full name of this species in binominal nomenclature as it would be used in written text, e.g., “Homo sapiens”.

  • common_name (str) – The name of this species as it would most often be used informally in written text, e.g., “human”, or “Orang-utan”. Where no common name for the species exist, use the most common abbreviation, e.g., “E. Coli”.

  • genome (stdpopsim.Genome) – The Genome instance describing the details of this species’ genome.

  • generation_time (float) – The current best estimate for the generation time of this species in years. Note that individual demographic models in the catalog may or may not use this estimate: each model uses the generation time that was used in the original publication(s).

  • population_size (float) – The current best estimate for the population size of this species. Note that individual demographic models in the catalog may or may not use this estimate: each model uses the population sizes defined in the original publication(s).

  • citations (list) – A list of Citation objects providing the source for the generation time and population size estimates.

  • demographic_models (list) – This list of DemographicModel instances in the catalog for this species.

  • dfes (list) – This list of DFE instances in the catalog for this species.

  • ensembl_id (str) – The ensembl id for the species which is used by maintenance scripts to query ensembl’s database.

get_annotations(id)[source]

Returns a set of annotations with the specified id.

Parameters

id (str) – The string identifier for the set of annotations A complete list of IDs for each species can be found in the “Annotations” subsection for the species in the Catalog.

Return type

Annotation

Returns

A Annotation that holds genome annotation information from Ensembl

get_contig(chromosome=None, genetic_map=None, length_multiplier=1, length=None, mutation_rate=None, inclusion_mask=None, exclusion_mask=None)[source]

Returns a Contig instance describing a section of genome that is to be simulated based on empirical information for a given species and chromosome.

Parameters
  • chromosome (str) – The ID of the chromosome to simulate. A complete list of chromosome IDs for each species can be found in the “Genome” subsection for the species in the Catalog. If the chromosome is not given, we specify a “generic” contig with given length.

  • genetic_map (str) – If specified, obtain recombination rate information from the genetic map with the specified ID. If None, simulate using a default uniform recombination rate on a region with the length of the specified chromosome. The default rates are species- and chromosome- specific, and can be found in the Catalog. (Default: None)

  • length_multiplier (float) – If specified, simulate a region of length length_multiplier times the length of the specified chromosome with the same chromosome-specific mutation and recombination rates. This option cannot currently be used in conjunction with the genetic_map argument.

  • mutation_rate (float) – The per-base mutation rate. If none is given, the mutation rate defaults to the rate specified by species chromosomes.

  • inclusion_mask – If specified, simulated genomes are subset to only inlude regions given by the mask. The mask can be specified by the path and file name of a bed file or as a list or array of intervals given by the left and right end points of the intervals.

  • exclusion_mask – If specified, simulated genomes are subset to exclude regions given by the mask. The mask can be specified by the path and file name of a bed file or as a list or array of intervals given by the left and right end points of the intervals.

  • length (float) – Used with a “generic” contig, specifies the length of genome sequence for this contig. For a generic contig, mutation and recombination rates are equal to the genome-wide average across all autosomal chromosomes.

Return type

Contig

Returns

A Contig describing the section of the genome.

get_demographic_model(id)[source]

Returns a demographic model with the specified id.

Parameters

id (str) – The string identifier for the demographic model. A complete list of IDs for each species can be found in the “Demographic Models” subsection for the species in the Catalog.

Return type

DemographicModel

Returns

A DemographicModel that defines the requested model.

get_dfe(id)[source]

Returns a DFE with the specified id.

Parameters

id (str) – The string identifier for the DFE. A complete list of IDs for each species can be found in the # TODO add that section to the species catalogo “DFE” subsection for the species in the Catalog.

Return type

DFE

Returns

A DFE that defines the requested model.

class stdpopsim.Genome[source]

Class representing the genome for a species.

Variables
  • chromosomes (list) – A list of Chromosome objects.

  • citations (list) – A list of Citation objects providing the source for the genome assembly, mutation rate and recombination rate estimates.

  • length (int) – The total length of the genome.

static from_data(genome_data, *, recombination_rate, mutation_rate, citations)[source]

Construct a Genome object from the specified dictionary of genome information from Ensembl, recombination_rate and mutation_rate dictionaries.

This method is for internal use only.

get_chromosome(id)[source]

Returns the chromosome with the specified id.

Parameters

id (str) – The string ID of the chromosome. A complete list of chromosome IDs for each species can be found in the “Genome” subsection for the species in the Catalog.

Return type

Chromosome

Returns

A Chromosome that defines properties of the chromosome such as length, mutation rate, and recombination rate.

property mean_mutation_rate

The length-weighted mean mutation rate across all chromosomes.

property mean_recombination_rate

The length-weighted mean recombination rate across all chromosomes.

class stdpopsim.Chromosome[source]

Class representing a single chromosome for a species.

Variables
  • id (str) – The string identifier for the chromosome.

  • length (int) – The length of the chromosome.

  • mutation_rate (float) – The mutation rate used when simulating this chromosome.

  • recombination_rate (float) – The recombination rate used when simulating this chromosome (if not using a genetic map).

  • synonyms (list of str) – List of synonyms that may be used when requesting this chromosome by ID, e.g. from the command line interface.

class stdpopsim.Contig[source]

Class representing a contiguous region of genome that is to be simulated. This contains the information about mutation rates and recombination rates that are needed to simulate this region. Genomic element types can be provided.

Variables
  • mutation_rate (float) – The rate of mutation per base per generation.

  • recombination_map (msprime.RateMap) – The recombination map for the region. See the msprime.RateMap for details.

  • mask_intervals (array-like (?)) – Intervals to keep in simulated tree sequence, as a list of (left_position, right_position), such that intervals are non-overlapping and in ascending order. Should have shape Nx2, where N is the number of intervals.

  • exclude (bool) – If True, mask_intervals specify regions to exclude. If False, mask_intervals specify regions in keep.

  • genomic_element_types (list) – a list of GenomicElementType objects. By default, the only element is a single genomic element type that spans the entire contiguous region.

  • mutation_types – a list of ext.MutationType object. By default, the only mutation type is neutral. See ext.MutationType for more details.

Note

To run stdpopsim simulations with alternative, user-specified mutation or recombination rates, a new contig can be created based on an existing one. For instance, the following will create a new_contig that, when simulated, will have double the mutation rate of the old_contig:

new_contig = stdpopsim.Contig(
    mutation_rate=old_contig.mutation_rate * 2,
    recombination_map=old_contig.recombination_map,
    genetic_map=old_contig.genetic_map,
)
add_DFE(intervals, DFE)[source]

Adds the provided DFE to the intervals specified, by making a new genomic element type with the mutation types and proportions specified by the DFE.

Parameters
  • intervals (array) – A valid set of intervals.

  • dfe (DFE) – A DFE object.

add_mutation_types(mutation_types, proportions, genomic_element_type_id)[source]

Adds mutation types with their respective proportions to the genomic element type provided.

property all_intervals_array

Returns an (n, 3) numpy array with the intervals across all genomic element types in the form (left, right, type).

is_neutral()[source]

returns true if the contig has no non-neutral mutation types

static species_contig(*, species, chromosome=None, genetic_map=None, length_multiplier=1, length=None, mutation_rate=None, inclusion_mask=None, exclusion_mask=None)[source]

Build a Contig for a species.

class stdpopsim.Citation[source]

A reference to the literature that should be acknowledged by users of stdpopsim.

Variables
  • doi (str) – The DOI for the publication providing the definitive reference.

  • author (str) – Short author list, .e.g, “Author 1 et. al”.

  • year (int) – Year of publication as a 4 digit integer, e.g. 2008.

assert_valid()[source]

Checks that this citation is valid by checking the types and values of the instance variables.

because(reasons)[source]

Returns a new Citation with the given reasons.

fetch_bibtex()[source]

Retrieve the bibtex of a citation from Crossref.

static merge(citations)[source]

Returns a deduplicated list of Citation objects.

class stdpopsim.Annotation[source]

Class representing an annotation track.

Variables
  • id (str) – String that uniquely identifies the annotation.

  • species (Species) – The species to which this annotation applies.

  • url (str) – The URL where the packed and compressed GFF3 can be found.

  • intervals_url (str) – The URL of the intervals cache of the annotations.

  • intervals_sha256 (str) – The SHA256 checksum of the annotations cache.

  • description (str) – One line description of the annotation.

  • citations (list of Citation) – List of citations for the annotation.

  • file_pattern – The pattern used to map individual chromosome id strings to files

download()[source]

Downloads the intervals URL and stores it in the cache directory.

get_chromosome_annotations(id)[source]

Returns the numpy interval array for the chromosome with the specified id.

is_cached()[source]

Returns True if this annotation is cached locally.

Demographic Models

class stdpopsim.DemographicModel[source]

Class representing a demographic model.

Instances of this class are constructed by model implementors, following the developer documentation. To instead obtain a pre-specified model as listed in the Catalog, see Species.get_demographic_model.

Variables
  • id (str) – The unique identifier for this model. DemographicModel IDs should be short and memorable, and conform to the stdpopsim naming conventions for demographic models.

  • description (str) – A short description of this model as it would be used in written text, e.g., “Three population Out-of-Africa”. This should describe the model itself and not contain author or year information.

  • long_description (str) – A concise, but detailed, summary of the model.

  • generation_time (int) – Mean inter-generation interval, in years.

  • citations (list of Citation) – A list of Citation, that describe the primary reference(s) for the model.

  • mutation_rate (float) – The mutation rate associated with the demographic model. If no mutation rate is given, the species’ default mutation rate is used.

get_samples(*args)[source]

Returns a list of msprime.Sample objects, with the number of samples from each population determined by the positional arguments. For instance, model.get_samples(2, 5, 7) would return a list of 14 samples, two of which are from the model’s first population (i.e., with population ID model.populations[0].id), five are from the model’s second population, and seven are from the model’s third population. The number of of arguments must be less than or equal to the number of “sampling” populations, model.num_sampling_populations; if the number of arguments is less than the number of sampling populations, then remaining numbers are treated as zero.

Todo

This documentation is broken. We’re now returning msprime SampleSet objects.

register_qc(qc_model)[source]

Register a QC model implementation for this model.

class stdpopsim.Population[source]

Class recording metadata representing a population in a simulation.

Variables
  • id (str) – The id of the population.

  • description (str) – a short description of the population

  • sampling_time (int) – an integer value which indicates how many generations prior to the present individuals should samples should be drawn from this population. If None, sampling not allowed from this population (default = 0).

asdict()[source]

Returns a dictionary representing the metadata about this population.

Generic models

The Catalog contains simulation models from the literature that are defined for particular species. It is also useful to be able to simulate more generic models, which are documented here. Please see the Running a generic model for examples of using these models.

class stdpopsim.PiecewiseConstantSize(N0, *args)[source]

Class representing a generic simulation model that can be run to output a tree sequence. This is a piecewise constant size model, which allows for instantaneous population size change over multiple epochs in a single population.

Parameters
  • N0 (float) – The initial effective population size

  • args – Each subsequent argument is a tuple (t, N) which gives the time at which the size change takes place and the population size.

The usage is best illustrated by an example:

model1 = stdpopsim.PiecewiseConstantSize(N0, (t1, N1))  # One change
model2 = stdpopsim.PiecewiseConstantSize(N0, (t1, N1), (t2, N2))  # Two changes
class stdpopsim.IsolationWithMigration(NA, N1, N2, T, M12, M21)[source]

Class representing a generic simulation model that can be run to output a tree sequence. A generic isolation with migration model where a single ancestral population of size NA splits into two populations of constant size N1 and N2 time T generations ago, with migration rates M12 and M21 between the split populations. Sampling is disallowed in population index 0, as this is the ancestral population.

Parameters
  • NA (float) – The initial ancestral effective population size

  • N1 (float) – The effective population size of population 1

  • N2 (float) – The effective population size of population 2

  • T (float) – Time of split between populations 1 and 2 (in generations)

  • M12 (float) – Migration rate from population 1 to 2

  • M21 (float) – Migration rate from population 2 to 1

Example usage:

model1 = stdpopsim.IsolationWithMigration(NA, N1, N2, T, M12, M21)

Simulation Engines

Support for additional simulation engines can be implemented by subclassing the abstract Engine class, and registering an instance of the subclass with register_engine(). These are usually not intended to be instantiated directly, but should be accessed through the main entry point, get_engine().

stdpopsim.get_engine(id)[source]

Returns the simulation engine with the specified id.

Parameters

id (str) – The string identifier for the requested engine. The currently supported engines are “msprime” and “slim”.

Returns

A simulation engine object with a simulate() method.

Return type

Engine

stdpopsim.get_default_engine()[source]

Returns the default simulation engine (msprime).

Return type

Engine

stdpopsim.register_engine(engine)[source]

Registers the specified simulation engine.

Parameters

engine (Engine) – The simulation engine object to register.

class stdpopsim.Engine[source]

Abstract class representing a simulation engine.

To implement a new simulation engine, one should inherit from this class. At a minimum, the id, description and citations attributes must be set, and the simulate() and get_version() methods must be implemented. See msprime example in engines.py.

Variables
  • id (str) – The unique identifier for the simulation engine.

  • description (str) – A short description of this engine.

  • citations (list of Citation) – A list of citations for the simulation engine.

get_version()[source]

Returns the version of the engine.

Return type

str

simulate(demographic_model, contig, samples, *, seed=None, dry_run=False)[source]

Simulates the model for the specified contig and samples. demographic_model, contig, and samples must be specified.

Parameters
  • demographic_model (DemographicModel) – The demographic model to simulate.

  • contig (Contig) – The contig, defining the length, mutation rate, and recombination rate(s).

  • samples (list of msprime.simulations.Sample) – The samples to be obtained from the simulation.

  • seed (int) – The seed for the random number generator.

  • dry_run (bool) – If True, the simulation engine will return None without running the simulation.

Returns

A succinct tree sequence.

Return type

tskit.trees.TreeSequence or None

class stdpopsim.engines._MsprimeEngine[source]

Bases: stdpopsim.engines.Engine

description = 'Msprime coalescent simulator'
id = 'msprime'
simulate(demographic_model, contig, samples, *, seed=None, msprime_model=None, msprime_change_model=None, dry_run=False, **kwargs)[source]

Simulate the demographic model using msprime. See Engine.simulate() for definitions of parameters defined for all engines.

Parameters
  • msprime_model (str) – The msprime simulation model to be used. One of hudson, dtwf, smc, or smc_prime. See msprime API documentation for details.

  • msprime_change_model (list of (float, str) tuples) – A list of (time, model) tuples, which changes the simulation model to the new model at the time specified.

  • dry_run (bool) – If True, end_time=0 is passed to msprime.simulate() to initialise the simulation and then immediately return.

  • **kwargs – Further arguments passed to msprime.sim_ancestry()

class stdpopsim.slim_engine._SLiMEngine[source]

Bases: stdpopsim.engines.Engine

description = 'SLiM forward-time Wright-Fisher simulator'
id = 'slim'
recap_and_rescale(ts, demographic_model, contig, samples, extended_events=None, slim_scaling_factor=1.0, seed=None, **kwargs)[source]

Apply post-SLiM transformations to ts. This rescales node times, does recapitation, simplification, and adds neutral mutations.

If the SLiM engine was used to output a SLiM script, and the script was run outside of stdpopsim, this function can be used to transform the SLiM tree sequence following the procedure that would have been used if stdpopsim had run SLiM itself. The parameters after ts have the same meaning as for simulate(), and the values for demographic_model, contig, samples, and slim_scaling_factor should match those that were used to generate the SLiM script with simulate().

Parameters

ts (tskit.TreeSequence) – The tree sequence output by SLiM.

Warning

The recap_and_rescale() function is provided in the hope that it will be useful. But as we can’t anticipate what changes you’ll make to the SLiM code before using it, the stdpopsim source code should be consulted to determine if the behaviour is appropriate for your case.

simulate(demographic_model, contig, samples, *, seed=None, extended_events=None, slim_path=None, slim_script=False, slim_scaling_factor=1.0, slim_burn_in=10.0, dry_run=False, verbosity=None)[source]

Simulate the demographic model using SLiM. See Engine.simulate() for definitions of the demographic_model, contig, and samples parameters.

Parameters
  • seed (int) – The seed for the random number generator.

  • slim_path (str) – The full path to the slim executable, or the name of a command in the current PATH.

  • slim_script (bool) – If true, the simulation will not be executed. Instead the generated SLiM script will be printed to stdout.

  • slim_scaling_factor (float) – Rescale model parameters by the given value, to speed up simulation. Population sizes and generation times are divided by this factor, whereas the mutation rate, recombination rate, and growth rates are multiplied by the factor. See SLiM manual: 5.5 Rescaling population sizes to improve simulation performance.

  • slim_burn_in (float) – Length of the burn-in phase, in units of N generations.

  • dry_run (bool) – If True, run the first generation setup and then end the simulation.