Development¶
stdpopsim
is a community effort, and we welcome YOU to join us!
We envision at least three main types of stdpopsim
developers:
Demographic model contributors
API developers
Documentation and tutorial curators
Demographic model contributors add simulation code of published models.
This could be your own published model or any other published model you think
would be useful. This is the main way we envision biologists to continually add
to the catalog of available models, and it is a great first step for new
contributors to learn the ins and outs of stdpopsim
development. See the
sections Adding a new demographic model and
Demographic model review process to get started.
API developers work on infrastructure development for the PopSim Consortium,
which could include improvements and additions to the internal code base of
stdpopsim
, establishment of benchmarking pipelines,
and new projects that align with consortium goals.
Documentation and tutorial curators help maintain the documentation and tutorials. This can be as easy as pointing out confusing bits of the documentation in a GitHub issue., or adding or editing the documentation. See the section Documentation.
Get into contact with the stdpopsim
community by subscribing to our email list
serve
and by creating and commenting on
Github issue.
There is a lot of chatter through
Github, and we’ve been building code
there cooperatively.
If you want to help out and don’t know where to start, you can look through the
list of
Good first issues
or
Help wanted issues
To get started helping with stdpopsim
development, please read the
following sections to learn how to contribute.
And, importantly, please have a look at our
code of conduct.
Installation¶
Before installing, be sure to make a fork of the repo and clone it locally following the instructions in the GitHub Workflow.
The stdpopsim
library requires Python 3.4 or later.
For pip
users, install the packages required for development using:
$ python3 -m pip install -r requirements/development.txt
You can then install the development version of stdpopsim
like this:
$ python3 setup.py install
For conda
users, you will need to add the conda-forge channel to your conda
environment and then should be able to install the development requirements using:
$ conda config --add channels conda-forge
$ conda install --file=requirements/development.txt
We do require msprime
, so please see the the installation notes if you
encounter problems with it.
Note
If you have trouble installing any of the requirements, your pip
may be the wrong version.
Try pip3 install -r requirements/development.txt
Using a Virtual Environment¶
We encourage the use of a virtual environment.
For pip
, you can use venv
.
First, create the virtual environment (You only need to do this once):
$ python3 -m venv stdpopsim_env
Next, activate the virtual environment:
$ source stdpopsim_env/bin/activate
You will then see the virtual environment in your prompt. Like so:
(stdpopsim_env) $
Once the virtual environment is activated, install the requirements:
(stdpopsim_env) $ python3 -m pip install -r requirements/development.txt
You can then run any of the code in the virtual environment with the packages installed, without conflicting with other packages in your local environment. To deactivate the virtual environment:
(stdpopsim_env) $ deactivate
GitHub workflow¶
Make your own fork of the
stdpopsim
repository on GitHub, and clone a local copy.Install the pre-commit hooks with:
$ pre-commit install
Make sure that your local repository has been configured with an upstream remote.
Create a “topic branch” to work on. One reliable way to do it is to follow this recipe:
$ git fetch upstream $ git checkout upstream/main $ git checkout -b topic_branch_nameAs you work on your topic branch you can add commits to it. Once you’re ready to share this, you can then open a pull request. Your PR will be reviewed by some of the maintainers, who may ask you to make changes.
If your topic branch has been around for a long time and has gotten out of date with the main repository, we might ask you to rebase. Please see the next section on how to rebase.
Pre-commit checks¶
On each commit a pre-commit hook will run
that checks for violations of code style and other common problems.
Where possible, these hooks will try to fix any problems that they find (including reformatting
your code to conform to the required style). In this case, the commit
will not complete and report that “files were modified by this hook”.
To include the changes that the hooks made, git add
any
files that were modified and run git commit
(or, use git commit -a
to commit all changed files.)
If you would like to run the checks without committing, use pre-commit run
(but, note that this will only check changes that have been staged;
do pre-commit run --all
to check unstaged changes as well).
To bypass the checks (to save or get feedback on work-in-progress) use
git commit --no-verify
Rebasing¶
Rebasing is used for two basic tasks we might ask for during review:
Your topic branch has gotten out of date with the tip of
upstream/main
and needs to be updated.Your topic branch has lots of messy commits, which need to be cleaned up by “squashing”.
Rebasing in git
basically means changing where your branch forked off the main code
in upstream/main
. A good way of visualising what’s happening is to
look at the Network view on
GitHub. This shows you all the forks and branches that GitHub knows about
and how they relate to the main repository. Rebasing lets you change where
your branch splits off.
To see this for your local repo on your computer, you can look at the Git graph output via the command line:
$ git log --decorate --oneline --graph
This will show something like:
|* 923ab2e Merge pull request #9 from mcveanlab/docs-initial
|\
| * 0190a92 (origin/docs-initial, docs-initial) First pass at development docs.
| * 2a5fc09 Initial outline for docs.
| * 1ccb970 Initial addition of docs infrastructure.
|/
* c49601f Merge pull request #8 from mcveanlab/better-genomes
|\
| * fab9310 (origin/better-genomes, better-genomes) Added pongo tests.
| * 62c9560 Tidied up example.
| * 51e21e8 Added basic tests for population models.
| * 6fff557 Split genetic_maps into own module.
| * 90d6367 Added Genome concept.
| * e2aaf95 Changed debug to info for logging on download.
| * 2fbdfdc Added badges for CircleCI and CodeCov.
|/
* c66b575 Merge pull request #5 from mcveanlab/tests-ci
|\
| * 3ae454f (origin/tests-ci, tests-ci) Initial circle CI config.
| * c39415a Added basic tests for genetic map downloads.
|/
* dd47000 Merge pull request #3 from mcveanlab/recomb-map-infrastructure
|\
This shows a nice, linear git history: we can see four pull requests, each of
which consists of a small number of meaningful commits. This is the ideal that
we’re aiming for, and git allows us to achieve it by rewriting history as
much as we want within our own forks (we never rewrite history in the
upstream
repository, as this would cause problems for other developers).
Having a clean, linear git history is a good idea for lots of reasons, not
least of which is making git bisect
easier.
One of the most useful things that we can do with rebasing is to “squash” commits
so that we remove some noise from the git history. For example, this PR
(on the branch topic_branch_name
) currently looks like:
$ git log --decorate --oneline --graph
* 97a9458 (HEAD -> topic_branch_name) DONE!!!
* c9c4a28 PLEASE work, CI!
* ad4c807 Please work, CI!
* 0fe6dc4 Please work, CI!
* 520e6ac Add documentation for rebasing.
* 20fb835 (upstream/main) Merge pull request #22 from mcveanlab/port-tennyson
|\
| * b3d45ea (origin/port-tennyson, port-tennyson) Quickly port Tennesen et al model.
|/
* 79d26b4 Merge pull request #20 from andrewkern/fly_model
|\
Here, in my initial commit (520e6ac) I’ve added some updated documentation for rebasing. Then, there’s four more commits where I’m trying to get CI pass. History doesn’t need to know about this, so I can rewrite it using rebase:
$ git fetch upstream
$ git rebase -i upstream/main
We first make sure that we’re rebasing against the most recent version of the
upstream repo. Then, we ask git to perform an interactive rebase against
the upstream/main
branch. This starts up your editor, showing something
like this:
pick 520e6ac Add documentation for rebasing.
pick 0fe6dc4 Please work, CI!
pick ad4c807 Please work, CI!
pick c9c4a28 PLEASE work, CI!
pick 97a9458 DONE!!!
# Rebase 20fb835..97a9458 onto 20fb835 (5 commands)
#
# Commands:
# p, pick = use commit
# r, reword = use commit, but edit the commit message
# e, edit = use commit, but stop for amending
# s, squash = use commit, but meld into previous commit
# f, fixup = like "squash", but discard this commit's log message
# x, exec = run command (the rest of the line) using shell
# d, drop = remove commit
#
# These lines can be re-ordered; they are executed from top to bottom.
#
# If you remove a line here THAT COMMIT WILL BE LOST.
#
# However, if you remove everything, the rebase will be aborted.
#
# Note that empty commits are commented out
We want git to squash the last five commits, so we edit the rebase instructions to look like:
pick 520e6ac Add documentation for rebasing.
s 0fe6dc4 Please work, CI!
s ad4c807 Please work, CI!
s c9c4a28 PLEASE work, CI!
s 97a9458 DONE!!!
# Rebase 20fb835..97a9458 onto 20fb835 (5 commands)
#
# Commands:
# p, pick = use commit
# r, reword = use commit, but edit the commit message
# e, edit = use commit, but stop for amending
# s, squash = use commit, but meld into previous commit
# f, fixup = like "squash", but discard this commit's log message
# x, exec = run command (the rest of the line) using shell
# d, drop = remove commit
#
# These lines can be re-ordered; they are executed from top to bottom.
#
# If you remove a line here THAT COMMIT WILL BE LOST.
#
# However, if you remove everything, the rebase will be aborted.
#
# Note that empty commits are commented out
After performing these edits, we then save and close. Git will try to do the rebasing, and if successful will open another editor screen that lets you edit the text of the commit message:
# This is a combination of 5 commits.
# This is the 1st commit message:
Add documentation for rebasing.
# This is the commit message #2:
Please work, CI!
# This is the commit message #3:
Please work, CI!
# This is the commit message #4:
PLEASE work, CI!
# This is the commit message #5:
DONE!!!
# Please enter the commit message for your changes. Lines starting
# with '#' will be ignored, and an empty message aborts the commit.
#
# Date: Tue Mar 5 17:00:39 2019 +0000
#
# interactive rebase in progress; onto 20fb835
# Last commands done (5 commands done):
# squash c9c4a28 PLEASE work, CI!
# squash 97a9458 DONE!!!
# No commands remaining.
# You are currently rebasing branch 'topic_branch_name' on '20fb835'.
#
# Changes to be committed:
# modified: docs/development.rst
#
#
We don’t care about the commit messages for the squashed commits, so we delete them and end up with:
Add documentation for rebasing.
# Please enter the commit message for your changes. Lines starting
# with '#' will be ignored, and an empty message aborts the commit.
#
# Date: Tue Mar 5 17:00:39 2019 +0000
#
# interactive rebase in progress; onto 20fb835
# Last commands done (5 commands done):
# squash c9c4a28 PLEASE work, CI!
# squash 97a9458 DONE!!!
# No commands remaining.
# You are currently rebasing branch 'topic_branch_name' on '20fb835'.
#
# Changes to be committed:
# modified: docs/development.rst
After saving and closing this editor session, we then get something like:
[detached HEAD 6b8a2a5] Add documentation for rebasing.
Date: Tue Mar 5 17:00:39 2019 +0000
1 file changed, 2 insertions(+), 2 deletions(-)
Successfully rebased and updated refs/heads/topic_branch_name.
Finally, after a successful rebase, you must force-push! If you try to
push without specifying -f
, you will get a very confusing and misleading
message:
$ git push origin topic_branch_name
To github.com:jeromekelleher/stdpopsim.git
! [rejected] topic_branch_name -> topic_branch_name (non-fast-forward)
error: failed to push some refs to 'git@github.com:jeromekelleher/stdpopsim.git'
hint: Updates were rejected because the tip of your current branch is behind
hint: its remote counterpart. Integrate the remote changes (e.g.
hint: 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.
DO NOT LISTEN TO GIT IN THIS CASE! Git is giving you terrible advice
which will mess up your branch. What we need to do is replace the state of
the branch topic_branch_name
on your fork on GitHub (the upstream
remote)
with the state of your local branch, topic_branch_name
. We do this
by “force-pushing”:
$ git push -f origin topic_branch_name
Counting objects: 4, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 4.33 KiB | 1.44 MiB/s, done.
Total 4 (delta 2), reused 0 (delta 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.
To github.com:jeromekelleher/stdpopsim.git
+ 6b8a2a5...d033ffa topic_branch_name -> topic_branch_name (forced update)
Success! We can check the history again to see if everything looks OK:
$ git log --decorate --oneline --graph
* d033ffa (HEAD -> topic_branch_name, origin/topic_branch_name) Add documentation for rebasing.
* 20fb835 (upstream/main) Merge pull request #22 from mcveanlab/port-tennyson
|\
| * b3d45ea (origin/port-tennyson, port-tennyson) Quickly port Tennesen et al model.
|/
* 79d26b4 Merge pull request #20 from andrewkern/fly_model
|
This looks just right: we have one commit, pointing to the head of upstream/main
and have successfully squashed and rebased.
When rebasing goes wrong¶
Sometimes rebasing goes wrong, and you end up in a frustrating loop of making and
undoing the same changes over and over again. In this case, it can be simplest to
make a diff of your current changes, and apply these in a single commit. First
we take the diff between the current state of the files in our branch and
upstream/main
and save it as a patch:
$ git diff upstream/main > changes.patch
After that, we can check out a fresh branch and check if everything works as it’s supposed to:
$ git checkout -b test_branch upstream/main
$ patch -p1 < changes.patch
$ git commit -a
# check things work
After we’ve verified that everything works, we then checkout the original
topic branch and replace it with the state of the test_branch
, and
finally force-push to the remote topic branch on your fork:
$ git checkout topic_branch_name
$ git reset --hard test_branch
$ git push -f origin topic_branch_name
Hard resetting and force pushing are not reversible operations, so please beware!
Adding a new demographic model¶
Steps for adding a new demographic model:
If this is your first time implementing a demographic model in stdpopsim, it’s a good idea to take some time browsing the Catalog and species’ demographic models in the source code to see how existing models are typically written and documented. If you have any questions or confusion about formatting or implementing demographic models, please don’t hesitate to open an issue – we’re more than happy to answer any questions and help get you up and running.
What models are appropriate to add?¶
Stdpopsim supports any demographic model from the published literature that gives enough information to be able to define msprime demography objects. At a minimum, that includes population sizes and the timing of demographic events. These values need to either be given in “physical” units (that is, raw population sizes and time units in generations), or be able to be converted to physical units using, e.g., mutation rates used in the published study.
Note that it is not necessary that the demographic model is attached to a particular species. Stdpopsim contains a collection of generic models that are widely used in developing and testing inference methods. If there is a generic model that does not currently exist in our catalog but would be useful to include, we also welcome those contributions. Again, you should provide a citation for a generic models, or it should be commonly used.
Fork the repository and create a branch¶
Before implementing any model, be sure to have forked the stdpopsim repository and cloned it locally, following the instructions in the GitHub Workflow section. Models are first implemented and tested locally, and then submitted as a pull request to the stdpopsim repository, at which point it is verified by another developer before being fully supported within stdpopsim.
Write the model function in the catalog source code¶
In the stdpopsim
catalog source code (found in stdpopsim/catalog/
),
each species has a module that defines all of the necessary functions to run
simulations for that species, including the demographic model. In each species module,
you will see that each type of function is divided by comments, such as:
###########################################################
#
# Demographic models
#
###########################################################
Go to the Demographic models
section of the source code.
The demographic model function should follow this format:
def _model_func_name():
id = "FILL ME"
description = "FILL ME"
long_description = """
FILL ME
"""
populations = [
stdpopsim.Population(id="FILL ME", description="FILL ME"),
]
citations = [
stdpopsim.Citation(
author="FILL ME",
year="FILL ME",
doi="FILL ME",
reasons={stdpopsim.CiteReason.DEM_MODEL},
)
]
generation_time = "FILL ME"
mutation_rate = "FILL ME"
# parameter value definitions based on published values
return stdpopsim.DemographicModel(
id=id,
description=description,
long_description=long_description,
populations=populations,
citations=citations,
generation_time=generation_time,
mutation_rate=mutation_rate,
population_configurations=["FILL ME"],
migration_matrix=["FILL ME"],
demographic_events=["FILL ME"],
)
_species.add_demographic_model(_model_func_name())
The demographic model should include the following:
id
: A unique, short-hand identifier for this demographic model. Thisid
contains a short description written in camel case, followed by an underscore, and then four characters (the number of sampled populations, the first letter of the name of the first author, and the year the study was published). For example, the Gutenkunst et al. (2009) Out of Africa demographic model has theid
“OutOfAfrica_3G09”. See Naming conventions for more details.description
: A brief one-line description of the demographic model.long_description
: A longer description (say, a concise paragraph) that describes the model in more detail.populations
: A list ofstdpopsim.Population
objects, which have their ownid
anddescription
. For example, the Thousand Genomes Project Yoruba panel could be defined asstdpopsim.Population(id="YRI", description="1000 Genomes YRI (Yorubans)")
.citations
: A list ofstdpopsim.Citation
objects for the appropriate citation for this model. The citation object requires author, year, and doi information, and a specified reason for citing this model.generation_time
: The generation time for the species in years. If you are implementing a generic model, the generation time should default to 1.mutation_rate
: The mutation rate assumed during the inference of this demographic model, if a mutation rate was used. If no mutation rate is associated with this demographic model, which is generally uncommon but possible, depending on the inference method, the mutation rate should be set toNone
.
Every demographic model has a few necessary features or attributes. First of all, demographic models are defined by the population sizes, migration rates, split and admixture times, and generation lengths given in the source publication. We often take the point estimates for each of the values from the best fit model (for example, the parameters that give the maximum likelihood fit), which are translated into msprime-formatted demographic inputs.
Msprime-defined demographic models are specified through the
population_configurations
, migration_matrix
, and demographic_events
. If this
is your first time specifying a model using msprime, it’s worth taking some time to
read through the msprime
documentation and tutorials.
Write parameter table¶
The parameters used in the implementation must
also be listed in a csv file in the docs/parameter_tables
directory. This ensures
that the documentation for this model displays the parameters.
Take a look at the csv files currently in docs/parameter_tables
for inspiration.
The csv file should have the format:
Parameter Type (units), Value, Description
We can check that the documentation builds properly after implementation by running
make
in the docs directory and opening the Catalog page from the docs/_build/
directory. See Documentation for more details.
Test the model locally¶
Once you have written the demographic model function, you should test the model locally
with stdpopsim
. Follow the development Installation
instructions to install the development stdpopsim
version along with the
requirements.
Now check that your new demographic model function has been imported:
import stdpopsim
species = stdpopsim.get_species("HomSap")
for x in species.demographic_models:
print(x.id)
# OutOfAfrica_3G09
# OutOfAfrica_2T12
# Africa_1T12
# AmericanAdmixture_4B11
# OutOfAfricaArchaicAdmixture_5R19
# Zigzag_1S14
# AncientEurasia_9K19
# PapuansOutOfAfrica_10J19
# AshkSub_7G19
# OutOfAfrica_4J17
The example above lists the imported demographic models for humans.
You should substitute "HomSap"
for which ever species you added your model to.
Your new model should be printed along with currently available demographic models.
Note
If your demographic model does not print, after defining your model function,
did you include the call _species.add_demographic_model(_model_func_name())
,
where _model_func_name()
is your model function name?
If you are still having trouble, check the GitHub issues, or open an issue.
Next, check that you can successfully run a simulation with your new model with the Python API. See Running stdpopsim with the Python interface (API) for more details.
Submit a Pull Request on GitHub¶
Once you have implemented the demographic model locally, including documentation, the next step is to open a pull request with this addition. See the GitHub workflow for more details.
So the model is implemented. What next?¶
Now at this point, most of your work is done! The model is reviewed and verified following the Demographic model review process by an independent member of the development team, and there may be some discussion about formatting and to clear up any confusing bits of the demographic parameters before the model is fully incorporated into stdpopsim.
Thank you for your contribution, and welcome to the stdpopsim development team!
Demographic model review process¶
When Developer A creates a new demographic model on their local fork they must follow these steps for it to be officially supported by stdpopsim:
Developer A submits a PR to add a new model to the catalog. This includes full documentation (i.e., the documentation that will be rendered on rtd). The code is checked for any obvious problems/style issues etc by a maintainer and merged when it meets these basic standards. The new catalog model is considered ‘preliminary’.
Developer A creates an issue tracking the QC for the model which includes information about the primary sources used to create the model and the population indices used for their msprime implementation. To create a new Model QC issue, click “New issue” from the “Issues” tab on GitHub, and click “Get started” to use the Model QC issue template. Follow the template to include the necessary information in the issue. Developer B is then assigned/volunteers to do a blind implementation of the model.
Developer B creates a blind implementation of the model in the
stdpopsim/qc/species_name.py
file, remembering to register the QC model implementation (see other QC models for examples). Note that if you are adding a new species you will have to add a new import tostdpopsim/qc/__init__.py
.Developer B runs the units tests to verify the equivalence of the catalog and QC model implementations.
Developer B then creates a PR, and all being good, this PR is merged and the QC issue is closed.
Arbitration¶
When developers A and B disagree on the model implementation, the process is to:
Try to hash out the details between them on the original issue thread
If this fails, contact the authors of the original publication to resolve ambiguities.
If changes have to be made to the production model Developer A submits a PR with the hotfix for the production model. Developer B then rebases the branch containing their PR against the main branch to check for model equality. Repeat steps 1-3 until this is achieved. If changes have to be made to the QC model they are committed to the branch where the QC PR originates from.
Adding a new species¶
To add a new species to stdpopsim several things are required: 1. The genome definition 2. Default species parameters 3. A genetic map with local recombination rates (optional)
Once you have these things the first step is to create a new file in the catalog directory named for the species (see Naming conventions for more details). All code described below should go in this file unless explicitly specified otherwise.
Default species parameters¶
Four default parameters are required to create a new species: 1. Generation time estimate 2. Mutation rate 3. Recombination rate 4. Characteristic population size
These parameters should be based on what values might be drawn from a typical population as represented in the literature for that species. Consequently one or more citations for each value are expected and will be required for constructing the species object detailed below.
Adding/Updating a genome definition¶
A genome definition is created with a call to stdpopsim.Genome() which requires a list of chromosomes and a citation for the assembly. stdpopsim has an automated procedure for obtaining this list from ensembl and saving it for automated parsing. First however the initial species directory must be created in the stdpopsim/catalog directory (e.g. stdpopsim/catalog/AraTha). Once that is done, run the update_ensembl_data.py script present in the top level directory providing the ensembl species id(s) as “_” delimited name(s) for positional arguments as shown below. If no positional arguments are specified then all specified registered in stdpopsim will be updated.
python update_ensembl_data.py arabidopsis_thaliana
This will write/overwrite the ensembl_info.py file in the appropriate catalog subdirectory. Then add the following to the head of catalog/{species_id}/__init__.py.
from . import genome_data
To create the chromosome object that make up a genome add the following code to catalog/{species_id}/__init__.py and supply default mutation and recombination rates along with citations for the assembly (and additional ones for the mutation, and recombination rates if necessary). This is then used to create a genome object.
# A citation for the chromosome parameters. Additional citations may be needed if
# the mutation or recombination rates come from other sources. In that case create
# additional citations with the appropriate reasons specified (see API documentation
# for stdpopsim.citations)
_assembly_citation = stdpopsim.Citation(
doi="FILL ME",
year="FILL ME",
author="Author et al.",
reasons={stdpopsim.CiteReason.ASSEMBLY},
)
# Parse list of chromosomes into a list of Chromosome objects which contain the
# chromosome name, length, mutation rate, and recombination rate
_chromosomes = []
for name, data in genome_data.data["chromosomes"].items():
_chromosomes.append(
stdpopsim.Chromosome(
id=name,
length=data["length"],
synonyms=data["synonyms"],
mutation_rate=FILL_ME,
recombination_rate=FILL_ME,
)
)
# Create a genome object
_genome = stdpopsim.Genome(
chromosomes=_chromosomes, assembly_citations=[_assembly_citation]
)
Once you have a genome object you can create a new Species object which contains species identifiers, the genome, and default generation time and population size settings along with the relevant citation(s). Below is an example species definition for Arabidopsis thaliana and a final line of code that registers the species in the catalog.
_gen_time_citation = stdpopsim.Citation(
doi="https://doi.org/10.1890/0012-9658(2002)083[1006:GTINSO]2.0.CO;2",
year="2002",
author="Donohue",
reasons={stdpopsim.CiteReason.GEN_TIME},
)
_pop_size_citation = stdpopsim.Citation(
doi="https://doi.org/10.1016/j.cell.2016.05.063",
year="2016",
author="1001GenomesConsortium",
reasons={stdpopsim.CiteReason.POP_SIZE},
)
_species = stdpopsim.Species(
id="AraTha",
name="Arabidopsis thaliana",
common_name="A. thaliana",
genome=_genome,
generation_time=1.0,
generation_time_citations=[_gen_time_citation],
population_size=10 ** 4,
population_size_citations=[_pop_size_citation],
)
stdpopsim.register_species(_species)
Once all of this is done, go to the catalog/__init__.py file and add a line like the one below using the six-letter species identifier. Make sure to keep the comment to prevent linting issues.
from .catalog import PonAbe # NOQA
Species review process¶
Once you are satisfied that the species can be simulated via the CLI, submit a pull request with your changes. The species definition will go through a review process. This process includes not only a code review, but also includes a QC process to double check parameters and citations are appropriate. To initiate the QC process, open a new issue using the ‘Species QC issue template’. One or more volunteers will check items off the checklist, until all items have been completed satisfactorily. The QC issue, or the pull request, may be used for review discussion. The new species will be merged once the checklist is completed.
Adding a genetic map¶
Some species have sub-chromosomal recombination maps available. They can be added to stdpopsim by creating a new GeneticMap object and providing a formatted file detailing recombination rates to a designated stdpopsim maintainer who then uploads it to AWS. If there is one for your species that you wish to include, create a space delimited file with four columns: Chromosome, Position(bp), Rate(cM/Mb), and Map(cM). Each chromosome should be placed in a separate file and with the chromosome id in the file name in such a way that it can be programatically parsed out. IMPORTANT: chromosome ids must match those provided in the genome definition exactly! Below is an example start to a recombination map file (see here for more details):
Chromosome Position(bp) Rate(cM/Mb) Map(cM)
chr1 32807 5.016134 0
chr1 488426 4.579949 0
Once you have the recombination map files formatted, tar and gzip them into a single compressed archive. The gzipped tarball must be FLAT (there are no directories in the tarball). This file will be sent to one of the stdpopsim uploaders for placement in the AWS cloud once the new genetic map(s) are approved. Finally, you must add a GeneticMap object to the file named for your species in the catalog directory (the same one in which the genome is defined) as shown below:
_genetic_map_citation = stdpopsim.Citation(
doi="FILL_ME", author="FILL_ME", year=9999, reasons={stdpopsim.CiteReason.GEN_MAP}
)
"""
The file_pattern argument is a pattern that matches the recombination map filenames,
where '{id}' is replaced with the 'id' field of a given chromosome.
"""
_gm = stdpopsim.GeneticMap(
species=_species,
id="FILL_ME", # ID for genetic map, see naming conventions
description="FILL_ME",
long_description="FILL_ME",
url=("https://stdpopsim.s3-us-west-2.amazonaws.com/genetic_maps/dir/filename"),
sha256="FILL_ME",
file_pattern="name_{id}_more_name.txt",
citations=[_genetic_map_citation],
)
_species.add_genetic_map(_gm)
The SHA256 checksum of the the genetic map tarball can be obtained using the
sha256sum
command from GNU coreutils. If this is not available on your
system, the following can instead be used:
python -c 'from stdpopsim.utils import sha256; print(sha256("genetic_map.tgz"))'
Once all this is done, submit a PR containing the code changes and wait for directions on whom to send the compressed archive of genetic maps to (currently Andrew Kern is the primary uploader but please wait to send files to him until directed).
Lifting over a genetic map¶
Existing genetic maps will need to be lifted over to a new assembly, if and when the current assembly is updated in stdpopsim. This process can be partially automated by running the liftOver maintenance code.
First, you must download and install the liftOver
executable from the
UCSC Genome Browser Store.
Next, you must download the appropriate chain files, again from UCSC
(see UCSC Genome Browser downloads for more details).
To validate the remapping between assemblies it is required to have chain files
corresponding to both directions of the liftOver
(e.g. hg19ToHg38.over.chain.gz and hg38ToHg19.over.chain.gz) as in the
example below.
An example of the process for
lifting over the GeneticMap "HapMapII_GRCh37"
to the "Hg19"
assembly
is shown below:
python /maintenance/liftOver_catalog.py \
--species HomSap \
--map HapMapII_GRCh37 \
--chainFile hg19ToHg38.over.chain.gz \
--validationChain hg38ToHg19.over.chain.gz \
--winLen 1000 \
--useAdjacentAvg \
--retainIntermediates \
--gapThresh 1000000
Here, the argument "--winLen"
corresponds to the size of the window over which a weighted
average of recombination rates is taken when comparing the original map with the
back-lifted map (for validation purposes only). The argument "--gapThresh"
is used to select a threshold for
which gaps in the new assembly longer than the "--gapThresh"
will be set with a
recombination rate equal to 0.0000, instead of an average rate. The type of average rate used for gaps
shorter than the "--gapThresh"
is determined either by using the mean rate of two most adjacent windows
or by using the mean rate for the entire chromosome, using options "--useAdjacentAvg"
or
"--useChromosomeAvg"`
respectively.
Validation plots will automatically be generated in the "/liftOver_validation/"
directory. Intermediate files created by the liftOver
executable will be saved
for inspection in the "/liftOver_intermediates/"
, only if the
"--retainInermediates"
option is used. Once the user has inspected the validation plots
and deemed the liftOver process to be sufficiently accurate, they can proceed to generating
the SHA256 checksum.
The SHA256 checksum of the new genetic map tarball can be obtained using the
sha256sum
command from GNU coreutils. If this is not available on your
system, the following can instead be used:
python -c 'from stdpopsim.utils import sha256; print(sha256("genetic_map.tgz"))'
The newly lifted over maps will be formatted in a compressed archive and automatically named using the assembly name from the chain file. This file will be sent to one of the stdpopsim uploaders for placement in the AWS cloud, once the new map is approved. Finally, you must add a GeneticMap object to the file named for your species in the catalog directory (the same one in which the genome is defined) as shown in Adding a genetic map.
Again, once all this is done, submit a PR containing the code changes and wait for directions on whom to send the compressed archive of genetic maps to (currently Andrew Kern is the primary uploader but please wait to send files to him until directed).
Note
The GeneticMap
named "ComeronCrossoverV2_dm6"
for "DroMel"
was generated by similar code (albeit slightly different
compared to that shown above) using the following command:
python /maintenance/liftOver_comeron2012.py \
--winLen 1000 \
--gapThresh 1000000 \
--useAdjacentAvg \
--retainIntermediates
Coding standards¶
To ensure that the code in stdpopsim
is as readable as possible
and follows a reasonably uniform style, we require that all code follows
the PEP8 style guide.
Lines of code should be no more than 89 characters.
Conformance to this style is checked as part of the Continuous Integration
testing suite.
Naming conventions¶
To ensure uniformity in naming schemes across objects in stdpopsim
we have strict conventions for species, genetic maps, and demographic
models.
Species names follow a ${first_3_letters_genus}${first_3_letters_species}
convention with capitilization such that Homo sapiens becomes “HomSap”. This
is similar to the UCSC Genome Browser naming convention and should be familiar.
Genetic maps are named using a descriptive name and the assembly version according
to ${CamelCaseDescriptiveName}_${Assembly}
. e.g., the HapMap phase 2 map on
the GRCh37 assembly becomes HapMapII_GRCh37.
Demographic models are named using a combination of a descriptive name,
information about the simulation, and information about the publication it was
presented in. Specifically we use
${SomethingDescriptive}_${number_of_populations}${first_author_initial}${two_digit_date}
where the descriptive text is meant to capture something about the model
(i.e. an admixture model, a population crash, etc.) and the number of populations
is the number of populations implemented in the model (not necessarily the number
from which samples are drawn). For author initial we will use a single letter, the 1st,
until an ID collision, in which case we will include the 2nd letter, and so forth.
DFEs (Distributions of Fitness Effects) are similarly named using something descriptive
of the distribution, and information about the publication:
${SomethingDescriptive}_${First_authors_last_name_first_letter}{two_digit_date}
.
For instance, if the distribution in question is a lognormal distribution,
then LogNormal
might be the descriptive string.
Unit tests¶
All code added to stdpopsim
should have
unit tests. These are typically
simple and fast checks to ensure that the code makes basic sense (the
entire unit test suite should not require more than a few seconds to run).
Test coverage is checked using CodeCov,
which generates reports about each pull request.
It is not practical to test the statistical properties of simulation models as part of unit tests.
The unit test suite is in the tests
directory. Tests are run using the
pytest module. Use:
$ python3 -m pytest
from the project root to run the full test suite. Pytest is very powerful and has lots of options; please see the tskit documentation for help on how to run pytest and some common options.
It’s useful to run the flake8
CI tests locally before pushing a commit.
To set this up use either pip
or conda
to install flake8
To run the test simply use:
$ flake8 --max-line-length 89 stdpopsim tests
If you would like to automatically run this test before a commit is permitted,
add the following line in the file stdpopsim/.git/hooks/pre-commit.sample
:
exec flake8 --max-line-length 89 setup.py stdpopsim tests
before:
# If there are whitespace errors, print the offending file names and fail.
exec git diff-index --check --cached $against --
Finally, rename pre-commit.sample
to simply pre-commit
Code Coverage¶
As part of the continuous testing suite we have automated checking of how well the test units cover the source code. As a result it’s very helpful to check locally how well your tests are covering your code by asking pytest for coverage reports. This can be done with:
$ pytest --cov-report html --cov=stdpopsim tests/
this will output of directory of html files for you to browse test coverage for every file in stdpopsim in a reasonably straightfoward graphical way. You’ll be looking for lines of code that are highlighted yellow or red indicated that tests do not currently cover that bit of code.
Documentation¶
Documentation is written using reStructuredText
markup and the sphinx documentation system.
It is defined in the docs
directory.
To build the documentation type make
in the docs
directory. This should build
HTML output in the _build/html/
directory.
Note
You will need stdpopsim
to be installed for the build to work.