Skip to content
Snippets Groups Projects
user avatar
Alexis Mergez authored
72d04a02
History

Pan1c (Pangenome at chromosome level) workflow

Pan1c : a snakemake workflow for creating pangenomes at chromosomic scale. Tools used within the workflow :

File architecture

Before running the workflow

Pan1c/
├── config.yaml
├── data
│   └── haplotypes
│       ├── ref.hap<x>.fa.gz
│       ├── samp1.hap<x>.fa.gz
│       └── ...
├── example
│   └── ...
├── getApps.sh
├── README.md
├── runSnakemake.sh
├── scripts
│   └── ...
└── Snakefile

After the workflow (Arabidopsis Thaliana example)

The following tree is non-exhaustive for clarity. Temporary files are not listed, but key files are included.

Pan1c-06AT-v3
├── chrInputs

├── config.yaml
├── data
│   ├── chrGraphs
│   │   ├── chr<id>
│   │   ├── chr<id>.gfa
│   │   └── graphsList.txt
│   ├── chrInputs
│   │   └── chr<id>.fa.gz
│   ├── haplotypes
│   └── hap.ragtagged
│       ├── <sample>.hap<hid>
│       └── <sample>.hap<hid>.ragtagged.fa.gz
├── logs
│   ├── pan1c.pggb.06AT-v3.logs.tar.gz
│   └── pggb
│       ├── chr<id>.pggb.cmd.log
│       └── chr<id>.pggb.time.log
├── output
│   ├── figures
│   │   ├── chr<id>.1Dviz.png
│   │   └── chr<id>.pcov.png
│   ├── pan1c.pggb.06AT-v3.chrGraph.stats.tsv
│   ├── pan1c.pggb.06AT-v3.gfa
│   ├── pan1c.pggb.06AT-v3.workflow.stats.tsv
│   ├── panacus.reports
│   │   └── chr<id>.histgrowth.html
│   ├── pggb.usage.figs
│   └── stats
│       └── chr<id>.stats.tsv
├── Pan1c-06AT-v3.log
├── README.md
├── runSnakemake.sh
├── scripts
│   └── ...
├── Snakefile
└── workflow.svg

Example DAG (Arabidopsis Thaliana example)

This DAG shows the worflow for a pangenome of Arabidospis Thaliana using the TAIR10.1 reference. Workflow DAG

Prepare your data

This workflow can take chromosome level assemblies as well as contig level assemblies but requires a reference assembly.
Fasta files need to be compressed using bgzip2 (included in PanGeTools). Sequences names of the reference must follow this pattern : <sample>#<haplotype>#<contig or chromosome name>.
For example, CHM13 chromosomes (haploïd) must be named CHM13#1#chr... Only the reference needs to follow this pattern for its sequence names. Others haplotypes sequences will be renamed based on the reference and their respective fasta file names. Fasta files must also follow a pattern : <sample>.hap<haplotype>.fa.gz. Once again with CHM13, the fasta file should be named : CHM13.hap1.fa.gz.

See PanSN for more info on sequence naming.

Note : Input files should be read-only to prevent snakemake to mess with them (which seems to happen in some rare cases).

Download apptainer images

Before running the worflow, some apptainer images needs to be downloaded. Use the script getApps.sh to do so :

./getApps.sh -a <apps directory>

Running the workflow

Clone this repository and create a data/haplotypes directory where you will place all your haplotypes.
Update the reference name and the apptainer image directory in config.yaml.
Then, modify the variables in runSnakemake.sh to match your requirements (number of threads, memory, job name, email, etc.).
Navigate to the root directory of the repository and execute sbatch runSnakemake.sh!

Outputs

The workflow generates several key files :

  • Aggregated graph including every chromosome scale graphs (output/pan1c.pggb.<panname>.gfa)
  • Chromosome scale graphs (data/chrGraphs/chr<id>.gfa)
  • Panacus html reports for each chromosome graph (output/panacus.reports/chr<id>.histgrowth.html)
  • Statistics on input sequences, graphs and resources used by the workflow (output/pan1c.pggb.<panname>.workflow.stats.tsv)
  • PAV matrices (optional) for each chromosome graph (output/pav.matrices/chr<id>.pav.matrix.tsv)