Read-level modification calling

Use SWARM_read_level.py to detect modifications in individual molecules.

sam + slow5 preprocessing (default)

Use this approach for faster and simultaneous preprocessing + model inference. Must run build.sh from install section to compile C++ code for signal preprocessing.

Models for RNA002 or RNA004 chemistry are automatically selected by default based on the blow5 header.

Example bash code to run SWARM read-level prediction for m6A:

module load tensorflow
conda activate SWARM
python3 SWARM_read_level.py -m <RNAmod> -s <SAM> -f <FASTA> -r <BLOW5> -o <OUT>

 Required:
  -m RNAMOD, --RNAmod RNAMOD    Target RNA modification [m6A/pU/m5C]
  -o OUT, --out OUT             Path for the output tsv file
  -s SAM, --sam SAM             Path to the input sam event align
  -f FASTA, --fasta FASTA       Path to the input fasta reference genome
  -r RAW, --raw RAW             Path to the input signals in blow5 format

Optional:
  --model1 MODEL1               Custom path to the trained model1
  --kmer KMER                   Custom path to the kmer model
  --cpp CPP                     Custom path to the compiled c++ preprcessing binary
  --kit KIT                     RNA sequencing kit [RNA004/RNA002]
  --temp TEMP                   Directory for temp files
  --arch ARCH                   Model1 network, Mini is default. [Mini/Mid/Large].
  --modsam                      Outputs OUT.mod.sam (MM/ML tags) and OUT.pred.tsv
  --nworkers NWORKERS           Number of preprocessing workers (default 4, 1 for --modsam)
  -h, --help                    Show this help message and exit

eventalign.tsv preprocessing

Alternatively, preprocessing and prediction can be run separately from eventalign.tsv, but that involves massive temp files (can be terabytes).

First preprocess the event alignments:

python3 SWARM_read_level.py --preprocess -m m6A --bam BAM --nanopolish $EVENTS -o $OUT.pickle

Then predict modifications:

python3 SWARM_read_level.py --predict -m m6A --pickle $OUT.pickle -o $OUT.pred.tsv

Output format

pred.tsv

Default output is in tsv format and contains 3 columns:

ENST00000390289.2_53_TCCCTCTCC_3f1d97d4-036e-478e-807a-dc68148e832d_326_28_T    0.01007  1
ENST00000390289.2_55_CCTCTCCCA_3f1d97d4-036e-478e-807a-dc68148e832d_328_37_T    0.03012  1
ENST00000390289.2_63_AGCCTGTGC_3f1d97d4-036e-478e-807a-dc68148e832d_336_42_T    0.00112  1
ENST00000390289.2_65_CCTGTGCTG_3f1d97d4-036e-478e-807a-dc68148e832d_338_42_T    0.00612  1
ENST00000390289.2_68_GTGCTGACT_3f1d97d4-036e-478e-807a-dc68148e832d_341_41_T    0.01945  1

Base metadata: refContig_refPosition_9mer_readID_readPosition_qscore_calledBase
Molecule-level probability, between 0 and 1. Target modification should have higher probability.
Model code, used for selecting parameters for site-level prediction.

model_code = {
    "pU_RNA002":    "1",
    "pU_RNA004":    "2",
    "m5C_RNA002":   "3",
    "m5C_RNA004":   "4",
    "m6A_RNA002":   "5",
    "m6A_RNA004":   "6"
}

mod.sam

SWARM predictions can be encoded in modsam format using the --modsam tag

Described in detail in modification visualization section.