Site-level detection
SWARM_site_level.py aggregates read-level data per reference site to compute:
-
Stoichiometry (modification rate) computed from read-level modification calls
-
Site-level probability (can narrow down modified sites from millions of tested coordinates)
Sort read-level output
Read-level predictions must be sorted to ensure correct site-level aggregation.
# Use cat if pooling multiple replicates
cat rep1.pred.tsv rep2.pred.tsv > reads.pred.tsv
# Sort based on the first column
sort -k 1 reads.pred.tsv > reads.pred.tsv.sorted
# Remove temp file if pooling
rm -f reads.pred.tsv
Running site-level detection
Run SWARM_site_level.py on sorted read-level data:
python3 SWARM_site_level.py -i <pred.tsv.sorted> -o <OUT> -d <0.5,0.5>
Required:
-i, --input Path to the sorted read-level prediction file
-o, --file_out Path to the output file
Optional:
-d, --double_cutoff Read-level cutoffs for computing stoichiometry [0.5,0.5]
-n, --min_reads Minimun read coverage to output a site [20]
-c, --cutoff Site-level probability cutoff for printing sites [0.0]
-m, --DL_model Custom path to pretrainned site-level model
--arch NN architecture [Mini/Mid/Large]
-h, --help Show this help message and exit
Stoichiometry parameters
-d, --double_cutoff arg can be used to select read-level cutoffs for computing stoichiometry.
Expected input is a ',' separated value providing cutoffs for unmodified and modified calls.
Default value is 0.5,0.5 ; meaning that p < 0.5 is unmodified and p > 0.5 is modified.
Values between the two cutoffs are not included in stoichiometry computation, but are still kept for site-level model prediction and included in the coverage.
Using higher cutoffs for calling modified reads would reduce stoichiometry but should give lower false positive rates.
Stoichiometry is reported as rate (between 0 and 1).
Output format
SWARM_site_level.py outputs a tsv file in the same fromat as CHEUI (Mateos et al., 2024):
contig position site coverage stoichiometry probability
ENST00000000233.10 1000 GATCTTGAG 223 0.04522613065326633 4.0531158447265625e-06
ENST00000000233.10 1001 ATCTTGAGT 227 0.06111111111111111 8.106231689453125e-06
ENST00000000233.10 1012 TAAATTTGC 164 0.08527131782945736 6.079673767089844e-06
ENST00000000233.10 1013 AAATTTGCT 182 0.07586206896551724 4.1484832763671875e-05
ENST00000000233.10 1017 TTGCTGTGG 151 0.00684931506849315 0.00010669231414794922
Post-processing
Site-level cutoffs
SWARM can detect modifications in a single sample, where the target modfiication is usually present in under 1% of the tested coordinates.
To enrich for the true modified sites, we use a standard 10% stoichiometry cutoff with additional neural-network filtering based on the distribution of read-level probabilities.
Site-level models were trained on pre-defined mixtures of modified and unmodified reads and extensively benchmarked on IVT transcriptomes and cellular data.
For single-sample detection we select sites with:
-
Stoichiometry > 0.1 (>10%)
-
Site-level probability with a false-positive rate < 0.1% on unmodified IVT transcriptomes
Cutoffs for different contexts:
| Kit | Modification | Context | Cutoff |
|---|---|---|---|
| RNA002 | m6A | all-context | 0.999953 |
| RNA002 | m6A | DRACH-only | 0.9972 |
| RNA004 | m6A | all-context | 0.9999999999986 |
| RNA004 | m6A | DRACH-only | 0.9999986 |
| RNA002 | Ψ | all-context | 0.9943 |
| RNA002 | Ψ | PUS7-TRUB1-only | 0.9925 |
| RNA004 | Ψ | all-context | 0.999999988 |
| RNA004 | Ψ | PUS7-TRUB1-only | 0.999999975 |
| RNA002 | m5C | all-context | 0.999969 |
| RNA002 | m5C | NSUN6-only | 0.99981 |
| RNA004 | m5C | all-context | 0.99999999989 |
| RNA004 | m5C | NSUN6-only | 0.988 |