Preprocessing raw data
Basecalling
SWARM was trained and tested with following basecaller versions and models, use the same versions for results comparable to our benchmarks:
| sequencing kit | basecaller version | basecaller model |
|---|---|---|
| sqk-RNA002 | guppy 6.4.6 | rna_r9.4.1_70bps_hac |
| sqk-RNA004 | dorado 0.7.2 | rna004_130bps_sup@v5.0.0 |
Recommended parameters RNA002:
guppy_basecaller -i $INPUTDIR --recursive -s $output_path.fastq -c guppy/ont-guppy/data/rna_r9.4.1_70bps_hac.cfg --device cuda:all:100%
Recommended parameters RNA004:
MODEL=dorado-0.7.2-linux-x64/rna004_130bps_sup@v5.0.0
dorado basecaller $MODEL $INPUTDIR -r -x cuda:all --emit-fastq > $output_path.fastq
Alignment
minimap 2.24 for alignment and samtools 1.22 for quality control
Recommended parameters: -k 5 for sythetic IVTs and -k 14 for human transcriptomes
Example transcriptome alignment:
minimap2 -ax map-ont -k 14 ${fasta} ${input_path}/guppy_pass.fastq | samtools sort -o ${output_path}.bam
samtools index ${output_path}.bam
samtools view -b -F 2324 ${bam_file}.bam > ${bam_file}_pass_filtered.bam
samtools index ${bam_file}_pass_filtered.bam
fast5 to slow5
This step is highly recommended, especially for large datasets.
Install slow5tools from: https://github.com/hasindu2008/slow5tools
Example conversion command:
#convert fast5 files to slow5 files using 8 I/O processes
slow5tools f2s $INPUT_DIR -d $TEMPDIR -p 8
#Merge all the slow5 files in to a single file using 8 threads
slow5tools merge $TEMPDIR -o $OUTDIR/${SAMPLE}.blow5 -t 8
#remove the temporary directory
rm -rf $TEMPDIR
Event alignment
f5c
Our workflow supports both f5c .sam and nanopolish .tsv formats. We highly recommend opting for f5c and sam files. This requires the slow5 conversion outlined in previous step.
f5c is available from: https://github.com/hasindu2008/f5c
Example event align command:
f5c index -t 48 $FASTQ_PATH --slow5 $SLOW5_PATH
f5c eventalign -t 48 -r $FASTQ_PATH --rna -g $genome -b $BAM --slow5 $SLOW5_PATH --min-mapq 0 --signal-index --scale-events --samples --print-read-names --sam > $OUT
nanopolish
We used this format in earlier stages of the project, our workflow can still support it. Note that our prediction workflow is optimised for f5c sam format.
nanopolish f5c is available from: https://github.com/jts/nanopolish
Example event align command:
nanopolish index -d ${fast5_path} -s ${guppy_files}/sequencing_summary.txt $fastq
nanopolish eventalign -t 48 --reads $fastq --bam $bam_file \
--genome $fasta --signal-index --scale-events --samples --print-read-names > $output_path