Overview¶
barseqcount
is a tool for the analysis of DNA barcode sequencing experiments, typically used for biodistribution studies. It has 2 main commands, count
and analyze
.
barseqcount count
counts barcodes combinations from NGS read files. barseqcount analyze
analyzes the data and displays the results as a collection of plots.
- Source code:
- Python package:
- Bioconda package:
- Bug report / feature requests:
https://github.com/damienmarsic/barseqcount/issues/new/choose
Latest news¶
2023-02-23: update to 0.1.4. Improved detection of errors in the count configuration file. 2023-05-16: update to 0.1.5. Updated some function calls in order to be compatible with latest version of dmbiolib.
Installation¶
It is recommended to install barseqcount
as a bioconda package under a Python 3 conda environment.
If you don’t have such environment, first install Miniconda3 from here (exists for Linux, MacOSX and Windows).
To install the barseqcount
bioconda package, at the conda prompt type:
conda install -c bioconda barseqcount
barseqcount
can be also installed using pip:
pip install barseqcount
If using pip intall, note that dependencies (numpy, matplotlib and regex) might need to be installed individually if not already present.
Windows conda users: note that Bioconda does not support Windows and does not allow the automatic creation of a bat file. At the conda prompt, please type the following:
(echo @echo off & echo python -m barseqcount %%*) > %CONDA_PREFIX%\Scripts\barseqcount.bat
Windows non-conda users: follow the instruction above except that you need to replace %CONDA_PREFIX%
with the full path to a directory that is part of your PATH system variable.
Linux and MacOSX non-conda users: create a file named barseqcount
in a directory that is part of your PATH environment variable, with the following content:
#!/bin/sh
python -m barseqcount $@
Then type the following to make the file executable:
chmod +x barseqcount
Update¶
To update barseqcount
to the latest version:
Using conda:
conda update barseqcount
or even better:
conda update --all
Using pip:
pip install barseqcount -U
Usage¶
barseqcount [-h] [-v] {count,analyze} ...
or
python -m barseqcount [-h] [-v] {count,analyze} ...
The optional arguments -h/--help
and -v/--version
allow to show a help message and version information respectively.
Positional arguments count
and analyze
are the two commands that barseqcount
can run.
The former processes read files, collects barcode data and saves a barcode distribution file.
The latter analyzes the barcode distribution data and displays the results as a collection of plots.
barseqcount count
has optional arguments -c/--configuration_file
and -n/--new
.
The -c
argument (followed by a file name) specifies which configuration file to use, or which to create if it does not exist yet.
The -n
argument allows to ignore an existing configuration file and to create a new one.
barseqcount analyze
has the same -c
and -n
arguments, but also a third one: -f/--file_format
, allowing to chose a file format to save the plots individually.
If the -f
argument is not used, then all plots will be saved in a single multipage pdf file.
barseqcount count¶
Barcodes¶
Although barseqcount count
works with any number of barcodes, typically, each read contains one variant barcode, as well as possibly one or more sample barcodes. The sample barcodes are introduced by barcoded primers which are used to amplify the variant barcode sequence. In the case of Illumina sequencing, if indexes (barcodes introduced by PCR or ligation) are used for different samples, demultiplexing is typically performed by the NGS provider, in which case multiple files will be provided (1 file or 1 file pair per index), with their reads only containing the variant barcode. In that case, and in any case where multiple read files are present, file name prefixes will be used as sample barcodes.
Paired-end reads¶
Any common read file format (fasta or fastq, either uncompressed or gzipped) can be read by barseqcount count
. However, paired-end reads must be merged before use. Many merger programs can be used, for example NGmerge
from the ngmerge
package is recommanded. Example of installation in a conda environment:
conda install ngmerge
Example of merging read files Reads_1.fq.gz and Reads_2.fq.gz into Merged_reads.fq.gz:
NGmerge -1 Reads_1.fq.gz -2 Reads_2.fq.gz -o Merged_reads.fq.gz
If you get the error message Error! Quality scores outside of set range
, then add the -u 41
and -g
arguments (see NGmerge documentation for more information):
NGmerge -1 Reads_1.fq.gz -2 Reads_2.fq.gz -o Merged_reads.fq.gz -u 41 -g
Configuration file¶
If no configuration file (barseqcount_count.conf by default or any file name entered after the -c
argument) exists in the current directory, or if the -n
argument is used, the command barseqcount count
will create a new configuration file (named barseqcount_count.conf by default if the -c
argument is not used).
If the -n
argument is used, the existing configuration file will be renamed by adding a unique string of numbers before the file extension.
The configuration file needs to be edited by the user and each section needs to be filled out with appropriate information before it can be used.
Some sections are populated automatically if the program detects the most plausible content: Project name (working directory name), Read file(s) (any gzipped files present in the current directory) and Template sequence (if a single fasta file is present in the current directory).
All other sections must be populated by the user according to the instructions provided within the configuration file.
When the configuration file is ready, running barseqcount count
will open it and check its contents.
If errors are detected, the program will exit with an explanatory message. Otherwise, it will proceed with processing the read file(s).
Error correction¶
Whether error correction is performed is determind automatically by barseqcount count
by analyzing the barcode sequences in the configuration file.
If all barcodes within the same barcode location differ by at least 3 nucleotide substitutions from any other barcode, then single substitution error correction will be activated for that location, which means that if an unknown barcode is obtained which can be converted to a know barcode by a single substitution, it will be converted to that known barcode.
The other type of error correction corrects for indels within homopolymers of the sequences surrounding the barcode and for homopolymer insertions within the barcode sequence, and is only activated if homopolymers are absent from all expected barcodes in the barcode locationand if the ends of the barcodes are different from the nucleotide next to them.
Read file processing¶
Barcodes combinations are collected, error corrected when applicable, converted to variant names and sample names whenever possible, and saved into a barcode distribution csv file, which can later be used by the barseqcount analyze
program. A result summary is also displayed and added to a report file.
barseqcount analyze¶
Configuration file¶
If no configuration file (barseqcount_analyze.conf by default or any file name entered after the -c
argument) exists in the current directory, or if the -n
argument is used, the command barseqcount analyze
will create a new configuration file (named barseqcount_analyze.conf by default if the -c
argument is not used).
If the -n
argument is used, the existing configuration file will be renamed by adding a unique string of numbers before the file extension.
The configuration file will only be created if a count report file can be found in the current directory (if more than one is present, the most recent will be used), from which relevant information (such as the barcode distribution file name and the definitions) will be used to prepopulate some sections of the configuration file.
The configuration file needs to be edited by the user and each section needs to be filled out with appropriate information before it can be used.
Most sections are actually populated automatically by barseqcount analyze
(but should still be edited by the user according to their preferences) except for the global genome and expression titers which need to be entered manually (although simplified analysis can still be performed if these sections are empty).
When the configuration file is ready, running barseqcount analyze
will open it and check its contents.
If errors are detected, the program will exit with an explanatory message.
Analysis¶
barseqcount analyze
analyzes the data from the barcode distribution file according to the settings in the configuration files, and displays the results as a collection of configurable bar plots and heat maps.
For each plot, the data is also saved as a csv file, so the user also has the option of creating their own plots.
Variant mix composition¶
If a variant mix exists in the sample definitions, its composition is displayed as a bar plot, with the variants in the x-axis and the deviation from equimolar frequency in the y-axis. If some variants have a frequency below a threshold defined in the configuration file, they will be removed from all subsequent analyses.
Global read count per sample¶
Total read counts per sample are displayed as a bar plot, allowing to verify that each sample is represented by a sufficient number of reads.
Global variant enrichment¶
Enrichment of each variant between the variant mix (if present) and each sample is displayed as a heat map, with colors indicating enrichment factors in Log scale. If mix is absent, equimolar variant mix is assumed.
Global biodistributions¶
If both Global titers and Combine data sections exist (and are not empty) in the configuration file, a global biodistribution plot will be displayed for each group in the Combine data section.
Detailed biodistributions¶
If the Combine data section exists and is not empty, detailed biodistribution plots will be displayed for each group in the section. In these plots, data from biological replicates are combined. If Global titers exist in the configuration file, biodistribution is expressed as titers in the appropriate unit, otherwise it is shown as enrichment factors. Each group is represented by two plots: a heat map and a bar plot. In the bar plots, individual data points corresponding to biological replicates can be overlaid in a choice of shapes, and error bars can be shown as range, standard deviation or standard error, according to settings in the configuration file.
Functions¶
Many of the functions used in barseqcount
are also used in other projects and have been included in the dmbiolib package.
main()¶
The main()
function uses argparse
to read and process the command line arguments.
count(args)¶
args: optional arguments following the
count
command
analyze(args)¶
args: optional arguments following the
analyze
command
anaconf(fname,args)¶
fname: name of the configuration file to be created
args: arguments
barseqcount analyze
programcountconf(fname,args)¶
fname: name of the configuration file to be created
args: arguments
barseqcount count
programfind_bc(l,templ,bcr,cl,ctempl,cbcr)¶
l: read
templ: template
bcr: dictionary containing information about barcode locations and error correction
cl: compressed read (using compress function from
dmbiolib
)ctempl: compressed template
cbcr: dictionary containing information about barcode locations based on compressed template
fb(l,templ,i,bcr)¶
l: read (nucleotide sequence)
templ: template
i: barcode index
bcr: dictionary containing information about barcode locations and error correction
maxmatch(sample,target,probe)¶
sample: nucleotide sequence of primer
target: nucleotide sequence of template
probe: initial probe size
override(func)¶
Allows argparse to handle the -v/--version
argument correctly.
version()¶
Displays version and other information:
python -m barseqcount -v
Project: barseqcount
Version: 0.1.4
Latest update: 2023-02-23
Author: Damien Marsic, damien.marsic@aliyun.com
License: GNU General Public v3 (GPLv3)