BLS Speller Spark

Spark

Apache Spark is required on the machine or cluster on which you will execute the BLS speller program. This can be downloaded here <https://spark.apache.org/downloads.html> and more instruction are available to install Spark on Ubuntu here <https://phoenixnap.com/kb/install-spark-on-ubuntu>.

Installation

The BLS Speller Spark software can installed like this:

1git clone https://dries_decap@bitbucket.org/dries_decap/bls-speller-spark.git
2cd bls-speller-spark
3sbt assembly # sbt is required to build the software, see https://www.scala-sbt.org/1.x/docs/Installing-sbt-on-Linux.html for installation details

The preprocessing and motif iterator software can be installed as described here and here

Execution

The BLS Speller tool requires preprocessed promotor fasta files, see this page

Some spark environments need to be set:

1# spark variables
2version=0.2
3jar=/path/to/blsspeller-jar/bls-speller-assembly-${version}.jar
4jars=$jar
5class=be.ugent.intec.ddecap.BlsSpeller
6execmem=10g # memory per executor
7execcores=1  # cores per executor
8numexec=8 # assumes an 8 core machine with at least 80GB memory available for Spark
9sparksubmit="spark-submit --conf spark.ui.showConsoleProgress=true --executor-memory ${execmem} --num-executors ${num_exec} --executor-cores ${execcores} --conf spark.task.cpus=1 --jars ${jars} --class ${class}"

Motif discovery

The BLS Speller tool can run all steps automatically like this:

 1# bls speller variables
 2blst='0.07,0.13,0.41,0.54,0.65,0.75,0.85,0.95'
 3length=8
 4degen=3
 5partitions=48 #should be exactly the number of executors that will start
 6inputdir=/data/bls/wheat
 7bindir=/path/to/motifIterator_binary/
 8input=/tmp/preprocessed_folder
 9output=/tmp/blsspeller-output
10cmd="${sparksubmit} ${jar} getMotifs --input ${input} --output ${output} --bindir ${bindir} --partitions ${partitions} --alphabet 3 --degen ${degen} --min_len ${length} --bls_thresholds ${blst}"
11# run the command
12$cmd

Optionally the gene families can be processed separately with the motifIterator tool like this. This produces parquet files which can be used as input for the BLS Speller tool.

 1# the data can first be reduced to a single parquet file to reduce storage space.
 2input=/tmp/preprocessed_folder
 3parquet=/tmp/folder_of_parquet_files/
 4blst='0.07,0.13,0.41,0.54,0.65,0.75,0.85,0.95'
 5cmd="${sparksubmit} ${jar} mergeParquet --input ${input} --output ${parquet} --bls_thresholds ${blst}"
 6
 7# bls speller variables
 8length=8
 9degen=3
10partitions=48 #should be exactly the number of executors that will start
11inputdir=/data/bls/wheat
12bindir=/path/to/motifIterator_binary/
13parquet=/tmp/folder_of_parquet_files/
14output=/tmp/blsspeller-output
15cmd="$sparksubmit $jar getMotifs --input ${parquet} --output ${output} --bindir ${bindir} --partitions ${partitions} --alphabet 3 --degen ${degen} --min_len ${length} --bls_thresholds ${blst}" # add --merged_parquet when working with merged parquet input file
16# run the command
17$cmd

Locating Conserved motifs

After the motifs have been detected in the previous step, we can now filter on a certain confidence score and find the motif locations in certain species.

 1# bls speller variables, same as before:
 2blst='0.07,0.13,0.41,0.54,0.65,0.75,0.85,0.95'
 3length=8
 4degen=3
 5partitions=200 # lower confidence score thresholds requires more partitions
 6inputdir=/data/bls/wheat
 7bindir=/path/to/motifIterator_binary/
 8input=/tmp/preprocessed_folder # same input as before
 9
10# updated variables:
11motifs=/tmp/blsspeller-output # output of previous step
12output=/tmp/blsspeller-output-locations
13fasta=/tmp/species_promotors/ # either a single fasta file of the promotor sequences per gene in that species, or a folder with a fasta file per species.
14maxl=$((length+1))
15c=0.9 # filter motifs with at least a conf score of ${c}
16fam=1 # filter motifs that appear in less than ${fam} gene families
17
18cmd="${sparksubmit} ${jar} locateMotifs --fasta ${fasta} --motifs ${motifs} --input ${input} --output ${output} --bindir ${bindir} --partitions ${p} --degen ${degen} --max_len ${maxl} --conf_cutoff ${c} --fam_cutoff ${fam} --bls_thresholds ${blst} # optionally --gene_pos to get positions relative to start of promotor region
19$cmd