BLS Speller Spark¶

Spark¶

Apache Spark is required on the machine or cluster on which you will execute the BLS speller program. This can be downloaded here <https://spark.apache.org/downloads.html> and more instruction are available to install Spark on Ubuntu here <https://phoenixnap.com/kb/install-spark-on-ubuntu>.

Installation¶

The BLS Speller Spark software can installed like this:

git clone https://dries_decap@bitbucket.org/dries_decap/bls-speller-spark.git
cd bls-speller-spark
sbt assembly # sbt is required to build the software, see https://www.scala-sbt.org/1.x/docs/Installing-sbt-on-Linux.html for installation details

The preprocessing and motif iterator software can be installed as described here and here

Execution¶

The BLS Speller tool requires preprocessed promotor fasta files, see this page

Some spark environments need to be set:

# spark variables
version=0.2
jar=/path/to/blsspeller-jar/bls-speller-assembly-${version}.jar
jars=$jar
class=be.ugent.intec.ddecap.BlsSpeller
execmem=10g # memory per executor
execcores=1  # cores per executor
numexec=8 # assumes an 8 core machine with at least 80GB memory available for Spark
sparksubmit="spark-submit --conf spark.ui.showConsoleProgress=true --executor-memory ${execmem} --num-executors ${num_exec} --executor-cores ${execcores} --conf spark.task.cpus=1 --jars ${jars} --class ${class}"

Motif discovery¶

The BLS Speller tool can run all steps automatically like this:

# bls speller variables
blst='0.07,0.13,0.41,0.54,0.65,0.75,0.85,0.95'
length=8
degen=3
partitions=48 #should be exactly the number of executors that will start
inputdir=/data/bls/wheat
bindir=/path/to/motifIterator_binary/
input=/tmp/preprocessed_folder
output=/tmp/blsspeller-output
cmd="${sparksubmit} ${jar} getMotifs --input ${input} --output ${output} --bindir ${bindir} --partitions ${partitions} --alphabet 3 --degen ${degen} --min_len ${length} --bls_thresholds ${blst}"
# run the command
$cmd

Optionally the gene families can be processed separately with the motifIterator tool like this. This produces parquet files which can be used as input for the BLS Speller tool.

# the data can first be reduced to a single parquet file to reduce storage space.
input=/tmp/preprocessed_folder
parquet=/tmp/folder_of_parquet_files/
blst='0.07,0.13,0.41,0.54,0.65,0.75,0.85,0.95'
cmd="${sparksubmit} ${jar} mergeParquet --input ${input} --output ${parquet} --bls_thresholds ${blst}"

# bls speller variables
length=8
degen=3
partitions=48 #should be exactly the number of executors that will start
inputdir=/data/bls/wheat
bindir=/path/to/motifIterator_binary/
parquet=/tmp/folder_of_parquet_files/
output=/tmp/blsspeller-output
cmd="$sparksubmit $jar getMotifs --input ${parquet} --output ${output} --bindir ${bindir} --partitions ${partitions} --alphabet 3 --degen ${degen} --min_len ${length} --bls_thresholds ${blst}" # add --merged_parquet when working with merged parquet input file
# run the command
$cmd

Locating Conserved motifs¶

After the motifs have been detected in the previous step, we can now filter on a certain confidence score and find the motif locations in certain species.

# bls speller variables, same as before:
blst='0.07,0.13,0.41,0.54,0.65,0.75,0.85,0.95'
length=8
degen=3
partitions=200 # lower confidence score thresholds requires more partitions
inputdir=/data/bls/wheat
bindir=/path/to/motifIterator_binary/
input=/tmp/preprocessed_folder # same input as before

# updated variables:
motifs=/tmp/blsspeller-output # output of previous step
output=/tmp/blsspeller-output-locations
fasta=/tmp/species_promotors/ # either a single fasta file of the promotor sequences per gene in that species, or a folder with a fasta file per species.
maxl=$((length+1))
c=0.9 # filter motifs with at least a conf score of ${c}
fam=1 # filter motifs that appear in less than ${fam} gene families

cmd="${sparksubmit} ${jar} locateMotifs --fasta ${fasta} --motifs ${motifs} --input ${input} --output ${output} --bindir ${bindir} --partitions ${p} --degen ${degen} --max_len ${maxl} --conf_cutoff ${c} --fam_cutoff ${fam} --bls_thresholds ${blst} # optionally --gene_pos to get positions relative to start of promotor region
$cmd