Skip to content

Help with parameters and interpretation of results #37

@SanSanMX

Description

@SanSanMX

Hi @GDKO

First of all, thank you so much for developing this pipeline. It's very practical, and I had no problems with the installation and running.

I'm writing to ask you a couple of questions. I have a genome from an insect of the order Hemiptera, and I'd like to know if there are any horizontally transferred genes from bacteria and fungi to my insect. I ran it step by step as indicated in the wiki, but I'm wondering if I'm setting the parameters correctly. For example, for my groups, I set the file like this:

Ingroup:

33208: Metazoa
EGP:

7524: Hemiptera
  1. Is this configuration correct?

And the configurations file similar to:

---
max_threads: 30

# DB path
blast_db_path: nr # blast: change to the local blast_db path
fasta_path: uniprot_sprot.fasta # diamond: change to the local fasta path for sp, ur90, or custom database
mode: sp # use blast for blast database, use sp for swissprot database, ur90 for unif90 or custom database
data_type: AA # data type DNA, AA

## Algorithm options
# prepare
ai_cutoff: 0
ahs_cutoff: 0
outg_pct_cutoff: 80
selection: "ai or ahs" # select sequences based on which metrics, another example "(ai or ahs) and outg_pct"
percent_identity: 100 # select hits equal or below this number
cutoffextend: 20# when ingroup hit is found, we take this hit + n hits
number_hits_noingroup: 50 # when no ingroup hit is found, we take this number of hits
trimal: false
min_num_hits: 4 # select queries with at least that many blast hits
percentage_similar_hits: 0.7 # group queries based on this
# detect, classify, evaluate
fastml: true # Use fasttree instead of IQTree
node_support: 0 # nodes below that number will collapse
complex_per_ingroup: 20 # if D/(D+I) smaller than this then node is considered Ingroup
complex_per_donor: 80 # if D/(D+I) greater than this then node is considered Donor
complex_per_node: 90 # if node contains percent number of this category, it is assigned

# Program specific options
mafft_options: '--anysymbol --auto'
trimal_options: '-automated1'

#IQ-Tree
iqmodel: '-mset WAG,LG,JTT -AICc -mrate E,I,G,R'
ufbootstrap: 1000
iq_threads: 30
  1. My question is, should I adjust the ai_cutoff and ahs_cutoff values ​​similar to the tutorial? What is your suggestion? I left them at 0.

Finally, after running each step I got the following:

(avp39) [enriquepet19961998@pitzer-login02 output_dir]$ ls -lh
total 707K
drwxr-xr-x 2 enriquepet19961998 PAS0471 32K Dec 19 16:11 alt_topology
-rw-r--r-- 1 enriquepet19961998 PAS0471 2.1K Dec 19 16:11 alt_topology_results.tsv
drwxr-xr-x 11 enriquepet19961998 PAS0471 4.0K Dec 19 15:28 classification
-rw-r--r-- 1 enriquepet19961998 PAS0471 232 Dec 19 15:28 classification_results.txt
-rw-r--r-- 1 enriquepet19961998 PAS0471 166K Dec 19 15:28 classification_tree_results.txt
drwxr-xr-x 2 enriquepet19961998 PAS0471 32K Dec 19 01:22 fastagroups
drwxr-xr-x 2 enriquepet19961998 PAS0471 32K Dec 19 01:33 fasttree
-rw-r--r-- 1 enriquepet19961998 PAS0471 177 Dec 19 01:34 fasttree_general_results.txt
drwxr-xr-x 2 enriquepet19961998 PAS0471 128K Dec 19 01:34 fasttree_nexus
-rw-r--r-- 1 enriquepet19961998 PAS0471 197K Dec 19 01:34 fasttree_tree_results.txt
-rw-r--r-- 1 enriquepet19961998 PAS0471 16K Dec 19 17:52 fasttree_tree_results.txt_k5.prox.txt
-rw-r--r-- 1 enriquepet19961998 PAS0471 35K Dec 19 01:22 groups.tsv
drwxr-xr-x 2 enriquepet19961998 PAS0471 32K Dec 19 01:26 mafftgroups
drwxr-xr-x 2 enriquepet19961998 PAS0471 4.0K Dec 19 01:21 tmp
  1. I understand that the final file for decision-making should be fasttree_tree_results.txt_k5.prox.txt?
(avp39) [enriquepet19961998@pitzer-login02 output_dir]$  wc -l fasttree_tree_results.txt_k5.prox.txt
82 fasttree_tree_results.txt_k5.prox.txt

Content:

HGT	output_dir/mafftgroups/gp9.fa	/fs/ess/test2/hgt/output_dir/fasttree/gp9.fa.fasttree	Dinsect_Chr_1_g674.t1	NN-NN|X|H--NN	0.56
HGT	output_dir/mafftgroups/gp9.fa	/fs/ess/test2/hgt/output_dir/fasttree/gp9.fa.fasttree	Dinsect_Chr_1_g674.t2	N-NNH|X|--NNN	0.56
HGT	output_dir/mafftgroups/gp26.fa	/fs/ess/test2/hgt/output_dir/fasttree/gp26.fa.fasttree	Dinsect_Chr_8_g575.t1	NNNNN|X|HHNNN	0.6
HGT	output_dir/mafftgroups/gp26.fa	/fs/ess/test2/hgt/output_dir/fasttree/gp26.fa.fasttree	Dinsect_Chr_8_g575.t3	NNNHH|X|NNNNN	0.6
HGT	output_dir/mafftgroups/gp26.fa	/fs/ess/test2/hgt/output_dir/fasttree/gp26.fa.fasttree	Dinsect_Chr_8_g575.t2	NNNNH|X|HNNNN	0.6
HGT	output_dir/mafftgroups/gp27.fa	/fs/ess/test2/hgt/output_dir/fasttree/gp27.fa.fasttree	Dinsect_Chr_6_g286.t1	NN-NN|X|NNNNN	0.92
HGT	output_dir/mafftgroups/gp29.fa	/fs/ess/test2/hgt/output_dir/fasttree/gp29.fa.fasttree	Dinsect_Chr_1_g201.t1	NNNNN|X|NNN--	0.84
HGT-NT	output_dir/mafftgroups/gp42.fa	/fs/ess/test2/hgt/output_dir/fasttree/gp42.fa.fasttree	Dinsect_Chr_1_g1409.t1	NNNNN|X|NN---	0.76
HGT-NT	output_dir/mafftgroups/gp46.fa	/fs/ess/test2/hgt/output_dir/fasttree/gp46.fa.fasttree	Dinsect_Chr_6_g707.t1	NNH--|X|N---N	0.4
HGT-NT	output_dir/mafftgroups/gp46.fa	/fs/ess/test2/hgt/output_dir/fasttree/gp46.fa.fasttree	Dinsect_Chr_3_g640.t1	--N--|X|N-N--	0.44

Finally, when I run a blastp command on some sequences, for example Dinsect_Chr_1_g674.t1, blastp shows that it's identical to another insect:

Select seq gb|KAJ6626570.1| 1 member(s), 1 organism(s) flies Glycine betaine reductase ATRR [Pseudolycoriella hygida] 1598 1598 100% 0.0 59.15% 1271 KAJ6626570.1
  1. How should this result be interpreted? Several proteins from my insect, classified as HGT and with positive values, match those of other insects.

I would greatly appreciate your feedback.

Thanks so much.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions