PHANS-C Tutorial

For those who want to hit the ground running…

You can either use a pre-made Amazon AWS AMI, Docker image, or install PHANS-C locally.

For tips getting PHANS-C running, and getting the tutorial folder, check out Get PHANS-C.

Get the tutorial folder

  • AWS users will find the folder in /home/ubuntu/Tutorial
  • Docker users will need to download the Tutorial folder, then mount it as a shared folder (see Get PHANS-C)
  • Local users will find the tutorial folder in the main PHANS-C repository download.

The Tutorial

The order of the tutorial steps roughly follow the flow diagram of the pipeline found here. Open the Tutorial directory. From here on out, paths are relative to the Tutorial directory.:

If you prefer, there is a shell script which automates the following process. However, we recommend that you manually run through the steps to gain a greater understanding. If you prefer to use the shell script for the tutorial data, it can be run as follows:

$ sh run_tutorial.sh

Warning: If you installed PHANS-C and prerequisites manually, rather than using the AWS image, it is very common to have to slightly modify some scripts to get the workflow running. For example, the raxml binary can be called different things (e.g. raxml, raxmlHPC, raxmlHPC-AVX, raxml-AVX2, raxml_8.2.11, etc) depending on how it was installed. You may need to open the run_tutorial.sh and ProteinModelSelection.pl scripts and adjust the name of the raxml binary. Be on the lookout for other similar errors. Another example is the third party ProteinModelSelection.pl script may need to have the perl path modified on the first line of the script.

1: Construct the Phylogenetic Reference Tree

First, you will need to assemble a list of full length gene sequences which will act as known reference sequence which will be used to compare the unknown metagenomic sequences. For this tutorial, we will identify and categorize sequences of the ribosomal protein rpsD. Full length rpsD reference sequences in amino acid space can be found in the file AA/rpsD_AA.fasta.

1.1: Make reference sequence alignment

Move into the AA directory:

$ cd AA/

Generate an alignment for the reference rpsD sequences using the program muscle.

$ muscle -in rpsD_AA.fasta -out rpsD_AA_aln.fasta

1.2: Convert alignment to phymlAln format

Convert from fasta to phymlAln format for tree construction.

$ phan-fastaToPhyml.py -f rpsD_AA_aln.fasta

phan-fastaToPhyml.py command line options:

-h, --help Show this help message and exit.
-f FILE Specify the FASTA in-file that contains the sequence alignment.

1.3: Find best amino acid substitution model for reference tree

Since we are using amino acid space, you may want to test for the best amino acid substitution model. However, you may use any model available in RAxML. An excellent perl script, proteinModelSelection.pl, for identifying the best AA substitution model for tree generation was created by the RAxML author Alexis Stamatakis and can be found at the RAxML site.

The proteinModelSelect.pl script generates a lot of output files,so, to manage these create a new directory:

$ mkdir ProteinModel

Will need to move a copy of the alignment file into this directory in order to run proteinModelSelection.pl.:

$ cp rpsD_AA_aln.phymlAln ./ProteinModel/

Now, move into the ProteinModel directory.

$ cd ProteinModel

Run proteinModelSelection.pl and redirect the standard output, which is a printout of the best substitution model, to the file BestModel.txt.

$ proteinModelSelection.pl rpsD_AA_aln.phymlAln > BestModel.txt

When proteinModelSelection.pl is finished running, check for the best identified substitution model.

$ cat BestModel.txt

Which should print out Best Model: LG

Also, check to see if a file rpsD_AA_aln.phymlAln.reduced has been generated. If there are duplicate sequences, this .reduced file is generated, and it is best to construct a tree on this alignment file. You can find what sequences were duplicates and which were removed from the alignment by checking the RAxML_info files.

$ more RAxML_info.LG_rpsD_AA_aln.phymlAln_EVAL

Move a copy of the .reduced file back to the AA directory, which should be one level above the current directory.

$ cp rpsD_AA_aln.phymlAln.reduced ../

Now, move back to the AA directory.

$ cd ../

1.4: Construct an amino acid reference phylogenetic tree

Build a tree using the generated alignment files and the best amino acid subtiution model identified. You can find more information about constructing trees with RAxML at the RAxML software page.

The PHANS-C Pipeline data analysis was developed using RAxML version 8.2.8. On your system, the RAxML exectuable may be named differently. For example, on our local system, ours is called raxmlHPC-PTHREADS-AVX. On the AWS AMI, version 8.2.8 can be accessed via the AMI executable raxml; however, a newer version of RAxML, version 8.2.11, is also available on the AMI and has been tested for compatibility with the PHANS-C Pipeline. RAxML version 8.2.11 can be accessed through the AMI executable raxml_8.2.11 or via raxml.

$ raxml -m PROTCATLG -s rpsD_AA_aln.phymlAln.reduced -# 1 -T 8 -p 610 -n rpsD_AA

Some of the options used to construct this tree are as follows:

-m STR Designates the model used to build the tree. Here, we are constructing an amino acid tree PROT using a gamma model of rate heterogeneity GAMMA using the best amino acid substitution model printed by proteinModelSelection.pl which is LG… PROTGAMMALG
-s FILE Specify the input alignment file.
-T INT Specify the number of threads. Do not use more than your computer has available.
-p INT Specify a random seed for starting parsimony inferences.
-n STR Specify a title for the output file names.

-# INT Specify the number of distinct starting trees.

For expediency in running this tutorial, we are using the CAT model and only using 1 starting tree.However, we would strongly encourage more stringent settings for real data, and recommend using the GAMMA model of rate heterogeneity -m PROTGAMMALG and at least 10 starting trees -# 10.

See the RAxML documentation for further explanation of options.

2: Parse and Format reads from Metagenome

2.1: Construct a BLAST database from the metagenomic sequence reads

You will need BLAST+ in order to build a blast database. You can find information on obtaining BLAST Get PHANS-C. The BLAST database will need to be hash indexed in order to pull metagenomic reads from the database for placement.

First, move into the blast directory:

$ cd ../blastDB/artificialMG/

An artificial metagenome, artificialMG.fasta, is located in this directory. Now, use this file to generate a blast database.

$ makeblastdb -parse_seqids -hash_index -dbtype nucl -in artificialMG.fasta -out artificialMG

In order to access this blast database from any directory, we will need to add it to the blast path. If you have it installed locally, use your own custom location. If you are using the AWS image, to add this directory to the blast path for this session of the shell, enter the following in the command line:

$ export BLASTDB=$HOME/Tutorial/blastDB/artificialMG

In order to add this directory to the blast path for every session, update the shell .profile with the following:

$ echo 'export BLASTDB=$HOME/Tutorial/blastDB/artificialMG/' >> ~/.profile

2.2: Search the BLAST database for potential reads of interest

Use BLAST search to pull out reads from the metagenome which show similarity to your target gene of interest. In this case, parse out reads that show similarity to a set of rpsD sequences we are interested in. We will use a set of full length amino acid sequences as queries from the file ~/Tutorial/AA/rpsD_AA_seeds_for_blast.fasta. NOTE: It is important to use full length gene sequences, as the trimming script used later will trim the metagenomic reads to be within the length or your starting query sequences.

Move to the directory AA

$ cd ../../AA/

Now use this amino acid sequence seed file to search a 6-frame translation of the metagenomic reads.

$ tblastn -db_gencode 11 -db ../blastDB/artificialMG/artificialMG -outfmt 5 -num_threads 2 -max_target_seqs 1000000 -evalue 1 -query rpsD_AA_seeds_for_blast.fasta -out rpsD_artificialMG.tblastn.xml

2.3: Parse reads from the BLAST results

Take the outfile from this BLAST search, rpsD_artificialMG.tblastn.xml, and parse out the desired information:

$ phan-parseBlastXML.py -f rpsD_artificialMG.tblastn.xml -n 0 -e ../blastDB/artificialMG/artificialMG -s -l -o rpsD_artificialMG_blastResults -c 2

phan-parseBlastXML.py command line options:

-h, --help Show this help message and exit.
-f FILE Specify the blast XML format FILE that you wish to use as input.
-o FILE Specify the outfile that you wish to use. If nothing is specified, the name and location of the infile will be used, with a “best” as the suffix.
-n INT Specify the number of hits you want from each blast record. A 0 means return all results. Default is 1.
-s, --sort Sort by evalue.
-e BLASTDB Specify db to extract hit read sequences from.
-c INT Specify number of cpu cores to use. Default=2.

phan-parseBlastXML.py output files:

rpsD_artificialMG_blastResults.hits - A file of all the blast hits: contains the blast information as well as the sequence of the reads. Blast results partitioned by query sequence. This file to be passed as input to next pipeline step

Each hit has a fasta title in the following tab delimited format:
Query   Hit Accession   Description Score               Evalue  Hit Number              Query Length    Hit ID  Hit Length              Align Length    Identities              Positives               Gaps    Query From         Query To                Hit From                Hit To  Frame Query Frame Hit



#####
#Hits below from query Acaryochloris_marina_MBIC11017_66
#####

>SMPLB_Deinococcus_geothermalis_DSM_11300_NCBI_taxonomyId_319795_37835_2                SMPLB_Deinococcus_geothermalis_DSM_11300_NCBI_taxonomyId_319795_37835_2 No definition line        130.0    3.53312e-10     1       202     SMPLB_Deinococcus_geothermalis_DSM_11300_NCBI_taxonomyId_319795_37835_2 142     42      25      30      0       89      130     1     126      0               -2
TCGGTAGCTGGGAATATCCACGCGCTTGCCGTTGACCAGGATGTGACCGTGACCCACAAACTGCCGGGCCTGACGGCGGGTCGAGGCAAACCCCATGCGGAACACGACATTGTCGAGGCGGCGCTCCAGCAGCTGCAGGAAC
>SMPLA_Deinococcus_geothermalis_DSM_11300_NCBI_taxonomyId_319795_37835_2                SMPLA_Deinococcus_geothermalis_DSM_11300_NCBI_taxonomyId_319795_37835_2 No definition line        130.0    3.53312e-10     2       202     SMPLA_Deinococcus_geothermalis_DSM_11300_NCBI_taxonomyId_319795_37835_2 142     42      25      30      0       89      130     1     126      0               -2
...

rpsD_artificialMG_blastResults - A tab delimited file of all the unique blast hits with associated blast result information.:

Query   Hit Accession   Description Score               Evalue  Hit Number              Query Length    Hit ID  Hit L
ength   Align Length    Identities              Positives               Gaps    Query From              Query To                Hit F
rom             Hit To  Frame Query Frame Hit
Acaryochloris_marina_MBIC11017_66               SMPLB_Deinococcus_geothermalis_DSM_11300_NCBI_taxonomyId_3197
95_37835_2              SMPLB_Deinococcus_geothermalis_DSM_11300_NCBI_taxonomyId_319795_37835_2 No definition
 line   130.0   3.53312e-10             1               202             SMPLB_Deinococcus_geothermalis_DSM_11300_NCBI_taxonom
yId_319795_37835_2              142             42              25              30              0               89              130             1               126             0       -2
Acaryochloris_marina_MBIC11017_66               SMPLA_Deinococcus_geothermalis_DSM_11300_NCBI_taxonomyId_3197
95_37835_2              SMPLA_Deinococcus_geothermalis_DSM_11300_NCBI_taxonomyId_319795_37835_2 No definition
 line   130.0   3.53312e-10             2               202             SMPLA_Deinococcus_geothermalis_DSM_11300_NCBI_taxonom
yId_319795_37835_2              142             42              25              30              0               89              130             1               126             0       -2
...

rpsD_artificialMG_blastResults.hits_all.fasta - A file that just contains all of the read sequences for the blast hits.:

>lcl|SMPLB_Deinococcus_geothermalis_DSM_11300_NCBI_taxonomyId_319795_37835_2
TCGGTAGCTGGGAATATCCACGCGCTTGCCGTTGACCAGGATGTGACCGTGACCCACAAACTGCCGGGCCTGACGGCGGG
TCGAGGCAAACCCCATGCGGAACACGACATTGTCGAGGCGGCGCTCCAGCAGCTGCAGGAAC
>lcl|SMPLB_Candidatus_Pelagibacter_ubique_HTCC1062_NCBI_taxonomyId_335992_58376_2
TTTTTACCGTTAACTCTTACGTGACCATGGTTAATAAGTTGTCTCGCTGAAAATACTGTTGTTGAAAATCTTGCTCTATA
AACAACAGAATCTAATCTTCTTTCAAGCAAACCAATTAAATTTTCACCCGTATCCCCTTTTAACAT
...

2.4: Trim the metagenomic reads:

Pass the .hits output file from the previous step to phan-trimFormatReads.py in order to format and clean the reads for placement.:

$ phan-trimFormatReads.py -g rpsD -f rpsD_artificialMG_blastResults.hits -m 100 -e 1e-5

phan-trimFormatReads.py command line options:

--version Show program’s version number and exit.
-h, --help Show this help message and exit.
-g STR Specify the gene name which will be used to label output files.
-f FILE Specify the input file that is parsed from blast output… *.hits file.
-m INT Specify the minimum length of reads that will be passed on to rest of analysis, if no min specified all reads will be passed.
-e STR (OPTIONAL) Evalue cutoff all reads must be =< this evalue to continue, if no max specified all reads will be considered. Enter in scientific notation (1e-5).
-s INT (OPTIONAL) All reads must be >= this score to continue, if no min specified all reads will be considered.

Functions of phan-trimFormatReads.py:

  • Trims the reads so that they are in the proper reading frame which corresponds with the BLAST hit.
  • Trims the reads to remove any overhang that might exist - ie. if the read captures the beginning of a gene, but also sequence upstream, the upstream sequence will be removed to improve alignment quality.
  • Sets an e-value or score threshold for reads - must meet certain similarity scores in order to continue for placement analysis.
  • Sets a length requirement - reads, after trimming, must be of a specific length (designated in nucleotide bases) in order to continue for placement analysis.
  • Removes any reads with ambiguous nucleotide bases - creates a file with all the reads which ambiguous nucleotides.
  • Removes stop codons in the amino acid translation (the stop codons do not work with downstream placement software). List of reads with stop codons produced.

phan-trimFormatReads.py output files:

rpsD_formatted_query_seqs_NT.fasta - A fasta file of the trimmed read sequences. Trimmed according to the blast hit information.:

>SMPLA_Candidatus_Pelagibacter_ubique_HTCC1062_NCBI_taxonomyId_335992_58376_1
AAAGTTAACATTGGAAGTTACGTTGTTAAAGAAGAAGATATAATTGAAATTAGAGATAAATCTAAACAGTTGGCTATAATTGATATCGCTCTAGCTAGTAA
AGAAAGAGAAACACCTGAATACATTAATTTAGATGAAAAAAAT
>SMPLA_Candidatus_Pelagibacter_ubique_HTCC1062_NCBI_taxonomyId_335992_58376_2
ATGTTAAAAGGGGATACGGGTGAAAATTTAATTGGTTTGCTTGAAAGAAGATTAGATTCTGTTGTTTATAGAGCAAGATTTTCAACAACAGTATTTTCAGC
GAGACAACTTATTAACCATGGTCACGTAAGAGTTAACGGTAAA
>SMPLB_Candidatus_Pelagibacter_ubique_HTCC1062_NCBI_taxonomyId_335992_79864_1
AAACATAAAATAGATAGAAGGTTAAAAATTAACCTATGGGGTAGACCTAAAAGTCCTTTTAATAAAAGGGATTATGGTCCAGGACAACACGGACAAGGTAG
AAAAGGAAAACCTTCTGATTACGGAATTCAGCTACAAGCTAAACAAAAATTAAAA
...

rpsD_AA_translation_stops_removed.fasta - A fasta file of the amino acid translation of the trimmed sequence reads according to the blast hit information. Any stop codons (*) have been removed as stop codons interfere with downstream programs. A file called rpsD_AA_translations_with_Stops.fasta is also produced which contains any reads with stop codons.:

>SMPLA_Candidatus_Pelagibacter_ubique_HTCC1062_NCBI_taxonomyId_335992_58376_1
KVNIGSYVVKEEDIIEIRDKSKQLAIIDIALASKERETPEYINLDEKN
>SMPLA_Candidatus_Pelagibacter_ubique_HTCC1062_NCBI_taxonomyId_335992_58376_2
MLKGDTGENLIGLLERRLDSVVYRARFSTTVFSARQLINHGHVRVNGK
>SMPLB_Candidatus_Pelagibacter_ubique_HTCC1062_NCBI_taxonomyId_335992_79864_1
KHKIDRRLKINLWGRPKSPFNKRDYGPGQHGQGRKGKPSDYGIQLQAKQKLK
...

rpsD_info_DICTIONARY_FILE_for_back_translate.csv - A data file in .csv format which contains key sequence information.:

Accession, e-value, abs_frame, raw_read, fixed_read, AA_with_stop, AA_removed_stop
>SMPLA_Candidatus_Pelagibacter_ubique_HTCC1062_NCBI_taxonomyId_335992_58376_1, 1.15292e-24, 3, ATTTTTTTCATCTAAATTAATGTATTCAGGTGTTTCTCTTTCTTTACTAGCTAGAGCGATATCAATTATAGCCAACTGTTTAGATTTATCTCTAATTTCAATTATATCTTCTTCTTTAACAACGTAACTTCCAATGTTAACTTTTT, AAAGTTAACATTGGAAGTTACGTTGTTAAAGAAGAAGATATAATTGAAATTAGAGATAAATCTAAACAGTTGGCTATAATTGATATCGCTCTAGCTAGTAAAGAAAGAGAAACACCTGAATACATTAATTTAGATGAAAAAAAT, 233.0, KVNIGSYVVKEEDIIEIRDKSKQLAIIDIALASKERETPEYINLDEKN, KVNIGSYVVKEEDIIEIRDKSKQLAIIDIALASKERETPEYINLDEKN
...

rpsD_unique_reads_with_ambiguous_nucleotides.fasta - If there are reads which have ambiguous nucleotides (e.g. base = ‘N’), then those reads are removed and listed here.

rpsD_query_reads_processing_info.log - A helpful basic stats file that should be reviewed before proceeding to next step of the pipeline. If there are significant reads which are removed due to ambiguous nucleotides or a high abundance of reads with stop codons, then the user should proceed with caution.:

The original number of blast reads prior to deduplication = 3110

The original number of unique reads recruited by blast = 106

e-value cutoff of 1e-05 used; all reads reported are <= this e-value threshold.
Number of reads which did not make this e-value threshold and were discarded: 23

The number of unique reads recruited from blast that also met thresholds specified = 83

The number of unique reads containing ambiguous nucleotides = 0
The percent of unique reads with ambiguous nucleotides that were removed from analysis = 0.00%

The number of unique reads that were less than 100 bases & removed from analysis = 7
The percent of reads that were too short & removed from analysis = 8.43%

The number of clean unique reads without ambiguous nucleotides = 76

The number of reads with at least one stop codon = 0
Percent of unique reads with stop codons = 0.00%

3: Align unknown reads to Reference Alignment

3.1: Use the program PaPaRa to align reads

Combine the formatted unknown reads with the reference phylogeny for identification. Take the alignmment that was used to construct the phylogentic reference tree and align all the unknown clean reads (amino acid space) to the reference alignment using the software package PaPaRa.

$ papara -t RAxML_bestTree.rpsD_AA -s rpsD_AA_aln.phymlAln.reduced -q rpsD_AA_translation_stops_removed.fasta -n rpsD_AA_artificialMG -j 1 -a -f

PaPaRa command line options used:

-t FILE Specify the reference phylogenetic tree.
-s FILE Specify the alignment used to build the reference tree.
-q FILE Specify the file with the formatted environmental sequence reads.
-n STR Specify label for output files.
-j INT Specify number of threads (do not specify more threads than available on your machine).
-a Using amino acid sequences.

PaPaRa output files:

papara_alignment.rpsD_AA_artificialMG - The alignment file in phymlAln format which contains the recruited reads aligned to the reference alignment.

140 228
Bifidobacterium_angulatum_DSM_20098_21                                                                            --MTNVQRSRRQVRLSRALGIAL----------TPKAQRIFEKRPYAPGEHG--RTRRRTESDYAVRLREKQRLRAQY-GLSEKQLRAVYEKGTKTSGQTGNAMLQDLEVRLDNLVLRAGFARTTAQARQFVVHRHILVDGNIVNRPSYRVKPGQTIQVKAKSQTMEPFQAAAEGVHRDVLPPVPGYLDVNLASLKATLTRKPEAEE--V-PAQVNIQYVVEFYAR--
Pirellula_sp_1_61                                                                                                                         MKVTMARYTGPKARINRRLGTMLYET-----AGAARA---MDRRPQPPGMHT----RGRRPSNYGAALMEKQKIKHYY-GLGERQLRRYFENVGRKSGNTGELLLLMCERRLDNVVRRVGFTKTRPQARQGITHGHFRVNGVKVTKPGYMLRAGDLIEVRGRENLKNLYRGVIANSPPDGLD----WVSFDSETLRATVLSLPGAVD--I-SLPVDANSVVEFLSR--
Planctomyces_brasiliensis_DSM_5305_53                                                                             ----MGRYTGPKARINRRLGSPVFES-----AGALRA---SEKRDSPPGMH---QRRK-KPTIYGAALTEKQKIKYYY-GFRERQLRKYFNEARRLKGNTGQNLLVLCERRLDNVVRRAGFAQTRPQARQAIVHSHFQLNGRTVNKPSIQVRAGDVITVRNRPNLHAIYKEQVGRADTAACE----FVSVDDSALKIVVSAIPTFAD--V-SLPVDVNQVVAFLSR--
...
SMPLB_Candidatus_Pelagibacter_ubique_HTCC1062_NCBI_taxonomyId_335992_32557_2  -------------------------------------------------------------------------------------------------------------------------------------------------------KEEDIIEIRDKSKQLAIIDIALA-SKERETPE---YINLDEKNKKVTFVRTPKFDE---------------------
...

papara_log.rpsD_AA_artificialMG - Settings and statistics for the PaPaRa run. This information is also printed to StdOut upon running PaPaRa.

papara called as:
papara -t RAxML_bestTree.rpsD_AA -s rpsD_AA_aln.phymlAln.reduced -q rpsD_AA_translation_stops_removed
.fasta -n rpsD_AA_artificialMG -j 16 -a -f
references container instantiated as: papara::references<pvec_pgap, sequence_model::tag_aa>
edges: 125
scoring scheme: -3 -1 2 -3
papara_core version 2.4
start scoring, using 16 threads
thread 1: 0.786583 gncup/s
thread 3: 0.378382 gncup/s
thread 10: 0.200862 gncup/s
thread 6: 0.37806 gncup/s
thread 4: 0.371692 gncup/s
thread 9: 0.193493 gncup/s
thread 8: 0.378985 gncup/s
thread 12: 0.209433 gncup/s
thread 2: 0.477034 gncup/s
thread 13: 0.177332 gncup/s
thread 14: 0.282384 gncup/s
thread 0: 0.286568 gncup/s
thread 15: 0.283097 gncup/s
thread 11: 0.327415 gncup/s
thread 5: 0.594323 gncup/s
thread 7: 0.489285 gncup/s
scoring finished: 0.023149
generating best scoring alignments
SUCCESS 0.147885

3.2: Join reads to increase phylogenetic signal

Next generation sequencing technologies often use mate pairing or paired ends - geneating 2 reads per strand of genomic DNA. In order to enhance the phylogenetic signal of the unknown reads for placement, and reduce errors of duplicate counting, the script spaceJoin will combine the alignments of individual read pairs into one read for placement. Since the exact length between the read pairs is unknown, an assumption is made that the alignment of the individual reads is accurrate. Read pairs should be designated as _1 and _2 in fasta file used for building the BLAST database, and the optional suffix _0 can represent read pairs that have already been joined.

$ phan-spaceJoin.py -p papara_alignment.rpsD_AA_artificialMG -r 64 -o rpsD_AA_cut40 -l 40

phan-spaceJoin.py command line options:

--version Show program’s version number and exit.
-h, --help Show this help message and exit.
-p FILE Specify the output file from PaPaRa that has the reference aln and metagenome reads aln combined (“papara_alignment.*”).
-r INT Specify the number of sequences in the reference alignment (# of sequences in reference tree).
-o FILE Specify the your desired output alignmemt file name.
-l INT Specify the minimum length of each read for placement; default set to 45 bases/residues.

phan-spaceJoin.py output files:

Log_spaceJoin_rpsD_AA_cut40_joinedReadsForPlacement.txt - A log file that outputs statistics about the joining of read pairs. Reads that are joined are relabeles as “_3”.

The following of the input alignment are the following:

Pre-Joind BOTH READS: 0
BOTH READS: 29
ONLY READ 1: 12
ONLY READ 2: 6


*****************************************************************************************
The following are the set of reads in the output file that are longer than 40
LONG Pre-Joined BOTH READS: 0
LONG BOTH READS: 29
LONG ONLY READ 1: 12
LONG ONLY READ 2: 4

Perecent of reads from total dataset that are longer than 40: 97.37%

Percent of long reads with pairs: 78.38%


*****************************************************************************************
Reads that are written to output file

Reads with mate pairs joined:
SMPLA_Vibrio_sp_MED222_NCBI_taxonomyId_314290_82469_3, SMPLB_Vibrio_sp_MED222_NCBI_taxonomyId_314290_82469_3, SMPLA_Deinococcus_geothermalis_DSM_11300_NCBI_taxonomyId_319795_37835_3, SMPLA_Candidatus_Pelagibacter_ubique_HTCC1062_NCBI_taxonomyId_335992_19173_3,
...

rpsD_AA_cut40_joinedReadsForPlacement.phymlAln - A new alignment file with the joined reads.

109 228
Bifidobacterium_angulatum_DSM_20098_21                                                                            --MTNVQRSRRQVRLSRALGIAL----------TPKAQRIFEKRPYAPGEHG--RTRRRTESDYAVRLREKQRLRAQY-GLSEKQLRAVYEKGTKTSGQTGNAMLQDLEVRLDNLVLRAGFARTTAQARQFVVHRHILVDGNIVNRPSYRVKPGQTIQVKAKSQTMEPFQAAAEGVHRDVLPPVPGYLDVNLASLKATLTRKPEAEE--V-PAQVNIQYVVEFYAR--
Pirellula_sp_1_61                                                                                                                         MKVTMARYTGPKARINRRLGTMLYET-----AGAARA---MDRRPQPPGMHT----RGRRPSNYGAALMEKQKIKHYY-GLGERQLRRYFENVGRKSGNTGELLLLMCERRLDNVVRRVGFTKTRPQARQGITHGHFRVNGVKVTKPGYMLRAGDLIEVRGRENLKNLYRGVIANSPPDGLD----WVSFDSETLRATVLSLPGAVD--I-SLPVDANSVVEFLSR--
...
SMPLA_Escherichia_coli_W3110_NCBI_taxonomyId_316407_83762_3                                       -------------------------------------------------------------------------------------FRNYYKEAARLKGNTGENLLALLEGRLDNVVYRMGFGATRAEARQLVSHKAIMVNGRVVNIASYQVSPNDVVSIREKAKKQSRVKAALELAEQREKPT---------------------------------------------
SMPLB_Candidatus_Pelagibacter_ubique_HTCC1062_NCBI_taxonomyId_335992_8052_3       ----MTKRISAKHKIDRRLKINL--------WGRPKS--PFNKRDYGPGQHG--QGRKGKPSDYGIQLQAKQKLKGYYGNINERQFRNIYKKAT--------------------------------------------------------------------------------------------------------------------------------------
...

4: Phylogenetic placement of unknown reads

4.1: Use the Evolutionary Placement Algorithm (EPA) to assign unknown reads within tree

The formatted unknown metagenomic reads are combined with the reference phylogeny using RAxML EPA.

$ raxml -f v -m PROTCATLG -s rpsD_AA_cut40_joinedReadsForPlacement.phymlAln -r RAxML_bestTree.rpsD_AA -n rpsD_AA_placements -T 8

RAxML-EPA command line options used:

-f v Use EPA to place environmental sequence reads onto a reference tree.
-m PROTGAMMALG Use the PROTGAMMALG model.
-s FILE Alignment file with joined environmental reads aligned to reference alignment.
-r FILE Specify the original reference tree.
-n STR Label for output files.
-T INT Specify number of threads (do not specify more threads than available on your machine).

RAxML-EPA output files:

RAxML-EPA produces numerous files - details of which can be found in the RAxML manual.

RAxML_classification.rpsD_AA_placements - Read placement/classification results file.

SMPLB_Bacteroides_plebeius_DSM_17135_NCBI_taxonomyId_484018_1483_1 I9 1 0.00000100000050002909
SMPLB_Shewanella_baltica_OS155_NCBI_taxonomyId_325240_2956_1 I96 1 0.00000100000050002909
SMPLA_Bacteroides_plebeius_DSM_17135_NCBI_taxonomyId_484018_1483_1 I9 1 0.00000100000050002909
SMPLA_Shewanella_baltica_OS155_NCBI_taxonomyId_325240_2956_1 I96 1 0.00000100000050002909
...

RAxML_classificationLikelihoodWeights.rpsD_AA_placements - Read placement/classification results using likelihood weights.

RAxML_entropy.rpsD_AA_placements - Entropy results using likelihood weights.

RAxML_originalLabelledTree.rpsD_AA_placements - Original reference tree with labeled node IDs.

RAxML_labelledTree.rpsD_AA_placements - Reference tree with branch labels and read placements.

RAxML_portableTree.rpsD_AA_placements.jplace - Labelled reference tree with branch labels in portable pplacer/EPA format (without query sequences). This file can be used along with the package genesis for viewing/manipulating placement data. genesis is not included within the AWS AMI or the PHANS-C Pipeline.

4.2: Visualize distribution of likelihood scores (or branch-lengths)

All of the reads may not be placed with high quality scores. In order to evaluate read placement quality, you can view an interactive visualization of the distribution of either likelihood scores for the final placement of a read or the branch lengths. The script “phan-distributions.py” will generate a stand-alone interactive html file where you can inspect the distribution of reads (see figure below for example). Below the chart of the distribution is a table of the reads sorted in order of the score bins. By clicking on a bar in the chart, it will highlight the corresponding row in the table. You can then use keyboard shortcuts to copy the row of values (the reads with scores in that bin) and paste them into a text editor.

The following will generate a distribution of the likelihood scores in a stand-alone .html file:

$ phan-distributions.py -i RAxML_classificationLikelihoodWeights.rpsD_AA_placements -l -o rpsD_likelihoods
_images/PHANS-C_distirbution_ML_Tutorial.png

The following will generate a distribution of branch lengths in a stand-alone .html file:

$ phan-distributions.py -i RAxML_classification.rpsD_AA_placements -b -o rpsD_branchLengths
_images/PHANS-C_distribution_branch_tutorial.PNG

While these figures are extremely simple for this tutorial, they become more useful with complex data. For example, the following are the distributions for the more complicated dataset analyzed and described in the manuscript.

_images/PHANS-C_distirbution_ML.png _images/PHANS-C_distribution_branch.png

4.3: View the placements and tree with FigTree

The script “phan-figTreeFiles.py” will reformat the placement tree into a newick-like format that can be viewed with the phylogenetic tree viewer FigTree. A basic annotation file will also be generated that can be loaded into FigTree for additional tree visualizations (used in the color mapping of reads).

$ phan-figTreeFiles.py -i RAxML_labelledTree.rpsD_AA_placements -o rpsD

phan-figTreeFiles.py options:

-i Specify the tree you wish to reformat. In the example, using the labelled tree output from RAxML-EPA, but the original tree output can also be used as input.
-o specify the label you would like to use in your output file names.

You can now open the output file in the FigTree GUI tree viewer. Open the file PlacementTree_rpsD.nw with FigTree. When the “Input” window pops up, change the field to “node”.

_images/FigTree_node_load.PNG

Now you can see the Tree in FigTree. Notice that the names of the terminal nodes now have the node number appended to the end of the label as [I#]. You can now add in internal node labels as well by checking the box for “Node Labels” in the left pane. If you select the arrow, the field will drop down with more details. Click on the drop-down selector next to “Display” and select the field “node”. You should now be able to see all the internal nodes labeled throughout the tree.

_images/Node_labels.PNG

You can add further customization to the tree by using the default annotations file that is produced by phan-figTreeFiles.py called AnnotationsFile_PlacementTree_rpsD.tab. You can manually add additional labels to this file in order to add further customizations. In order to load this annotations file, in FigTree go to File > Import_Annotations and load the file AnnotationsFile_PlacementTree_rpsD.tab.

_images/FigTree_load_annotations.png

In order to visualize the annotations, select the arrow next to “Tip Labels” in the left pane. From the drop down selector next to “Colour by:” select “Category”. In order to adjust the colors further, next to “Setup:” click on the box “Colour”.

_images/FigTree_annotations_color.png

This is a very short overview on working with annotations in FigTree. Please see the FigTree documentation and discussion threads for further information.

4.4: Create annotation key for read placements

Generate tree file dictionary:

The RAxML-EPA output file RAxML_classification.rpsD_AA_placements classifies the placement reads by labeled node ID. In order to make the classifications more readily interpretable, a dictionary file is created based upon the tree file RAxML_originalLabelledTree.rpsD_AA_placements which acts as a key to convert labeled node number into the reference sequence name for all extant nodes.

$ phan-taxaDictionary.py -O RAxML_originalLabelledTree.rpsD_AA_placements -N rpsD_taxa_dictionary.txt

If the output is improperly named (see note about -N below) you may have to rename the output file.

$ mv rpsD_taxa_dictionary.txt_Taxa_Dictionary.txt rpsD_taxa_dictionary.txt

phan-taxaDictionary.py command line options:

--version Show program’s version number and exit.
-h, --help Show this help message and exit.
-O FILE Specify the original tree file output from RAxML-EPA that is the labeled tree by node of Ref taxa only.
-N STR Specify the output file name for the taxa dictionary. Note that the -N flag may not be honored, if so you must hand rename the file.

phan-taxaDictionary.py output file:

rpsD_taxa_dictionary.txt - The dictionary output file.

...
I6              Flavobacterium_psychrophilum_JIP0286_17
I9              Bacteroides_plebeius_DSM_17135_36
I10             Bacteroides_vulgatus_ATCC_8482_37
I8              I8
I11             Bacteroides_eggerthii_DSM_20697_32
I7              I7
I5              I5
I12             Chlorobium_limicola_DSM_245_31
...

Update tree file dictionary:

If we look at this tree, we can see that some of the internal nodes can also be relabeled from node numbers to more meaningful text labels. Add in FigTree info Some tree viewing software may require you to add the ‘.tree’ suffix to the end of the file name in order to open it. Below is a zoomed in section of the tree showing the portion of the reference tree containing *Bacteroides.

_images/tree.png

The internal nodes I8 and I7 are internal nodes containing Bacteroides. It is possible to manually update the dictionary file to reflect this, so that any placements made to these internal nodes will be labeled as Bacteroides.

Manually update rpsD_taxa_dictionary.txt:

...
I6              Flavobacterium_psychrophilum_JIP0286_17
I9              Bacteroides_plebeius_DSM_17135_36
I10             Bacteroides_vulgatus_ATCC_8482_37
I8              Bacteroides
I11             Bacteroides_eggerthii_DSM_20697_32
I7              Bacteroides
I5              I5
I12             Chlorobium_limicola_DSM_245_31
...

4.5: Reannotate read placements with meaningful labels

Use the dictionary file, rpsD_taxa_dictionary.txt to relabel the nodes from the RAxML-EPA classification file.

$ phan-treeLabels.py -D rpsD_taxa_dictionary.txt -C RAxML_classification.rpsD_AA_placements

phan-treeLabels.py command line options:

--version Show program’s version number and exit.
-h, --help Show this help message and exit.
-D FILE Specify the dictionary file that contains the node location ‘[I#]’ with the associated reference taxa name.
-C FILE Specify the output file from RAxML-EPA that is a list of the Query sequences assigned to their locations on the reference tree

phan-treeLabels.py output file:

RAxML_classification.rpsD_AA_placements_placement_list.csv - The reads placed by relabeled tree nodes in .csv format.

Tree_node, Read_ID
Bacteroides_plebeius_DSM_17135_36,SMPLB_Bacteroides_plebeius_DSM_17135_NCBI_taxonomyId_484018_1483_1
Shewanella_baltica_OS155_43,SMPLB_Shewanella_baltica_OS155_NCBI_taxonomyId_325240_2956_1
Bacteroides_plebeius_DSM_17135_36,SMPLA_Bacteroides_plebeius_DSM_17135_NCBI_taxonomyId_484018_1483_1
...

4.6: Tabulate reads & sort by metagenome

Finally, you can group and count all reads placed to unique tree labeled nodes. If nodes were labeled with the same text string (e.g. internal nodes I7 & I8 relabeled as Bacteroides, then the count will be a combination of both of these labels).

If the original BLAST database contained sequences from multiple metagenomic samples, then this script can optionally parse out the placements by metagenome. In order to utilize this feature, the reads must all be labeled with unique metagenome sample ID pre-fixes (e.g. read SMPLA_Bacteroides_plebeius_DSM_17135_NCBI_taxonomyId_484018_1483_1 is from metagenome SMPLA). A key file of the metagenomes must also be included which consists of a text file with each unique metagenomic sample ID listed per line. For the tutorial, one has been provided: Sample_list_IDs.txt

SMPLA
SMPLB

Since the tutorial consists of a database with two metagenomes, we will separate them out.

$ phan-sortPlacements.py -C RAxML_classification.rpsD_AA_placements_placement_list.csv -L Sample_list_IDs.txt

phan-sortPlacements.py command line options:

--version Show program’s version number and exit.
-h, --help Show this help message and exit.
-L FILE Specify the metagenomic sample unique identifiers - one ID per line without special characters.
-C FILE Specify the reads listed according to their node which has already been relabeled by phan-taxaDict.py.

phan-sortPlacements.py output files:

Two files are generated for every metagenomic sample. Here, we have files generate for SMPLA and SMPLB metagenomes.

RAxML_classification.rpsD_AA_placements_placement_list_SMPLA_counts.csv - A .csv file of the counts of reads by unique node label.

Tree_node, Total Reads
Escherichia_coli_W3110_41,1
Bacillus_selenitireducens_MLS10_29,2
Deinococcus_geothermalis_DSM_11300_33,3
Cand_Pelagibacter_ubique_HTCC1062_1,9
Bacteroides_plebeius_DSM_17135_36,1
Vibrio_sp_MED222_45,3
Shewanella_baltica_OS155_43,1
Marinomonas_sp_MED121_49,2
...

RAxML_classification.rpsD_AA_placements_placement_list_SMPLA_readsPlaced.csv - A .csv file of all the reads placed to each unique node label. The first entry on each line is the unique node label and all the following elements are the individual reads which placed to that node.

Tree_node, Reads_placed
Escherichia_coli_W3110_41,SMPLA_Escherichia_coli_W3110_NCBI_taxonomyId_316407_83762_3
Bacillus_selenitireducens_MLS10_29,SMPLA_Bacillus_selenitireducens_MLS10_NCBI_taxonomyId_439292_28809_3,SMPLA_Bacillus_selenitireducens_MLS10_NCBI_taxonomyId_439292_52168_3
Deinococcus_geothermalis_DSM_11300_33,SMPLA_Deinococcus_geothermalis_DSM_11300_NCBI_taxonomyId_319795_37835_3,SMPLA_Deinococcus_geothermalis_DSM_11300_NCBI_taxonomyId_319795_2 6433_3,SMPLA_Deinococcus_geothermalis_DSM_11300_NCBI_taxonomyId_319795_93937_3
...

The related files, RAxML_classification.rpsD_AA_placements_placement_list_SMPLB_counts.csv and RAxML_classification.rpsD_AA_placements_placement_list_SMPLB_readsPlaced.csv are also generated for reads from metagenomic sample SMPLB.

5: Final Summary

The final result of running the PHANS-C Pipeline is a list of short sequencing reads that have been characterized according to full length references. If you ran the run_tutorial.sh script, then you will find a directory called /Results which has the highlights produced from this script. In order to generate these results manually (if you followed this tutorial manually, step-by-step), run the following to generate this same directory:

$ mkdir ../Results
$ mkdir ../Results/Quantitation
$ mkdir ../Results/EPA-Out
$ mkdir ../Results/Vis-Distributions
$ cp RAxML_labelledTree.rpsD_AA_placements RAxML_originalLabelledTree.rpsD_AA_placements RAxML_classification.rpsD_AA_placements RAxML_classificationLikelihoodWeights.rpsD_AA_placements ../Results/EPA-Out/
$ cp rpsD_branchLengths_distribution_branch-lengths.html rpsD_likelihoods_distribution_likelihoodWeights.html ../Results/Vis-Distributions/
$ cp AnnotationsFile_PlacementTree_rpsD_figTree.tab PlacementTree_rpsD_figTree.nw ../Results/
$ cp RAxML_classification.rpsD_AA_placements_placement_list.csv RAxML_classification.rpsD_AA_placements_placement_list_SMPLA_counts.csv RAxML_classification.rpsD_AA_placements_placement_list_SMPLA_readsPlaced.csv RAxML_classification.rpsD_AA_placements_placement_list_SMPLB_counts.csv RAxML_classification.rpsD_AA_placements_placement_list_SMPLB_readsPlaced.csv ../Results/Quantitation/

The overall results are broken up into 3 subdirectories. /Results/EPA-Out are copies of the main output files from the EPA placement algorithm - to view the placement files in FigTree, you can use $ phan-figTreeFiles.py to generate a newick-like file that can be loaded by FigTree (See Section 4.3 for details). /Results/Vis-Distributions are copies of the interactive visualizations of the read placement metrics. And /Results/Quantitation lists copies of .csv files which list counts of reads placed to nodes on the reference tree.