class: center, middle, inverse, title-slide # Shotgun metagenomics ## Migale facility ### Valentin Loux - Cédric Midoux - Olivier Rué ### 2021/06/24-25 --- <style type="text/css"> .remark-slide-content { font-size: 28px; } </style> # Practical informations - 9h00 - 17h00 - 2 breaks morning and afternoon - Lunch at INRAE restaurant (not mandatory) --- # Better know us * Who are you? - Institution, laboratory, position … * What are your needs in metagenomics? * Do you have already dealed with metagenomics data? - What kind of data? - Aim of the study? * Have you generated data? - Which design? How many samples? Sequencing technology? Problems encountered? --- # Migale team <img src="images/migale-orange.png" width="50%" style="display: block; margin: auto;" /> * <a href="https://migale.inrae.fr/">Migale website</a> * Dedicated service to Data Analysis - Specialists in Metagenomics - Bioinformatics & Statistics - More than 60 projects since 2016 - Collaboration or Accompaniement --- <img src="images/frogs_stuff.gif" width="75%" style="display: block; margin: auto;" /> --- # Objectives After this 2 days, you will (should): * Know advantages and limits of shotgun metagenomics data and their analyses * Identify tools to analyze your data and answer your own questions * Run tools with Migale resources --- # Program .pull-left[ Day 1 * Introduction * Reminders * QC * Taxonomic classification ] .pull-right[ Day 2 * Cleaning * Assembly * Annotation ] --- class: heading-slide, middle, center # Introduction to metagenomics analyses --- ## Introduction The term *Metagenomics* sometimes include: * Marker-gene sequencing (metataxonomics or metabarcoding) * Shotgun metagenome sequencing * Meta-transcriptome sequencing --- ## Introduction The term *Metagenomics* sometimes include: * Marker-gene sequencing (metataxonomics or metabarcoding) * **Shotgun metagenome sequencing** * Meta-transcriptome sequencing --- ## Introduction <img src="images/c6mb00217j-f1_hi-res.gif" width="80%" style="display: block; margin: auto;" /> <cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.1039/C6MB00217J">Addis et al. (2016)</a></cite> --- ## Introduction <img src="images/terminology.png" width="90%" style="display: block; margin: auto;" /> <cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.1093/bib/bbx120">Breitwieser, Lu and Salzberg (2019)</a></cite> --- ## Introduction <img src="images/fgene-06-00348-g001.jpg" width="95%" style="display: block; margin: auto;" /> <cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.1093/bib/bbx120">The Road to Metagenomics, Escobar-Zepeda et al., 2015</a></cite> --- ## Challenges * Complexity of ecosystem * Completeness of databases * Sequencing depth * Computational resources required --- ## Common analysis procedures for metagenomics data <img src="images/metagenomics.png" width="65%" style="display: block; margin: auto;" /> <cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.1093/bib/bbx120">Breitwieser, Lu and Salzberg (2019)</a></cite> --- class: heading-slide, middle, center # Reminders --- ## Reminders about Migale facility * working directories * SGE submission system * conda environments --- ## Working directories * https://tutorials.migale.inra.fr/posts/migale/ --- ## Databanks * https://migale.inrae.fr/databanks * Shared resources * Ready-to use * To ask new resources: https://migale.inrae.fr/ask-databank --- ## Tools * Conda environments * Need to be activated to access binaries * To ask a new tool: https://migale.inrae.fr/ask-tool --- class: tp, middle, center # Switch to TP --- ## Reminders about sequencing <img src="images/illumina_sequencing.png" width="60%" style="display: block; margin: auto;" /> --- ## Sequencing - Vocabulary .pull-left[ **Read** : piece of sequenced DNA **DNA fragment** = 1 or more reads depending on whether the sequencing is single end or paird-end **Insert** = Fragment size **Depth** = `\(N*L/G\)` N = number of reads, L = size, G = genome size **Coverage** = % of genome covered ] .pull-right[ <img src="images/se-pe.png" width="80%" style="display: block; margin: auto;" /> <img src="images/fragment-insert.png" width="80%" style="display: block; margin: auto;" /> <div class="figure" style="text-align: center"> <img src="images/depth-breadth.png" alt="Single-End , Paired-End" width="80%" /> <p class="caption">Single-End , Paired-End</p> </div> ] --- ## Sequencing data * Huge amount of reads (up to billions) * FASTQ format --- class: heading-slide, middle, center # FASTQ format --- ## FASTQ syntax The FASTQ format is the de facto standard by which all sequencing instruments represent data. It may be thought of as a variant of the FASTA format that allows it to associate a quality measure to each sequence base: **FASTA with QUALITIES**. --- ## FASTQ syntax The FASTQ format consists of 4 sections: 1. A FASTA-like header, but instead of the <code>></code> symbol it uses the <code>@</code> symbol. This is followed by an ID and more optional text, similar to the FASTA headers. 2. The second section contains the measured sequence (typically on a single line), but it may be wrapped until the <code>+</code> sign starts the next section. 3. The third section is marked by the <code>+</code> sign and may be optionally followed by the same sequence id and header as the first section 4. The last line encodes the quality values for the sequence in section 2, and must be of the same length as section 2. --- ## FASTQ syntax <i>Example</i> ```bash @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 ``` --- ## FASTQ quality Each character represents a numerical value: a so-called Phred score, encoded via a single letter encoding. ```bash !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI | | | | | | | | | 0....5...10...15...20...25...30...35...40 | | | | | | | | | worst................................best ``` The numbers represent the error probabilities via the formula: `\(Error=10^{-P/10}\)` It is basically summarized as: - P=0 means 1/1 (100% probability of error) - P=10 means 1/10 (10% probability of error) - P=20 means 1/100 (1% probability of error) - P=30 means 1/1000 (0.1% probability of error) - P=40 means 1/10000 (0.01% probability of error) --- ## FASTQ quality encoding specificities There was a time when instrumentation makers could not decide at what character to start the scale. The **current standard** shown above is the so-called Sanger (+33) format where the ASCII codes are shifted by 33. There is the so-called +64 format that starts close to where the other scale ends. <div class="figure" style="text-align: center"> <img src="images/qualityscore.png" alt="FASTQ encoding values" width="80%" /> <p class="caption">FASTQ encoding values</p> </div> --- ## FASTQ Header informations Information is often encoded in the “free” text section of a FASTQ file. <code>@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG</code> contains the following information: - <code>EAS139</code>: the unique instrument name - <code>136</code>: the run id - <code>FC706VJ</code>: the flowcell id - <code>2</code>: flowcell lane - <code>2104</code>: tile number within the flowcell lane - <code>15343</code>: ‘x’-coordinate of the cluster within the tile - <code>197393</code>: ‘y’-coordinate of the cluster within the tile - <code>1</code>: the member of a pair, 1 or 2 (paired-end or mate-pair reads only) - <code>Y</code>: Y if the read is filtered, N otherwise - <code>18</code>: 0 when none of the control bits are on, otherwise it is an even number - <code>ATCACG</code>: index sequence This information is specific to a particular instrument/vendor and may change with different versions or releases of that instrument. --- class: heading-slide, middle, center # Quality control --- ## Why QC'ing your reads ? Try to answer to (not always) simple questions: -- - Are data conform to the expected level of performance? - Size - Number of reads - Quality - Residual presence of adapters or indexes ? - (Un)expected techincal biases - (Un)expected biological biases <div class="alert comment">Quality control without context leads to misinterpretation</div> --- ## Quality control for FASTQ files - FastQC <a name=cite-fastqc></a>([Andrews, 2010](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)) - QC for (Illumina) FastQ files - Command line fastqc or graphical interface - Complete HTML report to spot problem originating from sequencer, library preparation, contamination - Summary graphs and tables to quickly assess your data <img src="images/fastqc.png" width="40%" style="display: block; margin: auto;" /> - https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/ --- class: tp, middle, center # Switch to TP --- class: heading-slide, middle, center # Taxonomic affiliation --- ## Strategies * Assign a taxon identifier to each read * Methods * Alignment * Mapping * K-mer matches * Level * Nucleic * Proteic * Relies on taxonomic information --- ## Kraken * First method <img src="images/kraken.png" width="70%" style="display: block; margin: auto;" /> <cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.1186/gb-2014-15-3-r46">Wood and Salzberg (2014)</a></cite> --- ## Kraken * Very fast * Database build may be long and need a lot of memory --- ## Kaiju * Database of proteic sequences * Supposed to be more sensitive * Translate reads in all six reading frames, split at stop codons * Use BWT and FM-index table <cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.1038/ncomms11257">Menzel et al. (2016)</a></cite> --- ## Kaiju databanks <img src="images/kaiju_databanks.png" width="45%" style="display: block; margin: auto;" /> --- ## Taxonomic classification caveats * Databanks * K-mer choice (sensitivity / specificity) * Allow a "fast" overview of your data - Contaminants? - Host reads? - Classification rate <cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.1093/bib/bbx120">Breitwieser et al. (2019)</a></cite> --- class: tp, middle, center # Switch to TP --- class: heading-slide, middle, center # Reads cleaning --- ## Objectives - Detect and remove sequencing adapters (still) present in the FastQ files - Filter / trim reads according to quality (as plotted in FastQC) ## Tools - Simple & fast : Sickle <a name=cite-sickle></a>([Joshi and Fass, 2011](#bib-sickle)) (quality), cutadapt <a name=cite-cutadapt></a>([Martin, 2011](#bib-cutadapt)) (adpater removal) - Ultra-configurable : Trimmomatic - All in one & ultra-fast : fastp <a name=cite-fastp></a>([Zhou, Chen, Chen, and Gu, 2018](https://dx.doi.org/10.1093/bioinformatics/bty560)) <img src="images/fastp_wkwf.png" width="45%" style="display: block; margin: auto;" /> --- # rRNA filering - Essential step in metatranscriptomics, can be applied in metagenomics - rRNA not important if you are interested of gene content ## Tool SortMeRNA <a name=cite-sortmerna></a>([Kopylova, Noé, and Touzet, 2012](#bib-sortmerna)) - SortMeRNA takes as input : - one or two (paired) reads file(s) - one or multiple rRNA database file(s) with index - Sorts apart aligned and rejected reads into two files. --- class: tp, middle, center # Switch to TP --- class: heading-slide, middle, center # Alignment --- # Alignment strategies ```bash GAAGCTCTAGGATTACGATCTTGATCGCCGGGAAATTATGATCCTGACCTGAGTTTAAGGCATGGACCCATAA ATCTTGATCGCCGAC--ATT # GLOBAL ATCTTGATCGCCGACATT # LOCAL, with soft clipping ``` ## Global alignment Global alignments, which attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. (This does not mean global alignments cannot start and/or end in gaps.) A general global alignment technique is the <code>Needleman–Wunsch algorithm</code>, which is based on dynamic programming. --- # Alignment strategies ```bash GAAGCTCTAGGATTACGATCTTGATCGCCGGGAAATTATGATCCTGACCTGAGTTTAAGGCATGGACCCATAA ATCTTGATCGCCGAC--ATT # GLOBAL ATCTTGATCGCCGACATT # LOCAL, with soft clipping ``` ## Local alignment Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The <code>Smith–Waterman algorithm</code> is a general local alignment method based on the same dynamic programming scheme but with additional choices to start and end at any place. --- # Seed-and-extend especially adapted to NGS data .pull-left[ Seed-and-extend mappers are a class of read mappers that break down each read sequence into seeds (i.e., smaller segments) to find locations in the reference genome that closely match the read. <img src="images/seed_and_extend.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ 1. The mapper obtains a read 2. The mapper selects smaller DNA segments from the read to serve as seeds 3. The mapper indexes a data structure with each seed to obtain a list of possible locations within the reference genome that could result in a match 4. For each possible location in the list, the mapper obtains the corresponding DNA sequence from the reference genome 5. The mapper aligns the read sequence to the reference sequence, using an expensive sequence alignment (i.e., verification) algorithm to determine the similarity between the read sequence and the reference sequence. ] --- # Mapping tools <img src="images/mapping_tools.png" width="100%" style="display: block; margin: auto;" /> Figure from --- class: tp, middle, center # Switch to TP --- class: heading-slide, middle, center # Metagenomes Assembly --- ## Objectives - Reconstruct genes and/or organisms from complex mixtures - Dealing with the ecosystem's heterogeneity, multiple genomes at varying levels of abundance - Limiting the reconstruction of chimeras <!-- <img src="https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/bib/21/2/10.1093_bib_bbz020/4/bbz020f1.png?Expires=1602505534&Signature=A0Z~5RHAP9RrEMU7HI9xqobe5H015WtC0M63JisWd-Be5HI8hLMF~xscpsAt644Bp1HTMdq28~cam5k83svkPnhtfNumjf7FSSfKyGmwympn9tbzPMzskmKbB9TkNTabHqb~qxesSQXAXyWMRdypXbB5y07ez4uZ9Fi3T2SZtFO4CmOJ0Zw2z~1c7xmqwb-gQOpedMZPEqME9~y-xmEaKMW-9xLwY1bQI6SA3t1gRysJxm1T0Hl-LUQ~mQLZZkKK7RMbEbB~4~J9f2RDkhMUcXG0eQWqvupt2DajlDBLsYyyOIazNMzlmrkcSfOxcgBLyDKTXF9fmEtfDiioI8wY~g__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA" width="70%" style="display: block; margin: auto;" /> --> --- ## Tools - Generic tool with a meta option : SPAdes and metaSPAdes <a name=cite-spades></a>([Bankevich, Nurk, Antipov, Gurevich, Dvorkin, Kulikov, Lesin, Nikolenko, Pham, Prjibelski, and others, 2012](#bib-spades)) - Tools requiring less memory : MEGAHIT <a name=cite-megahit></a>([Li, Liu, Luo, Sadakane, and Lam, 2015](#bib-megahit)) - The historical tool allowing many parameters : Velvet (and MetaVelvet) - A (not so) recent benchmark of short reads metagenome assemblers. <a name=cite-Vollmers2017></a>([Vollmers, Wiegand, and Kaster, 2017](https://doi.org/10.1371/journal.pone.0169662)) - Long read / Hybrid assemblies use different algorithms and strategies and are still a research question. --- class: tp, middle, center # Switch to TP --- ## Assessment of assembly quality After assembly, we use MetaQUAST <a name=cite-metaquast></a>([Mikheenko, Saveliev, and Gurevich, 2015](https://doi.org/10.1093/bioinformatics/btv697)) to evaluate and compare metagenome assemblies. What MetQUAST does : - De novo metagenomic assembly evaluation - [Optionally] identify reference genomes from the content of the assembly - Reference-based evaluation - Filtering so-called misassemblies based on read mapping - Report and visulaisation --- ## De novo metrics Evaluation of the assembly based on * Number of contigs greater than a given threshold (0, 500nct, 1kb) * Total / thresholded assemby size * largest contig size * N50 : the sequence length of the shortest contig at 50% of the total assembly length (equivalent to a median of contig lengths) * L50 : the number of contigs at 50% of the total assembly length * N75/L75 idem, for 75% of the assembly length --- ## Reference-based metrics * Metrics based on the comparison with reference genomes. * Reference genomes are given by the user or automatically constitued by MetaQuast based on comparison of rRNA genes content of the assembly and a reference database (Silva). Complete genomes are then automatically donloaded. * For each given reference genome, based on an alignement of all contigs on it : - duplication rate - percent genome complete - NGA50 : equivalent of N50 but with the aligned block of the contigs - "Misassemblies" : breakpoint of alignement in a contigs. "false misassemblies" (i.e. true inversion/translocation/…) are filterd by alligning paired-reads on reference. If a "misassembly" is supported by paired-end reads happily aligned, it is counted. - An individual Quast report and an alignement visualisation. --- class: tp, middle, center # Switch to TP --- class: heading-slide, middle, center # Mapping --- # Mapping * Asses how many reads map to the assembly --- class: tp, middle, center # Switch to TP --- class: heading-slide, middle, center # Contig Binning --- ## Objectives - Binning is a good compromise when the assembly of whole genomes is not feasible. - Similar contigs are grouped together. <img src="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5045144/bin/yjbm_89_3_353_g02.jpg" width="55%" style="display: block; margin: auto;" /> --- ## Approch - MetaBAT <a name=cite-metabat></a>([Kang, Froula, Egan, and Wang, 2015](#bib-metabat)) is a tool for reconstructing genomes from complex microbial communities. - Binning approch is based : - Tetranucleotide frequency - Abundance (i.e., mean base coverage) <div class="figure" style="text-align: center"> <img src="https://dfzljdn9uc3pi.cloudfront.net/2015/1165/1/fig-1-2x.jpg" alt="metabat" width="45%" /> <p class="caption">metabat</p> </div> --- ## Bins evaluation - For the evaluation of bins, we will use *completeness* and *contamination* estimated by CheckM <a name=cite-checkm></a>([Parks, Imelfort, Skennerton, Hugenholtz, and Tyson, 2015](#bib-checkm)). - Use of collocated sets of genes that are ubiquitous and single-copy within a phylogenetic lineage. - Among a set of tools in CheckM we will use the `checkm lineage_wf` workflow which only mandatory requires a directory of genome bins. -- CheckM is included in the M-Tools suite (https://ecogenomic.org/m-tools) with RefineM (population genomes), GraftM (marker genes), GroopM (metagenomic binning), BamM (parse BAM files), FinishM (improve/finish a draft genome), CoverM (read coverage calculator), ... --- class: tp, middle, center # Switch to TP --- class: heading-slide, middle, center # Annotation --- # Annotation * Goal : - Syntaxic annotation (gene prediction) - Functionnal annotation (function prediction for protein coding genes) * Prokka is a is a software tool to annotate bacterial, archaeal and viral genomes **quickly** and produce standards-compliant output files. * Prokka *automatically* annotate a complete bacterial genome in ~5mn. * Prokka will not replace expert annotation but gives you an homogeneous procedure for annotation of conserved genes familly --- ## Genes prediction - Gene prediction in complete prokaryotic genomes isn't as such a problem. - Efficient gene predictors are available (bactgeneSHOW, prodigal, glimmerHMM,…). - Most of them use HMM models to predict the gene structure - Gene prediction on metagenomes is difficult due to: - assembly fragmentation - assembly errors, frameshift, chimeras,… - different species in the same sample that could/should lead to use different models - Prodigal (with -meta parameter) and FragGeneScan have good enough results on metagenomic contigs. - Caution to partial genes ! --- # Prokka pipeline - Coding gene prediction with Prodigal <a name=cite-prodigal></a>([Hyatt, Chen, LoCascio, Land, Larimer, and Hauser, 2010](#bib-prodigal)) - tRNA; rRNA gene prediction with Aragorn, Barnap, RNAmmer (optionnal) - Functionnal annotation based on similarity search with a threshold (1e-6) and hierarchically against : - [Optionnal] a given proteome ( `--proteins` parameter) - [ISFinder](https://isfinder.biotoul.fr/) for transposases, not entire IS - [NCBI Bacterial Antimicrobial Resistance Reference Gene Database](https://www.ncbi.nlm.nih.gov/bioproject/313047) for Antimicrobial Resistance Genes. - [UniprotKB/Swissprot](https://www.uniprot.org/uniprot/?query=reviewed:yes) **curated** proteins with evidence that (i) from Bacteria (or Archaea or Viruses); (ii) not be "Fragment" entries; and (iii) have an evidence level ("PE") of 2 or lower, which corresponds to experimental mRNA or proteomics evidence. - Domain and motifs (hmmsearch) : - [Pfam](https://pfam.xfam.org) - [HAMAP](https://hamap.expasy.org) --- # Prokka pipeline <img src="images/prokka-pipelinepng.png" width="50%" style="display: block; margin: auto;" /> --- ## EggNogg Mapper * eggNOG-mapper <a name=cite-eggnogmapper2017></a>([Huerta-Cepas, Forslund, Coelho, Szklarczyk, Jensen, von Mering, and Bork, 2017](https://doi.org/10.1093/molbev/msx148)) is a tool for fast functional annotation of novel sequences. It uses precomputed orthologous groups and phylogenies from the eggNOG database to transfer functional information from fine-grained orthologs only. * eggNOG uses hmmsearch to search against HMM eggNOG database OR Diamond to search against eggNOG protein database * Refine first hit using a list of precomputed orthologs. Assign one ortholog * Functionnal annotation using this ortholog --- ## Other options for functionnal annotation ### Diamond Diamond <a name=cite-diamond2014></a>([Buchfink, Xie, and Huson, 2014](https://doi.org/10.1038/nmeth.3176)) is a sequence aligner for protein (equivalent blastp) and translated DNA (equivalent tblastx) searches, designed for high performance analysis of big sequence data. Diamond is 100x to 20,000x the speed of BLAST. * Diamond could be used to query against any given databank. ### ghostKOALA <a name=cite-ghostkoala></a>([Kanehisa, Sato, and Morishima, 2016](#bib-ghostkoala)) [Online] KOALA (KEGG Orthology And Links Annotation) is KEGG's internal annotation tool for K number assignment. ### Clustering of protein coding genes : cd-hit <a name=cite-cdhit2012></a>([Fu, Niu, Zhu, Wu, and Li, 2012a](https://doi.org/10.1093/bioinformatics/bts565)) is a software for clustering protein sequences. It can be used to downsize the number of lines in the in the gene count matrix. --- class: heading-slide, middle, center # Automatization --- # Automatization .pull-left[ We have developed a workflow that allows us to automate all these analyses. * developed with snakemake * executable on MIGALE ```json { "SAMPLES": ["mock"], "NORMALIZATION": true, "SORTMERNA": true, "ASSEMBLER": "metaspades", "CONTIGS_LEN": 1000, "PROTEINS-PREDICTOR": "prodigal" } ``` https://forgemia.inra.fr/cedric.midoux/workflow_metagenomics ] .pull-right[ <img src="images/mock-global_graph.png" width="80%" style="display: block; margin: auto;" /> ] --- # Autres pistes/outils * [Pear](https://cme.h-its.org/exelixis/web/software/pear/index.html) : merge paired-end * SimkaMin: fast and resource frugal de novo comparative metagenomics <a name=cite-simka2019></a>([Benoit, Mariadassou, Robin, Schbath, Peterlongo, and Lemaitre, 2019b](https://doi.org/10.1093/bioinformatics/btz685)) A quick comparative metagenomics tool with low disk and memory footprints, thanks to an efficient data subsampling scheme used to estimate Bray-Curtis and Jaccard dissimilarities. Très efficace dans le QC pour faire une première comparaison des échantillons / réplicats. * Cd-hit : ([Fu, Niu, Zhu, et al., 2012a](https://doi.org/10.1093/bioinformatics/bts565)) : CDS clustering * Linclust (MMseqs2) : Clustering huge protein sequence sets in linear time. <a name=cite-linclust2018></a>([Steinegger and Söding, 2018](https://doi.org/10.1038/s41467-018-04964-5)) * PLASS : Protein-Level ASSembler Assemble short read sequencing data on a protein level. <a name=cite-plass2019></a>([Steinegger, Mirdita, and Söding, 2019](https://doi.org/10.1038/s41592-019-0437-4)) The main purpose of Plass is the assembly of complex metagenomic datasets. It assembles 10 times more protein residues in soil metagenomes than Megahit. --- # GraftM : a tool for scalable, phylogenetically informed classification of genes within metagenomes <img src="https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6007438/bin/gky174fig1.jpg" width="80%" style="display: block; margin: auto;" /> <a name=cite-graftM2018></a>([Boyd, Woodcroft, and Tyson, 2018](https://doi.org/10.1093/nar/gky174)) --- # Metagenomics atlas .pull-left[ * Metagenome-atlas <a name=cite-metagenomeatlas20></a>([Kieser, Brown, Zdobnov, Trajkovski, and McCue, 2020](https://doi.org/10.1186/s12859-020-03585-4)) is an easy-to-use metagenomic pipeline based on *snakemake*. * It handles all steps of analysis : - QC - Assembly - Binning - Annotation. ] .pull-right[ <img src="https://raw.githubusercontent.com/metagenome-atlas/atlas/master/resources/images/ATLAS_scheme.png" width="80%" style="display: block; margin: auto;" /> ] --- # Metagenomics atlas <img src="images/altas_example.png" width="50%" style="display: block; margin: auto;" /> --- # Anvi’o: integrated multi-omics at scale * Anvi'o <a name=cite-anvio2015></a>([Eren, Esen, Quince, Vineis, Morrison, Sogin, and Delmont, 2015](https://doi.org/10.7717/peerj.1319)) is an open-source, community-driven analysis and visualization platform for microbial ‘omics. * With [this tutorial](http://merenlab.org/2016/06/22/anvio-tutorial-v2/), starting from a metagenomic assembly, you will : - Process your contigs, - Profile your metagenomic samples and merge them, - Visualize your data, identify and/or refine genome bins interactively, and create summaries of your results. --- # Metagenomics and long reads * Long read data can be used to improve assembly * Bottlenecks : - DNA extraction (?) - cost of data generation - sequencing errors * State of the art pipeline for assembly : - standalone long read assembly (ex: MetaFLYE <a name=cite-metaFlye19></a>([Kolmogorov, Rayko, Yuan, Polevikov, and Pevzner, 2019](https://doi.org/10.1101/637637))) - optionnal error correction with short reads --- #Take home message - Shotgun metagenomics is still an ongoing active bioinformatics research field - Numerous software dedicated to assembly, binning, functionnal annotation are actively developed - Depending on the ecosystem , one can have different approaches : - mapping on a reference database - assembly and mapping - Biostatistics… --- # References <a name=bib-fastqc></a>[Andrews, S.](#cite-fastqc) (2010). _FastQC A Quality Control tool for High Throughput Sequence Data_. URL: [http://www.bioinformatics.babraham.ac.uk/projects/fastqc/](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). <a name=bib-spades></a>[Bankevich, A., S. Nurk, D. Antipov, et al.](#cite-spades) (2012). "SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing". In: _Journal of computational biology_ 19.5, pp. 455-477. <a name=bib-simka2019></a>[Benoit, G., M. Mariadassou, S. Robin, et al.](#cite-simka2019) (2019b). "SimkaMin: fast and resource frugal de novo comparative metagenomics". In: _Bioinformatics_. Ed. by J. Hancock. DOI: [10.1093/bioinformatics/btz685](https://doi.org/10.1093%2Fbioinformatics%2Fbtz685). URL: [https://doi.org/10.1093/bioinformatics/btz685](https://doi.org/10.1093/bioinformatics/btz685). <a name=bib-graftM2018></a>[Boyd, J. A., B. J. Woodcroft, and G. W. Tyson](#cite-graftM2018) (2018). "GraftM: a tool for scalable, phylogenetically informed classification of genes within metagenomes". In: _Nucleic Acids Research_ 46.10, pp. e59-e59. DOI: [10.1093/nar/gky174](https://doi.org/10.1093%2Fnar%2Fgky174). URL: [https://doi.org/10.1093/nar/gky174](https://doi.org/10.1093/nar/gky174). <a name=bib-diamond2014></a>[Buchfink, B., C. Xie, and D. H. Huson](#cite-diamond2014) (2014). "Fast and sensitive protein alignment using DIAMOND". In: _Nature Methods_ 12.1, pp. 59-60. DOI: [10.1038/nmeth.3176](https://doi.org/10.1038%2Fnmeth.3176). URL: [https://doi.org/10.1038/nmeth.3176](https://doi.org/10.1038/nmeth.3176). --- # References(2) <a name=bib-anvio2015></a>[Eren, A. M., O. C. Esen, C. Quince, et al.](#cite-anvio2015) (2015). "Anvi'o: an advanced analysis and visualization platform for `omics data". In: _PeerJ_ 3, p. e1319. DOI: [10.7717/peerj.1319](https://doi.org/10.7717%2Fpeerj.1319). URL: [https://doi.org/10.7717/peerj.1319](https://doi.org/10.7717/peerj.1319). <a name=bib-cdhit2012></a>[Fu, L., B. Niu, Z. Zhu, et al.](#cite-cdhit2012) (2012a). "CD-HIT: accelerated for clustering the next-generation sequencing data". In: _Bioinformatics_ 28.23, pp. 3150-3152. DOI: [10.1093/bioinformatics/bts565](https://doi.org/10.1093%2Fbioinformatics%2Fbts565). URL: [https://doi.org/10.1093/bioinformatics/bts565](https://doi.org/10.1093/bioinformatics/bts565). <a name=bib-eggnogmapper2017></a>[Huerta-Cepas, J., K. Forslund, L. P. Coelho, et al.](#cite-eggnogmapper2017) (2017). "Fast Genome-Wide Functional Annotation through Orthology Assignment by eggNOG-Mapper". In: _Molecular Biology and Evolution_ 34.8, pp. 2115-2122. ISSN: 0737-4038. DOI: [10.1093/molbev/msx148](https://doi.org/10.1093%2Fmolbev%2Fmsx148). eprint: https://academic.oup.com/mbe/article-pdf/34/8/2115/24367816/msx148.pdf. URL: [https://doi.org/10.1093/molbev/msx148](https://doi.org/10.1093/molbev/msx148). <a name=bib-prodigal></a>[Hyatt, D., G. Chen, P. F. LoCascio, et al.](#cite-prodigal) (2010). "Prodigal: prokaryotic gene recognition and translation initiation site identification". In: _BMC bioinformatics_ 11.1, p. 119. --- # References(3) <a name=bib-sickle></a>[Joshi, N. and J. Fass](#cite-sickle) (2011). _Sickle: a sliding-window, adaptive, quality-based trimming tool for FastQ files_. <a name=bib-ghostkoala></a>[Kanehisa, M., Y. Sato, and K. Morishima](#cite-ghostkoala) (2016). "BlastKOALA and GhostKOALA: KEGG tools for functional characterization of genome and metagenome sequences". In: _Journal of molecular biology_ 428.4, pp. 726-731. <a name=bib-metabat></a>[Kang, D. D., J. Froula, R. Egan, et al.](#cite-metabat) (2015). "MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities". In: _PeerJ_ 3, p. e1165. <a name=bib-metagenomeatlas20></a>[Kieser, S., J. Brown, E. M. Zdobnov, et al.](#cite-metagenomeatlas20) (2020). "ATLAS: a Snakemake workflow for assembly, annotation, and genomic binning of metagenome sequence data". In: _BMC Bioinformatics_ 21.1. DOI: [10.1186/s12859-020-03585-4](https://doi.org/10.1186%2Fs12859-020-03585-4). URL: [https://doi.org/10.1186/s12859-020-03585-4](https://doi.org/10.1186/s12859-020-03585-4). <a name=bib-metaFlye19></a>[Kolmogorov, M., M. Rayko, J. Yuan, et al.](#cite-metaFlye19) (2019). "metaFlye: scalable long-read metagenome assembly using repeat graphs". DOI: [10.1101/637637](https://doi.org/10.1101%2F637637). URL: [https://doi.org/10.1101/637637](https://doi.org/10.1101/637637). --- # References(4) <a name=bib-sortmerna></a>[Kopylova, E., L. Noé, and H. Touzet](#cite-sortmerna) (2012). "SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data". In: _Bioinformatics_ 28.24, pp. 3211-3217. <a name=bib-megahit></a>[Li, D., C. Liu, R. Luo, et al.](#cite-megahit) (2015). "MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph". In: _Bioinformatics_ 31.10, pp. 1674-1676. <a name=bib-cutadapt></a>[Martin, M.](#cite-cutadapt) (2011). "Cutadapt removes adapter sequences from high-throughput sequencing reads". In: _EMBnet. journal_ 17.1, pp. 10-12. <a name=bib-metaquast></a>[Mikheenko, A., V. Saveliev, and A. Gurevich](#cite-metaquast) (2015). "MetaQUAST: evaluation of metagenome assemblies". In: _Bioinformatics_ 32.7, pp. 1088-1090. ISSN: 1367-4803. DOI: [10.1093/bioinformatics/btv697](https://doi.org/10.1093%2Fbioinformatics%2Fbtv697). eprint: https://academic.oup.com/bioinformatics/article-pdf/32/7/1088/19568745/btv697.pdf. URL: [https://doi.org/10.1093/bioinformatics/btv697](https://doi.org/10.1093/bioinformatics/btv697). <a name=bib-checkm></a>[Parks, D. H., M. Imelfort, C. T. Skennerton, et al.](#cite-checkm) (2015). "CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes". In: _Genome research_ 25.7, pp. 1043-1055. --- # References(5) ``` ## Warning in `[[.BibEntry`(x, ind): subscript out of bounds ``` <a name=bib-plass2019></a>[Steinegger, M., M. Mirdita, and J. Söding](#cite-plass2019) (2019). "Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold". In: _Nature Methods_ 16.7, pp. 603-606. DOI: [10.1038/s41592-019-0437-4](https://doi.org/10.1038%2Fs41592-019-0437-4). URL: [https://doi.org/10.1038/s41592-019-0437-4](https://doi.org/10.1038/s41592-019-0437-4). <a name=bib-linclust2018></a>[Steinegger, M. and J. Söding](#cite-linclust2018) (2018). "Clustering huge protein sequence sets in linear time". In: _Nature Communications_ 9.1. DOI: [10.1038/s41467-018-04964-5](https://doi.org/10.1038%2Fs41467-018-04964-5). URL: [https://doi.org/10.1038/s41467-018-04964-5](https://doi.org/10.1038/s41467-018-04964-5). <a name=bib-Vollmers2017></a>[Vollmers, J., S. Wiegand, and A. Kaster](#cite-Vollmers2017) (2017). "Comparing and Evaluating Metagenome Assembly Tools from a Microbiologist's Perspective - Not Only Size Matters!" In: _PLOS ONE_ 12.1, pp. 1-31. DOI: [10.1371/journal.pone.0169662](https://doi.org/10.1371%2Fjournal.pone.0169662). URL: [https://doi.org/10.1371/journal.pone.0169662](https://doi.org/10.1371/journal.pone.0169662). <a name=bib-fastp></a>[Zhou, Y., Y. Chen, S. Chen, et al.](#cite-fastp) (2018). "fastp: an ultra-fast all-in-one FASTQ preprocessor". In: _Bioinformatics_ 34.17, pp. i884-i890. ISSN: 1367-4803. DOI: [10.1093/bioinformatics/bty560](https://doi.org/10.1093%2Fbioinformatics%2Fbty560). eprint: http://academic.oup.com/bioinformatics/article-pdf/34/17/i884/25702346/bty560.pdf. URL: [https://dx.doi.org/10.1093/bioinformatics/bty560](https://dx.doi.org/10.1093/bioinformatics/bty560).