class: center, middle, inverse, title-slide # Analyse de données NGS sous Galaxy ## Migale facility ### Valentin Loux - Cédric Midoux ### 2021/02/17 --- <style type="text/css"> .remark-slide-content { font-size: 28px; } </style> # Practical informations - 9h30 - 17h00 - 2 breaks morning and afternoon - Lunck break of 1 hour - First session remote … please be comprehensive ! --- # Remote sessions rules - Even remotely, it should be interactive ! - Please interrupt us : - raise (virtually) your hand - ask questions in the chat - Practical session will be different : - tutorial support with practice to do on your own during a few minutes - group synchronization to be sure everyone follows - Inform us of your progress : - Ask (any) questions - Use green / red reactions --- # Ice breaking session * Who are you? - Institution, laboratory, position … * Why are your here - What are your needs in NGS data analysis? * Do you have already dealt with NGS data? - Which kind of data? - Aim of the study? * Have you ever used *Galaxy* ? --- # Migale team <img src="images/migale-orange.png" width="50%" style="display: block; margin: auto;" /> * <a href="https://migale.inrae.fr/">Migale website</a> * Infrastructure for bioinformatics * storage, compute * tools; databanks * interfaces (Galaxy) * Dedicated service to Data Analysis - Specialists in Metagenomics - Bioinformatics & Statistics - More than 60 projects since 2016 - Collaboration or Accompaniement --- <img src="images/frogs_stuff.gif" width="80%" style="display: block; margin: auto;" /> --- # Objectives After this training day, you will know: - the characteristics of the main types of sequencers - how to do a quality control of the raw sequences - how to assemble a (small) genome - how to align reads to a reference genome - how to explore graphically an alignment - how to compare assemblies --- # Program .pull-left[ **Morning** * Introduction & Round table * Sequencing technologies *Break* * Quality Control * Data cleaning * Assembly ] .pull-right[ **Afternoon** * Assembly evaluation and comparison *Break* * Mapping * Visualisation ] --- class: heading-slide, middle, center # Next generation Sequencing in a few slides --- ## Sequencing Cost per Megabase <img src="https://www.genome.gov/sites/default/files/inline-images/Sequencing_Cost_per_Megabase_May2020.jpg" width="90%" style="display: block; margin: auto;" /> --- # Genome Sequencing, why ? Interest in a genome that has not yet been sequenced * Assembly and annotation * de novo sequencing * chromosomal rearrangements * metagenomics A reference genome is available * Alignment (mapping) of reads on the genome - Detection of genomic variants (SNPs) - RNA-seq (gene expression) - ChIP-seq (regulation of gene expression) - Chromosomal rearrangements, variation in gene copy number - Detection of small non-coding RNAs - metagenomics --- # Sequencing challenges Smallest known (non viral) genome: - *Carsonella ruddii* = 0.16 Mbp Largest known genome: - *Paris japonica* = 150 Gbp - *Amoeba dubia* = 670 Gbp Maximum Reads Size : - 1st generation (Sanger): up to 900 bp - 2nd generation: up to 500 bp - 3rd generation: up to 100 - 1000 Kbp Need to cut the genome into millions of fragments (**shotgun sequencing**) from the 2 DNA strands. The operation to reconstruct the genetic elements from the raw reads is called **assembly**. --- # Sequencing technologies - First generation : - Sanger sequencing - First step : fragment cloning - Reads up to 900 bp - Expensive - low throughput --- # Next generation Sequencing technologies Second generation (since 2007) - **454** - Sequencing by Synthesis - PCR Amplification - **SOLiD** : Sequencing by Ligation - PCR Amplification - **Ion Torrent** : Sequencing by Synthesis - PCR Amplification - **Illumina** : Sequencing by Synthesis - PCR Amplification --- # Illumina : principles - Based on "reversible terminated chemistry" : reversible terminators that enable the identification of single nucleotides as they are washed over DNA strands. Three steps : - Ampification of DNA fragments - Sequencing - Analysis [ Reference : Technology Spotlight: Illumina® Sequencing](https://www.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf) --- # Prepare genomic DNA samples <img src="images/Illumina-1.png" width="30%" style="display: block; margin: auto;" /> Randomly fragment genomic DNA and ligate adapters to both ends of the fragments --- # Attach DNA to Flow Cell Surface <img src="images/Illumina-2.png" width="30%" style="display: block; margin: auto;" /> Bind single-stranded fragments randomly to the inside surface of the flow cell channels. --- # Bridge Amplification <img src="images/Illumina-3.png" width="30%" style="display: block; margin: auto;" /> Add **unlabeled** nucleotides and enzyme to initiate solid-phase bridge amplification. --- # Fragments Become Double Stranded <img src="images/Illumina-4.png" width="30%" style="display: block; margin: auto;" /> The enzyme incorporates nucleotides to build double-stranded bridges on the solid-phase substrate. --- # Denature the Double-Stranded Molecule <img src="images/Illumina-5.png" width="30%" style="display: block; margin: auto;" /> Denaturation leaves single-stranded templates anchored to the substrate. --- # Complete Amplification <img src="images/Illumina-6.png" width="30%" style="display: block; margin: auto;" /> Several millions dense clusters of double-stranded DNA are gerated in in channel of the flow cell. --- # Determine First Base <img src="images/Illumina-7.png" width="30%" style="display: block; margin: auto;" /> The first sequencing cycle begins by adding four labeled reversible terminators, primers, and DNA polymerase. --- # Image First Base <img src="images/Illumina-8.png" width="30%" style="display: block; margin: auto;" /> After laser excitation, the emitted fluorescence from each cluster is captured and the first base is identified. The blocked 3' terminus and florphore are removed,flow cell washed, leaving the terminator free for a second cycle. --- # Determine Second Base <img src="images/Illumina-9.png" width="30%" style="display: block; margin: auto;" /> The next cycle repeats the incorporation of four labeled reversible terminators, primers, and DNA polymerase. --- # Image Second Chemistry Cycle <img src="images/Illumina-10.png" width="30%" style="display: block; margin: auto;" /> After laser excitation, the image is captured as before, and the identity of the second base is recorded. --- # Sequencing Over Multiple Chemistry Cycles <img src="images/Illumina-11.png" width="20%" style="display: block; margin: auto;" /> The sequencing cycles are repeated to determine the sequence of bases in a fragment, one base at a time. Millions of clusters are processed in parallel, allowing high-throughput sequencing. --- # Illumina : summary - High precision >99.5% (main type or errors : substitutions) - Short reads (maximum 2 x 250) - Huge throughput (up to 6 Tbp per run on NovaSeq) - Some under-representation of rich AT- and GC- regions. - [Video](https://youtu.be/fCd6B5HRaZ8) --- ## Sequencing - Glossary .pull-left[ **Read** : piece of sequenced DNA **DNA fragment** = 1 or more reads depending on whether the sequencing is single end or paired-end **Insert** = Fragment size **Depth** = `\(N*L/G\)` N = number of reads, L = size, G = genome size **Coverage** = % of genome covered ] .pull-right[ <img src="images/se-pe.png" width="80%" style="display: block; margin: auto;" /> <img src="images/fragment-insert.png" width="80%" style="display: block; margin: auto;" /> <div class="figure" style="text-align: center"> <img src="images/depth-breadth.png" alt="Single-End , Paired-End" width="80%" /> <p class="caption">Single-End , Paired-End</p> </div> ] --- # 3d generation Target the weaknesses of the 2nd generation : - PCR amplification - Short reads Two main competitors (in production ) : - Pacific Bioscience (PacBio) - Oxford Nanopore Technologies (ONT) --- # PacBio <img src="images/pacbio.jpg" width="60%" style="display: block; margin: auto;" /> A polymerase is immobilized at the bottom of a sequencing unit called zero-mode waveguide (ZMW) .Four fluorescent-labeled nucleotides, which generate distinct emission spectrums, are added to the SMRT cell. As a base is held by the polymerase, a light pulse is produced that identifies the base. The replication processes in all ZMWs of a SMRT cell are recorded by a “movie” of light pulses, and the pulses corresponding to each ZMW can be interpreted to be a sequence of bases. [Reference](https://doi.org/10.1126/science.1162986) --- # PacBio : summary - Long reads (up to Kbs with SequelII) - Depends on DNA quality - High error rate. Tend to lower with depth - Medium throughput Applications : - IsoSeq (RNA Isoform full length sequencing) - Detection of DNA modifcation - Assembly --- # Oxford Nanopore <img src="images/nanopore-1.png" width="40%" style="display: block; margin: auto;" /> --- # Oxford Nanopore <img src="images/nanopore-2.png" width="50%" style="display: block; margin: auto;" /> --- # MinION, GridION, PromethION <img src="images/nanopore-3.jpg" width="60%" style="display: block; margin: auto;" /> --- # Sequencing on The ISS <img src="https://nanoporetech.com/sites/default/files/s3/Asset%202hdpi_0.png" width="90%" style="display: block; margin: auto;" /> --- # ONT Summary - Ultra long reads ( up to 1 Mb (!) ) - Length of the reads depends on DNA quality - Low to high throughput - "On field" sequencing - Direct RNA sequencing, peptide sequencing - High error rate (5-10%), tends to lower with new chemistry, base calling algorithms and depth Applications : - Full length isofrom sequencing, direct RNA sequencing - Detection of DNA modifcation - Assembly --- # An other view on sequencing technologies (probably out of date) <img src="https://flxlexblog.files.wordpress.com/2016/07/developments_in_high_throughput_sequencing.jpg" width="60%" style="display: block; margin: auto;" /> --- # Global Summary (probably out of date) <img src="images/sequencingsummary.png" width="60%" style="display: block; margin: auto;" /> An interesting review <a name=cite-Goodwin2016></a>([Goodwin, McPherson, and McCombie, 2016](https://doi.org/10.1038/nrg.2016.49)) Nature review : [Milestones in Genomic Sequencing](https://www.nature.com/immersive/d42859-020-00099-0/index.html) --- class: tp, middle, center # Switch to Hands-on : ## Connect to Galaxy --- # Practical session : - *Escherichia coli* genome (re)sequencing - Illumina MiSeq - Paired-end sequencing (2*150bp , insert size ~300bp) - Subsampled --- # Connect to Galaxy - https:://galaxy.migale.inrae.fr - Login : **stageXX** - Data in "Shared Data / Data Libraries / formation NGS / Reads" - References in "Shared Data / Data Libraries / formation NGS / Refs" --- class: heading-slide, middle, center # FASTQ format --- ## FASTQ syntax The FASTQ format is the de facto standard by which all sequencing instruments represent data. It may be thought of as a variant of the FASTA format that allows it to associate a quality measure to each sequence base: **FASTA with QUALITIES**. --- ## FASTQ syntax The FASTQ format consists of 4 sections: 1. A FASTA-like header, but instead of the <code>></code> symbol it uses the <code>@</code> symbol. This is followed by an ID and more optional text, similar to the FASTA headers. 2. The second section contains the measured sequence (typically on a single line), but it may be wrapped until the <code>+</code> sign starts the next section. 3. The third section is marked by the <code>+</code> sign and may be optionally followed by the same sequence id and header as the first section 4. The last line encodes the quality values for the sequence in section 2, and must be of the same length as section 2. --- ## FASTQ syntax <i>Example</i> ```bash @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 ``` --- ## FASTQ quality Each character represents a numerical value: a so-called Phred score, encoded via a single letter encoding. ```bash !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI | | | | | | | | | 0....5...10...15...20...25...30...35...40 | | | | | | | | | worst................................best ``` The numbers represent the error probabilities via the formula: `\(Error=10^{-P/10}\)` It is basically summarized as: - P=0 means 1/1 (100% probability of error) - P=10 means 1/10 (10% probability of error) - P=20 means 1/100 (1% probability of error) - P=30 means 1/1000 (0.1% probability of error) - P=40 means 1/10000 (0.01% probability of error) --- ## FASTQ quality encoding specificities There was a time when instrumentation makers could not decide at what character to start the scale. The **current standard** shown above is the so-called Sanger (+33) format where the ASCII codes are shifted by 33. There is the so-called +64 format that starts close to where the other scale ends. <div class="figure" style="text-align: center"> <img src="images/qualityscore.png" alt="FASTQ encoding values" width="80%" /> <p class="caption">FASTQ encoding values</p> </div> --- ## FASTQ Header informations Information is often encoded in the “free” text section of a FASTQ file. <code>@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG</code> contains the following information: - <code>EAS139</code>: the unique instrument name - <code>136</code>: the run id - <code>FC706VJ</code>: the flowcell id - <code>2</code>: flowcell lane - <code>2104</code>: tile number within the flowcell lane - <code>15343</code>: ‘x’-coordinate of the cluster within the tile - <code>197393</code>: ‘y’-coordinate of the cluster within the tile - <code>1</code>: the member of a pair, 1 or 2 (paired-end or mate-pair reads only) - <code>Y</code>: Y if the read is filtered, N otherwise - <code>18</code>: 0 when none of the control bits are on, otherwise it is an even number - <code>ATCACG</code>: index sequence This information is specific to a particular instrument/vendor and may change with different versions or releases of that instrument. --- class: tp, middle, center # Switch to Hands-on : ## Fastq import & visualisation --- class: heading-slide, middle, center # Quality control --- ## Why QC'ing your reads ? **What are the information you want to know about the sequencing when you perform Quality Control ?** Collective Answer on this [collaborative whiteborad](http://scrumblr.ca/ngs) --- ## Why QC'ing your reads ? Try to answer to (not always) simple questions: -- - Are data conform to the expected level of performance? - Size - Number of reads - Quality - Residual presence of adapters or indexes ? - Are there (un)expected techincal biases - Arte ther (un)expected biological biases <div class="alert comment">
Quality control without context leads to misinterpretation</div> --- ## Quality control for FASTQ files - FastQC <a name=cite-fastqc></a>([Andrews, 2010](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)) - QC for (Illumina) FastQ files - Command line fastqc or graphical interface - Complete HTML report to spot problem originating from sequencer, library preparation, contamination - Summary graphs and tables to quickly assess your data <img src="images/fastqc.png" width="40%" style="display: block; margin: auto;" /> - https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/ --- class: tp, middle, center # Switch to Hands-on : ## Quality Control with FastQC --- class: heading-slide, middle, center # Reads cleaning --- ## Objectives - Detect and remove sequencing adapters (still) present in the FastQ files - Filter / trim reads according to quality (as plotted in FastQC) ## Tools - Simple & fast : Sickle <a name=cite-sickle></a>([Joshi and Fass, 2011](#bib-sickle)) (quality), cutadapt <a name=cite-cutadapt></a>([Martin, 2011](#bib-cutadapt)) (adpater removal) - Ultra-configurable : Trimmomatic - All in one & ultra-fast : fastp <a name=cite-fastp></a>([Zhou, Chen, Chen, and Gu, 2018](https://dx.doi.org/10.1093/bioinformatics/bty560)) <img src="images/fastp_wkwf.png" width="45%" style="display: block; margin: auto;" /> --- class: tp, middle, center # Switch to hands-on : ## Clean your data with Sickle --- # Assembly : principles Similar to a puzzle : - millions of pieces -without the original image - with pieces in both sense - the pieces do not necessarily fit together (sequencing errors) - parts of the puzzle are missing (cover + sequencing bias) <img src="images/puzzle.png" width="45%" style="display: block; margin: auto;" /> --- # Assembly All assemby algorithms are based on read overlap. - Different ways of calculating overlap : * "All vs All" comparison : - "old" assemblers based on this approach - Graph representing overlap between reads - Quadratic number of comparison (number of reads^2 ) - do not scale with billion of reads * de Bruijn Graph - Named after Nicolaas Govert de Bruijn - Directed graph representing overlaps between sequences of symbols - Sequences can be reconstructed by moving between nodes in graph [Slide Credits](https://galaxyproject.github.io/training-material/topics/assembly/tutorials/debruijn-graph-assembly/slides.html) --- #De Bruijn Graph - A directed graph of sequences of symbols - Nodes in the graph are k-mers - Edges represent consecutive k-mers (which overlap by k-1 symbols) Consider the 2 symbol alphabet (0 & 1) de Bruijn Graph for k =3 <img src="images/dbg-1.png" width="100%" style="display: block; margin: auto;" /> --- #Producing sequences * Sequences of symbols are produced by moving through the graph <img src="images/dbg-2.png" width="100%" style="display: block; margin: auto;" /> --- #K-mers ? * To be able to use de Bruijn graphs, we need reads of **length L to overlap by L-1 bases**. * Not all reads will overlap another read perfectly. * Read errors * Coverage "holes" * Not all reads are the same length (depending on technology and quality cleanup) **To help us get around these problems, we use all k-length subsequences of the reads, these are the k-mers.** --- # What are K-mers ? <img src="images/dbg-4.png" width="45%" style="display: block; margin: auto;" /> --- # K-mers de Bruijn graph <img src="images/dbg-5.png" width="100%" style="display: block; margin: auto;" /> --- # K-mers de Bruijn graph <img src="images/dbg-6.png" width="100%" style="display: block; margin: auto;" /> --- # K-mers de Bruijn graph <img src="images/dbg-7.png" width="100%" style="display: block; margin: auto;" /> --- # The problem of repeats <img src="images/dbg-8.png" width="60%" style="display: block; margin: auto;" /> --- # The problem of repeats <img src="images/dbg-9.png" width="60%" style="display: block; margin: auto;" /> --- # The problem of repeats <img src="images/dbg-10.png" width="60%" style="display: block; margin: auto;" /> --- # Different k <img src="images/dbg-11.png" width="60%" style="display: block; margin: auto;" /> --- # Different k <img src="images/dbg-12.png" width="60%" style="display: block; margin: auto;" /> 2 contigs : *MISSISSIS* *SSIPPI* --- # Choose k wisely * Lower k * More connections * Less chance of resolving small repeats * Higher k-mer coverage * Higher k * Less connections * More chance of resolving small repeats * Lower k-mer coverage **Optimum value for k will balance these effects.** --- #Sequencing errors <img src="images/dbg-13.png" width="100%" style="display: block; margin: auto;" /> --- #Sequencing errors <img src="images/dbg-14.png" width="90%" style="display: block; margin: auto;" /> --- #Sequencing errors <img src="images/dbg-15.png" width="90%" style="display: block; margin: auto;" /> --- #More coverage * Errors won't be duplicated in every read * Most reads will be error free * We can count the frequency of each k-mer * Annotate the graph with the frequencies * Use the frequency data to clean the de Bruijn graph ** More coverage depth will help overcome errors!** --- # Sequencing errors - coverage <img src="images/dbg-16.png" width="60%" style="display: block; margin: auto;" /> Which path looks most valid ? Why ? --- # An important parameter : coverage cutoff * At what point is a low coverage indicative of an error? * Can we ignore low coverage nodes and paths? * This is a new assembly parameter **Coverage cutoff is an important parameter to differenciate error from real variations ** --- # de Bruijn Graph Assembly process 1. Select a value for k 2. "Hash" the reads (make the kmers) 3. Count the kmers 4. Make the de Bruijn graph 5. **Perform graph simplification steps** - use cov cutoff 6. Read off contigs from simplified graph --- #Graph simplification : Chain Merging <img src="images/dbg-17.png" width="60%" style="display: block; margin: auto;" /> --- #Graph simplification : Tip Clipping <img src="images/dbg-18.png" width="60%" style="display: block; margin: auto;" /> --- #Graph simplification : Bubble Collapsing <img src="images/dbg-19.png" width="60%" style="display: block; margin: auto;" /> --- # Make contigs * Find an unbalanced node in the graph * Follow the chain of nodes and "read off" the bases to produce the contigs * Where there is an ambiguous divergence/convergence, stop the current contig and start a new one. * Re-trace the reads through the contigs to help with repeat resolution --- #Graph simplification : Remove low coverage nodes * remove erroneous nodes and edges using the "**coverage cutoff**" * guenuine short nodes will be kept beause of their high coverage --- # Assemble with SPADES SPADES <a name=cite-spades></a>([Bankevich, Nurk, Antipov, Gurevich, Dvorkin, Kulikov, Lesin, Nikolenko, Pham, Prjibelski, and others, 2012](#bib-spades))is the de Bruijn graph assembler by Pavel Pevzner's group out of St. Petersburg - Uses multiple k-mers to build the graph - Graph has connectivity and specificity - Usually use a low, medium and high k-mer size together. - Performs error correction on the reads first - Maps reads back to the contigs and scaffolds as a check _ Under active development --- class: tp, middle, center # Switch to Hands-on : ## Assembly with SPADES --- ## Assessment of assembly quality After assembly, we use QUAST <a name=cite-quast></a>([Gurevich, Saveliev, Vyahhi, and Tesler, 2013](#bib-quast)) to evaluate and compare genome assemblies. What QUAST does : - De novo genome assembly evaluation - Reference-based evaluation - Evaluating so-called misassemblies - Report and visulaisation --- ## De novo metrics Evaluation of the assembly based on * Number of contigs greater than a given threshold (0, 500nct, 1kb) * Total / thresholded assemby size * largest contig size * N50 : the sequence length of the shortest contig at 50% of the total assembly length (equivalent to a median of contig lengths) * L50 : the number of contigs at 50% of the total assembly length * N75/L75 idem, for 75% of the assembly length --- ## Reference-based metrics * Metrics based on based on an alignement of all contigs on a reference genome. : - duplication rate - percent genome complete - NGA50 : equivalent of N50 but with the aligned block of the contigs - "Misassemblies" : breakpoint of alignement in a contigs. " - Visualisation available --- class: tp, middle, center # Switch to Hands-on : ## Assembly QC with Quast --- class: heading-slide, middle, center # Alignment --- # Alignment strategies ```bash GAAGCTCTAGGATTACGATCTTGATCGCCGGGAAATTATGATCCTGACCTGAGTTTAAGGCATGGACCCATAA ATCTTGATCGCCGAC----ATT # GLOBAL ATCTTGATCGCCGACATT # LOCAL, with soft clipping ``` ## Global alignment Global alignments, which attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. (This does not mean global alignments cannot start and/or end in gaps.) A general global alignment technique is the <code>Needleman–Wunsch algorithm</code>, which is based on dynamic programming. --- # Alignment strategies ```bash GAAGCTCTAGGATTACGATCTTGATCGCCGGGAAATTATGATCCTGACCTGAGTTTAAGGCATGGACCCATAA ATCTTGATCGCCGAC----ATT # GLOBAL ATCTTGATCGCCGACATT # LOCAL, with soft clipping ``` ## Local alignment Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The <code>Smith–Waterman algorithm</code> is a general local alignment method based on the same dynamic programming scheme but with additional choices to start and end at any place. --- # Seed-and-extend especially adapted to NGS data <img src="https://www.researchgate.net/publication/328816579/figure/fig1/AS:690855836921867@1541724275885/Two-different-approaches-to-genome-assembly-a-in-Overlap-Layout-Consensus_W640.jpg" width="80%" style="display: block; margin: auto;" /> --- # Seed-and-extend especially adapted to NGS data .pull-left[ Seed-and-extend mappers are a class of read mappers that break down each read sequence into seeds (i.e., smaller segments) to find locations in the reference genome that closely match the read. <img src="images/seed_and_extend.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ 1. The mapper obtains a read 2. The mapper selects smaller DNA segments from the read to serve as seeds 3. The mapper indexes a data structure with each seed to obtain a list of possible locations within the reference genome that could result in a match 4. For each possible location in the list, the mapper obtains the corresponding DNA sequence from the reference genome 5. The mapper aligns the read sequence to the reference sequence, using an expensive sequence alignment (i.e., verification) algorithm to determine the similarity between the read sequence and the reference sequence. ] --- # Mapping * For further analysis it is necessary to map all the reads on the contigs. <img src="images/mapping_tools.png" width="55%" style="display: block; margin: auto;" /> * We will use bowtie2 <a name=cite-bowtie2></a>([Langmead and Salzberg, 2012](http://dx.doi.org/10.1038/nmeth.1923)) * Firstly, we build an index. * Secondly, reads are aligned. * We can use samtools <a name=cite-samtools></a>([Li, Handsaker, Wysoker, Fennell, Ruan, Homer, Marth, Abecasis, and Durbin, 2009](#bib-samtools)) and bedtools <a name=cite-bedtools></a>([Quinlan and Hall, 2010](#bib-bedtools)) to manipulate SAM/BAM files. --- ## BAM/SAM * SAM = Sequence Alignment Map * BAM = Binary Alignment Map These files represent an alignment of FASTQ reads against a reference like a FASTA. * After a header section (for reference), each line represents the alignment of one read. <img src="images/SAM_format.jpg" width="55%" style="display: block; margin: auto;" /> --- class: tp, middle, center # Switch to Hands-on : ## Mapping --- class: heading-slide, middle, center # Visualization --- # Visualization * Some tools for vizalisation and browsing * IGV (alignments and reference) * Artemis (genome and annotations) <img src="http://software.broadinstitute.org/software/igv/sites/cancerinformatics.org.igv/files/images/igv_desktop_callouts.jpg" width="55%" style="display: block; margin: auto;" /> --- class: tp, middle, center # Switch to Hands-on : ## Visualization --- class: heading-slide, middle, center # Long reads --- # Tools for long reads * Long read data can be used to improve assembly * Bottlenecks : * DNA extraction (?) * cost of data generation * sequencing errors * State of the art pipeline for assembly : * standalone long read assembly * FLYE <a name=cite-metaFlye19></a>([Kolmogorov, Rayko, Yuan, Polevikov, and Pevzner, 2019](https://doi.org/10.1101/637637)) * canu * Optionnal error correction with short reads * Unicycler --- class: heading-slide, middle, center # Take home message --- class: tp, middle #Take home message ### → You have in your hands the first tools to analyze your NGS data ### → Data quality control is a crucial step ### → It is essential to define your plan analyses upstream of your project. ### → NGS are still an ongoing active bioinformatics research field ### → Biostatistics ... --- # References <a name=bib-fastqc></a>[Andrews, S.](#cite-fastqc) (2010). _FastQC A Quality Control tool for High Throughput Sequence Data_. URL: [http://www.bioinformatics.babraham.ac.uk/projects/fastqc/](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). <a name=bib-spades></a>[Bankevich, A, S. Nurk, D. Antipov, et al.](#cite-spades) (2012). "SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing". In: _Journal of computational biology_ 19.5, pp. 455-477. <a name=bib-Goodwin2016></a>[Goodwin, S, J. D. McPherson, and W. R. McCombie](#cite-Goodwin2016) (2016). "Coming of age: ten years of next-generation sequencing technologies". In: _Nature Reviews Genetics_ 17.6, pp. 333-351. DOI: [10.1038/nrg.2016.49](https://doi.org/10.1038%2Fnrg.2016.49). URL: [https://doi.org/10.1038/nrg.2016.49](https://doi.org/10.1038/nrg.2016.49). <a name=bib-quast></a>[Gurevich, A, V. Saveliev, N. Vyahhi, et al.](#cite-quast) (2013). "QUAST: quality assessment tool for genome assemblies". In: _Bioinformatics_ 29.8, pp. 1072-1075. <a name=bib-sickle></a>[Joshi, N. and J. Fass](#cite-sickle) (2011). _Sickle: a sliding-window, adaptive, quality-based trimming tool for FastQ files_. --- # References(2) <a name=bib-metaFlye19></a>[Kolmogorov, M, M. Rayko, J. Yuan, et al.](#cite-metaFlye19) (2019). "metaFlye: scalable long-read metagenome assembly using repeat graphs". DOI: [10.1101/637637](https://doi.org/10.1101%2F637637). URL: [https://doi.org/10.1101/637637](https://doi.org/10.1101/637637). <a name=bib-bowtie2></a>[Langmead, B. and S. L. Salzberg](#cite-bowtie2) (2012). "Fast gapped-read alignment with Bowtie 2". In: _Nature Methods_ 9.4, pp. 357-359. ISSN: 1548-7105. DOI: [10.1038/nmeth.1923](https://doi.org/10.1038%2Fnmeth.1923). URL: [http://dx.doi.org/10.1038/nmeth.1923](http://dx.doi.org/10.1038/nmeth.1923). <a name=bib-samtools></a>[Li, H, B. Handsaker, A. Wysoker, et al.](#cite-samtools) (2009). "The sequence alignment/map format and SAMtools". In: _Bioinformatics_ 25.16, pp. 2078-2079. <a name=bib-cutadapt></a>[Martin, M.](#cite-cutadapt) (2011). "Cutadapt removes adapter sequences from high-throughput sequencing reads". In: _EMBnet. journal_ 17.1, pp. 10-12. <a name=bib-bedtools></a>[Quinlan, A. R. and I. M. Hall](#cite-bedtools) (2010). "BEDTools: a flexible suite of utilities for comparing genomic features". In: _Bioinformatics_ 26.6, pp. 841-842. --- # References(3) <a name=bib-fastp></a>[Zhou, Y, Y. Chen, S. Chen, et al.](#cite-fastp) (2018). "fastp: an ultra-fast all-in-one FASTQ preprocessor". In: _Bioinformatics_ 34.17, pp. i884-i890. ISSN: 1367-4803. DOI: [10.1093/bioinformatics/bty560](https://doi.org/10.1093%2Fbioinformatics%2Fbty560). eprint: http://academic.oup.com/bioinformatics/article-pdf/34/17/i884/25702346/bty560.pdf. URL: [https://dx.doi.org/10.1093/bioinformatics/bty560](https://dx.doi.org/10.1093/bioinformatics/bty560).