Quantcast
Channel: CoreGenomics
Viewing all 109 articles
Browse latest View live

SPRI alternatives for NGS: survey results

$
0
0
Everyone loves bead cleanups, and it appears that almost everyone (85%) who read my recent post about SPRI alternatives loves Agencourt AMPureXP. I'd asked readers to take a survey asking if they used AMPure XP, a commercial alternative, or a home-brew version - the results are below.

https://www.surveymonkey.co.uk/r/V6LN5VX
Take the survey: https://www.surveymonkey.co.uk/r/V6LN5VX

I was surprised to see more home-brew responses than commercial alternatives, but this could simply reflect the attitudes of people reading CoreGenomics.

Come and work in my lab...

$
0
0
I've just readvertised for someone to my lab as the new Genomics Core Deputy Manager. We're expanding the Genomics core and building single-cell genomics capabilities (we currently have both 10X Genomics and Fluidigm C1). I'm looking for a senior scientist who can help lead the team, who has significant experience of NGS methods and applications; and ideally has an understanding of the challenges single-cell genomics presents. You'll be hands-on helping to define the single-cell genomics services we offer, and build these over the next 12-18 months.



You'll have a real opportunity to make a contribution to the science in our institute and drive single-cell genomics research. The Cancer Research UK Cambridge Institute is a great place to work. It's a department of the University of Cambridge, is one of Europe's top cancer research institutes. We are situated on the Addenbrooke's Biomedical Campus, and are part of both the University of Cambridge School of Clinical Medicine, and the Cambridge Cancer Centre. The Institute focus is high-quality basic and translational cancer research and we have an excellent track record in cancer genomics 123. The majority of data generated by the Genomics Core facility is Next Generation Sequencing, and we support researchers at the Cambridge Institute, as well as nine other University Institutes and Departments within our NGS collaboration.

You can get more information about the lab on our websiteYou can get more information about the role, and apply on the University of Cambridge website.

I don't want to leave Europe

$
0
0
Brexit sucks...probably. The issue is we don't really know what the vote really means, or even if we'll actually leave the European Union in the next couple of years at all. However one thing cannot be ignored and that is the two-fingered salute to our European colleagues from 52% of the UK voting population that got out of bed on a rainy Thursday.


I am privileged to work in one of the UK's top cancer institutes at the top UK University: the Cancer Research UKCambridge Institute, a department of the University of Cambridge. The institute, and the University, is an international one with people from all across the globe, many of the staff in my lab have come from outside the UK and they are all great people to work with. I dislike the idea that these people feel insecure about their future because our politicians have done such a crap job on governing the country.

I'd like to keep the international feel so if you're still thinking that working in the UK would be good for you (and your a genomics whiz) then why not check out the job ad for a new Genomics Core Deputy ManagerWe're expanding the lab and putting lots of effort into single-cell genomics service (10X Genomics and Fluidigm C1 right now). I'm looking for a senior scientist of any nationality to help lead the team, with NGS experience, and ideas about single-cell genomics.



You can get more information about the lab on our websiteYou can get more information about the role, and apply on the University of Cambridge website.

Measuring translation by ribosomal footprinting on MinION?

$
0
0
Oxford Nanopore should have kits for direct RNA sequencing available later this year, and have several examples of how these might be used on their website. The method presented at London Calling (see OmicsOmics coverage), is primarily for mRNA, but it likely to be adapted for other RNA species in due course.

One of the ideas I've briefly thought about is using the MinION to perform ribosome profiling - a basic method would involve ligating adapters to RNA after cell lysis so Ribosomes  are fixed to mRNAs with cyclohexamide treatment. Fast mode sequencing would identify the 3' end and the transcript, then sequencing speed would be massively ramped up to zip the mRNA through the pore; the bound ribosomes should cause sequencing to stall allowing counting of stall events and therefore the number of ribosomes attached to an mRNA.


Nanopore ribosome profling: In the figure above the cells from two distinct populations would be sorted (A) and mRNA with ribosomes still attached would be extracted (B). mRNA's are ligated with nanopore sequencing adapters (C). Nanopore sequencing shows distinct pause's at ribosome locations allowing estimation of quantitative translation (D). This is very much an idea that has not gone much further than the back-of-an-envelope and many issues need to be considered in developing methods like this:

  • Ribosomes might not survive the ONT library prep
  • Ribosomes may be too firmly attached
  • RNA may "slip" through the ribosome with no effect on sequencing
  • Pauses in sequencing may be sequence specific
  • RNA modifications may add all sorts of challenges


Figure A from Smith et al 2015
tRNA-seq: Mark Akeson's group recently published work on MinION tRNA-seq where they tailed the tRNA with a leader-adapter such that the molecule unzipped from its usually highly configured 2D structure as it was sequenced through the pore (possibly it even reassembles on the other side of the membrane). Other RNA species are likely to benefit from methods similar to this including perhaps ribosomal profiling.

I think data like the tRNA-seq paper, and ideas like mine above and an earlier rRNA-seq idea, demonstrate how much more we may be able to do on a Nanopore sequencer than on our current short-read Illumina sequencers.

The future is very difficult to predict - but it is fun to try!


Comparison of DNA library prep kits by the Sanger Institute

$
0
0
A recent paper from Mike Quail's group at the Sanger Institute compares 9 different library prep kits for WGS. In Quantitation of next generation sequencing library preparation protocol efficiencies using droplet digital PCR assays - a systematic comparison of DNA library preparation kits for Illumina sequencing, the authors used a digital PCR (ddPCR) assay to look at the efficiency of ligation and post-ligation steps. They show that even though final library yield can be high, this can mask poor adapter ligation efficiency - ultimately leading to lower diversity libraries.

In the paper they state that PCR-free protocols offer obvious benefits in not introducing amplification biases or PCR errors that are impossible to distinguish from true SNVs. They also discuss how the emergence of greatly simplified protocols that merge library prep steps can significantly improve the workflow as well as the chemical efficiency of those merged steps. As a satisfied user of the Rubicon Genomics library prep technology (e.g. for ctDNA exomes) I'd like to have seen this included in the comparison*. In a 2014 post I listed almost 30 different providers.



Hidden ligation inefficiency: The analysis of ligation efficiency by the authors sheds light on an issue that has been discussed by many NGS users - that of whether library yield is an important QC or not? Essentially yield is a measure of how much library a kit can generate from a particular sample, but it is not a measure of how "good" that library is. Only analysis of final library diversity can really act as a sensible QC.

The authors saw that kits with high adapter ligation efficiency gave similar yields when compared to kits with low adapter ligation efficiency (fig 4 reproduced above). They determined that the most likely cause was that the relatively high amount of adapter-ligated DNA going into PCR inhibits the PCR amplification reaction leading to lower than expected yields. For libraries with low adapter ligation efficiency a much lower amount of adapter-ligated DNA would make it into PCR, but because there is no inhibition the PCR amplification reaction leads to higher than expected yields. The best performing kits were Illumina Truseq Nano and PCR free, and KAPA Hyper kit with ligation yields above 30%; and the KAPA HyperPlus was fully efficient.

Control amplicon bias: the PhiX control used had three separate PCR amplicons amplified to assess bias. The kits with the lowest bias at less than 25 % for each fragment size were KAPA HyperPlus and NEBNext. The Illumina TruSeq Nano kit showed different biases when using the "Sanger adaptors" rather than "Illumina adaptors", which the authors suggest highlights that both adapter and fragment sequence play a role in the cause of this bias.

Which kit to choose: The authors took the same decision as most kit comparison papers and shied away from making overt claims about which kit was "best". The did discuss fragmentation and PCR-free as important points to consider.
  • If you have lots of DNA then aim for PCR-free to remove any amplification errors and/or bias.
  • If you don't have a Covaris then newest enzymatic shearing methods e.g. KAPA fragmentase have significantly less bias than previous chemical fragmentation methods.
Ultimately practicability, the overall time and number of steps required to complete a protocol, will be uppermost in many users minds. The fastest protocols were NEBNext Ultra kit, KAPA HyperPlus, and Illumina Truseq DNA PCR-free.

*Disclosure: I am a paid member of Rubicon Genomics' SAB.

How much time is lost formatting references?

$
0
0
I just completed a grant application and one of the steps required me to list my recent papers in a specific format. This was an electronic submission and I’m sure it could be made much simpler, possibly by working off the DOI or PubMed ID? But this got me thinking about the pain of reformatting references and the reasons we have so many formats in the first place. It took me ten minutes to get references in the required format, and I've spent much longer in the past - all wasted time in a day that is already too full!

I use Mendeley as my reference manager of choice and it has a very good Word plug-in that makes it easy to add references and build a final reference list when writing papers. I used it for my PhD with over 160 references and it coped pretty well. Mendeley, EndNote, et al make changing reference styles pretty easy, but why do we have to bother at all?

In digging into this I came across a post by  Jay Fitzsimmons at the Canadian Field-Naturalist blog. Jay's post is well written and describes the problem well - lots of citation styles, but no real evidence about which is most efficient.

How did reference styles evolve: Once upon a time the only way to access published information was to go to your University library and find the paper you were looking for (it wasn't that long ago). House styles were developed by publishers as a set of standards for the writing and design of articles in their periodicals. There was no, or little, effort to determine what the most efficient way to communicate the information in a reference. A big reason for abbreviating information, or omitting article titles etc from references was to reduce the amount of text - simply to save money for publishers of printed materials. There is even an ISO standard just for abbreviating journal titles! Even though we're in the electronic age there might still be good reasons to abbreviate references. Who wants to read a 300 author list (unless you're one of the authors of course)!

What do I think is important: It depends on why I’m looking at a reference in the first place but here are my priorities
  1. The title is the most common reason I decide if this is a paper I should read, I’d like to see it every time.
  2. Second on my list is the year of publication, there’s sometimes no point looking at old references in a fast moving field (but beware this simple cull on useful reading materials).
  3. Then I’d like a link to the paper - personally I’m happy with the PubMed ID or DOI.
  4. Lastly is the lead author(s) as these are likely to be the people with most to gain from the publication in the immediate future.
As far as the authors go then in the context of my grant application, or perhaps a CV or job application I’d prefer a simple numbering format: the authors place by numerical ordering of the author list and the total number of authors, perhaps with an asterisk to denote joint first or corresponding author status e.g. 2*/17 where I am the 2nd author in a list of 17 authors, but I'm a joint 1st author.

Lastly I’d set it all to a nice delimited format so a screen grab from almost anything can be easily imported into whatever I need to use the reference in.

I don't really care about the Journal and certainly not volume and/or page numbers as I am NOT going to look for this in the library!

So here’s my suggestion:
Murtaza/Dawson/Tsui. Non-invasive analysis of acquired resistance to cancer therapy by sequencing of plasma DNA. Nature. 2013. DOI:10.1038/nature12065. PMID:23563269. 13/17.

Compare this to:
Murtaza M, Dawson SJ, Tsui DW, Gale D, Forshew T, Piskorz AM, Parkinson C, Chin SF, Kingsbury Z, Wong AS, Marass F, Humphray S, Hadfield J, Bentley D, Chin TM, Brenton JD, Caldas C, Rosenfeld N. Non-invasive analysis of acquired resistance to cancer therapy by sequencing of plasma DNA. Nature. 2013 May 2;497(7447):108-12.

I guess nothing is going to change in the field anytime soon. But I feel better for getting this off my chest. And I’ve sent feedback to the funder...

Whole genome amplification improved

$
0
0
A new genome amplification technology from Expedeon/Sygnis: TruePrime looks like it might work great for single-cell and low-input anlyses - particularly copy number. TruePrimer is a primer-free multiple displacement amplification technology. It uses the well established phi29 DNA polymerase and a new TthPrimPol primase, which eliminates the need to use random primers and therefore avoids their inherent amplification bias. The senior author on the TthPrimPol primase paper, Prof Luis Blanco, is leading the TruePrime research team.


I saw a recent poster which had results demonstrating equal amplification and homogenous coverage (see image above), no primer artefacts, and high identification of both SNPs and CNVs. TruPrime gave very similar CNV data to unamplified DNA with very little apparent amplification or coverage bias from low coverage whole genome sequencing (12 million reads). Competitors "R" and "G" did not look so good.

What does TthPrimPol do in the cell: TthPrimPol is a DNA and RNA primase with DNA-dependent DNA and RNA polymerase activity. It is a unique human enzyme capable of de novo DNA synthesis solely with dNTPs and is found primarily in the nucleus - TthPrimPol -/- cells show inefficient mtDNA replication, but it is not an essential protein. In the mitochondria TthPrimPol provides the primers for leading-strand mtDNA synthesis in the replication fork. It is an important protein in the mitochondria where the highly oxidative environment leads to replication stress and and genome instability. It is also capable of reading through template lesions such as 8oxoG, a common DNA lesions produced by reactive oxygen species that causes G to T and C to A substitutions. This may have auseful application in the amplification of FFPE damaged DNA.

Using TruPrime in single-cell sequencing: I can see several opportunities for using this technology in my lab, including both single-cell systems: 10X Genomics and Fluidigm C1 for future copy-number methods. It is also likely to be useful for other low-input experiments and we're likely to couple it with Nextera XT or similar.

I'm sure we'll see some great work using this enzyme if it really works as well as the company suggest - if you are using TruPrimer please do let me know how you are getting on!




Core Genomics is going cor-porate (sort of)

$
0
0
I've just had my five year anniversary of starting the Core Genomics blog! Those five years have whizzed by and NGS technologies have surpassed almost anything I dreamed would have been possible when I started using them in 2007. My blog has also grown beyond anything I dreamed possible and the feedback I've had has been a real motivating factor in keeping up with the writing. It also stimulated my move onto Twitter and I now have multiple accounts: @CIGenomics (me), @CRUKgenomecore (my lab) and @RNA_seq, @Exome_seq (PubMed Twitter bots).

The blog is still running on the Google Blogger site I set up back in 2011 and I feel ready for a change. This will allow me to do a few things I've wanted to do for a while and over the next few months I'll be migrating core-genomics to a new WordPress site: Enseqlopedia.com




Introducing Enseqlopedia: The new home of Core Genomics will be a chance for me to expand on something I've been doing for many years - explaining NGS to users. The same blog content is going to keep flowing, but other stuff will appear alongside, and I hope you'll find it informative and entertaining.

The Enseqlopedia name was chosen as I'll be adding content describing methods, linking to the best papers that demonstrate these or advance them, and hopefully making the new site a useful resource for the community. It will also be somewhere I can serve up more PubMed Twitterbot output in a single place outside of Twitter. I'd also like to reinvigorate the sequencer map Nick Loman and I put together many years ago. Some of the reasons for these changes has come about from my dissatisfaction with sites that serve up NGS news, but simply regurgitate press releases from academics or companies in the NGS field; I want to deliver more than this. Hopefully you already agree that my blog posts hit the spot, and I'm hoping the new stuff is of real interest to readers. I aim to make sure you can see that what appears has been carefully chosen and has an opinion behind it.

Core Genomics corp: The biggest change is going to be the appearance of  commissioned or sponsored content i.e. stuff I get paid to post. I've not tried to monetise my blog before, mainly because I don't like unsightly ads all over the place, however I've been asked to write reasonably frequently about new products in the NGS space and until now I've always turned the offers down. I have ghost written other content, but nothing on Core Genomics has been paid for - and all the topics have been chosen by me. The two new types of post will be tagged so you can tell immediately what your reading:
Commissioned posts will be tagged "commissioned content" and labelled at the top of the post so you know who paid me to write the piece. All commissioned content will be taken on with full editorial control i.e. I decide what ends up in the final piece, and I will have written the post.
Sponsored posts will be tagged "sponsored content" and labelled at the top of the post so you know who wrote the piece. Sponsored content will only be accepted by me if I think readers of Core Genomics would be interested. The content is likely to be written by the sponsor and should be considered as an advert. Although I will decide whether a sponsorship opportunity will get posted I will NOT have full editorial control i.e. I get to decide on what sponsored content appears on the site, but I will not have written the post.
My first sponsored piece will be coming soon. Although the topic has been chosen by someone else, the opinions are very much my own. I'm not expecting to write much more than one sponsored post a month (so any NGS companies reading this better get their requests in soon), and I'm not going to write about something I really don't believe in. 

I'll also be making it clearer what kind of consultancy work I'm happy to take on. Mostly this has been technological consulting for investors who want to understand market reactions to new instruments or developments (with Brexit came a rush of consultancy work). But I've also consulted for technology companies, and for research groups.

Thanks for reading Core Genomics - hopefully you'll be reading for another five years!

James.

RNA-seq advice from Illumina

$
0
0
This article was commissioned by Illumina Inc.

The most common NGS method we discuss in our weekly experimental design meeting is RNA-seq. Nearly all projects will use it at some point to delve deeply into hypothesis driven questions, or simply as a tool to go fishing for new biological insights. It is amazing how far a project can progress in just 30 minutes of discussion, methodology, replication, controls, analysis, and all sorts of bias get covered as we try to come up with an optimal design. However many users don't have the luxury of in-house Bioinformatics and/or Genomics core facilities so they have to work out the right sort of experiment to do for themselves. Fortunately people have been hard at work creating resources that can really help and most recently Illumina released an RNA-seq "Buyer’s Guide" with lots of helpful information....including how to keep costs down.



Illumina's "Buyer’s Guide": the guide offers advice on common RNA-Sequencing methods and should help new users in evaluating the many options available for next-generation sequencing of RNA. Anyone considering a differential gene expression analysis experiment should have RNA-seq as their platform of choice and the guide presents three simple steps for users to consider different aspects of their experiments.


1) First of all make sure you understand what your scientific question is! This sounds simple but all too often people want to get too much out of one experiment and end up getting in a bit of a mess. Better to answer one question well, than two questions badly. Once you've thought about this it should be clear whether you want analyse mRNA's for a simple differential gene expression experiment, or are after something else e.g. splicing, and also if you'll  need to look at more than just poly-adenylated mRNAs. And if possible try to determine ahead of time whether the genes you're interested in studying are highly expressed or very rare.
 
2) Once you've thought about this you can consider what sort of samples you have, are they low quality and/or low quantity? You should also consider who's going to do the work in the lab and who's going to analyse the sequence data?

3) Now you can really think about the final experimental design, what type f library preparation kit to use, replicate numbers, proper controls, depth of sequencing, etc. Illumina's RNA-seq buyers guide describes some of the things you'll need to consider in choosing the read-depth and run-type, and also include some tips for keeping the costs of your experiment down. 

What do people mean when they say "RNA-seq": When people say "RNA-seq" most of them are talking about differential gene expression (DGE) by sequence analysis of reverse transcribed poly-adenylated mRNAs, but by changing the depth sequencing or type of sequencing, and/or choosing a different library prep kit you can investigate so much more. The guide includes three different scenarios for RNA-seq experiments including basic differential gene expression; DGE and allele-specic expression plus isoforms, SNVs and fusions; and finally whole transcriptome analysis.These show the breadth of experiments you can consider once you've mastered this method.

The first two scenarios showcase the power of RNA-seq and demonstrate how using a single library prep method, but varying the sequencing allows very different questions to be asked of your samples. The guide recommends Illumina's TruSeq Stranded mRNA-seq kits (these are the ones we use most in my lab and we have done so ever since beta-testing the original RNA-seq kit many years ago). Scenario #1 is a simple DGE experiment and Illumina recommends you generate ≥ 10 million reads per sample, using single-end 50bp reads (SE50). Scenario #2 allows a full mRNA analysis by simply changing read depth to ≥ 25 million reads per sample, and using paired-end 75 bp reads (PE75).

If you are interested in more than poly-adenylated mRNA's then changing the RNA-seq library prep kit to Illumina's TruSeq Stranded Total RNAgets rid of ribosomal RNA's, letting you anaylse both coding and non-coding RNA. Much greater read depth is neededand Illumina recommend ≥ 50 million PE75 reads per sample. Completing the RNA-seq line-up is the TruSeq small RNA kits which allow you to analyse microRNAs and other smaller transcripts, usually this requires only ≥ 1-2 million SE50 reads per sample.

How do Illumina's recommendations stack-up:The guide is pretty good in the suggestions it makes for common RNA-seq methods. I'd aim a bit higher for DGE and suggest 20 million reads per sample to allow profiling of high, medium and lowly expressed genes.  I'm really not keen on the suggestion that MiSeq or NextSeq mid-output are good tools for RNA-seq as from my experience most experiments, with sufficient replication, will be too large to fit into a single sequencing run. I'd argue that the cheapest way to get your RNA-seq data is going to be on HiSeq 4000, until of course we can run RNA-seq on X Ten. Of course not everyone should buy a HiSeq and a MiniSeq, MiSeq or NextSeq may be a good fit for your own laboratory; but I'd encourage you to consider the benefits of using your local core lab first though, especially if you are planning on doing experiments bigger than 12-24 samples. I'm not sure I'd argue quote as strongly for paired-end data and would prefer splicing, ASE, fusion detection to be coming from higher depth sequencing instead (50M SE50 reads cost about the same as 25M paired-75bp reads).

Why does my lab focus on mRNA-seq DGE: My own choices for RNA-seq are primarily informed by the questions people say that want to answer in experimental design discussions - and nearly all of these are differential gene expression questions. As such my lab runs lots and lots of Illumina's stranded mRNA-seq kits. We only run some form of ribosomal reduction when the experiment warrants it as these methods generally require deeper sequencing for the same differential gene expression analysis power. We've very few users who need to run FFPE RNA so although we tested the RNA Access kit, we've yet to really use it in a significant project. This is partly because the research groups coming ot my lab understand the limitations of FFPE samples, and work hard to procure fresh frozen material wherever possible.

A brief bit about informatics: This article is focussed on the wetlab but without a good analysis pipeline you'll be stuck with some big but unusable Fastq files. The analysis requirements are heavily influenced by the biological questions being asked,  by the samples available, and by the library preparation and sequencing performed. I'd always recommend the user to make sure they know what analysis is likely to be performed before generating data.

Many others have weighed in on how to use and design RNA-seq experiments (see the list of my favourite references at the bottom of this post). Nearly everyone agrees that replication is key with most people suggesting 4-6 biological replicates. Most papers agree on read-depth being kept to under 20M reads per sample.TheENCODE RNA-seq guidelines are very different recommending just two biological replicate and 30M paired-end reads per sample - I've never agreed with this, even when it was published in 2011, and have steered people to other resources. The Blogosphere also offers lots of help; a 2013 post by GKNO (Marth lab, U. Utah), and the RNA-seqlopedia (U. Oregon) are two great reads for people who want to know more.

All Illumina products listed are for research use only. Not for use in diagnostic procedures (except as specifically noted).

Further reading:

10X Genomics single-cell 3'mRNA-seq explained

$
0
0
10X Genomics have been very successful in developing their gel-bead droplet technology for phased genome sequencing and more recently, single-cell 3'mRNA-seq. I've posted about their technology before (at AGBT2016, and March and November 2015) and based most of what I've written on discussion with 10X or from presentations by early access users. Now 10X have a paper up on the BioRxiv: Massively parallel digital transcriptional profiling of single cells. This describes their approach to single-cell 3'mRNA-seq in some detail and describes how you might use their technology in trying to better understand biology and complex tissues.

Technical performance of the GEMcode system: The paper is unfortunately based on the earlier GEMcode system rather than the latest Chromium, but the results are likely, though not definitely, going to be representative of what Chromium can deliver.

Technical performance was assessed using 1200 Human 293T or Mouse 3T3 cells, with 100,000 reads per cell. 71% of reads aligned to Human or Mouse genomes (38% and 33% respectively). Analysis of the UMIs allowed the authors to estimate a total number of cell-containing GEMs to be just over 1000 (482 and 538 Human or Mouse respectively). Only 8 GEMs appeared to have Human and Mouse cells co-located, as assessed by GEM barcoded reads aligning to both genomes. It is not easy (is it possible) to detect Human:Human or Mouse:Mouse cell doublets so the inferred doublet rate for this experiment was 1.6% (see figure 2a in the paper with multiplet GEMs as grey dots).



The 1.6% multiplet (doublet, triplet, or higher) rate appears low, but as cell numbers increase so does the multiplet rate, the authors describe a linear relationship of multiplet rate to cell loading from 1000-10000 cells  (Supplementary Fig. 1a), however it is not clear how this rate changes at 20k, 30k, 40, or 50k (the maximum loading recommended). What the impact is on experiments I do not know - but this is an area several labs are focusing on. The multiplet rate "approximately followed a Poisson distribution" as assessed by imaging experiments (Supplementary Fig. 1b). In these a Nikon microscope equipped with a high-speed camera capable of capturing 4000 frames per second imaged GEMs as they were created. 28,000 frames were analysed for single-cell encapsulation (7 seconds of video, which only represents about 1.5% of the time your Chromium is actually making GEMs) but the multiplet rate was 16% higher than expected - I don't think the authors delve deeply enough into the reasons for this. Multiplets are likely to add significant noise to analysis of single-cell experiments, every single-cell technology has to account for them and cells like to stick together so user probably can't rely on actually having a single cell suspension in the first place.

To further investigate this the authors also carried out mixing experiments with Human 293T (female & expressing XIST) and Jurkat cells (male & expressing CD3D). Figure 2e (see above) in the paper shows the PCA for these mixes at 100% 293T, 100% Jurkat, 50:50 or 10:90. The 50:50 mix shows a lot of cells in the space between the cell clusters, I\d suggest this indicates higher multiplet rates in this experiment than the 1.6% suggested? But I could not see the cell loading density used, which may explain the higher numbers of apparent muliplets. 

Cell capture efficiency: The rate of cell capture is important especially where rare cell populations are being studied. 10X captures about 50% of the cells loaded into GEMS (Supplementary Tables 1&3), and whilst this could be increased it would be to the detriment of an increased cell doublet/triplet rate. This might be a parameter users are willing to tweak depending on their needs and it would be interesting to ask how many users would accept higher doublets in return for 80-90% cell capture rates? What we really need in a single-cell system is the ability to image cells in droplets so we can exclude empty drops, doublets and triplets; I'd be interested to know if anyone is working on something like this?

The level of cross-talk between cell barcodes was about 1% (see Online Methods) but it is not clear in the manuscript where this cross-talk comes from. If it is error in reading the cell barcodes then this could be reduced by sequencing longer, more error-tolerant barcodes, and a longer barcode read (if >25bp) would allow a proper error estimation of the index read. But if this is coming from molecular cross-over during the downstream library prep (which is going to happen to some degree) then fixing it will be much more difficult (see these papers to learn more about PCR chimeras and their affect on NGS: NAR 2012, NAR 1990, JBioChem 1990, NAR 1995).

83% of UMIs were associated with cell barcodes suggesting that cell-free RNA does not significantly affect the results - this is an issue scSeq users will have to consider carefully as the amount of cell-free RNA or DNA in a sample is likely to be highly variable, and it may be that experiments with artificially high levels might show us the failure mode in these sample types.

Transcript counting: With 100,000 reads per cell the authors report a median detection rate of 4500 genes or 27,000 transcripts with little bias for GC content or gene length. However as a 3' assay I'd not expect a huge variation here, and this is something that would become much more important as 10X, and others, move to whole transcript assays. Clustering analysis was performed Seurat (Satija et al., 2015).

SNV detection from scRNA-seq data: while deciphering population structure and discovering rare cells is great many people will want to look for SNP/SNVs in their scRNA-seq data. The authors reported an analysis of a curated set of high quality SNVs only observed in only 293T or Jurkat cells, but not both (see Online Methods). They showed that they could detect SNVs reliably, and that multiplet rates predicted from SNV were highly correlated with those from gene expression analysis. The paper is confusing in suggesting that each cDNA generates 250bp of sequence for SNV detection, but the sequencing run generates only 98bp in read 1 from the cDNA (I'd like to understand this better or see this corrected in the final version if it is a typo).

scRNA-seq from frozen cells: In the discussion the authors make a strong statement about the ability to analyse frozen cells: "the ability of GemCode to generate faithful scRNA-seq profiles from cryopreserved samples enables its application to clinical samples". The frozen cells in questions were fresh cells recovered from whole blood, cryopreserved and "gently thawed" one week later (see Online Methods). Only a small number of genes (57) showed greater than 2-fold upregulation (no down regulated genes were reported), suggesting that freezing cells is possible. However I suspect that the minimal freezing time and "gentle" protocols will put many users off relying on cell storage until more comprehensive evaluation is undertaken. The fact that they got such good results is encouraging, we're working on a project with patient material that needs to be processed immediately for best results. Right now we're brining cells over from the hospital about one hour after collection and processing straigh-away, but this is not an efficient use of the technology when the plastic chip holds 8 samples and costs $150 each time.

A few words about sequencing 10X scRNA-seq libraries: In the paper the authors say that after GEMcode prep "libraries then undergo standard Illumina short-read sequencing" - there is nothing standard about the run type you need to do for 10X. It is a 98.14.8.10 format run - 98bp 1st read (mRNA), 14bp Index 1 (UMI), 8bp Index 2 (sample index), 10bp 2nd read (Cell index) - I hope I got that right!

10X sequencing does not fit easily into a core lab running HiSeq instruments due to the run configuration (we need 8 lanes of the same sample type). I suspect this is going to get much easier as we do more and more 10X sequencing, but for now we're either running longer reads than necessary, or using NextSeq/2500 RapidRuns. Chromium genomes can now be run on X Ten as PE150 with no modification. Hopefully single-cell RNA-seq will move to a more standard single-end run for differential gene expression, this would make life easier for my team, and reduce costs by around 40%.

Summary: All on all this paper explains many of the things potential users of 10X single-cell are looking to understand. I'm expecting papers to be coming thick and fast over the next six months now people have the instruments in their hands.

It is going to be interesting to see how 10X develop their chemistry, particularly for whole transcriptome single-cell, for copy-number and for applications like G&T-seq or scM&T-seq, or even ATAC-seq.

How will RainDance fight back with their own single cell methods? And how does this 3'mRNA-seq assay compare to Fluidigm's C1? Both of these are questions I look forward to seeing answered. Ultimately the more technologies we have for single-cell the better, there are likely to be strengths and weaknesses in each. But I'd not be surprised if the one with the most open chemistry becomes dominant - this was part of Illumina/Solexa's success as it meant users could develop methods from a core technology.

PS: Supplementary Figure legends are available on BioRxiv, but not the figures - go figure! Online methods are also missing. Probably because the BioRxiv does not check if these have been submitted.

Upcoming Genomics conferences in the UK

$
0
0
It is almost time for the kick off at Genome Science, probably the best organised academic conference in the UK. It runs from August 30th to September 1st next week and sadly I can't be there (just returned from holidays and too much going on). You can hear from a wide range of speakers in a jam packed agenda. This year it is hosted by the University of Liverpool, and the evening entertainment comes from Beatles Tribute Band “The Cheatles”!

What other conferences are available for Genomics in the UK, and which one should you attend if you too can't make it over to Liverpool? The Wellcome Trust Genome Campus is holding their first Single Cell Genomics conference from September 9th (sold-out I'm afraid). Personally I thought that the London Festival of Genomics was excellent and I've high hopes for the January 2017 meeting. 

Often it is word of mouth that brings a conference to my attention, but there are a couple of resources out there to help.
  • AllSeq maintain a list of conferences.
  • GenomeWeb has a similar list, but it seems less focused than AllSeq.
  • NextGenSeek has a list for 2016, but nothing on the cards for 2017 yet.
  • Nature has an events page (searchable) that lists 50 upcoming NGS conferences.

PS: please do let me know if you've particular recommendations on conferences to attend. And do get in touch with the groups above to list your conference on their sites.

PPS: If you can justify it then the HVP/HUGO Variant Detection Training Course - "Variant Effect Prediction" running from 31st October 2016 is in Heraklion, Crete - a beautiful place to learn!

Optalysys eco-friendly genomics analysis

$
0
0
The amount of power used in a genome analysis is not something I'd ever thought of until I heard about Optalysys, a company developing optical computing that has the potential to be 90% more energy-efficient and 20X faster than than standard (electronic) compute infrastructure. Read on if you are interested in finding out more, and watch the video below - featuring Prof Heinz Wolff!



Optalysys was originally spun out from the University of Cambridge and the technology needs a lot more explanation that I'll give: briefly they split laser light across liquid crystal grids where each "pixel" can be modulated to encode analogue numerical data in the laser beam, this diffracts forming an interference pattern and a mathematical calculation is performed - all at the speed of light. The beam can be split across many liquid crystals to increase the multiplicity and complexity of mathematical operations performed.

Optalysys and the Earlham Institute in Norwich are collaborating on a project to build hardware/software that will be used for metagenomic analysis. This is a long way from comparing 500 matched tumour and normal genomes in an ICGC project; but if Optalysys can build systems to handle this scale then the huge compute processing tasks might be carried out at a fraction of the current costs and whilst running from a standard mains power supply.

PS: do you remember the Great Egg race as fondly as I do?

Celebrating 10 years at the CRUK-Cambridge Institute today

$
0
0
Today I have been working for Cancer Research UK for ten years! September 1st 2006 seems like such a short time ago but a huge amount has changed in that time in the world of Genomics. NGS has changed the way we do biology, and is changing the way we do medicine. The original Solexa SBS has been pushed hard by Illumina to give us the $1000 genome, and perhaps just as exciting are the results coming out of Oxford Nanopore's MAP community - this maybe the technology to displace Illumina? What the next ten years will hold is difficult to predict, but today I wanted to focus on the highlights of the last ten years at CRK for me.

CRUK-Cambridge Institue circa early 2006
I was employed to build a brand new genomics facility and was hired for my expertise in gene expression microarrays - previously I'd set up an Affymetrix facility at the John Innes Centre in what is now the Earlham InstitutePerhaps the one thing I remember from my interview is the answer to a question I'd posed at the end "Will the CRUK institute be using the new next-generation sequencing technologies?" NGS was still in its infancy then, in late 2005 the first 454 paper made a big splash, and a Solexa sequencer has been installed at the Sainsbury lab in Norwich and I'd heard interesting things about the technology.

The answer was something like "we want this facility to focus on microarrays, we'll see if the NGS comes to anything useful". Well everyone reading Core-Genomics knows how disruptive NGS was, microarrays are dead (for gene expression anyway) and virtually all the data we generate in my lab comes off an Illumina HiSeq sequencer.

When I arrived the site had only just been handed over by the builders. In January of 2007 we had the first instruments installed and were processing Sanger sequencing and Illumina arrays by the Spring. But we'd decided to get our first sequencer and our initial discussions with the Solexa rep ended up with the purchase of an Illumina GAI. The rest as they say is history.

Highlights from the last ten years: The Institute celebrates its 10th anniversary in February of 2017 so I'll not go into too much detail about the top ten projects the Genomics core has been involved with. But I did want to pick upon three projects that I was personally involved with and that I think were major advances.
  1. Understanding gene regulation: In a wonderful paper: Species-specific transcription in mice carrying human chromosome 21, Mike Wilson, in Duncan Odom's group, demonstrated that sequence differences in regulatory regions are the dominant force in governing when and where genes are expressed. Mike designed an incredibly elegant experiment using a Mouse model of Down's syndrome, the TC1 mouse carries an extra copy of chromosome 21, but it is a Human copy. That Human chromosome is in a mouse nuclear environment and this allowed the authors to show that the Mouse transcription factors bound to Human DNA in a Human specific context i.e. the DNA sequence was the dominant force driving gene expression. Mike and Duncan were instrumental in the development of NGS at the Institute. Mike was great to work with, and hosted probably the best "crash pad" in Cambridge; and Duncan has kept up an amazing pace of research over the whole of the last ten years.
  2. Molecular subtyping of Breast cancer: The METABRIC project was a major reason I took the job at CRUK. It was the largest array project I ever worked on and had a huge impact on our understanding of Breast cancer, revealing novel subtypes of breast cancer with distinct clinical outcomes and subtype- specific driver genes. It was truly a landmark study. The Genomics core processed all of the UK-based samples extracting DNA and RNA, quality controlling and normalising them for analysis. I managed the Affymetrix genotyping on SNP6.0 arrays, carried out as a service by Aros in Denmark. And my lab processed all of the 2500 Illumina HT12 arrays used in the study in just 6-8 weeks. Christina Curtis now runs her own lab at Stanford. And the Caldas group continues to lead on Breast cancer genomics, most recently we've been working with them most recently on a PDX project where we introduced low-coverage WGS of pre-capture exome libraries to significantly improve CNV calling.
  3. Liquid biopsy: probably the biggest advance I've been involved with, NGS analysis of ctDNA as a liquid biopsy, is changing the way we do cancer medicine. Tim Forshew in Nitzan Roselfeld's group was the first person to use NGS to non-invasively identify mutations by sequencing the DNA from a patients tumour circulating in their blood. In a hugely impactful Science Translational Medicine paper Tim and colleagues showed that this could be used to detect and quantify mutations seen in the tumour, that de novo mutations could be identified, and that a liquid biopsy could be used to monitor tumour progression in patients. Mohammad Murtaza (now Assistant Prof at TGEN) pushed the technology even further by showing that it was possible to perform whole exome analysis of ctDNA, and that this could be used to monitor tumour evolution. This was a groundbreaking study published in Nature, but when I presented it at AGBT the following year the audience was still highly skeptical of how widely ctDNA might be used - that has changed and now there are dozens of companies pursuing liquid biopsy including Nitzan and Tims Inivata.
I've worked with some amazing people over the last decade many of whom have gone on to start their own labs. My team has been great; people have come and gone, marriages have happened and babies have been born. The CRUK Cambridge Institute continues to be an excellent place to work, and is still a world leader in Genomics, and I've played my part in helping that to happen. Here's to the next ten years.

Sequencing base modifications: going beyond mC and 5hmC

$
0
0
A great new resource was recently brought to my attention on Twitter and there is a paper describing it on the BioRxiv: DNAmod: the DNA modification database. Nearly all of the modified nucleotide sequencing we hear and read about is modifications to Cytosine mostly methyl cytosine and hydroxymethyl cytosine; you may also have heard about 8-oxoG if you are interested in FFPE analysis. All sorts of modified nucleotides occur in nature and may be important in biological processes where they can vary across tissue of an organism, or may just be chemical noise. The modifications are most important when they change the properties of the DNA strand, how is is read, and what might or might not bind to it e.g mC.




The biology of base modification is very complex - DNA methyltransferase marking Cytosine with a 5-methyl, TET family enzymes oxidising 5-methylcytosine to 5-hydroxymethylcytosine, and thymine DNA glycosylase-mediated base excision repair back to unmodified Cytosine. Many groups have worked on methods to sequence modified bases, with Shankar Balasubramanian's research group here in Cambridge most closely associated with 5hmC-seq in his CEGX spinout.

DNAmod DB: The DNA modification database lists 38 modified bases, only 7 of which only been observed synthetically. It gives each a brief description of each modified base including the likely biological function, and most importantly for readers of Core Genomics it lists the methods that can be used to map the modifications in the genome.

Unfortunately it appears to miss the OxBS-seq method published by Booth et al in 2012, but does have the competing TAB-seq method published by Yu et al in the same year.

Not all bases are modified to the same extent: There are a total of 128 modified nucleotides reported in the unverified list on DNAmod. I'd assumed modifications would be about the same number for each of the biological building blocks but they vary quite significantly: Uracil has 45 mods (I'm guessing modifications in ribonucleotides need less careful control?), Adenine (39) has nearly twice as many modifications as Guanine (19), and Cytosine (13) and Thymine (12) have the least.


Citation: Sood AJ, Viner C, Hoffman MM. 2016. DNAmod: the DNA modification database. bioRxiv 071712.

Nuclear sharks live for 400 years

$
0
0
A wonderful paper in a recent edition of Science uses radiocarbon dating to show that the Greenland shark can live for up to 400 years - making it the longest lived vertebrate known. See: Eye lens radiocarbon reveals centuries of longevity in the Greenland shark (Somniosus microcephalus).


“Who would have expected that nuclear bombs [one day] could help to determine the life span of marine sharks?” The authors used measurements of 14C radiocarbon isotopes in eye lens nuclei to estimate life span of around 300 years, with the oldest animal approximately 392 years old. A complication in their analysis was the “bomb pulse”: the the pulse of carbon-14 produced by nuclear tests in the 1950s. This creates a spike in radiocarbon levels, however only the two smallest, and presumably youngest, of the 28 animals analysed had the high 14C levels associated with the bomb tests.

Why eye lens nuclei? It turns out that the lens is made from metabolically inert crystalline proteins, and the nucleus, which is formed during prenatal development, retains proteins synthesised at age 0.

No sex until you're 150: But this longevity comes at a price though, and for the Greenland shark the price is that sexual maturity is not reached for a very long time - around a female sharks 156th birthday!

Animals that live this long are rare, and horribly susceptible to Human activities; primarily fishing, shipping and pollution in the case of marine vertebrates. Most of animals used in this study came from several years of collecting dead  sharks, many of them accidentally ensnared when trawling for commercial catches.

10X Genomics phasing explained

$
0
0
This post follows on from my previous one explaining the 10X Genomics single-cell mRNA-seq assay. This time round I'm really reviewing the method as described in a paper recently put up on the BioRxiv by 10X's Deanna Church and David JaffeDirect determination of diploid genome sequences. This follows on from the earlier Nat Methods paper which was the first 10X de novo assembly of NA12878, but on the GemCode system. While we are starting some phasing projects on our 10X Chromium box the more significant interest has been on the single cell applications. But if we can combine the two methods (or something else) to get single-cell CNV then 10X are onto a winner!




The paper describes the 10X Genomics Chromium phasing technology. They highlight the impact of their tech by first reminding us that the majority of Human genomes sequenced to date are analysed by alignment to the reference (an important point often forgotten by users). They say that only a few de novo Human assemblies have been created, but that most do not truly represent complex biological genomes. The authors only consider two published genomes as true diploid de novo assemblies - Levy et al. PLoS Biol 2008: The diploid genome sequence of an individual human and Cao et al. Nat Biotech 2015. De novo assembly of a haplotype­resolved human genome.

The method: They introduce the 10X Chromium library prep. This starts with 1.25ng of >50kb DNA, from which 16bp barcoded random genomic loci are copied (by polymerase extension?) inside the Chromium gel-beads. Each of these contains around 10 molecules per droplet equal to ~0.5 Mb of the genome. The most important bit of the tech is the ability to put just 0.01% of the diploid Human genome into a single droplet - this makes the probability of both alleles being present vanishingly small. With 2 lanes of X Ten you can expect to get about 60X Human genome coverage and the authors calculate the number of "linked reads" per molecule as 60, which equates to around 0.4x coverage (enough for shallow CNV sequencing to reveal clonality in Tumours perhaps).

Question to the authors: I do not understand the statement about smaller genomes getting lower linked read coverage: "For smaller genomes, assuming that the same DNA mass was loaded and that the library was sequenced to the same read­depth, the number of Linked­Reads (read pairs) per molecule would drop proportionally, which would reduce the power of the data type. For example, for a genome whose size is 1/10th the size of the human genome (320 Mb), the mean number of Linked­Reads per molecule would be about 6, and the distance between Linked­Reads would be about 8 kb, making it hard to anchor barcodes to short initial contigs." My first assumption was that genome size would have no impact on linked read depth, but it would significantly affect the amount of the genome present in a single droplet. As such the smaller genome, with DNA fragments of the same size should still have around 60 linked reads per DNA molecule, but a 10MB genome would mean 5% was in each droplet making the phasing much harder to determine. Please feel free to explain this to me.

The data: In the paper they present data from seven Human genomes, sequenced on HiSeq X Ten, and assembled using the "pushbutton" Supernova algorithm (it won't run on your Mac Book Pro as you'll need >384Gb of RAM). In just two days per genome they generated 100kb+ contigs with 2.5Mb phase blocks. The 7 genomes include 4 with parental data to verify phasing results, as well as one sample used in the HGP. They include a figure (see below) showing the Supernova assembly of the HGP sample aligned to a 162kb clone which is part of the GRCh37 reference. It almost completely matches the reference sequence with the 8 variants including just 1 SNV (green), but 6 homopolymer and 1 di-nucleotide repeat length variants (blue/cyan). The sceond figure shows the representation of the path a FASTA sequence takes through the "megabubbles" separating parental alleles, and "microbubbles" caused by longer repeats and homopolymers.


Who's careful hand at 10X Genomics drew this representation of FASTA?

Tuning 10X phasing to your needs: Users may be able to "tune" scaffold N50 by varying DNA length or sequencing coverage. A single X Ten lane generating 30x coverage looks like it would push scaffold N50 down from 17 to 12 Mb. DNA quality is probably most important and I suspect many people will accept a significant improvement in phasing estimation from lower cost experiments.

Many groups will also want to run differently sized genomes and will need to estimate how much DNA to use and how much sequencing they'll require. For small genomes this gets really interesting and 10X could be an awesome metagenomics tool allowing strain level analysis of complex samples. For the larger non-Human genomes people will need to us a much smaller amount of DNA in a single run, which may limit the number of genome copies to an unreasonable level.
  • Human 3Gb = 1ng = 300 genome copies
  • Wheat 5Gb = 0.67ng = 135 genome copies
  • Maize 20Gb = 0.17ng = 8 genome copies
  • Salmander 50Gb = 0.07ng = 1.3 genome copies
  • Paris japonica 150Gb = 0.02ng = 0.15 genome copies

Who's going to use Chromium phasing: Is this kind of data going to be relevant enough for people to adopt 10X Chromium as the default genome library prep? I suspect many teams are working on 100s or even 1000s of 10X Genomics genomes right now and we'll see many more publications very soon. If the $500 Chromium prep can add real value (biologically or clinically) then 10X have a real chance of becoming a new standard for library prep. If that's the case I guess we'll see how strong their IP is as the competition builds their own variants of the technology.

10X Genomics publications

$
0
0
Anyone that's been reading Core-Genomics will have seen my interest in the technology from 10X Genomics. I've been watching and waiting for publications to come out to get a better understanding of how people are using the technology and thought you might like my current list of articles: many of these are on the BioRxiv and should be available in a reputable journal if you're reading this in 2017 or later!

The number of 10X Genomics publications is going to grow rapidly; and this list will only be updated sporadically!






This paper by Deanna Church and David Jaffe et al describes the 10X Genomics Chromium phasing technology. I've done a more comprehensive write up of this paper here on Core-Genomics. Essentially this is the paper to refer to if you're considering using Chromium phasing in your own research and want to better understand how it works and what you can do. The authors explain the basic principles of generating LinkedReads, and present data on 7 Human genomes successfully assembled from HiSeq X data using the Supernova algorithm. Assemblies are good with 100kb+ contigs and 2.5Mb phase blocks, and the HGP sample used had excellent alignment to the reference along a 162kb contig.



ABySS 2.0: Resource-Efficient Assembly of Large Genomes using a Bloom FilterBioRxiv 2016 Aug. 

The authors present ABySS 2.0 and compare it to the previous version and 5 other assemblers, BCALM2, DISCOVAR, Minia, SGA and SOAPdenovo. They used the Genome in a Bottle data: 70X coverage Human genome using Illumina paired 250bp reads (PE250) as well as mate-pair data, 10X genomics Chromium data, and BioNano optical mapping data. ABySS 2.0 generated an N50 of 3.5 Mb using only 35 GB of RAM (still won't run on your Mac Book Pro). Whilst this is not a 10X paper per se they do discuss the limitations of current short-reads and the impact the 10X technology is likely to have on assembly  including the BioNano Genomics and 10x Chromium data increased N50 from 29 to 42 Mb. In Fig. 3 from the paper (see below) the authors show all of the 90 scaffolds over 3 Mb, which add up to 90% of the genome. And state that "most chromosome arms are reconstructed by 1 to 4 large scaffolds".


Fig.3 from Jackman/Vandervalk et al 2016




High-Quality Assembly of an Individual of Yoruban DescentBioRxiv 2016 Aug. 

The authors present a hybrid assembly of NA19240 using multiple technologies including PacBio, BioNano genomics, Illumina sequencing, 10x Genomics LinkedReads, and BAC hybridization and sequencing. They explain the need for multiple technologies given that no single method "can fully resolve every genomic feature and/or region"; and argue that BAC tiling is still a useful technology. I'd be interested to know how useful this might be once 10X Genomics becomes standardised as the time and cost involved in BAC library construction, mapping and sequencing, let alone the huge amount of DNA required is quite outside the reach of most labs.

The assembly presented is the first in a set of 5 genomes which the authors are aiming to use to improve the diversity of the reference genome. They refer to "Gold" and "Platinum" genomes but I cannot tell which the final assembly was considered. The final assembly had an N50 of 7.25 Mb and a scaffold N50 of 78.6 Mb, which according to the authors "represents one of the most contiguous high-quality human genomes".




A hybrid approach for de novo human genome sequence assembly and phasing. Nat Methods. 2016 Jul:

This 
paper describes a combinatorial approach to de novo assembly and phasing analysis using Illumina sequencing, 10X Genomics (GemCode) LinkedReads, and BioNano Genomics mapping; again using NA12878.





Massively parallel digital transcriptional profiling of single cellsBioRxiv 2016 July: 

This paper describes the 10X Genomics single-cell 3' mRNA-seq technology. I'd previously covered this paper here on Core-Genomics. Essentially this is probably the paper to read to if you'd like to are considering 10X Genomics single-cell RNA-seq in your own research and want to better understand how it works and what you can do. The authors explain the basic principles of the methods, and present data from 250,000 cells across 29 samples. An awesome paper...when does it come out in a Jurnal?

Ben Hindson (10X Genomics CSO) will be presenting this work at the San Diego Festival of Genomics if you'd like to know more.




Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learningBioRxiv 2016 Jun.

This paper presents a novel clustering method for single-cell RNA-seq data. One of the data sets they used was the 10X Genomics single-cell RNA-seq of PBMCs from Zheng et al 2016. They present their method: SIMLR (single-cell interpretation via multi- kernel learning), and show that it more accurately defines subpopulations from single-cell data than either t-SNE or PCA methods.


Fig 5: 2D visualisation of data from 5 cell sub-populations by PCA (b) SIMLR (c) and t- SNE (d).





Health and population effects of rare gene knockouts in adult humans with related parentsScience Apr 2016 (originally on BioRxiv).


This paper presents the use of 10X Genomics phased genome sequencing as a confirmatory method in a study identifying gene knockouts created by rare homozygous predicted loss of function (rhLOF) variants from exome sequencing data. In one case a PRDM9 rhLOF was confirmed by 10X Genomics sequencing. PDRM9 is a gene involved in the localisation of meiotic crossovers, however the individual was healthy and fertile. The results suggest there are alternative mechanisms of localising human meiotic crossovers as PRDM9 LOF leads to infertility in mice and an inability to repair double-strand breaks. The authors state that we need to be careful when interpreting predicted loss of function events.


Third-generation sequencing and the future of genomics. BioRxiv April 2016.

This  review of third-generation NGS systems describes 10X Genomics Chromium genome technology as a mapping, rather than a sequencing application. 10X Genomics is lumped in with BioNano Genomics, Dovetail GenomicscHiCago (HiC) method, genetic maps and mate-pair mapping. The paper includes a great table highlighting the characteristics of the different 3rd-gen platforms (reproduced below).




Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat Biotechnol. 2016 Mar.

This paper from Hanlee Ji's group at Stanford and 10X's Ben Hindson et al describes the 10X Genomics GemCode phasing technology. It is the first paper to demonstrate that droplet methods for phasing and structural variant analysis. This is the other paper you should refer to if you'd like to are considering phasing in your own research, but the more up-to-date Chromium BioRxiv Church/Jaffe paper (see above) will give you better information about the technical performance today (Sept 2016).


Fig1: !0X technology overview

This paper demonstrates what can be done for cancer genomes, and that is what makes it such an important read for people deciding if the 10X the might be useful in their research here at the CRUK Cambridge Institute. I've previously written about why I'm excited about using phasing to resolve complex structural rearrangements and determine if multiple variants in the same gene are in cis/trans (cis- is on the same allele, trans- is on the second allele).

For a single colorectal cancer patient they generated 50x Illumina WGS and 30x 10X Genomics WGS (the choice of the name 10X Genomics requires some explanation, written as it is here only the inclusion of the work "genomics" in the sentence makes it easily interpretable. And given that most 10X Genomics phasing data will be generated on Illumina's X Ten we're having to ask for "10X on X Ten" or "an X-Ten 10X genome" - I get them mixed up in conversations and I know PIs and post-dos do too!) Multiple deleterious cancer mutations, including the known driver genes TP53 and NRAS, five rearrangements and 26 copy-number variants were found . The most interesting result presented was a C>T mutation in TP53 that causes a deleterious nonsynonymous R213Q substitution, confirmed in the LinkedRead data as being on one haplotype. The other haplotype was shown to be deleted in the same region leading to LOH, with the only copy present having the TP53 C>T mutation resulting in a single but inactvated copy of TP53. This phased cancer genome was produced from 1ng of ~50kb DNA, from a sample with 70% tumour purity - this is pretty close to many samples that people are collecting, but the careful reporting of this kind of information is going to be vital as we understand which samples might sensibly be run on the 10X Genomics tech, and which we should leave for now.

A previously validated EML4-ALK translocation was detected in lung cancer cell line NCI-H2228. To target the exome for phasing 10X Genomics and Agilent have partnered on a modified capture panel that includes baits designed to target the introns and improve pull-down of the large genomic fragments. The sequencing of 200X was after removal of duplicates so this could be very deep sequencing indeed. However the 10X Genomics data revealed that this is not a simple inversion, but is a more complex with a deletion including exons 2–19 of ALK.

They discuss the Moleculo tech (actually refs 6-9 from the paper) from Illumina pointing out the main reasons that these methods are sub-optimal are the relatively large amount of DNA used and the relatively low number of partitions generated  - both limiting how well the technology can be applied.

The authors conclude their discussion with the following statement "phased cancer genomes will provide new insight into the genomic alterations underlying tumor development and maintenance". I think the next few months will see other papers being published confirming how useful the technology really is. And who knows how soon we might see a phasing panel specifically for DNA repair genes being used in the clinic for instance?




Haplotypes drop by dropNat Biotechnol 2016 Mar.

In this news and views article Jacob Kitzman (University of Michigan) describes the data from the Zheng et al paper (see above) in the same issue, and explores the impact it might have in the field. This paper clearly describes the issue that clinicians want to understand: are both copies of a gene affected e.g. as in cystic fibrosis, where two mutations, one on each allele knock out both copies of the CTFR gene, or if the same haplotype is hit twice with mutations in cis.

He suggests how other methods might be improved by the use of 10X phasing technology including metagenomics (we're trying this with a collaborator), and for phasing cDNA to analyse transcriptomes more deeply with regards splice isoform diversity.

One of the questions Kitzman poses is "Whether the 10X Genomics platform will be widely adopted may depend as much on its cost above and beyond standard whole-genome shotgun sequencing as on its technical merit." The papers above are showing just how useful the 10X Genomics tech is turing out to be...but as I said at the start of this post this list is going to grow rapidly; and this list will only be updated sporadically!



Reporting on Fluidigm's single-cell user meeting at the Sanger Institute

$
0
0
The Genomics community is pushing ahead fast on single-cell analysis methods as these are revolutionising how we approach biological questions. Unfortunately my registration went in too late for the meeting running at the Sanger Institute this week (Follow #SCG16 on Twitter), but the Fluidigm pre-meeting was a great opportunity to hear what people are doing with their tech. And it should be a great opportunity to pick other users brains about their challenges with all single-cell methods.



Imaging mass-cytometry: the most exciting thing to happen in 'omics?

Mark Unger (Fluidigm VP of R&D) started the meeting off by asking the audience to consider the two axes of single-cell analysis: 1) Number of cells being analysed, 2) what questions can you ask of those cells (mRNA-seq is only one assay) - proteomics, epigenetics, SNPs, CNVs, etc.

Right now Fluidigm has the highest number of applications that can be run on single-cells with multiple Fluidigm and/or user developed protocols on the Fludigm Open App website; 10X Genomics only have single-cell 3' mRNA-seq right now, as do BioRad/Illumina and Drop-seq. But I am confident other providers will expand into non 3'mRNA assays...I'd go further and say that if they don't they'll find it hard to get traction as users are likely require a platform that can do more than one thing.


There are three sessions over the two days:

  • Session I: Single-cell heterogeneity, classification and discovery
  • Session II: Immunotherapy in oncology—new insights at single-cell resolution
  • Session III: Single-cell functional biology 


Session I: Single-cell heterogeneity, classification and discovery


Achieve new insights through single-cell biology. Candia Brown, Director Strategic Marketing, Fluidigm

Candia asked the audience "what are we trying to do with single-cell genomics methods?" She focussed her brief introductory presentation on understand biological mechanisms and pathways, cell differentiation, cell lineage, etc, and for biomarker discovery, therapeutics...or even in the clinic in the future? Much of the initial work has been done on identifying cell types within populations and to understand heterogeneity. Moving beyond this kind of classification requires more complex methods and analyses. Ultimately we'll need to be using spatio-temporal methods such as in-situ sequencing of carefully prepared samples, and combination analyses with data from RNA, DNA and proteins. We need to detect from single cells (this was a hot topic for Fluidigm at the beginning of 2016) and Candia shoewed examples of population classification and discussed how we might move past relatively "simple" atlasing studies to more complex experiments that aim to make mechanistic insights. Fluidigm aim to present all the latest updates on their tech during this meeting for the C1, Biomark, Helios and Polaris systems.

Dissecting cerebral organoids and fetal cortex using single-cell RNA-seq. Gray Camp, a post-doc in Svante Pääbo's group at Max Planck Institute for Evolutionary Biology, Germany. Gray is also collaborating closely with the Treutlein lab.


Cerebral organoids make biological experimentation easier in the same way that tumour organoids are better informing cancer biology. The group are deconstructing cellular heterogeneity in cerebral organoids using single-cell RNA-seq compared to bulk analysis. Now using organoids developed from patients to generate samples that recapitulate periventricular neuronal heterotopia. Also following the reprogramming of fibroblasts into induced neurons (recently published in Nature and in their News and Views). This great editorial in Development discusses the impact that organoids are having on biological research.

Becoming a new neuron in the cerebral cortex.Ludovic Telley, University of Geneva, Switzerland.

Ludovic is also talking about cells in the brain, single-cell methods are having a huge impact on brain biology. His talk focussed on the "L4 neurons" the main recipient of sensory input into the brain. Using a novel technology called FlashTag to visualise and isolate neurons during their development, see Science 2016 paper. Isolated neurons are then profiled using Fluidigm single-cell RNA-seq to track neuronal transcriptional programs. They found that waves of transcriptional activity are seen as each neuron progresses from proliferative to migratory and finally to connectivity phases.

A cost-effective 5’ selective single-cell transcriptome profiling approach.Pascal Barbry, Institut de Pharmacologie Moléculaire et Cellulaire, France.


Pascal's group are using Fluidigm single-cell methods to investigate Mucocilliary differentiation. Today he describes the modified SMART-seq method they developed incorporating on-chip barcoding and UMIs. This is somewhat similar to STRT-seq published in 2011, but now on the Fluidigm IFC. Pascal spent some time describing the impact of UMIs (Unique Molecular Identifiers), showed the figure from Cellular Research's PNAS paper, and mentioned one of the four methods to reduce RNA-ligation biases. After processing cDNA is fragmented and 5' fragments are isolated by the biotin tag before completion of library prep and sequencing. Showed data on performance and reproducibility of the assay: reads are very biased to the 5' end of transcripts (but have not copared directly to CAGE data), saw about 25% efficiency of ERCC cloning, data suggest that more than 1 million reads per cell is unnecessary. Interestingly they saw a correlation of 0.9 for a C1+IonProton versus Drop-seq+Illumina, but with a reasonable number of genes that appear to be present in only one method! The script will appear on Fludigm's Open App site after publication!

Pascal briefly mentioned their work on the 800 cell IFC, they're pretty happy so far. But would like to be sequencing on Next-seq, which needs lots of PhiX to be added due to the need to read through the oligo-dT sequence. He suggested starting sequencing from the 5' end instead.


Single-cell analysis of clonal dynamics and tumour evolution in childhood ALL. Virginia Turati, Enver lab UCL, UK.


ALL is the most common childhood cancer with 1 in 2000 affected and around 500 cases per year in UK. ALL was one of the first disease where branching evolution was described. Using Fluidigm C1 single-cell in a "mouse clinic" from primary patient tumour material, where treatments can be monitored over time. Analysis during chemotherapy of PDXs shows no impact on intratumour heterogeneity i.e PDXs recapitulate the patient tumour. Single cell WGS was much more difficult than RNA-seq! But an average of 37 CNV were found in each cell. They are generating around 10 million reads per cell to generate a coverage of around 0.2x. Saw multiple variants around CDKN2a locus.

Virginia presented some data that shows how small numbers of cells (Freddy) overlap transcriptomes with resistant cells, suggesting that these are evolving towards resistance. Understanding this process is key to improving outcomes for patients. They are aiming to identify a signature of resistant cells to use in the clinic.

See more with the C1™: explore the breadth of applications available on the C1 platform for single-cell genomics. Shaun Cordes, Senior Product Manager, Fluidigm.


Shaun gave an overview of the different methods users can run on the C1 system. He also confirmed the 10,000 cells coming soon, as is a Fluidigm automated imaging system which includes a cloud based software toolkit. New applications coming include single-cell protein analysis with two anti-bodies carrying probes that allow qPCR analysis (read more about the Proximity Ligation Assay approach in the Science 2015 paper).


Session II: Immunotherapy in oncology—new insights at single-cell resolution


Mass Cytometry applications from Fluidigm. Gary Impey, Director, Product Management - Mass Cytometry, and Robert Ellis, Director, Product Management, Fluidigm.


About half the audience are either using mass-cytometry already, or are considering using it. A search on PubMed for "mass-cytometry" or "CyToF" results in 196 papers - a pretty high number given how new this method is. Gary is talking about how Fluidigm's Helios system can be used to interrogate cells for immunogenic markers. Gary referenced a Wall Street Journal article: Immunotherapy and cancers super survivors. David Lane (formerly Chief Scientist at CRUK) was quoted as saying “It’s the most exciting thing I’ve ever seen”. To get real insights we need highly-dimensional single-cell methods - Fluidigm's Helios CyToF is one tool that can help.

Fluidigm currently have 50 high-purity metal isotope tags which allow almost generation of data with minimal biological or technical noise. Metals are tagged to antibodies and these are used to tag cell surface or intra-cellular markers.


Robert is presenting an overview of a new method called imaging mass-cytometry (see the figure at the top of this post - it may be the most exciting thing to happen in 'omics in a while). This allows spatial resolution of proteomic data from tissues in-situ. The system requires a new box to be bolted onto the Helios instrument to perform imaging, a UV laser vaporises tissue by scanning across the section one line at a time (approximately 1um per pixel), and the ionised tissue goes into the mass-cytometer for semi-quantitative analysis. It works with fixed or frozen tissue on standard microscope slides. The process takes approximately 1 hour to get a region 0.5mm square - highly detailed but highly focused (spatially). Robert presented software developed in the Bodenmiller group at ETH, Zurich. You can do LCM-style selection and pick defined regions.

Robert showed some wonderful images of imaging mass-cytometry compared to IHC or FISH. Alos some lsides from David Hedley's group at Toronto. You can label your own antibodies using a kit from Fluidigm, but Robert showed a slide of their Immuno-Onc panel with a broad concentration range for different anitbodies- just how much empirical work tis required to get the balance right is unclear!

Imaging Mass Cytometry—about proteins, tissues and biomedical research. Valerie Dubost and Markus Stoeckli (also on the SAB of Imabiotech a CRO for mass-cytometry imaging), Novartis, Switzerland.


Valerie is talking about her early access results from the imaging mass-cytometry methods presented by Robert. Valeri is a histologist so her perspective will be an interesting one, and potentially give insights into how likely this technology is to make it int the clinic. Novartis haev moved quickly to build a cross-functional team to focus on mass-cytometry imaging technology application and development. Using FFZN and FFPE tissue, incubate a panel of up to 30 antibodies, slides loaded into imaging mass-cytometer for laser ablation and analysis.

Data presented included validation of the antibodies - this is critical and too many scientific papers are messed up by the use of poorly characterised antibodies. Comparison of IHC to IMC looked excellent. She showed beautiful images of cell segmentation by Voronoi boundaries. The need to carefully consider cellular architecture is important in interpretign results from IMC - you are still going to need a pathologist to help interpret this kind of data. Pathology:Molecular Pathology:IMC Pathology is going to increase our understanding of tissue architecture, and possibly interactions.


Session III: Single-cell functional biology 



An introduction to single-cell functional biology. Simon Margerison, Senior Manager, Application Support, Fluidigm 


Simon gave an overview of how the Helios and Polaris systems can be used to investigate functional single-cell biology. We heard lots about the Helios yesterday and Simon showed some Cancer data using panels where 10 markers were used for phenotpying and 30+ markers used to investigate functional biology.

However Simon spent a little more time describing the Polaris system which was not really mentioned yesterday. This is a system that allows selection of  48 single-cells, and culture them for up to 24 hours while modulating the environment  - this is automated cell culture and I'm hoping Polaris is the first of many such systems that will allow highly parametric experiments to be performed where instead of a simple A vs B, treated and untreated experiment, we'll do A,B,C,D,E,F & G, treated at different doses and times all without being messed up in the tissue culture lab.

A holistic view of the mucosal immune system: identification of tissue- and disease-specific cellular networks.Frits Koning, Leiden University Medical Center, Netherlands 


Frits is presenting work published recently in Immunity. His lab has built a mass-cytometry panel to look at heterogeneity of the adaptive and innate immune compartment, applied to Human intestinal samples (Coeliac disease). He presented data from an initial cohort of 44 patients. 8 months to generate the data, 6 months to analyse it - a common bioinformatics challenge! He showed a merge scatterplot of all 2.5 million cells from all 44 patients, the different cell types clearly separate into the canonical immume cell populations. However the different samples (PBMC vs colon) and individual patients show very different enrichments for cell populations.

They were able to distinguish distinct mass-cytometry signatures that divide patients from controls, and were able to detect patients with mucosal lymphoid malignancies. His group has been working hard on developing computational methods to analyse these huge datasets quickly, all 5.2 million cells in 1 hour on a 32Gb laptop! See the Cytosplore website for more details. Frits was very bullish about the use of mass-cytometry in the clinic and finished by saying "we are moving towards an unbiased diagnostic tool".

The nature and nurture of cell heterogeneity: single-cell functional analysis, temporal single-cell sequencing and imaging of gene edited macrophages.Esther Mellado, Wellcome Trust Centre for Human Genetics, UK.


Esther's work is the focus of a spotlight article on Fluidigm's website. She is running the Polaris system at the WTCHG and presented her work isolating single cells and perturbing them to understand the role of macrophages in HIV pathology. And in particular cells with mutations in SAMHD1 gene and the effect of this mutation on HIV latency. They used multiple microenvironmental conditions in early and late activation so adjusted dosing for either 1 or 8 hours, comparing mutant and wild-type macrophages across 10 replicates. They performed high-resolution imaging off the Polaris to investigate morphology and behaviour. They saw that knockout of SAMDH1 has important paracrine signalling effects.

The WTCHG team call the Polaris their "10 Postdocs in a Box". It allows much mire complex experiments to be performed than an individual in the lab can realistically manage. As I said above I'm hoping Polaris is the first of many automated cell culture systems - and ideally we'd see instruments that can handle bulk cells too.

Understanding cellular heterogeneity. Sarah Teichmann, Wellcome Trust Sanger Institute and EMBL-EBI, UK


Sarah is presenting her groups work on cellualr heterogeneity, it turns out that much of this is of functional significance. She stumbled upon this when doing bulk RNA-seq could not relate the abundance of transcripts to counts of single-molecule RNA-FISH. Bulk RNA is limiting, single-cell rocks!

She presented data from a new publication just deposited on the BioRxiv: Temporal mixture modelling of single-cell RNA-seq data resolves a CD4+ T cell fate bifurcation. They used temporal modelling of single-cell RNA-seq to analyse development of Th1 and Tfh cell populations in mice infected with Plasmodium, and show that a single cell gives rise to both cell types. I'd really suggest reading the paper.

The future of Illumina according to @chrissyfarr

$
0
0
In yesterdays Fast Company pieceChristina Farr (on Twitter) gives a very nice write up of Illumina's history and where they are going with respect to bringing DNA sequencing into the clinic. I really liked the piece and wanted to share my thoughts after reading it with Core-Genomics readers.


To showcase how Illumina is impacting medicine Christina mentions two recent Illumina spin-outs; Helix (an Apple-esque app store for genome applications) and Grail (aiming to develop early cancer detection tests from deep sequencing of ctDNA). And also highlights some wonderful examples of where Illumina themselves have applied sequencing to clinical cases; the Jaynome (Flately's own genome) and discovery of his having the condition malignant hypothermia; to the more compelling rare disease cases such as Massimio, a boy with a genetic mutation causing HBSL (Hypomyelination in the Brain stem and Spinal cord) a new disease found only by the use of Illumina's technology.

Next-generation sequencing is changing medicine and the reality is when we say NGS most of us mean Illumina sequencing - for now at least.

New business models are emerging in genomics: Illumina's Helix is subsidising exome sequencing costs with the hope that users will pay to query the data over time and that this use will more than cover sequencing costs. In an era of very low borrowing costs buying in now to sequence 100 million genomes might only require users to sign up to a $10 a month plan for the rest of their lives, with queries costing a few dollars - in the case of Flatley's own malignant hypothermia, which can result in sudden death while under general anesthesia, a user might query this before deciding on surgery. for instance Or a family might check for an MT-RNR1 m.1555A>G mutation before their child is being treated with gentamicin saving the 1:500 kids with this particular variant from going deaf while in the ICU.

$10 per month is pretty low compared to life-insurance policies and if Illumina or others can do a deal with the "Man from the Pru" personalised genomes outside of the clinic really could become the norm. $10 per month over 10 years is $1200 versus a $1000 genome, but over 40 or even 80 years should be attractive, and this does not consider the reselling of consumer genomics data as 23andMe are showing is possible.

The negative impact of Illumina's lack of competition: Christina comes back to an issue Illumina are facing more and more several times during her article; the fine line Illumina are walking to bring new products to clinical and even consumer markets without competing with their academic and clinical customers. The Liquid biopsy market is predicted to be worth $1 billion by 2020 (personally I reckon a figure much higher than this), and NIPT possibly $2.4 billion by 2022. The size of these markets is a temptation for the company that is delivering most of the infrastructure being used to service them today.

John Stuelpnagel (Illumina's cofounder) and Jonathan Groberg (biotech analyst at UBS) both express some reservations about where Illumina are going in the comments Christina quotes in her article. John immediately jumps into one of the worries I hear about at conferences and meetings, especially when talking to the commercial sector, he says "people [companies] are apprehensive about Illumina and worried about if, and when, they might choose to compete against them". When asked about this fine line that Illumina should walk to stay on the right side of their customers Jon Groberg says "As Illumina moves into the clinical markets, it's making for some tough conversations", and Chrisina acknowledges that some of the people she spoke to were reluctant to talk openly. This comes out later in the article when Christina is interviewing Christian Henry (Illumina EVP & COO) about the purchase of Verinata and the signal it send to Illumina's users, possibly viewed as competitors. Whilst Henry is clear that competition with customers is "a foundational question for Illumina" (i.e. Illumina does not want to compete directly), Groberg adds that Illumina might be unable to NOT compete. And a description of Illumina as an "800-pound gorilla in genomics" by 23andMe’s director of research Joyce Tung is not completely flattering.

In the article Christina highlights Illumina's early days facing litigation from the likes of Affymetrix, where it was the underdog, to its own litigation against ONT, where it has been described as a bully trying to stifle competition. Illumina's dominance in the NGS market is so large that questions are being asked about whether it is unfairly abusing its monopoly position. As a long-term user, and being previously described as "an Illumina fan-boy" I see Illumina's dominance as down to the simple fact that they bought the best technology (an element of luck), but they put a team together that made it work really really well (they made their own luck by investing and working hard). It is Illumina's investment in R&D that has given us the family of instruments from the Mini-seq to the HiSeq X. I'd love to see stronger competition, but its' not there yet, and some big guns have tried and failed (454 LifeTech and CGI). I hope Illumina don't become another ABI bullying other companies trying to get into the space, as well as users - 10 years ago ABI was not a nice company to work with and users were pretty happy to drop them and move over to Illumina. I'm sure Illumina are working on not making the same mistake. But in her article Christina mentions that some of the people she spoke to were afraid to talk openly about this aspect of their relationships with Illumina.

NGS is here to stay and it is going to become more and more common to hear about it in the news and even down the pub. Jay Flately, Shankar Balasubramanian, David Klenerman et al, Solexa and Illumina will be remembered for developing a technology that changed the world (has anyone written a screenplay). Illumina may not be an Apple yet, but it can't be far away. However predicting the future of NGS has proven to be tough, nearly everyone has under-estimated what/when something might be possible in the future. New technologies like Oxford Nanopore's sequencers are looking like they may be ready for the clinic in as little as two or three years.

I am certain that after almost ten years working with NGS the next ten are likely to be almost as exciting.

Index mis-assignment to Illumina's PhiX control

$
0
0
Multiplexing is the default option for most of the work being carried out in my lab, and it is one of the reasons Illumina has been so successful. Rather than the one-sample-per-lane we used to run when a GA1 generated only a few million reads per lane, we can now run a 24 sample RNA-seq experiment in one HiSeq 4000 lane and expect to get back 10-20M reads per sample. For almost anything other than genomes multiplexed sequencing is the norm.

But index sequencing can go wrong, and this can and does happen even before anything gets on the sequencer. We noticed that PhiX has been turning up in demultiplexed sample Fastq. PhiX does not carry a sample index index so something is going wrong! What's happening? Is this a problem for indexing and multiplexing in general on NGS platforms? These were the questions I have recently been digging into after our move from HiSeq 2500 to HiSeq 4000. In this post I'll describe what we've seen with mis-assignment of sample indexes to PhiX. And I'll review some of the literature that clearly pointed out the issue - in particular I'll refer to Jeff Hussmann's PhD thesis from 2015.

The problem of index mis-assignment to PhiX can be safely ignored, or easily fixed (so you could stop reading now). But understanding it has made me realise that index mis-assignment between samples is an issue we don not know enough about - and that the tools we're using may not be quote up to the job (but I'll not cover this in depth in this post).



Issues with index mis-assignment and quality were initially noticed when we detected Illumina's PhiX control in demultiplexed Fastq data. PhiX is supplied by Illumina as a non-indexed library and as such should never appear in demultiplexed Fastq files. In our default analysis pipeline it should only appear in the "lost-reads" file and should be around 1% in data from lanes 1-7, and 5% in data from lane 8 of an Illumina flowcell (the actual percentage of PhiX can vary for several reasons, so we're not surprised to see higher or lower percentages than expected). We are still running PhiX in almost every lane of sequencing as an easy control to monitor run quality. But if PhiX is getting a barcode what's going wrong?

The main concern is that if the barcode read is failing in some manner, and attributing barcodes incorrectly, this will lead to erroneous results. There are two major things that index mis-assignment causes

  1. reads are lost because a spurious barcode was assigned; this data would usually be discarded, should be minimal, and can potentially be ignored.
  2. barcodes are mis-assigned to the wrong sample; this is a much more serious issue, and understanding what causes it, and the likelihood of it happening, will be critical in reducing the technical factors that could limit low variant calling. 
With PhiX on every lane we should be able to monitor index mis-assignment in every run. PhiX may also allow us to estimate the rate of mis-assignment between samples, which will be vital if users need to allow for this in their analysis, particularly in low-frequency variant calling.

Previous reports about multiplexing on Illumina sequencers: As was anticipated several years ago multiplex sequencing has become a common tool in many studies, the level of multiplexing varies but it is almost ubiquitous – an anomaly to this is the creation of indexed libraries in the Genomics England sequencing program but the running of non-indexed sequencing and single-sample-per-lane by the sequencing contractor Illumina.

Several key papers are listed below that describe this issue, probably the most useful papers are Kircher et al from the Meyer lab at the MPIMike Quail and Peter Ellis's SASI-seq paper from the Sanger, and Jeff Hussmann's PhD thesis.

The Kircher paper presents data from three slightly different preps no-CAP (standard library prep), SP-CAP (single-plex in-solution capture libraries), and MP-CAP (multi-plex in-solution capture libraries). They were able to determine the fraction of mis-tagging events caused by either barcode contamination during oligo synthesis, pooling or handling, by mixed clusters, or by PCR recombination. After removing possible contamination as a source of error they reported that both no-CAP and SP-CAP had low levels of index mis-assignment (0.018% and 0.034%) but that the MP-CAP libraries had more than ten times higher mis-assignment (0.390%). The low percentages in the first tow libraries were due to mixed cluster that could not be eliminated by quality filtering. The high, almost 0.5%, mis-assignment in the MP-CAP library was due to PCR recombination during multiplex PCR after in-solution capture. Importantly they calculated that if this recombination is occurring primarily in the adapter sequences then half of the chimeric reads, almost 0.25% of all exome reads, would be mis-assigned to a sample if a single index was used, and that dual-indexing would be recommended.

Their analysis was confirmed by Mitra et al 2015 who went further in showing that the template read on the HiSeq was part of the problem - on HiSeq 2500 this is kept to 4 cycles to reduce memory requirements, but when Mitra et al increased template read lengths to 20 cycles they saw 2-5 fold better results for index mis-assignment. Such a long template read would kill most of our HiSeq instruments, but upgrading the memory is suggested by the authors and could be very economical given the impact of low quality cluster detection and index mis-assignment.

In Jeff's PhD he used reads from the shortest library molecules with read-through into the adapters to determine that the PhiX control use the older ‘PE’ primers, which have no sequence complementarity to the standard indexing read primers; as such they cannot generate a signal during the index read. He noticed the same drop in quality scores for PhiX index reads compared to the indexed samples as we had. But he also shows that the PhiX reads that appear to be indexed are physically closer to an indexed cluster than PhiX reads with no index read. This led him to propose the same the model of index bleeding as I have here.

Jeff also carefully investigated PCR-mediated recombination (as did Kircher et al) as an additional source of index mis-assignment. This was first reported back at the start of the 1990's by Meyerhans et al. In any PCR the polymerase can stall or fall off the template creating a short extension products, this can then hybridise in place of a primer in the next round of PCR. The issue with Illumina libraries is that such a product could create a chimeric index mis-assignment due to molecular swapping of indexes. This is likely to be most pronounced in multiplexed amplification after indexed library prep i.e. most exome and amplicome strategies. He also stated that his analysis "constituted overwhelming evidence that PCR-mediated recombination happens during cluster generation". His analysis was all on HiSeq 2500 "Manteia" clustering chemistry, this is likely to perform quite differently from the patterned flowcell "Exclusion Amplification" chemistry and we're looking into index mis-assignment on that right now.

In the SASI-seq paper Quail et al highlighted the issue of index mis-assignment and discussed the need for confirmation that contamination is not present before a data set is analysed. They presented a simple and inexpensive method to verify that results are not contaminated. They prepared a mix of three uniquely barcoded amplicons, of different sizes spanning the range of insert sizes one would normally use for Illumina sequencing, and added these to samples at a spike-in level of approximately 0.1%. They also designed a set of 384 11bp Illumina indexes sequences with high Hamming distance (5bp apart) higher levels of error correction and very low levels of barcode mis-assignment due to sequencing errors.

Our PhiX mis-assignment analysis results: We took historical data to verify if PhiX mis-assignment was happening across all flowcells and could clearly see this was the case, (A) simply shows the percentage of PhiX we added to each lane. In (B) you can see that the majority of lanes show a reasonably low level of index mis-assignment to PhiX, at just 0.01-1% in single indexed samples (green), and 0.01-0.0001% in dual-indexed samples (red). Dual indexing appears to help significantly. We also saw that the level of PhiX contamination was worse on 2500 than 4000, and increased as the amount of PhiX used increased. In fact the rate of PhiX index mis-assignment was more strongly correlated with the amount of phiX in lane for single indexed samples than for dual indexed samples (C). We see PhiX appearing at as much as 1% of the sample in the very worst cases - however this is generally in single-indexed multiplexed sequencing with very high levels of PhiX e.g. low-diversity spiking.



Indexed versus non-indexed PhiX analysis: Whilst the Illumina PhiX control is not indexed, it is possible to purchase an indexed version from SEQMATIC. When we compare indexed versus non-indexed PhiX the results were clear - non-indexed PhiX shows around 0.02% bleed through, while the SEQMATIC index is around 0.005%; a four fold reduction in bleed through.


Indexed versus non-indexed PhiX comparison

Index-read base-quality scores are worthless: We saw that mis-assigned PhiX (PhiX FQ below) reads generally had lower sequence read quality scores than the correctly assigned samples (D). The mis-assigned PhiX index reads were also had generally lower quality scores than the correctly assigned samples (E & F), and it would be great to filter on base quality scores to remove mis-assigned reads. Unfortunately the quality score you get from an Illumina index read is pretty much useless. This is primarily due to its short length. Actually getting the index quality scores requires quite a bit of messing around with the default bcl-fastq pipeline.

These index Q-scores are currently discarded. Just to get the data for the plots below we had to rerun the flowcell through a modified bcl-fastq pipeline. Keeping index Q-scores would require changes to our default pipelines and increase in our compute storage requirements. However we may be able to develop methods similar to Q-score binning, to reduce this extra data, and still allow an assessment of index quality.

Going further than this Illumina sequencing might benefit from running a longer template read at the beginning of all reads e.g. read 1, i5, i7 and read 2. What the computational burden might be and exactly the impact on index mis-assignment this would have is difficult to predict. But even small reductions in errors like this would be worthwhile for low allele frequency applications. I'd expect that companies aiming for tumour screening in the general population (e.g. Grail) would benefit the most from doing these experiments.

PhiX mis-assignment analysis conclusions: Based on our analysis, and the results presented in Jeff's PhD we've come to the conclusion that PhIX index mis-assignment is caused by two issues: index bleeding and/or poly-clonal clusters. And that this can be fixed or safely ignored.


In the figure above (1A) I've tried to present “index bleeding” - each library template cluster emits a signal according to it’s base-fluorophore, represented by the capitalised circles as GAT, (green=G/T, red=A/C), however this fluorescent signal “bleeds” outward from each cluster. A non-indexed PhiX cluster, represented by the lower-case circles, does not emit signal and is base-called from the erroneous "index bleeding" library cluster signal as gat. An indexed PhiX cluster emits a signal according to it’s base-fluorophore and is correctly base-called as CTA. In figure 1B I've tried to present what may be happening on mixed template poly-clonal clusters. These are caused by the random nature of clustering where some clusters are made from two template molecules, that may have seeded at different times. A cluster produced from a single library molecule (α) is correctly base-called as GAT. A mixed template non-indexed PhiX cluster (β) is base-called on the low-signal from the erroneous library cluster signal in the indexing read only, due to lack of PhiX index signal as gat. A mixed template indexed PhiX cluster(γ) emits a signal according to it’s base-fluorophore that is higher than signal from the erroneous library cluster and is correctly base-called as CTA.

Index-bleeding should only be an issue for non-patterned flowcells, whilst poly-clonal clusters will be a problem on both patterned and non-patterned flowcells i.e. HiSeq 4000 and 2500.

How to fix the problem: for index mis-assignment to PhiX the fix is relatively straight-forward. Either use an indexed PhiX, or spike in an oligo to the indexing read primers such that PhiX generates a signal. Both strategies will mean the PhiX clusters generate a signal that outcompetes the index-bleeding, or poly-clonal cluster signals. PhiX will no longer appear in your demultiplexed fastq, or will be at such low levels you'd only see it if you specifically went looking.

Unfortunately index mis-assignment between samples is still an unresolved issue. In a follow up post I'm going to discuss what we've seen, and what the apparent causes are. Again some relatively simple fixes are available - but if you are using multiplexed sequencing to detect low-frequency alleles in populations; e.g. cancer, single-cells, population genomics, then you need to consider whether you understand how your experiments might be affected.

PS: I think it is pretty lax of Illumina not to provide an indexed PhiX. The V2 PhiX was indexed but V3 dropped this, probably due to there only being 96 TruSeq indexes. Come on Illumina sort this one out!

Useful references:
Viewing all 109 articles
Browse latest View live




Latest Images