In many instances, long reads are more helpful for sequencing and assembly than short reads. That's one of the reasons why Nature Methods chose it as the 2022 method of the year for 2022.
And here is a Focus on this subject with comments, perspectives and my news feature for which I chatted with scientists at companies and in academia about long-read sequencing and did some podcasts, too. This podcast is a conversation with Johns Hopkins University researcher Dr. Steven Salzberg.
You can listen to the podcast here or on streaming services. Here you will find links to the podcast on streaming services such as Spotify, Apple podcasts and Google podcasts.
Dr. Steven Salzberg is a Johns Hopkins University researcher and director of the Center for Computational Biology at Hopkins.
I spoke with him about genomics, about long-read sequencing, about human biology and human diversity, about funding, technology choice, about complete and incomplete genomes, about jobs in bioinformatics. He described his technology choices and about the choices one has to make in small labs. He shared his thoughts about the trend toward pangenomes and graph genomes. And he described how technology has changed and how happy that makes him.
Teeny reminder, Steven Salzberg headed bioinformatics at TIGR, the Institute for Genomic Research run by J. Craig Venter. It was part of the venture to determine the sequence of the human genome. And yes, there were human genome assemblies based on teeny tiny read lengths.
Transcript of the podcast with Steven Salzberg
Note: These podcasts are produced to be heard. If you can, please tune in. Transcripts are generated using speech recognition software and there’s a human editor. But a transcript may contain errors. Please check the corresponding audio before quoting.
There are what, three or four biggest killers of human beings. Cancer is one of the one of those, you know, heart disease, probably the other big one in the first world anyway, it's infections in the third world.
And cancer is a genetic disease, we all know that now, it's caused by some mutation happens in a cell. And now that cell starts to grow and won't stop. That's basically it. And there's many, many different mutations that can make a cell go haywire, we've already seen a lot of them. So what can we do? Is there something we can do with sequencing that can help us understand that better treat that better, you know, so we certainly want to catalogue all the mutations that are, you know, making it go haywire. So sequencing tumors is a way to do that.
That's Dr. Steven Salzberg, Johns Hopkins University researcher and director of the Center for Computational Biology at Hopkins.
Hi and welcome to conversations with scientists, I'm Vivien Marx.
You will hear more about and from Dr. Salzberg in this episode that is about genomics, about long-read sequencing, about human biology and human diversity, about funding, technology choice, about complete and incomplete genomes, about jobs in bioinformatics.
When something is incomplete, it just keeps tugging at you with a question. When will you finish this? It's true for many kinds of tasks and puzzles including ones in science. In genomics in particular. Sequencing and assembly are a way to see the information a genome holds and this becomes shareable information too. And you can sequence many genomes and compare them. For example as Steven Salzberg just mentioned you can sequence the genomes in tumors to better understand why they act so treacherously.
It's hard to sequence and assemble genomes to completion. Long-read sequencing helps with this and that is the Nature Methods method of the year for 2022.
The fact that long-read sequencing exists doesn't mean that all genomes are now sequenced to completion and sometimes they don't have to be complete. But for some questions it's important to have genomes to analyze that do not have gaps.
I should add, researchers need to sequence and assemble. C. Titus Brown, he's now at University of California Davis and was then at Michigan State University in East Lansing when he explained assembly this way to me for an article in Nature. He said: Imagine that 1,000 copies of Charles Dickens' novel A Tale of Two Cities have been shredded in a woodchipper. Your job is to put them back together into a single book. Side note here, I was afraid my editors would not like this at all, many of them are British and the idea of sending Dickens novels into a woodchipper might seem awful to them. So I buried this deep in my story. But the editors actually liked it and asked me to put it right at the beginning. That was a nice surprise. The link to the story is in the transcript of this podcast.
So that's the assembly challenge, explained with Dickens and woodchippers. Sequence and then assemble so that the sequence represents the real thing as closely as possible. The human reference genome which labs around the world use for analysis in basic research or in medicine is called GRCh38, that stands for Genome Reference Consortium Human Build 38. GRCh38 came together over 20 years. The challenge has been that the assembled sequence has many errors in it, for example gaps.
The T2T Consortium and that stands for telomere to telomere consortium has worked on these gaps. And succeeded. Except that the sequence they completed is from a certain kind of cell that is mainly homozygous. The two sets of chromosomes in human cells are not identical like this one typically. But this cell made things easier for assembly and for attacking gaps.
As part of my reporting on long-read sequencing I thought I would ask Steven Salzberg about his views and get his perspective on sequencing and assembly. He runs a computational biology lab that is part of the larger Center for Computational Biology at Johns Hopkins. He directs that center. In his lab he and his team work on assembly and ways to analyze DNA computationally.
He has a blog and a column on Forbes.com. Earlier in his career he headed bioinformatics at TIGR, the Institute for Genomic Research run by J. Craig Venter. It was part of the venture to determine the sequence of the human genome. There were disagreements about how the project was run and Craig Venter set out, and with his scientists including Steven Salzberg, to sequence the human genome.
This ended up as a public-private race to sequence and assemble the human genome. The publicly funded International Human Genome Sequencing Consortium published its sequenced and assembled human genome sequence in Nature.
https://www.nature.com/articles/35057062 At the same time, the TIGR team published in Science. That was in 2001.
Updates have continued since then. And gaps keep getting closed. But one issue with the existing human genome reference is that it's not diverse, it's not a representation of the diversity of humans on the planet. But that is something that the Human Pangenome Reference Consortium is taking on. Steven Salzberg is part of this consortium and a co-author on a more recent Nature paper from this consortium and that paper is about a assembly and reference genomes. https://www.nature.com/articles/s41586-022-05325-5
There's a paper led by Eric Jarvis that I'm on. Okay. That's on us. It's on semi-automated assembly of high-quality diploid human reference genomes.
In sequencing, technology has changed plenty over the years and the changes are making new kinds of projects possible. At first there was short-read sequencing and then long-read sequencing emerged. First there was Sanger sequencing, which was accurate but time consuming, then came high-throughput sequencing with short reads with Solexa machines. Solexa was bought by Illumina, a sequencing company known mainly for its short-read technology. And then came long-read sequencing, which a number of instruments can do. One of the companies with such an instrument is Oxford Nanopore Technologies, often nicknamed Nanopore and another is Pacific Biosciences, often nicknamed PacBio. And there are other companies with instruments that do long-read sequencing. Even illuminate is offering a kind of long-read sequencing. Steven Salzberg is happy about the evolution toward long-read sequencing.
Steven Salzberg (6:30)
I've been excited about it for a number of years now. You know, we went from from having 800 base pair reads to having 25 base pair reads in 2007. And initially nobody even thought of using that for assembly. And then it was so much cheaper that you know, once the reads got to be a little over 50 bases, people started trying to assemble genomes from them. And they were assemblies published with, you know, 55 base pair reads, 75 base pair reads. They were terrible assemblies, but they were much, much cheaper than Sanger sequencing.
This was the early these were the early machines was this Solexa or was this?
Yes. Solexa. The first Solexa machine was 25 bases, nobody was using that for assembly. They were up at 35 bases within about a year. And then it got to 50 and other year later. The first human genome sequenced and assembled from Illumina, I think used 54 base pair reads, I think is what their length was. And it was very, very fragmented, of course, and God knows how many, you know, hundreds of thousands of pieces. It wasn't clear that that technology was going to ever get very long reads. And here we are, you know, over 10 years later, it still doesn't have very long reads. Because there's sort of an inherent limitation in the technology and the way it works. So you know, because you're sequencing this little tiny fragment that you amplify it in place, and you don't have that many molecules there. So that's the limitation there. And it seems like they can go to a few 100 base pairs at most.
And then PacBio came out and then Nanopore came out, and they don't really have this length limit, especially Nanopore, you know, they can do they have reads of more than a million bases now, and then can be sequenced. With lower accuracy. And starting in, I don't remember what year but good six or eight years ago, using the earlier Nanopore reads, or maybe it was PacBio, I don't remember, Adam Phillippy and his group started doing complete assemblies of bacteria with no gaps, without having to do any sort of gap filling, you know, by in a lab, just totally computational. That was a big step forward. I thought. because we had never been able to do even that, and so around, I don't know, let's say 2015, 2016, we're finally able to do a bacterial genome, and get it completely assembled just from the raw shotgun data. But we were still a long way from eukaryotes, because of the large these large, repetitive regions, mostly centromeres.
Centromeres are regions in the genome. Karen Miga at University of California Santa Cruz has long studied these regions and she co leads the T2T Consortium. When you sequence with short reads, it's hard to fit puzzle pieces together with a stretch of repeated DNA like the ones you find at the centromeres. Which is why long reads are helpful.
Steven Salzberg (9:30)
The key parameter that you care about, in long reads, in doing assembly, and this is important for all the trees that we're doing. So we have these massive tree genomes that are sometimes upwards of 30 giga bases, you know, 10 times the size of human.
And they're, some of them, are like 80%, repeats, they're just filled with repeats. It's not like they have more genes, they don't, they just have a lot more repetitive DNA in them. But the repeat elements aren't that big, they're not that big. Most repeat elements are in the 2 to 5,000 base pair range, usually towards the shorter end of that. And so as long as your individual reads are longer than that, hopefully, you know a good bit longer than that, then you can place all of those repeats in the right place.
So once you get above, you know, once you have 10,000 base pair reads, which we have, you can span almost everything in a genome, with the exception of the centromeres. And the telomeres. So you do have a few special regions, but only a few special regions that that still remain as gaps. And the T2T consortium to finish the human genome, they didn't get read lengths that span the centromeres. Because the centromeres are millions of base pairs long. But there's just enough variation in the centromeres. They're not all the repeats aren't identical, but they're very, very close to identical. So with really long reads that were accurate enough, that's why the HiFi reads are necessary. You can tease apart the centromeres and assemble them, we hope correctly, and we're done. So we hope it's correct. So that's still very challenging to do.
As a computational biologist Steven Salzberg works on many types of genomes, such as tree genomes. He has a project in the works on the Whitebark Pine with David Neale and that's another podcast I still have in the making, one with David Neale.
Steven Salzberg (11.:25)
He is retired from UC Davis but he's now in charge of the Whitebark Pine Ecosystem Foundation, which is trying to get raise funding to sequence the white bark pine which is a high altitude pine that's endangered or nearly endangered. I'm hoping that will be listed as endangered soon we'll help them raise the money so and as soon as you can to raise the money that I'm going to be working with him to assemble it. Sorry, I misspoke. We're already assembling Whitebark Pine. We don't have all the money we need to do all this depths. So we've already got some of the money and we're already well underway with that. We're also sequencing, bristlecone pine, doing all these on like a shoestring budget, though.
Money. Money matters in science and sequencing and assembly. There are large-scale projects that garner large grants. And individual labs do more targeted projects, like the ones Steven Salzberg just mentioned on tree genomes. So one has to choose technology based on resources. For tree genomes Steven Salzburg and his colleagues combine Illumina sequencing with Oxford Nanopore Technologies sequencing.
Steven Salzberg (12:30)
I went to Illumina for all of them, because these are all very, very large genomes in the 25 to 30 Giga base range, and we don't have that much funding and David's trying to raise the money, mostly from private sources. You know, so he gets, you know, small amounts here and there. And, and that's not, you know, that's we're not these wealthy funded places like the NIH or NHGRI funded groups. So anyway, we need long reads to span most of those repeats, because there's so many of them, and they're long. And we get pretty good assemblies from these massive genomes. We also do we also, were doing this Hi-C technology at the end, make them even better with company called Dovetail. That's part of our recipe too.
That technology works surprisingly well. Once you have your big pieces, we call scaffolds, if you want to kind of make them even bigger, kind of link them together. It's a high throughput way to link them all together. And we frequently get chromosome-size scaffolds out of that. So the whole chromosome, it's got gaps in it, but we'll have everything kind of laid out on on something which is a whole chromosome or chromosome arm.
The project takes a combination of technologies and methods to sequence and assemble these genomes. I hear in my interviews that it would be great if labs had one box to go to for everything related to sequencing. Here's Steven Salzberg on that aspect, the one-box idea.
Steven Salzberg (13:30)
Well, I mean, ultimately, you know, maybe in decades to come. But we're not seeing anything around the corner like that. I work a lot on the human genome and human genetics, too, and human genomics, but human work is almost it almost should be in its own category.
Because in every way, all the resources are so much vaster, so a cancer person, they may not appreciate it. But compared to somebody's who is trying to sequence a tree, where you, you have to struggle to justify why it's worth doing at all. And then you get like a little grant. With c ancer, we're basically looking at sequencing the whole genome from an individual tissue sample from an individual patient. I mean, and there's funding to do that. There's resources do that, because we're humans. So we want to put our money into that kind of work.
What we actually want to get out of sequencing has changed dramatically over the past 20 years. It's changed over the past ten years. So if you have a cancer sample, do you want us do first of all, do you want to sequence the DNA, or RNA. A lot of people do RNA sequencing. And sometimes you do both. If you have enough money you do both, I mean. With DNA, you're looking for mutations in the DNA itself. And with RNA, you're looking for changes in gene expression, in each case that are somehow abnormal, and you want to compare it to some sort of normal signature.
But you know, some years ago, we would just be looking at SNPs and running SNP chips, but and people still do that there are these SNP chips with millions of SNPs. And that's not sequencing is not really sequencing, but it's, it's cheap. And you're interrogating the genome at several million locations for SNPs that are kind of standardized, so we know what they are. And some of them are markers for certain diseases or other traits. I don't do any SNP I don't do SNP analysis but with anything other than humans, you don't you don't you can't really contemplate doing that kind of detailed sequencing
Much funding these days is going into population-scale sequencing. To get, for example, reference genomes that reflect human diversity, which the current reference genome does not.
Steven Salzberg (15:45)
Yeah, so my view on that is the real is the NIH is pushing this pan genome idea. I think it's a little bit misguided, honestly. I think what we ought to do, and I'm doing some of this in my own lab, is we ought to be doing individual more or less complete genomes from many different populations. And building up a library of those. We've done two of them in my lab, one of an Ashkenazi Jew, and then one of a Puerto Rican individual, we both published. And we, you know, they are genomes that are basically a better quality than the human reference genome is now and they're annotated. And you could use them right now as a research tool.
And I think there should be hundreds, if not 1000s of those, the pan genome is a different idea. It's like, let's take all the different people's genomes different population genomes, and combine them into one giant data structure, which is not yet agreed upon, nobody really knows what's gonna look like. And then we can do all our analysis on that. I don't think that's the best solution. I honestly don't think it's ever gonna happen.
One issue with the concept of a pangenome is how to analyze these data and how to democratize that genome graph analysis so that analysis can be widely done, for example at medical centers large and small as well as basic research labs.
Steven Salzberg (17:05)
It's technically very challenging. And I know why people are interested in that. And I've worked a little bit on graph genomes, too. But that's not a good solution. It's just, it's just not, it raises additional technical problems. It also kind of ignores the fact that we have a vast amount of infrastructure already invested in analyzing things on one genome We have all kinds of software, we have laboratory- based kits are all based on the human reference genome. Adding other reference genomes from different populations would be pretty straightforward. Adding a thing that doesn't look like that at all that requires all kinds of new algorithms, new software that doesn't exist. And I don't see people creating are using such stuff in the research world, maybe. But in the clinical world, now, they're very reluctant.
Many, many clinical sites are still using what's called GRCH 37, or HD 19, the old version of the human genome, which was updated in 2013, to the current version, and they're still because of inertia, you know, they don't want to switch all their software. So they're not, they haven't even updated to the new newer genome assembly. Now. Now, we're hoping they'll all update to the CHFM 13 genome, which is so much better. And, and I hope they will. But I think that's going to take years just update to that. And that's a single genome linear representation. To ask them to go to something they don't even understand or the community hasn't even agreed on what the representation is, it's not going to happen. I think it's not going to happen.
Sequencing and assembly is not just done for one lab or group but for others to use these data, too. In the clinical world when a software analysis pipeline has been set up, there is reluctance to change those pipelines. And this is the inertia Steven Salzberg just described. Pipelines can break, that can be time-consuming and expensive to fix. So these clinical sites stick to using a previous version of the assembled human genome. And they would have to perhaps in some cases be made aware of the fact that there is a finished human genome.
Steven Salzberg (19.05)
Are you aware of the fact there's now a finished human genome? Would you like to switch to that? What would it take to get you to switch to that? That's not a big, that's not a big change in it. Yet, it is. There's a lot of things that are all kind of tied into a GRCh38. The current reference, despite its flaws, its shortcomings.
With the real-world application of sequencing that is widespread across big labs, small labs, big hospital centers and smaller ones, it's not just about the technology but about software, people, patients. And of course money. The assembled human genome CHM from the telomere to telomere consortium is not diploid the way the human genome is. But sequence and assembly have been completed as the name indicates from telomere to telomere from one end of chromosomes to another for the entire human genome. It's not a true diploid genome but maybe for most questions that does not matter so very much. Here's Steven Salzberg.
Steven Salzberg (20:00)
There's a lot of focus in the technical world that I live in on let's assemble things and make a diploid assembly some of these papers are diploid assemblies, meaning we assembled two copies of each chromosome. And, you know, just from a purely technical point of view, that's kind of interesting and challenging. And that is a better representation of the bad biological reality than what we usually produce, which is kind of a, you know, a mosaic of the two chromosomes kind of mashed together. But for humans, we're pretty inbred species. And so our chromosomes are all really, really, really similar. And so it doesn't really matter if you mash the two together or not except for is very, I can't actually give you a good example, because I don't know of one, were having a deployed assembly is gonna let you solve a problem that you couldn't solve.already.
There are some gene loci where there is a lot of variation between people one of them is the HLA locus- the human leukocyte antigen super locus also called the Major histocompatibility complex. It's a region with some genes that matter for immune system function. It's important when for example considering if one person is a good match for an organ transplant to another person.
Steven Salzberg (21.10)
So the HLA locus is incredibly complicated. And we already have a lot of trouble doing genotyping for that. I don't see a diploid assembly as a solution to one locus, I mean, you can work on that locus, and do something with that locus that might be you know. If you're interested in HLA typing, then, okay, maybe you want to separate the haplotypes, just for that.
But that's just one small piece of a very large genome. So you know, maybe there is something there, but it's, you really have to, you kind of have to dig hard dig deep to find justification for putting a lot of effort into diploid assembly.
I think a lot of this comes from the funders and from NHGRI, which is your eye is, and people say I shouldn't bite the hand that feeds me, but because I have funding from them, but they like to come up with initiatives that they come up with, they think of. They have advisors, but who knows how much the advisor involved, and then they say, we're giving out money for this. And that's what you got to do. And so, you know, they want to have pan-genome work done. They want to have diploid assemblies made. So that's getting funded. And you don't have to justify it. Really, if the funder says that's what we're giving out money for. If you write an RO1, one not to any particular initiative, then you have to explain like, why is that valuable? So I think that they are putting too much money into this sort of top-down driven initiatives that they think up and they really should just let us propose things and let the peer- review process sorted out, although, you know, flawed is that is it still better.
The large grants from the National Institutes of Health National Human Genome Research Institute, NIH NHGRI set the stage for much of what happens in genomics, at least in the United States.
Steven Salzberg (23:05)
They do have these very big grants out of NHGRI. And those are things where they write the, you know, call for proposals, and they define what's what it's going to be. They are somehow keen on on graft genomes and diploid genomes right now. So a lot of people are saying how great this is, but I'm not one of those. I think some people saying it's great, they know full well that it's not really that substantial, but it's, you know, it's the, the flavor of the month. Deploid assembly, again, if you had even a deployed assembly, we don't have tools to work with those. So I don't think they're going to be of much use for quite a few years.
But I think that the what's missing from the NHGRI perspective at the moment about these different technical directions is, is a real sense of, Well, why are you doing this, where's the tell me that direct connection to human health without waving your hands too much. Now, sequencing the human genome, you can argue, as some people have, like, it hasn't really delivered all the things that were promised. But the fact is that we do have a lot of things that it has delivered on. We now can track down genetic mutations very quickly that we had no idea about before we met, we at least know what all the genes are. We know a lot more about genetic causes of cancer and many other disorders.
Some scientists have told me that having the human genome reference helps to study aspects such as changes to the genome that aren't changes to the gene sequence. Those are epigenetic changes such as methylation and they can play a role in health and disease too.
Steven Salzberg (24:40)
Yeah, you can argue there are people who are looking at methylation how that's related to various diseases. I think that the idea of having a more accurate assembly or a diploid assembly or a pan genome assembly, is being driven by some sort of technical interest. Not by Oh, if only we had that, then we could figure out this problem. it's, it's not that you always have to work that way. It's fine. I'm, I've done lots of basic research in my career, and I think we should fund that I think basic research is great. But when you're talking about putting so much money, so many resources into doing human sequencing, we should be thinking about like, Okay, well, what is this supposed to do? Why, you know, what? Or is there a particular disease we're trying to cure? Is there a particular class of disorders that we're looking at? What's it going to do? Even if that's 10, 20 years out? Why are we doing it? So I don't have I have not seen a good answer when it comes to graph genomes and diploid genomes, I think they're, they're addressing a much narrower concern. If one person had a grant to do it, I'd be like, Okay, fine, whatever. There's lots of grants like that, but there's a lot of other questions out there that are probably more important and probably have a more immediate near term effect on on human health.
We returned to talking about cancer research. Not every tumor in every patient is being sequenced and not all sequencing works with low amounts of tissue, which is common in cancer of course, biopsies have a certain size. And one big aspect about sequencing is a more social aspect: sharing. Here's Steven Salzberg.
Steven Salzberg (26:25)
Let's go to cancer again, because it's, you know, There are what, three or four biggest killers of human beings. Cancer is one of the one of those, you know, heart disease, probably the other big one in the first world anyway. It's infections in the third world.
And cancer is a genetic disease, we all know that now, it's caused by some mutation happens in a cell. And now that cell starts to grow and won't stop. That's basically it. And there's many, many different mutations that can make a cell go haywire, we've already seen a lot of them. So what can we do? Is there something we can do with sequencing that can help us understand that better, treat that better, you know, so we certainly want to catalogue all the mutations that are, you know, making it go haywire. So sequencing tumors is a way to do that.
That's not addressed by you know, diploid assembly or pangenome or anything like that. But we are doing lots and lots of cancer sequencing. We're also doing RNA sequencing, which is another thing because if you can't pinpoint the mutation, presumably the mutation is also making some gene go haywire, it may not be a mutation in that gene, it may be something that controls the gene, but if the gene suddenly is upregulated by a lot or downregulated by a lot, and it's not supposed to be then okay. That's also important to know. And that might be a target for a drug if that gene is going haywire. Maybe that's a target.
So we are doing things like that. And the more sequencing we can do the the faster we can do it, the better. We don't sequence everybody's tumor. We're far from that. So it'd be nice if we got to the point where that was just a routine assay that anybody who had cancer would have a sequence. So we're not there yet. So we certainly have, you can think about what can we do to get there, because we at least could get that information. And even though we don't know, the genetic causes of all cancers, we're working on that. But we're also, the technology is getting cheaper and faster, and maybe requiring less DNA. Because tumors, you know, right, like your cancer doctor said, you can't just go back and get more tissue, you know, you have a limited amount, you get a biopsy, and that's where you got your DNA from.
But another thing, which is not addressed, I think, adequately, but NIH could address this, this is a bit orthogonal, is that we ought to share all this data completely amongst every doctor, every basic scientist in the world, we would already probably have 100 times more data about every type of cancer, if all of the sequencing has already been done was publicly available, or at least shared in some easy way. And it's just not. And that's a it's kind of a sociological, cultural problem. But you know, if I had a tumor, you had a tumor, and you went to your doctor, and they sequenced it, they're not going to share that with anybody.
And if someone asked for it, they say, Oh, we're not allowed to because of HIPAA. But if they asked the patient, would you mind? You know, I've known some people with cancer. And their answer is almost always: of course, I wouldn't mind please share the data, this, this would help cure my cancer and help other people. So all you have to do is ask, that's not part of the culture. We don't do that.
It's all about oh, no, we can't share their data, because they might be identified from it someday. And then someone might like lose their health insurance or whatever, hypothetical thing, and it's illegal to share it. But you can share it if you ask. So I think, in addition to making it easier to get all this data, we haven't done much to try to share all this data and NIH is not helping. They're really not helping.
Sharing and data sharing are a tough challenge with many layers of complication and it applies to all kinds of sequencing. Short-reads, long reads, just parts of a genome or whole genomes or many genomes. People share data or not. And they make choices they make about the technology they use for their work.
I asked Steven Salzberg a bit about his technology choices for long-read sequencing. One company is Oxford Nanopore also called ONT.
Steven Salzberg (30:20)
I like the ONT mission of sequencing becoming ubiquitous, so that, not just park rangers, but the elementary school kids in the park, you know, they could also have a sequencer. We'll probably be there one day. That may be a long way away. But it is a little complicated. I don't do sequencing in my lab, it's still a little bit big, you have to have a wet lab to do it. So one of my colleagues, Winston Timp does tons and tons of sequencing. So when I need something sequenced, I talk to him.
Yeah, he's trained a couple of my students in how to do the sequencing. So I have some students who can do it, I can't myself but they can and we just walk over to his lab, which is across the street. And, and we can do that. So but it's much easier than it used to be, but it's, it's not quite, you know, out in the field, anybody without knowing what they're doing, can just take a little sample and somehow turn it into sequence data.
If a lab wants to do long-read sequencing, there are growing number of tech options. The more established options are from PacBio and Oxford Nanopore. Steven Salzberg has a clear favorite.
Steven Salzberg (31.30)
Well, with few exceptions, there's only one real choice and that's Oxford Nanopore. I mean, it's not a close call Oxford Nanopore has device that cost a few 100 dollars. And I have one on my desk that I have to show to people, you know, they still have this, this device, which costs, you know, 500 bucks, and any lab can get it, you can take it out in the field. And it generates a fair amount of data. And it's very low cost. And the reads are very long.
So when they, when they first appeared, and well, both PacBio and Nanopore, both were so error prone when they first appeared, I was skeptical they would ever be adopted. But they've steadily gotten better. And now they're, you know, the error rates are still pretty high compared to Illumina. They claim lower error rates, but they're really still pretty high with the exception of the HiFi. But the raw sequencing reads, you get out of a PacBio or Nanopore, it's still pretty error prone, but they're much longer so they're, they're valuable for that.
And for people who are just starting out or trying to do some long-read sequencing, even if their lab has been around for a while, they can't go and buy a $700,000 PacBio machine. So if they're at a place, which has one already and has a core facility, and they can pay for sequencing that way, they might do a few jobs, you know, a few runs. And that's an option for them. If they're not at a place where that's an option, if they want to have their own sequencer run it themselves.
There's no, there's currently only one technology on the market. And that's Nanopore and that's I don't know when someone will emerge to compete with them. But it's it's just not a competition really, I thought PacBio was once the Nanopore got to be, you know, had a longer read length and had similar accuracy, which has been several years now. I thought PacBio would be basically gone pretty soon. And I don't know anything about their finances because I don't pay attention to that. But the in the sequencing arena, the HiFi technologies, the only thing that kind of that they have that have to offer that's really better. And I don't know how much business they get from that. But if you really want to assemble a large genome accurately, then the Hi Fi technology is great. So it's good that we have it. It's a shame that they sued Nanopore and prevented them from using their read twice technology, because that would probably make that that might even supplant HIFi technology. But anyway, most places, most people, most labs don't have the money to pay for that. It's a quite expensive technology, the Hi Fi sequencing.
So these NIH NHGRI-funded groups that are part of the Pan Genome Consortium, that seems that NHGRI is happy to spend tons of money on sequencing. So they're kind of keeping that technology alive. But in the research world you got a pretty big grants to afford that kind of thing.
I always like asking about jobs, in this case about job opportunities on the computational side of sequencing and assembling genomes.
Steven Salzberg (34:30)
Well, I mean, wouldn't be telling anything particularly new or interesting, there's academic jobs. So there's jobs doing like what I do, some of my students want to do that. And I know a lot about that. So I, you know, sort of tell him, you know, where there might be jobs, where there might be good places to go. And I know people in the academic side.
On the commercial side. I've had students go, well, one just went to Illumina, one year, year and a half ago, one went a few couple students went to a cancer sequencing startup, it's no longer a startup. It's too big for that called Personal Genome Diagnostics here in Baltimore, started by some of the people in cancer here at Hopkins, namely Victor Velculescu who's a colleague of Bert Vogelstein's. They have a company that does cancer sequencing and two of my former students have have gone there.
And so, but I don't know the commercial market, as well I have my hands full keeping up with the academic world and the basic science side. Two of my former students have have gone there. And so, but I don't know the commercial market, as well as I I have my hands full keeping up with the academic world and the basic science side.
My experience so far is that there's lots of jobs, if you're trained in bioinformatics, the job market is still very, very good. It's kind of a seller's market, if you have the skills, there's a lot of places to go. And there's lots always lots of startups, it seems even when the economy is slowed down, there seems to be a fair number of startups in this area.
There's a lot of metagenomics companies starting up, there's a lot of interest in that. I don't know how many of them are going to, if any, are ever going to succeed. So the sequencing side of things, that's just a few technology companies, but the data analysis side of things, all of that involves bioinformatics.
So like, for example, there's companies that are doing microbiome analysis where, you know, they want to figure out is there some useful diagnostic, you know, capability we can get from that. There are some opportunities in government where they're looking at trying to do a better job at tracking down foodborne outbreaks of infections that we use sequencing for that now, it's a totally different approach than w hat we ever did before. But we do. We've sequenced collectively, I mean, the community sequence like over 100,000 different strains of Salmonella, because the FDA collects this stuff. And that's their job is to track down foodborne outbreaks. So my students know how to do that if they're interested in that kind of thing. So there's government jobs and there's private sector but the private sector, I don't know as well.
That was Conversations with scientists. Today's guest was Dr. Steven Salzberg, computational biologist at Johns Hopkins University and director of the Center for Computational Biology at Hopkins. The music pieces used for this media project are Funky Energetic Intro by Winnie The Moog and Acid Trumpet by Kevin MacLeod, downloaded and licensed from filmmusic.io
And I just wanted to say because there's confusion about these things sometimes. Johns Hopkins University didn't pay for this podcast and nobody paid to be in this podcast. This is independent journalism that I produce in my living room. I'm Vivien Marx, thanks for listening.