How a suite of advanced machine learning algorithms is finding new cancer-causing patterns of DNA damage
Cancer Grand Challenges researchers and collaborators have identified four new mutational signatures, including a pattern of DNA mutations that links bladder cancer to smoking tobacco. The discovery was made possible by a powerful new machine learning tool developed by members of the team taking on our Unusual Mutation Patterns challenge, seeking to find patterns of mutations caused by carcinogens and other DNA-altering processes.
The research, published in Cell Genomics, could help pinpoint the cause of some people’s cancers, by enabling scientists to trace a pattern of mutations back to an environmental factor or exposure. Knowing which mutational signatures are present in a tumour could also lead to more personalised treatments for an individual’s cancer.
The missing link between smoking tobacco and bladder cancer
As part of the study, the team found a mutational signature in the DNA of bladder cancer that can be linked to smoking tobacco – significant, as this had been detected previously in other cancer types linked to tobacco, such as lung, mouth and oesophageal, but not for bladder cancer.
“There is strong epidemiological evidence tying bladder cancer to tobacco smoking,” says senior author of the study Ludmil Alexandrov, Cancer Grand Challenges co-investigator and professor of bioengineering and cellular and molecular medicine at University of California, San Diego. “The fact that we weren’t finding this signature in the bladder was strange.”
Interestingly, the mutational signature linked to smoking tobacco in bladder cancer is different to that found in lung cancer. What’s more, the signature can also be found in normal bladder tissues of people who smoke tobacco but who haven’t developed bladder cancer – adding to a growing body of evidence that ‘healthy’ tissues can harbour cancer-causing mutations without triggering tumorigenesis. Find out more in this article we wrote with Peter Campbell, and about a new team we are supporting to investigate this phenomenon here.
“What this signature tells us this that certain mutations [in bladder cancer DNA] are due to exposure to tobacco smoke,” says study co-first author Marcos Diaz-Gay, a postdoctoral researcher in the Alexandrov lab.
For this study, the team analysed more than 23,000 sequenced human cancers. In addition to the bladder cancer signature tied to tobacco smoking, it found three new mutational signatures, in stomach, colon and liver cancers, that had not previously been detected. What causes these three other signatures is still unclear, opening up new potential avenues of investigation.
“Signatures can have external origin, or originate from different processes within the cell,” says Marcos. “It will be interesting to look at different cancer datasets, maybe from different countries with different environmental exposures, to identify the causes of these new signatures that we can see in the genome.”
"A powerful machine learning approach”
The team’s findings were made possible thanks to their development of a next-generation machine learning tool, SigProfilerExtractor – which the team describes as the most advanced, automated bioinformatics tool for extracting mutational signatures directly from large amounts of genomic data. Earlier this year, we shared how the team had used the tool to identify 21 copy number signatures in cancer DNA, and build the first comprehensive map of chromosomal changes that occur during cancer development.
“This is a powerful machine learning approach to recognize patterns of mutations and separate them from genomic data,” says Ludmil. “It takes those patterns and deciphers them, so that we can see what the mutational signatures are and match them with their meaning.”
Ludmil compares the team’s machine learning approach to picking out individual conversations at a party. “You have multiple groups of people talking all around you, but you are only interested in hearing certain individuals speaking,” he says. “Our tool essentially helps you do that, but with cancer genetic data. You have multiple people around the world exposed to different environmental mutagens, and some of those exposures are leaving imprints on their genomes. This tool goes through all that data to pick out what are the processes that cause the mutations.”
To test the power of their tool, Ludmil and the team compared it to 13 existing bioinformatics tools, assessing their ability to extract mutational signatures from more than 60,000 synthetic genomes that contained 2,500 simulated signatures. In this task, not only did SigProfilerExtractor detect 20-50% more true positive signatures than other tools, but it also found almost no false positive signatures in the data.
“In bioinformatics, this is the first time that such a comprehensive benchmarking has been done on this scale for mutational signature extraction,” says Marcos. “It is a huge undertaking, comparing many tools across many datasets.”
The team’s ultimate goal is to create a web-based tool that more researchers can use and, as a result, profile more patients’ tumours.
“Right now, this tool requires bioinformatics expertise to run it,” says Ludmil. “What we want is to create a user-friendly version on the web, where researchers can just drop in a patient’s genome, and it immediately gives you the set of mutational signatures and what processes caused them.”
In particular, the team hope to work with a global database called COSMIC (the Catalogue of Somatic Mutations in Cancer) to enable researchers from anywhere in the world to analyse their patient’s tumours using the algorithm.
“We’re actually trying at the moment to set up a web server on the COSMIC website where anyone can go and upload a sample,” Ludmil adds. “Then you’re able to analyse the mutational signatures in an individual patient with a very, very high accuracy.
“Obviously, one cannot give clinical advice from a website. But one can say, ‘this is a signature and there is a lot of evidence from previous research that people who have these signatures are likely to respond to this specific drug’.”
Find out more about the SigProfiler suite and read the paper in Cell Genomics.
This story was first covered by Liezel Labios on UCSD Today, and Jacob Smith at Cancer Research UK.