Bioinformatics and Genomics Student Inspires Journal Cover

The July 2019 issue of Genome Research features art inspired by the work of Penn State graduate student Vijay Kumar Pounraja.

July 2019 Genome Research journal cover
July 2019 Genome Research journal cover inspired by the work of Vijay Kumar Pounraja

The July 2019 issue of the journal Genome Research features a cover image inspired by the innovative efforts of Penn State Bioinformatics and Genomics graduate student Vijay Kumar Pounraja.

The illustration depicts a scientist hunched over a laptop, as genetic sequencing data pours down all around her like raindrops. Fortunately, she’s sheltered from the deluge by her handy umbrella, neatly tucked under her chin. This is the piece Pounjara’s work inspired. The umbrella represents a new machine learning technique he spearheaded, that shields genome researchers from a flood of spurious results.

“We’ve been studying large chromosomal deletions and duplications called copy-number variants in the genome, or CNVs, given their frequent association with autism and other neurodevelopmental disorders,” said Pounraja.

“Unfortunately, high-resolution Whole Genome Sequencing (WGS) is one of the few fully reliable methods for detecting CNVs, and it is prohibitively expensive for large cohorts. The current go-to method for clinical diagnosticians is a more cost-effective technique called Whole-exome sequencing (WES), that selectively sequences the protein-coding regions spread discontinuously along the human genome.”

The problem with the data generated by WES is that discontinuity leaves lots of gaps. Current methods of predicting CNVs with such gaps in data produce very high false-positive rates. "Using WES data to detect CNVs is akin to searching for cracks in the train track between New York and Boston in the dark, but only using the faint lights from the train stations between these two cities," Pounraja explained.

Since decisions about a patient’s diagnosis and eventual treatment are made based on these predictions, false positive calls are problematic, and make the selection of CNVs for further clinical investigation very difficult. This challenge has led to the proliferation of CNV detection tools and methods. However, not all tools arrive at similar conclusions or report the same set of CNVs, with some being conservative and others being liberal with the number and type of CNVs they predict.

“Generally, the high number of false positives from copy-number-variant algorithms has been dealt with by using multiple algorithms and only counting the variants identified by all the methods — like a Venn diagram,” said Pounraja, “This approach has multiple drawbacks and limitations, so we decided to develop a new machine-learning method.”

Pounraja and his collaborators named this method CN-Learn. It employs a random-forest based machine learning technique that trains hundreds of decision trees on a small validated set of CNVs. The model built from these trees yields a far higher precision than previously attainable.

“With our new method, around 90% of the copy number variants we report are real,” said Santhosh Girirajan, associate professor of biochemistry and molecular biology at Penn State and the lead author of a paper describing the method in the July 2019 issue of Genome Research.

“I was looking for a postdoc or a student who could address this fundamental problem in the CNV field,” Girirajan explained. “Vijay’s deep interest in machine learning methods made this project successful. Personally, I think getting a publication early in the PhD career is a huge boost for the student’s confidence.”

George Perry, Chair of Penn State’s Intercollege graduate degree program in Bioinformatics and Genomics, agrees. "Vijay's contribution with this work is an outstanding example of the high caliber of our Bioinformatics and Genomics students, and we couldn't be prouder."

To aid the global community of genome researchers, the Girirajan lab has made CN-Learn and all of the necessary supporting programs freely available to download in one easy package at this GitHub link: https://github.com/girirajanlab/CN_Learn

In addition to Girirajan and Pounraja, the research team at Penn State includes Gopal Jayakar, Matthew Jensen, and Neil Kelkar. The research was supported by the U.S. National Institutes of Health, the Simons Foundation Autism Research Initiative, and the Penn State Huck Institutes of the Life Sciences.

A July 23, 2019 Penn State News article explains more about the project: https://news.psu.edu/story/580104/2019/07/09/research/new-method-helping-find-deletions-and-duplications-human-genome