CRC 1678: Central Projects C

INF: Information Infrastructure Project: Scientific data management and data analytics

About INF

Information Infrastructure Project: Scientific data management and data analytics

The CRC will generate substantial amounts of complex and heterogeneous data; our current estimates are 1,699 files (=samples) with raw sequencing data and 1,553 files (=samples) with proteomics data. These data will require sophisticated processing (computational analysis) for their optimal interpretation. The CRC will establish a support unit, the Information Infrastructure Project (INF), to be responsible for the following tasks: 1) establish, maintain and update standardized pipelines for the analysis of high-throughput molecular (‘omics’) data to address shared needs across multiple projects, such as comparative quantification of RNA-seq and mass-spec experiments and their functional annotation; 2) develop and improve bioinformatics algorithms adapted to project-specific questions, when published solutions are unavailable; 3) train postdoctoral researchers and doctoral students in computational methods for high-throughput data analysis (in close cooperation with the IRTG) and educate CRC scientists on FAIR data principles and best practices in data storage, as agreed upon in the data policy of the CRC; 4) implement a research data management system for the long-term storage of scientific data and assist in data annotation, including the deposition of data in community repositories (such as GEO, ArrayExpress, PRIDE); 5) assist in and conduct computational analysis (bioinformatics and biostatistical analysis) for CRC research projects; 6) be involved in experimental design, to maximize CRC project success (e.g. handling batch effects); and 7) provide support for professional publication of programming code in community resources (e.g. GitHub, Bioconductor). Note that staff of the INF Project will primarily guide and assist in the analysis. Most of the hands-on analysis will be done by researchers employed through the individual projects with support from the INF Project. The INF Project will cooperate with the CECAD Bioinformatics Facility, which has existing pipelines for routine tasks, such as differential expression analysis with RNA-seq data. The project leaders are experienced in generating and working with diverse large-scale data types and are in a unique position to oversee and support this aspect of the CRC.