Introduction

Functional genomics experiments on human subjects present a privacy conundrum. On one hand, many of the conclusions we infer from these experiments are not tied to the identity of individuals but represent universal statements about disease and developmental stages. On the other hand, by virtue of the experimental procedures, the reads from them are tagged with small bits of patients’ variant information, which presents privacy challenges in terms of data sharing. There are many benefits to sharing the data as broadly as possible. Measuring the amount of variant information leaked in a variety of experiments, particularly in relation to the amount of sequencing, will allow us to uncover ways of reducing information leakage and determine an appropriate set point for sharing information with minimal leakage.

In order to solve the dilemma between data sharing and privacy leakage, we propose a file formatting system that enables the sharing of a large amount of data while protecting individuals’ sensitive information and preserving the utility of the data. The proposed file format can achieve different levels of privacy and utility balance. At the highest level of privacy, our file format masks all the variant information leaked from reads, which can be used to calculate signal profiles with 99% recovery of the original profiles and 100% recovery of the original gene expression levels.