Information Quantification and Linking - Private Information Leakage in Functional Genomics Data

All fastq files first mapped to the human reference genome (b37). We used bwa for Hi-C and ChIP-Seq, STAR for RNA-Seq data. After mapping, the resulting BAM files are used to call variants using GATK best practices.
The codes for the information quantification and linking, preparation of input files and some of the input files can be downloaded in the download page. To perform mapping and use GATK best practices, use the combination of codes in:

sh SNPcalling.sh

We provide a bash script for preparation of input files called ProcessCalledVariants.sh, a python script called convertVcf.py to convert GATK vcd output to a text format and a C++ code for information quantification and linking attack.

Detailed README

We first transformed 1000 genmes genotypes for 2504 Phase III individuals for each autosomal chromosome seperately. These are converted from .vcf files in order to keep the information necessary in this study. The 1st column of each file has the location on the chromosome for the variant, 2nd and 3rd columns are the reference and alternate alleles of the variant and 4th column to 2508th column are the genotypes of the individuals for the variant. 0|1 and 1|0 corresponds to heterozygous variant, 1|1 correponds to homozygous variant and 0|0 corresponds no variant for the individual. Individuals are ordered as seen in the original .vcf files taken from 1000 genomes data portal, where NA12878 corresponds to 1753th individual in the list.

getInfonewmat.cpp first needs to be compiled as follows:

  
    g++ -std=c++11 -o getinfonewmat getInfonewmat.cpp

ProcessCalledVariants.sh script can be used to prepare the input files and then run the above C++ code. After running GATK on a .bam file, it will create filtered variant files in .vcf format, called passed_snps.vcf and passed_indels.vcf. In the ProcessCalledVariants.sh script, we first covert split these variants into autosomal chromosomes andconvert them into .txt format by keeping the information that we need only. An example line from these converted txt files as follows:

  
    1	12478	A	C	0/1

1st column: chromosome
2nd column: location on the chromosome
3rd column: reference allele
4th column: alternate allele
5th column: genotype (0/1 for heterozygous, 1/1 for homozygous)

The next conversion is to create new matrix and genotype files that contain onlt the variants called using GATK. By preprocessing the input files this way, we can save time when we quantify the information. In ProcessCalledVariants.sh, we used awk to create new matrices and genotype files as:

  
awk 'NR==FNR{arr[$2$3$4];next} $1$2$3 in arr' chr$i.fromexp.snp.txt chr$i.matrix.1kg.txt | awk '!seen[$0]++' > chr$i.newmat.snp.txt
awk 'NR==FNR{arr[$1$2$3];next} $2$3$4 in arr {print $5}' chr$i.newmat.snp.txt chr$i.fromexp.snp.txt > chr$i.gt.snp.txt

First line grabs the rows that are in the called genotypes and create a new matrix file from them. Second line uses these rows to create a file that has the genotypes that are called. $i denotes the chromosome. Once we have a new matrix that has only the variants and the individuals with those variants and the genotype file wth the genotypes called for the target individual, we can run the quantification and linking code as:

  
./getinfomat chr$i.newmat.snp.txt chr$i.gt.snp.txt > chr$i.snp.info.ind.txt

The resulting output file will have 2504 line. Each line will be the pmi value between the called variants and the individual in the genotype panel. The highest pmi value will correspond to the best matching individual. Gap values can be calculated by sorting these pmi values and finding the ratio between the top and second best match.

Risk Assessment

Detailed README