BMS8110复习(七):Lecture 7- ChIP Sequencing (transcription factor)


  • Introduction to transcription factor
  • ChIP-seq Experimental Procedure
  • Data Analysis
  • Case Study

Transcription Factor

  • General transcription factor and sequence specific transcription factor
    • Mediator, transcription machinery, structural proteins
  • TF is a protein that controls the rate of transcription of genetic information from DNA to messenger RNA, by binding to a specific DNA sequence.
  • The function of TFs is to regulate ----trun on and off----genes in order to make sure that they are expressed in the right cell at the right time and in the right amount throughout the life of the cell and the organism.

DNA Binding Domain (DBD)

  • Most TFs are composed of three parts
    • DBD - DNA binding (sequence specificities)
    • SSD - signal sensing (ligand binding or modifications)
    • TAD - transactivation反式激活 (releasing transcription initation)

TF Family

  • TFs can be classified according to DBD structures
  • In human, there are ~1,600 TFs
  • DBD alone determines sequence specificities of the TF

DNA Binding Specificities

  • Databases that store the information of DNA binding specificities of TFs:
    • JASPAR

Position Weight Matrix (PWM)

  • Consensus sequences
    • Most preferred individual sequences
    • A[CT]N{A}YR (A means that an A is always found in that position; [CT] stands for either C or T; N stands for any base; and {A} means any base except A. Y represents any pyrimidine(嘧啶), and R indicates any purine(嘌呤).)
  • PWM is a commonly used representation of motifs (patterns) in biological sequences.
  • PWMs are often derived from a set of aligned sequences that are thought to be functionally related and have become an important part of many software tools for computational motif discovery.
  • PositionFrequency Matrix (PFM)
    • Counting frequency for each nucleotide at each position
  • Position Probability Matrix (PPM)
    • Convert frequency to probability (0-1)
  • Each column can therefore be regarded as an independent multinomial distribution. This makes it easy to calculate the probability of a sequence given a PPM, by multiplying the relevant probabilities at each position.
  • Most often the elements in PWMs are calculated as log likelihoods.
  • There are various algorithms to scan for hits of PWMs in sequences. One example is the MATCH algorithm.

Advanced Models: Hidden Markov Chain Model

  • PWM assumes each position independently contributes to the binding of TF to the DNA sequences. Therefore, some times it fails to describe the DNA binding specificities.

Advanced Models: machine learning model Support Vector Machine (SVM) 

  • In machine learning, support-vector machines are supervised learning models with associated learning algorithms that analyzedata used for classification and regression analysis.
  • K-mer based methods

Chromatin(核染色质) Immunoprecipitation(免疫沉淀反应) (ChIP)

  • Cross-linking (1% formaldehyde甲醛)
  • Shearing chromatin to small fragments (300-500 bp)
  • Immunoprecipitation (Ab: a specific TF)
  • Wash to remove non-specific binding
  • De-crosslinking (heat)
  • DNA extraction and purification

Sequencing Library

  • End polishing (generate blunt end for adaptor)
  • Adenine nucleotide addition (facilitate TA ligation for adaptor)
  • Adaptor ligation (Y shape to generate two different ends for sequencing bridging)
  • PCR amplification
  • DNA purification and quantification (Qubit)

Data Analysis

  • FastQC (FASTO Quality Check): Preliminary Quality Control of Library
  • Mapping: Identify the genomic location of each individual read
  • Remove redundant reads
  • Peak Calling: is a computational method used to identify areas in a genome that have been enriched with aligned reads as a consequence of performing a ChIP-sequencing.
  • Further Analysis
    • Motif Enrichment
    • Gene Ontology Analysis
    • Others
