lncRNA 根据染色体位置进行分类

1/1/2018


根据与基因的相对位置,lncRNA可以分为Intergenic LncRNAs(lincRNA), Bidirectional LncRNAs, Intronic LncRNAs, Antisense LncRNAs, Sense-overlapping LncRNAs五类,每类详细定义规则如下:

定义

ref: https://www.arraystar.com/reviews/v30-lncrna-classification/

  1. Intergenic LncRNAs

    Intergenic LncRNAs are long non-coding RNAs which locate between annotated protein-coding genes and are at least 1 kb away from the nearest protein-coding genes. They are named according to their 3-protein-coding genes nearby. Gene expression patterns have implicated these LincRNAs in diverse biological processes, including cell-cycle regulation, immune surveillance and embryonic stem cell pluripotency. LincRNAs collaborate with chromatin modifying protein (PRC2, CoREST and SCMX) to regulate gene expression at specific loci.

  2. Bidirectional LncRNAs

    A Bidirectional LncRNA is oriented head to head with a protein-coding gene within 1kb. A Bidirectional LncRNA transcript exhibits a similar expression pattern to its protein-coding counterpart which suggests that they may be subject to share regulatory pressures. However, the discordant expression relationships between bidirectional LncRNAs and protein coding gene pairs have also been found, challenging the assertion that LncRNA transcription occurs solely to “open” chromatin to promote the expression of neighboring coding genes.

  3. Intronic LncRNAs

    Intronic LncRNAs are RNA molecules that overlap with the intron of annotated coding genes in either sense or antisense orientation. Most of the Intronic LncRNAs have the same tissue expression patterns as the corresponding coding genes, and may stabilize protein-coding transcripts or regulate their alternative splicing.

  4. Antisense LncRNAs

    Antisense LncRNAs are RNA molecules that are transcribed from the antisense strand and overlap in part with well-defined spliced sense or intronless sense RNAs. Antisense-overlapping LncRNAs have a tendency to undergo fewer splicing events and typically show lower abundance than sense transcripts. The basal expression levels of antisense-overlapping LncRNAs and sense mRNAs in different tissues and cell lines can be either positively or negatively regulated. Antisense-overlapping LncRNAs are frequently functional and use diverse transcriptional and post-transcriptional gene regulatory mechanisms to carry out a wide variety of biological roles.

  5. Sense-overlapping LncRNAs

    These LncRNAs can be considered transcript variants of protein-coding mRNAs, as they overlap with a known annotated gene on the same genomic strand. The majority of these LncRNAs lack substantial open reading frames (ORFs) for protein translation, while others contain an open reading frame that shares the same start codon as a protein-coding transcript for that gene, but unlikely encode a protein for several reasons, including non-sense mediated decay (NMD) issues that limits the translation of mRNAs with premature termination stop codons and trigger NMD-mediated destruction of the mRNA, or an upstream alternative open reading frame which inhibits the translation of the predicted ORF.

实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# step0 prepare
awk -F "\t" 'BEGIN{OFS="\t"}{if ($3=="transcript") {print $1,$4,$5,$9,$7}}' lnc.gtf | sed -e 's/gene_id.*transcript_id "//g' -e 's/".*\t/\t.\t/g' >lnc.transcript.bed
awk -F "\t" 'BEGIN{OFS="\t"}{if ($3=="transcript") {print $1,$4,$5,$9,$7}}' mRNA.gtf | sed -e 's/gene_id.*transcript_id "//g' -e 's/".*\t/\t.\t/g' >mRNA.transcript.bed
awk -F "\t" 'BEGIN{OFS="\t"}{if ($3=="exon") {print $1,$4,$5,$9,$7}}' mRNA.gtf | sed -e 's/gene_id.*transcript_id "//g' -e 's/".*\t/\t.\t/g' >mRNA.exon.bed

head mRNA.transcript.bed
# chr1 65419 71585 ENST00000641515 . +
# chr1 69055 70108 ENST00000335137 . +
# chr1 450703 451697 ENST00000426406 . -
# chr1 685679 686673 ENST00000332831 . -
# chr1 923928 939291 ENST00000420190 . +
# chr1 925150 935793 ENST00000437963 . +


# step1 获得 Intergenic lncRNA
# -v 取反,获得不和任何 coding-protein 基因(包括基因上下游 1000 bp范围)重合的 lncRNA
bedtools window -a lnc.transcript.bed -b mRNA.transcript.bed -v >lnc.Intergenic.bed
############### lnc.Intergenic.bed ###############


# step2 基因 1000 bp 区间内的 lncRNA
# 获得和 coding-protein 有重叠的 lncRNA ,lnc_gene.bed
bedtools intersect -a lnc.transcript.bed -b mRNA.transcript.bed -wa | sort -u > lnc_gene.bed
# 获得 coding-protein 基因上下游 1000 bp 范围内的 lnc ,lnc_gene_1k.bed
python -c 'with open("lnc.transcript.bed") as a, open("lnc_gene.bed") as m, open("lnc.Intergenic.bed") as i, open("lnc_gene_1k.bed", "w") as k1: k1.write("".join(set(a.readlines())-set(m.readlines())-set(i.readlines())))'

# step2b 根据转录方向进行分类
bedtools window -a lnc_gene_1k.bed -b mRNA.transcript.bed -u -Sm | sort -u >lnc.Bidirectional.bed
############### lnc.Bidirectional.bed ###############
bedtools window -a lnc_gene_1k.bed -b mRNA.transcript.bed -u -sm | sort -u >lnc.Enhancer.bed
############### lnc.Enhancer.bed ###############


# step3
# 和 coding-protein 重叠,不和外显子重叠,得到内含子区域的lnc
bedtools intersect -a lnc_gene.bed -b mRNA.exon.bed -v > lnc.Intronic.bed
############### lnc.Intronic.bed ###############
# 取相同链的 lnc
bedtools intersect -a lnc_gene.bed -b mRNA.exon.bed -wa -s | sort -u > lnc.Sense.bed
############### lnc.Sense.bed ###############
# 取反向链的 lnc
bedtools intersect -a lnc_gene.bed -b mRNA.exon.bed -wa -S | sort -u > lnc.Antisense.bed
############### lnc.Antisense.bed ###############
---------本文结束,感谢您的阅读---------