基于生物信息学胰腺腺癌关键基因的筛选及支持向量机诊断模型的构建

摘要背景与目的:胰腺癌是一种常见的消化道恶性肿瘤,其主要病理类型为胰腺腺癌(PAAD),因早期诊断困难且缺乏有效的治疗措施,故预后极差。因此,寻找PAAD的诊治新靶标具有重要意义。本研究通过生物信息学方法筛选与PAAD诊断和预后相关的关键基因,构建分类PAAD样本和正常样本的支持向量机(SVM)模型,以期为PAAD的诊治及机制研究提供依据。方法:从基因表达数据库(GEO)中下载3个芯片数据(GSE28735、GSE62165、GSE62452),应用R语言的Limma包筛选出PAAD组织和正常组织间的差异表达基因(DEGs)。利用STRING数据库对DEGs进行GO和KEGG通路富集分析。再以STRING数据库构建DEGs的蛋白互作网络(PPI),利用Cytoscape软件进行可视化编辑,并通过MCODE插件进行关键子网络分析。使用R语言的survival包筛选PPI和关键子网络中与预后相关的关键节点,将其上传至metascape进行功能富集分析。利用R语言caret包中递归式特征消除(RFE)算法筛选关键节点中的最优特征基因,在GEPIA数据库中验证最优特征基因的表达差异,随后通过R语言的e1071包构建最优特征基因的SVM模型,并在3个芯片数据中借助R语言的pROC包对该模型进行验证。在TCGA数据库中,用R语言的survminer包筛选出最优特征基因中与PAAD预后相关的基因作为关键基因。结果:共筛选出257个DEGs,包括168个上调基因和89个下调基因。GO分析结果表明DEGs主要参与细胞外基质的组成、细胞黏附、丝氨酸肽酶活性等生物学过程。KEGG分析显示,DEGs主要富集于蛋白质的消化和吸收、胰腺的分泌、黏着斑、PI3K-Akt信号通路。生存分析筛选出14个关键节点同时在GSE28735和GSE62452中与预后相关(均P<0.05),这些基因在肿瘤侵犯和肿瘤发生中发挥一定作用。RFE筛选出8个最优特征基因:LAMA3、FN1、ITGA3、MET、PLAU、CENPF、MMP14、OAS2;GEPIA数据库验证发现这8个最优特征基因在PAAD组织中明显上 Background and Aims:Pancreatic cancer is a common malignant tumor of the digestive tract.Its main pathological type is pancreatic adenocarcinoma(PAAD).Due to the difficulty of early diagnosis and lack of effective treatment measures,the prognosis of PAAD is extremely poor.Therefore,defining new targets for the diagnosis and treatment of PAAD is of great significance.This study was conducted to screen the hub genes related to the diagnosis and prognosis of PAAD by bioinformatics analysis,and then construct a support vector machine(SVM)model to classify PAAD and normal pancreatic samples,so as to provide a useful resource for researches in terms of diagnosis,treatment and mechanism of PAAD.Methods:Three microarray datasets(GSE28735,GSE62165,GSE62452)were downloaded from the Gene expression Omnibus(GEO)database.The differentially expressed genes(DEGs)between PAAD tissue and normal pancreatic tissue were screened using Limma package of R language.GO and KEGG pathway enrichment analysis of the DEGs were performed using STRING database.Then,protein-protein interaction networks(PPI)of the DEGs were generated using the STRING server and visualized by Cytoscape software.Key subnetwork module analyses were performed through MCODE plug-in.R language survival package was used to screen the key nodes related to prognosis in PPI and key subnetworks,and then,the key nodes were uploaded to metascape for function enrichment analysis.The recursive feature elimination(RFE)algorithm in caret package of R language was used to select the optimal feature genes in key nodes,and the expression differences of the optimal feature genes were verified in GEPIA database.A SVM classifier based on the optimal feature genes was constructed using the R language e1071 package,and the R language pROC package was used to verify the model in the 3 microarray datasets.In the TCGA database,the R package survminer was used to select the genes related to the prognosis of PAAD among the optimal feature genes as the hub genes.Results:A total of 257 DEGs we

作者张波徐涛徐浩夏雨周文策 ZHANG Bo;XU Tao;XU Hao;XIA Yu;ZHOU Wence(The First Clinical Medical College,Lanzhou University,Lanzhou 730000,China;Department of General Surgery,the First Hospital of Lanzhou University,Lanzhou 730000,China)

机构地区兰州大学第一临床医学院兰州大学第一医院普通外科

出处《中国普通外科杂志》 CAS CSCD 北大核心 2021年第3期276-285,共10页 Chinese Journal of General Surgery

基金甘肃省重点研发计划基金资助项目(17YF1FA128) 甘肃省兰州市人才创新创业基金资助项目(2017-RC-37)。

关键词胰腺肿瘤基因表达谱支持向量机计算生物学 Pancreatic Neoplasms Gene expression Profiling Support Vector Machine Computational Biology

分类号 R735.9 [医药卫生—肿瘤]