关于affy芯片Affymetrix Human Genome U133 Plus 2.0 Array的注释文件，为何GEO、affy官方与Bioconductor的注释数据有差别？

该问题已被锁定！

2: 关注

2096: 浏览

关于affy芯片Affymetrix Human Genome U133 Plus 2.0 Array的注释文件，为何GEO、affy官方与Bioconductor的注释数据有差别？

经过比对发现hgu133plus2.db的注释数据与GEO上的注释数据比，多出了一些，想知道这些多出来的注释数据的正确性。总结来说，就是hgu133plus2.db的注释数据比GEO和affy的注释数据都要多，多出来的那是怎么回事？还有就是 hgu133plus2注释数据中没有1个探针对应多个entrez id的情况，但是其他两个注释文件都有，这种情况应该如何取舍？选用哪个注释数据比较好？具体代码如下： library("hgu133plus2.db") library(GEOquery) hug133plus2=toTable(hgu133plus2ENTREZID) grep(pattern = '/',x=hug133plus2$gene_id,fixed = T) which(hug133plus2$gene_id=='') nrow(hug133plus2) #没有1对多行数为42307 ##取GEO注释文件 GEOanno=getGEO(file ='Mycloud/课题/辐射_GSEA/DataSource/Annotation_file/GPL570.annot.gz') GEOanno=Table(GEOanno) colnames(GEOanno) GEOanno=GEOanno[,c(1,4)] ##看一下有多少空的注释 length(which(GEOanno$`Gene ID`=='')) #9557行空注释删除！ rindex=which(GEOanno$`Gene ID`=='') GEOanno=GEOanno[-rindex,] geo_probe_multigene=length(grep(pattern = '/',x=GEOanno$`Gene ID`,fixed = T)) nrow(GEOanno) geo_probe_multigene/nrow(GEOanno) # 1对多的探针占0.049比例所占比例不多 #affy官方的 affyanno=read.csv(file = 'Mycloud/课题/辐射_GSEA/DataSource/Annotation_file/HG-U133_Plus_2-na36-annot-csv/HG-U133_Plus_2.na36.annot.csv',header = T,as.is = T,comment.char = '#') colnames(affyanno) affyanno_entrez=affyanno[,c(1,19)] length(which(affyanno_entrez$Entrez.Gene=='---')) rindex=which(affyanno_entrez$Entrez.Gene=='---') affyanno_entrez=affyanno_entrez[-rindex,] nrow(affyanno_entrez) affy_probe_multigene=length(grep(pattern = '/',x=affyanno_entrez$Entrez.Gene,fixed = T)) affy_probe_multigene/nrow(affyanno_entrez) ## 0.03的比例更低 ##注释文件比较 ##hug133与GEO文件比较 hugmerge=paste0(hug133plus2$probe_id,hug133plus2$gene_id) geomerge=paste0(GEOanno$ID,GEOanno$`Gene ID`) nrow(hug133plus2) nrow(GEOanno) ## GEO行数比hug多 table(geomerge %in% hugmerge) dif_geo_hug=GEOanno[!(geomerge %in% hugmerge),] nrow(dif_geo_hug)/nrow(hug133plus2) ##8%的探针不同。。 ##查看后大部分是1对多的探针的删掉1对多再看看 GEOanno=GEOanno[-(grep(pattern = '/',x=GEOanno$`Gene ID`,fixed = T)),] nrow(GEOanno)-nrow(hug133plus2) # 还是多接近600行再进行比对 geomerge=paste0(GEOanno$ID,GEOanno$`Gene ID`) table(geomerge %in% hugmerge) dif_geo_hug=GEOanno[!(geomerge %in% hugmerge),] nrow(dif_geo_hug)/nrow(hug133plus2) #这次只有2%的探针不同了 dif_hug_geo=hug133plus2[!(hugmerge %in% geomerge),] dif_hug_geo[,1] ##为什么这些探针注释不同？ dif_hug_geo[1,1] hug133plus2[which(hug133plus2$probe_id=='1552258_at'),] # 结果为 112597 GEOanno[which(GEOanno$ID=='1552258_at'),] which(GEOanno$ID=='1552258_at') ## 没有这个探针说明一开始这个探针没有对应的enrezid 被当成空探针删除了 ##那么hug133的注释数据对不对？ ## 测试一下affy官方的数据 affyanno_entrez[which(affyanno_entrez$Probe.Set.ID=='1552258_at'),] ## 结果也没有 why？

好问题 0 评论收藏举报

2 回答

fcdslz 超级管理员用户来自于: 北京市
2018-08-25 01:21

[attach]41[/attach] 终于找到了原来是这么个玩法那我都删了也行了

赞同 1 0评论

孟浩巍超级管理员用户来自于: 北京市
2018-08-24 22:50

1. 以官方的cdf文件为准。 2. cdf文件一般不会变，因为记录的是探针的信息。 3. 在使用Bioconductor里面的默认注释文件的时候，一定要注意和你的芯片版本保持绝对一致。 4. 表达谱芯片几乎一定存在多个probe id对应1个gene id的情况。

赞同 0 7评论

关于作者

: fcdslz 超级管理员
这家伙很懒，还没有设置简介

13: 回答

1: 文章

12: 问题

问题动态

发布时间: 2018-08-24 22:29

更新时间: 2018-08-25 01:21

关注人数: 2 人关注