The sharing of resources about Statistical Learning Theory and Machine Learning(includeing SVM,Semi-Supervised Learning,Ensemble Learning,Clustering) ,welcome to contact and communicate with me: Email: xiankaichen@gmail.com,QQ:112035246,

Friday, June 27, 2008

spss数据归一化处理

这里有详细的介绍关于用spss对数据进行的标准化的方法:
http://blog.sina.com.cn/s/blog_49f78a4b01000844.html
需要补充的是,标准化后的数据追加在原来数据表中。

Wednesday, June 25, 2008

Box Plot Graphs(盒图)

  • Conceptual Overview

Box Plot graphs, also referred to as Box and Whisker Plot graphs, are quite common in statistics and quality measurements. Graphs in the Box Plot data class organize data items by category. There is only one series of data, and all data items are the same color.

A Box Plot graph data item has five main values: Low, Q1, Median, Q3 and High. These numbers are determined from the data set you are using to create the Box Plot. The data set also can have any number of Outlier data values.

Let's look at an example with a small set of data:

35, 42, 48, 50, 51, 53, 54, 60, 75

Important: These are not the numbers we send to the graph. These are the number we use to compute the values that we send to the graph. The graph itself cannot compute these figures for us.

In this case, the Median value is 51. If we divide the data into two sets, we have 35, 42, 48, 50, 51 and 51, 53, 54, 60, 75 (the Median value is included in both sets because there is an odd number of items). Now we find the median of each of these two sets to find the first quartile, Q1, and the third quartile, Q3. These numbers are 48 and 54 respectively.

The difference between Q1 and Q3 is 6 (54 - 48 = 6). This is called the inter quartile range or IQR. The Low and High values are based on the IQR. The Low value, or lower whisker, is a maximum of 1.5 times the IQR below Q1; in this case, that is a maximum of 9 (6 x 1.5 = 9). So the minimum Low value for our example is 39 (48 - 9 = 39). However, because the Low value is the smallest data item value that is equal to or greater than the minimum Low value; in our set the Low value is 42.

The High value, or upper whisker, is determined similarly. We find the maximum High value, which is Q3 plus 1.5 times the IRQ, or 63 (54 + 9 = 63). We then find the data item value that is equal to or less than the maximum High value; in our data set, the High value is 60.

Therefore, for this data set, the values are as follows: Low=42, Q1=48, Median=51, Q3=54 and High=60.

So what do we do with the values 35 and 75? These are Outlier values, because they are more than 1.5 times the IQR away from Q1 and Q3. Each Outlier value is represented by a small circle symbol in the Box Plot graph. If an Outlier is more than 3 times the IQR away from Q1 or Q3, it is classified as an extreme Outlier and is represented in the graph by a plus sign. In our example, the lower threshold for an extreme Outlier is Q1 minus 3 times the IQR, or 30 (48 - (3 x 6) = 30). Since our minimum data value is 35 and thus higher than this threshold, it is a normal Outlier. The upper threshold for an extreme Outlier is Q3 plus 3 times the IQR, or 72 (54 + (3 * 6) = 72). Our maximum data value is 75 and thus larger than this threshold, making it an extreme Outlier value.

The image below displays the Box Plot graph derived from our data set.

Box Plot graphs are in the Box Plot Data Class.

  • Possible Uses

Box Plot graphs can be used for:

  • Statistical analysis

  • Emphasize outlying data

Tuesday, June 24, 2008

Bootstrap抽样方法(原理与实现)

  • 引用文章,“我的一些统计方法观”( 作者:谢益辉)。文章中是这样阐述的:Bootstrap的一般的抽样方式都是“有放回地全抽”(其实样本量也要视情况而定,不一定非要与原样本量相等),意思就是抽取的Bootstrap样本量与原样本相同,只是在抽样方式上采取有放回地抽,这样的抽样可以进行B次,每次都可以求一个相应的统计量/估计量,最后看看这个统计量的稳定性如何(用方差表示)。
  • 可以简单的看出Bootstrap是非常简单但是又有点古怪的抽样方法。集成学习方法Bagging里面提到的Bootstrap抽样的真正含义在上面得到了真正的解释。在Bagging中的抽样具体怎么实现呢?可以简单使用R软件 里面的sample(x,replace=TRUE,num)函数实现Bootstrap抽样.x表示被抽样的样本,replace=TRUE表示有回放抽样,反之不回放抽样,num表示你要抽样的个数.下面是实验效果(点击查看)


  • 实现比较简单明了,容易明白。

我对算法模型的总结

最近在做算法改进,发现了很多问题,也从中学习到了很多知识,打算总结一下,好使得思路更加的清晰。
1,数据预处理。这里包括了很多分析方法,在这里需要做的是,收集,噪音处理(吴信东做的不错),相关性分析(p卡方检验可以),主成份分析(在学习中)。
2,选择适当的学习算法,得到训练模型。模型的训练的时候存在一个很直接的问题,就是模型参数选择的问题,可以使用交叉验证来评价选择参数的合理性。
3,用模型对测试数据预测。
4,算法评估。时间复杂度分析,空间复杂度分析,最高精度比较,显著性检验(配对 t-检验),我比较赞同后者的评估方法,很多算法都那最高精度的进行比较,显然这不是很合理的,因为这些结果可能没有再现性,所以需要做显著性检验来说明改进后的实验结果是有显著性差别的,如果没有显著性差别,我认为就算它的实验结果最高精度多好也不能说明你的模型比别人的好。
下面以支持向量机做进一步的说明
(1),设训练集为S={x1,...xn},Y={-1,+1},S已经经过了预处理了,这里主要说明2-4部分的实现细节。
(2),为了分出训练集和测试集我们使用随机抽样的方法,抽样的方法可以这样,可以利用randon函数为每个样本随机产生一个随机数,然后按照随机数的大小对样本进行排序,然后对排序后的样本按一定的比例进行抽取样本(如7:3,70%是训练集,剩下的为30%测试集),注意每次抽样的训练集和测试集可能不相同的,因此,做一次抽样出来的数据,特别是训练集的数据,它可能不是模型所服从的分布,在这个时候,我们可以做多次抽样,训练得到每个不同的模型,然后选择模型精度最好的做为算法的模型。如果在一开始就已经分好了训练集和测试集,那么我们就可以省略这步骤了。
(3),在(2)后,得到的训练集和测试集就不要再改变了,接下来就是训练模型,训练模型就是从训练集应用支持向量机算法得到一个用于3中的预测的一个模型。为了训练得到这个模型,我们使用交叉验证来评价某一个参数下(算法)得到的模型的合理性。为此我们使用10折交叉验证法,这里也顺便介绍一下10折交叉验证法:(点击下面图片可以清晰阅读在这里一般k=10,就是说,选择了支持向量机的参数(C,核函数的参数)p(是一个参数向量)之后,我们将训练集分成10分,循环十次得到交叉验证误差e,则e就可以做为对参数p好坏的一个评价,当e小的时候说明p参数是好的,当e大的时候说明p参数是不好的,在选择p的时候得凭借一定的经验进行选择,也需要耐心,从众多的候选参数p中我们选择e最小时的参数p做为支持向量机的参数,然后对训练集进行训练得到相应的模型M。
(4),得到了模型M之后,我们接下来就要利用测试集检验模型M的精度了。
(5),在(2)中提到(3)中训练提出来的模型M不一定是合理的,因此我们得要重复执(2)进行多次抽样,得到多个模型M_i,选择其中预测继精度最好的模型最为最终的支持向量机模型。
(6),对你的算法进行各方面的评估,如果就精度而言,可以考虑做统计检验,或者认为最高的精度可以做为算法的一个评价;就速度而言,就要着重考虑时间复杂度和空间复杂度分析了。

以上的步骤是实现并应用支持向量机的核心内容,也是我自己总结的,有不当之处请大家指出,大家交流。

聚类+fsvm

前段时间做了些关于模糊支持向量机的实验,发现结果不理想,我们是针对徐光佑那篇文章来做的,但是效果不好,他的文章是有局限性,我想应该使用聚类先分出主成份,然后对每个成分利用徐光佑的方法训练,这个方法应该可以得到比较好的效果,
回过头来想想一个月前所做的,当时的研究方法还是存在一定问题,
主要原因我觉得有:
1,第一次探索,有很多知识需要我们快速学习,难度过大;
2,算法评估存在问题,当时没有使用交叉验证法去做;
3,对核函数理解存在偏差;
4,老师误导,非到达5个百分点不可,导致信心大打折扣,事实上,经过我的调查之后发现,明老师当时的要求是不合理的,目前就0.5到2个百分点就已经可以了的;
5,对于精度可以考虑用统计方法做结果的显著检验(配对 t-检验);

接下来,打算花半个月时间重新解决此问题;

了解国内机器学习研究动态的方法

Machine Learning Mailing List in China
中国机器学习邮件列表

大家可以订阅:
http://cs1.shu.edu.cn/gzli/mlchina.htm

List of Important Machine Learning Journal

Machine Learning Journal


Journal of Machine learning research


Data Mining and Knowledge Discovery


IEEE-Transactions on Knowledge and Data Engineering


Knowledge and Information Systems


IEEE-Transactions on Pattern Analysis and Machine Intelligence


Artificial Intelligence Journal


IEEE-Transaction on Evolutionary Computation


Pattern Recognition


Neural Computation


IEEE-Transactions on Neural Networks

Monday, June 23, 2008

Communications of the ACM Newsletter – New Digital Edition

Communications of the ACM Newsletter – New Digital Edition
July 2008


--------------------------------------------------------------------------------


With this issue, ACM is proud to announce the publication of an entirely redesigned and revitalized Communications of the ACM. In flipping through the pages of this Digital Edition, and of the beautiful print edition many of you will receive in the coming weeks, you will notice that in many ways CACM looks more like an entirely new magazine than a revamped version of its former self. A great deal of effort by a great many people has gone into this redesign, with the ultimate goal of making Communications of the ACM a magazine that is both of higher quality and more relevant for the broader computing community than ever before. It is our hope that you begin to see CACM fulfill both of these important goals with this July issue.



About the Redesign

The redesign refers both to the look and feel of the CACM pages and to the editorial scope of the magazine. In the past year, ACM hired the renowned international design firm Pentagram to completely overhaul the physical publication. They introduced a new three column format and new typography as part of an entirely new graphic design of the magazine. You will find the new magazine filled with comfortable white space, clean lines, fresh colors, more relevant imagery, and more readable typography.

In parallel, ACM appointed an active and vibrant new editorial team, led by Editor-in-Chief Moshe Y. Vardi, to revamp the magazine's editorial model. Over the coming months, you will see many new editorial features in the magazine, but what is noteworthy for now is the introduction of several well defined sections, including Departments, an expanded News section with in-depth articles on hot computing topics plus shorter news briefs, Viewpoints or opinion articles by leading experts in particular fields, a completely new Practice section led by Stephen Bourne and the editorial board of ACM Queue magazine, Contributed Articles, Review Articles, and the Research Highlights section, which presents full length research papers from around the computing community along with one-page Technical Perspectives that seek to provide context and relevance for non-researchers interested in the latest and most important research coming down the pike.

As with any new magazine design, the first issue is a starting point, and your suggestions for improving the magazine will play an important role in how CACM evolves over time.

In this Issue

This issue offers a rich assortment of articles on emerging areas of computer science objectives, practical research applications, technology news developments, in-depth engineering features, even an editorial debate. Indeed, this great array of topics reflects the diverse professional interests and endeavors of the readers of Communications.

The cover story on Web Science, by James Hendler, Nigel Shadbolt, Wendy Hall, Tim Berners-Lee, and Daniel Weitzner, examines an interdisciplinary approach to understanding the Web as an entity in its own right to ensure that it continues to flourish.

Also in this issue:

James Larus and Christos Kozyrakis ponder if transactional memory is the reasonable answer for improving parallel programming.
Mark Oskin traces how changes in computer architecture are about to impact everyone in the IT business.
Erik Wilde and Bob Glushko debunk some popular XML myths.
Adam Leventhal questions the potential of flash memory as a viable platform for a new tier in the storage hierarchy.
Margo Seltzer warns there is more to data access than SQL.
Stephen Andriole and Eric Roberts argue over the technology curriculum in the first in a series of Point/Counterpoint editorial debates. Look for other leading voices to take opposing sides of an industry issue in this new feature.
Donald Knuth reflects on the influences that set the course for his extraordinary career in this first of a two-part interview with fellow A.M. Turing Award winner Edward Feigenbaum.
News on the latest trends in cloud computing, quantum computing, and dependable software design.
A Q&A about model-checking technology with the recipients of the 2007 A.M. Turing Award.
Author Guidelines

In the current issue of the magazine you will find new Author Guidelines, which provide instructions on submitting manuscripts to the new Communications of the ACM. To visit this page, please Click Here.

As always, we welcome your thoughts and comments on Communications of the ACM. Please forward your feedback to: cacmfeedback@acm.org

Sincerely,

Scott Delman
Group Publisher
Association for Computing Machinery


--------------------------------------------------------------------------------

To unsubscribe:
You are receiving this alert because of your subscription to Communications of the ACM.
To be excluded from future alerts, please click here.

计算机博士交流群的一些记录(如何做学问)

校园商务平台(30940045) 22:13:10

dudu(562014591) 22:14:18
将一个算法直接替换成另一个算法“应用”于某一特例,然后自欺欺人地对比结果,就是烂论文
KaiBao-ML(112035246) 22:14:49

云海山人(57373880) 22:15:29
何为好论文?
dudu(562014591) 22:15:41
反命题
云海山人(57373880) 22:15:42
介绍点经验,
Leung-图形学(25728739) 22:16:15
如何反命题丫..介绍一下
怡外(260240477) 22:16:18
开创性的就是好文章
云海山人(57373880) 22:16:27
何为开创性的?介绍点经验,
校园商务平台(30940045) 22:16:51
我现在就打算这样写篇论文,哈
校园商务平台(30940045) 22:16:58
在琢磨中
云海山人(57373880) 22:16:59

怡外(260240477) 22:17:05
哈哈
KaiBao-ML(112035246) 22:17:19
开创性比较难哦
Leung-图形学(25728739) 22:17:30
将一个算法直接替换成另一个算法“应用”于某一特例,然后自欺欺人地对比结果,就是烂论文

得看什么特例了.....
校园商务平台(30940045) 22:17:52
我认为这是写论文,很重要的第一步,
云海山人(57373880) 22:18:14
next ?
SVM信息分类(14029756) 22:18:30
我不喜欢做纯粹的研究,做交叉应用好点
怡外(260240477) 22:18:34
我觉得,要从本学科的基本概念和基本定理开始
怡外(260240477) 22:18:56
结合其他学科
Leung-图形学(25728739) 22:19:07
如果有人发明了A算法,有人发明了B算法
但是我A+B用在某处,得到了好的效果
也应该算个好文章
怡外(260240477) 22:19:28
哈哈
Leung-图形学(25728739) 22:19:29
没有什么特别能开创的
我们都不是爱因斯坦
怡外(260240477) 22:19:54
我写了这样一篇文章
校园商务平台(30940045) 22:20:08
就是 显示模仿别人,掌握的了一定规律和心得后,再把自己的开创性的表现出来,我现在要结合自己的项目背景,先达到第一步
Leung-图形学(25728739) 22:20:42
就拿图形学来说,siggraph上面的很多文章
都算A+B+C+D+...的多重叠加
最多有一点点自己的自创
可以形成一个完整的工作,就是好文章
KaiBao-ML(112035246) 22:21:33
Leung-图形学(25728739) 22:19:07 同意
怡外(260240477) 22:21:37
我把计算机学报一篇文章和中国科学上一篇文章结合了
YJX-并行计算(24974976) 22:21:55
太有才了.
YJX-并行计算(24974976) 22:22:36
强强结合.
困惑-网格计算(9386619) 22:23:39
问君能有几多愁,恰似满仓中石油
c(12303108) 22:23:53
厉害啊
Leung-图形学(25728739) 22:23:54
当然,你自己发明一个A算法
直接发siggraph也是可以的
不过对于大多数研究者来说,只能做一些整合和小的改进
或者用在一个特殊的领域
比如,我会把小波用在曲面造型上面,不能说我这个工作很烂
怡外(260240477) 22:24:02

DW&DM(410224799) 22:24:57

(该用户免费使用手机QQ登录,详情请查看:http://mobile.qq.com/?from=qg)
KaiBao-ML(112035246) 22:26:42
要是我改进了某个算法之后,算法精度提高了一点,但不怎么提高,这算不算有改进
Leung-图形学(25728739) 22:26:49

KaiBao-ML(112035246) 22:26:55
怎么评价?
Leung-图形学(25728739) 22:27:10
看这个改进对于你这个应用领域的意义
校园商务平台(30940045) 22:27:14
那再改进些呀
KaiBao-ML(112035246) 22:27:53
我只对算法进行纯的改进,没有应用背景
沧海一粟(104589894) 22:28:17
困惑-网格计算(9386619) 22:23:39
问君能有几多愁,恰似满仓中石油

兄弟,我现在能去抄底吗?
KaiBao-ML(112035246) 22:28:18
试了好多方法,效果都不怎么明显
云海山人(57373880) 22:30:14
论文就是这么批量生产的!A+B+C+...
YJX-并行计算(24974976) 22:30:21
大幅度的提升和改进一般困难,否则别人也改进了.
在好期刊上按住一篇新点的文章,如果你的精度达到了它的,甚至超过一点点,那你就可以做好文章了.
二次加工,就是剩下来的功夫活了.
夏蛙的诱惑(7120043) 22:30:22
别啊,兄弟,今年很不真常
校园商务平台(30940045) 22:31:01
背景还是很重要的
Leung-图形学(25728739) 22:31:16
我觉得核心上很多烂文章都是这样的:
1、直接抄袭;
2、拿来用,但是和人家老掉牙的方法比
3、刻意说别人不好
4、数据作假
困惑-网格计算(9386619) 22:31:18
沧海一粟(104589894) 22:28:17
困惑-网格计算(9386619) 22:23:39
问君能有几多愁,恰似满仓中石油

兄弟,我现在能去抄底吗?

兄弟,我只是莹诗一句,不知底在何方哦
KaiBao-ML(112035246) 22:32:00
同意,我做了不少实验发现,结果并非它们所说的那么好
沧海一粟(104589894) 22:32:02

夏蛙的诱惑(7120043) 22:32:07
别吵阿.
YJX-并行计算(24974976) 22:32:46
许多好论文,在另外一个环境下,结果都有差别(这是很正常的),但文章仍然可以做.
校园商务平台(30940045) 22:32:47
这样的论文 会误导真正搞学问的人
夏蛙的诱惑(7120043) 22:32:51
底永远是底,庄家不会让你的手的
怡外(260240477) 22:32:58
引用了,不算抄呀。
Leung-图形学(25728739) 22:33:28
校园商务平台
这样的论文 会误导真正搞学问的人

什么意思?
YJX-并行计算(24974976) 22:33:29
原创性文章太少.很多文章大底如此.
KaiBao-ML(112035246) 22:33:42
但是现在困难就摆在这里呀,大幅度改进难,怎么能让算法有尽可能高的精度呢(就提高精度而言)?
YJX-并行计算(24974976) 22:34:02
当然有.
校园商务平台(30940045) 22:34:36
就是你所说的烂论文呀,假数据的论文
YJX-并行计算(24974976) 22:34:40
去年有一篇获奖论文(你可能不关心这篇论文),就是在众多好论文基础上,作了五个改进.
Leung-图形学(25728739) 22:34:41
小波提升你知道吗
它有个好处是可以做整数运算
整数运算就可以大幅度减少存储量
在计算机中精度可是很重要的
KaiBao-ML(112035246) 22:34:45

YJX-并行计算(24974976) 22:34:54
实际上就是细节上作了改进,但效果非常好.
Leung-图形学(25728739) 22:35:09
数据误差是有累积的
精度改进一点就是一点
Leung-图形学(25728739) 22:35:52
来图像去噪来说,提升0.5个db可以算好了
提升1~2个db,直接发transaction吧
KaiBao-ML(112035246) 22:36:09
比如说什么细节呢?
KaiBao-ML(112035246) 22:38:03

Leung-图形学(25728739) 22:39:21
有时候是某个参数可能只是调整了一点点,最终效果不同就是不同
举例试试:
在模式识别中,有的聚类算法是这样的
f(x) = a(x) + b(x) ...等等一个长式子
但是有人在f(x)后面加一项c(x),最后效果就不同了,这样子的文章也见过很多
校园商务平台(30940045) 22:40:00
嗯,。不错
YJX-并行计算(24974976) 22:40:14
所以说,某人读博1.5年,就为什么能够发36篇SCI,和24篇EI呢,其中就有些说法.
奋斗(329621457) 22:40:36
 所谓“有实际开发工作经验”是指你目前已经具备下列能力:1)你已经认为C++和汇编语言都是很简单的语言,并能够自如地运用;2)你能够在30分钟之内想到正确的五子棋AI算法设计思路和方向;3)你完全理解STL为什么这么重要;4)你能够独立地解决所有的编译与链接问题,哪怕你从来没有遇到的问题,你也不需要询问任何人;5)英文网站是你的首要信息来源;6)能够读懂英语写成的国际标准,比如NTFS磁盘格式标准。7)你经常站在集合论的角度思考算法问题;8)能够理解一个简单的驱动程序,能够理解一个简单3D交互程序;9)你能够认识到线性代数和概率论在实际编程工作中的极端重要性;10)你完全理解COM的设计思想,尤其能够理解COM为什么要设计成这样;11)当我说到虚函数的重要作用时,你不会急着去找书来翻;12)你能够说出C++为什么比其他语言优秀的理由,记住这种理由应该来自于你的开发体会,而不是因为其他人都这么说。此外还有很多判断标准,但如果你同时具备5条以上,可以认为你已经具备相应的开发经验了。在这种状态下读研,你将取得读研效益的最大值。
KaiBao-ML(112035246) 22:41:17

YJX-并行计算(24974976) 22:41:48
他告诉我,有些问题可以仔细去挖掘.其实很多问题研究细了,的确有很多问题.
校园商务平台(30940045) 22:42:15
Leung-图形学 提供的方法很值得我们去借鉴。
Leung-图形学(25728739) 22:42:24
继续举例:
高斯滤波器有个重要参数吧,delta,这个delta取得就有技巧了
你乘以个什么因子代进去都会很不同
以前有人搞双侧滤波器,就是两个高斯滤波(当然不是直接类同叠加),现在有人搞三侧的,就是多一些约束条件和参数
当然你试出来有好处就是你本事了
KaiBao-ML(112035246) 22:42:26
这牛人是谁
dudu(562014591) 22:42:27
1.5年,36篇SCI+24篇EI
YJX-并行计算(24974976) 22:42:44
暂时不要去评论论文的数量.
怡外(260240477) 22:42:48
做研究就是咬文嚼子了
YJX-并行计算(24974976) 22:43:04
那不是.你认为是小问题,可能它就不是小问题.
怡外(260240477) 22:43:51
开辟了新方向是很有可能的---1.5年,36篇SCI+24篇EI
YJX-并行计算(24974976) 22:43:53
因为,在这个群的所有人,看每个问题,都不敢说他自己能够确定某个问题是小问题还是大问题.你把它找到了,发现有用,可能就是大问题.
奋斗(329621457) 22:44:11
总之一句话,如果你只想成为软件开发高手(比如认为会编驱动程序或杀毒软件就是高手的那种),建议工作,不要考研;完全没有工作经验的,也不建议考研,你进来了只有瞎混一通。如果你有上述工作经验且想成为高级软件工程师(能够独立理解并设计出快速傅立叶变换算法的那种软件工程师)的话,那么强烈建议考研。考研让你有3年放松思考的机会,也有3年让你思想和技术积累沉淀的机会。非常难得的机会。不考研的话,这种机会就是一种奢侈,可望而不可及的那么一种奢侈。1 B7
YJX-并行计算(24974976) 22:44:45
至少这点,我在最近大量实验中,就发现了这点,许多文献中的结果,实际上没有考虑到另外一个角度.

Leung-图形学(25728739) 22:44:56
同意
奋斗(329621457) 22:45:30
关于实战经验与理论学习的优劣问题。这没有定论,如前所述,管理信息系统,设备驱动开发,工具软件开发,软件病毒剖析等等这些工作不太需要创造性,需要的是耐心和经验,需要的是对既有规范的准确理解,这类开发工作最适合在实战中提高,理论学习没什么作用。但是在人工智能,模式识别,图像压缩,虚拟现实,巨量数据检索,自然语言理解,计算机图形学等等领域,理论学习就占据着绝对的统治地位!这些领域的突破对人类的生活的影响是极其巨大而深刻的。某些领域处于一个极其快速发展的态势之中,比如计算机图形学,相信诸君能够从众多3D游戏的灿烂辉煌中体认到我的这种说法。在这些领域,如果没有扎实的理论功底,一切都是那么遥远,不管你花了多少时间在编程上面。! F: j. }7 ^2 w! S$ d( ~
  
" S. w3 P- S* J# j9 e; F7 U
Leung-图形学(25728739) 22:45:48
你在一个狭小的角度发现问题,在这个里面有改进就行
别管最终应用的面会有多窄
也许可能就是在某些特例上有用,那也是有价值的
怡外(260240477) 22:46:25
如果所有因素都加进来的话,问题可能会很复杂,所以21世纪是复杂科学的世纪
YJX-并行计算(24974976) 22:46:39
我当年的导师在本科时,就是这样发现了一个很狭小,很偏,很理论的一个问题,所以他当年发了5篇顶级期刊论文.
YJX-并行计算(24974976) 22:47:43
至今让我不敢忽略一些自认为不重要的问题,除非我对那个问题搞得很明白.
怡外(260240477) 22:48:07

KaiBao-ML(112035246) 22:48:08
同意,收益匪浅阿
YJX-并行计算(24974976) 22:49:30
的确有大气,大手笔的文章,并且可能是几页的.但太多的顶级期刊论文,都在很详尽的讨论一些细致的问题.
说白了,就是研究做得还是够量.本质上就是这样.
Leung-图形学(25728739) 22:49:47
系统
Leung-图形学(25728739) 22:50:13
做全做透
校园商务平台(30940045) 22:50:24
讲下这个过程
校园商务平台(30940045) 22:50:49
可以描述下当初的这个过程么?
YJX-并行计算(24974976) 22:51:24
一个教授把Nature上的一篇文章看透了,为了针对那篇文章,详细地做了另外一篇SCIENCE文章,做得非常够多的工作,足足26页的文章,录用了.
dudu(562014591) 22:52:00
哎!还是想想我们怎么毕业吧
Asian Engine(372760127) 22:52:18
同意
YJX-并行计算(24974976) 22:52:58
不说了,再说,就没有意思了.我得修改文章了.
YJX-并行计算(24974976) 22:53:18

校园商务平台(30940045) 22:53:25

KaiBao-ML(112035246) 22:53:35

Sunday, June 22, 2008

"Learning Has Just Started" - an interview with Prof. Vladimir Vapnik(2008.4对老瓦的采访)

As a part of the renovation of the learningTheory.org web site, we are launching a series of interviews with leading researchers in learning theory and related fields. We are proud that Prof. Vladimir Vapnik accepted our invitation to be the first to be interviewed.

Prof. Vapnik has been working on learning theory related problems for more than four decades. Together with Alexey Chervonenkis he studied the problem of uniform convergence of empirical means and developed the VC theory. He also developed the large margin principles and the Support Vector Machines algorithm.

R-GB: Thank you for accepting our invitation to be the first one to be interviewed for learningtheory.org. Can you tell us what your current research directions are?

V-V: My current research interest is to develop advanced models of empirical inference. I think that the problem of machine learning is not just a technical problem. It is a general problem of philosophy of empirical inference. One of the ways for inference is induction. The main philosophy of inference developed in the past strongly connected the empirical inference to the inductive learning process.

I believe that induction is a rather restrictive model of learning and I am trying to develop more advanced models. First, I am trying to develop non-inductive methods of inference, such as transductive inference, selective inference, and many other options. Second, I am trying to introduce non-classical ways of inference. Here is an example of such an inference. In the classical scheme, given a set of admissible indicator functions {f(x)} and given a set of training data, pairs (xi,yi)Î X´{±1}, one tries to find the best classification function in this set. In the new setting, called master-class learning, we have also given a set of admissible functions {f(x)} and our goal is also to find the best classification function in this set. However, for the training data we are given additional information: we are given triplets (xi,x*i,yi), where vectors x*i belong to space X* (generally speaking, different from the space $X$). These vectors are carriers of "hidden information" about vectors x (they will not be available during testing). These vectors can be a special type of holistic description of the data.

For example, when you have a technical description x of the object and have some impression x* about this object you have two forms of description: a formal description and a holistic description or Gestalt description. Using both descriptions during training can help to find a better decision function. This technique remains master-class learning, like musicians training in master classes. The teacher does not show exactly how to play. He talks to students and gives some images transmitting some hidden information - and this helps. So, the challenge is to create an algorithm which using additional information, will generalize better than classical algorithms.

I believe that learning has just started, because whatever we did before, it was some sort of a classical setting known to classical statistics as well. Now we come to the moment where we are trying to develop a new philosophy which goes beyond classical models.

R-GB: You gave the example of master-classes where you see this additional information, can you give another example?

V-V: Consider for example a figure skating coach. The coach cannot skate as well as a good young skater; nevertheless, he can explain how to ski. The explanation is not in technical details but something like what you should more focus on or giving you some images you should think of. You can look at it as if it is just blah-blah-blah, but it is not.

This type of description contains hidden information that affects your choice of a good rule. We checked this opportunity in the digit recognition task. We developed metaphoric descriptions of all digits of the training set, and used these descriptions to improve performance, and it works. This is what real learning is about: it uses technical description x and uses hidden information provided by the teacher in a completely different language to create a good technical decision rule.

R-GB: Do you think this setting can be formalized in the same sense that uniform convergence was formalized?

V-V: It is easy to formalize it, and you can use it with well-known algorithms like support vector machines. In support vector machines one uses many independent slack parameters in the optimization process. Hidden information leads to a restricted set of admissible slack functions which have a smaller capacity than all possible slack functions used in classical SVMs. Many of these ideas were discussed in the after-word in the second edition of my 1982 book "Estimation of Dependencies Based on Empirical Data".

I believe that something drastic has happened in computer science and machine learning. Until recently, philosophy was based on the very simple idea that the world is simple. In machine learning, for the first time, we have examples where the world is not simple. For example, when we solve the "forest" problem (which is a low-dimensional problem) and use data of size 15,000 we get 85%-87% accuracy. However, when we use 500,000 training examples we achieve 98% of correct answers. This means that a good decision rule is not a simple one, it cannot be described by a very few parameters. This is actually a crucial point in approach to empirical inference.

This point was very well described by Einstein who said "when the solution is simple, God is answering". That is, if a law is simple we can find it. He also said "when the number of factors coming into play is too large, scientific methods in most cases fail". In machine learning we dealing with a large number of factors. So the question is what is the real world? Is it simple or complex? Machine learning shows that there are examples of complex worlds. We should approach complex worlds from a completely different position than simple worlds. For example, in a complex world one should give up explain-ability (the main goal in classical science) to gain a better predict-ability.


R-GB: Do you claim that the assumption of mathematics and other sciences that there are very few and simple rules that govern the world is wrong?

V-V: I believe that it is wrong. As I mentioned before, the (low-dimensional) problem "forest" has a perfect solution, but it is not simple and you cannot obtain this solution using 15,000 examples.

R-GB: Maybe it is because learning from examples is too limited?

V-V: It is limited, but it is not too limited. I want to stress another point: you can get a very good decision rule, but it is a very complicated function. It can be like a fractal. I believe that in many cases we have these kinds of decision rules. But nonetheless, we can make empirical inferences. In many cases, to make empirical inference, we do not need to have a general decision rule; we can do it using different techniques. That is why empirical inference is an interesting problem. It is not a function estimation problem that has been known in statistics since the time of Gauss. Now a very exciting time has come when people try to find new ways to attack complex worlds. These ways are based on a new philosophy of inference.

In classical philosophy there are two principles to explain the generalization phenomenon. One is Occam's razor and the other is Popper's falsifiability. It turns out that by using machine learning arguments one can show that both of them are not very good and that one can generalize violating these principles. There are other justifications for inferences.

R-GB: What are the main challenges that machine learning should address?

V-V: The main challenge is to attack complex worlds.

R-GB: What do you think are the main accomplishments of machine learning?

V-V: First of all, machine learning has had a tremendous influence both on modern intelligent technology and modern methods of inferences. Ten years ago, when statisticians did not buy our arguments, they did not do very well in solving high-dimensional problems. They introduced some heuristics, but this did not work well. Now they have adopted the ideas of statistical learning theory and this is an important achievement.

Machine learning theory started in early 1960's with the Perceptron of Rosenblatt and the Novikoff theorem about Perceptron algorithm. The development of these works led to the construction of a learning theory. In 1963, Alexey Chervonenkis and I introduced an algorithm for pattern recognition based on optimal hyper-plane. We proved the consistency of this algorithm using uniform convergence arguments and got a bound for its accuracy. Generalization of these results led to the VC theory.

From a conceptual point of view the most important part of VC theory is the necessary and sufficient conditions for learn-ability not just sufficient conditions (bounds). These conditions are based on capacity concepts. There are three capacity measures, one is entropy, the second is growth function and the last is VC dimension. VC dimension is the most crude description of capacity. The best measure is the VC-entropy which is different than the classical entropy. The necessary and sufficient condition for a given probability measure states that the ratio of the entropy to the number of examples must go to zero. What happens if it goes to value a which is not zero? Then one can prove that there exists in the space X a subspace X0 with probability measure a, such that subset of training vectors that belong to this subspace can be separated in all possible ways. This means that you cannot generalize. This also means that if you have to choose a good function from an admissible set of functions you can not avoid VC type of reasoning.

R-GB: What do you think about the bounds on uniform convergence? Are they as good as we can expect them to be?

V-V: They are O.K. However the main problem is not the bound. There are conceptual questions and technical questions. From a conceptual point of view, you cannot avoid uniform convergence arguments; it is a necessity. One can try to improve the bounds, but it is a technical problem. My concern is that machine learning is not only about technical things, it is also about philosophy: What is the complex world science about? The improvement of the bound is an extremely interesting problem from mathematical point of view. But even if you'll get a better bound it will not be able help to attack the main problem: what to do in complex worlds?


R-GB: Today we use these bounds and design algorithms to minimize these bounds. But the bounds are loose. Are we doing the right thing?

V-V: I don't think you can get a bound which is not loose, because technically it is very difficult. Not everything can be described in a simple mathematical expressions. Now, people are working on transductive inference. What is the difference between transductive inference and inductive inference? In transductive inference, you have a set of equivalent decision rules: rules which classify the training and test data in the same way are equivalent. So instead of an infinite number of hypotheses, you have a finite number of equivalence classes. For a set of equivalence classes the problem becomes simpler; it is combinatorial. You can measure size of equivalence classes using different measures and this leads to different algorithms. You can restructure risk minimization on equivalence classes. It describes the situation better. But in this case, you will not have a general decision rule, you just have answers for your test points. You give up generality but you gain in accuracy. I am happy that people have started to develop transductive inference which was introduced back in the 1970s.

R-GB: Do you have some pointers to interesting work that you ran across recently?

V-V: There are a lot of very good works. Recently, the book "Semi-Supervised Learning" was published. Along with semi-supervised learning it contains chapters about transductive learning. I think that semi-supervised learning is still an attempt to do induction. But I believe that in order to get accuracy in inference we should give up the inductive approach. It should be something else like transductive inference or selective inference. For example, consider the following problem. You are given pairs, (xi,yi), i=1,...,l for training and simultaneously you are given m testing examples x1,…,xm. The problem is, given these two sets, can you find k examples from the test set which most probably belong to the first class? This is a decision making problem. It is easier than the inductive pattern recognition because you don't need to classify everything. Classification is difficult for the border elements which are close to the decision boundary. It is also not a ranking problem, because you are not interested in the ranking of chosen vectors. It is a simpler problem, and therefore it can be solved more accurately.

R-GB: Your SVM work gained a lot of popularity, what do you think is the reason that made it so popular?

V-V: First of all, SVM was developed over 30 years. The first publication we did jointly with Alexey Chervonenkis in 1963 and it was about optimal separating hyper-planes. It is actually linear SVM. In 1992, jointly with Bernhard Boser and Isabelle Guyon, we introduced the kernel trick and in 1995, jointly with Corinna Cortes, we introduced slack variables and it became SVM. Why is it so popular? I believe there are several reasons. First of all, it is effective. It gives very stable and good results. From a theoretical point of view, it is very clear, very simple, and allows many different generalizations, called kernel machines. It also introduced pattern recognition to optimization scientists and this includes new researchers in the field. There exist several very good libraries for SVM algorithms. All together these make SVM popular.

R-GB: It seems now days that almost all machine learning is about margins and maximizing margins. Is it the right thing to look at?

V-V: In my research, I am trying to invent something better than margin, because margin is not the best characteristic of capacity. Through margin one can bound the VC dimension. But VC dimension is a loose capacity concept. Now I am trying to introduce something instead of the margin. I am checking different ideas such as using universums and making inference by choosing the equivalence class which has the maximal number of contradiction on universum. Margin is good to prove a general point, but to find advanced techniques maybe we should think about what could be better than margin - to move closer to entropy and not to VC dimension.

R-GB: Switching to a different topic: you have been active in machine learning for more than 4 decades and keep being very innovative. How do you do that?

V-V: The problem of machine learning is very wide. It includes technical aspects as well as philosophical aspects. It is also a unification of humanities and technicalities. In different periods of my life I tried to understand different aspects of this problem. Through 1960's-1970's, I was interested in the mathematical aspects of this problem. In 1980's, I understood the relation of the pattern recognition problem to problems of classical philosophy of science. In 1990's, developing SVM I was happy to introduce a theoretical line in creating algorithms instead of heuristics. Now I see in this problem a central point for changing a general paradigm that has existed for many hundreds of years which separates inductive (or scientific) inference developed for dealing with a simple world from direct (non-inductive or non-scientific) inference which is a tool for dealing with a complex world. It is a very rich problem that has many aspects and one can find appropriate aspects of this problem in different periods of ones life.

R-GB: When you started in the 60's, did you consider your work as studying learning?

V-V: Yes, I considered my research to be research in learning theory from the very beginning. The uniform convergence theory was constructed to justify learning algorithms developed in the 1960s and, in particular, the optimal separating hyperplane. For me PAC theory that started in mid 1980's was a backward step to prior development presented in our joint with Alexey Chervonenkis book "Theory of pattern recognition" (1974) and in my book "Estimation of Dependencies Based on Empirical Data" (1979 Russian version and 1982 English translation).

R-GB: Can you share with us some recommendations for papers or books in machine learning or related areas that you find interesting?

V-V: There are many excellent articles and books on machine learning. Now is the time when many different ideas are implemented into algorithms for solving large-scale technical, biological, and linguistic problems. Now is a time when a lot of new facts and observations have appeared. It is very interesting to try to follow them in order to understand how they can fit in a general model.

Recently, the first book in the philosophy of science appeared, Reliable Reasoning - Induction and Statistical Learning Theory" by G. Harman and S. Kulkarni, which is trying to explain a philosophical component of what was done in machine learning, what is the VC dimension, what was wrong in Popper's non-falsifiability concept in contrast to VC non-falsifiability concept and so on. From my point of view this book is still conservative: it is primarely about induction but also mentions the transductive inference. Nevertheless, people in philosophy are trying to take into account our science. I believe that we should also try to advance general philosophical problems of intelligence.


from:http://www.learningtheory.org/

人工智能机构(国外)

人工智能机构(国内)

人工智能界的牛人

Qiang Yang Zhi-Hua Zhou Jiawei Han Liefeng Bo
Huan Liu Lei Yu Yiyu Yao Guozheng Li
Hui Wang XinDong Wu Changshui Zhang Qiang Shen
Josh Tenenbaum Richard Jensen Dominik Slezak Sheng Zhong
L. A. Zadeh Ronald R. Yager Tieniu Tan Congfu Xu
Ivo Düntsch      

数据挖掘,机器学习,统计,matlab 的qq交流群


数据挖掘:
神威智能挖掘II:12023354
数据挖掘爱好者1号群:56615792
机器学习:
哈工大机器学习小组:27980716
Matlab:
Matlab技术交流群:46721754
matlab研论群:46912946
统计:
中国统计联盟:14112616

Agent:
Agent学习组:62761088
SupperComputer讨论组:
SuperComputer交流群:13242003

计算机博士学术交流群:62135360

快速学习某一个领域的好方法:Wikipedia


Wikipedia上有很多资料,包括导论,应用,理论,而且具有一定的代表性,还有很多相关的链接,可以让你快速了解一个领域!大家可以体验一下:Wikipedia

人工智能会议


我知道的几个人工智能会议(一流)

IJCAI (1+): AI最好的综合性会议, 1969年开始, 每两年开一次, 奇数年开. 因为AI 实在太大, 所以虽然每届基本上能录100多篇(现在已经到200多篇了),但分到每个领域就没几篇了,象machine learning、computer vision这么大的领域每次大概也就10篇左右, 所以难度很大. 不过从录用率上来看倒不太低,基本上20%左右, 因为内 行人都会掂掂分量, 没希望的就别浪费reviewer的时间了. 最近中国大陆投往国际会议的文章象潮水一样, 而且因为国内很少有能自己把关的研究组, 所以很多会议都在complain说中国的低质量文章严重妨碍了PC的工作效率. 在这种情况下, 估计这几年国际会议的录用率都会降下去. 另外, 以前的IJCAI是没有poster的, 03年开始, 为了减少被误杀的好人, 增加了2页纸的poster.值得一提的是, IJCAI是由貌似一个公司"IJCAI Inc."主办的(当然实际上并不是公司, 实际上是个基金会), 每次会议上要 发几个奖, 其中最重要的两个是IJCAI Research Excellence Award 和 Computer& Thoughts Award, 前者是终身成就奖, 每次一个人, 基本上是AI的最高奖(有趣的是, 以AI为主业拿图灵奖的6位中, 有2位还没得到这个奖), 后者是奖给35岁以下的青年科学家, 每次一个人. 这两个奖的获奖演说是每次IJCAI的一个重头戏.另外,IJCAI 的 PC member 相当于其他会议的area chair, 权力很大, 因为是由PC member 去找 reviewer 来审, 而不象一般会议的PC member其实就是 reviewer. 为了制约这种权力, IJCAI的审稿程序是每篇文章分配2位PC member, primary PC member去找3位reviewer, second PC member 找一位. AAAI (1): 美国人工智能学会AAAI的年会. 是一个很好的会议, 但其档次不稳定, 可以给到1+, 也可以给到1-或者2+, 总的来说我给它"1". 这是因为它的开法完全受IJCAI制约: 每年开, 但如果这一年的IJCAI在北美举行, 那么就停开. 所以, 偶数年里因为没有IJCAI, 它就是最好的AI综合性会议, 但因为号召力毕竟比IJCAI要小一些,特别是欧洲人捧AAAI场的比IJCAI少得多(其实亚洲人也是), 所以比IJCAI还是要稍弱一点, 基本上在1和1+之间; 在奇数年, 如果IJCAI不在北美, AAAI自然就变成了比IJCAI低一级的会议(1-或2+), 例如2005年既有IJCAI又有AAAI, 两个会议就进行了协调, 使得IJCAI的录用通知时间比AAAI的deadline早那么几天, 这样IJCAI落选的文章可以投往AAAI.在审稿IJCAI 的 PC chair也在一直催, 说大家一定要快, 因为AAAI那边一直在担心IJCAI的录用通知出晚了AAAI就麻烦了.

COLT (1): 这是计算学习理论最好的会议, ACM主办, 每年举行. 计算学习理论基本上可以看成理论计算机科学和机器学习的交叉, 所以这个会被一些人看成是理论计算机科学的会而不是AI的会. 我一个朋友用一句话对它进行了精彩的刻画: "一小群数学家在开会". 因为COLT的领域比较小, 所以每年会议基本上都是那些人. 这里顺便提一件有趣的事, 因为最近国内搞的会议太多太滥, 而且很多会议都是LNCS/LNAI出论文集, LNCS/LNAI基本上已经被搞臭了, 但很不幸的是, LNCS/LNAI中有一些很好的会议, 例如COLT. CVPR (1): 计算机视觉和模式识别方面最好的会议之一, IEEE主办, 每年举行. 虽然题目上有计算机视觉, 但个人认为它的模式识别味道更重一些. 事实上它应该是模式识别最好的会议, 而在计算机视觉方面, 还有ICCV与之相当. IEEE一直有个倾向, 要把会办成"盛会", 历史上已经有些会被它从quality很好的会办成"盛会"了. CVPR搞不好也要走这条路. 这几年录的文章已经不少了. 最近负责CVPR会议的TC的chair发信说, 对这个community来说, 让好人被误杀比被坏人漏网更糟糕, 所以我们是不是要减少好人被误杀的机会啊? 所以我估计明年或者后年的CVPR就要扩招了.

ICCV (1): 介 绍CVPR的时候说过了, 计算机视觉方面最好的会之一. IEEE主办. ICCV逢奇数年开,开会地点以往是北美,欧洲和亚洲轮流,本来2003年定 在北京,后来因Sars和原定05年的法国换了一下。ICCV'07年将首次在南美(巴西)举行.CVPR原则上每年在北美开, 如果那年正好ICCV在北美,则该年没有CVPR.

ICML (1): 机器学习方面最好的会议之一. 现在是IMLS主办, 每年举行. 参见关于NIPS

的介绍.

NIPS (1): 神经计算方面最好的会议之一, NIPS主办, 每年举行. 值得注意的是, 这个会每年的举办地都是一样的, 以前是美国丹佛, 现在是加拿大温哥华; 而且它是年底开会, 会开完后第2年才出论文集, 也就是说, NIPS'05的论文集是06年出. 会议的名字是"Advances in Neural Inxxxxation Processing Systems", 所以, 与ICMLECML这样的"标准的"机器学习会议不同, NIPS里有相当一部分神经科学的内容, 和机器学习有一定的距离. 但由于会议的主体内容是机器学习, 或者说与机器学习关系紧密, 所以不少人把NIPS看成是机器学习方面最好的会议之一. 这个会议基本上控制在MichaelJordan的徒子徒孙手中, 所以对Jordan系的人来说, 发NIPS并不是难事, 一些未必很强的工作也能发上去, 但对这个圈子之外的人来说, 想发一篇实在很难, 因为留给"外人"的口子很小. 所以对Jordan系以外的人来说, 发NIPS的难度比ICML更大. 换句话说,ICML比较开放, 小圈子的影响不象NIPS那么大, 所以北美和欧洲人都认, 而NIPS则有些人(特别是一些欧洲人, 包括一些大家)坚决不投稿. 这对会议本身当然并不是好事,但因为Jordan系很强大, 所以它似乎也不太care. IMLS(国际机器学习学会)改选理事, 有资格提名的人包括近三年在ICMLECMLCOLT发过文章的人, NIPS则被排除在外了. 无论如何, 这是一个非常好的会.

ACL (1-): 计算语言学/自然语言处理方面最好的会议, ACL (Association of Computational Linguistics) 主办, 每年开.

KR (1-): 知识表示和推理方面最好的会议之一, 实际上也是传统AI(即基于逻辑的AI)最好的会议之一. KR Inc.主办, 现在是偶数年开.

SIGIR (1-): 信息检索方面最好的会议, ACM主办, 每年开. 这个会现在小圈子气越来越重. 信息检索应该不算AI, 不过因为这里面用到机器学习越来越多, 最近几年甚至有点机器学习应用会议的味道了, 所以把它也列进来.

SIGKDD (1-): 数据挖掘方面最好的会议, ACM主办, 每年开. 这个会议历史比较短,毕竟, 与其他领域相比,数据挖掘还只是个小弟弟甚至小侄儿. 在几年前还很难把它列在tier-1里面, 一方面是名声远不及其他的top conference响亮, 另一方面是相对容易被录用. 但现在它被列在tier-1应该是毫无疑问的事情了. 这几年来KDD的质量都很高. SIGKDD从2000年来full paper的录取率都在10%-12%之间,远远低于IJCAI和ICML. 经常听人说,KDD要比IJICAI和ICML都要困难。IJICAI才6页,而KDD要10页。没有扎实系统的工作,很难不留下漏洞。有不少 IJICAI的常客也每年都投KDD,可难得几个能经常中。

UAI (1-): 名字叫"人工智能中的不确定性", 涉及表示推理学习等很多方面, AUAI

(Association of UAI) 主办, 每年开.



我知道的几个人工智能会议(二三流)

(原创为lilybbs.us上的daniel) 纯属个人看法, 仅供参考. tier-1的列得较全, tier-2的不太全, tier-3的很不全.
同分的按字母序排列. 不很严谨地说, tier-1是可以令人羡慕的, tier-2是可以令人尊敬的,由于AI的相关会议非常多, 所以能列进tier-3的也是不错的. tier 2: tier-2的会议列得不全, 我熟悉的领域比较全一些.

AAMAS (2+): agent方面最好的会议. 但是现在agent已经是一个一般性的概念,几乎所有AI有关的会议上都有这方面的内容, 所以AAMAS下降的趋势非常明显.

ECCV (2+): 计算机视觉方面仅次于ICCV的会议, 因为这个领域发展很快, 有可能升级到1-去.

ECML (2+): 机器学习方面仅次于ICML的会议, 欧洲人极力捧场, 一些人认为它已经是1-了. 我保守一点, 仍然把它放在2+. 因为机器学习发展很快, 这个会议的reputation上升非常明显.

ICDM (2+): 数据挖掘方面仅次于SIGKDD的会议, 目前和SDM相当. 这个会只有5年历史, 上升速度之快非常惊人. 几年前ICDM还比不上PAKDD, 现在已经拉开很大距离了.

SDM (2+): 数据挖掘方面仅次于SIGKDD的会议, 目前和ICDM相当. SIAM的底子很厚,但在CS里面的影响比ACM和IEEE还是要小, SDM眼看着要被ICDM超过了, 但至少目前还是相当的.

ICAPS (2): 人工智能规划方面最好的会议, 是由以前的国际和欧洲规划会议合并来的. 因为这个领域逐渐变冷清, 影响比以前已经小了.

ICCBR (2): Case-Based Reasoning方面最好的会议. 因为领域不太大, 而且一直半冷不热, 所以总是停留在2上.

COLLING (2): 计算语言学/自然语言处理方面仅次于ACL的会, 但与ACL的差距比ICCV-ECCV和ICML-ECML大得多.

ECAI (2): 欧洲的人工智能综合型会议, 历史很久, 但因为有IJCAI/AAAI压着,很难往上升.

ALT (2-): 有点象COLT的tier-2版, 但因为搞计算学习理论的人没多少, 做得好的数来数去就那么些group, 基本上到COLT去了, 所以ALT里面有不少并非计算学习理论的内

容.

EMNLP (2-): 计算语言学/自然语言处理方面一个不错的会. 有些人认为与COLLING相当, 但我觉得它还是要弱一点.

ILP (2-): 归纳逻辑程序设计方面最好的会议. 但因为很多其他会议里都有ILP方面的内容, 所以它只能保住2-的位置了.

PKDD (2-): 欧 洲的数据挖掘会议, 目前在数据挖掘会议里面排第4. 欧洲人很想把它抬起来, 所以这些年一直和ECML一起捆绑着开, 希望能借ECML把它带起来. 但因为ICDM和SDM, 这已经不太可能了. 所以今年的PKDD和ECML虽然还是一起开, 但已经独立审稿了(以前是可以同时投两个会, 作者可以 声明优先被哪个会考虑, 如果ECML中不了还可以被PKDD接受).

tier 3: 列得很不全. 另外, 因为AI的相关会议非常多, 所以能列在tier-3也算不错了, 基本上能进到所有AI会议中的前30%吧

ACCV (3+): 亚洲的计算机视觉会议, 在亚太级别的会议里算很好的了.

DS (3+): 日本人发起的一个接近数据挖掘的会议.

ECIR (3+): 欧洲的信息检索会议, 前几年还只是英国的信息检索会议.

ICTAI (3+): IEEE最主要的人工智能会议, 偏应用, 是被IEEE办烂的一个典型. 以前的quality还是不错的, 但是办得越久声誉反倒越差了, 糟糕的是似乎还在继续下滑, 现在其实3+已经不太呆得住了.

PAKDD (3+): 亚太数据挖掘会议, 目前在数据挖掘会议里排第5.

ICANN (3+): 欧洲的神经网络会议, 从quality来说是神经网络会议中最好的, 但这个领域的人不重视会议,在该领域它的重要性不如IJCNN.

AJCAI (3): 澳大利亚的综合型人工智能会议, 在国家/地区级AI会议中算不错的了.

CAI (3): 加拿大的综合型人工智能会议, 在国家/地区级AI会议中算不错的了.

CEC (3): 进 化计算方面最重要的会议之一, 盛会型. IJCNN/CEC/FUZZ-IEEE这三个会议是计算智能或者说软计算方面最重要的会议, 它们经常一起 开, 这时就叫

WCCI (World Congress on Computational Intelligence). 但这个领域和CS其他分支不太一样, 倒是和其他学科相似, 只重视journal, 不重视会议, 所以录用率经常在85%左右, 所录文章既有quality非常高的论 文, 也有入门新手的习作.

FUZZ-IEEE (3): 模糊方面最重要的会议, 盛会型, 参见CEC的介绍.

GECCO (3): 进化计算方面最重要的会议之一, 与CEC相当,盛会型.

ICASSP (3): 语音方面最重要的会议之一, 这个领域的人也不很care会议.

ICIP (3): 图像处理方面最著名的会议之一, 盛会型.

ICPR (3): 模式识别方面最著名的会议之一, 盛会型.

IEA/AIE (3): 人工智能应用会议. 一般的会议提名优秀论文的通常只有几篇文章, 被提名就已经是很高的荣誉了, 这个会很有趣, 每次都搞1、20篇的优秀论文提名, 专门搞几个session做被提名论文报告, 倒是很热闹.

IJCNN (3): 神经网络方面最重要的会议, 盛会型, 参见CEC的介绍.

IJNLP (3): 计算语言学/自然语言处理方面比较著名的一个会议.

PRICAI (3): 亚太综合型人工智能会议, 虽然历史不算短了, 但因为比它好或者相当的

综合型会议太多, 所以很难上升.

如何检验两组实验结果有显著性差别


使用的方法有两种,

一种是:盒状图
另一种是:配对 T检验法

数据挖掘常用的数据集

1、气候监测数据集?http://cdiac.ornl.gov/ftp/ndp026b?

2、几个实用的测试数据集下载的网站

http://www.cs.toronto.edu/~roweis/data.html
http://www.cs.toronto.edu/~roweis/data.html
http://kdd.ics.uci.edu/summary.task.type.html
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/
http://www.phys.uni.torun.pl/~duch/software.html
在下面的网址可以找到reuters数据集http://www.research.att.com/~lewis/reuters21578.html

以下网址上有各种数据集:
http://kdd.ics.uci.edu/summary.data.type.html

进行文本分类,还有一个数据集是可以用的,即rainbow的数据集
http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html?

3、找了很多测试数据集,写论文的同志们肯定需要的,至少能用来检验算法的效果
可能有一些不能访问,但是总有能访问的吧:

UCI收集的机器学习数据集
ftp://pami.sjtu.edu.cn/
http://www.ics.uci.edu/~mlearn//MLRepository.htm

statlib?
http://liama.ia.ac.cn/SCILAB/scilabindexgb.htm
http://lib.stat.cmu.edu/

样本数据库
http://kdd.ics.uci.edu/
http://www.ics.uci.edu/~mlearn/MLRepository.html

关于基金的数据挖掘的网站
http://www.gotofund.com/index.asp

http://lans.ece.utexas.edu/~strehl/

reuters数据集
http://www.research.att.com/~lewis/reuters21578.html

各种数据集:
http://kdd.ics.uci.edu/summary.data.type.html
http://www.mlnet.org/cgi-bin/mlnetois.pl/?File=datasets.html
http://lib.stat.cmu.edu/datasets/
http://dctc.sjtu.edu.cn/adaptive/datasets/?
http://fimi.cs.helsinki.fi/data/
http://www.almaden.ibm.com/software/quest/Resources/index.shtml
http://miles.cnuce.cnr.it/~palmeri/datam/DCI/

进行文本分类&WEB
http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html

http://www.w3.org/TR/WD-logfile-960221.html
http://www.w3.org/Daemon/User/Config/Logging.html#AccessLog
http://www.w3.org/1998/11/05/WC-workshop/Papers/bala2.html
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/
http://www.web-caching.com/traces-logs.html
http://www-2.cs.cmu.edu/webkb
http://www.cs.auc.dk/research/DP/tdb/TimeCenter/TimeCenterPublications/TR-75.pdf
http://www.cs.cornell.edu/projects/kddcup/index.html


时间序列数据的网址
http://www.stat.wisc.edu/~reinsel/bjr-data/

apriori算法的测试数据
http://www.almaden.ibm.com/cs/quest/syndata.html

数据生成器的链接
http://www.cse.cuhk.edu.hk/~kdd/data_collection.html
http://www.almaden.ibm.com/cs/quest/syndata.html


关联:
http://flow.dl.sourceforge.net/sourceforge/weka/regression-datasets.jar
http://www.almaden.ibm.com/software/quest/Resources/datasets/syndata.html#assocSynData

WEKA:
http://flow.dl.sourceforge.net/sourceforge/weka/regression-datasets.jar
1。A?jarfile?containing?37?classification?problems,?originally?obtained?from?the?UCI?repository
http://prdownloads.sourceforge.net/weka/datasets-UCI.jar
2。A?jarfile?containing?37?regression?problems,?obtained?from?various?sources
http://prdownloads.sourceforge.net/weka/datasets-numeric.jar
3。A?jarfile?containing?30?regression?datasets?collected?by?Luis?Torgo
http://prdownloads.sourceforge.net/weka/regression-datasets.jar

癌症基因:
http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi

金融数据:
http://lisp.vse.cz/pkdd99/Challenge/chall.htm

另一个人提供的
http://www.cs.toronto.edu/~roweis/data.html
http://kdd.ics.uci.edu/summary.task.type.html
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/
http://www.phys.uni.torun.pl/~duch/software.html
在下面的网址可以找到reuters数据集
http://www.research.att.com/~lewis/reuters21578.html

以下网址上有各种数据集:
http://kdd.ics.uci.edu/summary.data.type.html

进行文本分类,还有一个数据集是可以用的,即rainbow的数据集
http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html


Download?the?Financial?Data?(~17.5M?zipped?file,?~67M?unzipped?data)?
Download?the?Medical?Data?(~2M?zipped?file,?~6M?unzipped?data)
http://lisp.vse.cz/pkdd99/Challenge/chall.htm


kdnuggets?相关链接数据集:
http://www.kdnuggets.com/datasets/index.html