ldatopicnumber-创新互联

Hi Vikas --

the optimum number of topics (K in LDA) is dependent on a at least two factors: 
Firstly, your data set may have an intrinsic number of topics, i.e., may derive 
from some natural clusters that your data have. This number will in the best 
case make your ppx minimal. A non-parametric approach like HDP would ideally 
result in the same K as the one that minimises ppx for LDA.  The second type of 
influence is that of the hyperparameters. If you fix the Dirichlet parameters 
alpha and beta (for LDA's Dirichlet-multinomial "levels" (theta | alpha) and 
(phi | beta)), you bias the optimum K. For instance, larger alpha will force 
more " "decisive" choices of z for each token, leading to a concentration of 
theta to fewer weights, which influences K.

Trouble minimizing perplexity in LDA

up vote1down votefavorite

I am running LDA from Mark Steyver's MATLAB Topic Modelling toolkit on a few Apache Java open source projects. I have taken care of stop word removal (for e.g. words such Apache, java keywords are marked as stopwords) and tokenization. I find that perplexity on test data always decreases with increasing number of topics. I tried different values of ALPHA but no difference.

成都创新互联专注于盘龙企业网站建设,响应式网站开发,商城系统网站开发。盘龙网站建设公司,为盘龙等地区提供建站服务。全流程专业公司，专业设计，全程项目跟踪，成都创新互联专业和态度为您提供的服务

I need to find optimal number of topics and for that perplexity plot should reach a minimum. Please suggest what may be wrong.

Definition and details regarding calculation of perplexity of a topic model is explained in this post.

Edit: I played with hyperparameters alpha and beta and now perplexity seems to reach a minimum. It is not clear to me as to how these hyperparameters affect perplexity. Initially I was plotting results till 200 topics without any success. Now on the same range minimum is reached at around 50-60 topics (which was my intuition) after modifying hyperparameters. Also, as this postnotes, you bias optimal number of topics according to specific values of hyperparameters.

machine-learning topic-models hyperparameter

shareimprove this question

edited Sep 15 '12 at 2:13

asked Sep 14 '12 at 5:22

abhinavkulkarni
2586

Many of us probably don't know what perplexity means and what aperplexity plot shows. I know I don't. Could you enlighten me (us)? – Michael Chernick Sep 14 '12 at 15:54

@MichaelChernick: I edited post to include a link detailing perplexity of a topic model. – abhinavkulkarni Sep 14 '12 at 22:27

Thanks for doing that. – Michael Chernick Sep 14 '12 at 22:52

How many topics have you tried so far (on what size corpus)? Maybe you just haven't yet hit the right number of topics? Also, for inferring the number of topics from data you may want to look into the Hierarchical Dirichlet Process (HDP) with code on David Blei's site: cs.princeton.edu/~blei/topicmodeling.html – Nick Sep 14 '12 at 23:22

@Nick: Indeep HDP, a nonparametric topic modelling algorithm is an alternative to LDA, wherein you don't have to tune hyperparameters. However at this point I would like to stick to LDA and know how and why perplexity behaviour changes drastically with regards to small adjustments in hyperparameters. Also, my corpus size is quite large. For e.g. I have tokenized Apache Lucene source code with ~1800 java files and 367K source code lines. So that's a pretty big corpus I guess. – abhinavkulkarni Sep 15 '12 at 2:21

1 Answer

activeoldestvotes

up vote2down vote

You might want to have a look at the implementation of LDA in Mallet, which can do hyperparameter optimization as part of the training. Mallet also uses asymmetric priors by default, which according to this paper, leads to the model being much more robust against setting the number of topics too high. In practice this means you don't have to specify the hyperparameters, and can set number of topics pretty high without negatively affecting results.

In my experience hyperparameter optimization and asymmetric priors gave significantly better topics than without it, but I haven't tried the Matlab Topic Modelling toolkit.

shareimprove this answer

分享文章：ldatopicnumber-创新互联
分享URL：http://cdkjz.cn/article/jsiso.html

多年建站经验

多一份参考，总有益处

联系快上网，免费获得专属《策划方案》及报价

咨询相关问题或预约面谈，可以通过以下方式与我们联系

网站建设

网站推广

案例

方案

电商网站开发

微信小程序

我们

联系

精准传达 • 有效沟通

查看其它板块

ldatopicnumber-创新互联

Trouble minimizing perplexity in LDA

1 Answer

多一份参考，总有益处

联系快上网，免费获得专属《策划方案》及报价

大客户专线成都：13518219792 座机：028-86922220

友情链接交换友情链接

网络推广

Network promotion

网站方案

Solution

电商网站开发

E-commerce & System

我们

About Us

联系

Contact Us

精准传达 • 有效沟通

查看其它板块

ldatopicnumber-创新互联

Trouble minimizing perplexity in LDA

1 Answer

相关资讯

公众号怎么看域名 微信公众号怎么看域名

域名查询辅助工具怎么用 域名查询辅助工具怎么用的

腾讯和百度云服务器地址 腾讯云服务器是固定ip的吗

阿里云服务器上外网IP 阿里云服务器访问国外网站

php查询整张数据表 php查询sql数据并显示

如何在C语言取消调用函数 c语言关闭函数

微信小程序腾讯云服务器 小程序的云服务器

c语言编程数学分段函数 c语言分段函数简单代码

多一份参考，总有益处

联系快上网，免费获得专属《策划方案》及报价

大客户专线 成都：13518219792 座机：028-86922220

友情链接 交换友情链接

公众号怎么看域名微信公众号怎么看域名

域名查询辅助工具怎么用域名查询辅助工具怎么用的

腾讯和百度云服务器地址腾讯云服务器是固定ip的吗

微信小程序腾讯云服务器小程序的云服务器

大客户专线成都：13518219792 座机：028-86922220

友情链接交换友情链接