INVESTIGATING THE TASK EXECUTION OF PROBABILISTIC TOPIC MODELS: ANEMPIRICAL STUDY OF PLSA AND LDAIntroduction and problem statement: This article deals with the task execution of PLSA (Probabilistic Latent Semantic Analysis) and LDA (Latent Dirichlet Assignment). Much work has been done, reporting promising performance of topic models, but none of the work has systematically investigated the performance of topic models in tasks. As a result, some critical questions that can affect the performance of all topic model applications are mostly unanswered, in particular: • How to choose between competing models? • How do multiple local maxima affect task performance? • and how to set parameters in topic models? In this article the author addresses these questions by conducting a systematic investigation of two representative probabilistic topic models PLSA and LDA using three representative tasks of text mining, document clustering, text categorization and ad-hoc retrieval. Important Terms: Probabilistic Topic Models: The Basic Concepts The idea behind probabilistic topic models is that documents are mixtures of topics, where a topic is represented by a multinomial distribution of words. ϕw(j) = P(w/z=j) refers to the multinomial distribution over the words for topic j and θj(d)=P(z=j/d) refers to the multinomial distribution over the topics for document d. The parameters ϕ and θ respectively indicate which words are important for which topic and which topics are important for a particular document. Probabilistic Latent Semantic Analysis (PLSA): PLSA was introduced by Hoffman. A document d is considered a sample of the following mixture model. That is, the probability distribution over words w for a given document d. word-topic distributions ϕ an...... middle of paper ...... answered. The authors address these issues in this current empirical study of plsa and lda. An article by Chang et al. 2009 conducts user studies to quantitatively compare semantic meaning in topics inferred by PLSA and LDA. The goal is to quantify the interpretability of arguments with human effort. The author of this (current) article studies the task performance of topic models in three standard text mining applications, which can be objectively quantified using standard measures. So this work is supplementary to theirs. Previous work: As stated above, much work has been done reporting promising performance of topic models such as the results on text categorization in the original LDA paper (Blei et al. 2003). Work done by Wei and Bruce Croft (2006) shows that LDA could improve the state of the art of information retrieval in the framework of language modeling. Etc.
tags