TY - JOUR
T1 - Nonparametric multi-assignment clustering
AU - Liu, Chien-Liang
AU - Hsaio, Wen Hoar
AU - Chang, Tao Hsing
AU - Jou, Tzai Min
N1 - Publisher Copyright:
© 2017 - IOS Press and the authors. All rights reserved.
PY - 2017
Y1 - 2017
N2 - Multi-label learning has attracted significant attention from machine learning and data mining over the last decade. Although many multi-label classification algorithms have been devised, few research studies focus on multi-assignment clustering (MAC), in which a data instance can be assigned to multiple clusters. The MAC problem is practical in many application domains, such as document clustering, customer segmentation and image clustering. Additionally, specifying the number of clusters is always a difficult but critical problem for a certain class of clustering algorithms. Hence, this work proposes a nonparametric multi-assignment clustering algorithm called multi-assignment Chinese restaurant process (MACRP), which allows the model complexity to grow as more data instances are observed. The proposed algorithm determines the number of clusters from data, so it provides a practical model to process massive data sets. In the proposed algorithm, we devise a novel prior distribution based on the similarity graph to achieve the goal of multi-assignment, and propose a Gibbs sampling algorithm to carry out posterior inference. The implementation in this work uses collapsed Gibbs sampling and compares with several methods. Additionally, previous evaluation metrics used by multi-label classification are inappropriate for MAC, since label information is unavailable. This work further devises an evaluation metric for MAC based on the characteristics of clustering and multi-assignment problems. We conduct experiments on two real data sets, and the experimental results indicate that the proposed method is competitive and outperforms the alternatives on most data sets.
AB - Multi-label learning has attracted significant attention from machine learning and data mining over the last decade. Although many multi-label classification algorithms have been devised, few research studies focus on multi-assignment clustering (MAC), in which a data instance can be assigned to multiple clusters. The MAC problem is practical in many application domains, such as document clustering, customer segmentation and image clustering. Additionally, specifying the number of clusters is always a difficult but critical problem for a certain class of clustering algorithms. Hence, this work proposes a nonparametric multi-assignment clustering algorithm called multi-assignment Chinese restaurant process (MACRP), which allows the model complexity to grow as more data instances are observed. The proposed algorithm determines the number of clusters from data, so it provides a practical model to process massive data sets. In the proposed algorithm, we devise a novel prior distribution based on the similarity graph to achieve the goal of multi-assignment, and propose a Gibbs sampling algorithm to carry out posterior inference. The implementation in this work uses collapsed Gibbs sampling and compares with several methods. Additionally, previous evaluation metrics used by multi-label classification are inappropriate for MAC, since label information is unavailable. This work further devises an evaluation metric for MAC based on the characteristics of clustering and multi-assignment problems. We conduct experiments on two real data sets, and the experimental results indicate that the proposed method is competitive and outperforms the alternatives on most data sets.
KW - Chinese restaurant process (CRP)
KW - Multi-assignment clustering
KW - Non-parametric Bayesian
UR - http://www.scopus.com/inward/record.url?scp=85027988796&partnerID=8YFLogxK
U2 - 10.3233/IDA-160105
DO - 10.3233/IDA-160105
M3 - Article
AN - SCOPUS:85027988796
SN - 1088-467X
VL - 21
SP - 893
EP - 911
JO - Intelligent Data Analysis
JF - Intelligent Data Analysis
IS - 4
ER -