是因为您有一个循环,用于不同数量的群集n。在第一次迭代中, ,
换句话说, (因此,np.unique(km.labels_)prints array([0],dtype=int32))。
silhouette_score 。这会导致错误。错误消息是明确的。from sklearn import datasetsfrom sklearn.cluster import KMeansimport numpy as npiris = datasets.load_iris()X = iris.datay = iris.targetkm = KMeans(n_clusters=3)km.fit(X,y)# check how many unique labels do you havenp.unique(km.labels_)#array([0, 1, 2], dtype=int32)
我们有3个不同的集群/集群标签。
silhouette_score(X, km.labels_, metric=’euclidean’)0.38788915189699597
该功能工作正常。
km2 = KMeans(n_clusters=1)km2.fit(X,y)silhouette_score(X, km2.labels_, metric=’euclidean’)
ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1
(inclusive)
解决方法silhouette score当我找到要创建的最佳群集数时,我正在尝试进行计算,但是出现错误消息:
ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)
我无法理解其原因。这是我用来聚类和计算的代码silhouette score。
我阅读了包含要聚类的文本的csv,并K-Means在n聚类值上运行。我收到此错误的原因可能是什么?
#Create cluster using K-Means#Only creates graphimport matplotlib#matplotlib.use(’Agg’)import reimport osimport nltk,math,codecsimport csvfrom nltk.corpus import stopwordsfrom gensim.models import Doc2Vecfrom sklearn.cluster import KMeansimport matplotlib.pyplot as pltimport pandas as pdfrom sklearn.metrics import silhouette_scoremodel_name = checkpoint_save_pathloaded_model = Doc2Vec.load(model_name)#Load the test csv filedata = pd.read_csv(test_filename)overview = data[’overview’].astype(’str’).tolist()overview = filter(bool,overview)vectors = []def split_words(text): return ’’.join([x if x.isalnum() or x.isspace() else ' ' for x in text ]).split()def preprocess_document(text): sp_words = split_words(text) return sp_wordsfor i,t in enumerate(overview): vectors.append(loaded_model.infer_vector(preprocess_document(t)))sse = {}silhouette = {}for k in range(1,15): km = KMeans(n_clusters=k,max_iter=1000,verbose = 0).fit(vectors) sse[k] = km.inertia_ #FOLLOWING LINE CAUSES ERROR silhouette[k] = silhouette_score(vectors,km.labels_,metric=’euclidean’)best_cluster_size = 1min_error = float('inf')for cluster_size in sse: if sse[cluster_size] < min_error:min_error = sse[cluster_size]best_cluster_size = cluster_sizeprint(sse)print('====')print(silhouette)