熟悉TF-IDF的同学一定有疑问了,你这TF-IDF的字典也会很大呀,如果样本量很大而且有各式各样的参数value,你的特征向量岂不是稀疏得不行了?对于这个问题,我有一个解决方案,也就是将所有的TF-IDF进一步加以处理,对参数key相同的TF-IDF项进行求和。设参数key集合为K={k1, k2, …, kn},TF-IDF字典为集合x={x1, x2, …, xm}。则每个参数key的特征值为:
- vn = ∑TF-IDFxn xn∈{x | x startswith ‘kn=’}
具体代码在vectorize/vectorizer.py中:
- for path, strs in path_buckets.items():
- if not strs:
- continue
- vectorizer = TfidfVectorizer(analyzer='word', token_pattern=r"(?u)bSS+b")
- try:
- tfidf = vectorizer.fit_transform(strs)
- #putting same key's indices together
- paramindex = {}
- for kv, index in vectorizer.vocabulary.items():
- k = kv.split('=')[0]
- if k in param_index.keys():
- param_index[k].append(index)
- else:
- param_index[k] = [index]
- #shrinking tfidf vectors
- tfidf_vectors = []
- for vector in tfidf.toarray():
- v = []
- for param, index in param_index.items():
- v.append(np.sum(vector[index]))
- tfidf_vectors.append(v)
- #other features
- other_vectors = []
- for str in strs:
- ov = []
- kvs = str.split(' ')[:-1]
- lengths = np.array(list(map(lambda x: len(x), kvs)))
- #param count
- ov.append(len(kvs))
- #mean kv length
- ov.append(np.mean(lengths))
- #max kv length
- ov.append(np.max(lengths))
- #min kv length
- ov.append(np.min(lengths))
- #kv length std
- ov.append(np.std(lengths))
- other_vectors.append(ov)
- tfidf_vectors = np.array(tfidf_vectors)
- other_vectors = np.array(other_vectors)
- vectors = np.concatenate((tfidf_vectors, other_vectors), axis=1)
(编辑:惠州站长网)
【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!
|