2016-08-31

Word2Vec 4 Python

Word2Vec 为Tomas Mikolov 在 Google 带领的研究团队创造。主要是用神经网络训练词库模型。Word2Vec 通过对语料的神经网络训练，将词转化为n维向量，最重要的是与简单的的 Bag Of Words 模型不同，Word2Vec 模型形成的向量每一维度值的大小具有特定意义，可以表示词与词之间的关系。

本文主要简单讲解 Word2Vec 模型在 Python 中的具体使用。

1、准备输入

word2vec 功能在 gensim 包中，因此需要先安装 gensim 模块，word2vec 功能在 gensim.models.Word2Vec 中。

训练 word2vec 模型需要输入的是句子的序列. 每个句子是一个单词列表,同时需要过滤掉停用词。


def get_stop_words():
    # load stop words from file
    stop_word_set = set()

    with open('stop_words') as f:
        for word in f.readlines():
            stop_word_set.add(word.strip())

    return stop_word_set


def sen2words(sentence,stop_word_set):
    # split words from sentence with jieba
    # split the stop words
    words_list = []

    seg_words = jieba.lcut(sentence,cut_all=False)

    for word in seg_words:
        if word not in stop_word_set:
            words_list.append(word)

    return words_list

使用了 [jieba] (https://github.com/fxsjy/jieba “github 主页”) 中文分词组件进行分词，然后用 set 进行停用词过滤，返回每个 sentence 的单词列表。

2. 训练

以正在处理的 novel 数据集为例训练 model，训练时不必一次性将所有数据都加载进内存中，这时候就可以使用生成器 generator

class MySentences(object):

    # iter load the data

    def __init__(self,files_dir,stop_word_set):
        self.files_dir = files_dir
        self.stop_word_set = stop_word_set

    def __iter__(self):

        files =[f for f in os.listdir(self.files_dir) if not f.endswith('_sub')]

        for file in files:
            with open(self.files_dir+'/'+file) as f:
                for line in f.readlines():
                    yield sen2words(line,self.stop_word_set)

定义一个 MySentences 类，自定义迭代过程。同时如果需要对单词进行其他的处理，比如大小写转换，编码格式处理，删除数字，抽取命名实体等，都可以在 MySentences 类中完成。

数据输入处理完了，就可以进行训练了，gensim.models.Word2Vec 函数接受多个参数:

def train_save(files_dir,modelname):
    stop_word_set = get_stop_words()
    sentences = MySentences(files_dir,stop_word_set)

    num_features = 200
    min_word_count = 20
    num_workers = 48
    context = 20
    spoch = 20
    sample = 1e-5
    model = gensim.models.Word2Vec(
        sentences,
        size = num_features,        # 神经网络隐藏单元数 default = 100
        min_count = min_word_count, # 字典过滤，小于min_count被丢弃 default = 5
        workers = num_workers,      # 并行的 cpu? 需要 Cython 支持
        sample = sample,
        window = context,
        iter = epoch,
        )

    model.save(modelname)

    return model

同时训练时会遍历两遍训练数据集：第一遍收集单词及词频来建立一个内部字典树，第二遍训练神经网络。
如果只能遍历一遍数据：

1
2
3

>>> model = gensim.models.Word2Vec() # an empty model, no training
>>> model.build_vocab(some_sentences)  # can be a non-repeatable, 1-pass generator
>>> model.train(other_sentences)  # can be a non-repeatable, 1-pass generator

3. 储存和加载模型

>>> model.save('mymodel')
>>> new_model = gensim.models.Word2Vec.load('mymodel')
# 加载 C 生成的模型
>>> model = Word2Vec.load_word2vec_format('vectors.txt', binary=False)
>>> # using gzipped/bz2 input works too, no need to unzip:
>>> model = Word2Vec.load_word2vec_format('vectors.bin.gz', binary=True)

可以对已经有的模型进行在线训练：

1 2	>>> model = gensim.models.Word2Vec.load('mymodel') >>> model.train(more_sentences)

4. 使用模型

单词相似度辨别：

>>> model['computer']  # 像 dict 一样直接得到单词向量
array([-0.00449447, -0.00310097,  0.02421786, ...], dtype=float32)

>>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
[('queen', 0.50882536)]
>>> ll = model.most_similar('lady') # 得到最相似词列表 
>>> model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'
>>> model.similarity('woman', 'man')   # 得到两个单词的相似度
.73723527