使用哈尔滨工业大学SCIR的开源代码训练自己的ELMo

xiaoxiao2023-10-15 159

本篇博客使用哈尔滨工业大学SCIR实验室的ELMoForManyLangs

链接：https://github.com/HIT-SCIR/ELMoForManyLangs

使用方法：

1. gitclone 到本地

2. 在Downloads处（提供了各种语言（包括简体中文）下载预训练好的语言模型，下载的语言模型中带有自己的config。

3. 执行setup命令

python setup.py install

4. 设置模型中（例如zhs.model/config.json）中的config_path为cnn_50_100_512_4096_sample.json的相对位置

如何finetuing ELMo？

在只使用ELMO提供的embedding时，ELMoForManyLangs/elmo.py的class Embedder中168行中存在model.eval()，在自己的代码中正式调用ELMOembedding时使用了with torch.no_grad()来保证不对elmo进行更改，且提高运行速度减少显存。

同理，在finetuing ELMO时，168行的model.eval()要关掉，且不要在elmo外加with torch.no_grad()即可。

（该种方法需要更大的显存）

如何训练自己的ELMo？

1. 配置要求

python >= 3.6；pytorch 0.4；other requirements from allennlp

2. 准备好输入数据和词表

数据格式： Notable alumni Aris Kalafatis ( Acting ) Labour Party They build an open nest in a tree hole , or man - made nest - boxes . Legacy

3. 进入目录执行命令

python -m elmoformanylangs.biLM train \ --train_path data/en.raw \ --config_path configs/cnn_50_100_512_4096_sample.json \ --model output/en \ --optimizer adam \ --lr 0.001 \ --lr_decay 0.8 \ --max_epoch 10 \ --max_sent_len 20 \ --max_vocab_size 150000 \ --min_count 3 --gpu 2 -train_path：用于训练的数据，数据格式如上文-config_path：-model：训练好的模型的保存地址-max_sent_len：例如一个含70词的句子，由于max_len=30，会被分成3个句子-max_vocab_size：代码中未查到使用？？-min_count：最少word数量为3（<S></S><UNK>）

4. 原文使用了20-million word，

However, we need to add that the training process is not very stable. In some cases, we end up with a loss of nan. We are actively working on that and hopefully improve it in the future.

The training of ELMo on one language takes roughly 3 days on an NVIDIA P100 GPU.

最新回复(0)