Lang's Blog

Keep writing, Keep loving

任务目标

1、投入目标任务的文本数据集重新训练哈工大已完成MLM任务预训练的roberta模型

2、使其能够完成下游文本分类任务

载入模型

模型下载地址,只需下载模型相关文件即可,config.json、pytorch_model.bin、vocab.txt

1
2
3
4
5
6
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer

model_path = 'D:/Models/chinese-roberta-wwm-ext'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path, num_labels=15) # 分类个数

构建训练数据

数据集下载地址,内部数据结构如下,具体信息可见下载链接中readme文件描述

1
2
3
4
5
6
6551700932705387022_!_101_!_news_culture_!_京城最值得你来场文化之旅的博物馆_!_保利集团,马未都,中国科学技术馆,博物馆,新中国
6552368441838272771_!_101_!_news_culture_!_发酵床的垫料种类有哪些?哪种更好?_!_
6552310157706002702_!_102_!_news_entertainment_!_成龙改口决定不裸捐了,20亿财产给儿子一半,你怎么看?_!_
6552309039697494532_!_103_!_news_sports_!_亚洲杯夺冠赔率:日本、伊朗领衔 中国竟与泰国并列_!_土库曼斯坦,乌兹别克斯坦,亚洲杯,赔率,小组赛
6552477789642031623_!_103_!_news_sports_!_9轮4球本土射手仅次武磊 黄紫昌要抢最强U23头衔_!_黄紫昌,武磊,卡佩罗,惠家康,韦世豪
6552495859798376712_!_103_!_news_sports_!_如果今年勇士夺冠,下赛季詹姆斯何去何从?_!_
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import re
from sklearn.utils import shuffle
import pandas as pd

label_dic = {
'news_story':0,
'news_culture':1,
'news_entertainment':2,
'news_sports':3,
'news_finance':4,
'news_house':5,
'news_car':6,
'news_edu':7,
'news_tech':8,
'news_military':9,
'news_travel':10,
'news_world':11,
'stock':12,
'news_agriculture':13,
'news_game':14
}

def get_train_data(file_path, col_num):
content = []
label = []
with open(file_path, "r", encoding="utf-8") as f:
num = 0
for i in f.readlines():
if num > col_num:
break
lines = i.split("_!_")
content.append(re.sub('[^\u4e00-\u9fa5]',"",lines[3])) # 去除非中文
label.append(label_dic.get(lines[2]))
num += 1
return content,label

content,label = get_train_data("./file/toutiao_cat_data.txt", 8000)
data = pd.DataFrame({"content":content,"label":label})
data = shuffle(data)

train_data = tokenizer(data.content.to_list(), padding = "max_length", max_length = 100, truncation=True ,return_tensors = "pt")
train_label = data.label.to_list()

完成预处理的data变量中的训练样本数据格式如下:

index content label
4383 以色列警告称如果战机被击落将会轰炸俄军事基地你怎么看 9
5244 月份北京楼市各区成交排名昌平丰台密云三区热度高 5
5608 市值与业绩倒挂华大基因是第二个乐视网吗 12

定义优化器和学习率

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

batch_size = 16
train = TensorDataset(train_data["input_ids"], train_data["attention_mask"], torch.tensor(train_label))
train_sampler = RandomSampler(train)
train_dataloader = DataLoader(train, sampler=train_sampler, batch_size=batch_size)

# 定义优化器
from torch.optim import AdamW
optimizer = AdamW(model.parameters(), lr=1e-4)
# 定义学习率和训练轮数
num_epochs = 1
from transformers import get_scheduler
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

模型训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from tqdm import tqdm

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
for epoch in range(num_epochs):
total_loss = 0
model.train()
with tqdm(list(enumerate(train_dataloader)),ncols=100) as _tqdm:
for step, batch in _tqdm:
_tqdm.set_description('epoch {}/{}'.format(epoch+1, num_epochs))
if not step == 0:
cur_loss = total_loss/(step*batch_size)
avg_train_loss = total_loss / len(train_dataloader)
_tqdm.set_postfix(loss=cur_loss, avg_loss=avg_train_loss)
_tqdm.update(1)
else:
_tqdm.set_postfix(loss=0.00000)
b_input_ids = batch[0].to(device)
b_input_mask = batch[1].to(device)
b_labels = batch[2].to(device)
model.zero_grad()
outputs = model(b_input_ids,
token_type_ids=None,
attention_mask=b_input_mask,
labels=b_labels)

loss = outputs.loss
total_loss += loss.item()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
lr_scheduler.step()

模型预测

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np

test = tokenizer("通过考研大纲谈农学复习之化学部分年农学将第二次实行全国统一考试这使农学考生的复习备考十分迷茫",return_tensors="pt",padding="max_length",max_length=100)
test.to(device)

model.eval()
with torch.no_grad():
outputs = model(test["input_ids"],
token_type_ids=None,
attention_mask=test["attention_mask"])

logits = outputs["logits"].cpu()
pred_flat = np.argmax(logits,axis=1).numpy().squeeze()
print(pred_flat.tolist())
print(list(label_dic.keys())[list(label_dic.values()).index(pred_flat.tolist())])

注:预训练模型的finetune基本就是这个套路,模型部分基本没太大变动,一般只需要根据数据集进行预处理,处理成模型适用的输入格式

总的来说,虚拟机下的ubuntu扩容分为三步:

1、虚拟机扩容

2、进入系统对多出来的容量新建分区

3、将新分区挂在到指定目录并完成固化

虚拟机扩容

这点没什么好说,VMware的虚拟机扩容得先删掉所有快照,不然没法扩容只能加磁盘

新建分区

1、将多出来的内存新建分区

1
2
3
4
5
6
7
8
# 新建目录用于挂载新分出来的分区
mkdir /data

# 进如root模式,不进也行,每个命令前加上sudo就行
su root

# 进入fdisk交互式命令行
fdisk /dev/sda
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# 输入 m 可以查看命令说明
Command (m for help): m

Help:

Generic
d delete a partition
F list free unpartitioned space
l list known partition types
n add a new partition
p print the partition table
t change a partition type
v verify the partition table
i print information about a partition

Misc
m print this menu
x extra functionality (experts only)

Script
I load disk layout from sfdisk script file
O dump disk layout to sfdisk script file

Save & Exit
w write table to disk and exit
q quit without saving changes

Create a new label
g create a new empty GPT partition table
G create a new empty SGI (IRIX) partition table
o create a new empty DOS partition table
s create a new empty Sun partition table

# 1、输入 n 新建分区,一切默认回车即可
Command (m for help): n
Partition number (6-128, default 6):
First sector (41940992-104857566, default 41940992):
Last sector, +sectors or +size{K,M,G,T,P} (41940992-104857566, default 104857566):

Created a new partition 6 of type 'Linux filesystem' and of size 30 GiB.

# 2、输入 p 查看新建分区情况,sda6就是新分区
Command (m for help): p
Disk /dev/sda: 50 GiB, 53687091200 bytes, 104857600 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: D7617B57-3A11-4B5B-B28D-FE5A8EEAD05D

Device Start End Sectors Size Type
/dev/sda1 2048 4095 2048 1M BIOS boot
/dev/sda2 4096 2101247 2097152 1G Linux filesystem
/dev/sda3 2101248 8392703 6291456 3G Linux swap
/dev/sda4 8392704 29364223 20971520 10G Linux filesystem
/dev/sda5 29364224 41940991 12576768 6G Linux filesystem
/dev/sda6 41940992 104857566 62916575 30G Linux filesystem

# 3、输入 w 保存分区,此时df -lh还看不到新建的分区

# 4、格式化sda6分区,至此分区新建完成
mkfs -t /dev/sda6

挂载分区

只有将分区挂载到指定目录上,这个空间才能投入使用

1
mount /dev/sda6 /data

但是这样挂载的目录有一个缺陷,重启后就没了,还得重新挂载,所以需要执行固化

1
2
# 1、确定分区uuid
blkid /dev/sda6

/dev/sda6: UUID="01c6810e-738f-47c2-882f-2451b93a163d" TYPE="ext4" PARTUUID="212a7071-d49e-0f41-95df-b7a35eef881b"

复制下UUID的值,后面需要使用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# 2、将分区信息固化到系统配置信息中
vim /etc/fstab

# 参照原文件格式编写

# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# <file system> <mount point> <type> <options> <dump> <pass>
/dev/disk/by-uuid/bfae9057-2637-4bdb-b355-395ab9ca30f1 none swap sw 0 0
# / was on /dev/sda4 during curtin installation
/dev/disk/by-uuid/9bdde825-6d74-45e2-bbf5-0b497e0d0fcf / ext4 defaults 0 1
# /home was on /dev/sda5 during curtin installation
/dev/disk/by-uuid/40ca4129-cacf-4d8b-a213-fd3c654ca1ce /home ext4 defaults 0 1
# /boot was on /dev/sda2 during curtin installation
/dev/disk/by-uuid/1e0131bc-9662-41b6-81e7-9a5db92f3509 /boot ext4 defaults 0 1
# 参照/home的挂载配置,追加上相似的配置信息
/dev/disk/by-uuid/01c6810e-738f-47c2-882f-2451b93a163d /data ext4 defaults 0 1
/swap.img none swap sw 0 0

保存文件,重启系统,至此完成扩容,重启后就可以在df -lh中看到相关容量信息了

Anconda环境部署

1、创建文件夹,下载Anaconda

wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2021.05-Linux-x86_64.sh

2、安装

chmod +x Anaconda3-2021.05-Linux-x86_64.sh

./Anaconda3-2021.05-Linux-x86_64.sh

1
2
3
4
5
6
7
8
9
10
Please, press ENTER to continue
# 按enter 出现用户协议 一直按enter
Please answer 'yes' or 'no':'
# 输入 yes 按 enter
[/root/anaconda3] >>> /root/Anaconda
# 设置安装目录(必须未存在的文件夹),直接enter就是默认的目录,显示在左边中括号那个
Unpacking payload ...
#出现就是就是在安装了
by running conda init? [yes|no]
# 是否初始化conda,这里一定要输入yes 不要直接按enter,因为默认是no 然后就安装完成了

打开~/.bashrc会看到这段配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/root/Anaconda/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/root/Anaconda/etc/profile.d/conda.sh" ]; then
. "/root/Anaconda/etc/profile.d/conda.sh"
else
export PATH="/root/Anaconda/bin:$PATH"
fi
fi
unset __conda_setup
# <<< conda initialize <<<

3、验证安装

1
2
3
4
source ~/.bashrc
# 检查是否安装成功
conda -V
conda 4.10.1 # 出来这个就说明安装成功了

4、取消默认自动进入base环境

1
2
conda config --set auto_activate_base false
conda deactivate

5、配置清华镜像

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 开启下载时通道显示
conda config --set show_channel_urls yes
# ./condarc 中追加镜像信息
channels:
- defaults
default_channels:
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2
custom_channels:
conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
pytorch-lts: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud

conda config --show验证

开发环境配置

1、新建nlp虚拟环境

1
conda create -n nlp python=3.6.8

Jupyter Lab环境搭建

1、安装jupyter lab

1
conda install jupyterlab

2、编辑配置信息

1
2
3
4
5
6
7
8
9
10
# 生成配置文件
jupyter notebook --generate-config
# 在配置文件中追加
c.ServerApp.open_browser = False # 禁止自动打开浏览器
c.ServerApp.ip='*' # 就是设置所有ip皆可访问
c.ServerApp.allow_remote_access = True # 允许远程访问
c.ServerApp.allow_root = True # 以root身份运行
c.ServerApp.port = 8888 # 指定端口,默认8888
c.ServerApp.root_dir = '/jupyter/xuxingchen' # 工作目录
c.ServerApp.password = ''

3、生成远程密码

1
2
3
# 进入python命令行
from jupyter_server.auth import passwd; passwd()
# 输入密码、验证密码后会输出'sha1:*****'结果,复制粘贴到配置文件中的c.ServerApp.password值中

4、使用conda的Python环境

1
2
3
4
5
6
7
8
9
10
11
12
13
# 安装依赖
conda install nb_conda_kernels
# 定位到anaconda/etc/jupyter/jupyter_config.json,修正配置为以下:
{
"CondaKernelSpecManager": {
"kernelspec_path": "--user",
"name_format": "{kernel} ({environment})"
}
}
# base下直接指定环境安装ipykernel
conda install -n 环境名称 ipykernel
# 切换到对应环境中 写入jupyter notebook 的kernel
python -m ipykernel install --user --name 环境名称

5、多用户配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# 复制多份.jupyter/jupyter_lab_config.py,修改其中端口号与密码,以脚本形式后台执行
# 启动脚本
#!/bin/bash

source /root/Anaconda/etc/profile.d/conda.sh
# 切换环境
conda activate base

# 执行jupyter
root_dir="/jupyter/"
log_dir="/root/.jupyter/logs/"
declare -A dic
dic=([zl]=8888 [xj]=8889 [zgp]=8890 [xxc]=8891)
if [ $1 == "all" ]
then
for dirname in ${!dic[*]}
do
(jupyter-lab --port=${dic[$dirname]} --notebook-dir=$root_dir$dirname) > $log_dir$dirname.log 2>&1 &
sleep 2
cat $log_dir$dirname.log
echo "jupyter lab $dirname are running successfully"
done
else
(jupyter-lab --port=${dic[$1]} --notebook-dir=$root_dir$1) > $log_dir$1.log 2>&1 &
sleep 3
cat $log_dir$1.log
echo "jupyter lab $1 is running successfully"
fi

# 关闭脚本
#!/bin/bash

if [ $1 == "all" ]
then
list=("zl" "xj" "zgp" "xxc")
for dirname in ${!list[@]}
do
# 获取进程pid
PID=$(pgrep -f dir=/jupyter/${list[$dirname]})
if [ $PID ]
then
kill -n 15 $PID
echo "已终止${list[$dirname]} $PID"
else
echo "未找到${list[$dirname]}相关进程"
fi
done
else
# 获取进程pid
PID=$(pgrep -f dir=/jupyter/$1)
if [ $PID ]
then
kill -n 15 $PID
echo "已终止$1 $PID"
else
echo "未找到$1相关进程"
fi
fi

6、代码提示

① 先去Node官网下载安装包,解压安装到指定位置

1
2
3
4
5
6
7
8
9
10
sudo mkdir -p /usr/local/lib/nodejs
sudo tar -xJvf node-xxx.tar.xz -C /usr/local/lib/nodejs
vim ~/.profile

# 追加环境变量
export PATH=/usr/local/lib/nodejs/node-$VERSION-$DISTRO/bin:$PATH
# 刷新bash环境
source ~/.profile
# 验证安装
node -v

② 关闭所有jupyter进程

③ 激活base环境,安装jupyterlab-lsp插件

1
2
3
4
conda activate
pip install jupyterlab-lsp
pip install python-lsp-server[all]
pip install nbclassic==0.2.8

④ 重启jupyter,看到左下角状态栏有✔Fully字样即为成功

⑤ 去除下划线纠错标识,工具栏 - Settings - Advanced Settings,开启json视图,添加下方键值对即可关闭:

1
{"ignoreMessagesPatterns": [".*"]}

注:本文是对文章的部分转载

首先安装构建Python所需的依赖项:

1
2
sudo apt update
sudo apt install build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev libsqlite3-dev wget libbz2-dev

下载最新版本的源代码:

1
wget https://www.python.org/ftp/python/3.9.0/Python-3.9.0.tgz

解压:

1
tar -xf Python-3.9.0.tgz

到Python源码目录并运行configure脚本,执行脚本是为了检查对系统的依赖是否完整以及配置编译选项,--enable-optimizations选项通过运行多个测试来优化Python二进制文件:

1
2
cd Python-3.9.0
./configure --enable-optimizations

开始Python 3.9的构建过程,要加快构建时间,请修改-j对应的处理器的核心数。您可以通过键入nproc来找到CPU的核心数:

1
make -j 12

构建过程完成后,通过键入以下命令来安装Python二进制文件:

1
sudo make altinstall

我们使用altinstall而非install,因为稍后将覆盖系统默认的python3二进制文件。现在Python 3.9已安装并可以使用。 要验证它,请键入以下命令:

1
2
python3.9 --version     

输出应显示Python版本:

1
Python 3.9.0+ 
0%