HexoSEOAutoPush安装使用

from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer

model_path = 'D:/Models/chinese-roberta-wwm-ext'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path, num_labels=15)  # 分类个数

构建训练数据

数据集下载地址，内部数据结构如下，具体信息可见下载链接中readme文件描述

6551700932705387022_!_101_!_news_culture_!_京城最值得你来场文化之旅的博物馆_!_保利集团,马未都,中国科学技术馆,博物馆,新中国
6552368441838272771_!_101_!_news_culture_!_发酵床的垫料种类有哪些？哪种更好？_!_
6552310157706002702_!_102_!_news_entertainment_!_成龙改口决定不裸捐了，20亿财产给儿子一半，你怎么看？_!_
6552309039697494532_!_103_!_news_sports_!_亚洲杯夺冠赔率：日本、伊朗领衔 中国竟与泰国并列_!_土库曼斯坦,乌兹别克斯坦,亚洲杯,赔率,小组赛
6552477789642031623_!_103_!_news_sports_!_9轮4球本土射手仅次武磊 黄紫昌要抢最强U23头衔_!_黄紫昌,武磊,卡佩罗,惠家康,韦世豪
6552495859798376712_!_103_!_news_sports_!_如果今年勇士夺冠，下赛季詹姆斯何去何从？_!_

import re
from sklearn.utils import shuffle
import pandas as pd

label_dic = {
    'news_story':0,
    'news_culture':1,
    'news_entertainment':2,
    'news_sports':3,
    'news_finance':4,
    'news_house':5,
    'news_car':6,
    'news_edu':7,
    'news_tech':8,
    'news_military':9,
    'news_travel':10,
    'news_world':11,
    'stock':12,
    'news_agriculture':13,
    'news_game':14
}

def get_train_data(file_path, col_num):
    content = []
    label = []
    with open(file_path, "r", encoding="utf-8") as f:
        num = 0
        for i in f.readlines():
            if num > col_num:
                break
            lines = i.split("_!_")
            content.append(re.sub('[^\u4e00-\u9fa5]',"",lines[3]))  # 去除非中文
            label.append(label_dic.get(lines[2]))
            num += 1
    return content,label
        
content,label = get_train_data("./file/toutiao_cat_data.txt", 8000)
data = pd.DataFrame({"content":content,"label":label})
data = shuffle(data)

train_data = tokenizer(data.content.to_list(), padding = "max_length", max_length = 100, truncation=True ,return_tensors = "pt")
train_label = data.label.to_list()

完成预处理的data变量中的训练样本数据格式如下：

index	content	label
4383	以色列警告称如果战机被击落将会轰炸俄军事基地你怎么看	9
5244	月份北京楼市各区成交排名昌平丰台密云三区热度高	5
5608	市值与业绩倒挂华大基因是第二个乐视网吗	12

定义优化器和学习率

import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

batch_size = 16
train = TensorDataset(train_data["input_ids"], train_data["attention_mask"], torch.tensor(train_label))
train_sampler = RandomSampler(train)
train_dataloader = DataLoader(train, sampler=train_sampler, batch_size=batch_size)

# 定义优化器
from torch.optim import AdamW
optimizer = AdamW(model.parameters(), lr=1e-4)
# 定义学习率和训练轮数
num_epochs = 1
from transformers import get_scheduler
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

模型训练

from tqdm import tqdm

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
for epoch in range(num_epochs):
    total_loss = 0
    model.train()
    with tqdm(list(enumerate(train_dataloader)),ncols=100) as _tqdm:
        for step, batch in _tqdm:
            _tqdm.set_description('epoch {}/{}'.format(epoch+1, num_epochs))
            if not step == 0:
                cur_loss = total_loss/(step*batch_size)
                avg_train_loss = total_loss / len(train_dataloader)
                _tqdm.set_postfix(loss=cur_loss, avg_loss=avg_train_loss)
                _tqdm.update(1)
            else:
                _tqdm.set_postfix(loss=0.00000)
            b_input_ids = batch[0].to(device)
            b_input_mask = batch[1].to(device)
            b_labels = batch[2].to(device)
            model.zero_grad()        
            outputs = model(b_input_ids, 
                        token_type_ids=None, 
                        attention_mask=b_input_mask, 
                        labels=b_labels)

            loss = outputs.loss       
            total_loss += loss.item()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            lr_scheduler.step()

模型预测

import numpy as np

test = tokenizer("通过考研大纲谈农学复习之化学部分年农学将第二次实行全国统一考试这使农学考生的复习备考十分迷茫",return_tensors="pt",padding="max_length",max_length=100)
test.to(device)

model.eval()
with torch.no_grad():  
    outputs = model(test["input_ids"], 
                    token_type_ids=None, 
                    attention_mask=test["attention_mask"])

logits = outputs["logits"].cpu()
pred_flat = np.argmax(logits,axis=1).numpy().squeeze()
print(pred_flat.tolist())
print(list(label_dic.keys())[list(label_dic.values()).index(pred_flat.tolist())])

注：预训练模型的finetune基本就是这个套路，模型部分基本没太大变动，一般只需要根据数据集进行预处理，处理成模型适用的输入格式

虚拟机下ubuntu server扩容方案

发表于 2022-05-07 分类于解决方案

总的来说，虚拟机下的ubuntu扩容分为三步：

1、虚拟机扩容

2、进入系统对多出来的容量新建分区

3、将新分区挂在到指定目录并完成固化

虚拟机扩容

这点没什么好说，VMware的虚拟机扩容得先删掉所有快照，不然没法扩容只能加磁盘

新建分区

1、将多出来的内存新建分区

# 新建目录用于挂载新分出来的分区
mkdir /data

# 进如root模式，不进也行，每个命令前加上sudo就行
su root

# 进入fdisk交互式命令行
fdisk /dev/sda

# 输入 m 可以查看命令说明
Command (m for help): m

Help:

  Generic
   d   delete a partition
   F   list free unpartitioned space
   l   list known partition types
   n   add a new partition
   p   print the partition table
   t   change a partition type
   v   verify the partition table
   i   print information about a partition

  Misc
   m   print this menu
   x   extra functionality (experts only)

  Script
   I   load disk layout from sfdisk script file
   O   dump disk layout to sfdisk script file

  Save & Exit
   w   write table to disk and exit
   q   quit without saving changes

  Create a new label
   g   create a new empty GPT partition table
   G   create a new empty SGI (IRIX) partition table
   o   create a new empty DOS partition table
   s   create a new empty Sun partition table
   
# 1、输入 n 新建分区，一切默认回车即可
Command (m for help): n
Partition number (6-128, default 6): 
First sector (41940992-104857566, default 41940992): 
Last sector, +sectors or +size{K,M,G,T,P} (41940992-104857566, default 104857566): 

Created a new partition 6 of type 'Linux filesystem' and of size 30 GiB.

# 2、输入 p 查看新建分区情况，sda6就是新分区
Command (m for help): p
Disk /dev/sda: 50 GiB, 53687091200 bytes, 104857600 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: D7617B57-3A11-4B5B-B28D-FE5A8EEAD05D

Device        Start       End  Sectors Size Type
/dev/sda1      2048      4095     2048   1M BIOS boot
/dev/sda2      4096   2101247  2097152   1G Linux filesystem
/dev/sda3   2101248   8392703  6291456   3G Linux swap
/dev/sda4   8392704  29364223 20971520  10G Linux filesystem
/dev/sda5  29364224  41940991 12576768   6G Linux filesystem
/dev/sda6  41940992 104857566 62916575  30G Linux filesystem

# 3、输入 w 保存分区，此时df -lh还看不到新建的分区

# 4、格式化sda6分区，至此分区新建完成
mkfs -t /dev/sda6

挂载分区

只有将分区挂载到指定目录上，这个空间才能投入使用

1	mount /dev/sda6 /data

但是这样挂载的目录有一个缺陷，重启后就没了，还得重新挂载，所以需要执行固化

1 2	# 1、确定分区uuid blkid /dev/sda6

/dev/sda6: UUID="01c6810e-738f-47c2-882f-2451b93a163d" TYPE="ext4" PARTUUID="212a7071-d49e-0f41-95df-b7a35eef881b"

复制下UUID的值，后面需要使用

# 2、将分区信息固化到系统配置信息中
vim /etc/fstab

# 参照原文件格式编写

# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# <file system> <mount point>   <type>  <options>       <dump>  <pass>
/dev/disk/by-uuid/bfae9057-2637-4bdb-b355-395ab9ca30f1 none swap sw 0 0
# / was on /dev/sda4 during curtin installation
/dev/disk/by-uuid/9bdde825-6d74-45e2-bbf5-0b497e0d0fcf / ext4 defaults 0 1
# /home was on /dev/sda5 during curtin installation
/dev/disk/by-uuid/40ca4129-cacf-4d8b-a213-fd3c654ca1ce /home ext4 defaults 0 1
# /boot was on /dev/sda2 during curtin installation
/dev/disk/by-uuid/1e0131bc-9662-41b6-81e7-9a5db92f3509 /boot ext4 defaults 0 1
# 参照/home的挂载配置，追加上相似的配置信息
/dev/disk/by-uuid/01c6810e-738f-47c2-882f-2451b93a163d /data ext4 defaults 0 1
/swap.img       none    swap    sw      0       0

保存文件，重启系统，至此完成扩容，重启后就可以在df -lh中看到相关容量信息了

NLP线上环境部署

发表于 2022-02-23 更新于 2023-01-10 分类于安装笔记

Anconda环境部署

1、创建文件夹，下载Anaconda

wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2021.05-Linux-x86_64.sh

2、安装

chmod +x Anaconda3-2021.05-Linux-x86_64.sh

./Anaconda3-2021.05-Linux-x86_64.sh

Please, press ENTER to continue
# 按enter 出现用户协议 一直按enter
Please answer 'yes' or 'no':'
# 输入 yes 按 enter
[/root/anaconda3] >>> /root/Anaconda
# 设置安装目录(必须未存在的文件夹)，直接enter就是默认的目录，显示在左边中括号那个
Unpacking payload ...  
#出现就是就是在安装了
by running conda init? [yes|no] 
# 是否初始化conda，这里一定要输入yes 不要直接按enter，因为默认是no 然后就安装完成了

打开~/.bashrc会看到这段配置

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/root/Anaconda/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/root/Anaconda/etc/profile.d/conda.sh" ]; then
        . "/root/Anaconda/etc/profile.d/conda.sh"
    else
        export PATH="/root/Anaconda/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<

3、验证安装

source ~/.bashrc
# 检查是否安装成功
conda -V
conda 4.10.1 # 出来这个就说明安装成功了

4、取消默认自动进入base环境

1 2	conda config --set auto_activate_base false conda deactivate

5、配置清华镜像

# 开启下载时通道显示
conda config --set show_channel_urls yes
# ./condarc 中追加镜像信息
channels:
  - defaults
default_channels:
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2
custom_channels:
  conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  pytorch-lts: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud

conda config --show验证

开发环境配置

1、新建nlp虚拟环境

1	conda create -n nlp python=3.6.8

Jupyter Lab环境搭建

1、安装jupyter lab

1	conda install jupyterlab

2、编辑配置信息

# 生成配置文件
jupyter notebook --generate-config 
# 在配置文件中追加
c.ServerApp.open_browser = False  # 禁止自动打开浏览器
c.ServerApp.ip='*'  # 就是设置所有ip皆可访问
c.ServerApp.allow_remote_access = True  # 允许远程访问
c.ServerApp.allow_root = True  # 以root身份运行
c.ServerApp.port = 8888  # 指定端口，默认8888
c.ServerApp.root_dir = '/jupyter/xuxingchen'  # 工作目录
c.ServerApp.password = ''

3、生成远程密码

1
2
3

# 进入python命令行
from jupyter_server.auth import passwd; passwd()
# 输入密码、验证密码后会输出'sha1:*****'结果，复制粘贴到配置文件中的c.ServerApp.password值中

4、使用conda的Python环境

# 安装依赖
conda install nb_conda_kernels
# 定位到anaconda/etc/jupyter/jupyter_config.json，修正配置为以下：
{
  "CondaKernelSpecManager": {
    "kernelspec_path": "--user",
    "name_format": "{kernel} ({environment})"
  }
}
# base下直接指定环境安装ipykernel
conda install -n 环境名称 ipykernel
# 切换到对应环境中 写入jupyter notebook 的kernel
python -m ipykernel install --user --name 环境名称

5、多用户配置

# 复制多份.jupyter/jupyter_lab_config.py，修改其中端口号与密码，以脚本形式后台执行
# 启动脚本
#!/bin/bash

source /root/Anaconda/etc/profile.d/conda.sh
# 切换环境
conda activate base

# 执行jupyter
root_dir="/jupyter/"
log_dir="/root/.jupyter/logs/"
declare -A dic
dic=([zl]=8888 [xj]=8889 [zgp]=8890 [xxc]=8891)
if [ $1 == "all" ]
then
	for dirname in ${!dic[*]}
	do
		(jupyter-lab --port=${dic[$dirname]} --notebook-dir=$root_dir$dirname) > $log_dir$dirname.log 2>&1 &
		sleep 2
		cat $log_dir$dirname.log
		echo "jupyter lab $dirname are running successfully"
	done
else
	(jupyter-lab --port=${dic[$1]} --notebook-dir=$root_dir$1) > $log_dir$1.log 2>&1 &
	sleep 3
	cat $log_dir$1.log
	echo "jupyter lab $1 is running successfully"
fi

# 关闭脚本
#!/bin/bash

if [ $1 == "all" ]
then
	list=("zl" "xj" "zgp" "xxc")
	for dirname in ${!list[@]}
		do
			# 获取进程pid
			PID=$(pgrep -f dir=/jupyter/${list[$dirname]})
			if [ $PID ]
			then 
				kill -n 15 $PID
				echo "已终止${list[$dirname]} $PID"
			else
				echo "未找到${list[$dirname]}相关进程"
			fi
		done
else
	# 获取进程pid
	PID=$(pgrep -f dir=/jupyter/$1)
	if [ $PID ]
	then 
		kill -n 15 $PID
		echo "已终止$1 $PID"
	else
		echo "未找到$1相关进程"
	fi
fi

6、代码提示

① 先去Node官网下载安装包，解压安装到指定位置

sudo mkdir -p /usr/local/lib/nodejs
sudo tar -xJvf node-xxx.tar.xz -C /usr/local/lib/nodejs
vim ~/.profile

# 追加环境变量
export PATH=/usr/local/lib/nodejs/node-$VERSION-$DISTRO/bin:$PATH
# 刷新bash环境
source ~/.profile
# 验证安装
node -v

② 关闭所有jupyter进程

③ 激活base环境，安装jupyterlab-lsp插件

conda activate
pip install jupyterlab-lsp
pip install python-lsp-server[all]
pip install nbclassic==0.2.8

④ 重启jupyter，看到左下角状态栏有✔Fully字样即为成功

⑤ 去除下划线纠错标识，工具栏 - Settings - Advanced Settings，开启json视图，添加下方键值对即可关闭：

1	{"ignoreMessagesPatterns": [".*"]}

ubuntu上编译安装python源码

发表于 2022-02-11 分类于安装笔记

注：本文是对文章的部分转载

首先安装构建Python所需的依赖项：

1 2	sudo apt update sudo apt install build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev libsqlite3-dev wget libbz2-dev

下载最新版本的源代码：

1	wget https://www.python.org/ftp/python/3.9.0/Python-3.9.0.tgz

解压：

1	tar -xf Python-3.9.0.tgz

到Python源码目录并运行configure脚本，执行脚本是为了检查对系统的依赖是否完整以及配置编译选项，--enable-optimizations选项通过运行多个测试来优化Python二进制文件：

1 2	cd Python-3.9.0 ./configure --enable-optimizations

开始Python 3.9的构建过程，要加快构建时间，请修改-j对应的处理器的核心数。您可以通过键入nproc来找到CPU的核心数：

1	make -j 12

构建过程完成后，通过键入以下命令来安装Python二进制文件：

1	sudo make altinstall

我们使用altinstall而非install，因为稍后将覆盖系统默认的python3二进制文件。现在Python 3.9已安装并可以使用。要验证它，请键入以下命令：

1 2	python3.9 --version

输出应显示Python版本：

1	Python 3.9.0+