Informer代码解析

1. 参数设定模块(main_informer)

值得注意的是'--model''--data'参数需要去掉required参数,否则运行代码可能会报'--model''--data'错误

修改完参数后运行该模块,保证代码运行不报错的情况下进行debug

1.1 参数含义

下面是各参数含义(注释)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
# 选择模型(去掉required参数,选择informer模型)
parser.add_argument('--model', type=str, default='informer',help='model of experiment, options: [informer, informerstack, informerlight(TBD)]')

# 数据选择(去掉required参数)
parser.add_argument('--data', type=str, default='WTH', help='data')
# 数据上级目录
parser.add_argument('--root_path', type=str, default='./data/', help='root path of the data file')
# 数据名称
parser.add_argument('--data_path', type=str, default='WTH.csv', help='data file')
# 预测类型(多变量预测、单变量预测、多元预测单变量)
parser.add_argument('--features', type=str, default='M', help='forecasting task, options:[M, S, MS]; M:multivariate predict multivariate, S:univariate predict univariate, MS:multivariate predict univariate')
# 数据中要预测的标签列
parser.add_argument('--target', type=str, default='OT', help='target feature in S or MS task')
# 数据重采样(h:小时)
parser.add_argument('--freq', type=str, default='h', help='freq for time features encoding, options:[s:secondly, t:minutely, h:hourly, d:daily, b:business days, w:weekly, m:monthly], you can also use more detailed freq like 15min or 3h')
# 模型保存位置
parser.add_argument('--checkpoints', type=str, default='./checkpoints/', help='location of model checkpoints')

# 输入序列长度
parser.add_argument('--seq_len', type=int, default=96, help='input sequence length of Informer encoder')
# 先验序列长度
parser.add_argument('--label_len', type=int, default=48, help='start token length of Informer decoder')
# 预测序列长度
parser.add_argument('--pred_len', type=int, default=24, help='prediction sequence length')
# Informer decoder input: concat[start token series(label_len), zero padding series(pred_len)]

# 编码器default参数为特征列数
parser.add_argument('--enc_in', type=int, default=7, help='encoder input size')
# 解码器default参数与编码器相同
parser.add_argument('--dec_in', type=int, default=7, help='decoder input size')
parser.add_argument('--c_out', type=int, default=7, help='output size')

# 模型宽度
parser.add_argument('--d_model', type=int, default=512, help='dimension of model')
# 多头注意力机制头数
parser.add_argument('--n_heads', type=int, default=8, help='num of heads')
# 模型中encoder层数
parser.add_argument('--e_layers', type=int, default=2, help='num of encoder layers')
# 模型中decoder层数
parser.add_argument('--d_layers', type=int, default=1, help='num of decoder layers')
# 网络架构循环次数
parser.add_argument('--s_layers', type=str, default='3,2,1', help='num of stack encoder layers')
# 全连接层神经元个数
parser.add_argument('--d_ff', type=int, default=2048, help='dimension of fcn')
# 采样因子数
parser.add_argument('--factor', type=int, default=5, help='probsparse attn factor')
# 1D卷积核
parser.add_argument('--padding', type=int, default=0, help='padding type')
# 是否需要序列长度衰减
parser.add_argument('--distil', action='store_false', help='whether to use distilling in encoder, using this argument means not using distilling', default=True)
# 神经网络正则化操作
parser.add_argument('--dropout', type=float, default=0.05, help='dropout')
# attention计算方式
parser.add_argument('--attn', type=str, default='prob', help='attention used in encoder, options:[prob, full]')
# 时间特征编码方式
parser.add_argument('--embed', type=str, default='timeF', help='time features encoding, options:[timeF, fixed, learned]')
# 激活函数
parser.add_argument('--activation', type=str, default='gelu',help='activation')
# 是否输出attention
parser.add_argument('--output_attention', action='store_true', help='whether to output attention in ecoder')
# 是否需要预测
parser.add_argument('--do_predict', action='store_true', help='whether to predict unseen future data')
parser.add_argument('--mix', action='store_false', help='use mix attention in generative decoder', default=True)
# 数据读取
parser.add_argument('--cols', type=str, nargs='+', help='certain cols from the data files as the input features')
# 多核训练(windows下选择0,否则容易报错)
parser.add_argument('--num_workers', type=int, default=0, help='data loader num workers')
# 训练轮数
parser.add_argument('--itr', type=int, default=2, help='experiments times')
# 训练迭代次数
parser.add_argument('--train_epochs', type=int, default=6, help='train epochs')
# mini-batch大小
parser.add_argument('--batch_size', type=int, default=32, help='batch size of train input data')
# 早停策略
parser.add_argument('--patience', type=int, default=3, help='early stopping patience')
# 学习率
parser.add_argument('--learning_rate', type=float, default=0.0001, help='optimizer learning rate')
parser.add_argument('--des', type=str, default='test',help='exp description')
# loss计算方式
parser.add_argument('--loss', type=str, default='mse',help='loss function')
# 学习率衰减参数
parser.add_argument('--lradj', type=str, default='type1',help='adjust learning rate')
# 是否使用自动混合精度训练
parser.add_argument('--use_amp', action='store_true', help='use automatic mixed precision training', default=False)
# 是否反转输出结果
parser.add_argument('--inverse', action='store_true', help='inverse output data', default=False)

# 是否使用GPU加速训练
parser.add_argument('--use_gpu', type=bool, default=True, help='use gpu')
parser.add_argument('--gpu', type=int, default=0, help='gpu')
# GPU分布式训练
parser.add_argument('--use_multi_gpu', action='store_true', help='use multiple gpus', default=False)
# 多GPU训练
parser.add_argument('--devices', type=str, default='0,1,2,3',help='device ids of multile gpus')

# 取参数值
args = parser.parse_args()
# 获取GPU
args.use_gpu = True if torch.cuda.is_available() and args.use_gpu else False

1.2 数据文件参数

1
2
3
4
5
6
7
8
9
10
11
# 数据参数
data_parser = {
'ETTh1':{'data':'ETTh1.csv','T':'OT','M':[7,7,7],'S':[1,1,1],'MS':[7,7,1]},
'ETTh2':{'data':'ETTh2.csv','T':'OT','M':[7,7,7],'S':[1,1,1],'MS':[7,7,1]},
'ETTm1':{'data':'ETTm1.csv','T':'OT','M':[7,7,7],'S':[1,1,1],'MS':[7,7,1]},
'ETTm2':{'data':'ETTm2.csv','T':'OT','M':[7,7,7],'S':[1,1,1],'MS':[7,7,1]},
# data:数据文件名,T:标签列,M:预测变量数(如果要预测12个特征,则为[12,12,12]),
'WTH':{'data':'WTH.csv','T':'WetBulbCelsius','M':[12,12,12],'S':[1,1,1],'MS':[12,12,1]},
'ECL':{'data':'ECL.csv','T':'MT_320','M':[321,321,321],'S':[1,1,1],'MS':[321,321,1]},
'Solar':{'data':'solar_AL.csv','T':'POWER_136','M':[137,137,137],'S':[1,1,1],'MS':[137,137,1]},
}

1.3 数据处理模块(data_loader)

main_informer.py文件中exp.train(setting)train方法进入exp_informer.py文件,在_get_data中找到WTH数据处理方法

1
2
3
4
5
6
7
8
9
10
data_dict = {
'ETTh1':Dataset_ETT_hour,
'ETTh2':Dataset_ETT_hour,
'ETTm1':Dataset_ETT_minute,
'ETTm2':Dataset_ETT_minute,
'WTH':Dataset_Custom,
'ECL':Dataset_Custom,
'Solar':Dataset_Custom,
'custom':Dataset_Custom,}

可以看到WTH数据处理方法为Dataset_Custom,我们进入data_loader.py文件,找到Dataset_Custom

__init__主要用于传各类参数,这里不过多赘述,主要对__read_data__进行说明

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
def __read_data__(self):
# 数据标准化
self.scaler = StandardScaler()
# 利用pandas将数据读入
df_raw = pd.read_csv(os.path.join(self.root_path,
self.data_path))
# 如果指定了排除项
if self.cols:
cols=self.cols.copy()
# 移除标签列
cols.remove(self.target)
else:
# 提取数据列名;移除标签列;移除日期列
cols = list(df_raw.columns); cols.remove(self.target); cols.remove('date')
# 日期列+特征列+标签列(即:调整列顺序)
df_raw = df_raw[['date']+cols+[self.target]]

# 划分训练集
num_train = int(len(df_raw)*0.7)
# 划分测试集
num_test = int(len(df_raw)*0.2)
# 划分验证集
num_vali = len(df_raw) - num_train - num_test
# 计算数据起始点
border1s = [0, num_train-self.seq_len, len(df_raw)-num_test-self.seq_len]
border2s = [num_train, num_train+num_vali, len(df_raw)]
border1 = border1s[self.set_type]
border2 = border2s[self.set_type]

# 若预测类型为M(多特征预测多特征)或MS(多特征预测单特征)
if self.features=='M' or self.features=='MS':
# 取除日期列的其他所有列
cols_data = df_raw.columns[1:]
df_data = df_raw[cols_data]
# 若预测类型为S(单特征预测单特征)
elif self.features=='S':
# 取特征列
df_data = df_raw[[self.target]]
# 将数据进行归一化
if self.scale:
train_data = df_data[border1s[0]:border2s[0]]
self.scaler.fit(train_data.values)
data = self.scaler.transform(df_data.values)
else:
data = df_data.values
# 取日期列
df_stamp = df_raw[['date']][border1:border2]
# 利用pandas将数据转换为日期格式
df_stamp['date'] = pd.to_datetime(df_stamp.date)
# 构建时间特征
data_stamp = time_features(df_stamp, timeenc=self.timeenc, freq=self.freq)

self.data_x = data[border1:border2]
if self.inverse:
self.data_y = df_data.values[border1:border2]
else:
# 取数据特征列
self.data_y = data[border1:border2]
self.data_stamp = data_stamp

  • 需要注意的是time_features函数,用来提取日期特征,比如’t’:[‘month’,’day’,’weekday’,’hour’,’minute’],表示提取月,天,周,小时,分钟。可以打开timefeatures.py 文件进行查阅
  • 同样的,对__getitem__进行说明
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def __getitem__(self, index):
# 随机取得标签
s_begin = index
# 训练区间
s_end = s_begin + self.seq_len
# 有标签区间+无标签区间(预测时间步长)
r_begin = s_end - self.label_len
r_end = r_begin + self.label_len + self.pred_len

# 取训练数据
seq_x = self.data_x[s_begin:s_end]
if self.inverse:
seq_y = np.concatenate([self.data_x[r_begin:r_begin+self.label_len], self.data_y[r_begin+self.label_len:r_end]], 0)
else:
# 取有标签区间+无标签区间(预测时间步长)数据
seq_y = self.data_y[r_begin:r_end]
# 取训练数据对应时间特征
seq_x_mark = self.data_stamp[s_begin:s_end]
# 取有标签区间+无标签区间(预测时间步长)对应时间特征
seq_y_mark = self.data_stamp[r_begin:r_end]

return seq_x, seq_y, seq_x_mark, seq_y_mark

def __len__(self):
# 返回数据长度
return len(self.data_x) - self.seq_len- self.pred_len + 1

def inverse_transform(self, data):
return self.scaler.inverse_transform(data)

2. Informer模型架构(model)

这里贴上Informer模型论文中的结构图,方便对照理解。

image-20230616221849861

K值选取原因与筛选方法:

image-20230616221932196

先进入exp_informer.py文件,train函数中包含有网络架构函数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
def train(self, setting):
# 数据加载器
train_data, train_loader = self._get_data(flag = 'train')
vali_data, vali_loader = self._get_data(flag = 'val')
test_data, test_loader = self._get_data(flag = 'test')

path = os.path.join(self.args.checkpoints, setting)
if not os.path.exists(path):
os.makedirs(path)

# 记录时间
time_now = time.time()
# 训练steps
train_steps = len(train_loader)
# 早停策略
early_stopping = EarlyStopping(patience=self.args.patience, verbose=True)

# 优化器Adam
model_optim = self._select_optimizer()
# 损失函数(MSE)
criterion = self._select_criterion()

# 分布式训练(windows一般不推荐)
if self.args.use_amp:
scaler = torch.cuda.amp.GradScaler()

# 训练次数
for epoch in range(self.args.train_epochs):
iter_count = 0
train_loss = []

self.model.train()
epoch_time = time.time()
for i, (batch_x,batch_y,batch_x_mark,batch_y_mark) in enumerate(train_loader):
iter_count += 1
# 梯度归零
model_optim.zero_grad()
# 训练模型(网络架构)
pred, true = self._process_one_batch(
train_data, batch_x, batch_y, batch_x_mark, batch_y_mark)
# 计算损失
loss = criterion(pred, true)
# 加入数组
train_loss.append(loss.item())

# 输出信息
if (i+1) % 100==0:
print("\titers: {0}, epoch: {1} | loss: {2:.7f}".format(i + 1, epoch + 1, loss.item()))
speed = (time.time()-time_now)/iter_count
left_time = speed*((self.args.train_epochs - epoch)*train_steps - i)
print('\tspeed: {:.4f}s/iter; left time: {:.4f}s'.format(speed, left_time))
iter_count = 0
time_now = time.time()

if self.args.use_amp:
scaler.scale(loss).backward()
scaler.step(model_optim)
scaler.update()
else:
# 反向传播
loss.backward()
# 更新梯度
model_optim.step()

# 打印时间信息
print("Epoch: {} cost time: {}".format(epoch+1, time.time()-epoch_time))
train_loss = np.average(train_loss)
vali_loss = self.vali(vali_data, vali_loader, criterion)
test_loss = self.vali(test_data, test_loader, criterion)

# 打印损失信息
print("Epoch: {0}, Steps: {1} | Train Loss: {2:.7f} Vali Loss: {3:.7f} Test Loss: {4:.7f}".format(
epoch + 1, train_steps, train_loss, vali_loss, test_loss))
# 早停策略
early_stopping(vali_loss, self.model, path)
if early_stopping.early_stop:
print("Early stopping")
break

adjust_learning_rate(model_optim, epoch+1, self.args)
# 保存模型
best_model_path = path+'/'+'checkpoint.pth'
# 导入模型
self.model.load_state_dict(torch.load(best_model_path))

return self.model

注意模型训练那一块_process_one_batch,进入该方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def _process_one_batch(self, dataset_object, batch_x, batch_y, batch_x_mark, batch_y_mark):
# 将数据集放入GPU中
batch_x = batch_x.float().to(self.device)
batch_y = batch_y.float()

batch_x_mark = batch_x_mark.float().to(self.device)
batch_y_mark = batch_y_mark.float().to(self.device)

# decoder输入
if self.args.padding==0:
# 创建一个全0数组,维度为batch,预测序列长度,特征数,本例中为[32,24,12]
dec_inp = torch.zeros([batch_y.shape[0], self.args.pred_len, batch_y.shape[-1]]).float()
elif self.args.padding==1:
dec_inp = torch.ones([batch_y.shape[0], self.args.pred_len, batch_y.shape[-1]]).float()
# 维度变为[32,72,12](72 = 24 + 48),48是预测中有标签的数据量
dec_inp = torch.cat([batch_y[:,:self.args.label_len,:], dec_inp], dim=1).float().to(self.device)
# encoder - decoder
if self.args.use_amp:
with torch.cuda.amp.autocast():
if self.args.output_attention:
outputs = self.model(batch_x, batch_x_mark, dec_inp, batch_y_mark)[0]
else:
outputs = self.model(batch_x, batch_x_mark, dec_inp, batch_y_mark)
else:
if self.args.output_attention:
outputs = self.model(batch_x, batch_x_mark, dec_inp, batch_y_mark)[0]
else:
# 运行到这一步,model中包含了网络架构
# output维度[batch,预测序列长度,预测特征数]
outputs = self.model(batch_x, batch_x_mark, dec_inp, batch_y_mark)
if self.args.inverse:
outputs = dataset_object.inverse_transform(outputs)
# 如果预测类型为多特征预测单特征(取结果最后一列)
f_dim = -1 if self.args.features=='MS' else 0

batch_y = batch_y[:,-self.args.pred_len:,f_dim:].to(self.device)

return outputs, batch_y

可以看到outputs = self.model(batch_x, batch_x_mark, dec_inp, batch_y_mark)model中包含Informer的核心架构(也是最重要的部分)

打开model.py文件,找到Informer类,直接看forward

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def forward(self, x_enc, x_mark_enc, x_dec, x_mark_dec, 
enc_self_mask=None, dec_self_mask=None, dec_enc_mask=None):
# x_enc[batch,序列长度,特征列],x_mark_enc[batch,序列长度,时间特征列]
# x_enc.shape:(32,96,12),x_mark_enc.shape:(32,96,4)
enc_out = self.enc_embedding(x_enc, x_mark_enc)
# enc_self_mask是数据中需要忽略的样本,本项目中为空
enc_out, attns = self.encoder(enc_out, attn_mask=enc_self_mask)

# 解码器embedding操作
# x_dec维度[batch,有标签+无标签序列长度,特征列](32,72=48+24,12)
dec_out = self.dec_embedding(x_dec, x_mark_dec)
# 解码器decoder操作
dec_out = self.decoder(dec_out, enc_out, x_mask=dec_self_mask, cross_mask=dec_enc_mask)
# 利用全连接层输出结果512-->12
dec_out = self.projection(dec_out)

# dec_out = self.end_conv1(dec_out)
# dec_out = self.end_conv2(dec_out.transpose(2,1)).transpose(1,2)
if self.output_attention:
return dec_out[:,-self.pred_len:,:], attns
else:
# 截断,只取后面24个需要预测的
return dec_out[:,-self.pred_len:,:] # [B, L, D]

2.1 编码器Embedding操作

Embedding操作,在embed.py文件中

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class DataEmbedding(nn.Module):
def __init__(self, c_in, d_model, embed_type='fixed', freq='h', dropout=0.1):
super(DataEmbedding, self).__init__()

self.value_embedding = TokenEmbedding(c_in=c_in, d_model=d_model)
self.position_embedding = PositionalEmbedding(d_model=d_model)
self.temporal_embedding = TemporalEmbedding(d_model=d_model, embed_type=embed_type, freq=freq) if embed_type!='timeF' else TimeFeatureEmbedding(d_model=d_model, embed_type=embed_type, freq=freq)

self.dropout = nn.Dropout(p=dropout)

def forward(self, x, x_mark):
# 12个特征列利用卷积层映射为512 + position_embedding + 4个时间特征利用全连接层映射为512
x = self.value_embedding(x) + self.position_embedding(x) + self.temporal_embedding(x_mark)
# 输出正则化后的embedding
return self.dropout(x)

2.2 Encoder模块

Encoder模块,在encoder.py文件中

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
class Encoder(nn.Module):
def __init__(self, attn_layers, conv_layers=None, norm_layer=None):
super(Encoder, self).__init__()
self.attn_layers = nn.ModuleList(attn_layers)
self.conv_layers = nn.ModuleList(conv_layers) if conv_layers is not None else None
self.norm = norm_layer

def forward(self, x, attn_mask=None):
# x [B, L, D]
attns = []
if self.conv_layers is not None:
for attn_layer, conv_layer in zip(self.attn_layers, self.conv_layers):
# 遍历注意力架构层
x, attn = attn_layer(x, attn_mask=attn_mask)
# 对x做maxpool1d操作,将512-->256
# 也就是结构中的金字塔,为了加速模型训练提出
x = conv_layer(x)
attns.append(attn)
# # 遍历注意力架构层
x, attn = self.attn_layers[-1](x, attn_mask=attn_mask)
attns.append(attn)
else:
for attn_layer in self.attn_layers:
x, attn = attn_layer(x, attn_mask=attn_mask)
attns.append(attn)

if self.norm is not None:
# 执行标准化操作
x = self.norm(x)

return x, attns

进入EncoderLayer类,找到注意力计算架构

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
class EncoderLayer(nn.Module):
def __init__(self, attention, d_model, d_ff=None, dropout=0.1, activation="relu"):
super(EncoderLayer, self).__init__()
d_ff = d_ff or 4*d_model
self.attention = attention
self.conv1 = nn.Conv1d(in_channels=d_model, out_channels=d_ff, kernel_size=1)
self.conv2 = nn.Conv1d(in_channels=d_ff, out_channels=d_model, kernel_size=1)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
self.activation = F.relu if activation == "relu" else F.gelu

def forward(self, x, attn_mask=None):
# 传入3个x,分别用于计算Q、K、V
new_x, attn = self.attention(
x, x, x,
attn_mask = attn_mask
)
# 残差连接
x = x + self.dropout(new_x)

y = x = self.norm1(x)
y = self.dropout(self.activation(self.conv1(y.transpose(-1,1))))
y = self.dropout(self.conv2(y).transpose(-1,1))

return self.norm2(x+y), attn

注意代码中的new_x, attn = self.attention(x, x, x,attn_mask = attn_mask)

2.3 注意力层

注意力层在attn.py文件中,找到AttentionLayer

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
class AttentionLayer(nn.Module):
def __init__(self, attention, d_model, n_heads,
d_keys=None, d_values=None, mix=False):
super(AttentionLayer, self).__init__()

d_keys = d_keys or (d_model//n_heads)
d_values = d_values or (d_model//n_heads)

self.inner_attention = attention
self.query_projection = nn.Linear(d_model, d_keys * n_heads)
self.key_projection = nn.Linear(d_model, d_keys * n_heads)
self.value_projection = nn.Linear(d_model, d_values * n_heads)
self.out_projection = nn.Linear(d_values * n_heads, d_model)
self.n_heads = n_heads
self.mix = mix

def forward(self, queries, keys, values, attn_mask):
# 取出batch,序列长度,特征数12(即B=32,L=96,_=12)
B, L, _ = queries.shape
# 同样的S=96
_, S, _ = keys.shape
# 多头注意力机制,这里为8
H = self.n_heads

# 通过全连接层将特征512-->512,映射到Q,K,V
# 512是在进行Embedding后特征数量
# 同时维度变为(batch,序列长度,多头注意力机制,自动计算)
queries = self.query_projection(queries).view(B, L, H, -1)
keys = self.key_projection(keys).view(B, S, H, -1)
values = self.value_projection(values).view(B, S, H, -1)

# 计算注意力
out, attn = self.inner_attention(
queries,
keys,
values,
attn_mask
)
if self.mix:
out = out.transpose(2,1).contiguous()
# 维度batch,序列长度,自动计算值
out = out.view(B, L, -1)
# 连接全连接512-->512
return self.out_projection(out), attn

注意代码中self.inner_attention,跳转到ProbAttention

其中_prob_QK用于选取Q、K是非常模型核心,要认真读,贴一下公式:

image-20230616222628197

_get_initial_context计算初始V值,_update_context更新重要Q的V值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
class ProbAttention(nn.Module):
def __init__(self, mask_flag=True, factor=5, scale=None, attention_dropout=0.1, output_attention=False):
super(ProbAttention, self).__init__()
self.factor = factor
self.scale = scale
self.mask_flag = mask_flag
self.output_attention = output_attention
self.dropout = nn.Dropout(attention_dropout)

def _prob_QK(self, Q, K, sample_k, n_top): # n_top: c*ln(L_q)
# 维度[batch,头数,序列长度,自动计算值]
B, H, L_K, E = K.shape
_, _, L_Q, _ = Q.shape

# 添加一个维度,相当于复制维度,当前维度为[batch,头数,序列长度,序列长度,自动计算值]
K_expand = K.unsqueeze(-3).expand(B, H, L_Q, L_K, E)
# 随机取样,取值范围0~96,取样维度为[序列长度,25]
index_sample = torch.randint(L_K, (L_Q, sample_k)) # real U = U_part(factor*ln(L_k))*L_q
# 96个Q与25个K做计算,维度为[batch,头数,Q个数,K个数,自动计算值]
K_sample = K_expand[:, :, torch.arange(L_Q).unsqueeze(1), index_sample, :]
# 矩阵重组,维度为[batch,头数,Q个数,K个数]
Q_K_sample = torch.matmul(Q.unsqueeze(-2), K_sample.transpose(-2, -1)).squeeze(-2)

# 分别取到96个Q中每一个Q跟K关系最大的值
M = Q_K_sample.max(-1)[0] - torch.div(Q_K_sample.sum(-1), L_K)
# 在96个Q中选出前25个
M_top = M.topk(n_top, sorted=False)[1]

# 取出Q特征,维度为[batch,头数,Q个数,自动计算值]
Q_reduce = Q[torch.arange(B)[:, None, None],
torch.arange(H)[None, :, None],
M_top, :] # factor*ln(L_q)
Q_K = torch.matmul(Q_reduce, K.transpose(-2, -1)) # factor*ln(L_q)*L_k

return Q_K, M_top

# 计算V值
def _get_initial_context(self, V, L_Q):
# 取出batch,头数,序列长度,自动计算值
B, H, L_V, D = V.shape
if not self.mask_flag:
# 对25个Q以外其他Q的V值,使用平均值(让其继续平庸下去)
V_sum = V.mean(dim=-2)
# 先把96个V全部使用平均值代替
contex = V_sum.unsqueeze(-2).expand(B, H, L_Q, V_sum.shape[-1]).clone()
else: # use mask
assert(L_Q == L_V) # requires that L_Q == L_V, i.e. for self-attention only
contex = V.cumsum(dim=-2)
return contex

# 更新25个V值
def _update_context(self, context_in, V, scores, index, L_Q, attn_mask):
B, H, L_V, D = V.shape

if self.mask_flag:
attn_mask = ProbMask(B, H, L_Q, index, scores, device=V.device)
scores.masked_fill_(attn_mask.mask, -np.inf)

# 计算softmax值
attn = torch.softmax(scores, dim=-1)

# 对25个Q更新V,其他仍然为平均值
context_in[torch.arange(B)[:, None, None],
torch.arange(H)[None, :, None],
index, :] = torch.matmul(attn, V).type_as(context_in)
if self.output_attention:
attns = (torch.ones([B, H, L_V, L_V])/L_V).type_as(attn).to(attn.device)
attns[torch.arange(B)[:, None, None], torch.arange(H)[None, :, None], index, :] = attn
return (context_in, attns)
else:
return (context_in, None)

def forward(self, queries, keys, values, attn_mask):
# 取出batch,序列长度,头数,自动计算值
B, L_Q, H, D = queries.shape
# 取出序列长度(相当于96个Q,96个K)
_, L_K, _, _ = keys.shape

# 维度转置操作,维度变为(batch,头数,序列长度,自动计算值)
queries = queries.transpose(2,1)
keys = keys.transpose(2,1)
values = values.transpose(2,1)

# 选取K的个数,模型核心,用于加速
# factor为常数5,可以自行修改,其值越大,计算成本越高
U_part = self.factor * np.ceil(np.log(L_K)).astype('int').item() # c*ln(L_k)
u = self.factor * np.ceil(np.log(L_Q)).astype('int').item() # c*ln(L_q)

U_part = U_part if U_part<L_K else L_K
u = u if u<L_Q else L_Q

# Q、K选择标准
scores_top, index = self._prob_QK(queries, keys, sample_k=U_part, n_top=u)

# 削弱维度对结果的影响
scale = self.scale or 1./sqrt(D)
if scale is not None:
scores_top = scores_top * scale
# 初始化V值
context = self._get_initial_context(values, L_Q)
# 更新25个Q的V值
context, attn = self._update_context(context, values, scores_top, index, L_Q, attn_mask)

return context.transpose(2,1).contiguous(), attn

2. 4 解码器Embedding操作

解码器的Embedding操作与编码器Embedding操作完全一致,只不过需要注意传入数组维度x_dec维度[batch,有标签+无标签序列长度,特征列](32,72=48+24,12)

2.5 Decoder模块

decoder.py文件中找到Decoder

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class Decoder(nn.Module):
def __init__(self, layers, norm_layer=None):
super(Decoder, self).__init__()
self.layers = nn.ModuleList(layers)
self.norm = norm_layer

def forward(self, x, cross, x_mask=None, cross_mask=None):
for layer in self.layers:
# 遍历层,需要注意的是该处计算自注意力,也就是self-attention
# 72个Q,72个K,重复编码器中的decoder操作
x = layer(x, cross, x_mask=x_mask, cross_mask=cross_mask)

if self.norm is not None:
x = self.norm(x)

return x

代码中的layer层定义在该文件中,找到DecoderLayer

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
class DecoderLayer(nn.Module):
def __init__(self, self_attention, cross_attention, d_model, d_ff=None,
dropout=0.1, activation="relu"):
super(DecoderLayer, self).__init__()
d_ff = d_ff or 4*d_model
self.self_attention = self_attention
self.cross_attention = cross_attention
self.conv1 = nn.Conv1d(in_channels=d_model, out_channels=d_ff, kernel_size=1)
self.conv2 = nn.Conv1d(in_channels=d_ff, out_channels=d_model, kernel_size=1)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.norm3 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
self.activation = F.relu if activation == "relu" else F.gelu

def forward(self, x, cross, x_mask=None, cross_mask=None):
x = x + self.dropout(self.self_attention(
# Decoder(序列长度为72)中的Q,K,V
x, x, x,
attn_mask=x_mask
)[0])
x = self.norm1(x)

# cross_attention,在Encoder与Decoder间计算attention
# 结构图中Encoder与Decoder连接线部分
x = x + self.dropout(self.cross_attention(
# x为Q,cross是Encoder中的K,ross是Encoder中的V
x, cross, cross,
attn_mask=cross_mask
)[0])

y = x = self.norm2(x)
y = self.dropout(self.activation(self.conv1(y.transpose(-1,1))))
y = self.dropout(self.conv2(y).transpose(-1,1))

return self.norm3(x+y)

结果展示

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
 
Args in experiment:
Namespace(model='informer', data='ETTh1', root_path='./data/ETT/', data_path='ETTh1.csv', features='M', target='OT', freq='h', checkpoints='./checkpoints/', seq_len=96, label_len=48, pred_len=24, enc_in=7, dec_in=7, c_out=7, d_model=512, n_heads=8, e_layers=2, d_layers=1, s_layers=[3, 2, 1], d_ff=2048, factor=5, padding=0, distil=True, dropout=0.05, attn='prob', embed='timeF', activation='gelu', output_attention=False, do_predict=False, mix=True, cols=None, num_workers=0, itr=2, train_epochs=6, batch_size=32, patience=3, learning_rate=0.0001, des='test', loss='mse', lradj='type1', use_amp=False, inverse=False, use_gpu=True, gpu=0, use_multi_gpu=False, devices='0,1,2,3', detail_freq='h')
Use GPU: cuda:0
>>>>>>>start training : informer_ETTh1_ftM_sl96_ll48_pl24_dm512_nh8_el2_dl1_df2048_atprob_fc5_ebtimeF_dtTrue_mxTrue_test_0>>>>>>>>>>>>>>>>>>>>>>>>>>
train 8521
val 2857
test 2857
iters: 100, epoch: 1 | loss: 0.5261109
speed: 0.1242s/iter; left time: 185.9112s
iters: 200, epoch: 1 | loss: 0.3267951
speed: 0.0466s/iter; left time: 65.1050s
Epoch: 1 cost time: 20.148507595062256
Epoch: 1, Steps: 266 | Train Loss: 0.4189890 Vali Loss: 0.6623457 Test Loss: 0.5295426
Validation loss decreased (inf --> 0.662346). Saving model ...
Updating learning rate to 0.0001
iters: 100, epoch: 2 | loss: 0.2857513
speed: 0.1165s/iter; left time: 143.4282s
iters: 200, epoch: 2 | loss: 0.2075945
speed: 0.0464s/iter; left time: 52.4259s
Epoch: 2 cost time: 12.32896614074707
Epoch: 2, Steps: 266 | Train Loss: 0.2562122 Vali Loss: 0.6945553 Test Loss: 0.5665584
EarlyStopping counter: 1 out of 3
Updating learning rate to 5e-05
iters: 100, epoch: 3 | loss: 0.1802989
speed: 0.1160s/iter; left time: 111.9148s
iters: 200, epoch: 3 | loss: 0.2122464
speed: 0.0473s/iter; left time: 40.9259s
Epoch: 3 cost time: 12.661777019500732
Epoch: 3, Steps: 266 | Train Loss: 0.1954239 Vali Loss: 0.7046237 Test Loss: 0.6552624
EarlyStopping counter: 2 out of 3
Updating learning rate to 2.5e-05
iters: 100, epoch: 4 | loss: 0.1874317
speed: 0.1169s/iter; left time: 81.6874s
iters: 200, epoch: 4 | loss: 0.1856833
speed: 0.0463s/iter; left time: 27.7299s
Epoch: 4 cost time: 12.376939058303833
Epoch: 4, Steps: 266 | Train Loss: 0.1685006 Vali Loss: 0.7188290 Test Loss: 0.7713081
EarlyStopping counter: 3 out of 3
Early stopping
>>>>>>>testing : informer_ETTh1_ftM_sl96_ll48_pl24_dm512_nh8_el2_dl1_df2048_atprob_fc5_ebtimeF_dtTrue_mxTrue_test_0<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
test 2857
test shape: (89, 32, 24, 7) (89, 32, 24, 7)
test shape: (2848, 24, 7) (2848, 24, 7)
mse:0.5292219519615173, mae:0.5165334343910217
Use GPU: cuda:0
>>>>>>>start training : informer_ETTh1_ftM_sl96_ll48_pl24_dm512_nh8_el2_dl1_df2048_atprob_fc5_ebtimeF_dtTrue_mxTrue_test_1>>>>>>>>>>>>>>>>>>>>>>>>>>
train 8521
val 2857
test 2857
iters: 100, epoch: 1 | loss: 0.4414886
speed: 0.0469s/iter; left time: 70.2441s
iters: 200, epoch: 1 | loss: 0.3686439
speed: 0.0463s/iter; left time: 64.7000s
Epoch: 1 cost time: 12.486876964569092
Epoch: 1, Steps: 266 | Train Loss: 0.4205188 Vali Loss: 0.7174129 Test Loss: 0.6916030
Validation loss decreased (inf --> 0.717413). Saving model ...
Updating learning rate to 0.0001
iters: 100, epoch: 2 | loss: 0.2283735
speed: 0.1189s/iter; left time: 146.3317s
iters: 200, epoch: 2 | loss: 0.2095163
speed: 0.0466s/iter; left time: 52.7198s
Epoch: 2 cost time: 12.505866289138794
Epoch: 2, Steps: 266 | Train Loss: 0.2619197 Vali Loss: 0.6467599 Test Loss: 0.5395069
Validation loss decreased (0.717413 --> 0.646760). Saving model ...
Updating learning rate to 5e-05
iters: 100, epoch: 3 | loss: 0.1923385
speed: 0.1177s/iter; left time: 113.5350s
iters: 200, epoch: 3 | loss: 0.1816102
speed: 0.0465s/iter; left time: 40.2514s
Epoch: 3 cost time: 12.474884033203125
Epoch: 3, Steps: 266 | Train Loss: 0.1994059 Vali Loss: 0.6823798 Test Loss: 0.6388326
EarlyStopping counter: 1 out of 3
Updating learning rate to 2.5e-05
iters: 100, epoch: 4 | loss: 0.1672164
speed: 0.1154s/iter; left time: 80.6885s
iters: 200, epoch: 4 | loss: 0.1530543
speed: 0.0467s/iter; left time: 27.9933s
Epoch: 4 cost time: 12.345957517623901
Epoch: 4, Steps: 266 | Train Loss: 0.1715669 Vali Loss: 0.7120979 Test Loss: 0.6874654
EarlyStopping counter: 2 out of 3
Updating learning rate to 1.25e-05
iters: 100, epoch: 5 | loss: 0.1682257
speed: 0.1163s/iter; left time: 50.3378s
iters: 200, epoch: 5 | loss: 0.1570243
speed: 0.0465s/iter; left time: 15.4723s
Epoch: 5 cost time: 12.490875959396362
Epoch: 5, Steps: 266 | Train Loss: 0.1574323 Vali Loss: 0.7080439 Test Loss: 0.6615207
EarlyStopping counter: 3 out of 3
Early stopping
>>>>>>>testing : informer_ETTh1_ftM_sl96_ll48_pl24_dm512_nh8_el2_dl1_df2048_atprob_fc5_ebtimeF_dtTrue_mxTrue_test_1<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
test 2857
test shape: (89, 32, 24, 7) (89, 32, 24, 7)
test shape: (2848, 24, 7) (2848, 24, 7)
mse:0.5387394428253174, mae:0.5245636701583862

Process finished with exit code 0

  • 跑完以后项目文件中会生成两个文件夹,checkpoints文件夹中存放模型文件,后缀名为.pht;results文件夹中有3个文件,pred.npy为预测值,true.npy为真实值
  • 作者在GitHub上留下了关于预测的具体方法

————————————————
原文链接:https://blog.csdn.net/qq_20144897/article/details/127298319