LSTUR：利用长短期用户表示进行新闻推荐

在信息爆炸的时代，个性化的新闻推荐系统变得越来越重要。LSTUR (Long- and Short-term User Representations) 是一种基于神经网络的新闻推荐方法，它能够有效地捕捉用户的长短期兴趣，从而提供更精准的推荐结果。

LSTUR 的核心思想

LSTUR 的核心是新闻编码器和用户编码器。新闻编码器通过学习新闻标题的表示，来理解新闻内容。用户编码器则分为两个部分：长短期用户表示。

长短期用户表示：LSTUR 利用用户的 ID 来学习其长期偏好，并使用 GRU 网络来学习用户最近浏览新闻的短期兴趣。
结合长短期用户表示：LSTUR 提供两种方法来结合长短期用户表示。第一种方法是使用长期用户表示来初始化 GRU 网络的隐藏状态，从而将长期偏好信息融入短期兴趣学习中。第二种方法是将长短期用户表示拼接起来，形成一个统一的用户向量。

LSTUR 的优势

LSTUR 具有以下优势：

同时捕捉用户的长短期兴趣：LSTUR 能够同时学习用户的长期偏好和短期兴趣，从而提供更精准的推荐结果。
利用用户 ID 学习长期偏好：LSTUR 利用用户的 ID 来学习其长期偏好，这是一种简单而有效的学习长期兴趣的方法。
利用 GRU 网络学习短期兴趣：LSTUR 利用 GRU 网络来学习用户最近浏览新闻的短期兴趣，这是一种能够捕捉用户动态兴趣变化的有效方法。

数据格式

为了方便训练和评估，LSTUR 使用了 MIND 数据集。MIND 数据集分为三个版本：large、small 和 demo。demo 版本包含 5000 个用户，是 small 版本的子集，用于快速实验。large 版本则包含了更多用户和新闻数据，用于更全面的评估。

MIND 数据集包含两个文件：news 文件和 behaviors 文件。

news 文件：包含新闻信息，例如新闻 ID、类别、子类别、新闻标题、新闻摘要、新闻链接以及新闻标题和摘要中的实体信息。

示例：

N46466    lifestyle   lifestyleroyals The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By  Shop the notebooks, jackets, and more that the royals can't live without.   https://www.msn.com/en-us/lifestyle/lifestyleroyals/the-brands-queen-elizabeth,-prince-charles,-and-prince-philip-swear-by/ss-AAGH0ET?ocid=chopendata   [{\"Label\": \"Prince Philip, Duke of Edinburgh\", \"Type\": \"P\", \"WikidataId\": \"Q80976\", \"Confidence\": 1.0, \"OccurrenceOffsets\": [48], \"SurfaceForms\": [\"Prince Philip\"]}, {\"Label\": \"Charles, Prince of Wales\", \"Type\": \"P\", \"WikidataId\": \"Q43274\", \"Confidence\": 1.0, \"OccurrenceOffsets\": [28], \"SurfaceForms\": [\"Prince Charles\"]}, {\"Label\": \"Elizabeth II\", \"Type\": \"P\", \"WikidataId\": \"Q9682\", \"Confidence\": 0.97, \"OccurrenceOffsets\": [11], \"SurfaceForms\": [\"Queen Elizabeth\"]}]  []

behaviors 文件：包含用户行为信息，例如印象 ID、用户 ID、印象时间、用户点击历史以及印象新闻。

示例：

1    U82271  11/11/2019 3:28:58 PM   N3130 N11621 N12917 N4574 N12140 N9748  N13390-0 N7180-0 N20785-0 N6937-0 N15776-0 N25810-0 N20820-0 N6885-0 N27294-0 N18835-0 N16945-0 N7410-0 N23967-0 N22679-0 N20532-0 N26651-0 N22078-0 N4098-0 N16473-0 N13841-0 N15660-0 N25787-0 N2315-0 N1615-0 N9087-0 N23880-0 N3600-0 N24479-0 N22882-0 N26308-0 N13594-0 N2220-0 N28356-0 N17083-0 N21415-0 N18671-0 N9440-0 N17759-0 N10861-0 N21830-0 N8064-0 N5675-0 N15037-0 N26154-0 N15368-1 N481-0 N3256-0 N20663-0 N23940-0 N7654-0 N10729-0 N7090-0 N23596-0 N15901-0 N16348-0 N13645-0 N8124-0 N20094-0 N27774-0 N23011-0 N14832-0 N15971-0 N27729-0 N2167-0 N11186-0 N18390-0 N21328-0 N10992-0 N20122-0 N1958-0 N2004-0 N26156-0 N17632-0 N26146-0 N17322-0 N18403-0 N17397-0 N18215-0 N14475-0 N9781-0 N17958-0 N3370-0 N1127-0 N15525-0 N12657-0 N10537-0 N18224-0

代码示例

以下是使用 LSTUR 模型进行新闻推荐的代码示例：

1. 导入必要的库

import os
import sys
import numpy as np
import zipfile
from tqdm import tqdm
from tempfile import TemporaryDirectory
import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages

from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources
from recommenders.models.newsrec.newsrec_utils import prepare_hparams
from recommenders.models.newsrec.models.lstur import LSTURModel
from recommenders.models.newsrec.io.mind_iterator import MINDIterator
from recommenders.models.newsrec.newsrec_utils import get_mind_data_set
from recommenders.utils.notebook_utils import store_metadata

print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))

2. 设置参数

epochs = 5
seed = 40
batch_size = 32

# 选择数据集版本: demo, small, large
MIND_type = "demo"

3. 下载并加载数据

tmpdir = TemporaryDirectory()
data_path = tmpdir.name

train_news_file = os.path.join(data_path, 'train', r'news.tsv')
train_behaviors_file = os.path.join(data_path, 'train', r'behaviors.tsv')
valid_news_file = os.path.join(data_path, 'valid', r'news.tsv')
valid_behaviors_file = os.path.join(data_path, 'valid', r'behaviors.tsv')
wordEmb_file = os.path.join(data_path, "utils", "embedding.npy")
userDict_file = os.path.join(data_path, "utils", "uid2index.pkl")
wordDict_file = os.path.join(data_path, "utils", "word_dict.pkl")
yaml_file = os.path.join(data_path, "utils", r'lstur.yaml')

mind_url, mind_train_dataset, mind_dev_dataset, mind_utils = get_mind_data_set(MIND_type)

if not os.path.exists(train_news_file):
    download_deeprec_resources(mind_url, os.path.join(data_path, 'train'), mind_train_dataset)

if not os.path.exists(valid_news_file):
    download_deeprec_resources(mind_url, os.path.join(data_path, 'valid'), mind_dev_dataset)
if not os.path.exists(yaml_file):
    download_deeprec_resources(r'https://recodatasets.z20.web.core.windows.net/newsrec/', os.path.join(data_path, 'utils'), mind_utils)

4. 创建超参数

hparams = prepare_hparams(yaml_file,
                          wordEmb_file=wordEmb_file,
                          wordDict_file=wordDict_file,
                          userDict_file=userDict_file,
                          batch_size=batch_size,
                          epochs=epochs)
print(hparams)

5. 创建迭代器

iterator = MINDIterator

6. 训练 LSTUR 模型

model = LSTURModel(hparams, iterator, seed=seed)

# 评估模型
print(model.run_eval(valid_news_file, valid_behaviors_file))

# 训练模型
%%time
model.fit(train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file)

# 再次评估模型
%%time
res_syn = model.run_eval(valid_news_file, valid_behaviors_file)
print(res_syn)

7. 保存模型

model_path = os.path.join(data_path, "model")
os.makedirs(model_path, exist_ok=True)

model.model.save_weights(os.path.join(model_path, "lstur_ckpt"))

8. 生成预测文件

group_impr_indexes, group_labels, group_preds = model.run_fast_eval(valid_news_file, valid_behaviors_file)

with open(os.path.join(data_path, 'prediction.txt'), 'w') as f:
    for impr_index, preds in tqdm(zip(group_impr_indexes, group_preds)):
        impr_index += 1
        pred_rank = (np.argsort(np.argsort(preds)[::-1]) + 1).tolist()
        pred_rank = '[' + ','.join([str(i) for i in pred_rank]) + ']'
        f.write(' '.join([str(impr_index), pred_rank])+ '\n')

f = zipfile.ZipFile(os.path.join(data_path, 'prediction.zip'), 'w', zipfile.ZIP_DEFLATED)
f.write(os.path.join(data_path, 'prediction.txt'), arcname='prediction.txt')
f.close()

总结

LSTUR 是一种基于神经网络的新闻推荐方法，它能够有效地捕捉用户的长短期兴趣，从而提供更精准的推荐结果。LSTUR 的优势在于能够同时学习用户的长期偏好和短期兴趣，并利用用户 ID 和 GRU 网络来学习用户表示。实验结果表明，LSTUR 模型在新闻推荐任务中取得了较好的效果。

参考文献

An, Mingxiao, et al. 「Neural News Recommendation with Long- and Short-term User Representations.」 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2019.
Wu, Fangzhao, et al. 「MIND: A Large-scale Dataset for News Recommendation.」 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2019.
Pennington, Jeffrey, et al. 「GloVe: Global Vectors for Word Representation.」 Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014.

LSTUR 的核心思想

LSTUR 的优势

数据格式

代码示例

总结

参考文献

发表评论 取消回复

发表评论取消回复