LSTUR:利用长短期用户表示进行新闻推荐

在信息爆炸的时代,个性化的新闻推荐系统变得越来越重要。LSTUR (Long- and Short-term User Representations) 是一种基于神经网络的新闻推荐方法,它能够有效地捕捉用户的长短期兴趣,从而提供更精准的推荐结果。

LSTUR 的核心思想

LSTUR 的核心是新闻编码器和用户编码器。新闻编码器通过学习新闻标题的表示,来理解新闻内容。用户编码器则分为两个部分:长短期用户表示。

  • 长短期用户表示:LSTUR 利用用户的 ID 来学习其长期偏好,并使用 GRU 网络来学习用户最近浏览新闻的短期兴趣。
  • 结合长短期用户表示:LSTUR 提供两种方法来结合长短期用户表示。第一种方法是使用长期用户表示来初始化 GRU 网络的隐藏状态,从而将长期偏好信息融入短期兴趣学习中。第二种方法是将长短期用户表示拼接起来,形成一个统一的用户向量。

LSTUR 的优势

LSTUR 具有以下优势:

  • 同时捕捉用户的长短期兴趣:LSTUR 能够同时学习用户的长期偏好和短期兴趣,从而提供更精准的推荐结果。
  • 利用用户 ID 学习长期偏好:LSTUR 利用用户的 ID 来学习其长期偏好,这是一种简单而有效的学习长期兴趣的方法。
  • 利用 GRU 网络学习短期兴趣:LSTUR 利用 GRU 网络来学习用户最近浏览新闻的短期兴趣,这是一种能够捕捉用户动态兴趣变化的有效方法。

数据格式

为了方便训练和评估,LSTUR 使用了 MIND 数据集。MIND 数据集分为三个版本:largesmalldemodemo 版本包含 5000 个用户,是 small 版本的子集,用于快速实验。large 版本则包含了更多用户和新闻数据,用于更全面的评估。

MIND 数据集包含两个文件:news 文件和 behaviors 文件。

  • news 文件:包含新闻信息,例如新闻 ID、类别、子类别、新闻标题、新闻摘要、新闻链接以及新闻标题和摘要中的实体信息。

示例:

N46466    lifestyle   lifestyleroyals The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By  Shop the notebooks, jackets, and more that the royals can't live without.   https://www.msn.com/en-us/lifestyle/lifestyleroyals/the-brands-queen-elizabeth,-prince-charles,-and-prince-philip-swear-by/ss-AAGH0ET?ocid=chopendata   [{\"Label\": \"Prince Philip, Duke of Edinburgh\", \"Type\": \"P\", \"WikidataId\": \"Q80976\", \"Confidence\": 1.0, \"OccurrenceOffsets\": [48], \"SurfaceForms\": [\"Prince Philip\"]}, {\"Label\": \"Charles, Prince of Wales\", \"Type\": \"P\", \"WikidataId\": \"Q43274\", \"Confidence\": 1.0, \"OccurrenceOffsets\": [28], \"SurfaceForms\": [\"Prince Charles\"]}, {\"Label\": \"Elizabeth II\", \"Type\": \"P\", \"WikidataId\": \"Q9682\", \"Confidence\": 0.97, \"OccurrenceOffsets\": [11], \"SurfaceForms\": [\"Queen Elizabeth\"]}]  []
  • behaviors 文件:包含用户行为信息,例如印象 ID、用户 ID、印象时间、用户点击历史以及印象新闻。

示例:

1    U82271  11/11/2019 3:28:58 PM   N3130 N11621 N12917 N4574 N12140 N9748  N13390-0 N7180-0 N20785-0 N6937-0 N15776-0 N25810-0 N20820-0 N6885-0 N27294-0 N18835-0 N16945-0 N7410-0 N23967-0 N22679-0 N20532-0 N26651-0 N22078-0 N4098-0 N16473-0 N13841-0 N15660-0 N25787-0 N2315-0 N1615-0 N9087-0 N23880-0 N3600-0 N24479-0 N22882-0 N26308-0 N13594-0 N2220-0 N28356-0 N17083-0 N21415-0 N18671-0 N9440-0 N17759-0 N10861-0 N21830-0 N8064-0 N5675-0 N15037-0 N26154-0 N15368-1 N481-0 N3256-0 N20663-0 N23940-0 N7654-0 N10729-0 N7090-0 N23596-0 N15901-0 N16348-0 N13645-0 N8124-0 N20094-0 N27774-0 N23011-0 N14832-0 N15971-0 N27729-0 N2167-0 N11186-0 N18390-0 N21328-0 N10992-0 N20122-0 N1958-0 N2004-0 N26156-0 N17632-0 N26146-0 N17322-0 N18403-0 N17397-0 N18215-0 N14475-0 N9781-0 N17958-0 N3370-0 N1127-0 N15525-0 N12657-0 N10537-0 N18224-0

代码示例

以下是使用 LSTUR 模型进行新闻推荐的代码示例:

1. 导入必要的库

import os
import sys
import numpy as np
import zipfile
from tqdm import tqdm
from tempfile import TemporaryDirectory
import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages

from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources
from recommenders.models.newsrec.newsrec_utils import prepare_hparams
from recommenders.models.newsrec.models.lstur import LSTURModel
from recommenders.models.newsrec.io.mind_iterator import MINDIterator
from recommenders.models.newsrec.newsrec_utils import get_mind_data_set
from recommenders.utils.notebook_utils import store_metadata

print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))

2. 设置参数

epochs = 5
seed = 40
batch_size = 32

# 选择数据集版本: demo, small, large
MIND_type = "demo"

3. 下载并加载数据

tmpdir = TemporaryDirectory()
data_path = tmpdir.name

train_news_file = os.path.join(data_path, 'train', r'news.tsv')
train_behaviors_file = os.path.join(data_path, 'train', r'behaviors.tsv')
valid_news_file = os.path.join(data_path, 'valid', r'news.tsv')
valid_behaviors_file = os.path.join(data_path, 'valid', r'behaviors.tsv')
wordEmb_file = os.path.join(data_path, "utils", "embedding.npy")
userDict_file = os.path.join(data_path, "utils", "uid2index.pkl")
wordDict_file = os.path.join(data_path, "utils", "word_dict.pkl")
yaml_file = os.path.join(data_path, "utils", r'lstur.yaml')

mind_url, mind_train_dataset, mind_dev_dataset, mind_utils = get_mind_data_set(MIND_type)

if not os.path.exists(train_news_file):
    download_deeprec_resources(mind_url, os.path.join(data_path, 'train'), mind_train_dataset)

if not os.path.exists(valid_news_file):
    download_deeprec_resources(mind_url, os.path.join(data_path, 'valid'), mind_dev_dataset)
if not os.path.exists(yaml_file):
    download_deeprec_resources(r'https://recodatasets.z20.web.core.windows.net/newsrec/', os.path.join(data_path, 'utils'), mind_utils)

4. 创建超参数

hparams = prepare_hparams(yaml_file,
                          wordEmb_file=wordEmb_file,
                          wordDict_file=wordDict_file,
                          userDict_file=userDict_file,
                          batch_size=batch_size,
                          epochs=epochs)
print(hparams)

5. 创建迭代器

iterator = MINDIterator

6. 训练 LSTUR 模型

model = LSTURModel(hparams, iterator, seed=seed)

# 评估模型
print(model.run_eval(valid_news_file, valid_behaviors_file))

# 训练模型
%%time
model.fit(train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file)

# 再次评估模型
%%time
res_syn = model.run_eval(valid_news_file, valid_behaviors_file)
print(res_syn)

7. 保存模型

model_path = os.path.join(data_path, "model")
os.makedirs(model_path, exist_ok=True)

model.model.save_weights(os.path.join(model_path, "lstur_ckpt"))

8. 生成预测文件

group_impr_indexes, group_labels, group_preds = model.run_fast_eval(valid_news_file, valid_behaviors_file)

with open(os.path.join(data_path, 'prediction.txt'), 'w') as f:
    for impr_index, preds in tqdm(zip(group_impr_indexes, group_preds)):
        impr_index += 1
        pred_rank = (np.argsort(np.argsort(preds)[::-1]) + 1).tolist()
        pred_rank = '[' + ','.join([str(i) for i in pred_rank]) + ']'
        f.write(' '.join([str(impr_index), pred_rank])+ '\n')

f = zipfile.ZipFile(os.path.join(data_path, 'prediction.zip'), 'w', zipfile.ZIP_DEFLATED)
f.write(os.path.join(data_path, 'prediction.txt'), arcname='prediction.txt')
f.close()

总结

LSTUR 是一种基于神经网络的新闻推荐方法,它能够有效地捕捉用户的长短期兴趣,从而提供更精准的推荐结果。LSTUR 的优势在于能够同时学习用户的长期偏好和短期兴趣,并利用用户 ID 和 GRU 网络来学习用户表示。实验结果表明,LSTUR 模型在新闻推荐任务中取得了较好的效果。

参考文献

  1. An, Mingxiao, et al. “Neural News Recommendation with Long- and Short-term User Representations.” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2019.
  2. Wu, Fangzhao, et al. “MIND: A Large-scale Dataset for News Recommendation.” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2019.
  3. Pennington, Jeffrey, et al. “GloVe: Global Vectors for Word Representation.” Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014.

发表评论

人生梦想 - 关注前沿的计算机技术 acejoy.com