常见问题¶
您可以在这里找到常见问题,以及一些不属于用户指南的其他用例示例。
如何为每个用户获取Top-N推荐¶
这里是一个示例,展示了如何从MovieLens-100k数据集中为每个用户检索评分预测最高的Top-10物品。我们首先在整个数据集上训练SVD算法,然后预测不在训练集中的所有用户-物品对的评分。接着,我们为每个用户检索其Top-10预测评分。
examples/top_n_recommendations.py
¶from collections import defaultdict
from surprise import Dataset, SVD
def get_top_n(predictions, n=10):
"""Return the top-N recommendation for each user from a set of predictions.
Args:
predictions(list of Prediction objects): The list of predictions, as
returned by the test method of an algorithm.
n(int): The number of recommendation to output for each user. Default
is 10.
Returns:
A dict where keys are user (raw) ids and values are lists of tuples:
[(raw item id, rating estimation), ...] of size n.
"""
# First map the predictions to each user.
top_n = defaultdict(list)
for uid, iid, true_r, est, _ in predictions:
top_n[uid].append((iid, est))
# Then sort the predictions for each user and retrieve the k highest ones.
for uid, user_ratings in top_n.items():
user_ratings.sort(key=lambda x: x[1], reverse=True)
top_n[uid] = user_ratings[:n]
return top_n
# First train an SVD algorithm on the movielens dataset.
data = Dataset.load_builtin("ml-100k")
trainset = data.build_full_trainset()
algo = SVD()
algo.fit(trainset)
# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = algo.test(testset)
top_n = get_top_n(predictions, n=10)
# Print the recommended items for each user
for uid, user_ratings in top_n.items():
print(uid, [iid for (iid, _) in user_ratings])
如何计算precision@k和recall@k¶
这里是一个示例,展示了如何为每个用户计算Precision@k和Recall@k
\(\text{Precision@k} = \frac{ | \{ \text{被推荐且相关的物品} \} | }{ | \{ \text{推荐的物品} \} | }\) \(\text{Recall@k} = \frac{ | \{ \text{被推荐且相关的物品} \} | }{ | \{ \text{相关的物品} \} | }\)
如果一个物品的真实评分 \(r_{ui}\) 大于给定的阈值,则认为该物品是相关的。如果一个物品的估计评分 \(\hat{r}_{ui}\) 大于阈值,并且它在k个最高估计评分之列,则认为该物品被推荐。
请注意,在出现除以零的极端情况下,Precision@k 和 Recall@k 的值未定义。根据约定,在这种情况下我们将它们的值设置为 0。
examples/precision_recall_at_k.py
¶from collections import defaultdict
from surprise import Dataset, SVD
from surprise.model_selection import KFold
def precision_recall_at_k(predictions, k=10, threshold=3.5):
"""Return precision and recall at k metrics for each user"""
# First map the predictions to each user.
user_est_true = defaultdict(list)
for uid, _, true_r, est, _ in predictions:
user_est_true[uid].append((est, true_r))
precisions = dict()
recalls = dict()
for uid, user_ratings in user_est_true.items():
# Sort user ratings by estimated value
user_ratings.sort(key=lambda x: x[0], reverse=True)
# Number of relevant items
n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)
# Number of recommended items in top k
n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])
# Number of relevant and recommended items in top k
n_rel_and_rec_k = sum(
((true_r >= threshold) and (est >= threshold))
for (est, true_r) in user_ratings[:k]
)
# Precision@K: Proportion of recommended items that are relevant
# When n_rec_k is 0, Precision is undefined. We here set it to 0.
precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0
# Recall@K: Proportion of relevant items that are recommended
# When n_rel is 0, Recall is undefined. We here set it to 0.
recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0
return precisions, recalls
data = Dataset.load_builtin("ml-100k")
kf = KFold(n_splits=5)
algo = SVD()
for trainset, testset in kf.split(data):
algo.fit(trainset)
predictions = algo.test(testset)
precisions, recalls = precision_recall_at_k(predictions, k=5, threshold=4)
# Precision and recall can then be averaged over all users
print(sum(prec for prec in precisions.values()) / len(precisions))
print(sum(rec for rec in recalls.values()) / len(recalls))
如何获取用户(或物品)的k个最近邻¶
您可以使用算法对象的 get_neighbors()
方法。这仅适用于使用相似度度量的算法,例如 k-NN算法。
这里是一个示例,展示了如何从MovieLens-100k数据集中检索电影《玩具总动员》(Toy Story) 的10个最近邻。输出结果是
The 10 nearest neighbors of Toy Story are:
Beauty and the Beast (1991)
Raiders of the Lost Ark (1981)
That Thing You Do! (1996)
Lion King, The (1994)
Craft, The (1996)
Liar Liar (1997)
Aladdin (1992)
Cool Hand Luke (1967)
Winnie the Pooh and the Blustery Day (1968)
Indiana Jones and the Last Crusade (1989)
由于需要在电影名称及其原始/内部ID之间进行转换(参见 此说明),所以有很多样板代码,但核心都是使用 get_neighbors()
方法。
examples/k_nearest_neighbors.py
¶import io # noqa
from surprise import Dataset, get_dataset_dir, KNNBaseline
def read_item_names():
"""Read the u.item file from MovieLens 100-k dataset and return two
mappings to convert raw ids into movie names and movie names into raw ids.
"""
file_name = get_dataset_dir() + "/ml-100k/ml-100k/u.item"
rid_to_name = {}
name_to_rid = {}
with open(file_name, encoding="ISO-8859-1") as f:
for line in f:
line = line.split("|")
rid_to_name[line[0]] = line[1]
name_to_rid[line[1]] = line[0]
return rid_to_name, name_to_rid
# First, train the algorithm to compute the similarities between items
data = Dataset.load_builtin("ml-100k")
trainset = data.build_full_trainset()
sim_options = {"name": "pearson_baseline", "user_based": False}
algo = KNNBaseline(sim_options=sim_options)
algo.fit(trainset)
# Read the mappings raw id <-> movie name
rid_to_name, name_to_rid = read_item_names()
# Retrieve inner id of the movie Toy Story
toy_story_raw_id = name_to_rid["Toy Story (1995)"]
toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id)
# Retrieve inner ids of the nearest neighbors of Toy Story.
toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=10)
# Convert inner ids of the neighbors into names.
toy_story_neighbors = (
algo.trainset.to_raw_iid(inner_id) for inner_id in toy_story_neighbors
)
toy_story_neighbors = (rid_to_name[rid] for rid in toy_story_neighbors)
print()
print("The 10 nearest neighbors of Toy Story are:")
for movie in toy_story_neighbors:
print(movie)
当然,对用户也可以进行类似操作,只需稍作修改。
如何序列化算法¶
预测算法可以使用 dump()
和 load()
函数进行序列化和重新加载。这里有一个小示例,展示了如何在一个数据集上训练SVD算法并进行序列化。然后重新加载该算法,并再次用于进行预测。
examples/serialize_algorithm.py
¶import os
from surprise import Dataset, dump, SVD
data = Dataset.load_builtin("ml-100k")
trainset = data.build_full_trainset()
algo = SVD()
algo.fit(trainset)
# Compute predictions of the 'original' algorithm.
predictions = algo.test(trainset.build_testset())
# Dump algorithm and reload it.
file_name = os.path.expanduser("~/dump_file")
dump.dump(file_name, algo=algo)
_, loaded_algo = dump.load(file_name)
# We now ensure that the algo is still the same by checking the predictions.
predictions_loaded_algo = loaded_algo.test(trainset.build_testset())
assert predictions == predictions_loaded_algo
print("Predictions are the same")
算法可以与它们的预测结果一起被序列化,以便使用pandas DataFrames进行进一步分析或与其他算法进行比较。以下两个notebook中提供了一些示例:
如何构建自己的预测算法¶
有一份完整的指南此处。
什么是原始ID和内部ID¶
用户和物品都有一个原始ID和一个内部ID。有些方法会使用/返回原始ID(例如 predict()
方法),而另一些则会使用/返回内部ID。
原始ID是在评分文件或pandas DataFrame中定义的ID。它们可以是字符串或数字。但请注意,如果评分是从文件中读取的(这是标准场景),它们将表示为字符串。如果您使用例如 predict()
或其他接受原始ID作为参数的方法,了解这一点很重要。
在创建训练集时,每个原始ID都会映射到一个称为内部ID的唯一整数,这更适合 Surprise 进行操作。原始ID和内部ID之间的转换可以使用 to_inner_uid()
, to_inner_iid()
, to_raw_uid()
, 和 to_raw_iid()
等 trainset
对象的方法来完成。
我可以使用自己的数据集与Surprise配合使用吗?可以是pandas DataFrame吗?¶
是的,都可以。参见用户指南。
如何调整算法参数¶
您可以使用 GridSearchCV
类来调整算法参数,如 此处 所述。调整后,您可能希望获得算法性能的 无偏估计。
如何获取训练集上的精度度量¶
您可以使用 Trainset
对象的 build_testset()
方法构建一个测试集,该测试集随后可与 test()
方法一起使用
examples/evaluate_on_trainset.py
¶from surprise import accuracy, Dataset, SVD
from surprise.model_selection import KFold
data = Dataset.load_builtin("ml-100k")
algo = SVD()
trainset = data.build_full_trainset()
algo.fit(trainset)
testset = trainset.build_testset()
predictions = algo.test(testset)
# RMSE should be low as we are biased
accuracy.rmse(predictions, verbose=True) # ~ 0.68 (which is low)
查看示例文件以获取更多用法示例。
如何保存部分数据用于无偏精度估计¶
如果您的目标是调整算法参数,您可能需要预留一部分数据以获得其性能的无偏估计。例如,您可以将数据分成两组 A 和 B。A 用于使用网格搜索进行参数调优,B 用于无偏估计。可以按如下方式进行:
examples/split_data_for_unbiased_estimation.py
¶import random
from surprise import accuracy, Dataset, SVD
from surprise.model_selection import GridSearchCV
# Load the full dataset.
data = Dataset.load_builtin("ml-100k")
raw_ratings = data.raw_ratings
# shuffle ratings if you want
random.shuffle(raw_ratings)
# A = 90% of the data, B = 10% of the data
threshold = int(0.9 * len(raw_ratings))
A_raw_ratings = raw_ratings[:threshold]
B_raw_ratings = raw_ratings[threshold:]
data.raw_ratings = A_raw_ratings # data is now the set A
# Select your best algo with grid search.
print("Grid Search...")
param_grid = {"n_epochs": [5, 10], "lr_all": [0.002, 0.005]}
grid_search = GridSearchCV(SVD, param_grid, measures=["rmse"], cv=3)
grid_search.fit(data)
algo = grid_search.best_estimator["rmse"]
# retrain on the whole set A
trainset = data.build_full_trainset()
algo.fit(trainset)
# Compute biased accuracy on A
predictions = algo.test(trainset.build_testset())
print("Biased accuracy on A,", end=" ")
accuracy.rmse(predictions)
# Compute unbiased accuracy on B
testset = data.construct_testset(B_raw_ratings) # testset is now the set B
predictions = algo.test(testset)
print("Unbiased accuracy on B,", end=" ")
accuracy.rmse(predictions)
如何实现可重现的实验¶
一些算法会随机初始化其参数(有时使用 numpy
),并且交叉验证折叠也是随机生成的。如果您需要多次重现您的实验,只需在程序开始时设置RNG(随机数生成器)的种子即可。
import random
import numpy as np
my_seed = 0
random.seed(my_seed)
np.random.seed(my_seed)
数据集存储在哪里,如何更改?¶
默认情况下,Surprise下载的数据集将保存在 '~/.surprise_data'
目录中。转储文件也将存储在此处。您可以通过设置 'SURPRISE_DATA_FOLDER'
环境变量来更改默认目录。
Surprise是否支持基于内容的 数据 或 隐式评分?¶
不:这超出了 Surprise 的范围。Surprise 是为显式评分设计的。