混合模型的推荐算法（ACM暑校-案例学习）

xiaoxiao2022-07-07 208

单纯使用基于内容、基于知识或者协同滤波的推荐引擎已经越来越少了。因为，基于内容的推荐技术面临“过度个性化缺少惊喜度”的缺点、基于协同过滤的推荐技术面临“冷启动”难题。因此，一种比较好的解决方案是融合多种推荐技术的优点，用基于内容的策略解决冷启动问题，用协同过滤策略解决用户惊喜度问题。博客旨在实践融合内容+协同过滤的混合推荐算法。

1. 简介：Netflix的成功

正如开篇提到的，混合推荐通过结合各种简单的模型组成更加强大的、鲁棒的系统为用户提供更精准的产品建议。目前来看，有几种构建混合推荐系统的范式：① 分别利用基于内容、基于协同滤波进行产品推荐，再在利用自适应权重对他们的预测结果进行加权；②将基于内容的推荐技术嵌入到协同过滤框架中，构成端到端的推荐引擎；③将协同过滤技术嵌入到基于内容的推荐框架中，构成端到端的推荐引擎。

Netflix就是凭借强大的、高度精准的混合式的推荐技术取得了市场的份额，并在网络电视/电影领域取得了空前的成功。当我们正在看一个电影时，Netflix的推荐系统就会利用基于内容的技术为我推荐相同的影片，一个例子如下所示：

选中Ratatouille时，Netflix就会推荐Top-5最相似的影片，从推荐结果可以看出，他们都是Disney Pixar出品的动画片

然而，用户之所以选择Netflix进行电影观看，不仅仅只是为了看动画片，也可能喜欢看话剧、动作、喜剧等等。此时，Netflix使用协同过滤技术判别相似的人群，进行推荐具有惊喜度的电影，其推荐页面如下：

对于不同题材的电影，Netflix利用协同过滤推荐技术进行推荐

综上，Netflix同时雇佣了基于内容content-based和基于协同过滤collaborative-based的推荐技术，这样的推荐引擎已经证明确实有效。

2. 案例研究：构建混合推荐模型

这里的混合模型是指，充分融合content-based和collaborative-based的优点。

基于内容推荐技术的纵向场景应用

以Youtube为例，每当我们看一个电影/视频时，面板的右侧都会出现推荐列表，其实这些推荐都是通过content-based方法产生的。这是充分利用到了conten-based精细化描述的优势：当用户正在观看感兴趣的视频时，他们往往更倾向于继续观看类似的内容。

基于协同过滤技术的横向场景应用

假设用户正在观看The Dark Knight，它属于蝙蝠侠题材的电影。如果我们基于内容设计推荐系统，就很可能会推荐其他的蝙蝠侠题材（或超级英雄题材）电影，而忽略了推荐影片本身的质量控制。例如，大多数喜欢The Dark Knight的人对蝙蝠侠题材和超级英雄题材的电影评价并不高，尽管他们的主角相同，题材相近。因此，这个时候有必要引入协同过滤推荐技术，以提高用户对推荐内容的惊喜度。

因此，混合推荐系统的流程可以设计如下：

输入电影的标题和用户图谱采用content-based模型计算25个最相似的电影使用协同滤波模型对该用户的25个电影计算评分参考最高的预测分数返回最高的前10个电影

数据集准备：

ratings_small.csv https://www.kaggle.com/rounakbanik/the-movies-dataset/downloads/ra tings_small.csv/7 （700个用户对9000个电影的100000个评分，高度稀疏）movie_ids.csv https://drive.google.com/drive/folders/1H9pnfVTzP46s7VwOTcC5ZY_VahRTr5Zv?usp=sharing （其中的links_small.csv文件包含了ratings_small.csv中评分所有电影的movie IDs） import numpy as np import pandas as pd # Import or compute the cosine_sim matrix cosine_sim = pd.read_csv('../data/cosine_sim.csv') # Import or compute the cosine sim mapping matrix cosine_sim_map = pd.read_csv('../data/cosine_sim_map.csv', header=None) # Convert cosine_sim_map into a Pandas Series cosine_sim_map = cosine_sim_map.set_index(0) cosine_sim_map = cosine_sim_map[1] # Build the SVD based Collaborative filter from surprise import SVD, Reader, Dataset reader = Reader() ratings = pd.read_csv('../data/ratings_small.csv') data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader) data.split(n_folds=5) svd = SVD() trainset = data.build_full_trainset() svd.train(trainset) # Build title to ID and ID to title mappings id_map = pd.read_csv('../data/movie_ids.csv') id_to_title = id_map.set_index('id') title_to_id = id_map.set_index('title') # Import or compute relevant metadata of the movies smd = pd.read_csv('../data/metadata_small.csv') def hybrid(userId, title): # Extract the cosine_sim index of the movie idx = cosine_sim_map[title] # Extract the TMDB ID of the movie tmdbId = title_to_id.loc[title]['id'] # Extract the movie ID internally assigned by the dataset movie_id = title_to_id.loc[title]['movieId'] # Extract the similarity scores and their corresponding index for every movie from the cosine_sim matrix sim_scores = list(enumerate(cosine_sim[str(int(idx))])) # Sort the (index, score) tuples in decreasing order of similarity scores sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True) # Select the top 25 tuples, excluding the first # (as it is the similarity score of the movie with itself) sim_scores = sim_scores[1:26] # Store the cosine_sim indices of the top 25 movies in a list movie_indices = [i[0] for i in sim_scores] # Extract the metadata of the aforementioned movies movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year', 'id']] # Compute the predicted ratings using the SVD filter movies['est'] = movies['id'].apply(lambda x: svd.predict(userId, id_to_title.loc[x]['movieId']).est) # Sort the movies in decreasing order of predicted rating movies = movies.sort_values('est', ascending=False) # Return the top 10 movies as recommendations return movies.head(10)

系统测试：

hybrid(1, 'Avatar')

最新回复(0)