Hive练习之影评案例分析（一）

xiaoxiao2025-05-19 125

影评案例分析能够较为完整的学习Hive，今天在复习Hive的过程中又将这个案例操作了一遍。

一、数据说明和下载

在本案例中总共有三份数据，分别是：（1）users.dat 数据格式为： 2::M::56::16::70072，对应字段为：UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String

（2）movies.dat 数据格式为： 2::Jumanji (1995)::Adventure|Children's|Fantasy，对应字段为：MovieID BigInt, Title String, Genres String

（3）ratings.dat 数据格式为： 1::1193::5::978300760，对应字段为：UserID BigInt, MovieID BigInt, Rating Double, Timestamped String

数据下载链接：

https://github.com/qianhonglinIT/HiveMovieCaseStudy

二、建表和导入数据

在数据库中创建3张表，t_user，t_movie，t_rating，由于原始数据是使用::进行分割的，所以需要使用能解析多字节分隔符的Serde

create table t_user( userid bigint, sex string, age int, occupation string, zipcode string) row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s %5$s') stored as textfile; use movie; create table t_movie( movieid bigint, moviename string, movietype string) row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' with serdeproperties('input.regex'='(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s') stored as textfile; use movie; create table t_rating( userid bigint, movieid bigint, rate double, times string) row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s') stored as textfile;

导入数据：

load data local inpath "/home/hive/users.dat" into table t_user; load data local inpath "/home/hive/movies.dat" into table t_movie; load data local inpath "/home/hive/ratings.dat" into table t_rating;

三、题目和解答

（1）求被评分次数最多的10部电影，并给出评分次数（电影名，评分次数）

分析：按照电影名进行分组统计，求出每部电影的评分次数并按照评分次数降序排序

create table question1 as select a.moviename as moviename,count(a.moviename) as total from t_movie a join t_rating b on a.movieid=b.movieid group by a.moviename order by total desc limit 10;

结果：

（2）分别求男性，女性当中评分最高的10部电影（性别，电影名，影评分）

分析：三表联合查询，按照性别过滤条件，电影名作为分组条件，影评分作为排序条件进行查询

女性：

create table question2_1 as select "F" as sex, c.moviename as name, avg(a.rate) as avgrate, count(c.moviename) as total from t_rating a join t_user b on a.userid=b.userid join t_movie c on a.movieid=c.movieid where b.sex="F" group by c.moviename having total >= 50 order by avgrate desc limit 10;

结果：

男性：

create table question2_2 as select "M" as sex, c.moviename as name, avg(a.rate) as avgrate, count(c.moviename) as total from t_rating a join t_user b on a.userid=b.userid join t_movie c on a.movieid=c.movieid where b.sex="M" group by c.moviename having total >= 50 order by avgrate desc limit 10;

结果：

明天继续。。。。。。。。。。。。。。。

最新回复(0)