Hive练习之影评案例分析(一)

    xiaoxiao2025-05-19  44

    影评案例分析能够较为完整的学习Hive,今天在复习Hive的过程中又将这个案例操作了一遍。

    一、数据说明和下载

    在本案例中总共有三份数据,分别是: (1)users.dat 数据格式为: 2::M::56::16::70072, 对应字段为:UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String

    (2)movies.dat 数据格式为: 2::Jumanji (1995)::Adventure|Children's|Fantasy, 对应字段为:MovieID BigInt, Title String, Genres String

    (3)ratings.dat 数据格式为: 1::1193::5::978300760, 对应字段为:UserID BigInt, MovieID BigInt, Rating Double, Timestamped String  

    数据下载链接:

    https://github.com/qianhonglinIT/HiveMovieCaseStudy

    二、建表和导入数据

    在数据库中创建3张表,t_user,t_movie,t_rating,由于原始数据是使用::进行分割的,所以需要使用能解析多字节分隔符的Serde

    create table t_user( userid bigint, sex string, age int, occupation string, zipcode string) row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s %5$s') stored as textfile; use movie; create table t_movie( movieid bigint, moviename string, movietype string) row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' with serdeproperties('input.regex'='(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s') stored as textfile; use movie; create table t_rating( userid bigint, movieid bigint, rate double, times string) row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s') stored as textfile;

    导入数据:

    load data local inpath "/home/hive/users.dat" into table t_user; load data local inpath "/home/hive/movies.dat" into table t_movie; load data local inpath "/home/hive/ratings.dat" into table t_rating;

    三、题目和解答

    (1)求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)

    分析:按照电影名进行分组统计,求出每部电影的评分次数并按照评分次数降序排序

    create table question1 as select a.moviename as moviename,count(a.moviename) as total from t_movie a join t_rating b on a.movieid=b.movieid group by a.moviename order by total desc limit 10;

    结果:

    (2)分别求男性,女性当中评分最高的10部电影(性别,电影名,影评分)

    分析:三表联合查询,按照性别过滤条件,电影名作为分组条件,影评分作为排序条件进行查询

    女性:

    create table question2_1 as select "F" as sex, c.moviename as name, avg(a.rate) as avgrate, count(c.moviename) as total from t_rating a join t_user b on a.userid=b.userid join t_movie c on a.movieid=c.movieid where b.sex="F" group by c.moviename having total >= 50 order by avgrate desc limit 10;

    结果:

    男性:

    create table question2_2 as select "M" as sex, c.moviename as name, avg(a.rate) as avgrate, count(c.moviename) as total from t_rating a join t_user b on a.userid=b.userid join t_movie c on a.movieid=c.movieid where b.sex="M" group by c.moviename having total >= 50 order by avgrate desc limit 10;

    结果:

    明天继续。。。。。。。。。。。。。。。

     

    最新回复(0)