影评练习

xiaoxiao2025-08-01 40

大数据影评练习

现有如此三份数据： 1、users.dat 数据格式为： 2::M::56::16::70072 对应字段为：UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String 对应字段中文解释：用户id，性别，年龄，职业，邮政编码

2、movies.dat 数据格式为： 2::Jumanji (1995)::Adventure|Children’s|Fantasy

对应字段为：MovieID BigInt, Title String, Genres String 对应字段中文解释：电影ID，电影名字，电影类型

3、ratings.dat 数据格式为： 1::1193::5::978300760 对应字段为：UserID BigInt, MovieID BigInt, Rating Double, Timestamped String 对应字段中文解释：用户ID，电影ID，评分，评分时间戳

题目要求：

数据要求：

（1）写shell脚本清洗数据。（hive不支持解析多字节的分隔符，也就是说hive只能解析’:’, 不支持解析’::’，所以用普通方式建表来使用是行不通的，要求对数据做一次简单清洗）（2）使用Hive能解析的方式进行

Hive要求：

（1）正确建表，导入数据（三张表，三份数据），并验证是否正确

首先创建一个影评数据库 create database if not exists yingping; 进入数据库 use yingping; 创建表

user用户表

create external table if not exists users(UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String) row format serde ‘org.apache.hadoop.hive.serde2.RegexSerDe’ with serdeproperties(‘input.regex’=’(.):?.):?.):?.):?.*)’,‘output.format.string’=’%1 $s$ s %3 $s$ s %5$s’) stored as textfile location “/user/data/yingping/users”; 在hdfs中直接将数据放入到"/user/data/yingping/users"下就可以直接访问数据了，也可以用 load data locla inpath “hdts中的全路径名” into table user表名验证是否创建成功测试时最好set hive.exec.mode.local.auto=true;设置本地模式，这样可以提高执行效率，注意这仅仅是在练习是用

select * from users limit 5;

注意：在hive 中禁用 select * from tnname; 一定配合 limit使用，因为一张

movies 电影表

create external table if not exists movies(MovieID BigInt, Title String, Genres String) row format serde ‘org.apache.hadoop.hive.serde2.RegexSerDe’ with serdeproperties(‘input.regex’=’(.):?.):?.*)’,‘output.format.string’=’%1 $s$ s %3$s’) stored as textfile location “/user/data/yingping/movies”;

select * from movies limit 10;

ratings评分表

create external table if not exists ratings(UserID BigInt, MovieID BigInt, Rating Double, Timestamped String) row format serde ‘org.apache.hadoop.hive.serde2.RegexSerDe’ with serdeproperties(‘input.regex’=’(.):?.):?.):?.)’,‘output.format.string’=’%1 $s$ s %3 $s$ s’) stored as textfile location “/user/data/yingping/ratings”;

select * from ratings limit 10;

（2）求被评分次数最多的10部电影，并给出评分次数（电影名，评分次数）

set hive.exec.mode.local.auto=true;本地模式

select count(*) rcount,title from ratings a join movies b on a.movieid=b.movieid group by a.movieid ,b.title order by rcount desc limit 10;

（3）分别求男性，女性当中评分最高的10部电影（性别，电影名，影评分）

select c.Gender,b.title,avg(a.rating) avga from ratings a join movies b on a.movieid=b.movieid join users c on a.userid=c.userid where c.Gender=“M” group by b.movieid,b.title,c.Gender order by avga desc limit 10;

select c.Gender,b.title,avg(a.rating) avga from ratings a join movies b on a.movieid=b.movieid join users c on a.userid=c.userid where c.Gender=“F” group by b.movieid,b.title,c.Gender order by avga desc limit 10;

（4）求movieid = 2116这部电影各年龄段（因为年龄就只有7个，就按这个7个分就好了）的平均影评（年龄段，影评分）

select b.Age,avg(a.rating) avg from ratings a join users b on a.userid=b.userid group by b.Age;

（5）求最喜欢看电影（影评次数最多）的那位女性评最高分的10部电影的平均影评分（观影者，电影名，影评分）（该女性评价最高的10部电影，都必须是至少有50个用户参与评价的）

select g.userid,h.title,g.rating from (select e.userid,e.movieid,e.rating from (select d.userid userid,avg(d.rating) rating,d.movieid movieid from (select count(*) count,b.userid userid from ratings a join users b on a.userid=b.userid where b.Gender=“F” group by b.userid order by count desc limit 1) c left join ratings d on c.userid=d.userid group by d.userid,d.movieid) e join (select count(distinct a.userid) ypcount,b.movieid movieid from users a join ratings b on a.userid=b.userid where a.Gender=“F” group by b.movieid having ypcount>50 order by ypcount desc) f on e.movieid=f.movieid order by e.rating desc limit 10)g join movies h on g.movieid=h.movieid;

（6）求好片（评分>=4.0）最多的那个年份的最好看的10部电影

select avg(a.rating) rating,substr(b.Title,-5,4) year,b.movieid movieid from ratings a join movies b on a.movieid=b.movieid group by substr(b.Title,-5,4),b.movieid having rating >4 order by rating desc limit 10;

（7）求1997年上映的电影中，评分最高的10部Comedy类电影

select avg(b.rating) rating,a.title from(select movieid,title,leixing.lx Genres from movies lateral view explode(split(Genres,"\|"))leixing as lx) a join ratings b on a.movieid=b.movieid where lower(a.Genres)=“comedy” and substr(a.Title,-5,4)=“1997” group by a.movieid,a.Title order by rating desc limit 10;

（8）该影评库中各种类型电影中评价最高的5部电影（类型，电影名，平均影评分）

select * from (select c.title title,c.Genres,c.rating,row_number() over(distribute by c.Genres sort by c.rating desc) index from (select a.title title,avg(b.rating) rating,a.movieid,a.Genres from (select movieid,title,leixing.lx Genres from movies lateral view explode(split(Genres,"\|"))leixing as lx) a join ratings b on a.movieid=b.movieid group by a.Genres,a.movieid,a.title) c) d where d.index<=5 ;

最新回复(0)