Hadoop Experiment - Using Pig
Pig
Using the Pig language, we can make a script to perform the MapReduce actions similar to the previous post. Note that I will be using the same CSV file as before.
gamedata_01.pig
gamedata = LOAD 'nesgamedata.csv' AS (index:int, name:chararray, grade:chararray, publisher:chararray, reader_rating:chararray, number_of_votes:int, publish_year:int, total_grade:chararray);
DESCRIBE gamedata;
DUMP gamedata;
[root@quickstart gamedata]# pig -f gamedata_01.pig
...
(269,Winter Games,12,Epyx,13,24,1987,12.96)
(270,Wizards and Warriors,9,Rare,6,55,1987,6.053571428571429)
(271,World Games,6,Epyx,9,8,1986,8.666666666666666)
(272,Wrath of the Black Manta,7,Taito,6,31,1989,6.03125)
(273,Wrecking Crew,10,Nintendo,8,18,1985,8.105263157894736)
(274,Xevious,5,Namco,6,36,1988,5.972972972972973)
(275,Xexyz,10,Hudson Soft,5,26,1989,5.185185185185185)
(276,Yoshi,5,Nintendo,6,41,1992,5.976190476190476)
(277,Yoshi's Cookie,5,Nintendo,7,23,1993,6.916666666666667)
(278,Zanac,2,Pony,3,21,1986,2.9545454545454546)
(279,Zelda II: The Adventure of Link,3,Nintendo,4,112,1989,3.9911504424778763)
(280,Zelda, The Legend of,3,Nintendo,3,140,1986,3.0)
(281,Zombie Nation,4,Kaze,8,26,1991,7.851851851851852)
Now lets calculate the average rating given by users for each different rating given by the author of the website for all Nintendo games.
gamedata_02.pig
gamedata = LOAD 'nesgamedata.csv' AS (index:int, name:chararray, grade:int, publisher:chararray, reader_rating:int, number_of_votes:int, publish_year:int, total_grade:float);
gamesNintendo = FILTER gamedata BY publisher == 'Nintendo';
gamesRatings = GROUP gamesNintendo BY grade;
averaged = FOREACH gamesRatings GENERATE group as rating,
AVG(gamesNintendo.total_grade) AS avgRating;
DUMP averaged;
Run the script on the Hadoop machine:
[root@quickstart gamedata]# pig -f gamedata_02.pig
...
(1,2.321279764175415)
(2,3.3024109601974487)
(3,3.7930258750915526)
(4,3.0212767124176025)
(5,5.381512546539307)
(6,5.773015689849854)
(7,6.020833492279053)
(8,9.833333015441895)
(9,6.624411582946777)
(10,8.105262756347656)
(12,8.070609092712402)
(13,10.066511631011963)
From this we can observe that on average the users do not really agree with the author on the ratings. Often the author gives higher grades to a game than the users.