NETS 212: Scalable & Cloud Computing Fall 2014 Assignment 4: SocialRank via Spark Due November 13th at 10:00pm EST Your fourth homework assignment involves revisiting SocialRank using a different platform, Apache Spark. Spark models state as reliable distributed datasets, which are modeled in Java as variables, e.g., of type JavaRDD (non-keyed values) or JavaPairRDD (keyed values). You can manipulate these RDDs by calling methods on them; each method implements a callback (not so different from Node.JS). This basic programming model is also shared by a variety of other programming platforms like Apache Crunch (opensourced Google FlumeJava). The SocialRank Algorithm in Spark Recall from the previous assignment that we had to implement iteration through a series of map-reduce steps, and that each time we had to pass state from iteration to iteration through the data. In Spark, this is simpler. The Spark equivalent of SocialRank can work as follows: 1. 2. 3. Initialize an RDD that contains all of the edges Initialize an RDD that contains all of the nodes plus their SocialRanks (initialized to 1 or whatever) Iterate n times: a. Create an RDD with all of the weight propagations from nodes to their neighbors b. Recompute the ranks-RDD by summing up the incoming weights (plus the decay factor) This can be described in essentially a sequential fashion, even though it involves parallel stages. Tasks We have given you the broad skeleton of the SocialRank algorithm (check out home1/n/nets212/nets212/HW4 from the svn repo). Take a look at the class edu.upenn.cis.nets212.hw4.SocialRank, which operates as we sketched above. First let’s actually run things. You’ll need to go to Terminal and get Spark running (over HDFS and Hadoop MapReduce/YARN) as follows. 1. 2. [Each time you reboot] Make sure HDFS and Yarn are running (recall how you ran start-dfs.sh and start-yarn.sh from HW2) Run (one time) sh ~/workspace/HW4/setup-spark.sh, which should make some directories and copy some files to ensure Spark is able to run on your VM. You’ll need to type in your nets212 administrator password. 3. 4. 5. If necessary, copy the sample graph data from HW3 (we’ve re-included it in the project, as data/input and specifically data/input/part-0000) to HDFS. (In general, you should still have this in HDFS from HW3.) [Each time you reboot] Run source ~/workspace/HW4/set-classpath.sh [Each time you reboot] Launch Spark via /opt/spark/sbin/start-master.sh Now go back to Eclipse. Right-click on your project, hit Export, and make a JAR file. You can call it something like hw4.jar. Save the jar into the workspace directory. Make sure “Export generated class files and resources” is selected. Don’t worry if it says “jar finished with warnings.” Run your program by first changing to the workspace directory, then: spark-submit --class edu.upenn.cis.nets212.hw4.SocialRank hw4.jar input 10 where hw4.jar should be named according to your correct JAR name, input is the name of your sample input data folder that is in HDFS, and 10 is the number of SocialRank iterations. If you look in your HDFS directory you should see the final output and the initial-ranks for the nodes. You can use hdfs dfs -get or hdfs dfs -copyToLocal to copy them to your hard drive for inspection. Or you can run hdfs dfs –cat with the filename to see it on the screen. Structure of code The basic code provided computes SocialRank under the following assumptions: 1. 2. 3. 4. Every node is assumed to have at least one out-edge. We should initialize the rank of every node to 1.0. We are not using the decay function. There is no existing output file in the way! Tasks Your tasks will require you to look at some of the functions you can apply on RDDs. (You can find these in detail at https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/api/java/JavaPairRDD.html). Let’s start by removing the 1st assumption from above (out-edges, which are in fact not always presence! You’ll see there’s no initial rank value for Node 4), and also rescale the second value so it’s a probability. 1. 2. 3. 4. Think about how you can create the ranks RDD in a way that ensures every node shows up if it is connected. Pay special attention to the union operator, which enables you to merge two RDDs, and (hint) also think about how you can transpose source and destination nodes. Think about how you can scale the initial ranks to 1/n as opposed to 1.0, where n is the number of nodes. Do keep in mind that there are a variety of functions for RDDs like aggregateByKey, count, etc. Find the function doing the SocialRank computation, and revise it to incorporate the decay factor, with value 0.15 for the random-jump and 0.85 for out-link traversal. Remove the initial-ranks and output files if they exist. Submitting Your Work Submit via Canvas the main edu.upenn.cis.nets212.hw4.SocialRank source code. Verify that: Your code contains a reasonable amount of useful documentation (required for style points). You have checked your final code into your SVN repository. Extra Credit For extra credit, get Spark running on Amazon Elastic MapReduce following the guide at: https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923 Run it over the LiveJournal data you used for HW3, using the same cluster size as in HW3. Create a file called extra-credit.txt explaining how you did this, and showing the full running times for Spark. Which was faster, Hadoop from HW3 or Spark from HW4?
© Copyright 2018