Final Exam¶

Exam Format¶

The exam will be a mixture of programming, and short answer / multiple choice + justification.

Programming questions will either be: “What does this Spark code do” / “Which of the two spark programs do you think will be faster, and why?” (i.e. no coding on your part) or “Write/Complete the following Spark Program so that it does ____”

You will get a hand-out with any necessary APIs (e.g. RDD methods, DataFrame methods, DStream methods)

Spark coding can be written in Scala, Python, or pseudocode as the important thing is demonstrating you can combine the RDD/DataFrame operations together to achieve the task.

What to know: I’ll use notation such as RDD[Int] to indicate an RDD holding Int values, or RDD[(String, String)] to indicate a Pair RDD where the keys and values are both strings.

(DStream and DataFrame will use identical notation, if necessary)

Alright, so below is a list of “important” and “not testable” concepts from each module.

There is of course a little middle ground there. If I’ve forgot anything from a module when making the bullet points, ask so I can put it in the right category.

Modules Covered¶

1 (Intro)¶

Not covered, it’s just memes, syllabus, and backstory

2+3 (MapReduce and Spark basics)¶

What to study:¶

The basics of mappers and reducers (with Spark, “map-like” operations are ones that are done within their own partition, like map, flatMap, filter, etc and “reduce-like” operations that require a shuffle, e.g. repartition, reduceByKey)
Documentation for Spark methods like .map, .reduceByKey, etc. will be provided. You should know what the functions do so you’re not spending all your time looking at the reference sheet, but it’ll be there if you forget some specifics.

What not to study:¶

You will not be asked detailed questions about MapReduce / Spark frameworks. So exactly how the Hadoop buffers work, how intermediate files are merged, that kind of thing.
You could be asked about Combiners but will not be asked EXACTLY when they run. If you remember that they’re optional and cannot be guaranteed to happen exactly once, that’s enough.

4 (Text Processing, Information Retrieval)¶

What’s important:¶

N-Grams, Boolean Retrieval, Ranked Retrieval
inverted indexing, index compression
You are expected to know the basics of compression codes like VInt, Elias Gamma, Golomb, but not the implementation details. You should know:
- the BASIC idea of VInt and when they use fewer / more bytes than Int
- The underlying assumptions for Elias codes / Golomb codes

What not to study:¶

You won’t be asked to encode / decode VInt, Elias, or Golomb codes, or for details on how these schemes work (see above)
You won’t be asked to recite equations like Heap’s law, Zipf’s law, but should remember the basics (most frequent terms are very common, but most terms are rare)

5 (Graphs)¶

What’s important:¶

Parallel Breadth First Search
Page Rank / Personalized Page Rank
This general style of graph algorithms (Parallel synchronized message passing)

What’s not testable:¶

Don’t worry about memorizing PageRank formula.
Although Module 10 is also about graphs, it’s not testable material.
Details about Spam Rank, Trust Rank, and other variations.
- You did an assignment on Personalized PageRank so you should know about that variation.

6 (Machine Learning)¶

What’s important:¶

Online (Stochastic) Gradient Descent vs regular Gradient Descent
Training vs. Testing sets, Cross Validation
Min-Hashing, Hierarchical Clustering, K-Means Clustering
- For min-hashing and Hierarchical Clustering, you only need to know the high level overview, not the implementation specifics. For Euclidean K-means there’s very little to know so you should know it.

What’s not testable:¶

The specifics of MinHashing or Hierarchical Clustering (If there’s a question on it, I’ll remind you of the specifics)
You won’t be asked to recite the definitions for stats terms like precision, recall, etc.

7 (Relational Data)¶

What important:¶

How to convert SQL queries to RDD operations [mostly interesting are the joins]
DataFrames - though since we didn’t program using them on A6 I won’t ask you to do that for the final!
OLAP vs OLTP workloads

What’s not testable:¶

The history of HIVE / other SQL-on-Hadoop tools
Jargon like data warehousing, data lakes, etc.
Don’t memorize the DataFrame API (if needed, they’ll be on the handout)

8 (Streaming)¶

What’s important:¶

Spark Streaming (like on the last assignment)
Repeating myself, but if the API is required it will be provided for you
Algorithms for dealing with data bursts (to a lesser extent)

What not testable:¶

Pipelines like Kafka, Flume, MQTT etc. might be MENTIONED, but you are not expected to memorize any features or details of these tools (especially since they weren’t given except for Kafka).
The algorithms / data structures for dealing with bursts shouldn’t be memorized. (E.g. testable: What’s a bloom filter used for. Not testable: sketch the code for a bloom filter. In other words, you should know what it is but it’s okay if you don’t remember the details)

9 (Mutable State)¶

What’s important:¶

CAP and PACELC theorems
- Example question might be “Here is a description of a NoSQL database, which best describes it: PC/EC, etc.”

What not testable:¶

There won’t be any questions that require you to know details of HBase, BigTable, Redis, Memcached, etc.
- any details needed will be provided
Paxos and 2PC (Two Phase Commit)

Example Questions¶

The following are (mostly) questions that Ali Abedi had for the online quizzes when the class was online-only. I think they’re basically reasonable (and have modified a few of them a little), but a few I wouldn’t ask because I’d need to remind you of what the thing I’m asking about is, which would (probably) defeat the purpose of the question…these quizzes were timed but open-book. The final exam is closed book.

The only “do you remember?” sorts of questions I’d want are things that are so fundamental you’d HAVE to know them by now, if you did the assignments!

Sections 2+3¶

Q)

In MapReduce, how does the framework determine which reducer should receive a given intermediate result?

a) Mappers exchange messages to determine the reducer for each key

b) Reducers exchange messages to determine the reducer for each key

d) Keys are assigned to reducers in a round-robin manner

e) Key-value pairs are assigned to reducers in a round-robin manner

f) Mappers use a hash function on each key to determine the reducer for that key

(That’s what I mean by “so fundamental you’d have to know”)

Q)

In a MapReduce program, what are the benefits and pitfalls of using an IMC (In-Mapper Combiner) instead of a traditional Combiner?

Q)

Consider the following Spark code:

def noisySquare(x : Integer) = {
  println(f"Squaring $x")
  x * x
}

def mystery(n : Integer) = {
  sc.parallelize(range(n)).map(noisySquare)
}
run code snippetVisit Manage Class to disable runnable code snippets×

a) What is the type of the RDD returned by the mystery function?

b) What will be printed to the screen if I have the assignment val x = mystery(3)? (Ignoring any Spark logging output)

Section 4¶

Q)

Describe why we would want to “smooth” our n-grams when training a language model.

Q)

True or False: If a posting list is in sorted order, then VInt compression will always resulting in a smaller posting. Justify your answer (if your answer is “false” you can justify it by counterexample).

Section 5¶

Q)

When performing single-source-shortest-path in MapReduce / Spark, we used Breadth-First-Search instead of Dijkstra. Why is that?

Q)

Why are spider-traps a problem in Page Rank? How is this problem resolved?

Q)

Why are dead-end nodes a problem in Page Rank? How is this problem resolved?

(NOTE FROM DAN: In a closed book format I’d remind you what spider-traps and dead-ends are, even though dead-end at least is a self-explanatory name)

Section 6¶

Q)

If doing machine-learning on user data, is age a sparse or a dense feature?

Q)

Assume we have k centroids and a dataset in x HDFS blocks. When running K-Means Clustering, what is the maximum number of:

Mappers: Reducers:

Explain your reasoning. (Note: You’re expected to know what K-Means Clustering is)

Section 7¶

Q)

When is it appropriate to perform a Hash Join (also known as a Broadcast Join)? When is it inappropriate?

Q)

Given a DataFrame people with 3 columns: name: String, age: Integer, job: String

Compute the average age for each unique job. You can use Spark SQL or DataFrame operations. The DataFrame API can be found on the handout. You can assume the DataFrame has already been registered as a View with the name people.

Section 8¶

(Reminder: If there’s a DStream question there will be a handout with the DStream API)

Q)

True / False: Spark Streaming processes incoming messages as soon as they arrive. Explain your answer

Q)

If I have a DStream ds, and create a new DStream newDS = ds.flatMap(f) which of the following is true

a) There will be a 1:1 matching between RDD in newDS and ds

b) Every RDD in ds will map to 1 or more RDD in newDS

c) Every RDD in ds will map to 0 or nore RDD in newDS

d) There is no correlation between RDD in ds and newDS

Q)

Assume we have an DStream[String] called “words” and we want, every minute, to print the LONGEST word seen in the last minute. (Ties can be broken in any way you like) Complete the following code by giving definitions for f1, f2, x, and y

(Note: You can provide “None” for the inverse function, in which case it’s less efficient, but will work even if you can’t write an inverse function)

words.reduceByWindow(f1, None, x, y).foreachRDD(f2)

Section 9¶

Q)

Bloom filters:

a) Can give false positives (falsely report a value has been seen)

b) Can give false negatives (falsely report a value has not been seen)

c) All of the above

d) None of the above

(This is a bit of a borderline question, since I’ve said above you should have a high level overview, not implementation specifics. This question is a case where you’d need to either have memorized the answer, or know the implementation details to be able to work it out).

Q) In Redis, keys are hashed and taken modulo 16384. Each of these 16K buckets is assigned to a single server, and transactions can only insert / delete / update a single key at a time. The cluster controller is in charge of assigning buckets to nodes in the cluster. If the controller loses contact with a node, it will reassign the buckets to another node (using replicas). If a node loses contact with the controller, it will go into offline mode and will not accept transactions until it reestablishes connection (and which point it will be out of date and need to resynced before it can be in charge of any buckets again) . In terms of the CAP theorem, what category does Redis fall under? In terms of PACELC, what category does Redis fall under?

CAP

a) PC

b) PA

c) CA

PACELC

a) PA/EL

b) PA/EC

c) PC/EL

d) PC/EC