instructions.

1. Solution may not be submitted by students in pairs.

2. You may submit a pdf of the homework, either printed or hand-written and scanned, as long as it is easily readable.

3. If your solution is illegible not clearly written, it might not be graded.

4. Unless otherwise stated, you should prove the correctness of your answer. A correct answer without justification may be worth less.

5. If you have discussed any problems with other students, mention their names clearly on the homework. These discussions are not forbidden and are actually encouraged. However, you must write your whole solution yourself.

6. Unless otherwise specified, all questions have same weight.

7. You may refer to data structure or their properties studied in class without having to repeat details, and may reference theorems we have studied without proof. If your answer requires only modifications to one of the algorithms, it is enough to mention the required modifications, and the effect (if any) on the running time and on other operations that the algorithm performs.

8. In general, a complete solution should contain the following parts:

(a) A high level description of the data structures (if needed). E.g. We use a binary balanced search tree. Each node contains, a key and pointers to its children. We augment the tree so each node also contains a field…

(b) A clear description of the main ideas of the algorithm. You may include pseu- docode in your solution, but this may not be necessary if your description is clear.

(c) Proof of correctness (e.g. show that your algorithm always terminates with the desired output).

(d) A claim about the running time, and a proof showing this claim.

1

1. You are given k sorted arrays of keys A1 . . . Ak. Each key is a float number. The total number of keys is n. Your goal it to sort all these keys in time O(n logk).

(a) Suggest a solution under the assumption that there are no constrains on memory availability nor on access time to memory, and access to each memory cell takes a constant time.

(b) Suggest a solution under the following assumptions: Each ‘Write’ Operation is much slower than a ‘read’ operation (as is commonly the case for Solid State Driver). However, you have access to fast memory containing O(k) words. (hint – you might need to borrow some ideas from CSC345).

To elaborate, assume that your computer’s memory contains two major components:

i. SlowMem – a slower memory. Reading and writing from this components are slower. The input arrays are stored in SlowMem, and the final output should be writing to SlowMem. Your algorithm should keep the number of reads/writes from SlowMem to the bare minimum.

ii. FastMem – a faster memory unit. Could only contains say 10k words.

2. You insert n different keys into an initially empty SkipList L. The version of SkipList discussed in class assumes that if a key k participates in level i, then with probability p = 0.5 it is promoted to level i + 1. What is the probability that after all insertions take place, all keys only appear on level 2, and none appear on level 3?

3. When considering the probability of having a SkipList of height ≥ Z log2 n, we discussed the case where only insertions of keys are performed.

Give bound to the probability that the height is ≥ Z log2 n if n is the current number of keys, but possibly a larger number of keys were present (in the past) in the SkipList but were deleted.

4. (a) Let d > 2 be a fixed positive integer. consider a perfect SkipList constructed as follows: In order to create the ith level Li of the SkipList, we scan the keys of level Li−1, and promote to Li every d’th key. So for example, the perfect SkipList discussed in class uses the value d = 2. The case d = 3 implies that every third key is promoted, and so on.

Express your answer as a function of n and d

i. What is the number of levels, as a function of n and d?

ii. What is the worst case time for performing find(x) operation ? For delete(x), For insert(x) ?

iii. Assume n = 109. Compare the case d = 2 vs. d = d = 10 vs. d = 1000. When will the search time be optimal.

(b) Assume that we re-create a SkipList by inserting n keys, in the same order, but this time we are using the randomized insertion algorithm shown on the slides. However a key that appears in level i is promoted to level i+1 with probability p = 0.1 (rather than p = 0.5 that was discussed on the slides).

Will the expected time to perform find(x) operation increase or decrease, compared to the expected time for the same operation in the original SkipList created with p = 0.5.

2

(c) Repeat the question, but now assume p = 0.9.

5. A vicious hacker just got access to the SkipList L you have built, which contains n keys. Show that the hacker could delete an expected number of n/2 keys from L, such that the operation search(x) would take Ω(n) in the worst case.

6. Suggest a modification of the SkipList structure, such that in addition to the operations insert(x), find(x) and delete(x), you could also answer the operation LesserThan(x) specifying the sum of all the keys in the SkipList are strictly smaller than x.

The expected time for each operation should be O(log n).

Hint: Store at each node v of the SL another field, called size[v], containing the number sum of keys in the SL between v and the next node at the same level as v. Note that maintaining the values of these fields might imply extra work while performing other operations.

7. In addition to the operations mentioned in the previous question, now support also range(x1, x2) which returns the sum of all keys which are ≥ x1 and ≤ x2. The expected time for every operation is O(log n).

8. In addition to the operation in the last question, not support also avg(x1, x2) which returns the average value of all keys stored in the SL, which are are ≥ x1 and ≤ x2. The expected time for every operation is O(log n).

9. (a) Explain possible drawbacks of using the following hashing method: m = 32, h(k) = (kA) mod m, where A is a large integer. All keys in U are integers whose last digit (in decimal representation) is 8.

(b) Does the drawback that you pointed out exist even if we use the multiplication method, as described in the slides (but with the same value of A) ? If not, explain exactly why.

(c) Does the drawback that you pointed out exist even if we use the function h(k) = (kA) mod m but A is an long float number ? if you have to use this method, which values of A would you prefer, and why.

10. Assume a hash table T [0..20] (that is, m = 21), and a open addressing hashing where

h(x, i) = {x + i · (x mod 10)} mod m.

Assume you start with an empty table. Show an example of a set of 4 distinct keys {k1, k2, k3, k4} such that

(a) kj mod 10 > 0 for j = 1, 2, 3, 4. And in addition,

(b) You could not insert all of them into the table. That is, calling insert(k1), followed by insert(k2), followed by insert(k3) and insert(k4) would report that the last operation is unsuccessful.

11. You have stored a huge number of images, and you are running out of disk space. Explain how you would use hash functions to find if your computer contains two identical image

3

files (possibly under different names). Give a pseudocode of your solution. Specify which and how your hash functions are used. Do not use values provided by the file system.

Your algorithm should be as efficient (in space and running time) as possible.

Assume images are stored as raw data. That is, images are given as matrix of pixels, and for each pixel, we are given the RGB values, as numbers between 0 and 255.

If you prefer, think about these images as ASCII documents that you could read sequen- tially, one character after the other.

Assume that beside the memory used for storing the files, you could use only 1MB of fast-access memory. of data. You have over 10000 files, each over a 2GB long.

12. Repeat the previous question, but this time you could use values provided by the File System/Operating System (such as MD5). Could you use the MD5 value of the file as an index to a hash table?