An interesting problem

Given a file of 4 billion, 32-bit integers, how to find an integer that appears at least twice?

One simple solution that I can think of quickly is:

Sort the integers in the file, using any of the efficient sorting techniques like Merge, Quick etc and then go on comparing adjacent integers to find duplicates.

But come to think of it, sorting a 16 GB file is not a feasible. I need to come up with something more efficient.

5 comments:

Sharad said...

well i am not understanding the whole question.
1:is this a one time thing?
2:if repeating,will you get another set of 4.2 billion nos?in a file?or will the addition be a no after another.

my solution will be split it up.the first sorting will be a huge task, but if you split it up into many files or segments, then the future overhead can be reduced.once this is done, searching will never be a big task.
i understand, you are thinking on this line, but then afew more details might help.

Pavan Kulkarni said...

There is no searching here.. nor is it repeating.. its simple and straight forward.. find an integer which is appearing twice from a list of integers... I hope that helps.

Sharad said...

hey, did you get any faster method to solve it?

goit said...

Here is another alternative solution using bitmap and better than the ones proposed by the author:

Let:
C=create array of byte char of size (2^32)/8 = 4,294,967,296/8 = 536,870,912 and initialize all to 0 [linear time O(N)]. A bit in char array corresponds to a 32 bit integer value as:

Let m be a 32 bit integer, then its correspnding bit will be
base = m/8
offset = m mod 8
Therefore the bit at the offset poistion of C[base] char is the corresponding bit.

For eg:
m=1
base = 1/8 = 0
offset = (1 mod 8) = 1
So the bit at position 1 of C[0] is its corresponding bit.

Now we traverse the input file, and in the process we determine its corrosponding bit place, and we check the bit value. If the bit value is 0 we set it to 1, if not then that integer appears atleast twice so exit/done. This operation also takes a linear time growth O(N).

Therefore, the total time complexity of this algorithm is basically O(N).

BTW, to set the particular bit a value of 1 use this logic (for integer m)

base = m/8
offset_value = 1 << (m mod 8)
C[base] |= offset_value

To compare the bit use the following expression
(C[base] & offset_value)


Further analysis,

If it takes 1 milli sec for one computer computational step (not for all kinds of insructions), then for the given input 4Billion

O(N) takes around 46 days whereas the O(N * log(N)) would take around 9 times 46 days = 414 days. (Morethan by a YEAR!!)


Still I haven't mentioned about space complexity (there is no way the computation can be done in an infinte memory space...as of this time)

Can we do it better? I would love to hear...

mymagnadata said...

goit why divsion by 8?