Arrays and rearrangements

By David Loeffler (P865) on Tuesday, February 15, 2000 - 11:37 am :

While writing a computer program recently, I needed a short routine to check whether two arrays of numbers were rearrangements of each other. Of course you can just sort them and compare individual terms but this is very slow. It occurred to me that I could just check that they had the same sum, sum of squares, sum of cubes etc.
This of course is very similar to Richard Mycroft's problem about means and variances.
What I would like to know is, how many different powers do I need to sum to be sure of not getting spurious matches? I suspect I would need to sum up to the nth power if there were n numbers but I cannot prove this generally, only for n=2 and n=3. Also, how does the situation change if the numbers are all known to be positive integers? And what if I check their products are the same too?

David Loeffler

By David Loeffler (P865) on Tuesday, February 15, 2000 - 11:38 am :

P.S. Sorry, I forgot to say that both arrays are known to be the same size.

David

By Amanda Turner (Agt24) on Tuesday, February 15, 2000 - 01:24 pm :

Suppose your numbers (in one array) are

x_{1}

, ...,

x_{n}

, and consider the polynomial

P (t) = (t - x_{1}) \dots (t - x_{n})

whose roots are the

x_{i}

's. Two arrays are rearrangements of each other if and only if they 'generate' the same polynomial. Now the coefficients of the above polynomial are

$e_{1} = - \sum_{i = 1}^{n} x_{i}$

$e_{2} = \sum_{i < j} x_{i} x_{j}$

$e_{3} = - \sum_{i < j < k} x_{i} x_{j} x_{k}$

... $e_{n} = (- 1)^{n} x_{1} \dots x_{n}$

where $P (t) = t^{n} + e_{1} t^{n - 1} + \dots + e_{n}$ .

Now Newton's theorem states that any symmetric function in $x_{1}$ , ..., $x_{n}$ (i.e. if you rearrange the $x_{i}$ 's the function is unchanged) can be expressed in terms of $e_{1}$ , ..., $e_{n}$ . In particular, $s_{k} = \sum_{i = 1}^{n} x_{i}^{k}$ can be expressed in terms of the $e_{i}$ 's for all $k$ . Now try and deduce that you can express the $e_{i}$ 's in terms of the $s_{k}$ 's. Why does it then follow that your two arrays are rearrangements of each other if and only if the corresponding $s_{k}$ 's are equal? Since you have $n$ elements in your array, you will need at least $n$ equations to guarantee that you don't get spurious matches. Try and find two sets of numbers which give the same values of $s_{k}$ for $n - 1$ values of $k$ but that are not rearrangements of each other. If you have any set of $n$ functions $f_{k}$ such that you can express the $e_{i}$ 's in terms of the $f_{k}$ 's then it is enough that your two arrays agree for these functions. What other sets of functions can you think of?

I hope this is helpful. Please ask if there is anything that you don't understand or if you have any other questions.

By Dan Goodman (Dfmg2) on Tuesday, February 15, 2000 - 01:34 pm :

The problem with this method is you will get overflows for even quite small n. You're right that you would need to know all of the sums up to n, and I'm not even sure that this method would be not spurious even if you had n sums to work with. You can see why you need the n sums by thinking about it like this: If x₁ +x₂ +...+x_n =S₁ then x₁ =S₁ -(x₂ +x₃ +...+x_n ). In other words, you can choose x₂ to x_n arbitrarily and get the same sum. You've just lost one degree of freedom. Each additional constraint you put on the problem will lose one more degree of freedom until you have done the first n sums. The n sums will certainly fix it for positive numbers, but they might not for negative numbers, think of S₂ =x₁ ² +...+x_n ² , this gives x₂ =±sqrt(S₂ -x₁ ² +x₃ ² +...+x_n ² ), i.e. two possible values.

My advice is to find another algorithm for comparing them, because this one will not be practical if you have even moderately large n (like 10 or so) and moderately large data. Anyway, the order of operations it takes to generate this data is approximately O(n³ ) which is very big (assuming that is that x^k is k elementary operations). There are sort algorithms which can do better than O(nlog(n)), so it would be quicker to sort and compare than to use the sum of powers method. I suggest looking up bubblesort, shellsort, and quicksort on the internet to see which would best sort your data.

By James Lingard (Jchl2) on Wednesday, February 16, 2000 - 11:57 pm :

A few thoughts...

Firstly, regarding the time complexity of the proposed algorithm, you can quite easily do x^k in O(log k) operations (I think - ask if you want details), so this would mean that the algorithm is actually 'only' O(n² .log n).

Also, you could improve this method still by doing all the power sums simultaneously, and using the fact that to generate x^k by the usual method, you have already found x² , ..., x^k-1 . This would bring the complexity down to O(n² ).

However, this is still high, and is rather irrelavent anyway since, as Dan and Amanda have already pointed out, the algorithm won't necessarily guarantee that the two arrays are different, and also the problem of overflow does rather mess things up.

Secondly, I don't think that it's true that there are sorting algorithms which 'can do better than O(n.log n)' - there are certainly algorithms which achieve this in the worst case, but unless you assume some knowledge of the input data (e.g. you know that all the numbers come from a certain small set - 1,2,3,...,9 for example) then no sorting algorithm can do better than O(n.log n) in the worst case.

A brief summary of sorting algorithms:
- Insertion sort is a very simple algorithm, particularly good on data which is not very muddled up to start with, but it is O(n² ) in the worst case.
- Bubble sort should be avoided at all costs - although it looks very simple it is slow. Insertion sort would almost certainly be a better alternative.
- Merge sort and heap sort are two algorithms which are O(n.log n) in the worst case, so are particularly good for sorting large amounts of data.
- Quicksort is actually O(n² ) in the worst case, but on average it turns out to be O(n.log n) and in practice usually beats merge sort or heap sort, so it's a very good algorithm. There are sometimes (e.g. in C) standard library implementations of this so you don't have to even write the code yourself.
- I don't know much about shell sort , but it may well be worth a look.

Finally, I should add that when designing your algorithm it is often useful to consider what behaviour is expected most often. For example, if you know that 99\% of the time your two arrays will not be the same, then you could do some additional checks before sorting the arrays and comparing them, e.g. you could take the sum of the elements of each array and compare them. If the arrays are the same then these sums will be the same and you will have to go ahead and sort the arrays anyway, but if they are different then almost certainly the sum will be different, and you could save yourself a lot of time on average, although in the worst case it would take longer due to the extra check. However, if most of the time you expect the arrays to be the same then you may as well just get on and sort them straight away.

I hope that some of this has made sense/been of interest/useful,

James.

PS. If the arrays you are comparing are of real numbers rather than integers then you also have the additional problem that due to rounding errors you could get different values by adding up the same set of numbers in a different order!

By Amanda Turner (Agt24) on Thursday, February 17, 2000 - 01:18 pm :

As James says, a sorting algorithm will be faster than computing

s_{k}

for

k = 1

, ...,

n

. However my argument above is intended to show that this summing powers algorithm WILL guarantee that two arrays are rearrangements of each other.

Letting $e_{i}$ be defined as above, except without the $(- 1)^{i}$ coefficients in front (sorry, I probably should have left it off originally), and letting

$E (t) = 1 + e_{1} t + e_{2} t^{2} + \dots + e_{n} t^{n} = (1 + x_{1} t) (1 + x_{2} t) \dots (1 + x_{n} t)$

we notice that

$E' (t) / E (t) = (\log E (t))' = \sum_{i = 1}^{n} x_{i} (1 + x_{i} t)^{- 1} = s_{1} - s_{2} t + s_{3} t^{2} + \dots$

So $E' (t) = (e_{1} + 2 e_{2} t + \dots n e_{n} t^{n - 1}) = (s_{1} - s_{2} t + s_{3} t^{2} + \dots) (1 + e_{1} t + e_{2} t^{2} + \dots + e_{n} t^{n})$ .

Equating coefficients of $t$ , we get $e_{k}$ defined in terms of $e_{1}$ , ..., $e_{k - 1}$ , $s_{1}$ , ..., $s_{k}$ and so, by induction, we can obtain expressions for $e_{k}$ in terms of $s_{1}$ , ..., $s_{k}$ and so you can express $P (t)$ in terms of the $s_{k}$ 's. So by my original argument this guarantees that your arrays are rearrangements of each other if and only if the corresponding $s_{k}$ 's are equal.

By David Loeffler (P865) on Friday, February 18, 2000 - 01:11 pm :

Dear all...

Thanks - it will probably take me a while to digest all that though. If I have any problems I will get back to you.

David Loeffler