While writing a computer program recently, I needed a short
routine to check whether two arrays of numbers were rearrangements
of each other. Of course you can just sort them and compare
individual terms but this is very slow. It occurred to me that I
could just check that they had the same sum, sum of squares, sum of
cubes etc.
This of course is very similar to Richard Mycroft's problem about
means and variances.
What I would like to know is, how many different powers do I need
to sum to be sure of not getting spurious matches? I suspect I
would need to sum up to the nth power if there were n numbers but I
cannot prove this generally, only for n=2 and n=3. Also, how does
the situation change if the numbers are all known to be positive
integers? And what if I check their products are the same
too?
David Loeffler
P.S. Sorry, I forgot to say that both arrays are known to be the
same size.
David
Suppose your numbers (in one array) are
x1, ..., xn, and consider the polynomial
P(t)=(t-x1)...(t-xn) whose roots are the
xi's. Two arrays are rearrangements of eachother if and
only if they 'generate' the same polynomial. Now the coefficients
of the above polynomial are
e1=-Sn
i=1xi
e2=S
i<jxixj
e3=-S
i<j<kxixjxk
...
en=(-1)nx1...xn
where
P(t)=tn+e1tn-1+...+en.
Now Newton's theorem states that any symmetric function in
x1, ..., xn (i.e. if you rearrange the
xi's the function is unchanged) can be expressed in
terms of e1, ..., en. In particular,
sk=Sn
i=1xik can
be expressed in terms of the ei's for all k. Now try and
deduce that you can express the ei's in terms of the
sk's. Why does it then follow that your two arrays are
rearrangements of eachother if and only if the corresponding
sk's are equal? Since you have n elements in your array,
you will need at least n equations to guarantee that you don't get
spurious matches. Try and find two sets of numbers which give the
same values of sk for n-1 values of k but that are not
rearrangements of eachother. If you have any set of n functions
fk such that you can express the ei's in
terms of the fk's then it is enough that your two arrays
agree for these functions. What other sets of functions can you
think of?
I hope this is helpful. Please ask if there is anything that you
don't understand or if you have any other questions.
The problem with this method is you will
get overflows for even quite small n. You're right that you would
need to know all of the sums up to n, and I'm not even sure that
this method would be not spurious even if you had n sums to work
with. You can see why you need the n sums by thinking about it like
this: If
x1+x2+...+xn=S1 then
x1=S1-(x2+x3+...+xn).
In other words, you can choose x2 to xn
arbitrarily and get the same sum. You've just lost one degree of
freedom. Each additional constraint you put on the problem will
lose one more degree of freedom until you have done the first n
sums. The n sums will certainly fix it for positive numbers, but
they might not for negative numbers, think of
S2=x12+...+xn2,
this gives
x2=±sqrt(S2-x12+x
32+...+xn2), i.e. two
possible values.
My advice is to find another algorithm for comparing them, because
this one will not be practical if you have even moderately large n
(like 10 or so) and moderately large data. Anyway, the order of
operations it takes to generate this data is approximately
O(n3) which is very big (assuming that is that
xk is k elementary operations). There are sort
algorithms which can do better than O(nlog(n)), so it would be
quicker to sort and compare than to use the sum of powers method. I
suggest looking up bubblesort, shellsort, and quicksort on the
internet to see which would best sort your data.
A few thoughts...
Firstly, regarding the time complexity of the proposed algorithm,
you can quite easily do xk in O(log k) operations (I
think - ask if you want details), so this would mean that the
algorithm is actually 'only' O(n2.log n).
Also, you could improve this method still by doing all the power
sums simultaneously, and using the fact that to generate
xk by the usual method, you have already found
x2, ..., xk-1. This would bring the
complexity down to O(n2).
However, this is still high, and is rather irrelavent anyway since,
as Dan and Amanda have already pointed out, the algorithm won't
necessarily guarantee that the two arrays are different, and also
the problem of overflow does rather mess things up.
Secondly, I don't think that it's true that there are sorting
algorithms which 'can do better than O(n.log n)' - there are
certainly algorithms which achieve this in the worst case, but
unless you assume some knowledge of the input data (e.g. you know
that all the numbers come from a certain small set - 1,2,3,...,9
for example) then no sorting algorithm can do better than O(n.log
n) in the worst case.
A brief summary of sorting algorithms:
- Insertion sort is a very simple algorithm, particularly
good on data which is not very muddled up to start with, but it is
O(n2) in the worst case.
- Bubble sort should be avoided at all costs - although it
looks very simple it is slow. Insertion sort would almost certainly
be a better alternative.
- Merge sort and heap sort are two algorithms which
are O(n.log n) in the worst case, so are particularly good for
sorting large amounts of data.
- Quicksort is actually O(n2) in the worst case,
but on average it turns out to be O(n.log n) and in practice
usually beats merge sort or heap sort, so it's a very good
algorithm. There are sometimes (e.g. in C) standard library
implementations of this so you don't have to even write the code
yourself.
- I don't know much about shell sort, but it may well be
worth a look.
Finally, I should add that when designing your algorithm it is
often useful to consider what behaviour is expected most often. For
example, if you know that 99% of the time your two arrays will not
be the same, then you could do some additional checks before
sorting the arrays and comparing them, e.g. you could take the sum
of the elements of each array and compare them. If the arrays are
the same then these sums will be the same and you will have to go
ahead and sort the arrays anyway, but if they are different then
almost certainly the sum will be different, and you could save
yourself a lot of time on average, although in the worst case it
would take longer due to the extra check. However, if most of the
time you expect the arrays to be the same then you may as well just
get on and sort them straight away.
I hope that some of this has made sense/been of
interest/useful,
James.
PS. If the arrays you are comparing are of real numbers rather than
integers then you also have the additional problem that due to
rounding errors you could get different values by adding up the
same set of numbers in a different order!
As James says, a sorting algorithm will be
faster than computing sk for k=1, ..., n. However my
argument above is intended to show that this summing powers
algorithm WILL guarantee that two arrays are rearrangements of
eachother.
Letting ei be defined as above, except without the
(-1)i coefficients in front (sorry, I probably should
have left it off originally), and letting
E(t)=1+e1t+e2t2+...+ent
n=(1+x1t)(1+x2t)...
(1+xnt)
we notice that
E'(t)/E(t)=(log E(t))'=Sn
i=1xi(1+xit)-1=s
1-s2t+s3t2+...
So
E'(t)=(e1+2e2t+...nentn-1)=(s
1-s2t+s3t2+...)(1+e1t+e
2t2+...+entn).
Equating coefficients of t, we get ek defined in terms
of e1, ..., ek-1, s1, ...,
sk and so, by induction, we can obtain expressions for
ek in terms of s1, ..., sk and so
you can express P(t) in terms of the sk's. So by my
original argument this guarantees that your arrays are
rearrangements of eachother if and only if the corresponding
sk's are equal.
Dear all...
Thanks - it will probably take me a while to digest all that
though. If I have any problems I will get back to you.
David Loeffler