Montag, 16. Mai 2011

Why equals() is broken and how to do it right

Two simple explanations about the intention of equals() (and always associated, of course, hashCode()):
1.) I once read in some Effective Java book from Josh Bloch something like: "When designing a class, you should think about: is it supposed to be a value type, then override .equals() and .hashCode(), otherwise don't and leave the default identity comparison."
2.) Most tutorials briefly streak the implementation of equals() as considering the "significant" fields of the class and then they go on and on about how equals() and hashCode() have to be implemented consistently (a little unnatural, isn't it?) and that it is needed for hash collections etc.

Both statments are conveniently simple and as most conveniently simple things, they are only partly correct.

Ad 1.) Nice idea, but what ARE value types? Okay String and Integer are value types. What about Date? Or Person? Aren't those value types, too? What the simple value type definition is lacking is the view if immutability. Date is not immutable, an instance of Person most probably won't be, either. Collections themselves are not immutable in general. So what happens when you put those mutable objects in a hash collection and change their internal state: the collection becomes inconsistent. It becomes broken (I won't go into the basics here). This is often described in guides as "the collections will do weird things" (how funny if your application depends on correctness) or they sell it as being the programmer's fault (hello? I simply NEED to have mutable objects in many places. Why does that mean that I can't use elemental data structures any longer with them?)

On the other hand, still some kind of logic is needed to evaluate if two objects of one type are content-wise equal.

Ad 2.) So what about the "significant" fields? Wait a second: what ARE the significant fields? Doesn't that depend on the situation? How should we know when implementing a class? Non of those guides explain which fields to use for it. Because it CAN'T be decided once and for all.
In one place, your business logic has to ensure that a collection contains only distinct person instances (say with distinct first name, last name and date of birth). Somewhere else, the persons have to be distinct by hometown. Or SSID. Or whatever. And in the next place, the logic is about actual instances (reference equality) again (say in an OR mapper registry that keeps track of loaded instances).

There really is a TON of ways what "equality" can mean in different contexts:
Equality of...
... reference (identity)
... data identity (what a PRIMARY KEY in a database is)
... full data (with transient fields, without transient fields?)
... some special key data combination (like additional UNIQUE keys in databases)
... data identity and behaviour (type)
... full data and behaviour (type)
... and so on

Which one to implement in a single hard coded equals() method?
The answer is: none, because it is not that simple. Because equality changes with the situation.
It also won't lead you anywhere to extends subclasses every now and then just to override a troublesome equals() (yes, and hashCode()) implementation (SsidEqualityPerson, lol). Your software complexity will explode without need and apart from that, there's no multiple class inheritence, which alone will kill attempts like that. Almost the same goes for creating decoration classes over and over. In any case, both approaches will kill referential integrity (which is crucial for advanced object graph technologies like OR-Mappers)

The right way to do it is: define an Equalator<T> interface, pretty similar to the already existing Comparator<T>.
So and only so, you can define the right equality for the right situation.

And what about the hashCode which has to be consistent with the equals logic? Trivial: HashEqualator<T> extends Equalator<T>.
Here you are. And notice: that way, you have NATURALLY defined the contract that whoever implements equals() (or equal(T t1, Tt2) now) also has to implement hashCode() (or hash(T t) now) as well. And all of a sudden, the architecture gets less weird (like the unnatural "you have to implement equals() and hashCode() consistently although it's not really obvious why from the language") and yet more powerful and flexible.

Another indication of why equals() is misconceptioned: why, for probably non existing god's sake, does the parameter type have to be Object?
"Oh that's because on such a low level of language, typing has to be flexible enough to... blabla... because collections must... bla... no generics back then... blablabla". Bullshit. Because it's misconceptioned, peroid. Just like any other MyClass.doStuffWithOtherStuff(Object shouldActuallyBeOfTypeStuffButHeySorry) is that we all got a dressing-down in our early days when we tried to plug in another load of functionality in an conceptionally unfitting architecture. Nothing else.

Here's a another nice example for the misconception, actually even a bug, of equals() and hashCode():
All JDK collections assume that all equals() methods are implemented as either identity equality or "safe" simple value type equality. Yet they themselves break that assumption.
Following example (again I spare to type the explicit code sample):
Define two List<String>, an ArrayList and a LinkedList, both with content "A", "B", "C".
Define a List<List<String>> and add the ArrayList and then the LinkedList.
Tell the list-list to remove the LinkedList.
Which list remains in the list-list? Right: The LinkedList remains. The ArrayList has been removed by the command to remove the LinkedList.
Why? Because lists inherently always use .equals() internally to determine identity (which is extremely stupid, tbh). And ArrayList and LinkedList both implement equals only by (again stupidly) comparing their content, but NOT by comparing the type. This means a LinkedList and an ArrayList with equal content are seen as completely equal, up to being "identical" (huh?). If that is so, why are there two implementations in the first place? Why is everyone advised (rightly) to think about which implementation is needed for a certain situation, if all list implementation types are considered irrelevantly equal among each other and by the JDK itself?
The answer is: because it's ill-conceived.
Academic example? Hardly relevant in practice? I say: it's a clean indication for a half baked architecture that every now and then causes situations like "Oh my, JDK classes don't work like that although it would be perfectly logical. Now I have to implement a workaround and copy huge collections into arrays and back all over the place which will bloat my code, kill performance and create tons of potential errors".

The conceptional problem here is that equals() got overused for different meanings. On the one hand, it is used for identifying objects (either identity-wise or immutable-value-type-wise, such as Strings) and on the other hand, it is used for content comparison of mutable objects (like with date or with collections). By this unguided overuse, the consequence is that no matter how equals() gets implemented, it is always broken for the one concearn or the other. That's a typical "overlapping of concearns" instead of clean "seperation of concearns".

And don't get started on "that's why there's collections like IdentityHashSet". Because
a) They explicitely break their own collection contracts (also a very good sign of the oversimplified equality misconception)
b) And what about all the other types of situational equality? Do we really have to implement hundreds of PersonFullDataHashSet, PersonFullDataArrayList, PersonTypedDataIdentityArrayList and so one for every single type we want to use in collections? Of course not. This is ... Sparta maybe, but no clean seperation of concearns and sure as hell no quality software development)

Again, the revolutioning simple solution is: Inherent equality of ALL things is equality of reference at first. Everything else is done by Equalators as required.
For example:
stringList.contains("Hello"); // identity comparison
stringList.contains("Hello", String.valueEquality); // value equality Equalator
stringList.contains("Hello", MyUtils.firstLetterEquality);
stringList.contains("Hello", MyUtils.stringLengthEquality);

If this basic model then gets suboptimal in certain situations (e.g. because Strings are compared value-wise in 95% of the time and you don't want to pass equalator functions every time), there can be ONE alternative implementation that accepts a custom equalator to be used as internal logic that will fulfill all needs for custom equality. Something like:

VarList<String> stringList = new VarList<String>(String.valueEquality);
stringList.contains("Hello"); // value equality

Or when using Sets:
VarSet<String> stringSet = new VarSet<String>(String.valueHashEquality);
stringSet.contains("Hello"); // internal value hashing and value equality

Only then are collections braced for all situations of equality that is needed to do efficient quality software development. Note that this is not just some simple improvement that fixes one certain type of equality handling, but it fixes equality handling in principle by exploiting functional programming (really surprising why Comparator is around for so long but there has never been introduced an Equalator)

And what about Object.equals() and Object.hashCode() ?
Well, there are severel options here:
1.) Implement one HashEqualator<?> valueHashEquality that calls them internally to keep backwards compatability if you must.
2.) For actual value types, meaning immutable types (or at lest immutable in the state used for determination of equality and hash calculation, which I call "hashimmutable"), it's still okay to use them.
3.) Apart from that, as crazy as it might sound: @Deprecated equals(), @Deprecated hashCode() (along with the even more broken @Deprecated clone(), of course, which really cleans up the Object class).


This new (and the first complete) concept of equality is one of many revolutionary core concepts of the "Next Generation Collections" framework I'm working on and which is nearing completion and will be presented in a corresponding paper here once it's done in a all around usable degree. It fixes many other misconceptions and bugs of current JDK's collections framework and adds a whole new world of functionality and performance to collections, thus massively reducing software complexity (already does in my project at work) while increasing collection usability and performance at the same time.

Keine Kommentare:

Kommentar posten