Jadoth: Mai 2011

Sonntag, 22. Mai 2011

Assignment Inlining

What is the difference between the following two code samples:

1.)

final int[] values = this.someValues;
if(index < values.length){
    return values[index];
}

and

2.)

final int[] values;
if(index < (values = this.someValues).length){
    return values[index];
}

Both do exactly the same thing and both a equally long. The second one is a little more complicated to read, granted. But it is also reasonably faster.
Why is it faster?

For a very natural reason, that applies to CPUs as well as to human beings. Let's say CPUs are only human and work on a desk with a shelf (= CPU registers). A local variable is effectively just a CPU register or a slot in the shelf, for that matter. To fetch a sheet of paper (load a value of the variable) from the shelf (the register) or to put (store) it there takes time.

Example 1 instructs the CPU via bytecode to:
- write the address of the array on a sheet of paper
- put the sheet into slot 1 of the shelf
- fetch the sheet from slot 1
- compare the length of the array on the sheet with the index
- ...

Example 2 instructs the CPU via bytecode to:
- write the address of the array on a sheet of paper
- and compare the length of the array on the sheet with the index

(Okay the comparison is a little lopsided: actually, the value on the stack is duplicated and stored away once instead of stored away and loaded back. But the picture stays the same)

It's quite obvious: The instructions of the first example are a little... well.. silly for the executing CPU or human being, because the newly written sheet is put away just to be fetched again. Number two is a little smarter, because it says "continue to use what you've already got".

The performance gain is very small in numbers, BUT: If a method doesn't do much else (like in the example), or if multiple optimizations of that type can be done, then the small gain compared to the anyway small workload becomes relatively high.
And that comes without "side effect". No structural disadvantage, no instability, nothing. Even variable finality is kept. It's just that the characters in the source code are arranged differently without changing their meaning and the stuff gets faster. That's all.

Sometimes, such an "assignment inlining" can even spare a local variable completely - and then it gets really fast (because storing to a register is surprisingly expansive).

The internet provides hardly any information to that topic, which I wonder about a little. Only one sophisticated bytecode optimization talks about replacing an equivalent STORE/LOAD by a DUP(licate) instruction. So I researched it on my own. It really works that way, every little test project can show it.

Of course that doesn't mean that everyone shall write bytecode-optimized code all the time.
It depends on the situation if it makes sense to do it.
If you have a piece of code that is called only once when pressing a button, or that assembles a SQL statement (which itself takes computer-AGES to be executed), etc., then there's absolutely NO point in reducing readability to gain some 0.000001% performance. In such cases, readability and maintainability of source code is much more important than any performance optimizations.

On the other hand, if you implement something like a Comparator, a tiny piece of code (3-4 lines maybe) but which gets called a million or a billion times subsequently, then such a small change can make a big difference. Especially when algorithms grow logarithmical or even exponential in runtime complexity (like sorting, for example), this can quickly mean the difference of several seconds response time per action for the user.
(and honestly: in 3-4 lines of code, even inlined statements still keep up readability)

This is one of many performance optimizations I use in my "Next Generation Collections" framework, all in all resulting for example in GrowList.add() (the behaviour-wise cleanly named equivalent to the implementation-wise named ArrayList of JDK) being 200% to 300% as fast as ArrayList.add().

Montag, 16. Mai 2011

Why equals() is broken and how to do it right

Two simple explanations about the intention of equals() (and always associated, of course, hashCode()):
1.) I once read in some Effective Java book from Josh Bloch something like: "When designing a class, you should think about: is it supposed to be a value type, then override .equals() and .hashCode(), otherwise don't and leave the default identity comparison."
2.) Most tutorials briefly streak the implementation of equals() as considering the "significant" fields of the class and then they go on and on about how equals() and hashCode() have to be implemented consistently (a little unnatural, isn't it?) and that it is needed for hash collections etc.

Both statments are conveniently simple and as most conveniently simple things, they are only partly correct.

Ad 1.) Nice idea, but what ARE value types? Okay String and Integer are value types. What about Date? Or Person? Aren't those value types, too? What the simple value type definition is lacking is the view if immutability. Date is not immutable, an instance of Person most probably won't be, either. Collections themselves are not immutable in general. So what happens when you put those mutable objects in a hash collection and change their internal state: the collection becomes inconsistent. It becomes broken (I won't go into the basics here). This is often described in guides as "the collections will do weird things" (how funny if your application depends on correctness) or they sell it as being the programmer's fault (hello? I simply NEED to have mutable objects in many places. Why does that mean that I can't use elemental data structures any longer with them?)

On the other hand, still some kind of logic is needed to evaluate if two objects of one type are content-wise equal.

Ad 2.) So what about the "significant" fields? Wait a second: what ARE the significant fields? Doesn't that depend on the situation? How should we know when implementing a class? Non of those guides explain which fields to use for it. Because it CAN'T be decided once and for all.
In one place, your business logic has to ensure that a collection contains only distinct person instances (say with distinct first name, last name and date of birth). Somewhere else, the persons have to be distinct by hometown. Or SSID. Or whatever. And in the next place, the logic is about actual instances (reference equality) again (say in an OR mapper registry that keeps track of loaded instances).

There really is a TON of ways what "equality" can mean in different contexts:
Equality of...
... reference (identity)
... data identity (what a PRIMARY KEY in a database is)
... full data (with transient fields, without transient fields?)
... some special key data combination (like additional UNIQUE keys in databases)
... data identity and behaviour (type)
... full data and behaviour (type)
... and so on

Which one to implement in a single hard coded equals() method?
The answer is: none, because it is not that simple. Because equality changes with the situation.
It also won't lead you anywhere to extends subclasses every now and then just to override a troublesome equals() (yes, and hashCode()) implementation (SsidEqualityPerson, lol). Your software complexity will explode without need and apart from that, there's no multiple class inheritence, which alone will kill attempts like that. Almost the same goes for creating decoration classes over and over. In any case, both approaches will kill referential integrity (which is crucial for advanced object graph technologies like OR-Mappers)

The right way to do it is: define an Equalator<T> interface, pretty similar to the already existing Comparator<T>.
So and only so, you can define the right equality for the right situation.

And what about the hashCode which has to be consistent with the equals logic? Trivial: HashEqualator<T> extends Equalator<T>.
Here you are. And notice: that way, you have NATURALLY defined the contract that whoever implements equals() (or equal(T t1, Tt2) now) also has to implement hashCode() (or hash(T t) now) as well. And all of a sudden, the architecture gets less weird (like the unnatural "you have to implement equals() and hashCode() consistently although it's not really obvious why from the language") and yet more powerful and flexible.

Another indication of why equals() is misconceptioned: why, for probably non existing god's sake, does the parameter type have to be Object?
"Oh that's because on such a low level of language, typing has to be flexible enough to... blabla... because collections must... bla... no generics back then... blablabla". Bullshit. Because it's misconceptioned, peroid. Just like any other MyClass.doStuffWithOtherStuff(Object shouldActuallyBeOfTypeStuffButHeySorry) is that we all got a dressing-down in our early days when we tried to plug in another load of functionality in an conceptionally unfitting architecture. Nothing else.

Here's a another nice example for the misconception, actually even a bug, of equals() and hashCode():
All JDK collections assume that all equals() methods are implemented as either identity equality or "safe" simple value type equality. Yet they themselves break that assumption.
Following example (again I spare to type the explicit code sample):
Define two List<String>, an ArrayList and a LinkedList, both with content "A", "B", "C".
Define a List<List<String>> and add the ArrayList and then the LinkedList.
Tell the list-list to remove the LinkedList.
Which list remains in the list-list? Right: The LinkedList remains. The ArrayList has been removed by the command to remove the LinkedList.
Why? Because lists inherently always use .equals() internally to determine identity (which is extremely stupid, tbh). And ArrayList and LinkedList both implement equals only by (again stupidly) comparing their content, but NOT by comparing the type. This means a LinkedList and an ArrayList with equal content are seen as completely equal, up to being "identical" (huh?). If that is so, why are there two implementations in the first place? Why is everyone advised (rightly) to think about which implementation is needed for a certain situation, if all list implementation types are considered irrelevantly equal among each other and by the JDK itself?
The answer is: because it's ill-conceived.
Academic example? Hardly relevant in practice? I say: it's a clean indication for a half baked architecture that every now and then causes situations like "Oh my, JDK classes don't work like that although it would be perfectly logical. Now I have to implement a workaround and copy huge collections into arrays and back all over the place which will bloat my code, kill performance and create tons of potential errors".

The conceptional problem here is that equals() got overused for different meanings. On the one hand, it is used for identifying objects (either identity-wise or immutable-value-type-wise, such as Strings) and on the other hand, it is used for content comparison of mutable objects (like with date or with collections). By this unguided overuse, the consequence is that no matter how equals() gets implemented, it is always broken for the one concearn or the other. That's a typical "overlapping of concearns" instead of clean "seperation of concearns".

And don't get started on "that's why there's collections like IdentityHashSet". Because
a) They explicitely break their own collection contracts (also a very good sign of the oversimplified equality misconception)
b) And what about all the other types of situational equality? Do we really have to implement hundreds of PersonFullDataHashSet, PersonFullDataArrayList, PersonTypedDataIdentityArrayList and so one for every single type we want to use in collections? Of course not. This is ... Sparta maybe, but no clean seperation of concearns and sure as hell no quality software development)

Again, the revolutioning simple solution is: Inherent equality of ALL things is equality of reference at first. Everything else is done by Equalators as required.
For example:

stringList.contains("Hello"); // identity comparison
stringList.contains("Hello", String.valueEquality); // value equality Equalator
stringList.contains("Hello", MyUtils.firstLetterEquality);
stringList.contains("Hello", MyUtils.stringLengthEquality);

If this basic model then gets suboptimal in certain situations (e.g. because Strings are compared value-wise in 95% of the time and you don't want to pass equalator functions every time), there can be ONE alternative implementation that accepts a custom equalator to be used as internal logic that will fulfill all needs for custom equality. Something like:

VarList<String> stringList = new VarList<String>(String.valueEquality);
stringList.contains("Hello"); // value equality

Or when using Sets:

VarSet<String> stringSet = new VarSet<String>(String.valueHashEquality);
stringSet.contains("Hello"); // internal value hashing and value equality

Only then are collections braced for all situations of equality that is needed to do efficient quality software development. Note that this is not just some simple improvement that fixes one certain type of equality handling, but it fixes equality handling in principle by exploiting functional programming (really surprising why Comparator is around for so long but there has never been introduced an Equalator)

And what about Object.equals() and Object.hashCode() ?
Well, there are severel options here:
1.) Implement one HashEqualator<?> valueHashEquality that calls them internally to keep backwards compatability if you must.
2.) For actual value types, meaning immutable types (or at lest immutable in the state used for determination of equality and hash calculation, which I call "hashimmutable"), it's still okay to use them.
3.) Apart from that, as crazy as it might sound: @Deprecated equals(), @Deprecated hashCode() (along with the even more broken @Deprecated clone(), of course, which really cleans up the Object class).

This new (and the first complete) concept of equality is one of many revolutionary core concepts of the "Next Generation Collections" framework I'm working on and which is nearing completion and will be presented in a corresponding paper here once it's done in a all around usable degree. It fixes many other misconceptions and bugs of current JDK's collections framework and adds a whole new world of functionality and performance to collections, thus massively reducing software complexity (already does in my project at work) while increasing collection usability and performance at the same time.

Donnerstag, 5. Mai 2011

Java Single Page Reference

Some time ago, I had the idea of condensing all non-trivial Java syntax, specially treated standard API (JDK) components and basic conceptional terms on one single page - a Java Single Page Reference.
The intention is not to write another programming guide or such (would hardly fit on one page :-D), but to have a nice and lean overview of common language items that are enountered in basic and advanced Java developing with each item being a link to a detailed explanation somewhere else on the web. For example if someone hasn't dealt with, say the Externalizable interface or reflection or proxy classes, etc., it would be always a simple thing to open the reference page and find a link to a nice article on the unknown topic (because, believe it or not, some Java topics are still hard to google or the search takes at least more than a quick glance. For example researching what that "SAM" meant that everybody on lambda-dev was taking about was anything but a quick thing to do back then ^^).
Also, I find it intriguing and inspiring to have the whole language (on a very condensed level, of course) at a glance.

Sadly, for now it's just a plain PDF without any links, but the goal is to create it a nice HTML site, of course.
Still, it might be worthwhile to present it even in this early form. I'm sure many people can find an intresting item to google here and there that they aren't aware of, yet. Or post a comment about a missing item that I don't know, yet, of course :-).

So, here it is: Java Single Page Reference (PDF)