Condividi tramite


Puzzling through Erasure II

Bruce Eckel has some further thoughts about the use of erasure in Java generics.

Erasure is the way that Java gets generic types without adding VM support. When the developer writes something like:

List<Integer>

in Java generics, the compiler knows that this type can only contain an integer, but when the bytecodes are generated, the type is really:

List<Object>

This means that Java generics doesn't give you the performance benefit that .NET generics do - not only do you not have the ability to create generics on value types (not that surprising given that Java has previously not had boxing), but you still have typecasts in your generated code.

Advantages? Well, the advantage I always heard was that erasure meant that you could run on downlevel VMs, but it turns out that that isn't true. It does mean that it's much easier to write a 5.0 VM than it would be if an approach such as the .NET one was used.

The other advantage is that because the underlying object is really a List<Object>, it's possible to implement a wildcard feature in which you can write something like (and forgive me if my Java syntax is wrong):

List<?>

which means a List of anything. You can get away with this in erasure because everything's the same type under the covers, but it would be much more challenging in the .NET implementation because List<double> is very different than List<string>.

My personal opinion is that it's far more important to be able to have generics over value types and performant implementations than to support wildcards. In VS2003/C# V1.1, programmers sometimes need to ask the following question:

"Hmm, I need a list of integers here. Should I use an ArrayList, which is easy to write, or should I use my own IntegerList class, which is a bit harder to write, but performs better. Or should I use an array".

The fact that users need to consider that is bad, as is the fact that code contains all three options, and it's often hard to know why a specific option was chosen.

Comments

  • Anonymous
    September 23, 2004
    The comment has been removed
  • Anonymous
    September 23, 2004
    pragmatism vs academia
  • Anonymous
    September 23, 2004
    On a related note, my personal opinion is that all generic classes should be abstract. ie

    List<string> mylist=new List<string>;

    should be dissallowed because List<string> should be an abstract type. This would force all programmers to do the following:

    internal class StringList : List<string> {}

    ...

    StringList mylist=new StringList();


    which hides the details of the list of strings behind a class which can be easily extended with additional functionality at a later point. These additional changes would then be localized to the StringList class, rather than having to modify many instance creation statements across the entire codebase.

    For some strange reason, House of Straw, House of Sticks and House of Bricks come to mind...
  • Anonymous
    September 23, 2004
    Doesn't the .NET approach involve JITting the collection code for every new type it's instantiated with (like C++ templates, but at run-time)?

    If this is the case then I imagine the Java approach scales down to PDA's (and other such small devices, like phones) much better. It should provide lower memory requirements and could be faster code (I would assume that JITting is quite slow on these devices).
  • Anonymous
    September 23, 2004
    I really prefer the .NET way of generics and I would have been really disappointed if left the boxing in the generated code.

    However, there is still some work to do, to support generic operators and stuff like that ;-)
  • Anonymous
    September 23, 2004
    Pete, IIRC, the jitting is handled on a per-appdomain basis.

    All reference types share the same jitted code. For each value type, the code is jitted the first time such a type is created, and then shared.
  • Anonymous
    September 23, 2004
    So... where does the performance gain come from (in the .NET version of generics)?

    If a List<T> is only jitted once for all reference types of T, surely the jitted code must be no different from List<Object>? If so, how is this different from the Java approach (for ref types)?
  • Anonymous
    September 23, 2004
    The comment has been removed
  • Anonymous
    September 23, 2004
    "One advantage you missed is that because a List<T> is really just a List under the covers, you can use generics and still pass a List<T> to library code that was written before generics were even a possibility. Of course, it would be up to you to verify the type safeness of this decision (or be prepared to accept a failed checked cast in the future)."

    That's what IList is for, and other untyped collection interfaces. Correct? So what is the advantage again of wild cards? I see absolutely none.
  • Anonymous
    September 23, 2004
    Frank-

    As far as I'm aware, an IList<T> is not an IList and the two interfaces don't directly match up. You'd have to thunk all your calls and make a wrapper going through each time. Also, this is a property of erasure, not wild cards. Wild cards are a particular generic feature; erasure is a way to implement generics. It's possible to have one without the other (in both directions).
  • Anonymous
    September 23, 2004
    Nicholas,

    You have found a few cases where I wasn't precise in what I wrote, which is good. I have a few comments on other things that you wrote.

    When I said, "there are still typecasts in the generated code", I meant "the final result is as if you still had typecasts in the code". With erasure, you still need to have the runtime type checking in the native code. With the .NET implementation, that's not true - there is no need for runtime type checking, so the effect is as if there are no casts any more. Which improves performance, though the improvement due to removing casts is much smaller than the improvement from getting rid of boxing.

    I'm not an expert on our generics implementation, but I'm not terribly optimistic about the possibility of supporting wildcards. If you define a supertype of both List<double> and List<string>, that type needs to have Add(<some type>).

    But there's no type that is both a reference type and an unboxed value type. If you use object, you're back to boxing again.

    List<T> implements both ILIst<T> and IList, so if you want to pass a List<int> to a routine that expects an IList, it will work fine. You just have to know that you'll be getting boxing on each item.


  • Anonymous
    September 23, 2004
    Eric, that is good news about IList<T> : IList! I had complained about that months ago and got wontfixed. Too bad about the boxing, but that's a hard case.

    > I meant "the final result is as if you still had typecasts in the code"

    Right, exactly. But there's some subtlety here. If we can perform an escape analysis on a generic object and see that it's never accessible to non-generic code (trickier than it sounds), then we can prove cases where the typecasts are unnecessary. I have no idea what Sun has planned, but it's very similar tech to their virtual method analysis...

    It's definitely true that eliminating boxing is a bigger win than this in most real world programs though.

    > If you define a supertype of both List<double> and List<string>

    I don't think it's necessary to define such a supertype though. The <?> captures types but the implementation doesn't care what that results in. It's all just type theoretical bookkeeping on the compiler's end, like varargs. At some point you have to make a concrete call and pass in either the List<double> or List<string>. Admittedly it would be a bit gooky if the compiler had to emit separate methods for List<? extends object> and List<? extends System.ValueType> but I hope it could be done smoother than that.

    40 characters by 8 lines is a tough interface to explain things in. A big feature like this would take a lot of careful thought. I think it would be great if it happened though.
  • Anonymous
    September 23, 2004
    I still don't see how you avoid run-time casts:

    Say I have a class, MyClass, that is a reference type. You have said that all List<RefType>s will share the same code (only jitted once), so how can I get a MyClass object back from the List<MyClass>, without it doing a run-time cast?
  • Anonymous
    September 23, 2004
    > Eric, that is good news about IList<T> : IList! I had complained about that months ago and got wontfixed. Too bad about the boxing, but that's a hard case.

    That's not what Eric said. You're implying this:

    interface IList<T> : IList {...}

    Eric said this:

    class List<T> : IList<T>, IList {...}

    This is probably why your bug got marked as wontfix.

    Regardless, you can still pass List<T> to an IList method:

    void OldMethod (IList list);

    List<int> n = new List<int>();
    n.Add (42);
    OldMethod (n);
  • Anonymous
    September 23, 2004
    > Say I have a class, MyClass, that is a reference type. You have said that all List<RefType>s will share the same code (only jitted once), so how can I get a MyClass object back from the List<MyClass>, without it doing a run-time cast?

    This is possible because the JIT operates at a lower level. The compiler + JIT type check all method boundaries, ensuring that everything is type checked.

    The JIT generated code doesn't need to redo the type checking; it was already done during the JIT. Consequently, the generated code can just stay in the realm hardware, of pointers. Since Reference Types are really just pointers, they all have the same size, so the generated code doesn't need to worry about the actual types involved -- no fixup or change would actually be required.

    It's akin to just using a C++ reinterpret_cast everywhere an object reference is used -- the cast doesn't do anything, it's just present to satisfy compiler type safety:

    int *i = new int;
    // doesn't do anything on the actual hardware; just pleases the compiler
    char c = reinterpret_cast<char>(i);
    // which can be bad if you're not careful...
    printf ("%sn", c); // undefined

    In the context of .NET, "generated reinterpret_casts" (really just a register copy) look terribly unsafe, they're not -- all the type checking has already been done, so there is no loss in type safety (as long as there aren't any bugs in the type analysis code).
  • Anonymous
    September 23, 2004
    "This is possible because the JIT operates at a lower level. The compiler + JIT type check all method boundaries, ensuring that everything is type checked."

    So the jitter compiles the code for List<T> the first time it hits one, but checks all future declarations/usages separately when it reaches them? Or can this type-checking be done entirely by the compiler? I guess it must be, because otherwise it would be a possible run-time exception for an invalidly declared container.

    Interesting.. I assume this is why constraints are necessary? Because the code is jitted once, all operations must be valid at that time. I guess this is the biggest restriction of implementing generics this way (unless I'm getting confused).

    In that case, Java has really lost out. They don't have the power of C++ templates, and they don't have the speed of .NET generics. I guess C++/CLI wins all round (since it supports both).
  • Anonymous
    September 23, 2004
    The comment has been removed
  • Anonymous
    September 23, 2004
    > So the jitter compiles the code for List<T> the first time it hits one, but checks all future declarations/usages separately when it reaches them? Or can this type-checking be done entirely by the compiler?

    Not exactly. The JIT always checks declarations and usage, not just the first time, to ensure consistency and safety. Of course, this is at JIT time, so you only pay once (per AppDomain).

    This type checking CANNOT be done ONLY by the compiler, as the compiler may be buggy, or the program executed may be corrupted/trojaned, or ILASM is your compiler, or...

    > I assume this is why constraints are necessary?

    Absolutely. Without these constraints, the JITted code would need to perform the runtime check (or some similar mechanism) to ensure that the method dispatch would be correct/safe. This would hurt performance.

    Alas, this implementation also means that a generic class can't inherit from one of its arguments, which is possible in C++:

    template <class T> class Helper : public T {/* ... */}

    Such functionality would be useful for "kludging" multiple-inheritance-like behavior, but obviously isn't possible under the current Generics mechanism.
  • Anonymous
    September 24, 2004
    Jonathan- I'm not implying anything. What I asked for is exactly what Eric said is happening. You're reading things into posts that aren't there. I am glad though that the problem got solved even it if does involve boxing.
  • Anonymous
    September 24, 2004
    Jonathan,

    I agree that's it's possible to build analysis into the system so that you can know that the casts really aren't possible. It's a fairly involved process with the prospect of dynamically loaded code and IL generation on the fly (don't remember if Java has this), but it's possible.

    I'd prefer, however, a design in which you didn't have to do this, as it adds a bunch of complexity on the runtime side.

    Constraints are very necessary for the compilers to give a good programming experience. Unlike the C++ model (anything goes, we'll check at instantiation time), the .NET model tightly constrains based on the constraints of the generic type. That means that if I constrain a type parameter to implement IList<T> (for example), I can depend on it having those members.

    To do it the C++ way would require that there be a validator that checked whether the generic type IL could work correctly for every different type it was instantiated on. There's a bit of checking in the JIT, but not really of this sort, and you'd have to do it for every type (ie both List<Employee> and List<Address>).
  • Anonymous
    September 24, 2004
    The comment has been removed
  • Anonymous
    September 24, 2004
    Jonathan,

    I meant to comment on this last time...

    While we don't currently allow it, it is possible for the .NET implementation to be extended to allow

    class Extend<T>: T

    We don't currently allow it, and it would break some invariants that currently make things simpler in the runtime, but it's possible. I don't know if/when we'll do this, but we do understand the utility of being able to do so.

    Pete,

    If you chose to do that, you would have to have that checking both at the instantiation time and at compilation time. You would also lose some of your design-time utility. Right now, Intellisense knows exactly what operations are valid on type parameter T. Without this being limited by constraints, it would have no idea what to present - the valid set would become every field and operation on every type.

    We do agree that the current constraint syntax is less expressive than where we'd like to end up.