Devlico.Us
CodeBetter.Com
RSS 2.0 via Feedburner
           Do you Twitter? Follow us @devlicious

Billy McCafferty



Using Equals/GetHashCode Effectively

[Updated June 23, 2008:  A simplified and more systematic approach to this may be found within S#arp Architecture.  While the complexities of the Equals/GetHashCode method are kept away in a base class, you simply need to add an [Identity] attribute over each property in a class which is part of the domain object's "signature."  It also supports multiple attributes to be used in the same class for multi-property uniqueness - outside of the ID itself.]

The ability to override GetHashCode is available on every object but is seldom required for POCOs.  The Equals method usually provides all of the comparison functionality we'll ever need.  But when using an ORM such as NHibernate, GetHashCode takes a more prominent role as it helps NHibernate determine if an object already exists in a collection.  Not overriding GetHashCode, or doing so inappropriately, may lead to duplicate objects showing up in collections and/or objects missing altogether.  When needed, most people implement both methods and end up with similar code in both.  So to ease the burden of managing both of these methods, there's exploitable overlap between Equals and GetHashCode to kill two birds with one stone...no offense to any PETA members out there.  (A thanks goes out to Alan Northam whose approach made me aware of the code duplication I previously had in my own approach.)

In my project work, I consider the following to be true when comparing two objects:

  • If two objects have IDs, and have the same IDs, then they may be considered equal without examining them further.  (I'm assuming ID to be an identity field, or equivalent, in the DB.)
  • If two objects have IDs, but have different IDs, than they may be considered not equal without examining them further.  E.g., If customer A has an ID of 4 while customer B has an ID of 5, then they are not equal, QED.
  • If neither object has an ID or only one of them has an ID, but their "business value signatures" are identical, then they're equal.  E.g., customer A has an ID of 4 and a social-security-number of 123-45-6789 while customer B has no ID but also has a social-security-number of 123-45-6789.  In this case, customer A and customer B are equal.  By "business signatures," I imply a combination of those properties which deem an entity unique, regardless of its ID.  (As a side, see Eric Evans' Domain Driven Design for a stellar conversation of value and entity objects.  A summary version is also available.)
  • If one of them is null, then they are not equal.  (Whoo, that was easy.)

With the above considerations in mind, we'll want to write code to make the following unit test pass.  Note that Customer takes a company name into its constructor.  It also has a settable contact name.  The combination of its company name and contact name give Customer its unique business signature.

Although some argue against a single object which all other persistable domain objects inherit from, I use one nonetheless and ingeniously call it "DomainObject."  (Those are Dr. Evil quotes there.)  "DomainObject," in its entirety, contains the following:

Note that Equals is sealed and cannot be overridden by a DomainObject implementation.  I suppose it could be unsealed, but since I put a lot of work into that method, I don't want anyone mucking it up! 

Now assume that Customer implements DomainObject.  As mentioned above, the combination of its company name and contact name give it its unique signature.  So its GetHashCode would be as follows:

You'll notice that the start of the method includes the full name of the class type itself.  With this in place, two different classes would never return the same signature.  (You'll have to reconsider how GetHashCode is implemented to handle inheritance structures; e.g. a Customer and an Employee both inherit from a Person class but Customer and Employee may be equal in some instances...for this, I'd probably only add GetHashCode to the Person class.)  Additionally, note that GetHashCode should only contain the "business signature" of the object and not include its ID.  Including the ID in the signature would make it impossible to find equality between a transient object and a ! transient object.  (Equality for all regardless of transience I say!)

Billy McCafferty



Comments

Tomas Tintera said:

Hello Billy,

I like this article. Have one idea or question. What about GetHashCode, when some signature properties of the objects are not allowed by security to be visible and thus are not visible and present in a Layer or Tier.

I mean ex. SSN for user A is in database, but not loaded into DomainObject for security reasons, when viewed by user B.

# May 3, 2007 1:21 PM

Brian said:

Good stuff.  I like the idea of combining GetHashCode and Equals together.

# May 4, 2007 4:09 PM

Christian said:

Hi Billy,

I personally implement identity in the following way:

1. If a class represents an Entity and I know that based on the design of the application instances are never detached / reattached to an NHibernate session, then I don't override the Equals method. The rational here is that NHibernate's IdentityMap (the session cache) will maintain a one-to-one between an object in memory and  the row in the database

2. if a class is an Entity and the design of the application allows for detached/reattached instances I override Equals (and GetHashCode) using the technique described here: http://www.onjava.com/pub/a/onjava/2006/09/13/dont-let-hibernate-steal-your-identity.html?page=1

3. If a class represents a ValueObject I override the Equals and GetHashCode (using Resharper to do the work of comparing each field value in turn)

Again a personal thing - I prefer to assign object id as a guid (modified Guid algorithm) in a EntityObject super class, that way the whole transient / persistent distinction goes away for the purpose of Entity identity. I then tend to provide a separate static method for comparison of two instances by business key as and when this makes sense. The name of this method is usually called AreEquivalentByBusinessKey.

On the subject of business key equivalence, I think this is just one of several variants all coming from the requirement to compare two instances "by value" or by some natural key of which there could be one or more keys.

Recently I've been playing with the idea of objectifying these keys as immutable inner classes. There could then be some mileage having these "key" classes inherit from a common interface, which allows them to be compared generically and possibly. They could even inherit from an abstract class so as to provide some useful common behaviour such as printing themselves.

Christian

# May 14, 2007 4:01 PM

hitch said:

Hi Billy, how would you implement GetHashCode for an an association entity - ie. an entity that is only unique by the id's of the objects it is associated to - it doesn't have any comparable properties on itself. I'm trying to think of a real world example. A department can have a collection of people who have roles, but can only have one person with each role therefore I want to model this with a "DepartmentPersonRole" class, which is simply an association between the department and combination of role and person - I think this is a realworld example :)

# May 23, 2007 11:31 AM

Billy McCafferty said:

The GetHashCode could return the following:  return (Department.GetHashCode().ToString() + "|" + Person.GetHashCode().ToString() + "|" + Role.GetHashCode().ToString()).GetHashCode()

...or something like that ;)

# May 23, 2007 12:57 PM

hitch said:

Thanks Billy - would you be concerned by the overhead this could introduce if lazy initialization is being used, and (eg.) Department, Person and Role haven't been loaded yet?

# May 24, 2007 10:02 AM

Billy McCafferty said:

On the flipside, you probably wouldn't have much reason to look at the relationship in the first place if you didn't want to look at the objects it joins.  Alternatively, you could set the lazy="false" for the three sides of the relationship so that (potentially) only one query is used to load it.

# May 24, 2007 11:35 AM

Rémy said:

I think you take some risk to combine GetHashCode and Equals, because GetHashCode is not always unique. It's size is to limited (32bit). If you have a 64bit integer, then it's not possible to have a unique hashcode for each value.

So you can have hashcodes that is equal, but that doesn't mean that the objects are equal, so you can't depend on GetHashCode when you implement Equals.

In your case, the risk is smaller, because you check first the Id's on equality.

I am still struggeling with: What if the id's are the same, but the properties of one object are changed. Are they still the same object?

You make a good point with the statement: if the id's are equal the objects are equal and you don't look further.

# May 27, 2007 11:49 AM

Billy McCafferty said:

Rémy,

I'm not sure if this would be a problem since the 64bit integer is converted to a string before the hash code is interpreted from it.  And the hash code algorithm dictates that the same hash code will be returned for the same string.  E.g. 4,000,000,000 would return a different hash code from 4,000,000,001 even though they are both 64bit numbers; and 4,000,000,000 would always return the same hash code.

But to your point, there are certainly more strings possible than there are numbers in the 32bit int range; but I would think that the odds of coincidental overlap would be small enough to be dismissed.

I am certainly no expert when it comes to GetHashCode limitations and welcome any other concerns you may have.  (You're certainly welcome to tell me I'm wrong, as well! ;)

# May 29, 2007 2:01 PM

Rémy said:

You can 'only' have 2^32 hashcode values, so at some cases you can get duplicates. You think that the odds are low, but if it happens its an irritating bug.

I had a situation where we would have to sync products if they were changed. We would sync the product if they hascode was different from the hashcode we stored before the change. What happend was that each day, some products didn't sync after a change. This happend because the hashcode was equal, but the products

So if Equals is true then GetHashCode is true, but the reversed isn't true. GetHashCode can be true, even if Equals is false!

For you are the odds lower, because you first check the Ids for equality, so the chances are that you will never encounter it, but I make this remark for people blindly implementing your solution without thinking about it ;-)

# May 31, 2007 5:26 AM

AndyHitchman said:

I'm concerned that your taking an even bigger risk by not using immutable values to calculate your hash code.

Put a subclass of DomainObject into a List and change one of the values used in GetHashCode and then see if the list thinks it Contains the object. It'll return false.

I hit this problem yesterday and it made me cry. I originally had exactly the same goal---make DomainObjects play nicely with NHibernate.  My GetHashCode was based on TId.GetHashCode(), so I'd add a transient (default(TId) DomainObject to an ISet and have NHibernate persist the object and assign keys. Then my set would thoroughly borked.

I googled http://weblogs.asp.net/bleroy/archive/2004/12/15/316601.aspx and realised this was a big problem.

By not using the Id's hash code you wouldn't have this specific problem, but if you changed the values of any of the properties used in the business signature, then expect broken collections.

My solution was to leave Equals and GetHashCode alone and introduce a method that checks for business-level identity equality.

Andy.

# July 12, 2007 8:17 PM

vade said:

Another nice article about equal and getHash. Im not a nature english speaker and haven't spend enough time on the given article, but the advices there are in conflict with bill's DomainObject structure.

summary:

1) do not use mutable objects for creating hashcode

2) equals and gethashcode should remain the same after object creation

still bill's suggestions and fuctions like: HasSameBusinessMeaning etc. are still a good idea, but have to be re-implemented/structured somehow.

http://www.onjava.com/pub/a/onjava/2006/09/13/dont-let-hibernate-steal-your-identity.html?page=3

# July 27, 2007 3:40 PM

vade said:

Hi, i've did some tests. Fortunatelly C# works different way, so using your GetHashcode and Equals method will not affect .NET collections. But still a question remains: Should HashCode remain constnt throughouth the object lifetime?

# July 28, 2007 5:39 AM

Billy McCafferty said:

I've updated the CodeProject article NHibernate Best Practices with ASP.NET, 1.2 ed with a number of

# August 13, 2007 4:06 PM

Billy McCafferty said:

Download the sample code for this post. In part one of this three part series, I offered what we’d ideally

# December 7, 2007 2:04 AM

Gary Brunton said:

I know I'm very late to the game but can you please help me?  

I'm sure I'm just missing something here but I can't see how the following test would pass:

Assert.IsFalse(customer.Equals(owner))

Assumptions:

1. custom is of type Customer that inherits from DomainObject<int> and owner is of type Owner that inherits from DomainObject<int>

2. customer is not transient and has an ID equal to 7

3. owner is not transient and has an ID equal to 7

Maybe we are just not concerned with comparing objects of different types?  Is this equality method only meant to be used by NHibernate when comparing objects within a typed specific collection?  This would make sense in that case but I'm a little surprised that this method would not be used in some other context.

Thanks for any insight you may have!

# March 31, 2008 2:17 PM

Proof Police said:

Just so you are aware, you misused "QED".

> If two objects have IDs, but have different IDs, than they may be considered not equal without examining them further.  E.g., If customer A has an ID of 4 while customer B has an ID of 5, then they are not equal, QED.

QED is used after you've proved something, but here, you've defined something (equality), which is different from proving something.

Good article, though. I was looking for a way to generate hash codes, and I hadn't thought of creating a string and taking the created string's hash code.

# April 9, 2008 3:48 PM

Using GenericDAO and NHibernate Attribute (3) » ???????????? said:

Pingback from  Using GenericDAO and NHibernate Attribute (3) &raquo; ????????????

# April 10, 2008 9:00 AM

Billy McCafferty said:

@Proof Police,

That's hilarious...thanks for keeping me in line!! ;)

# April 10, 2008 10:47 AM

S. Gryphon said:

Your implementation of GetHashCode() is not consistent with your implementation of Equals().

Looking at your first unit test where you Assert.AreEqual(customerA, customerB);, these values are equal because their Id's are the same.

You GetHashCode(), however would be based on the business values, "Acme" vs "Anvil", so customer A & B would have different hash codes.

So, if you stored customerA in a hash lookup (i.e. the key), then customerB, supposedly an equal value, would not find it (it would be looking in a different hash bucket).

One key rule with hash values is that if objects compare as equal, then they must be in the same hash bucket (have the same hash value).

In fact, given the 'OR' definition in your equality, there isn't any reasonable way to hash the values.

I would recommend keeping Equals() and GetHashCode() as programming constructs used in hash lookups and collections, e.g. by database PK only, and have separate implementation for various notions of business "equality", e.g. an "AreSameCustomer()" method.

# May 21, 2008 10:35 PM

Pirobox said:

Sorry, but your implementation of Equals is causing me a bit of headache. I'd like to propose an example.

class MyObj : DomainObject<string>

{

 public string Name{get; set;}

 public override int GetHashCode() { return Name);

}

If, for a chance, I get these objects in memory:

Obj1: MyObj(ID = "X", Name = "NameOfX"); // persistent and not modified

Obj2: MyObj(ID = "X", Name = "ChangedNameOfX"); // persistent, but modified

Obj3: MyObj(ID = null, Name = "NameOfX"); // transient

I'll obtain:

Assert.AreEqual(Obj1, Obj2);  // same id

Assert.AreEqual(Obj1, Obj3);  // same business representation

Assert.AreNotEqual(Obj2, Obj3);  // different id and different business representation.

So we have broken the transitive property of equality. Have I lost something?

# July 9, 2008 6:54 PM

Sam said:

Good post Bill, I gleaned some very useful pieces from your code here.  Especially the generic type parameter for the PK (IdT).  Made things a lot easier for my base DAO class in the NHibernate-based DALs I create and maintain at work.

I wanted to quickly mention two things I've done in my base DAO class that help in quite a few areas, especially equality comparisons.  One is I have abstract string property named EntityTableName that only has a getter.  All concrete DAO classes then have to implement this string property, and it will always contain the name of the table in the database that the class represents.  Obviously it's not always perfect, because a class could possibly come from more than 1 table, or a table, using discriminator values, might map to more than 1 class, but it is still useful.

The other thing I do is perhaps more useful.  The base DAO class has a private static int that I name m_internalNegativeIdCounter, which is initialized to -1, and a non-static private int named m_internalNegativeId.  These are in no way persisted to the database.  In the constructor for the DAO class, I init the m_internalNegativeId (non-static) to m_internalNegativeIdCounter (static), and then decrement the static var by 1.  I don't HAVE to use negative numbers, I just decided to so that no one would ever confuse THAT id with an actual database ID (since by convention we don't use negative IDs in the database where I work).  So now every single object, be it persisted, transient, attached, detached, etc, has a UNIQUE integer associated with it.  Unless I've overlooked something (thread safety?  I'm not doing any locks, maybe I should), this guarantee of a unique number, regardless of type, seems to be very useful.  It can be used for generating hash codes, equality comparisons of two non-transient objects, etc, to avoid having to compare field by field the values of two objects to determine of they are "equal".

Thoughts?

# July 14, 2008 8:05 PM

Geoff said:

I have to take considerable issue with your implementation, Billy, along the same lines as Rémy's earlier comments.

Two objects which return the same value from GetHashCode() are NOT NECESSARILY Equal. Statistically, the chances of a clash are low, but they are high enough that the operation of things like Hashtables has to perform the following double-check when determining if it already contains a particular object:

1) Call GetHashCode() on the object.

2) Look to see if there is a bucket to match this HashCode.

3) Check each item WITHIN that bucket using Equals to determine if it is the same as the object.

NB: It's not just good enough for the Hashtable to go "ah, I have a bucket with that HashCode, therefore I contain that object", because the object could be a completely different item, that by its very nature returns the same HashCode.

Test it out yourself - you'll find that the Hashtable calls Equals to double-check things every time before it returns a definite result.

At the very least I would recommend that you caution that while your method is functional, and there's a statistically low likelihood of a clash, it's not a complete impossibility.

I'd further suggest that your approach is acceptable in an nHibernate environment where one record must be compared with another, but it would be folly to use this method when working with potentially large collections of business objects.

What I *do* appreciate is that you've satisfied one of the most-often overlooked requirements of a good GetHashCode implementation (from the Object.GetHashCode() MSDN documentation):

"For the best performance, a hash function must generate a random distribution for all input."

Your use of a string satisfies this requirement nicely.

I'm afraid I can't work out, however, why you're not using a StringBuilder (and subsequently StringBuilder.Append()) to build the string on which you subsequently call GetHashCode().

And finally, I'd like to suggest the following alternative GetHashCode() function, which I haven't performance-tested, but which is frequently used within the .NET Framework itself (can be seen via reflection):

public override int GetHashCode() {

return _field1.GetHashCode() ^ _field2.GetHashCode() ^ _field3.GetHashCode();

}

Thanks, and sorry I haven't come across your post before to raise my concerns!

-Geoff.

# July 31, 2008 11:45 AM

manitra said:

Hi,

My workaround to all those hashCode problem is :

- use the Id.GetHashCode

- *never* add an domain object in a collection before it has an ID.

A bit obtrusive but, it just works :)

@Geof

- Using a StringBuilder for concatening 4 or 5 short string is a lot slower than the "+" operator.

- The final implementation of GethasCode using the bitewise operator should just be inside a "unckecked" statement to avoid overflow exception. That's what Resharper 4+ do.

# August 14, 2008 4:50 AM

Mike Walters said:

you SHOULD be doing hashcodes like this...

public override int GetHashCode()

{

       return _myObj1.GetHashCode() ^ _myObj2.GetHashCode() ^ _myObj3.GetHashCode();

}

# August 29, 2008 9:17 AM

Billy McCafferty said:

Thanks Mike and Geoff for the ^ concatenation idea.  I'll add that to S#arp Architecture.

# September 3, 2008 11:22 AM

Petar Petrov said:

Hi. I don't like the implementation of GetHashCode.

# public override int GetHashCode() {  

#     return (GetType().FullName + "|" +  

#             CompanyName + "|" +  

#             ContactName).GetHashCode();  

# }

First GetType() is very costly, it uses reflection. Second every time it creates a new string. It will occupy a lot of memory. Mike Walters proposes a good hash function(method) but you must check for null any of the objects.

# September 19, 2008 8:03 AM

Bouwe said:

-------------------------- quote --------------------------------------

Note that Equals is sealed and cannot be overridden by a DomainObject implementation.  I suppose it could be unsealed, but since I put a lot of work into that method, I don't want anyone mucking it up!

-----------------------------------------------------------------------

When upgrading to NHibernate 2.0 I had to remove the "sealed" keyword from the Equals method to get my app working again otherwise it complained about I had to make Equals virtual in all DomainObject subclasses...

# October 3, 2008 8:43 AM

Billy McCafferty said:

Thanks for bringing this up Bouwe...I've encountered the exact same behavior as you and had to remove the sealed keyword when using NHibernate 2.0.

# October 3, 2008 2:48 PM

Billy McCafferty said:

Great call Petar...this has been resolved to be much more efficient in the current release of S#arp Architecture.

# October 16, 2008 11:38 AM

Flannery Culp said:

There hasn't been enough talk about how equal hashcodes implies equal objects.

This is all very simple:

GetHashCode returns a random integer - 32 bits.

Birthday Paradox - time for collision is about square root the space size.

Put it together - at about 2^16 instances, we'll have collisions of GetHashCode, without objects being really equal.

2^16 = 64,000.

That is not a big number.

If you have a set of 64,000 customers, expect problems.

# November 1, 2008 10:19 AM

Frederik Gheysels said:

Very interesting posts here , but ...

What does your GetHashcode method do when the value of one of your keyfields (business signature) changes ?  

I mean, according to MSDN, the hashcode of an object should be constant (never change) during the lifetime of the object ?

msdn.microsoft.com/.../system.object.gethashcode(VS.80).aspx

# November 2, 2008 9:52 AM

Billy McCafferty said:

Frederik,

Interestingly, in that same documentation, they provide an example - the Point class - wherein the hashcode changes if the underlying X/Y values change; therefore, the hashcode could very well change throughout the lifetime of the object.  Can you explain this inferred contradiction?  BTW, we're having some good discussion about this at groups.google.com/.../f76d1678e68e3ece

# November 3, 2008 8:47 AM

Billy McCafferty said:

This issue has finally been brought to a final resolution in the latests version of S#arp Architecture, version 0.9.  Thank you all for your tips, suggestions, and observations!!

# November 14, 2008 12:18 AM

rüya tabiri said:

Thank you..

# November 25, 2008 6:56 AM

Mike Walters said:

To help Frederick out, the base.GetHashCode() initial call calculates a hash based on the system clock. if two objects are instantiated (even if the its the same class and parameters) even a fraction of a second off, they will in fact return different hash codes, thus they are not equal. I suppose one could attempt to create two classes on different machines at the exact same moment, and try a comparison with shared memory program, a topic for another discussion, although the odds are overwhelmingly against you getting them exact. More than likely this is what MSDN is referring to when they say "never changes"... a more or less relative meaning to the object if you (the programmer) dont change the gethashcode method. since the hash already determined its value at the given "birth" of its object, it should infact, never change.

# December 21, 2008 1:27 AM

Leave a Comment

(required)  
(optional)
(required)  

Enter the numbers above:
Add
Check out Devlicio.us!

Our Sponsors

Proudly Partnered With