Devlico.Us
CodeBetter.Com
RSS 2.0 via Feedburner
           Do you Twitter? Follow us @devlicious

ViNull, Off the Record

Distilled ramblings of Michael C. Neel delivered right to your browser.


Finding the difference between two Arrays, or un-LINQ-ing your code

imageLINQ is great, up to the point when it's not.  Then it's really not great at all.  When we trade simplicity of syntax for performance, we have to keep in mind there may come a point when we have to go back the code and factor out the shiny new toys.

My system is complex, but this task is simple.  I have about 95,000 video files to keep track of in a database.  Try as we might, at some point someone is going to make changes, add or remove files, without going through our software, so a periodic file audit is required.

I have two lists of file names, one is the list as it exists in the database, the other as it exists on disk.  This is greatly simplified, but the initial idea was something like this:

List<String> toAdd = (from f in existingFiles
                      where !indexedFiles.Contains(f)
                      select f).ToList();

List<String> toRemove = (from f in indexedFiles
                         where !existingFiles.Contains(f)
                         select f).ToList();

The power of LINQ, short and sweet.  Sadly, it appears to act like this under the hood:

List<String> toAdd = new List<String>();
foreach (String f in existingFiles)
    if (!indexedFiles.Contains(f))
        toAdd.Add(f);

List<String> toRemove = new List<String>();
foreach (String f in indexedFiles)
    if (!existingFiles.Contains(f))
        toRemove.Add(f);

These statements pegged the CPU for minutes at a time - can you see why?  How many times am I looping through each collection of strings?  Twice? Four times?  No, I'm afraid this is classic "Oh 2 Da N" performance here.  The problems lies in using the Contains() method - this method must search through the collection each time it's called, which is often.  To solve this, we'll have to kick it old school C-style:

indexedFiles.Sort();
existingFiles.Sort();

for (int iE = 0, iI = 0; iE < existingFiles.Count || iI < indexedFiles.Count; ) {
    if (iE >= existingFiles.Count) {
        toRemove.Add(indexedFiles[iI]);
        iI++;
    }
    else if (iI >= indexedFiles.Count) {
        toAdd.Add(existingFiles[iE]);
        iE++;
    }
    else {
        Int32 diff = existingFiles[iE].CompareTo(indexedFiles[iI]);
        if (diff == 0) {
            iE++;
            iI++;
        }
        else if (diff > 0) {
            toRemove.Add(indexedFiles[iI]);
            iI++;
        }
        else if (diff < 0) {
            toAdd.Add(existingFiles[iE]);
            iE++;
        }
    }
}

A quick sort of the lists and we can make some time saving assumptions as well as walk through both lists at the same time (.Net had no trouble sorting lists of 95,000 strings in record breaking time).  We start off comparing the strings at index 0 of each list (the first if and else if will be false - I'll come back to those shortly).  If they match, then the file both exists and is indexed - go to the next file in both lists.  If the existing string is greater than the index string, then we know the file no longer exists (remember, lists are sorted).  In this case only advance the index list to see if it "catches up" with the existing list.  If the existing string is less than the index string, the reverse is done.  If we run out of existing list before we finish the index list, those files no longer exist - which is what the first if statement handles (the first else if handles the other case for running out of index list first).

If you have to work through this for loop on scratch paper, don't feel bad - I did.  It's been a very long time since I've had to get this "raw" with my code.  It was worth it as the performance went from several minutes at 100% CPU to so fast I couldn't believe it had run.

I googled a bit to see if there were other solutions out there, and more importantly some method in the .Net framework that would solve this issue and came up empty.  I'm not convinced there isn't something in the bazillion methods living in the framework, so if you know of something, please let me know!



Comments

Carlos Beppler said:

Try to use System.Collections.Generic.HashSet<T> instead of lists.

To get an HashSet you can change the fist code to:

HashSet<String> toAdd = new HashSet<string>(

   from f in existingFiles

   where !indexedFiles.Contains(f)

   select f).AsEnumerable() );

# October 3, 2008 7:36 PM

Kevin H said:

There's always the .Except() method, which does what you want, and seems to be a little faster even than your mergesort (i'm guessing because it *is* a mergesort written unmanaged)

Adding/removing 1000 out of

Linq w/ contains: 17734.375ms

Except: 15.625ms

MergeSort: 46.875ms

source code follows:

   class Program

   {

       const int NumberOfSamples = 50000;

       static void Main(string[] args)

       {

           Random r = new Random();

           List<int> oldNumbers = new List<int>();

           List<int> newNumbers = new List<int>();

           for (int i = 0; i < NumberOfSamples; i++)

               oldNumbers.Add(r.Next());

           newNumbers.AddRange(oldNumbers);

           for (int i = 0; i < 1000; i++)

               newNumbers.RemoveAt(r.Next(newNumbers.Count));

           for (int i = 0; i < 1000; i++)

               newNumbers.Add(r.Next());

           DateTime start;

           GC.Collect();

           start = DateTime.Now;

           ContainsMethod(oldNumbers, newNumbers);            

           Console.WriteLine("{0}ms", (DateTime.Now - start).TotalMilliseconds);

           GC.Collect();

           start = DateTime.Now;

           ExceptMethod(oldNumbers, newNumbers);

           Console.WriteLine("{0}ms", (DateTime.Now - start).TotalMilliseconds);

           GC.Collect();

           start = DateTime.Now;

           MergeSortMethod(oldNumbers, newNumbers);

           Console.WriteLine("{0}ms", (DateTime.Now - start).TotalMilliseconds);

           Console.ReadKey();

       }

       static void ExceptMethod(List<int> oldNumbers, List<int> newNumbers)

       {

           List<int> removed = oldNumbers.Except(newNumbers).ToList();

           List<int> added = newNumbers.Except(oldNumbers).ToList();

           Console.WriteLine("Removed: {0}, Added: {1}", removed.Count, added.Count);

       }

       static void ContainsMethod(List<int> oldNumbers, List<int> newNumbers)

       {

           List<int> removed = (from f in oldNumbers

                                where !newNumbers.Contains(f)

                                select f).ToList();

           List<int> added = (from f in newNumbers

                              where !oldNumbers.Contains(f)

                              select f).ToList();

           Console.WriteLine("Removed: {0}, Added: {1}", removed.Count, added.Count);

       }

       static void MergeSortMethod(List<int> oldNumbers, List<int> newNumbers)

       {

           List<int> removed = new List<int>();

           List<int> added = new List<int>();

           newNumbers.Sort();

           oldNumbers.Sort();

           for (int iE = 0, iI = 0; iE < oldNumbers.Count || iI < newNumbers.Count; ) {

               if (iE >= oldNumbers.Count) {

                   removed.Add(newNumbers[iI]);

                   iI++;

               }

               else if (iI >= newNumbers.Count) {

                   added.Add(oldNumbers[iE]);

                   iE++;

               }

               else {

                   Int32 diff = oldNumbers[iE].CompareTo(newNumbers[iI]);

                   if (diff == 0) {

                       iE++;

                       iI++;

                   }

                   else if (diff > 0) {

                       removed.Add(newNumbers[iI]);

                       iI++;

                   }

                   else if (diff < 0) {

                       added.Add(oldNumbers[iE]);

                       iE++;

                   }

               }

           }

           Console.WriteLine("Removed: {0}, Added: {1}", removed.Count, added.Count);

       }

   }

# October 3, 2008 7:55 PM

smaclell said:

Since what you are doing appears to be filtering each list based on where they overlap you might want to check out the Intersects and Except extension methods.

rsanidad.wordpress.com/.../linq-except-and-intersect

Good luck.

# October 3, 2008 7:59 PM

charles said:

I have to try this one.

# October 4, 2008 5:54 AM

Dew Drop - October 4, 2008 | Alvin Ashcraft's Morning Dew said:

Pingback from  Dew Drop - October 4, 2008 | Alvin Ashcraft's Morning Dew

# October 4, 2008 10:40 AM

Bill said:

The Except extension method is what you are looking for (though your solution is almost twice as fast because using the extension method you would be sorting the lists twice, contrary to the incorrect results suggested by Keven). My results on the comparison methods above are:

ExceptMethod: 31.25ms

MergeSortMethod: 15.625ms

though I changed the instantiation for loops:

       static void Main(string[] args) {

           Random r = new Random(1);

           List<int> oldNumbers = Enumerable.Range(1, NumberOfSamples).Select(x => r.Next(NumberOfSamples*2)).Distinct().ToList();

           List<int> newNumbers = Enumerable.Range(1, NumberOfSamples).Select(x => r.Next(NumberOfSamples * 2)).Distinct().ToList();

A slight improvement on your code would be (a couple of saved instructions per loop, we are talking microseconds of improvement here):

var removed = new List<int>();

var added = new List<int>();

newNumbers.Sort();

oldNumbers.Sort();

int iE = 0, iI = 0, oldCount = oldNumbers.Count, newCount = newNumbers.Count;

while (iE < oldCount && iI < newCount) {

   var diff = oldNumbers[iE].CompareTo(newNumbers[iI]);

   if (diff == 0) {

       iE++;

       iI++;

   } else if (diff > 0) {

       removed.Add(newNumbers[iI]);

       iI++;

   } else {

       added.Add(oldNumbers[iE]);

       iE++;

   }

}

if (iE >= oldCount) {

   while (iI < newCount) {

       removed.Add(newNumbers[iI]);

       iI++;

   }

} else if (iI >= newCount) {

   while (iE < oldCount) {

       added.Add(oldNumbers[iE]);

       iE++;

   }

}

# October 4, 2008 5:20 PM

Kevin H said:

Well, the DateTime timing isn't very good (15ms resolution, plus the GC might run, etc). I've re-timed it a few times, and it seems that your method is slightly better than the method in the post, is slightly better than Except. However, they are all orders of magnitude faster than the LINQ .contains version, and 500k entries take less than a couple hundred ms, which is significantly less time than it would take to actually load that amount of data from the file system or db.

# October 7, 2008 8:51 PM

Leave a Comment

(required)  
(optional)
(required)  

Enter the numbers above:
Add

About Michael C. Neel

Michael C. Neel is a Digital Media Developer with Jewelry Television and independent consultant with ViNull Software. He is also a board member and Vice President of the East Tennessee .Net Users Group (ETNUG) in his home town of Knoxville, TN. Michael has been published in asp.netPro magazine and continues to publish .NET focused articles on his blog (ViNull.com) and at Devlicio.us with other community developers. A regular speaker at .Net conferences and user groups, Michael travels to most of the states surrounding Tennessee. Check out Devlicio.us!

Our Sponsors

Red-Gate!