I did 6 or 7 more deduplication tests using MARCEdit, with no true success, but a little minor success. On the plus side, I can produce an overzealous list of duplicate records that includes true duplicates and a few that only look like duplicates (for example, same title, but one is a newer edition than the other). That at least gives us a list to work from, I suppose, although matching on ISBN would give a more accurate and probably more complete list.
In case you're wondering (I know this and my last post are somewhat technical), duplicate records are records that are basically for the exact same title - it was published by the same publisher, published on the same year, etc. When we get e-book record files from vendors, we sometimes get records for the same title from multiple vendors. Some of the vendors have records with OCLC numbers in them, some don't, and sometimes they might have OCLC numbers in them but not the same ones that another vendor used (yes, OCLC has duplicate records, lots and lots of them). When we load them, we end up with multiple records for basically the same thing. Ideally, we'd like to have an e-book that is available from multiple vendors accessible on one record.
That's where record deduplication comes in. Right now, we could do our deduping by searching each and every e-book title in the catalog and clearing up duplicates as we come across them. This is not a good idea - we have tens of thousands of e-books, and the number will only grow. The tests I've been doing are part of an attempt to automate deduplication, or at least come up with a list of potential duplicates so that we could avoid having to search every single title in our e-book collection.
I think I'm going to start reading articles on record deduplication. I probably should have done this earlier - if I find something right away that could help us, I'm going to kick myself.