Tuesday, October 26, 2010

Trying to dedupe records...and failing

While provider neutral e-book records are a nice idea, it's a little hard to do in practice when you're dealing with vendor e-book record packages. Today will be Round 3 of me trying to figure out how to dedupe our records without having to go through each title one by one.

In theory, deduplication could be done at the record loading stage, using, for instance, ISBNs as a second match point. In practice, this probably wouldn't go well, unless we decided to have print and electronic formats on one record - by matching on ISBN, we would end up matching our e-book records with our print records. There are probably other issues with this method that I haven't even thought of. I could do some testing, but I haven't really focused much on this method of record deduplication yet.

Instead, I've mostly been looking at methods of deduplication using MARCEdit. The obvious method, using MARCEdit's deduplication tool and trying to dedupe on ISBNs, has so far failed. I'm either using the tool wrong, or it's not working the way it should. The first day I started experimenting, I remember having some success by matching on main title information. I think I might try that again today. Unfortunately, that would result in multiple editions of one title being considered dupes. If it also lists actual dupes, it would still be better than nothing. Instead of having to search hundreds of titles, maybe we'd only have to search a few dozen. Or so I hope...

2 comments:

  1. Don't you just love a good puzzle?

    ReplyDelete
  2. The strain this one puts on my brain is almost Rubix cube level. But I do think (cross my fingers) I stand a chance at solving it.

    ReplyDelete