Tuesday, October 26, 2010

More deduping, plus an explanation of why it is necessary

I did 6 or 7 more deduplication tests using MARCEdit, with no true success, but a little minor success. On the plus side, I can produce an overzealous list of duplicate records that includes true duplicates and a few that only look like duplicates (for example, same title, but one is a newer edition than the other). That at least gives us a list to work from, I suppose, although matching on ISBN would give a more accurate and probably more complete list.

In case you're wondering (I know this and my last post are somewhat technical), duplicate records are records that are basically for the exact same title - it was published by the same publisher, published on the same year, etc. When we get e-book record files from vendors, we sometimes get records for the same title from multiple vendors. Some of the vendors have records with OCLC numbers in them, some don't, and sometimes they might have OCLC numbers in them but not the same ones that another vendor used (yes, OCLC has duplicate records, lots and lots of them). When we load them, we end up with multiple records for basically the same thing. Ideally, we'd like to have an e-book that is available from multiple vendors accessible on one record.

That's where record deduplication comes in. Right now, we could do our deduping by searching each and every e-book title in the catalog and clearing up duplicates as we come across them. This is not a good idea - we have tens of thousands of e-books, and the number will only grow. The tests I've been doing are part of an attempt to automate deduplication, or at least come up with a list of potential duplicates so that we could avoid having to search every single title in our e-book collection.

I think I'm going to start reading articles on record deduplication. I probably should have done this earlier - if I find something right away that could help us, I'm going to kick myself.

Trying to dedupe records...and failing

While provider neutral e-book records are a nice idea, it's a little hard to do in practice when you're dealing with vendor e-book record packages. Today will be Round 3 of me trying to figure out how to dedupe our records without having to go through each title one by one.

In theory, deduplication could be done at the record loading stage, using, for instance, ISBNs as a second match point. In practice, this probably wouldn't go well, unless we decided to have print and electronic formats on one record - by matching on ISBN, we would end up matching our e-book records with our print records. There are probably other issues with this method that I haven't even thought of. I could do some testing, but I haven't really focused much on this method of record deduplication yet.

Instead, I've mostly been looking at methods of deduplication using MARCEdit. The obvious method, using MARCEdit's deduplication tool and trying to dedupe on ISBNs, has so far failed. I'm either using the tool wrong, or it's not working the way it should. The first day I started experimenting, I remember having some success by matching on main title information. I think I might try that again today. Unfortunately, that would result in multiple editions of one title being considered dupes. If it also lists actual dupes, it would still be better than nothing. Instead of having to search hundreds of titles, maybe we'd only have to search a few dozen. Or so I hope...

Friday, October 8, 2010

Vacation, catalog maintenance

Wow, it's been almost a month and a half since my last post.  My vacation had a little to do with that, but the rest was just...pre-vacation near burn-out, maybe?

My vacation went great. It took me a while to get comfortable with my niece, since I've never really been around babies before, but now I find I feel sad that I won't get to see her very often. At the very least, everyone in her family but her mom and dad is going to miss out on her first birthday - so sad!

Being back at work feels a little weird, but that'll wear off. With SCUUG only a week away, I've been reminding myself how to use MARCEdit for catalog maintenance by working on a project I started looking into right before my vacation. An unknown number of name headings in our catalog are messed up, with subfield d coming before subfield q, rather than after. I had been ignoring this problem, but now it's starting to interfere pretty significantly with my batch authority searching and loading process.

An example of the problem:
Babcock, C. J. $d 1894- $q (Clarence Joseph),

Should be:
Bacock, C. J. $q (Clarence Joseph), $d 1894-

In the past, I occasionally fixed these by hand as I came across them. However, this is annoying, and also bad for my wrist. Global editing is a good thing, and this looked like something that should be fixable globally. I just wasn't sure how.

It turns out it's possible with MARCEdit, and I figured out how to do it all on my own. Woohoo! I'm planning on running the fix for all the oldest records in our catalog (nearly 200,000 I think) over the course of a few weeks. That should take care of most, if not all, of the problem, and then I can get back to batch searching and loading authority records.

While playing with all of that, I also learned the first few steps for a new tool in MARCEdit that allows you to extract certain records from a larger file, edit the smaller file of records, and (in theory) re-insert the edited records back into the larger file. This will be great for all kinds of projects, once I figure out how to get the reinsertion part to work.