Posted: October 18th, 2012 by fcs
- Backend DB stats script updated to give improved stats.
- AD feed update to handle displayName better. When will it go production??? Today!
- Privacy flags… probably a great time to re-write and install where Account Services can take it over!
- root filesystem space on the masters is getting crowded. Need to expand or split. Consensus is to split out the /home space onto its own file system.
- The tape that was eaten by the failed drive was successfully offloaded and reused, but the next tape in the series of tapes used to hold it was short written. DOH! For the Record – tape drives are as evil as printers!
- re-sharding of the email servers is ready to be installed tomorrow after the backup completes.
- Test servers 45 day license expired. Thankfully, I don’t need them to test anything right now.
- Writing the recipe for the backup maintenance day (Monday).
- OS upgrades
- DataDomain: DDOS upgrade (scary upgrade)
- DataDomain: 10GbE performance testing (if can work it out with DD support)
- Develop graph/measurement to trend disk needs for AFTD. Make suggestions to clients about how disk would help them.
Posted: October 17th, 2012 by fcs
- PeopleSoft upgrade will prevent the input of invalid SSN’s (so much for our special 9* entries to flag special kinds of employees). I have to work out the code changes to get around this.
- One privacy flag to set.
- New mdb developments to check out. Nice to have a developer’s ear.
- The data from yesterday’s stuck tape has been safely moved to another tape and the tape is being tested via IBM’s itdt full-write test in a known good drive.
- The replacement drive for the one that ate the tape yesterday has arrived and been replaced.
- sharding for the email servers is ready to be put in place on Friday.
Posted: October 16th, 2012 by fcs
- Still working on moving the database that holds the backend from Oracle’s Berkley DB (the old Sleepycat DB, or BDB) to Howard Chu’s MDB (Memory Data Base) that he has developed in the past year. It is very fast, requires zero configuration and just bloody works!
- Still getting the RHEL6 replica image worked out, almost there. Just a few more bits and bobs to tweak before I declare it acceptably ready and roll it into production. Aiming for Thanksgiving to do that.
- Watching the newest update to the eduPerson objectClass. It has a couple of new attributes (eduPersonPrincipalNamePrior and eduPersonPricipalNameTimestamp) that will be of good use to us (with some development work) to remembering the former NetIDs that have been assigned to people so we don’t reuse without serious consideration.
- Working on (finally) getting the Employee Privacy Flag (EPF) routines moved to and working on the account management system.
- Dealing with the PeopleSoft upgrade that is coming soon and appears that it is going to have a large change in the fake SSNs we have been using. Still working to gather the facts and figure out what I need to change in the update scripts.
- I have also become aware of the CIFER project and I need to read information about that to determine if I should advocate for migrating to that platform to make it possible to replace myself at some point.
- Still working on the vault space issue. We are attacking this on multiple fronts.
- We are investigating a large reduction in the amount of data that we keep for the full seven years, likely the largest space consumers being reduced to two years.
- We are changing the way we dispose of the seven year old tapes. Instead of removing them once a year, we will remove them every month. This maximizes the reclaiming of vault space and reduces the burden on me to fit all of them into the destruction bins in July.
- Still working on the pricing and feasibility of a dedicated backup system to handle the VMware ESXi clusters. The two current candidates are AVAMAR and VEEAM. We looked at VADP as provided by NetWorker and found it to be lacking critical features. The biggest problem being the inability to do a disaster recovery (or bare metal restore) from incremental backups that use Changed Block Tracking (CBT). Disk is cheap, but it is not cheap enough to do a daily full backup of our entire VMware infrastructure.
- Our Data Domain’s OS (DDOS) is about to fall off support, so I am working on the upgrade of that to the current version. DD support is not answering questions as rapidly as I had hoped. I may not be able to do the work on October 22 as I had hoped.
- October 22 is the next Backup Maintenance Day. I’m still pulling together the list of work to be done this month.
- And just for fun! One of the tape drives decided to do a failure mode of “stuck tape”. So much for the morning. However, the tape was finally released. After power cycling the drive and waving the rubber chicken around the office and over the keyboard a few times. Visual inspection of the tape says it is either ok, or there is hidden damage buried in the supply spool of tape. Work is progressing to move the data to another tape.
- Upgraded the MacBook Pro to OSX 10.8 (Mountain Lion).
- As a result of that, I have found that Thunderbird is not reliable for me (people are not receiving my emails — or are at least claiming that they were not!) as well as some messages that our own systems tell me were delivered into my INBOX were lost. Therefore, I have stopped using ThunderBird. I tried Apple’s Mail.app, but was unable to find a reliable way to get it to work with gpg to sign and/or encrypt messages that I send out. Therefore, believe it or not, I have returned to pine. I am actually using the re-alpine 2.0.2 code from sourceforge. I found directions HERE that didn’t work for building, but did work to set up the .mailcap and SSL certificates. I found directions HERE on configuring to use gpg with re-alpine. [EDIT: 17. Oktober -- Yeah, this is not the modern way of doing pgp encryption, and it doesn't support the modern MIME/multi-part methods]
- I also, since PGP Whole Disk Encryption is not ready for Mountain Lion 10.8.2 , re-encrypted the system disk using FileVault2.
Posted: October 2nd, 2012 by fcs
Oh my, it has been two weeks since I last updated this. That’s never good. It means I’ve been running around like a be-headed chicken!
Backup things, LDAP things, meetings and dealing with vendors. Just not a lot of FUN!
The newest is that I’ve got Mountain Lion to upgrade on my MacBook Pro. Of course, to be sure of a flawless Time Machine based upgrade, I have to decrypt the drive in the thing first. Decrypting the drive was all day yesterday happening. And now, Time Machine won’t finish backing up the disk. I’m guessing that will take all of today. In the past hour, it has managed to back up 2.6MB out of the 388.1MB it needs to. Yeah, I Love Apple. Scratch That … I Love TECHNOLOGY. It always breaks when you really can’t afford to have it do so.
- Clones are digging into the new tapes and of course, first time through they get written a short amount. Of course, the backups wriggle their way across 14 bazillion tapes, so relabeling that one tape wipes out half a dozen others and usually still leaves a tape with almost zilch for data on it going to the vaults. I hate this. I really need to develop a procedure to do that first scraping of the manufacturing dust off new tapes.
- Moving the clone tapes that have been marked full out of the jukebox into the vault system.
- Starting to dig into (with DataDomain) why the 10GbE connection is not performing at expected levels.
- Got a couple of Former Students to manually add.
Posted: September 18th, 2012 by fcs
- Update kerberos server definition so it should take a full tonight
- Manually force a full on skink since it was upgraded from RHEL4 to RHEL6 yesterday
- Storage Array Maintenance
- Replace a failed drive
- Force Fail a failing drive
- Waiting for the rebuild to finish
- Vendor teleconference about expensive options that might work
- Fix the errors from last night’s update (two former students)
- Privacy Flags
Posted: September 18th, 2012 by fcs
- This was backup maintenance day.
- OS upgrades
- Disk maintenance (fsck’s)
- The completely unwanted manual editing of a configuration database because the Linux startup processes are not consistent in the order they find attached devices (aka the tape library) anymore and it changed its SCSI address again.
- Tapes transported from the fire safe to the vaults
- Expired tapes returned from the vaults and made available for re-use
- Weekly cloning started.
Posted: September 14th, 2012 by fcs
- URI formatting questions answered
- No privacy flags! A minor miracle!
- A few update matching failures that required some digging
- More work on a sanitize script to obfuscate enough data to make the LDAP database safe to release to the vendor.
- Do the EMC SR dance – where the heck is 18.104.22.168 (aka NW 8.0 Build 172)? Oh, not quite released yet. When will it be released? Uhm… Maybe by the 18th?
- Chasing Nexsan to figure out where the two drives they said they were sending me have wound up. Update: Nexsan is having trouble obtaining drives to replace the failed 750GB ones with – and they have seen fit to leave me in the dark about this issue. That does not make me a happy customer.
- Update the sharding for nfs3.
Posted: September 11th, 2012 by fcs
Today is the 11th anniversary of the terrorist attack that collapsed the World Trade Towers in New York City and killed over 3000 people. Let’s have a moment of silence.
Thank you, now on with our daily report…
- Privacy flags, flags and even more flags
- HYC wants to work with me to find the source of the MDB error. Trying to get schedules synchronized to make that work.
- ldap11 (test replica) exhibited the same failure following Sunday night’s update, so I reverted that one back to the HDB backend yesterday (yeah, I didn’t blog about my work day yesterday, tough!)
- Discovered that when cc is called with both -O2 and -O0 flags, it seems to like the -O2 better (sigh), edit the Makefile, force -O0, rebuild, retest, it still breaks. HYC is now working on a patch.
- Opened ITS#7385 for this issue.
- sharepoint2 and sharepoint3 failed last night. They just up and refused to talk with the backup server. Inquired of the sysadmins if there was a problem.
- Turned in quotes for 300 LTO4 media yesterday. Reduced the usual order by half while we consider the possibility of upgrading the drives to LTO5.
- Still waiting on someone (the lawyers?) to define what has to be vaulted for seven (7) years, so I can adjust the backups and reduce the amount that is going to the vaults – still going to run out of space this year if someone does not make a decision soon.
- SR 49973418 – response from EMC support person (level 1 I assume) that is not relevant and confusing. This should be fun.
- loaded 40 blank tapes into library and labeled them (tested out new script. Yay, it worked and didn’t wipe out all the tapes in the library!)
Posted: September 8th, 2012 by fcs
- WAF hit because I got up and left my pager by the bed and then the 6:30 ldap update on the test server (where I’m working out how the mdb backend will perform) blows up and pages her out of bed. Not a good way to start the weekend.
- The test system has 8GB of memory and was told it could use 80GB for the mdb (memory data base – in case you’re interested).
- The system has memory, file system space, and lousy error messages – which claim that an unnamed device is out of memory.
- Posted most of the details to openldap-technical, we’ll see what that brings.
- Reconfigured to allow the mdb to create a 7GB data base. Trying the update again.
- Still failed at 7GB, so tried 4GB, while the failure is not as dramatic, it still fails at the exact same location.
- Now reloading using the hdb backend to verify that there isn’t something wonky in the 2.4.32 base code.
- hdb had no issue, so it is something in mdb. The OpenLDAP developer believes it is a bug in the code – so more digging later.
- Of equal concern is that the mdb_stat program was not here, it should have been – that’s a build issue that needs further research.
Posted: September 7th, 2012 by fcs
- Privacy flags – bumper crop today
- PeopleSoft manual merges of two entries
- mdb test worked on development primary server, now to test it on the development replica
- ruhroh! mini-httpd is not working correctly on RHEL6! Must fix that. The pidofproc function is not working on RHEL6 like it does on RHEL5. The straight pidof binary command works.
- Adjusted the columns per page to 90 for the Admissions printer. It is now reported to work! YAY!
- The usual mess with some clients dying in the middle of being backed up and the connection not timing out. Thus I have to manually stop the job
- Discovered a COMIS based full that had been marked suspect (I think because the cloning process broke on the writing side). Therefore, I manually marked it not suspect, and am cloning it again this morning before I move the clone tapes.
- More discussion around how to best backup Virtual clients… EMC’s VADP process or something else (like VEEAM or PHD Virtual).
- filesystem sharding re-balanced for second unix file server.
- Further discussion with EMC about the /bin/logger vs /usr/bin/logger notification rules. SIGH.
- Testing out NW8′s Synthetic Full – how does it work???