Posted: December 11th, 2013 by fcs
- Because of the OSP update issues last night, I must modify the update process to deal with it.
- Turns out the issue was that Net::LDAP will happily toss out mixed case and lower case forms of the attribute name. So, my little hash needed to have both forms. A really, really proper fix would probably be to force the attribute variable value to lower case before checking in the hash (and having lower case versions of the names in the hash).
- Other than the OSP update failure/abend – there were not normal issues with the update last night.
- The Windows groups had their nightly panic attacks and I will have to do something about that – but my hypothesis right now is loading on the temporary disk subsystem which is too small to hold much longer.
- Worked with EMC on SR 59521968.
- He understands the environment better
- Action Items:
- Run nsrget (and provide output to him)
- Update staging.pl to print out contents of SSID file
- Spoke with EMC sales rep about Data Domain and DDBoost and pricing.
Posted: December 10th, 2013 by fcs
- No issues in last night’s update.
- VPN flipping…
- OSP code change testing, and production roll-out.
- EDIT: 12/11 – Because the OSPEmail attribute does not have an EQUALITY match rule, the update code needs to be updated to replace instead of delete/add. Took two (2) hours (8:30-10:30) last night to manually process the rest of the update.
- Cleaning cartridge expired at 0525. Replaced at 0649.
- No response from EMC about the open issue (59521968). Need to poke them again. Scheduled a call for 10am tomorrow.
- Windows disaster recovery group had three failures. This is getting to be a pattern. Now, if I could only figure out what the pattern indicates :/
- df -h reports 2.3TB left on that second ZFS pool that I’m trying to empty off. Oops! I forgot about the gzip-1 compression. It’s really over 7TB of data. Well, darn! That’s going to take a while to stage off. Perhaps my Christmas present to myself will be getting this done before New Years??? (or maybe not)
- recover from over the summer required. Two tapes retrieved from the vaults. Recover command instructions given to the IT support person.
- Nothing back from Numara/BMC – perhaps it is time I gave them a phone call?
- Lots of work trying to adjust to the new firewall rule check in cfengine.
Posted: December 9th, 2013 by fcs
Theme: Another day, another day
- Update last Friday night had a former student. Did the dance.
- Meeting with OSP. Direction chosen. Marching begins.
- OSP schema updated.
- ACLs verified to protect the new info.
- in-mailgw and ldap.uvm.edu servers updated with the new schema.
- ldaprw was updated with the new schema at 10:30pm. No problems (Nagios noticed one of the in-mailgw’s had a sync check issue).
- Last week’s clones finished at 1am this morning. OUCH. Started this week’s clones up.
- Answered the questionnaire about NW 8.1.1 and friends – mostly honestly.
- Went to poke EMC about SR 59521968, and discovered they had poked me last Thursday. OUCH! So, I apologized and poked them back.
- BMC issue 293502 has been identified that our 9.3 install is too far out of date and I really need to upgrade to 11.6. Starting doing that. Need BMC to confirm that RHEL6 (glibc 2.12) is officially supported on Footprints v11.
Posted: December 6th, 2013 by fcs
- Work on developing OSP changes uncovered a bug in the production code. It is possible now (though was not when the code was written) for an SSN to match, but not have the correct objectClass in the LDAP entry. modEntry has to check for that now.
- Last night’s update had one issue, a former student. Did the dance.
- Heard back from OSP; merged the former student info with the OSP info and created the account.
- The moving of the bits off the troubled ZFS pool onto tape continues. The 21,156 save sets that started moving yesterday continue. 50 of them have moved. Yes, this is going to take time.
- Updated the list of save sets and the matching directive for NFS1 to hopefully have a “more harmonious outcome” with this weekend’s full save of that file server.
Posted: December 5th, 2013 by fcs
- One issue in last night’s update: a former student. Did the search, fail, add dance.
- Working on some code changes for OSP.
- ldap6dev failed backup last night – apparently, the networker installation got wiped out when I was recovering from the mess that yum was in yesterday. That is particularly disgusting.
- Three windows systems failed their Disaster Recovery backups, two with a message that the remote host had forcibly closed the connection (uhm. what?!?), the final one with an unknown error in an oracle *.msb file.
- The INBOX aftd cleaned itself up overnight. YAY, that’s one manual process that I never intended to do that is now not manual anymore!
- The first 500 save sets on the UVM.SATABeast.003 aftd finished staging to tape after almost 24 hours. Set up the remaining 21,156 save sets (12 TB), sorted by decreasing size, to move to tape as singletons. That should finish some time next week.
- Scheduled maintenance for 11am – 2pm on 12/13 – to upgrade to the new DDOS version.
- Downloaded the 188.8.131.52 DDOS code and documentation.
Posted: December 4th, 2013 by fcs
- Yesterday’s decision to start backing up the three systems that were not currently doing so, caused errors in the backup system overnight. Apparently, besides defining the backup client, one has to actually install the backup client software on the system being backed up. Who knew? DOH!
- cfengine configuration complained repeatedly about condition of one of the three systems. Finally gave up and took the 2×4 to the side of cf-execd’s head (rm -f /var/cfengine/state/*). That and a stop/start of cf-execd still didn’t make it realize things were proper. However, murdering the nsrexecd that had been running since May 23 (kill -9 style) and then starting nsrexecd again, that did. I guess nsrexecd didn’t like having lgtoclnt removed from the system and the /nsr directory deleted?
- Two issues in last night’s update. An OSP feed with a singleton match and a different last name than the match found, and a Former Student from SIS.
- Former Student: Searched, Not Found, Added.
- OSP: Interrogation email sent… (oy, I need to get more sleep)
- Rolled OpenLDAP 2.4.38-2 onto ldap11 and ldap6dev. Found yum on ldap6dev in a terrible state with multiple copies running and all hung on a lock like fleas on a dog. Wound up doing a modified version of the steps Jim had to take on otter yesterday to clean it up. Someone else will have to figure out why it was in such a state.
- Backups of several of the ldap servers failed. Glowered at the admin until he installed the backup client software on those servers.
- 8.1sp1 testing finds more errors. Time to open an issue with EMC (Issue 59521968 opened)
- A couple of windows systems failed their disaster recovery backups – wonder if there was some Tuesday night maintenance that hit.
- The movement of bits from the degraded ZFS pool to the temporary ZFS pool finished last night. The next step of clearing off everything else from that portion of the degraded ZFS pool to tape is started. However, at just north of 22000 save sets, this is going to take a while!
- Discovered that the INBOX aftd disk volume was set as “manually recyclable”. Oops! That is not supposed to be that way. I fixed it. I’ll bet that is the reason I have had to manually purge the old save sets off it!
Posted: December 3rd, 2013 by fcs
- Last night’s audit failed because a new netid had been created that duplicated an existing sendmail specific alias. The adduser script was updated to verify that we don’t create duplicate mailLocalAddress attributes in LDAP.
- This morning’s log rotations ran out of space. The suggested fix for that is to back up all the replicas and keep a smaller number of the daily logs on the live systems.
- And, there were a few errors in the update last night.
- Two were the result of the discussion with OSP yesterday and their update of their DB.
- The third is a new employee that might be an alumnus with a new name. HR confirms my hypothesis and the employee information was merged into the existing LDAP entry.
- A little bit of VPN permission hacking.
- The first of the first of the three AFTD’s on the “athens” zpool were emptied off overnight and have been removed. Work begins on the second of the three.
- A client failed to backup last night, for the second night in a row. The system administrator let me know that it was a cascade failure. The first night was a failed UPS, the second night was a failed ethernet card. We expect it to perform properly tonight.
Posted: December 2nd, 2013 by fcs
- One issue in update – a PeopleSoft job that has no employee code associated with it, so it is being ignored. I’ll ignore it for a day and see if HR gets it fixed before I point it out to them and request it be fixed.
- OSP query about an issue that was resolved (in LDAP) back in the middle of November. Hopefully, I was able to answer the problem sufficiently that they can deal with it (sounds like their DB needs to have a field changed).
- There was an apparent issue with the snapper job over the weekend so the Sunday morning incremental backups of email all failed because the GFS filesystems were not mounted. It appears there is an error path that will allow them to be unmounted and not remounted on penguin5. I manually mounted them and this morning’s backup ran just fine.
- Turns out the issue was that the “Penguin Full” group ran until 11pm Saturday night, so the snapper could not run at 10:30pm. However, snapper only remounted the filesystems on one of the backup email servers, and all of the jobs on Sunday morning tried to run on the other one. Jim will update the snapper script to be sure it remounts the existing snapshots on both nodes if either of the nodes fails the can I unmount check.
- Two issues with backups:
- a client appears to have failed – will give the support person a day before asking what’s up
- another windows client decided to try to back up the B:\ disk which doesn’t really exist. Geoff will upgrade to Networker 8 client to fix that.
- Setting up AFTD’s on the zpool sparta that was created on Friday and using quotas to limit the amount of space each AFTD can consume. However, I have discovered that the RHEL6 NFS client is not honoring the “root=@” parameter, so the filesystems appear to be owned by “nobody”. This is not a problem in Networker 7, it is causing a failure in Networker 8 because nobody is not root.
- After a couple hours of email exchanges with Ben, I have given up on resolving the issue today and put Networker 7 on stornode4 instead of Networker 8.
- ZFS ashift issues: working on staging all the data existing on the athens ZFS pool to tape.
Posted: December 2nd, 2013 by fcs
Today’s Theme: Overtime on Salary. Sweet!
- Added a QLogic QLE2462 to bujbod1 and attached the two 1TB/42disk SATABeasts to it. Set up a new ZFS pool (sparta) which will hold most of the data currently in the zpool athens while athens is rebuilt.
- Work (including preliminary monitoring config changes and getting everything shutdown) took just about 4 hours.
Posted: November 27th, 2013 by fcs
- Five issues in the update. Two former students, two singleton matches with non-conformant surnames, and one new employee who can’t find a match.
- Two former students searched for, not found, added.
- Two singleton matches were people with the generational suffix in the feed where it didn’t belong, merged them with their existing entries.
- One new employee is a foreign student without an SSN. Found and merged.
- One issue from yesterday – the person confirms the surname we have on record. Information created and merged.
- Performance of staging from the NFS systems to tape appears to be degraded from what was originally measured. Is this due to the issue since October with the ZFS pool, or is this something worse in the Linux NFS client implementation? As it is a deviation from previously measured normal, I shall endeavor to hope that it is related to the issues with the ZFS pool and it will magically go away after that is fixed in the next week.
- New client defined (DR only)