Posted: March 7th, 2014 by fcs
- No issues today.
- Updated a sendmail alias for network services.
- Some VPN group updating for Business Re-engineering.
- CoM/IS: 81 SSIDS/24.894 GB left, 8/3,370.156 in flight.
- Disk: 414/13,668.213 left, 8/1,282.831 in flight.
- Tape: finished on Wednesday.
- Shred: 21TiB/35TiB 60% of pass 1 complete.
- EMC SR 61542494: Finally a response on this Sev 2 after two days…
- Reshard NFS2.
- Talked into upgrading my laptop to Mavericks (OSX 10.9.2) and now I can’t get NMC to run. UGH.
Posted: March 6th, 2014 by fcs
- Issues? Issues!? Issues? We ain’t got no issues!
- That special special backup ran in less than an hour last night and backed up 50,310 files which is an improvement from the 1,584 the last full backed up (when the .nsr with skip everything was present).
- Disk: 469 SSIDS/19,755.776 GB left, 17/202.259 in flight.
- CoM/IS: 105/3,395.051 left, 49/8,623.362 in flight.
- Tape: Finished yesterday.
- Shred: 20TiB/35TiB 57% of pass 1.
- No failed groups (according to NMC) last night. The two email groups continue to run (as normally). Of course, the save group email said that the windows file server system had errors – but that is a lie… Sadly, I guess until I can get 220.127.116.11 (and the savegrp binary) installed, I have to trust NMC over the email! Eek!
Posted: March 5th, 2014 by fcs
- Issues? Yes, we have issues…
- Three former students. A dancing fool today.
- One OSP person with a singleton match and a different last name. Email to Kay.
- Updated the adduser script to prevent blank/null names from being entered.
- One windows client and six linux clients failed to backup overnight. The majority of them due to a “Read Only filesystem” failure which was caused by a SAN hardware disaster at the end of the day on Tuesday.
- CoM/IS full save from the weekend continues plodding along past the start time of 11pm, but managed to complete before I arrived in the office (and checked).
- Tape: 0/0 left, 4/2,043.171 in flight. Completed at 12:39pm.
- Disk: 558 SSIDS/25,903.806 GB left; 6/258.792 in flight.
- CoM/IS: started 212 SSIDS/13,989.889 GB
- Shred: 18TiB/35TiB 52% of pass 1 done.
- EMC has a problem. nsrexecd checks for a file named nsrexecd. in /nsr/run at startup and if there is a process running with that pid (whether it is nsrexecd or not) it refuses to start. This is fine if nsrexecd is always cleanly shutdown. It fails miserably if the system crashes or the SAN crashes and causes a read only filesystem which won’t let nsrexecd clean up the file.
- Opened SR 61542380: savegrp emails tell lies – to deal with the fact that the save group emails claim failure, but the mminfo, nsrinfo and recover commands indicate that the backup was not a failure. (EMC claims that the NMC backup failure I see in 18.104.22.168 is also a lie and they are fixing that one… sigh)
- EMC has a patched savegrp binary available. Fix is for 22.214.171.124 – they claim it fixes the NMC database backup failure too. Retrieved and tested on the test system. It does indeed fix it. Now, I need them to give me the hot fix for NW153818 as well for 126.96.36.199 and then I can put this into production.
- Opened SR 61542494: /nsr/run not cleaned up after system crash – to deal with nsrexecd failing to start after a system crash because /nsr/run/nsrexecd.1558 exists, and a process with pid 1558 also exists (in this case it was crond, not nsrexecd that was pid 1558).
- Why does machine X have the files I want not backed up??? Just a hunch, but it might have something to do with the .nsr file which contains “+skip: *.*” that is in the root of that filesystem… Well, you removed that when? After the last full save. I see. Let’s give you a special full save, right now!
Posted: March 4th, 2014 by fcs
- Issues: Two Former Students fresh off the boat from Banner. We danced.
- Running the K drive for the windows file server group that failed to back it up and failed to inform us. This Networker 8.1.0.x code has some serious issues.
- Shred: 17 TiB/35TiB 49% of pass 1. This is definitely not going to be done before March 17.
- Cleaning tape replaced.
- CoM/IS full save from the weekend continues to plod along. Dribbling data onto the tape at less than five (5) Megabytes per second. A sure way to damage both the tapes and the tape drive.
- Tape: 20 SSIDS/3,986.454 GB left, 7/1,138.671 in flight.
- Disk: 601/31,346.625 left, 10, 1,622.746 in flight.
Posted: March 3rd, 2014 by fcs
- One issue over the weekend, on Friday night:
- New PeopleSoft entry with an invalid SSN – typical of a new STU who is a foreign national. Found the student, merged the PeopleSoft information in.
- Openldap upgraded on all of the ldap.uvm.edu replicants.
- Two LDAP merge operations for Account Services. One being a simple rename due to Oracle not behaving properly with account names that start with digits. The second being a case of one person two accounts. Notified the portal and CatAlert folks about that second change as it will bother their processes.
- There are several weekend full saves that are still running. Three are tied up behind the big windows file server which is using all the available streams for that client and probably won’t finish until late today. All but the CoM/IS one completed before 9am.
- One Windows client has refused to backup three days in a row. Talking with Geoff – something with a firewall group policy is implied on that system. He’s looking at it.
- Shred: 15 TiB/35TiB 43% of pass 1 complete. This is a pretty fair indication that this process is not going to be able to complete before the monthly downtime. So, I will need to come up with an alternative plan.
- Hybrid installation of Networker 188.8.131.52 on test machine. Installed 184.108.40.206 client, node, server, man packages. Installed 220.127.116.11 NMC package. Will know soon if that allows the backup of the NMC database to work. No, backup of the NMC database is still broken, so it is definitely a bug in the lgtoclnt package, which means that 18.104.22.168 is a non-starter.
- Started DISK and TAPE clones.
Posted: February 28th, 2014 by fcs
- Two issues in update:
- CatCard complaining. Informed CatCard office of the merge and change of UUID that I performed yesterday, so they can update their database.
- Banner can’t find entry X to be deleted. Removed entry X from my look aside entries to deal with (Banner confirmed the duplication yesterday and deleted the one that was newly created).
- Shred: 12 TiB/35TiB 34%. At this rate, I predict about a month (+/- 2 days) to do a single pass. That means it can’t be done between reboots for backup maintenance. what is plan b?
Posted: February 27th, 2014 by fcs
Not going to forget what day it is today…
- Had a few issues in last night’s update.
- One person – two accounts. PeopleSoft and Banner disagree about the person’s SSN. They get to fight it out. PeopleSoft lost that battle. Entries merged.
- Former Student – dance done.
- Banner Course Assignment without any matching employee entry. This seems to be another SSN entering problem, because I think this is an existing lecturer and former student with an SSN that is almost identical.
- LDAPRW upgrade to OpenLDAP 2.4.39 and nuke/repave of the ldbm backend was completed in 14 minutes. Beat my own best guess by a minute.
- Set up a new VPN group (well, assigned a new VPN group value to a set of LDAP entries)
- Shred: 11 TiB/35TiB 31% (… will it be done with both passes before March 17?)
Posted: February 26th, 2014 by fcs
- No problems with last night’s update.
- Cleaned up an ou=Former Student denizen for Account Services that belonged to a person who no longer exists.
- Testing out entitlements as a way to protect some accounts from having their VPN usage blocked.
- Clones: CoM/IS and Tape clones finished. That makes all three done this week! YAY!
- Shred: 9.6TiB/35TiB 27% (pass 1 – random stuff)… been over a week now…
- Because of the new checkpoint capable save set options, I needed to update the staging selection logic to stop ignoring incomplete save sets.
Posted: February 25th, 2014 by fcs
- No errors in last night’s update.
- Working on the VPN changes. I think I finally have it. Examining the victims (over 121K) to verify the code is not over zealous. Looks good. Reported results, asking permission to implement. Answer is no – too many that need specific VPNs would be blocked.
- doubled the size of the /home filesystem on the test server – to make it the same as the production server (tired of banging my head on that ceiling).
- Added a user to a VPN group per request.
- Announced ldaprw outage for 2/27@6:45am.
- VPN blocks placed on Former Students container.
- Pushed removal of VPN blocks update code into production, but the addition of VPN blocks is not ready to go yet due to a problem population.
- shred: 8.2TiB/35TiB 23% of pass 1
- Com/IS: 24/4,242.198GB working, 37/1,705.979GB left.
- Tape started – that long running full save finally finished.
Posted: February 24th, 2014 by fcs
- No errors in the nightly updates in the past three days.
- Working on the posixGroup-memberuid validity audit.
- Cleaned up a one person two LDAP entries issue. Need to complete the clean up tomorrow by purging the abandoned entry (which was left for today to allow the portal to catch up).
- Shred: 6.8TiB/35TiB 19% random data pass
- Clones: Disk clones finished Saturday afternoon.
- Clones: Started the CoM/IS and Disk clones (tape backups are still running). Disk clones completed.
- 28 of the tapes that were re-labeled on Saturday had issues.
- These issues are normally a result of the way that the choices are made of what to label. A series of tapes that contains a single save set exists. The first tape in the series is re-labeled, which makes the saved information about the rest of the tapes in the series invalid. Then the remaining tapes in the series get examined and they each fail because they no longer match the saved information. The result is I need to go back and manually deal with them on a second pass.
- The second pass cleaned up 25 of the 28.
- As the first pass had made all of these “expired” and eligible to be recycled, checking on the final 3 confirms that Networker had relabeled them and started using them again.
- Wrote a script to list out the tapes that should be (or will probably be) in the vaults in barcode order. This will allow me to organize the contents of the vaults in barcode order.