User login

Brad Cowie's blog

13

Feb

2012

With all of our RAIDs behaving this week I managed to be a bit more productive on other projects. We've been copying all the backups off our current RAID in scribe to a new RAID that we've freed up by upgrading chasm and mojo which will increase our backup capacity by quite a bit. I managed to get some more work done on the new voodoo drones for WAND and have authenticated sound working but have some interesting pulseaudio bugs to work around where it'll decide to suspend your sound card in your drone when it thinks you're finished with it when sound is clearly still playing.

Also spent some time this week cleaning up G.1.02 which is now as clean as G.1.32 and ready for some new students.

13

Feb

2012

So here at WAND we have a lot of hard drives and RAIDs for storing network traces that we collect from our various monitors. As we all know hard drives are very unreliable and like to fail when least expected so after a few fun to debug issues with flaky hard drives I decided to do some heavy monitoring of all of our drives and RAIDs.

So we mostly have a mixture 3ware hardware RAIDs and software MDs on our machines and after finding some similar projects I have merged them into a couple of scripts to do the following: monitor the state/health of 3ware RAIDs and MDs themselves in nagios/icinga, also to monitor the state of the actual hard drives behind the 3ware RAID to look for reallocated sectors to indicate a drive might be starting to fail so I can look at organising a replacement. I'm also using SMART to monitor a couple of parameters of directly connected drives such as temperature and reallocated sectors, the next job will be to patch in support for reading SMART through the 3ware controller.

Another feature I've added to our monitoring system is the ability to watch IO graphs for each drive and RAID to our cacti instance which has turned out to be far more valuable than I initially thought it would be when I added this feature. I investigated for a long time the best way of doing this and found a great project by Mark Round which did exactly what I wanted. I've forked his project and made a few minor changes that include naming all our MDs with their proper names so that it's easier to identify them.

You can find my projects on my github: raid_check and Cacti-iostat-templates

08

Feb

2012

Spent this week dealing with the hardware RAIDs on WAND machines, which wasn't fun. I upgraded the firmware on 2 of our 3ware RAIDs after we found a few interesting bugs that we'd rather not have on our important RAIDs. We have now decommissioned chasm and mojo and all of their data has been moved to either spectre or wraith. With the spare drives we are going to upgrade the backup server.

01

Feb

2012

Spent the first half of the week working on the new voodoo drone setup, been trying to get pulseaudio working in the new setup and I've got non-authenticated sound working the next job is to work out how to make pulseaudio auth cookies work in our setup. I also keep discovering bugs in the LDAP/kerberos setup that I'm tweaking configs for to make it work nicer.

Spent the second half of this week at NZNOG in Christchurch, had some pretty interested talks this year and tossed a few ideas around with people of stuff we should be working on. Was good to finally meet some people I had previously only had email contact with.

23

Jan

2012

Got back to WAND this week after an extended holiday. Spent my time doing some routine maintenance stuff, doing security updates on servers and the website. I also spent some time fixing up a few website problems such as the spam comment problem we've been having and also improved the WAND wiki a bit. Also spent a bit of time on the WAND mail server and now we have pretty graphs of mail traffic in cacti.

12

Dec

2011

Spent this week working on the new voodoo development version. Got NFSv4 working with kerberos auth after having to fight with it a bit.

We got our 34 "new" machines and monitors this week from R block lab 1, so spent a morning transporting them from R block to G block. The rest of the week was spent cleaning monitors and keyboards.

We had our LANfest LAN party in Tauranga over the weekend with WAND/CS being sponsors this year of the juniper switches from G.1.04. Everyone seemed to enjoy themselves and the network worked flawlessly which removed some of the headaches of past LANfests. I'll attempt to get photos up on www.lanfest.co.nz over the coming days of our setup.

05

Dec

2011

This week I finished off my hard drive monitoring and graphing stuff and have it deployed on all machines except the WAND backups server and it has shown up a few more drives that probably should be replaced soon as they're showing early signs of failure.

TSG has asked us to document all the machines that WAND has inside of 130.217.250.0/24, luckily I had already spent time setting up the WAND IPAM so this made the documentation process pretty trivial. I wrote a small module to phpipam to support exporting CSVs (because all it could export was XLS files) and then wrote a little python script to change the CSV into the format that TSG wanted.

We also have another LANfest (lanfest.co.nz) coming up so I spent the weekend setting up servers to provide useful network services, this year we're using the WAND juniper switches which should give us a very nice network this year. Cheers to Jamie for all the help this year.

25

Nov

2011

Spent this week working primarily on WAND hard drives. I have started to monitor our 3ware RAIDs and software MD arrays. By using metrics from each drive such as reallocated sectors and temperature I hope to be able to quickly find failing drives and source replacements before they die.

Another project I have been working on this week is building a new voodoo based on Debian Squeeze. I have a very early work-in-progress on my desk. Things I hope to add to voodoo this time round is: network sound (pulseaudio), network file system (AFS? NFSv4?) and a kerberos relm which seems to be required by all modern network filesystems these days.

21

Nov

2011

I've been back full time at WAND this week after a busy semester of uni. Have spent the week tidying up projects I've been working on a bit over the semester and performing security updates on things.

People were having problems accessing the WAND website over ipv6 so looked into this with ITS and realised new warlock didn't have a static ipv6 address up on it but in the process found a few problems ITS have with ipv6 at the moment such as attempting to access an unroutable addresses will sit in a routing loop on one of their routers. They have also asked us to review the ipv6 firewalling policy for our machines. I took this opportunity to deploy an IP Address Management (IPAM) system to inventory all our machines and what services they host in the v4 and v6 world. The IPAM I rolled (phpipam) which while looking good and supports ipv6 has some quite poor code behind it. I have created a git repo in ~bmc26/git/phpipam/ and made a number of patches that fix some bugs and add some new features that I'll look at getting into upstream at some point.

We had a problem a few weeks back of a sector on a RAID being incorrectly reallocated which almost cost us a trace file but luckily swapping the disk out with the bad sector for a good disk and rebuilding the RAID fixed that issue. I have made a small recursive md5summer that I'm running over our traces so we have up to date md5sums for every trace so we can work out if trace files ever break.

Jamie and I have started to look at what we want to do with voodoo over the summer and I've started to build a development voodoo box based on squeeze that we'll try some new things on and see how a squeeze desktop machine behaves.

13

Sep

2011

Spent this week deploying all my nagios/cacti monitoring stuff to all the WAND machines. Found a few bugs that I fixed along the way. Also I had to create a few groups in nagios to separate processing servers to servers that we actually care about load on so I don't get hundreds of emails per day.

You can find all the new graphs of WAND servers and switches here: https://secure.wand.net.nz/cacti/