I’ve been slacking on posting to this blog for a few days — sorry for the lack of updates!
This past week, we’ve been putting our efforts on getting data to the consortium that is different in some way or another. These included the SAS stacks, the STS stacks, which were not defined for the nightly stacking process and were needed for the SAS diff/magic/release process, the STS ‘uncorrected’ data, a couple of Sweetspot data sets needed by MOPS, and the Run5b data (M31 + extra data for Paul Sydney to spot-check). We decided on Tuesday to block the nightly science to ensure all of these other data sets finished. They all finished by this morning, so we have turned back on the nightly science processing. With bad weather last night (only 7 images!) and part of Wednesday night, we only have about 650 images to get through. It looks like the next couple of nights will have poor weather as well, so we should easily catch up…
We had an interesting glitch over the weekend, and by ‘interesting’ I mean ‘annoying’. One of our machines (ipp014) got itself into a confused state where we could not login or ssh in, but nfs seemed to respond. other machines could write to its disk, and it would claim to accept the data, but no file would actually be created. The result was what we call a Nebulous ‘widow’ — the nebulous database claimed there was a file, but the filesystem did not have it. It turned out that ipp014 claimed to have had a kernel panic before this started (usually that results in a full-on crash and nfs blocks). The jobs that got caught in this trap correctly reported that they failed, but it caused a number of headaches, mostly because our code has assumptions that Nebulous widows could not exist. In the end, this failure forced us to address this type of problem, so there was some side benefit.
Related to, or partly inspired by, the ipp014 problem, Bill Sweeney investigated some of our NFS set parameters and made a very important improvement. We have long had a problem with jobs failing occasionally because of NFS timeouts. This gets particularly bad when the load is very high. We have a whole set of tasks to re-try jobs to handle these kinds of failures — because they are ephemeral, re-running the job usually succeed. As a nasty side effect, when we have real errors in some job, that job will revert over and over until we notice it.
Bill discovered that our NFS configurations were tuned oddly to timeout much too quickly — 12 seconds instead of the default of 180 seconds. We adjust the timeouts back up to a more sensible value on Thursday, and since then we have had a lot fewer of the NFS glitch errors. This is really good news, because it was causing at least 10% extra work on the cluster, and a lot of other headaches.