We have just upgraded the operational IPP to a new tag: ipp-20101206.  The main changes in the new build are:

  • ppImage updated apply the non-linearity correct to deal with the faint-end bias sag.
  • ppSub updated to perform in-line forced photometry on the detected positive sources (optional)
  • updates to dvo tools to allow threaded addstar and dvomerge; dvoverify and dvorepair created
  • compression (gzip) on the mdc output files
  • dark construction fixed to give sensible residuals.

Over the new few days, we will be working on static mask updates, but for now the detrends are the same as before, with only the addition of the non-linear correction.

Posted in Uncategorized | Leave a comment

Processing Load Tests

I’m using this post to track the processing load tests we are running today.  The goal of the test is to see the increasing load on Nebulous as the number of simultaneous jobs increases.  The processing test will run ‘chip’ only on a set of data.  We will use a certain number of machines for 30 min blocks, increasing from 12 up to 238 machines as the test proceeds.  The data set for the test is the STS data waiting to be distributed.

Starting @ 10:55 with 12 nodes

oops: the data set (STS) is a poor choice: the high star density makes it an outlier in processing time.  We want to load Nebulous, not so much the cpus.  aborting this, and Bill is setting up MD03 as an alternative data set.

Restarting @ 12:04 with 1x wave1 (12 nodes)

Update @ 12:39 : add wave3 (15 nodes)

update @ 13:12 : add compute (24 nodes)

ipp043 crashed @ 13:08 (bill’s stack)

update @ 13:59 : add wave2 + wave3 + compute (51 nodes)

update @ 14:40 : add 2 x wave3 + 2 x compute (78 nodes)

Posted in testing | Leave a comment

Processing Log 2010-11-25

Happy Thanksgiving.

Bad weather again no new data.

Enabled diff proceessing for MD02. The images for the i band stack for skycell.094 were lost on ipp020. This caused the diff processing for that skycell to fail for diff_id 94750.  To allow the run to complete I updated the database to set quality=8686, fault=0 for that diffSkyfile. To prevent future diffs from being queued I set the quality of stackSumSkyfile stack_id 36900 to 8686

Posted in Uncategorized | Leave a comment

Processing Log 2010-11-24 afternoon

Some magicRuns failed to complete. The problem was an error in magictool -definebyquery which got the inverse bit wrong in certain cases. Tweaked the database to start the runs over. Found the problem in the SQL and fixed it. Ticket #1439.

Stack_id 184549 failed repeatedly. ppStack can’t find a psf for any of the 4 input warps. Set state to drop and added the note ‘PSF completely bad’ There is a ticket open on this problem #1427

Remembered that Mark H. reported lack of Stacks and diffs from MD03.20101118. Sure enough they weren’t queued. Now they are. Changed the label on the warps so that they won’t get cleaned up out from under us.

Posted in Uncategorized | Leave a comment

Processing Log 2010-11-24 Morning

Due to high humidity we did not get much data last night. 95 exposures.

It ran through rather quickly. Now the OC133.OSS.20101117 data should have a chance to finish destreaking. As of 12:48 HST there are 311 exposures at the magic stage.

The postage stamp storm from last night has finished. There were 6 jobs stuck because the target chipRun’s magicDSRun was stuck in goto_cleaned state. This happened due to a bug in the script. If all of the components are cleaned the magicDSRun never got set to cleaned. So the cleanup job kept running over and over. This could happen if the magicDSRun was set to update but none of the components successfully updated. Fixed the bug and checked in magic_destreak_cleanup.pl.

Posted in Uncategorized | Leave a comment

Processing Log 2010-11-23 Afternoon and Evening

I wasn’t able to log into the blog outside the firewall.

The distribution pantasks died around 14:18 while I was at lunch. Serge restarted it around 3:15.

With some advice from Heather got the addstar pantasks up and running.

A warp failed due to corrupt camera outputs. Fixed with runcameraexp.pl

On the morning of the 24th the postage stamp server was not making progress on several thousand outstanding jobs. Something was out of whack with the processing of the dependency checking jobs. Restarting the pstamp pantasks fixed the problem. Several requests submitted via the web interface were getting blocked by several thousand jobs submitted by IFA. I lowered the IFA priority and they finished just as Nigel Metcalfe was typing an email to inquire why his jobs weren’t completing in a timely fashion.

Posted in Uncategorized | Leave a comment

Processing Log 2010-11-23 mid morning

9:45 I queued new camera runs with the reduction SAS_REFERENCE This will use a PS1 reference catalog that we built. The label is

10:10 AM

Burntool finally finished. (Some stare processing is going on on the cluster this morning. That may have slowed it down.)

For some reason the entry in the date book got lost.

pantasks: ns.show.date
2010-11-18 DROP
2010-11-19 DROP
2010-11-20 CONFIRM_STACKING
2010-11-21 CONFIRM_STACKING
2010-11-22 CONFIRM_STACKING

I addedi t back in with ns.add.date 2010-11-23 BURNING. It ignored BURNING and set state to new. A few minutes later the chips were queued.

10:30 Queued camRuns with label SAS.20101118 reduction SAS_REFERENCE. Reference catalog isn’t quite right yet so set label to inactive.

Posted in Uncategorized | Leave a comment

Processing Log 23 November 2010 AM

Bill is processing czar today 2010-11-23

5 am OC133.OSS.20101117 had over 700 magic runs pending. The label was not in the distribution pantasks. Added it.

Checked postage stamp server status. All requests are done. Queued data for cleanup.

There were a couple of diff and destreak faults. Reverts were turned off. Turned them on.

6 am 1 destreak run was still faulted. Tured out to be a corrupt diff skyfile. I fixed this with rundiffskycell.pl Created ipp wiki page to record these faults.

http://svn.pan-starrs.ifa.hawaii.edu/trac/ipp/wiki/PS1_Operations/broken_files

6:45 am Summit copy was stuck on one file. The problem was that the node running the process ipp031 was having nfs issues with the destination host for the file ipp018.

Killed the process. Ran sudo /usr/local/bin/force.umount ipp018

This left an entry in summit copy’s pzPendingImfile

pantasks: show.book pzPendingImfile
exp_name            o5523g0631o
camera                     gpc1
telescope                   ps1
file_id         o5523g0631o34.fits
bytes                  51831360
md5sum          efdae9f34c0ace4c190f8061c869f3df
class                      chip
class_id                  ota34
uri             omitted
epoch           2010-11-23T05:05:56.000000
dateobs         2010-11-23T15:04:30.000000
dbname                     gpc1
pantaskState              CRASH
filename        neb://ipp018.0/gpc1/20101123/o5523g0631o/o5523g0631o.ota34.fits
npages: 1

I needed to create a file called copy.reset containing

book.init pzPendingImfile

and source it in the summit copy pantasks.

A few minute later summit copy and registration were done for the day. Burntool started soon afterwards.

Posted in Uncategorized | Leave a comment

Processing Status Update 2010.03.05

I’ve been slacking on posting to this blog for a few days — sorry for the lack of updates!

This past week, we’ve been putting our efforts on getting data to the consortium that is different in some way or another. These included the SAS stacks, the STS stacks, which were not defined for the nightly stacking process and were needed for the SAS diff/magic/release process, the STS ‘uncorrected’ data, a couple of Sweetspot data sets needed by MOPS, and the Run5b data (M31 + extra data for Paul Sydney to spot-check). We decided on Tuesday to block the nightly science to ensure all of these other data sets finished. They all finished by this morning, so we have turned back on the nightly science processing. With bad weather last night (only 7 images!) and part of Wednesday night, we only have about 650 images to get through. It looks like the next couple of nights will have poor weather as well, so we should easily catch up…

We had an interesting glitch over the weekend, and by ‘interesting’ I mean ‘annoying’. One of our machines (ipp014) got itself into a confused state where we could not login or ssh in, but nfs seemed to respond. other machines could write to its disk, and it would claim to accept the data, but no file would actually be created. The result was what we call a Nebulous ‘widow’ — the nebulous database claimed there was a file, but the filesystem did not have it. It turned out that ipp014 claimed to have had a kernel panic before this started (usually that results in a full-on crash and nfs blocks). The jobs that got caught in this trap correctly reported that they failed, but it caused a number of headaches, mostly because our code has assumptions that Nebulous widows could not exist. In the end, this failure forced us to address this type of problem, so there was some side benefit.

Related to, or partly inspired by, the ipp014 problem, Bill Sweeney investigated some of our NFS set parameters and made a very important improvement. We have long had a problem with jobs failing occasionally because of NFS timeouts. This gets particularly bad when the load is very high. We have a whole set of tasks to re-try jobs to handle these kinds of failures — because they are ephemeral, re-running the job usually succeed. As a nasty side effect, when we have real errors in some job, that job will revert over and over until we notice it.

Bill discovered that our NFS configurations were tuned oddly to timeout much too quickly — 12 seconds instead of the default of 180 seconds. We adjust the timeouts back up to a more sensible value on Thursday, and since then we have had a lot fewer of the NFS glitch errors. This is really good news, because it was causing at least 10% extra work on the cluster, and a lot of other headaches.

Posted in Uncategorized | Leave a comment

Status of data since 2010.02.25 (Wednesday night)

On Thursday and Friday, we had some delays in processing due to problems with 2 machines, ipp036 and ipp005. On Thursday, ipp036 lost 2 disks, so we needed to avoid all I/O access on that machine until the raid rebuilt On Friday, ipp005 stopped responding for a while and, because of a BIOS configuration problem, could not be rebooted remotely. The manual reboot by the MHPCC service folks took a while. In addition, Thursday night had the largest number of images we’ve seen to date: 777 total images, of which over 600 were science data. it was not until late in the afternoon on Friday that everything was done with download, registration and burntool, so processing didn’t start until after 7pm. As of 2:40 pm on Saturday, the chip processing is done, but warp still has 130 exposures to go and diff also has over 100 exposure pairs to go. There were a lot of exposures taken last night as well, but not quite so many as Thursday night — it looks like about 500 science exposures. Hopefully we will be able to catch up tonight…

Also note that in the past couple of nights, there have been no MD field data because of the bright moon. This is the reason for the large number of exposures, but it may also help in processing speed. We do not have as many steps for the non-MD field data since we do not have to make stacks and stack-stack diffs.

aloha
gene

Posted in Uncategorized | Leave a comment