Обсуждение: buildfarm animals and 'snapshot too old'
Hi all, today I got a few of errors like these (this one is from last week, though): Status Line: 493 snapshot too old: Wed May 7 04:36:57 2014 GMT Content: snapshot to old: Wed May 7 04:36:57 2014 GMT on the new buildfarm animals. I believe it was my mistake (incorrectly configured local git mirror), but it got me thinking about how this will behave with the animals running CLOBBER_CACHE_RECURSIVELY. If I understand the Perl code correctly, it does this: (1) update the repository (2) run the tests (3) check that the snapshot is not older than 24 hours (pgstatus.pl:188) (4) fail if older Now, imagine that the test runs for days/weeks. This pretty much means it's wasted, because the results will be thrown away anyway, no? regards Tomas
On 05/15/2014 12:43 PM, Tomas Vondra wrote: > Hi all, > > today I got a few of errors like these (this one is from last week, though): > > Status Line: 493 snapshot too old: Wed May 7 04:36:57 2014 GMT > Content: > snapshot to old: Wed May 7 04:36:57 2014 GMT > > on the new buildfarm animals. I believe it was my mistake (incorrectly > configured local git mirror), but it got me thinking about how this will > behave with the animals running CLOBBER_CACHE_RECURSIVELY. > > If I understand the Perl code correctly, it does this: > > (1) update the repository > (2) run the tests > (3) check that the snapshot is not older than 24 hours (pgstatus.pl:188) > (4) fail if older > > Now, imagine that the test runs for days/weeks. This pretty much means > it's wasted, because the results will be thrown away anyway, no? > The 24 hours runs from the time of the latest commit on the branch in question, not the current time, but basically yes. We've never had machines with runs that long. The longest in recent times has been friarbird, which runs CLOBBER_CACHE_ALWAYS and takes around 4.5 hours. But we have had misconfigured machines reporting unbelievable snapshot times. I'll take a look and see if we can tighten up the sanity check. It's worth noting that one thing friarbird does is skip the install-check stage - it's almost certainly not going to have terribly much interesting to tell us from that, given it has already run a plain "make check". How long does a CLOBBER_CACHE_RECURSIVELY run take? days or weeks seems kinda nuts. cheers andrew
On 15 Květen 2014, 19:46, Andrew Dunstan wrote: > > On 05/15/2014 12:43 PM, Tomas Vondra wrote: >> Hi all, >> >> today I got a few of errors like these (this one is from last week, >> though): >> >> Status Line: 493 snapshot too old: Wed May 7 04:36:57 2014 GMT >> Content: >> snapshot to old: Wed May 7 04:36:57 2014 GMT >> >> on the new buildfarm animals. I believe it was my mistake (incorrectly >> configured local git mirror), but it got me thinking about how this will >> behave with the animals running CLOBBER_CACHE_RECURSIVELY. >> >> If I understand the Perl code correctly, it does this: >> >> (1) update the repository >> (2) run the tests >> (3) check that the snapshot is not older than 24 hours (pgstatus.pl:188) >> (4) fail if older >> >> Now, imagine that the test runs for days/weeks. This pretty much means >> it's wasted, because the results will be thrown away anyway, no? >> > > > The 24 hours runs from the time of the latest commit on the branch in > question, not the current time, but basically yes. > > We've never had machines with runs that long. The longest in recent > times has been friarbird, which runs CLOBBER_CACHE_ALWAYS and takes > around 4.5 hours. But we have had misconfigured machines reporting > unbelievable snapshot times. I'll take a look and see if we can tighten > up the sanity check. It's worth noting that one thing friarbird does is > skip the install-check stage - it's almost certainly not going to have > terribly much interesting to tell us from that, given it has already run > a plain "make check". > > How long does a CLOBBER_CACHE_RECURSIVELY run take? days or weeks seems > kinda nuts. I don't know. According to this comment from cache/inval.c, it's expected to be way slower (~100x) compared to CLOBBER_CACHE_ALWAYS. /** Test code to force cache flushes anytime a flush could happen.** If used with CLOBBER_FREED_MEMORY, CLOBBER_CACHE_ALWAYSprovides a* fairly thorough test that the system contains no cache-flush hazards.* However, it also makesthe system unbelievably slow --- the regression* tests take about 100 times longer than normal.** If you're a gluttonfor punishment, try CLOBBER_CACHE_RECURSIVELY. This* slows things by at least a factor of 10000, so I wouldn't suggest*trying to run the entire regression tests that way.It's useful to try* a few simple tests, to make sure that cachereload isn't subject to* internal cache-flush hazards, but after you've done a few thousand* recursive reloads it'sunlikely you'll learn more.*/ regards Tomas
On 05/15/2014 03:57 PM, Tomas Vondra wrote: >> How long does a CLOBBER_CACHE_RECURSIVELY run take? days or weeks seems >> kinda nuts. > I don't know. According to this comment from cache/inval.c, it's expected > to be way slower (~100x) compared to CLOBBER_CACHE_ALWAYS. > > /* > * Test code to force cache flushes anytime a flush could happen. > * > * If used with CLOBBER_FREED_MEMORY, CLOBBER_CACHE_ALWAYS provides a > * fairly thorough test that the system contains no cache-flush hazards. > * However, it also makes the system unbelievably slow --- the regression > * tests take about 100 times longer than normal. > * > * If you're a glutton for punishment, try CLOBBER_CACHE_RECURSIVELY. This > * slows things by at least a factor of 10000, so I wouldn't suggest > * trying to run the entire regression tests that way.It's useful to try > * a few simple tests, to make sure that cache reload isn't subject to > * internal cache-flush hazards, but after you've done a few thousand > * recursive reloads it's unlikely you'll learn more. > */ > Yes, I've seen that. Frankly, a test that takes something like 500 hours is a bit crazy. If we really want to run this in the buildfarm we should probably try to create a massively cut down test schedule for use in this case. Incidentally, should the CLOBBER_CACHE_ALWAYS machines also be defining CLOBBER_FREED_MEMORY? cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Incidentally, should the CLOBBER_CACHE_ALWAYS machines also be defining > CLOBBER_FREED_MEMORY? The former does need the latter or it's not very thorough. However, CLOBBER_FREED_MEMORY is defined automatically by --enable-cassert, so you shouldn't need to use a -D switch for it. regards, tom lane
On 05/15/2014 07:46 PM, Andrew Dunstan wrote: > > On 05/15/2014 12:43 PM, Tomas Vondra wrote: >> Hi all, >> >> today I got a few of errors like these (this one is from last week, >> though): >> >> Status Line: 493 snapshot too old: Wed May 7 04:36:57 2014 GMT >> Content: >> snapshot to old: Wed May 7 04:36:57 2014 GMT >> >> on the new buildfarm animals. I believe it was my mistake (incorrectly >> configured local git mirror), but it got me thinking about how this will >> behave with the animals running CLOBBER_CACHE_RECURSIVELY. >> >> If I understand the Perl code correctly, it does this: >> >> (1) update the repository >> (2) run the tests >> (3) check that the snapshot is not older than 24 hours (pgstatus.pl:188) >> (4) fail if older >> >> Now, imagine that the test runs for days/weeks. This pretty much means >> it's wasted, because the results will be thrown away anyway, no? >> > > > The 24 hours runs from the time of the latest commit on the branch in > question, not the current time, but basically yes. > > We've never had machines with runs that long. The longest in recent > times has been friarbird, which runs CLOBBER_CACHE_ALWAYS and takes > around 4.5 hours. But we have had misconfigured machines reporting > unbelievable snapshot times. I'll take a look and see if we can tighten > up the sanity check. It's worth noting that one thing friarbird does is > skip the install-check stage - it's almost certainly not going to have > terribly much interesting to tell us from that, given it has already run > a plain "make check". well I'm not sure about about "misconfigured" but both my personal buildfarm members and pginfra run ones (like gaibasaurus) got errors complaining about "snapshot too old" in the past for long running tests so I'm not sure it is really a "we never had machine with runs that long". So maybe we should not reject those submissions at submission time but rather mark them clearly on the dashboard and leave the final interpretation to a human... Stefan
On 05/15/2014 04:30 PM, Stefan Kaltenbrunner wrote: > On 05/15/2014 07:46 PM, Andrew Dunstan wrote: >> On 05/15/2014 12:43 PM, Tomas Vondra wrote: >>> Hi all, >>> >>> today I got a few of errors like these (this one is from last week, >>> though): >>> >>> Status Line: 493 snapshot too old: Wed May 7 04:36:57 2014 GMT >>> Content: >>> snapshot to old: Wed May 7 04:36:57 2014 GMT >>> >>> on the new buildfarm animals. I believe it was my mistake (incorrectly >>> configured local git mirror), but it got me thinking about how this will >>> behave with the animals running CLOBBER_CACHE_RECURSIVELY. >>> >>> If I understand the Perl code correctly, it does this: >>> >>> (1) update the repository >>> (2) run the tests >>> (3) check that the snapshot is not older than 24 hours (pgstatus.pl:188) >>> (4) fail if older >>> >>> Now, imagine that the test runs for days/weeks. This pretty much means >>> it's wasted, because the results will be thrown away anyway, no? >>> >> >> The 24 hours runs from the time of the latest commit on the branch in >> question, not the current time, but basically yes. >> >> We've never had machines with runs that long. The longest in recent >> times has been friarbird, which runs CLOBBER_CACHE_ALWAYS and takes >> around 4.5 hours. But we have had misconfigured machines reporting >> unbelievable snapshot times. I'll take a look and see if we can tighten >> up the sanity check. It's worth noting that one thing friarbird does is >> skip the install-check stage - it's almost certainly not going to have >> terribly much interesting to tell us from that, given it has already run >> a plain "make check". > well I'm not sure about about "misconfigured" but both my personal > buildfarm members and pginfra run ones (like gaibasaurus) got errors > complaining about "snapshot too old" in the past for long running tests > so I'm not sure it is really a "we never had machine with runs that > long". So maybe we should not reject those submissions at submission > time but rather mark them clearly on the dashboard and leave the final > interpretation to a human... > > That's a LOT harder and more work to arrange. Frankly, there are more important things to do. I would like to know the circumstances of these very long runs. I drive some of my VMs pretty hard on pretty modest hardware, and they don't come close to running 24 hours. The current behaviour goes back to this commit from December 2011: commit a8b5049e64f9cb08f8e165d0737139dab74e3bce Author: Andrew Dunstan <andrew@dunslane.net> Date: Wed Dec 14 14:38:442011 -0800 Use git snapshot instead of fixed 10 day timeout. The sanity checks made sure that an animal wasn't submitting a snapshot that was too old. But sometimes anold branch doesn't get any changes for more than 10 days. So accept a snapshot that is not more than 1 dayolder than the last known snapshot. Per complaint from Stefan. I'm prepared to increase the sanity check time if there is a serious demand for it, but I'd like to know what to increase it to. cheers andrew
On 15.5.2014 22:56, Andrew Dunstan wrote: > > On 05/15/2014 04:30 PM, Stefan Kaltenbrunner wrote: > >> well I'm not sure about about "misconfigured" but both my personal >> buildfarm members and pginfra run ones (like gaibasaurus) got errors >> complaining about "snapshot too old" in the past for long running tests >> so I'm not sure it is really a "we never had machine with runs that >> long". So maybe we should not reject those submissions at submission >> time but rather mark them clearly on the dashboard and leave the final >> interpretation to a human... >> > > That's a LOT harder and more work to arrange. Frankly, there are more > important things to do. > > I would like to know the circumstances of these very long runs. I drive > some of my VMs pretty hard on pretty modest hardware, and they don't > come close to running 24 hours. > > The current behaviour goes back to this commit from December 2011: > > commit a8b5049e64f9cb08f8e165d0737139dab74e3bce > Author: Andrew Dunstan <andrew@dunslane.net> > Date: Wed Dec 14 14:38:44 2011 -0800 > > Use git snapshot instead of fixed 10 day timeout. > > The sanity checks made sure that an animal wasn't submitting a > snapshot that was too old. But sometimes an old branch doesn't > get any changes for more than 10 days. So accept a snapshot that > is not more than 1 day older than the last known snapshot. Per > complaint from Stefan. > > > I'm prepared to increase the sanity check time if there is a serious > demand for it, but I'd like to know what to increase it to. I doubt there's no "one size fits all" limit. If the machines running "recursive clobber" tests need tens of days to complete the tests. then that limit is pretty useless to most regular animals. So what about keeping the current value for most animals, but allowing an override for some selected ones? I'd expect this to be much simpler to implement, and it shouldn't require any human intervention. Tomas
On 15.5.2014 22:07, Andrew Dunstan wrote: > > Yes, I've seen that. Frankly, a test that takes something like 500 > hours is a bit crazy. Maybe. It certainly is not a test people will use during development. But if it can detect some hard-to-find errors in the code, that might possibly lead to serious problems, then +1 from me to run them at least on one animal. 500 hours is ~3 weeks, which is not that bad IMHO. Also, once you know where it fails the developer can run just that single test (which might take minutes/hours, but not days). > If we really want to run this in the buildfarm we should probably > try to create a massively cut down test schedule for use in this > case. If we can run cut this down in a meaningful way (i.e. without sacrificing most of the benefits) then sure - let's do that. But I think that's what CLOBBER_CACHE_ALWAYS is about. regards Tomas
On 05/15/2014 07:47 PM, Tomas Vondra wrote: > On 15.5.2014 22:07, Andrew Dunstan wrote: >> Yes, I've seen that. Frankly, a test that takes something like 500 >> hours is a bit crazy. > Maybe. It certainly is not a test people will use during development. > But if it can detect some hard-to-find errors in the code, that might > possibly lead to serious problems, then +1 from me to run them at least > on one animal. 500 hours is ~3 weeks, which is not that bad IMHO. > > Also, once you know where it fails the developer can run just that > single test (which might take minutes/hours, but not days). I have made a change that omits the snapshot sanity check for CLOBBER_CACHE_RECURSIVELY cases, but keeps it for all others. See <https://github.com/PGBuildFarm/server-code/commit/abd946918279b7683056a4fc3156415ef31a4675> cheers andrew
On 17.5.2014 19:58, Andrew Dunstan wrote: > > On 05/15/2014 07:47 PM, Tomas Vondra wrote: >> On 15.5.2014 22:07, Andrew Dunstan wrote: >>> Yes, I've seen that. Frankly, a test that takes something like 500 >>> hours is a bit crazy. >> Maybe. It certainly is not a test people will use during development. >> But if it can detect some hard-to-find errors in the code, that might >> possibly lead to serious problems, then +1 from me to run them at least >> on one animal. 500 hours is ~3 weeks, which is not that bad IMHO. >> >> Also, once you know where it fails the developer can run just that >> single test (which might take minutes/hours, but not days). > > > > I have made a change that omits the snapshot sanity check for > CLOBBER_CACHE_RECURSIVELY cases, but keeps it for all others. See > <https://github.com/PGBuildFarm/server-code/commit/abd946918279b7683056a4fc3156415ef31a4675> OK, thanks. Seems reasonable. Tomas
On 17.5.2014 22:35, Tomas Vondra wrote: > On 17.5.2014 19:58, Andrew Dunstan wrote: >> >> On 05/15/2014 07:47 PM, Tomas Vondra wrote: >>> On 15.5.2014 22:07, Andrew Dunstan wrote: >>>> Yes, I've seen that. Frankly, a test that takes something like 500 >>>> hours is a bit crazy. >>> Maybe. It certainly is not a test people will use during development. >>> But if it can detect some hard-to-find errors in the code, that might >>> possibly lead to serious problems, then +1 from me to run them at least >>> on one animal. 500 hours is ~3 weeks, which is not that bad IMHO. >>> >>> Also, once you know where it fails the developer can run just that >>> single test (which might take minutes/hours, but not days). >> >> >> >> I have made a change that omits the snapshot sanity check for >> CLOBBER_CACHE_RECURSIVELY cases, but keeps it for all others. See >> <https://github.com/PGBuildFarm/server-code/commit/abd946918279b7683056a4fc3156415ef31a4675> > > OK, thanks. Seems reasonable. Seems we're still running into this on the CLOBBER_CACHE_ALWAYS animals. The problem is that the git mirror is refreshed only at the very beginning, and while a single branch does not exceed the limit, all the branches do. Could this be solved by keeping a local mirror, without a mirror in the build root? I mean, something like this: git_keep_mirror => 0, scmrepo => '/path/to/local/mirror' And of course a cron script updating the mirror every hour or so. regards Tomas
On 05/19/2014 03:40 PM, Tomas Vondra wrote: > On 17.5.2014 22:35, Tomas Vondra wrote: >> On 17.5.2014 19:58, Andrew Dunstan wrote: >>> On 05/15/2014 07:47 PM, Tomas Vondra wrote: >>>> On 15.5.2014 22:07, Andrew Dunstan wrote: >>>>> Yes, I've seen that. Frankly, a test that takes something like 500 >>>>> hours is a bit crazy. >>>> Maybe. It certainly is not a test people will use during development. >>>> But if it can detect some hard-to-find errors in the code, that might >>>> possibly lead to serious problems, then +1 from me to run them at least >>>> on one animal. 500 hours is ~3 weeks, which is not that bad IMHO. >>>> >>>> Also, once you know where it fails the developer can run just that >>>> single test (which might take minutes/hours, but not days). >>> >>> >>> I have made a change that omits the snapshot sanity check for >>> CLOBBER_CACHE_RECURSIVELY cases, but keeps it for all others. See >>> <https://github.com/PGBuildFarm/server-code/commit/abd946918279b7683056a4fc3156415ef31a4675> >> OK, thanks. Seems reasonable. > Seems we're still running into this on the CLOBBER_CACHE_ALWAYS animals. > The problem is that the git mirror is refreshed only at the very > beginning, and while a single branch does not exceed the limit, all the > branches do. > > Could this be solved by keeping a local mirror, without a mirror in the > build root? I mean, something like this: > > git_keep_mirror => 0, > scmrepo => '/path/to/local/mirror' > > And of course a cron script updating the mirror every hour or so. > No, the git mirror should be refreshed at the start of each branch build. It's not done at all by run_branches.pl. So your premise is false. This should only happen if the actual run takes more than 24 hours. cheers andrew
On 19.5.2014 23:04, Andrew Dunstan wrote: > > On 05/19/2014 03:40 PM, Tomas Vondra wrote: >> On 17.5.2014 22:35, Tomas Vondra wrote: >>> On 17.5.2014 19:58, Andrew Dunstan wrote: >>>> On 05/15/2014 07:47 PM, Tomas Vondra wrote: >>>>> On 15.5.2014 22:07, Andrew Dunstan wrote: >>>>>> Yes, I've seen that. Frankly, a test that takes something like 500 >>>>>> hours is a bit crazy. >>>>> Maybe. It certainly is not a test people will use during development. >>>>> But if it can detect some hard-to-find errors in the code, that might >>>>> possibly lead to serious problems, then +1 from me to run them at >>>>> least >>>>> on one animal. 500 hours is ~3 weeks, which is not that bad IMHO. >>>>> >>>>> Also, once you know where it fails the developer can run just that >>>>> single test (which might take minutes/hours, but not days). >>>> >>>> >>>> I have made a change that omits the snapshot sanity check for >>>> CLOBBER_CACHE_RECURSIVELY cases, but keeps it for all others. See >>>> <https://github.com/PGBuildFarm/server-code/commit/abd946918279b7683056a4fc3156415ef31a4675> >>>> >>> OK, thanks. Seems reasonable. >> Seems we're still running into this on the CLOBBER_CACHE_ALWAYS animals. >> The problem is that the git mirror is refreshed only at the very >> beginning, and while a single branch does not exceed the limit, all the >> branches do. >> >> Could this be solved by keeping a local mirror, without a mirror in the >> build root? I mean, something like this: >> >> git_keep_mirror => 0, >> scmrepo => '/path/to/local/mirror' >> >> And of course a cron script updating the mirror every hour or so. >> > > No, the git mirror should be refreshed at the start of each branch > build. It's not done at all by run_branches.pl. So your premise is > false. This should only happen if the actual run takes more than 24 hours. OK. I think I understand what's wrong. This is a summary of log from 'leech' (attached is the full log): ======================================================================== Sat May 17 23:43:24 2014: buildfarm run for leech:REL8_4_STABLE starting [23:43:24] checking out source ... ... [04:32:47] OK Sun May 18 04:32:51 2014: buildfarm run for leech:REL9_0_STABLE starting [04:32:51] checking out source ... ... [08:58:57] OK Sun May 18 08:59:01 2014: buildfarm run for leech:REL9_1_STABLE starting [08:59:01] checking out source ... ... [14:09:08] OK Sun May 18 14:09:12 2014: buildfarm run for leech:REL9_2_STABLE starting [14:09:12] checking out source ... ... [00:13:59] OK Mon May 19 00:14:04 2014: buildfarm run for leech:REL9_3_STABLE starting [00:14:04] checking out source ... ... [14:26:29] OK Query for: stage=OK&animal=leech&ts=1400451244 Target: http://www.pgbuildfarm.org/cgi-bin/pgstatus.pl/28116b975a4186275d83f7e0f5c3fc92b1a75e85 Status Line: 493 snapshot too old: Fri May 16 07:05:50 2014 GMT Content: snapshot to old: Fri May 16 07:05:50 2014 GMT Web txn failed with status: 1 Mon May 19 14:26:31 2014: buildfarm run for leech:HEAD starting [14:26:31] checking out source ... ... [20:07:23] checking test-decoding Buildfarm member leech failed on HEAD stage test-decoding-check ======================================================================== So the REL9_3_STABLE starts at 00:14, completes at 14:26 (i.e. 14h runtime), but fails with 'snapshot too old'. These are the commits for each branch: REL8_4_STABLE => 1dd0b3eeccaffd33b9c970a91c53fe42692ce8c2 (May 15, 2014) REL9_0_STABLE => 0fc94340753f19da8acca5fc53039adbf2fa3632 (May 16, 2014) REL9_1_STABLE => 39b3739c05688b5cd5d5da8c52fa5476304eff11 (May 16, 2014) REL9_2_STABLE => 0d4c75f4de68012bb6f3bc52ebb58234334259d2 (May 16, 2014) REL9_3_STABLE => d6a9767404cfee7f037a58e445b601af5837e4a5 (May 16, 2014) HEAD => f097d70b7227c1f9aa2ed0af1d6deb05367ef657 (May 19, 2014) IMHO the problem is that d6a97674 was the last revision in the REL9_3_STABLE branch when the test started (00:14), but at 06:06 777d07d7 got committed. So the check at the end failed, because the tested revision was suddenly ~2 days over the limit. This seems wrong to me, because even a very fast test including the commit (e.g. starting at 06:00, finishing at 06:10) would fail exactly like this. This is more probable on the old stable branches, because the commits are not that frequent (on HEAD the commits are usually less than a few hours apart, so the new one won't obsolete the previous one). It's also made more likely to hit by the long runtime, because it increases the probability something will be committed into the branch. And it also makes it more "expensive" because it effectively throws all the cpu time to /dev/null. regards Tomas
Вложения
On 05/19/2014 05:37 PM, Tomas Vondra wrote: > IMHO the problem is that d6a97674 was the last revision in the > REL9_3_STABLE branch when the test started (00:14), but at 06:06 > 777d07d7 got committed. So the check at the end failed, because the > tested revision was suddenly ~2 days over the limit. > > This seems wrong to me, because even a very fast test including the > commit (e.g. starting at 06:00, finishing at 06:10) would fail exactly > like this. > > This is more probable on the old stable branches, because the commits > are not that frequent (on HEAD the commits are usually less than a few > hours apart, so the new one won't obsolete the previous one). It's also > made more likely to hit by the long runtime, because it increases the > probability something will be committed into the branch. And it also > makes it more "expensive" because it effectively throws all the cpu time > to /dev/null. > > Well, the original code was put in for a reason, presumably that we were getting some stale data and wanted to exclude it. So I'm unwilling to throw it out altogether. If someone can propose a reasonable sanity check then I'm prepared to implement it. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Well, the original code was put in for a reason, presumably that we were > getting some stale data and wanted to exclude it. So I'm unwilling to > throw it out altogether. If someone can propose a reasonable sanity > check then I'm prepared to implement it. Would it be practical to make the timeout configurable on a per-animal basis? A small number of hours is certainly sufficient for most of the critters, but for ones that are particularly slow or are doing CLOBBER type tests, we could raise the limit. regards, tom lane
On Mon, May 19, 2014 at 7:58 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > Well, the original code was put in for a reason, presumably that we were > getting some stale data and wanted to exclude it. So I'm unwilling to throw > it out altogether. If someone can propose a reasonable sanity check then I'm > prepared to implement it. While I generally agree that long-established code shouldn't be changed for light or transient causes, I have to admit I'm pretty skeptical about this particular instance. I can't think of any particularly compelling reason why it's BAD for an old result to show up. We now show the commit ID on the main page, so if you see 512abc4 in the middle of a bunch of ef9ab5f's, you'll notice. And if you don't notice, so what? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, May 19, 2014 at 7:58 PM, Andrew Dunstan <andrew@dunslane.net> wrote: >> Well, the original code was put in for a reason, presumably that we were >> getting some stale data and wanted to exclude it. So I'm unwilling to throw >> it out altogether. If someone can propose a reasonable sanity check then I'm >> prepared to implement it. > While I generally agree that long-established code shouldn't be > changed for light or transient causes, I have to admit I'm pretty > skeptical about this particular instance. I can't think of any > particularly compelling reason why it's BAD for an old result to show > up. We now show the commit ID on the main page, so if you see 512abc4 > in the middle of a bunch of ef9ab5f's, you'll notice. And if you > don't notice, so what? Robert's got a point here. In my usage, the annoying thing is not animals that take a long time to report in; it's the ones that lie about the snapshot time (which is how you get "512abc4 in the middle of a bunch of ef9ab5f's"). That is an issue of incorrect system clock, not of how long it takes to do the run. I wonder if the buildfarm script could be taught to get the timestamp from an NTP server somewhere? Or at least sanity-check the system clock reading by comparing it to the newest commit timestamp in the git repo. regards, tom lane
On 05/20/2014 07:09 AM, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, May 19, 2014 at 7:58 PM, Andrew Dunstan <andrew@dunslane.net> wrote: >>> Well, the original code was put in for a reason, presumably that we were >>> getting some stale data and wanted to exclude it. So I'm unwilling to throw >>> it out altogether. If someone can propose a reasonable sanity check then I'm >>> prepared to implement it. >> While I generally agree that long-established code shouldn't be >> changed for light or transient causes, I have to admit I'm pretty >> skeptical about this particular instance. I can't think of any >> particularly compelling reason why it's BAD for an old result to show >> up. We now show the commit ID on the main page, so if you see 512abc4 >> in the middle of a bunch of ef9ab5f's, you'll notice. And if you >> don't notice, so what? > Robert's got a point here. In my usage, the annoying thing is not animals > that take a long time to report in; it's the ones that lie about the > snapshot time (which is how you get "512abc4 in the middle of a bunch of > ef9ab5f's"). That is an issue of incorrect system clock, not of how long > it takes to do the run. I wonder if the buildfarm script could be taught > to get the timestamp from an NTP server somewhere? Or at least > sanity-check the system clock reading by comparing it to the newest commit > timestamp in the git repo. Showing the commit id is a relatively recent phenomenon, dating back to July 2013. I agree with Robert that it might obsolete this check, so I have disabled it for now. I have also disabled the other timestamp check, on the time the client actually took the snapshot (as opposed to the time of the last commit in the snapshot) for the CLOBBER_CACHE_RECURSIVELY case. Regarding clock skew, I think we can do better then what you suggest. The web transaction code in the client adds its own timestamp just before running the web transaction. It would be quite reasonable to reject reports from machines with skewed clocks based on this value. I'm not sure what a reasonable skew might be. Somewhere in the range of 5 to 15 minutes seems reasonable. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > On 05/20/2014 07:09 AM, Tom Lane wrote: >> Robert's got a point here. In my usage, the annoying thing is not animals >> that take a long time to report in; it's the ones that lie about the >> snapshot time (which is how you get "512abc4 in the middle of a bunch of >> ef9ab5f's"). That is an issue of incorrect system clock, not of how long >> it takes to do the run. I wonder if the buildfarm script could be taught >> to get the timestamp from an NTP server somewhere? Or at least >> sanity-check the system clock reading by comparing it to the newest commit >> timestamp in the git repo. > Regarding clock skew, I think we can do better then what you suggest. > The web transaction code in the client adds its own timestamp just > before running the web transaction. It would be quite reasonable to > reject reports from machines with skewed clocks based on this value. I'm > not sure what a reasonable skew might be. Somewhere in the range of 5 to > 15 minutes seems reasonable. Rather than reject, why not take the result and adjust the claimed start timestamp by the difference between the web transaction timestamp and the buildfarm server's time? regards, tom lane
<div class="moz-cite-prefix">On 21/05/14 01:42, Tom Lane wrote:<br /></div><blockquote cite="mid:24480.1400593335@sss.pgh.pa.us"type="cite"><pre wrap="">Andrew Dunstan <a class="moz-txt-link-rfc2396E" href="mailto:andrew@dunslane.net"><andrew@dunslane.net></a>writes: </pre><blockquote type="cite"><pre wrap="">On 05/20/2014 07:09 AM, Tom Lane wrote: </pre><blockquote type="cite"><pre wrap="">Robert's got a point here. In my usage, the annoying thing is not animals that take a long time to report in; it's the ones that lie about the snapshot time (which is how you get "512abc4 in the middle of a bunch of ef9ab5f's"). That is an issue of incorrect system clock, not of how long it takes to do the run. I wonder if the buildfarm script could be taught to get the timestamp from an NTP server somewhere? Or at least sanity-check the system clock reading by comparing it to the newest commit timestamp in the git repo. </pre></blockquote></blockquote><pre wrap=""> </pre><blockquote type="cite"><pre wrap="">Regarding clock skew, I think we can do better then what you suggest. The web transaction code in the client adds its own timestamp just before running the web transaction. It would be quite reasonable to reject reports from machines with skewed clocks based on this value. I'm not sure what a reasonable skew might be. Somewhere in the range of 5 to 15 minutes seems reasonable. </pre></blockquote><pre wrap=""> Rather than reject, why not take the result and adjust the claimed start timestamp by the difference between the web transaction timestamp and the buildfarm server's time? regards, tom lane </pre></blockquote> I think, that if possible, any such adjustment should be noted along with the original time, so that:<br/><ol><li>the timing issue can be remedied<li>it is possible to link the output to any messages in the machines logetc.<br /></ol><br /> Cheers,<br /> Gavin<br />
On 05/20/2014 03:59 PM, Gavin Flower wrote: > On 21/05/14 01:42, Tom Lane wrote: >> Andrew Dunstan<andrew@dunslane.net> writes: >>> On 05/20/2014 07:09 AM, Tom Lane wrote: >>>> Robert's got a point here. In my usage, the annoying thing is not animals >>>> that take a long time to report in; it's the ones that lie about the >>>> snapshot time (which is how you get "512abc4 in the middle of a bunch of >>>> ef9ab5f's"). That is an issue of incorrect system clock, not of how long >>>> it takes to do the run. I wonder if the buildfarm script could be taught >>>> to get the timestamp from an NTP server somewhere? Or at least >>>> sanity-check the system clock reading by comparing it to the newest commit >>>> timestamp in the git repo. >>> Regarding clock skew, I think we can do better then what you suggest. >>> The web transaction code in the client adds its own timestamp just >>> before running the web transaction. It would be quite reasonable to >>> reject reports from machines with skewed clocks based on this value. I'm >>> not sure what a reasonable skew might be. Somewhere in the range of 5 to >>> 15 minutes seems reasonable. >> Rather than reject, why not take the result and adjust the claimed start >> timestamp by the difference between the web transaction timestamp and the >> buildfarm server's time? >> >> >> > I think, that if possible, any such adjustment should be noted along > with the original time, so that: > > 1. the timing issue can be remedied > 2. it is possible to link the output to any messages in the machines > log etc. > I don't see how that's going to help anyone. Major clock skew is a sign of client misconfiguration. And where would we note this adjustment, and who would do anything about it? We seem to be engaging in a sort of PoohBearism* here. More information is not always better. cheers andrew * Rabbit: Would you like *condensed milk*, or *honey* on your bread? Winnie the Pooh: Both. But never mind the bread,please.
On 05/20/2014 09:42 AM, Tom Lane wrote: >> Regarding clock skew, I think we can do better then what you suggest. >> The web transaction code in the client adds its own timestamp just >> before running the web transaction. It would be quite reasonable to >> reject reports from machines with skewed clocks based on this value. I'm >> not sure what a reasonable skew might be. Somewhere in the range of 5 to >> 15 minutes seems reasonable. > Rather than reject, why not take the result and adjust the claimed start > timestamp by the difference between the web transaction timestamp and the > buildfarm server's time? Right now I have added some logging of clock skew to see what we're actually getting. It won't be visible in the Web UI, but when I have some data I will post more info. cheers andrew