Обсуждение: MSVC SSL test failure
I am getting this test failure 001_ssltests.pl on my test MSVC system when SSL tests are enabled: not ok 110 - certificate authorization fails with revoked client cert with server-side CRL directory: matches # Failed test 'certificate authorization fails with revoked client cert with server-side CRL directory: matches' # at t/001_ssltests.pl line 618. # 'psql: error: connection to server at "127.0.0.1", port 59491 failed: server closed the connectionunexpectedly # This probably means the server terminated abnormally # before or while processing the request. # server closed the connection unexpectedly # This probably means the server terminated abnormally # before or while processing the request. # server closed the connection unexpectedly # This probably means the server terminated abnormally # before or while processing the request.' # doesn't match '(?^:SSL error: sslv3 alert certificate revoked)' There's nothing terribly suspicious in the server log, so I'm not quite sure what's going on here. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > I am getting this test failure 001_ssltests.pl on my test MSVC system > when SSL tests are enabled: > not ok 110 - certificate authorization fails with revoked client cert with server-side CRL directory: matches > # Failed test 'certificate authorization fails with revoked client cert with server-side CRL directory: matches' > # at t/001_ssltests.pl line 618. > # 'psql: error: connection to server at "127.0.0.1", port 59491 failed: server closed the connectionunexpectedly > # This probably means the server terminated abnormally > # before or while processing the request. > # server closed the connection unexpectedly > # This probably means the server terminated abnormally > # before or while processing the request. > # server closed the connection unexpectedly > # This probably means the server terminated abnormally > # before or while processing the request.' > # doesn't match '(?^:SSL error: sslv3 alert certificate revoked)' > There's nothing terribly suspicious in the server log, so I'm not quite > sure what's going on here. Hmm .. does enabling log_connections/log_disconnections produce any useful info? This looks quite a bit like the sort of failure that commit 6051857fc was meant to forestall. I wonder whether reverting that commit changes the results? You might also try inserting a shutdown() call, as we'd decided not to do [1]. regards, tom lane [1] https://www.postgresql.org/message-id/1283317.1638213407%40sss.pgh.pa.us
> On 5 Dec 2021, at 18:03, Andrew Dunstan <andrew@dunslane.net> wrote: > I am getting this test failure 001_ssltests.pl on my test MSVC system > when SSL tests are enabled: > > not ok 110 - certificate authorization fails with revoked client cert with server-side CRL directory: matches > > # Failed test 'certificate authorization fails with revoked client cert with server-side CRL directory: matches' > # at t/001_ssltests.pl line 618. > # 'psql: error: connection to server at "127.0.0.1", port 59491 failed: server closed the connectionunexpectedly > # This probably means the server terminated abnormally > # before or while processing the request. > # server closed the connection unexpectedly > # This probably means the server terminated abnormally > # before or while processing the request. > # server closed the connection unexpectedly > # This probably means the server terminated abnormally > # before or while processing the request.' > # doesn't match '(?^:SSL error: sslv3 alert certificate revoked)' > > There's nothing terribly suspicious in the server log, so I'm not quite > sure what's going on here. Trying out HEAD as of 20 minutes ago in Andres' MSVC setup on Cirrus CI I'm unable to replicate this. Did you remake the keys/certs or are they the files from the repo? Which version of OpenSSL is this using? -- Daniel Gustafsson https://vmware.com/
On 12/5/21 15:14, Daniel Gustafsson wrote: >> On 5 Dec 2021, at 18:03, Andrew Dunstan <andrew@dunslane.net> wrote: >> I am getting this test failure 001_ssltests.pl on my test MSVC system >> when SSL tests are enabled: >> >> not ok 110 - certificate authorization fails with revoked client cert with server-side CRL directory: matches >> >> # Failed test 'certificate authorization fails with revoked client cert with server-side CRL directory: matches' >> # at t/001_ssltests.pl line 618. >> # 'psql: error: connection to server at "127.0.0.1", port 59491 failed: server closed the connectionunexpectedly >> # This probably means the server terminated abnormally >> # before or while processing the request. >> # server closed the connection unexpectedly >> # This probably means the server terminated abnormally >> # before or while processing the request. >> # server closed the connection unexpectedly >> # This probably means the server terminated abnormally >> # before or while processing the request.' >> # doesn't match '(?^:SSL error: sslv3 alert certificate revoked)' >> >> There's nothing terribly suspicious in the server log, so I'm not quite >> sure what's going on here. > Trying out HEAD as of 20 minutes ago in Andres' MSVC setup on Cirrus CI I'm > unable to replicate this. Did you remake the keys/certs or are they the files > from the repo? Which version of OpenSSL is this using? > It's WS2019 Build 1809, VS2019, openssl 1.1.1 (via chocolatey). Can you show me the cirrus.yml file you're using to test with? A URL ref to the results would also help. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
> On 5 Dec 2021, at 23:44, Andrew Dunstan <andrew@dunslane.net> wrote: > Can you show me the cirrus.yml file you're using to test with? I used the 0001 patch from this thread: https://www.postgresql.org/message-id/20211101055720.7mzwtkhzxmorpxth%40alap3.anarazel.de > A URL ref to the results would also help. I can't vouch for how helpful it is, but below is the build info that exists: https://cirrus-ci.com/task/5358152892809216 -- Daniel Gustafsson https://vmware.com/
On 12/5/21 12:50, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> I am getting this test failure 001_ssltests.pl on my test MSVC system >> when SSL tests are enabled: >> not ok 110 - certificate authorization fails with revoked client cert with server-side CRL directory: matches >> # Failed test 'certificate authorization fails with revoked client cert with server-side CRL directory: matches' >> # at t/001_ssltests.pl line 618. >> # 'psql: error: connection to server at "127.0.0.1", port 59491 failed: server closed the connectionunexpectedly >> # This probably means the server terminated abnormally >> # before or while processing the request. >> # server closed the connection unexpectedly >> # This probably means the server terminated abnormally >> # before or while processing the request. >> # server closed the connection unexpectedly >> # This probably means the server terminated abnormally >> # before or while processing the request.' >> # doesn't match '(?^:SSL error: sslv3 alert certificate revoked)' >> There's nothing terribly suspicious in the server log, so I'm not quite >> sure what's going on here. > Hmm .. does enabling log_connections/log_disconnections produce > any useful info? No, that's already turned on (this test is run using a standard buildfarm client). > > This looks quite a bit like the sort of failure that commit > 6051857fc was meant to forestall. I wonder whether reverting > that commit changes the results? You might also try inserting > a shutdown() call, as we'd decided not to do [1]. > > regards, tom lane > > [1] https://www.postgresql.org/message-id/1283317.1638213407%40sss.pgh.pa.us Commenting out the closesocket() worked. I will look at shutdown(). cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > On 12/5/21 12:50, Tom Lane wrote: >> This looks quite a bit like the sort of failure that commit >> 6051857fc was meant to forestall. I wonder whether reverting >> that commit changes the results? You might also try inserting >> a shutdown() call, as we'd decided not to do [1]. > Commenting out the closesocket() worked. Man, that's annoying. Apparently OpenSSL is doing something to screw up the shutdown sequence. According to [1], the graceful shutdown sequence will happen by default, but changing SO_LINGER or SO_DONTLINGER can get you into abortive shutdown anyway. Maybe they change one of those settings (why?) regards, tom lane [1] https://docs.microsoft.com/en-us/windows/win32/winsock/graceful-shutdown-linger-options-and-socket-closure-2
> On 6 Dec 2021, at 07:02, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Andrew Dunstan <andrew@dunslane.net> writes: >> On 12/5/21 12:50, Tom Lane wrote: >>> This looks quite a bit like the sort of failure that commit >>> 6051857fc was meant to forestall. I wonder whether reverting >>> that commit changes the results? You might also try inserting >>> a shutdown() call, as we'd decided not to do [1]. > >> Commenting out the closesocket() worked. > > Man, that's annoying. Apparently OpenSSL is doing something to > screw up the shutdown sequence. According to [1], the graceful > shutdown sequence will happen by default, but changing SO_LINGER > or SO_DONTLINGER can get you into abortive shutdown anyway. > Maybe they change one of those settings (why?) AFAICT they don't touch either, and nothing really sticks out looking at setsockopt calls in either 1.1.1 or 3.0.0. -- Daniel Gustafsson https://vmware.com/
On 12/6/21 01:02, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> On 12/5/21 12:50, Tom Lane wrote: >>> This looks quite a bit like the sort of failure that commit >>> 6051857fc was meant to forestall. I wonder whether reverting >>> that commit changes the results? You might also try inserting >>> a shutdown() call, as we'd decided not to do [1]. >> Commenting out the closesocket() worked. > Man, that's annoying. Apparently OpenSSL is doing something to > screw up the shutdown sequence. According to [1], the graceful > shutdown sequence will happen by default, but changing SO_LINGER > or SO_DONTLINGER can get you into abortive shutdown anyway. > Maybe they change one of those settings (why?) > > regards, tom lane > > [1] https://docs.microsoft.com/en-us/windows/win32/winsock/graceful-shutdown-linger-options-and-socket-closure-2 Yeah, quite annoying, especially because only some combinations of MSVC runtime / openssl version seem to trigger the problem. Adding a shutdown() before the closesocket() also fixes the issue. <https://docs.microsoft.com/en-us/windows/win32/api/winsock/nf-winsock-shutdown> says: To assure that all data is sent and received on a connected socket before it is closed, an application should use shutdown to close connection before calling closesocket. ... Note The shutdown function does not block regardless of the SO_LINGER setting on the socket. Since we're not expecting anything back from the client I don't think we need any of the recv calls the recipes there suggest. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
> On 6 Dec 2021, at 15:56, Andrew Dunstan <andrew@dunslane.net> wrote: > Yeah, quite annoying, especially because only some combinations of MSVC > runtime / openssl version seem to trigger the problem. > > Adding a shutdown() before the closesocket() also fixes the issue. If you have a patch you're testing I'm happy to run it through Cirrus in as many combinations of MSVC and OpenSSL as I can muster there. -- Daniel Gustafsson https://vmware.com/
Hello Andrew, 06.12.2021 17:56, Andrew Dunstan wrote: > Yeah, quite annoying, especially because only some combinations of MSVC > runtime / openssl version seem to trigger the problem. > > > Adding a shutdown() before the closesocket() also fixes the issue. > Can you confirm that adding shutdown(MyProcPort->sock, SD_BOTH) fixes the issue for many test runs? I don't see the stable test passing here. Without shutdown() the test failed on iterations 1, 5, 4, but with shutdown() it failed too, on iterations 3, 1, 4. Without close() and shutdown() the test passes 20 iterations. (I'm still researching how openssl affects the shutdown sequence.) Best regards, Alexander
On 12/6/21 10:30, Alexander Lakhin wrote: > Hello Andrew, > 06.12.2021 17:56, Andrew Dunstan wrote: >> Yeah, quite annoying, especially because only some combinations of MSVC >> runtime / openssl version seem to trigger the problem. >> >> >> Adding a shutdown() before the closesocket() also fixes the issue. >> > Can you confirm that adding shutdown(MyProcPort->sock, SD_BOTH) fixes > the issue for many test runs? > I don't see the stable test passing here. > Without shutdown() the test failed on iterations 1, 5, 4, but with > shutdown() it failed too, on iterations 3, 1, 4. > Without close() and shutdown() the test passes 20 iterations. > (I'm still researching how openssl affects the shutdown sequence.) I have been getting 100% failures on the SSL tests with closesocket() alone, and 100% success over 10 tests with this: diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index 96ab37c7d0..5998c089b0 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -295,6 +295,7 @@ socket_close(int code, Datum arg) * Windows too. But it's a lot more fragile than the other way. */ #ifdef WIN32 + shutdown(MyProcPort->sock, SD_SEND); closesocket(MyProcPort->sock); #endif That said, your results are quite worrying. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
06.12.2021 23:51, Andrew Dunstan wrote: > I have been getting 100% failures on the SSL tests with closesocket() > alone, and 100% success over 10 tests with this: > > > diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c > index 96ab37c7d0..5998c089b0 100644 > --- a/src/backend/libpq/pqcomm.c > +++ b/src/backend/libpq/pqcomm.c > @@ -295,6 +295,7 @@ socket_close(int code, Datum arg) > * Windows too. But it's a lot more fragile than the other way. > */ > #ifdef WIN32 > + shutdown(MyProcPort->sock, SD_SEND); > closesocket(MyProcPort->sock); > #endif > > > That said, your results are quite worrying. My next results are following: It seems that the test failure rate may depend on the specs/environment. With close-only version, having limited CPU usage for my Windows VM to 20%, I've got failures on iterations 10, 2, 1. With 100% CPU I've seen 20 successful runs, then fails on iterations 5, 2. clean&buid and then failed iterations 11, 6, 3. (So maybe caching is another factor.) shutdown(MyProcPort->sock, SD_SEND) apparently fixes the issue, I've got 83 successful runs, but then iteration 84 unfortunately failed: t/001_ssltests.pl .. 106/110 # Failed test 'intermediate client certificate is missing: matches' # at t/001_ssltests.pl line 608. # 'psql: error: connection to server at "127.0.0.1", port 63187 failed: could not receive data from server: Software caused connection abort (0x00002745/10053) # SSL SYSCALL error: Software caused connection abort (0x00002745/10053) # could not send startup packet: No error (0x00000000/0)' # doesn't match '(?^:SSL error: tlsv1 alert unknown ca)' # Looks like you failed 1 test of 110. t/001_ssltests.pl .. Dubious, test returned 1 (wstat 256, 0x100) Failed 1/110 subtests (less 2 skipped subtests: 107 okay) It's not that one that we observed with close-only fix, but it still worrying. And then exactly this fail occurred again, on iteration 8. But "fortunately" I've got the same fail as before: t/001_ssltests.pl .. 106/110 # Failed test 'certificate authorization fails with revoked client cert with server-side CRL directory: matches' # at t/001_ssltests.pl line 618. # 'psql: error: connection to server at "127.0.0.1", port 59220 failed: server closed the connection unexpectedly # This probably means the server terminated abnormally # before or while processing the request. # server closed the connection unexpectedly # This probably means the server terminated abnormally # before or while processing the request. # server closed the connection unexpectedly # This probably means the server terminated abnormally # before or while processing the request.' # doesn't match '(?^:SSL error: sslv3 alert certificate revoked)' # Looks like you failed 1 test of 110. t/001_ssltests.pl .. Dubious, test returned 1 (wstat 256, 0x100) Failed 1/110 subtests (less 2 skipped subtests: 107 okay) on 145-th iteration of the test even without close() (I've tried to check whether the aforementioned fail existed before the fix). So probably we found a practical evidence of shutdown() importance we missed before, but it's not the end. There was some test instability even without the close() fix and it remains with the shutdown(...SD_SEND). By the way, while exploring openssl' behavior, I found that SSL_shutdown() has it's own quirks (see [1], return value 0). Maybe now we've encountered one of these. Best regards, Alexander [1] https://www.openssl.org/docs/man3.0/man3/SSL_shutdown.html
Alexander Lakhin <exclusion@gmail.com> writes: > It seems that the test failure rate may depend on the specs/environment. No surprise there, since the issue is almost surely timing-dependent. > shutdown(MyProcPort->sock, SD_SEND) apparently fixes the issue, I've got > 83 successful runs, but then iteration 84 unfortunately failed: > t/001_ssltests.pl .. 106/110 > # Failed test 'intermediate client certificate is missing: matches' > # at t/001_ssltests.pl line 608. > # 'psql: error: connection to server at "127.0.0.1", > port 63187 failed: could not receive data from server: Software caused > connection abort (0x00002745/10053) > # SSL SYSCALL error: Software caused connection abort (0x00002745/10053) > # could not send startup packet: No error (0x00000000/0)' > # doesn't match '(?^:SSL error: tlsv1 alert unknown ca)' > # Looks like you failed 1 test of 110. Hmm. I wonder whether using SD_BOTH behaves any differently. regards, tom lane
Hello Tom, 07.12.2021 19:25, Tom Lane wrote: > Hmm. I wonder whether using SD_BOTH behaves any differently. With shutdown(MyProcPort->sock, SD_BOTH) the test failed for me on iterations 1, 2, 3, 16 (just as without shutdown() at all). So shutdown with the SD_SEND flag definitely behaves much better (I've seen over 200 successful iterations). Best regards, Alexander
Alexander Lakhin <exclusion@gmail.com> writes: > 07.12.2021 19:25, Tom Lane wrote: >> Hmm. I wonder whether using SD_BOTH behaves any differently. > With shutdown(MyProcPort->sock, SD_BOTH) the test failed for me on > iterations 1, 2, 3, 16 (just as without shutdown() at all). > So shutdown with the SD_SEND flag definitely behaves much better (I've > seen over 200 successful iterations). Fun. Well, I'll put in shutdown with SD_SEND for the moment, and we'll have to see whether we can improve further than that. It does sound like we may be running into OpenSSL bugs/oddities, not only kernel issues, so it may be impossible to do better on our side. regards, tom lane
On 12/7/21 13:22, Tom Lane wrote: > Alexander Lakhin <exclusion@gmail.com> writes: >> 07.12.2021 19:25, Tom Lane wrote: >>> Hmm. I wonder whether using SD_BOTH behaves any differently. >> With shutdown(MyProcPort->sock, SD_BOTH) the test failed for me on >> iterations 1, 2, 3, 16 (just as without shutdown() at all). >> So shutdown with the SD_SEND flag definitely behaves much better (I've >> seen over 200 successful iterations). > Fun. Well, I'll put in shutdown with SD_SEND for the moment, > and we'll have to see whether we can improve further than that. > It does sound like we may be running into OpenSSL bugs/oddities, > not only kernel issues, so it may be impossible to do better > on our side. Yeah. My suspicion is that SD_BOTH is what closesocket() does if shutdown() hasn't been previously called. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com