Обсуждение: PG 10: could not generate random cancel key

Поиск
Список
Период
Сортировка

PG 10: could not generate random cancel key

От
Dean Rasheed
Дата:
Last week I upgraded 15 servers from various pre-10 versions to 10.4.
At first everything looked OK, but then (around 4 days later) one of
them failed with this in the logs:

2018-07-14 01:53:35.840 BST  LOG:  could not generate random cancel key
2018-07-14 01:53:37.233 BST  LOG:  could not generate random cancel key
2018-07-14 01:53:37.245 BST  LOG:  could not generate random cancel key
2018-07-14 01:53:38.553 BST  LOG:  could not generate random cancel key
2018-07-14 01:53:38.581 BST  LOG:  could not generate random cancel key
2018-07-14 01:54:43.851 BST  WARNING:  worker took too long to start; canceled
2018-07-14 01:54:43.862 BST  LOG:  could not generate random cancel key
2018-07-14 01:55:09.861 BST  LOG:  could not generate random cancel key
2018-07-14 01:55:09.874 BST  LOG:  could not generate random cancel key
...

After that it would not accept any new connections until I restarted
postmaster a few hours later. Since then, it has been OK.

It was built using --with-openssl and strong random support enabled,
so it was OpenSSL's RAND_bytes() that failed for some reason. I
attempted to reproduce it with a small C program directly calling
RAND_bytes(), but it refused to fail, even if I disabled haveged and
ran my tests in an @reboot cron job. So this failure is evidently
quite rare, but the documentation for RAND_bytes() says it *can* fail
(returning 0) if it isn't seeded with enough entropy, in which case
more must be added, which we're not doing.

In addition, once it does fail, repeated calls to RAND_bytes() will
continue to fail if it isn't seeded with more data -- hence the
inability to start any new backends until after a postmaster restart,
which is not a very friendly failure mode.

The OpenSSL documentation suggests that we should use RAND_status()
[1] to check that the generator has been seeded with enough data:

    RAND_status() indicates whether or not the CSPRNG has been sufficiently
    seeded. If not, functions such as RAND_bytes(3) will fail.

and if not, RAND_poll() can be used to fix that:

    RAND_poll() uses the system's capabilities to seed the CSPRNG using
    random input obtained from polling various trusted entropy sources. The
    default choice of the entropy source can be modified at build time using
    the --with-rand-seed configure option, see also the NOTES section. A
    summary of the configure options can be displayed with the OpenSSL
    version(1) command.

Looking for precedents elsewhere, I found [2] which does exactly that,
although I'm slightly dubious about the need for the for-loop there. I
also found a thread [3], which recommends simply doing

if (RAND_status() == 0)
    RAND_poll();

which seems preferable. Attached is a patch to do this in pg_strong_random().

Thoughts?

Regards,
Dean


[1] https://www.openssl.org/docs/man1.1.1/man3/RAND_status.html
[2] https://github.com/nodejs/node/blob/master/src/node_crypto.cc
[3] https://github.com/openssl/openssl/issues/4148

Вложения

Re: PG 10: could not generate random cancel key

От
Michael Paquier
Дата:
On Tue, Jul 17, 2018 at 01:33:11PM +0100, Dean Rasheed wrote:
> Looking for precedents elsewhere, I found [2] which does exactly that,
> although I'm slightly dubious about the need for the for-loop there. I
> also found a thread [3], which recommends simply doing
>
> if (RAND_status() == 0)
>     RAND_poll();
>
> which seems preferable. Attached is a patch to do this in pg_strong_random().

Checking for the return result of RAND_poll() would also be good thing
to do.  From what I read in OpenSSL code it could fail as well, and
we could combine that with a loop attempted to feed the machinery a
decided amount of times, just failing after successive failures.
--
Michael

Вложения

Re: PG 10: could not generate random cancel key

От
Dean Rasheed
Дата:
On 17 July 2018 at 14:04, Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, Jul 17, 2018 at 01:33:11PM +0100, Dean Rasheed wrote:
>> Looking for precedents elsewhere, I found [2] which does exactly that,
>> although I'm slightly dubious about the need for the for-loop there. I
>> also found a thread [3], which recommends simply doing
>>
>> if (RAND_status() == 0)
>>     RAND_poll();
>>
>> which seems preferable. Attached is a patch to do this in pg_strong_random().
>
> Checking for the return result of RAND_poll() would also be good thing
> to do.  From what I read in OpenSSL code it could fail as well, and
> we could combine that with a loop attempted to feed the machinery a
> decided amount of times, just failing after successive failures.

From what I understand from here [1], some parts of OpenSSL call
RAND_poll() once on initialisation, and that's enough to get the PRNG
going. It's not obvious that calling it multiple times would have any
benefit.

They also don't appear to bother checking the return code from
RAND_poll() [2]. If it did fail, there'd not be much you could do
anyway, so you might as well just let it continue and let RAND_bytes()
fail. In fact it may even be possible for RAND_poll() to fail, but
just do enough to cause RAND_bytes() to succeed.

Regards,
Dean


[1] https://wiki.openssl.org/index.php/Random_Numbers
[2] https://github.com/benvanik/openssl/blob/master/openssl/crypto/rand/md_rand.c


Re: PG 10: could not generate random cancel key

От
Robert Haas
Дата:
On Tue, Jul 17, 2018 at 8:33 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
> if (RAND_status() == 0)
>     RAND_poll();

Looks like a recipe for an infinite loop.  At least, I think we ought
to have a CHECK_FOR_INTERRUPTS() in that loop.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: PG 10: could not generate random cancel key

От
Alvaro Herrera
Дата:
On 2018-Jul-17, Robert Haas wrote:

> On Tue, Jul 17, 2018 at 8:33 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
> > if (RAND_status() == 0)
> >     RAND_poll();
> 
> Looks like a recipe for an infinite loop.  At least, I think we ought
> to have a CHECK_FOR_INTERRUPTS() in that loop.

What loop?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PG 10: could not generate random cancel key

От
Robert Haas
Дата:
On Tue, Jul 17, 2018 at 1:27 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> On 2018-Jul-17, Robert Haas wrote:
>
>> On Tue, Jul 17, 2018 at 8:33 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>> > if (RAND_status() == 0)
>> >     RAND_poll();
>>
>> Looks like a recipe for an infinite loop.  At least, I think we ought
>> to have a CHECK_FOR_INTERRUPTS() in that loop.
>
> What loop?

Ugh, I'm not doing very well today, am I?  I read that as while() but
it says if().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: PG 10: could not generate random cancel key

От
Michael Paquier
Дата:
On Tue, Jul 17, 2018 at 01:31:01PM -0400, Robert Haas wrote:
> On Tue, Jul 17, 2018 at 1:27 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> On 2018-Jul-17, Robert Haas wrote:
>>> On Tue, Jul 17, 2018 at 8:33 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>>>> if (RAND_status() == 0)
>>>>     RAND_poll();
>>>
>>> Looks like a recipe for an infinite loop.  At least, I think we ought
>>> to have a CHECK_FOR_INTERRUPTS() in that loop.
>>
>> What loop?

The CHECK_FOR_INTERRUPTS() addition could be an addition if the number
of retries gets high enough, but that just does not apply here.

> Ugh, I'm not doing very well today, am I?  I read that as while() but
> it says if().

Time for vacations :)
--
Michael

Вложения

Re: PG 10: could not generate random cancel key

От
Michael Paquier
Дата:
On Tue, Jul 17, 2018 at 02:28:14PM +0100, Dean Rasheed wrote:
> From what I understand from here [1], some parts of OpenSSL call
> RAND_poll() once on initialisation, and that's enough to get the PRNG
> going. It's not obvious that calling it multiple times would have any
> benefit.
>
> They also don't appear to bother checking the return code from
> RAND_poll() [2]. If it did fail, there'd not be much you could do
> anyway, so you might as well just let it continue and let RAND_bytes()
> fail. In fact it may even be possible for RAND_poll() to fail, but
> just do enough to cause RAND_bytes() to succeed.
>
> [1] https://wiki.openssl.org/index.php/Random_Numbers

This quote from the wiki is scary so that's not quite clean either for
Windows:
"Be careful when deferring to RAND_poll on some Unix systems because it
does not seed the generator. See the code guarded with
OPENSSL_SYS_VXWORKS in rand_unix.c. Additionally, RAND_poll can have
negative interactions on newer Windows platforms, so your program could
hang or crash depending on the potential issue. See Windows Issues
below."

> [2] https://github.com/benvanik/openssl/blob/master/openssl/crypto/rand/md_rand.c

This repository is outdated, on OpenSSL HEAD I am seeing this used only
in rand_win.c.  And this commit is sort of interesting because there was
a retry loop done with RAND_poll().  Please see this one:
commit: c16de9d8329d41a2433d0f273c080d9d06ad7a87
author: Dr. Matthias St. Pierre <Matthias.St.Pierre@ncp-e.com>
date: Thu, 31 Aug 2017 23:16:22 +0200
committer: Ben Kaduk <kaduk@mit.edu>
date: Wed, 18 Oct 2017 08:39:20 -0500
Fix reseeding issues of the public RAND_DRBG

apps/ocsp.c also has the wisdom to check for a failure on RAND_poll().
--
Michael

Вложения

Re: PG 10: could not generate random cancel key

От
Dean Rasheed
Дата:
On 18 July 2018 at 03:17, Michael Paquier <michael@paquier.xyz> wrote:
>> [1] https://wiki.openssl.org/index.php/Random_Numbers
>
> This quote from the wiki is scary so that's not quite clean either for
> Windows:
> "Be careful when deferring to RAND_poll on some Unix systems because it
> does not seed the generator. See the code guarded with
> OPENSSL_SYS_VXWORKS in rand_unix.c. Additionally, RAND_poll can have
> negative interactions on newer Windows platforms, so your program could
> hang or crash depending on the potential issue. See Windows Issues
> below."
>

I think that wiki page is somewhat out of date in places. Both the
Windows issues it links to seem to have been fixed a long time ago, so
I think using RAND_poll() is probably safe now, although perhaps there
are still some Unix platforms on which it won't help either.


>> [2] https://github.com/benvanik/openssl/blob/master/openssl/crypto/rand/md_rand.c
>
> This repository is outdated, on OpenSSL HEAD I am seeing this used only
> in rand_win.c.  And this commit is sort of interesting because there was
> a retry loop done with RAND_poll().  Please see this one:
> commit: c16de9d8329d41a2433d0f273c080d9d06ad7a87
> author: Dr. Matthias St. Pierre <Matthias.St.Pierre@ncp-e.com>
> date: Thu, 31 Aug 2017 23:16:22 +0200
> committer: Ben Kaduk <kaduk@mit.edu>
> date: Wed, 18 Oct 2017 08:39:20 -0500
> Fix reseeding issues of the public RAND_DRBG
>
> apps/ocsp.c also has the wisdom to check for a failure on RAND_poll().

OK, I guess that it is possible that an older version of OpenSSL
requires RAND_poll() to be called multiple times. Here's an updated
patch doing that (with up to 8 retries, based on the old OpenSSL
code).

Regards,
Dean

Вложения

Re: PG 10: could not generate random cancel key

От
Michael Paquier
Дата:
On Wed, Jul 18, 2018 at 10:14:56AM +0100, Dean Rasheed wrote:
> OK, I guess that it is possible that an older version of OpenSSL
> requires RAND_poll() to be called multiple times. Here's an updated
> patch doing that (with up to 8 retries, based on the old OpenSSL
> code).

Thanks for the updated version.  This looks safer to me.  It is possible
to simplify the code by removing the external RAND_status() call and
check for RAND_status() first in the loop as per the attached.
--
Michael

Вложения

Re: PG 10: could not generate random cancel key

От
Dean Rasheed
Дата:
On 18 July 2018 at 14:01, Michael Paquier <michael@paquier.xyz> wrote:
> Thanks for the updated version.  This looks safer to me.  It is possible
> to simplify the code by removing the external RAND_status() call and
> check for RAND_status() first in the loop as per the attached.

OK, thanks.

Barring any further comments, I'll push and back-patch this to v10
later this week.

Regards,
Dean