Обсуждение: The dangers of streaming across versions of glibc: A cautionary tale
The following is a real critical problem that we ran into here at TripAdvisor, but have yet figured out a clear way to mitigate.
TL;DR:
Streaming replicas—and by extension, base backups—can become dangerously broken when the source and target machines run slightly different versions of glibc. Particularly, differences in strcoll and strcoll_l leave "corrupt" indexes on the slave. These indexes are sorted out of order with respect to the strcoll running on the slave. Because postgres is unaware of the discrepancy is uses these "corrupt" indexes to perform merge joins; merges rely heavily on the assumption that the indexes are sorted and this causes all the results of the join past the first poison pill entry to not be returned. Additionally, if the slave becomes master, the "corrupt" indexes will in cases be unable to enforce uniqueness, but quietly allow duplicate values.
Context:
We were doing a hardware upgrade on a large internal machine a couple months ago. We followed a common procedure here: stand up a the new HA pair as streaming replica's of the old system; then failover to the new pair. All systems involved were running 9.1.9 (though that is not relevant as we'll see), and built from source.
Immediately, after the failover we saw some weird cases with some small indexes. We thought it was because the streaming replication failover had gone poorly (and because we weren't running latest version of postgres on that machine), so we rebuilt them and moved on. Until last week when an important query stopped getting optimized as a hash join and turned into a merge join. From that query I generated a simple, single column join between two tables. That query returns 50 million rows with merge joins disabled and 0 rows with them enabled. Rebuilding the index fixed the issue, but this was an important table and so we did some digging.
Using some query optimizer coercion, I was able to show that 1. the "corrupt" index had the same number of rows as the table, and 2. the index returned rows in a different, but nearly identical ordering to the one that you would receive by explicitly sorting the column. Taking a pair of rows that were out of place, I manage to narrow the issue down. Luckily, we able to find the old server sitting on the floor. The simplest form of this issue is:
SELECT 'M' > 'ஐ';
Root cause:
Depending on your charset the first character might look like an ascii 'M'. It is not. The two characters in question are the utf8 representations of http://www.fileformat.info/info/unicode/char/ff2d/index.htm and http://www.fileformat.info/info/unicode/char/0b90/index.htm respectively. Across different machines, running the same version of postgres, and in databases with identical character encodings and collations ('en_US.UTF-8') that select will return different results if the version of glibc is different. This holds whether one pg instance is a pg_basebackup of the other or if you run an initdb and then immediately start postgres.
Digging further lead me to: master:src/backend/utils/adt/varlena.c:1494,1497 These are the lines where postgres calls strcoll_l and strcoll, in order to sort strings in a locale aware manner.
In the simplest case, the attached c file returns inconsistent results across glibc versions and environments. It just sets the collation to 'en_US.UTF-8', and then compares two one character strings (the characters above). Depending on the version of glibc you are running, you may see positive, negative or zero as the output. I have observed:
Old Server (CentOS 5.8, kernel 2.6.18-308.24.1.el5):
0 -> glibc-2.5-81.el5_8.7
1 -> glibc-devel-2.5-81.el5_8.7 statically linked
0 -> Source builds of glibc (2.5, 2.6, 2.10)
New Server (CentOS 6.4, kernel 2.6.32-358.el6.x86_64)
-1 -> glibc-2.12-1.132.el6.x86_64
Dev server (CentOS 6.4, kernel 2.6.32-358.el6.x86_64)
-1 -> glibc-2.12-1.107.el6_4.5.x86_64
-1 -> Source build (2.12.2, 2.16, 2.18)
15 -> Source build (HEAD)
Laptop (Ubuntu, kernel 3.11.0-12-generic)
-1 -> 2.17-93ubuntu4
15 -> 2.17-93ubuntu4 statically linked
Mac OS (For comparison only)
62365 -> OSX 10.8, 10.9
From my digging, I have been unable to figure out why glibc is returning different results in different situations. It is probably worth getting a discussion going on their mailing lists as well.
Regardless, the reality is that there are different versions of glibc out there in the wild, and they do not sort consistently across versions/environments. Streaming replica's rely on the assumption that the sort order within a collation is consistent across machines. If they differ ever so slightly, then a single pair of rows with poison pill characters can yield an entire index invalid. We were lucky that the first discrepancy was at the beginning of the index. If it was 85% of the way through then we probably would never have noticed that merge joins were broken on that machine.
We still are discussing internally how we plan to prevent this in the future, but we'd like to open this up to the community for wider discussion.
- Matt K, TripAdvisor
Вложения
On Wed, Aug 6, 2014 at 09:24:17PM +0000, Matthew Kelly wrote: > The following is a real critical problem that we ran into here at TripAdvisor, > but have yet figured out a clear way to mitigate. > > TL;DR: > Streaming replicas—and by extension, base backups—can become dangerously broken > when the source and target machines run slightly different versions of glibc. > Particularly, differences in strcoll and strcoll_l leave "corrupt" indexes on > the slave. These indexes are sorted out of order with respect to the strcoll > running on the slave. Because postgres is unaware of the discrepancy is uses > these "corrupt" indexes to perform merge joins; merges rely heavily on the > assumption that the indexes are sorted and this causes all the results of the > join past the first poison pill entry to not be returned. Additionally, if the > slave becomes master, the "corrupt" indexes will in cases be unable to enforce > uniqueness, but quietly allow duplicate values. No surprise; I have been expecting to hear about such breakage, and am surprised we hear about it so rarely. We really have no way of testing for breakage either. :-( -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Wed, Aug 6, 2014 at 5:11 PM, Bruce Momjian <bruce@momjian.us> wrote: > No surprise; I have been expecting to hear about such breakage, and am > surprised we hear about it so rarely. We really have no way of testing > for breakage either. :-( I guess that Trip Advisor were using some particular collation that had a chance of changing. Sorting rules for English text (so, say, en_US.UTF-8) are highly unlikely to change. That might be much less true for other locales. Unicode Technical Standard #10 states: """ Collation order is not fixed. Over time, collation order will vary: there may be fixes needed as more information becomes available about languages; there may be new government or industry standards for the language that require changes; and finally, new characters added to the Unicode Standard will interleave with the previously-defined ones. This means that collations must be carefully versioned. """ So, the reality is that we only have ourselves to blame. :-( LC_IDENTIFICATION serves this purpose on glibc. Here is what en_US looks like on my machine: """ escape_char / comment_char % % Locale for English locale in the USA % Contributed by Ulrich Drepper <drepper@redhat.com>, 2000 LC_IDENTIFICATION title "English locale for the USA" source "Free Software Foundation, Inc." address "59 Temple Place - Suite 330, Boston, MA 02111-1307, USA" contact "" email "bug-glibc-locales@gnu.org" tel "" fax "" language "English" territory "USA" revision "1.0" date "2000-06-24" % category "en_US:2000";LC_IDENTIFICATION category "en_US:2000";LC_CTYPE category "en_US:2000";LC_COLLATE category "en_US:2000";LC_TIME category "en_US:2000";LC_NUMERIC category "en_US:2000";LC_MONETARY category "en_US:2000";LC_MESSAGES category "en_US:2000";LC_PAPER category "en_US:2000";LC_NAME category "en_US:2000";LC_ADDRESS category "en_US:2000";LC_TELEPHONE *** SNIP *** """ This is a GNU extension [1]. If the OS adds a new version of a collation, that probably accidentally works a lot of the time, because the collation rule added or removed was fairly esoteric anyway, such is the nature of these things. If it was something that came up a lot, it would surely have been settled by standardization years ago. If OS vendors are not going to give us a standard API for versioning, we're hosed. I thought about suggesting that we hash a strxfrm() blob for about 2 minutes, before realizing that that's a stupid idea. Glibc would be a good start. [1] https://www.gnu.org/software/autoconf/manual/autoconf-2.63/html_node/Special-Shell-Variables.html -- Regards, Peter Geoghegan
> Over time, collation order will vary: there may be fixes needed as > more information becomes available about languages; there may be new > government or industry standards for the language that require > changes; and finally, new characters added to the Unicode Standard > will interleave with the previously-defined ones. This means that > collations must be carefully versioned. Another idea could be having our own collation data to isolate any changes from outside world. I vaguley recall this had been discussed before. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
On Wed, Aug 6, 2014 at 6:30 PM, Tatsuo Ishii <ishii@postgresql.org> wrote: > Another idea could be having our own collation data to isolate any > changes from outside world. I vaguley recall this had been discussed > before. That's probably the best solution. It would not be the first time that we decided to stop relying on the operating system's facilities due to various problems (e.g. we used to use the C standard library qsort() until about 2006). The only problem is that it's a lot of work. One possible solution that has been proposed is to adopt ICU [1]. That might allow us to say "this is the official way that PostgreSQL 9.6 sorts Japanese; you may use the old way if you want, but it's incompatible with the new way". ICU would give us a standard versioning interface [2]. They seem to take this seriously, and are aware of our considerations around B-Tree indexes on text. [1] https://wiki.postgresql.org/wiki/Todo:ICU [2] http://userguide.icu-project.org/collation/architecture#TOC-Versioning -- Regards, Peter Geoghegan
We are currently running with the en_US.UTF-8 collation. It was a decision made long ago, and seeing as we never actuallyrely on the sort order of internationalized strings (other than for stability, apparently), we have never had anymotivation to change this practice. Some way of versioning collations, which is not tied to glibc seems immensely appealing. Without a good way of testing thedifferences between glibc sort versions, it seems the only safe thing to do at the moment is to guarantee all streamingreplica's run from the exact same OS image. Which is fine until you want to upgrade your OS, and need to do a dump-restoreinstead of being able to do that in a rolling fashion. To Bruce's point the way I was able to test for this issue in a particular index was (approximately): --Assuming textfield is what the index is on, this causes the query planner to scan the index and give the position in theindex. CREATE TABLE index_order (SELECT textfield, dense_rank() OVER (ORDER BY textfield) as i_order FROM table); --No index here, postgres must sort CREATE TABLE both_order as (SELECT textfield, i_order, dense_rank() OVER (ORDER BY textfield) as sort_order FROM index_order); -- If this doesn't return zero, you have a problem SELECT count(*) FROM both_orders WHERE i_order <> sort_order; This method is really slow on a big table, and I'm not going to promise it always works, but that is how we found the rootcause. - Matt K On Aug 6, 2014, at 9:46 PM, Peter Geoghegan <peter.geoghegan86@gmail.com> wrote: > On Wed, Aug 6, 2014 at 6:30 PM, Tatsuo Ishii <ishii@postgresql.org> wrote: >> Another idea could be having our own collation data to isolate any >> changes from outside world. I vaguley recall this had been discussed >> before. > > That's probably the best solution. It would not be the first time that > we decided to stop relying on the operating system's facilities due to > various problems (e.g. we used to use the C standard library qsort() > until about 2006). The only problem is that it's a lot of work. One > possible solution that has been proposed is to adopt ICU [1]. That > might allow us to say "this is the official way that PostgreSQL 9.6 > sorts Japanese; you may use the old way if you want, but it's > incompatible with the new way". ICU would give us a standard > versioning interface [2]. They seem to take this seriously, and are > aware of our considerations around B-Tree indexes on text. > > [1] https://wiki.postgresql.org/wiki/Todo:ICU > [2] http://userguide.icu-project.org/collation/architecture#TOC-Versioning > -- > Regards, > Peter Geoghegan
On Thu, Aug 7, 2014 at 03:07:04PM +0000, Matthew Kelly wrote: > We are currently running with the en_US.UTF-8 collation. It was a decision made long ago, and seeing as we never actuallyrely on the sort order of internationalized strings (other than for stability, apparently), we have never had anymotivation to change this practice. > > Some way of versioning collations, which is not tied to glibc seems immensely appealing. Without a good way of testingthe differences between glibc sort versions, it seems the only safe thing to do at the moment is to guarantee allstreaming replica's run from the exact same OS image. Which is fine until you want to upgrade your OS, and need to doa dump-restore instead of being able to do that in a rolling fashion. > > > > To Bruce's point the way I was able to test for this issue in a particular index was (approximately): > --Assuming textfield is what the index is on, this causes the query planner to scan the index and give the position inthe index. > CREATE TABLE index_order (SELECT textfield, dense_rank() OVER (ORDER BY textfield) as i_order FROM table); > --No index here, postgres must sort > CREATE TABLE both_order as (SELECT textfield, i_order, dense_rank() OVER (ORDER BY textfield) as sort_order FROM index_order); > -- If this doesn't return zero, you have a problem > SELECT count(*) FROM both_orders WHERE i_order <> sort_order; > > This method is really slow on a big table, and I'm not going to promise it always works, but that is how we found the rootcause. We could walk the index looking for inconsistent btree splits, e.g. the split doesn't match the ordering returned by the existing collation functions. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Thu, Aug 7, 2014 at 9:46 AM, Bruce Momjian <bruce@momjian.us> wrote: > We could walk the index looking for inconsistent btree splits, e.g. the > split doesn't match the ordering returned by the existing collation > functions. I'm not sure I follow. I don't think that a tool like my btreecheck tool will necessarily be able to catch anything like this on a standby. Maybe it will, but that isn't guaranteed. For example, the difference in collation rules in question might just not have cropped up yet, but it's still a ticking time-bomb. Or, there are only differences affecting values on internal pages. Things break down very quickly. In general, once there is an undetected inconsistency in collations between replicas, that means that the varlena B-Tree support function number 1 can violate various invariants that all operator classes must obey. I doubt we want to get into the business of working backwards from a broken state of affairs like that to figure out there is a problem. Rather, I really do think we're compelled to offer better versioning of collations using a versioning interface like Glibc's LC_IDENTIFICATION. There is no way other way to properly fix the problem. This is a problem that is well understood, and anticipated by the Unicode consortium. -- Regards, Peter Geoghegan