Обсуждение: PRIVATE columns
Currently, ANALYZE collects data on all columns and stores these samples in pg_statistic where they can be seen via the view pg_stats. In some cases we have data that is private and we do not wish others to see it, such as patient names. This becomes more important when we have row security. Perhaps that data can be protected, but it would be even better if we simply didn't store value-revealing statistic data at all. Such private data is seldom the target of searches, or if it is, it is mostly evenly distributed anyway. It would be good if we could collect the overall stats * NULL fraction * average width * ndistinct yet without storing either the MFVs or histogram. Doing that would avoid inadvertent leaking of potentially private information. SET STATISTICS 0 simply skips collection of statistics altogether To implement this, one way would be to allow ALTER TABLE foo ALTER COLUMN foo1 SET STATISTICS PRIVATE; Or we could use another magic value like -2 to request this case. (Yes, I am aware we could use a custom datatype with a custom typanalyze for this, but that breaks other things) Thoughts? -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12/12/2012 1:12 PM, Simon Riggs wrote: > Currently, ANALYZE collects data on all columns and stores these > samples in pg_statistic where they can be seen via the view pg_stats. > > In some cases we have data that is private and we do not wish others > to see it, such as patient names. This becomes more important when we > have row security. > > Perhaps that data can be protected, but it would be even better if we > simply didn't store value-revealing statistic data at all. Such > private data is seldom the target of searches, or if it is, it is > mostly evenly distributed anyway. Would protecting it the same way, we protect the passwords in pg_authid, be sufficient? Jan > > It would be good if we could collect the overall stats > * NULL fraction > * average width > * ndistinct > yet without storing either the MFVs or histogram. > Doing that would avoid inadvertent leaking of potentially private information. > > SET STATISTICS 0 > simply skips collection of statistics altogether > > To implement this, one way would be to allow > > ALTER TABLE foo > ALTER COLUMN foo1 SET STATISTICS PRIVATE; > > Or we could use another magic value like -2 to request this case. > > (Yes, I am aware we could use a custom datatype with a custom > typanalyze for this, but that breaks other things) > > Thoughts? > -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On 12 December 2012 19:13, Jan Wieck <JanWieck@yahoo.com> wrote: > On 12/12/2012 1:12 PM, Simon Riggs wrote: >> >> Currently, ANALYZE collects data on all columns and stores these >> samples in pg_statistic where they can be seen via the view pg_stats. >> >> In some cases we have data that is private and we do not wish others >> to see it, such as patient names. This becomes more important when we >> have row security. >> >> Perhaps that data can be protected, but it would be even better if we >> simply didn't store value-revealing statistic data at all. Such >> private data is seldom the target of searches, or if it is, it is >> mostly evenly distributed anyway. > > > Would protecting it the same way, we protect the passwords in pg_authid, be > sufficient? The user backend does need to be able to access the stats data during optimization. It's hard to have data accessible and yet impose limits on the uses to which that can be put. If we have row security on the table but no equivalent capability on the stats, then we'll have leakage. e.g. set statistics 10000, ANALYZE, then leak 10000 credit card numbers. Selectivity functions are not marked leakproof, nor do people think they can easily be made so. Which means the data might be leaked by various means through error messages, plan selection, skullduggery etc.. If it ain't in the bucket, the bucket can't leak it. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12/12/2012 12:12 PM, Simon Riggs wrote: >> Would protecting it the same way, we protect the passwords in pg_authid, be >> sufficient? > > The user backend does need to be able to access the stats data during > optimization. It's hard to have data accessible and yet impose limits > on the uses to which that can be put. If we have row security on the > table but no equivalent capability on the stats, then we'll have > leakage. e.g. set statistics 10000, ANALYZE, then leak 10000 credit > card numbers. > > Selectivity functions are not marked leakproof, nor do people think > they can easily be made so. Which means the data might be leaked by > various means through error messages, plan selection, skullduggery > etc.. > > If it ain't in the bucket, the bucket can't leak it. > I accidentally responded to Simon off-list to this. I understand the need and think it would be a good thing to have. However, the real opportunity here is to make statistics non-user visible. I can't think of any reason that they need to be visible to the standard user? Even if when we set the statistics private, it makes just that column non-visible. Sincerely, Joshua D. Drake -- Command Prompt, Inc. - http://www.commandprompt.com/ PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC @cmdpromptinc - 509-416-6579
Simon Riggs <simon@2ndQuadrant.com> writes: > Currently, ANALYZE collects data on all columns and stores these > samples in pg_statistic where they can be seen via the view pg_stats. Only if you have appropriate privileges. > In some cases we have data that is private and we do not wish others > to see it, such as patient names. This becomes more important when we > have row security. > Perhaps that data can be protected, but it would be even better if we > simply didn't store value-revealing statistic data at all. SET STATISTICS 0 seems like a sufficient solution for people who don't trust the have_column_privilege() protection in the pg_stats view. In practice I think this is a waste of time, though. Anyone who can bypass the view restriction can probably just read the original table. (I suppose we could consider marking pg_stats as a security_barrier view to make this even safer. Not sure it's worth the trouble though; the interesting columns are anyarray so it's hard to do much with them mechanically.) > It would be good if we could collect the overall stats > * NULL fraction > * average width > * ndistinct > yet without storing either the MFVs or histogram. Do you have any evidence whatsoever that that's worth the trouble? I'd bet against it. And if we're being paranoid, who's to say that those numbers couldn't reveal useful data in themselves? regards, tom lane
On 12/12/2012 3:12 PM, Simon Riggs wrote: > On 12 December 2012 19:13, Jan Wieck <JanWieck@yahoo.com> wrote: >> On 12/12/2012 1:12 PM, Simon Riggs wrote: >>> >>> Currently, ANALYZE collects data on all columns and stores these >>> samples in pg_statistic where they can be seen via the view pg_stats. >>> >>> In some cases we have data that is private and we do not wish others >>> to see it, such as patient names. This becomes more important when we >>> have row security. >>> >>> Perhaps that data can be protected, but it would be even better if we >>> simply didn't store value-revealing statistic data at all. Such >>> private data is seldom the target of searches, or if it is, it is >>> mostly evenly distributed anyway. >> >> >> Would protecting it the same way, we protect the passwords in pg_authid, be >> sufficient? > > The user backend does need to be able to access the stats data during > optimization. It's hard to have data accessible and yet impose limits > on the uses to which that can be put. If we have row security on the > table but no equivalent capability on the stats, then we'll have > leakage. e.g. set statistics 10000, ANALYZE, then leak 10000 credit > card numbers. Like for the encrypted password column of pg_authid, I don't see any reason why the values in the stats columns need to be readable for anyone but a superuser at all. Do you? Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
On 12 December 2012 20:57, Tom Lane <tgl@sss.pgh.pa.us> wrote: > SET STATISTICS 0 seems like a sufficient solution for people who don't > trust the have_column_privilege() protection in the pg_stats view. The point here is that a user may *have* privilege on the column and have rights to see some, but not all, rows of the table. But we cannot apply row level security to individual column values, so neither the row nor column security applies here and it appears there is a greater level of risk at this point. > In practice I think this is a waste of time, though. Anyone who can > bypass the view restriction can probably just read the original table. Where the row security would apply. > (I suppose we could consider marking pg_stats as a security_barrier > view to make this even safer. Not sure it's worth the trouble though; > the interesting columns are anyarray so it's hard to do much with them > mechanically.) I'm trying to respond in useful ways to your statements that row security might not be very secure. Please advise. >> It would be good if we could collect the overall stats >> * NULL fraction >> * average width >> * ndistinct >> yet without storing either the MFVs or histogram. > > Do you have any evidence whatsoever that that's worth the trouble? > I'd bet against it. All I can say is that uniformly distributed data that is accessed only by equality has no need of MFVs or histograms. Much personal data is so evenly distributed as to make it not worth storing and in some cases, it isn't. We don't search for credit cards with a BETWEEN, so estimating end of ranges isn't needed. Yet knowing number of distinct values is important to ensure that we use an index scan. Without stats we tend to do a bitmapindexscan, which seems to be significantly more expensive in practice. > And if we're being paranoid, who's to say that > those numbers couldn't reveal useful data in themselves? I'm talking about privacy. Knowing there are 226,768 credit cards in a table, 0% of them are NULL and they are on average 16 digits wide tells me nothing about individual credit card numbers. Same with patient names. In edge cases we might infer something more when mixed with some external knowledge, but that's a matter for the military. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
2012/12/12 Tom Lane <tgl@sss.pgh.pa.us>: > Simon Riggs <simon@2ndQuadrant.com> writes: >> Currently, ANALYZE collects data on all columns and stores these >> samples in pg_statistic where they can be seen via the view pg_stats. > > Only if you have appropriate privileges. > >> In some cases we have data that is private and we do not wish others >> to see it, such as patient names. This becomes more important when we >> have row security. > >> Perhaps that data can be protected, but it would be even better if we >> simply didn't store value-revealing statistic data at all. > > SET STATISTICS 0 seems like a sufficient solution for people who don't > trust the have_column_privilege() protection in the pg_stats view. > > In practice I think this is a waste of time, though. Anyone who can > bypass the view restriction can probably just read the original table. > > (I suppose we could consider marking pg_stats as a security_barrier > view to make this even safer. Not sure it's worth the trouble though; > the interesting columns are anyarray so it's hard to do much with them > mechanically.) > I also agree with Tom's opinion. Even though it does not have security_barrier flag now, unprivileged rows shall be filtered our with have_column_privilege(). It seems to me sufficient protection towards the scenario that allows users to reference samples of contents within unprivileged columns. Indeed, it is not sufficient protection when we have row security features; for example, "SET STATISTICS 1000" to the table with less than 1000 rows will eventually have full copy on pg_statistics catalog... Unlike column, it does not save the origin of statistical data, so it is not feasible to control based on user's privilege. If we try to protect the collected statistical data (come from tables with row-security), an option is prohibits to access entries relevant to relations with row-security. On the other hand, it will affect compatibility of third-party system monitoring tools that assumes pg_statistics being visible... Thanks, -- KaiGai Kohei <kaigai@kaigai.gr.jp>