Обсуждение: Counting the occurences of a substring within a very large text
Hello, I'd like to count the number linebreaks within a string, but I get a memory allocation error when using regexp_matches or regexp_split_to_table. Any idea for an alternative to this problem ? select count(*)-1 from ( select regexp_split_to_table(full_message,'(\n)', 'g') from mytable where id =-2146999703 )foo; ERROR: invalid memory alloc request size 1447215584 regards, Marc Mamin
> Hello, > > I'd like to count the number linebreaks within a string, > but I get a memory allocation error when using regexp_matches or regexp_split_to_table. > > Any idea for an alternative to this problem ? > > select count(*)-1 from > ( select regexp_split_to_table(full_message,'(\n)', 'g') > from mytable > where id =-2146999703 > )foo; > > ERROR: invalid memory alloc request size 1447215584 > > regards, > > Marc Mamin > Hi, what's the size of full_message from mytable where id =-2146999703? Also: Postgres version? OS? any extensions installed? Bye, Chris.
Marc Mamin wrote: > I'd like to count the number linebreaks within a string, > but I get a memory allocation error when using regexp_matches or regexp_split_to_table. > > Any idea for an alternative to this problem ? > > select count(*)-1 from > ( select regexp_split_to_table(full_message,'(\n)', 'g') > from mytable > where id =-2146999703 > )foo; > > ERROR: invalid memory alloc request size 1447215584 Does any of these two work: SELECT length(regexp_replace(full_message, '[^\n]', '', 'g')) FROM mytable WHERE id = -2146999703; or SELECT length(full_message) - length(replace(full_message, E'\n', '')) FROM mytable WHERE id = -2146999703; Yours, Laurenz Albe
On 6/24/2015 5:55 AM, Marc Mamin wrote: > Hello, > > I'd like to count the number linebreaks within a string, > but I get a memory allocation error when using regexp_matches or regexp_split_to_table. > > Any idea for an alternative to this problem ? > > select count(*)-1 from > ( select regexp_split_to_table(full_message,'(\n)', 'g') > from mytable > where id =-2146999703 > )foo; > > ERROR: invalid memory alloc request size 1447215584 > > regards, > > Marc Mamin > > If its a large enough string, switching to plperl/plpython might give you a pleasant speedup. There is a small overhead getting the string to/from pg, but the string ops will be much faster. -Andy
> -----Original Message----- > From: Albe Laurenz [mailto:laurenz.albe@wien.gv.at] > Sent: Mittwoch, 24. Juni 2015 13:44 > To: Marc Mamin; Postgres General > Subject: RE: Counting the occurences of a substring within a very large > text > > Marc Mamin wrote: > > I'd like to count the number linebreaks within a string, but I get a > > memory allocation error when using regexp_matches or > regexp_split_to_table. > > > > Any idea for an alternative to this problem ? > > > > select count(*)-1 from > > ( select regexp_split_to_table(full_message,'(\n)', 'g') > > from mytable > > where id =-2146999703 > > )foo; > > > > ERROR: invalid memory alloc request size 1447215584 > > Does any of these two work: > > SELECT length(regexp_replace(full_message, '[^\n]', '', 'g')) FROM > mytable WHERE id = -2146999703; > > or > > SELECT length(full_message) - length(replace(full_message, E'\n', '')) > FROM mytable WHERE id = -2146999703; no, they both yeld the same error. a new string functions for this would be nice, as it could certainly be implemented in a more efficient way... BTW: the text to check is a single 350 MB error message from a log file :) > > Yours, > Laurenz Albe
> -----Original Message----- > From: Chris Mair [mailto:chris@1006.org] > Sent: Mittwoch, 24. Juni 2015 13:26 > To: Marc Mamin; Postgres General > Subject: Re: [GENERAL] Counting the occurences of a substring within a > very large text > > > Hello, > > > > I'd like to count the number linebreaks within a string, but I get a > > memory allocation error when using regexp_matches or > regexp_split_to_table. > > > > Any idea for an alternative to this problem ? > > > > select count(*)-1 from > > ( select regexp_split_to_table(full_message,'(\n)', 'g') > > from mytable > > where id =-2146999703 > > )foo; > > > > ERROR: invalid memory alloc request size 1447215584 > > > > regards, > > > > Marc Mamin > > > > Hi, > > what's the size of full_message from mytable where id =-2146999703? 345 MB, Postgres 9.3.6 Marc Mamin > > Also: Postgres version? OS? any extensions installed? > > Bye, > Chris. >
> > > I'd like to count the number linebreaks within a string, but I get > > > a memory allocation error when using regexp_matches or > > regexp_split_to_table. > > > > > > Any idea for an alternative to this problem ? > > > > > > select count(*)-1 from > > > ( select regexp_split_to_table(full_message,'(\n)', 'g') > > > from mytable > > > where id =-2146999703 > > > )foo; > > > > > > ERROR: invalid memory alloc request size 1447215584 > > > > Does any of these two work: > > > > SELECT length(regexp_replace(full_message, '[^\n]', '', 'g')) FROM > > mytable WHERE id = -2146999703; > > > > or > > > > SELECT length(full_message) - length(replace(full_message, E'\n', > '')) > > FROM mytable WHERE id = -2146999703; > > > no, they both yeld the same error. > and this fails too, which is more annoying as it looks like a bug: SELECT replace(full_message, E'\n', '') FROM stadium_rprod.aserrorfull_20150623 WHERE id = -2146999703; note that the 345MB text only contains 635 lines. This might be the issue... Marc Mamin > a new string functions for this would be nice, as it could certainly be > implemented in a more efficient way... > > BTW: the text to check is a single 350 MB error message from a > log file :) > > > > > > > > > Yours, > > Laurenz Albe
note that the 345MB text only contains 635 lines. This might be the issue...
There's similar issue discussed here:
Tom did seem to accept that the attempted alloc is larger than it needs to be, but even if it were fixed you would still run into problems further down the line managing values of that size.
Geoff
>>On 24 June 2015 at 14:51, Marc Mamin <M.Mamin@intershop.de> wrote: >>note that the 345MB text only contains 635 lines. This might be the issue... >There's similar issue discussed here: >http://www.postgresql.org/message-id/6046.1353874252@sss.pgh.pa.us >Tom did seem to accept that the attempted alloc is larger than it needs to be, but even if it were fixed you would stillrun into problems further down the line managing values of that size. Thanks, does it means that the issue would also occur in a (not yet existing) function that would just count the substring, withouthaving to manipulate the text itself ? or in other words, are there string functions that can handle such large text in stream, or is it always necessary to allocenough memory for the whole text first ? Marc
Marc Mamin <M.Mamin@intershop.de> writes: > I'd like to count the number linebreaks within a string, > but I get a memory allocation error when using regexp_matches or regexp_split_to_table. Anything involving a regexp is going to have a maximum input string length of about 256MB, as a result of conversion to pg_wchar format. regexp_split_to_table(), for instance, does this: /* convert string to pg_wchar form for matching */ orig_len = VARSIZE_ANY_EXHDR(orig_str); wide_str = (pg_wchar *) palloc(sizeof(pg_wchar) * (orig_len + 1)); wide_len = pg_mb2wchar_with_len(VARDATA_ANY(orig_str), wide_str, orig_len); palloc() would complain for requests beyond 1GB, and sizeof(pg_wchar) is 4, so 256MB is the longest orig_str this can handle. (This coding is on the hairy edge of being a security bug on 32-bit machines, too, but AFAICS it is okay because there's a factor of 4 daylight between the max possible input length of 1GB and overflowing uint32.) In newer branches we could consider using MemoryContextAllocHuge to escape the 1GB limit on wide_str, but it would take some research to be sure that everything involved is using suitable datatypes for string indices and so forth. I'd be a bit worried about the performance implications of throwing such large strings around, anyway. regards, tom lane
Marc Mamin wrote: >>> I'd like to count the number linebreaks within a string, but I get a >>> memory allocation error when using regexp_matches or regexp_split_to_table. >> Does any of these two work: [...] > no, they both yeld the same error. > > a new string functions for this would be nice, as it could certainly be implemented in a more efficient way... This is a rather special case. But it should be easy to write your own C function that does this efficiently. Yours, Laurenz Albe