Re: Shared detoast Datum proposal

Поиск
Список
Период
Сортировка
От Andy Fan
Тема Re: Shared detoast Datum proposal
Дата
Msg-id 8734tmg88e.fsf@163.com
обсуждение исходный текст
Ответ на Re: Shared detoast Datum proposal  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Ответы Re: Shared detoast Datum proposal  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Список pgsql-hackers
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

Hi Tomas, 
>
> I took a quick look on this thread/patch today, so let me share a couple
> initial thoughts. I may not have a particularly coherent/consistent
> opinion on the patch or what would be a better way to do this yet, but
> perhaps it'll start a discussion ...

Thank you for this!

>
> The goal of the patch (as I understand it) is essentially to cache
> detoasted values, so that the value does not need to be detoasted
> repeatedly in different parts of the plan. I think that's perfectly
> sensible and worthwhile goal - detoasting is not cheap, and complex
> plans may easily spend a lot of time on it.

exactly.

>
> That being said, the approach seems somewhat invasive, and touching
> parts I wouldn't have expected to need a change to implement this. For
> example, I certainly would not have guessed the patch to need changes in
> createplan.c or setrefs.c.
>
> Perhaps it really needs to do these things, but neither the thread nor
> the comments are very enlightening as for why it's needed :-( In many
> cases I can guess, but I'm not sure my guess is correct. And comments in
> code generally describe what's happening locally / next line, not the
> bigger picture & why it's happening.

there were explaination at [1], but it probably is too high level.
Writing a proper comments is challenging for me, but I am pretty happy
to try more. At the end of this writing, I explained the data workflow,
I am feeling that would be useful for reviewers. 

> IIUC we walk the plan to decide which Vars should be detoasted (and
> cached) once, and which cases should not do that because it'd inflate
> the amount of data we need to keep in a Sort, Hash etc.

Exactly.

> Not sure if
> there's a better way to do this - it depends on what happens in the
> upper parts of the plan, so we can't decide while building the paths.

I'd say I did this intentionally. Deciding such things in paths will be
more expensive than create_plan stage IMO.

> But maybe we could decide this while transforming the paths into a plan?
> (I realize the JIT thread nearby needs to do something like that in
> create_plan, and in that one I suggested maybe walking the plan would be
> a better approach, so I may be contradicting myself a little bit.).

I think that's pretty similar what I'm doing now. Just that I did it
*just after* the create_plan. This is because the create_plan doesn't
transform the path to plan in the top->down manner all the time, the
known exception is create_mergejoin_plan. so I have to walk just after
the create_plan is 
done.

In the create_mergejoin_plan, the Sort node is created *after* the
subplan for the Sort is created.

/* Recursively process the path tree, demanding the correct tlist result */
plan = create_plan_recurse(root, best_path, CP_EXACT_TLIST);

+ /*
+  * After the plan tree is built completed, we start to walk for which
+  * expressions should not used the shared-detoast feature.
+  */
+ set_plan_forbid_pre_detoast_vars_recurse(plan, NIL);

>
> In any case, set_plan_forbid_pre_detoast_vars_recurse should probably
> explain the overall strategy / reasoning in a bit more detail. Maybe
> it's somewhere in this thread, but that's not great for reviewers.

a lession learnt, thanks.

a revisted version of comments from the lastest patch.

/*
 * set_plan_forbid_pre_detoast_vars_recurse
 *     Walking the Plan tree in the top-down manner to gather the vars which
 * should be as small as possible and record them in Plan.forbid_pre_detoast_vars
 * 
 * plan: the plan node to walk right now.
 * small_tlist: a list of nodes which its subplan should provide them as
 * small as possible.
 */
static void
set_plan_forbid_pre_detoast_vars_recurse(Plan *plan, List *small_tlist)

>
> Similar for the setrefs.c changes. It seems a bit suspicious to piggy
> back the new code into fix_scan_expr/fix_scan_list and similar code.
> Those functions have a pretty clearly defined purpose, not sure we want
> to also extend them to also deal with this new thing. (FWIW I'd 100%%
> did it this way if I hacked on a PoC of this, to make it work. But I'm
> not sure it's the right solution.)

The main reason of doing so is because I want to share the same walk
effort as fix_scan_expr. otherwise I have to walk the plan for 
every expression again. I thought this as a best practice in the past
and thought we can treat the pre_detoast_attrs as a valuable side
effects:(

> I don't know what to thing about the Bitset - maybe it's necessary, but
> how would I know? I don't have any way to measure the benefits, because
> the 0002 patch uses it right away. 

a revisted version of comments from the latest patch. graph 2 explains
this decision.

    /*
     * The attributes whose values are the detoasted version in tts_values[*],
     * if so these memory needs some extra clean-up. These memory can't be put
     * into ecxt_per_tuple_memory since many of them needs a longer life span,
     * for example the Datum in outer join. These memory is put into
     * TupleTableSlot.tts_mcxt and be clear whenever the tts_values[*] is
     * invalidated.
     *
     * Bitset rather than Bitmapset is chosen here because when all the members
     * of Bitmapset are deleted, the allocated memory will be deallocated
     * automatically, which is too expensive in this case since we need to
     * deleted all the members in each ExecClearTuple and repopulate it again
     * when fill the detoast datum to tts_values[*]. This situation will be run
     * again and again in an execution cycle.
     * 
     * These values are populated by EEOP_{INNER/OUTER/SCAN}_VAR_TOAST steps.
     */
    Bitset       *pre_detoasted_attrs;

> I think it should be done the other
> way around, i.e. the patch should introduce the main feature first
> (using the traditional Bitmapset), and then add Bitset on top of that.
> That way we could easily measure the impact and see if it's useful.

Acutally v4 used the Bitmapset, and then both perf and pgbench's tps
indicate it is too expensive. and after talk with David at [2], I
introduced bitset and use it here. the test case I used comes from [1].
IRCC, there were 5% performance difference because of this.

create table w(a int, b numeric);
insert into w select i, i from generate_series(1, 1000000)i;
select b from w where b > 0;

To reproduce the difference, we can replace the bitset_clear() with

bitset_free(slot->pre_detoasted_attrs);
slot->pre_detoasted_attrs = bitset_init(slot->tts_tupleDescriptor->natts);

in ExecFreePreDetoastDatum. then it works same as Bitmapset.


> On the whole, my biggest concern is memory usage & leaks. It's not
> difficult to already have problems with large detoasted values, and if
> we start keeping more of them, that may get worse. Or at least that's my
> intuition - it can't really get better by keeping the values longer, right?
>
> The other thing is the risk of leaks (in the sense of keeping detoasted
> values longer than expected). I see the values are allocated in
> tts_mcxt, and maybe that's the right solution - not sure.

about the memory usage, first it is kept as the same lifesplan as the
tts_values[*] which can be released pretty quickly, only if the certain
values of the tuples is not needed. it is true that we keep the detoast
version longer than before,  but that's something we have to pay I
think. 

Leaks may happen since tts_mcxt is reset at the end of *executor*. So if
we forget to release the memory when the tts_values[*] is invalidated
somehow, the memory will be leaked until the end of executor. I think
that will be enough to cause an issue. Currently besides I release such
memory at the ExecClearTuple, I also relase such memory whenever we set
tts_nvalid to 0, the theory used here is:

/*
 * tts_values is treated invalidated since tts_nvalid is set to 0, so
 * let's free the pre-detoast datum.
 */
ExecFreePreDetoastDatum(slot);

I will do more test on the memory leak stuff, since there are so many
operation aginst slot like ExecCopySlot etc, I don't know how to test it
fully. the method in my mind now is use TPCH with 10GB data size, and
monitor the query runtime memory usage.


> FWIW while looking at the patch, I couldn't help but to think about
> expanded datums. There's similarity in what these two features do - keep
> detoasted values for a while, so that we don't need to do the expensive
> processing if we access them repeatedly.

Could you provide some keyword or function names for the expanded datum
here, I probably miss this.

> Of course, expanded datums are
> not meant to be long-lived, while "shared detoasted values" are meant to
> exist (potentially) for the query duration.

hmm, acutally the "shared detoast value" just live in the
TupleTableSlot->tts_values[*], rather than the whole query duration. The
simple case is:

SELECT * FROM t WHERE a_text LIKE 'abc%';

when we scan to the next tuple, the detoast value for the previous tuple
will be relased. 

> But maybe there's something
> we could learn from expanded datums? For example how the varlena pointer
> is leveraged to point to the expanded object.

maybe. currently I just use detoast_attr to get the desired version. I'm
pleasure if we have more effective way.

if (!slot->tts_isnull[attnum] &&
    VARATT_IS_EXTENDED(slot->tts_values[attnum]))
{
    Datum        oldDatum;
    MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);
        
    oldDatum = slot->tts_values[attnum];
    slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
   (struct varlena *) oldDatum));
 
    Assert(slot->tts_nvalid > attnum);
    Assert(oldDatum != slot->tts_values[attnum]);
    bitset_add_member(slot->pre_detoasted_attrs, attnum);
    MemoryContextSwitchTo(old);
}


> For example, what if we add a "TOAST cache" as a query-level hash table,
> and modify the detoasting to first check the hash table (with the TOAST
> pointer as a key)? It'd be fairly trivial to enforce a memory limit on
> the hash table, evict values from it, etc. And it wouldn't require any
> of the createplan/setrefs changes, I think ...

Hmm, I am not sure I understand you correctly at this part. In the
current patch, to avoid the run-time (ExecExprInterp) check if we
should detoast and save the datum, I defined 3 extra steps so that
the *extra check itself* is not needed for unnecessary attributes.
for example an datum for int or a detoast datum should not be saved back
to tts_values[*] due to the small_tlist reason. However these steps can
be generated is based on the output of createplan/setrefs changes. take
the INNER_VAR for example: 

In ExecInitExprRec:

switch (variable->varno)
{
    case INNER_VAR:
            if (is_join_plan(plan) &&
            bms_is_member(attnum,
              ((JoinState *) state->parent)->inner_pre_detoast_attrs))
        {
            scratch.opcode = EEOP_INNER_VAR_TOAST;
        }
        else
        {
            scratch.opcode = EEOP_INNER_VAR;
        }
}

The data workflow is:

1. set_plan_forbid_pre_detoast_vars_recurse (in the createplan.c)
decides which Vars should *not* be pre_detoasted because of small_tlist
reason and record it in Plan.forbid_pre_detoast_vars.

2. fix_scan_expr (in the setrefs.c) tracks which Vars should be
detoasted for the specific plan node and record them in it. Currently
only Scan and Join nodes support this feature.

typedef struct Scan
{
        ...
    /*
     * Records of var's varattno - 1 where the Var is accessed indirectly by
     * any expression, like a > 3.  However a IS [NOT] NULL is not included
     * since it doesn't access the tts_values[*] at all.
     *
     * This is a essential information to figure out which attrs should use
     * the pre-detoast-attrs logic.
     */
    Bitmapset  *reference_attrs;
} Scan;

typedef struct Join
{
..
    /*
     * Records of var's varattno - 1 where the Var is accessed indirectly by
     * any expression, like a > 3.  However a IS [NOT] NULL is not included
     * since it doesn't access the tts_values[*] at all.
     *
     * This is a essential information to figure out which attrs should use
     * the pre-detoast-attrs logic.
     */
    Bitmapset  *outer_reference_attrs;
    Bitmapset  *inner_reference_attrs;
} Join;

3. during the InitPlan stage, we maintain the
PlanState.xxx_pre_detoast_attrs and generated different StepOp for them.

4. At the ExecExprInterp stage, only the new StepOp do the extra check
to see if the detoast should happen. Other steps doesn't need this
check at all. 

If we avoid the createplan/setref.c changes, probabaly some unrelated
StepOp needs the extra check as well?

When I worked with the UniqueKey feature, I maintained a
UniqueKey.README to summaried all the dicussed topics in threads, the
README is designed to save the effort for more reviewer, I think I
should apply the same logic for this feature.

Thank you very much for your feedback!

v7 attached, just some comments and Assert changes. 

[1] https://www.postgresql.org/message-id/87il4jrk1l.fsf%40163.com
[2]
https://www.postgresql.org/message-id/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com

-- 
Best Regards
Andy Fan


Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tomas Vondra
Дата:
Сообщение: Re: Shared detoast Datum proposal
Следующее
От: Tomas Vondra
Дата:
Сообщение: Re: Missing Group Key in grouped aggregate