Обсуждение: an attempt to fix the Google search problem

Поиск
Список
Период
Сортировка

an attempt to fix the Google search problem

От
Peter Eisentraut
Дата:
It is a well-known problem that a Google search for something in the
PostgreSQL documentation will usually return hits in old documentation
versions first, because those pages have been around for the longest.

I believe I have a promising fix for that.  By adding a <link
rel="canonical"> to the documentation pages that point to the "current"
version, search engines will be encouraged to return the current version
search results.

I had heard that the Django project had the same problem and got this
solution from there.  See for example the source of this page:
<https://docs.djangoproject.com/en/1.10/topics/db/models/>.  Here is
also some information from Google about this:
<https://webmasters.googleblog.com/2013/04/5-common-mistakes-with-relcanonical.html>

I think this is worth trying.  A one-line patch is attached.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: an attempt to fix the Google search problem

От
Magnus Hagander
Дата:


On Wed, Nov 9, 2016 at 6:34 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
It is a well-known problem that a Google search for something in the
PostgreSQL documentation will usually return hits in old documentation
versions first, because those pages have been around for the longest.

I believe I have a promising fix for that.  By adding a <link
rel="canonical"> to the documentation pages that point to the "current"
version, search engines will be encouraged to return the current version
search results.

I had heard that the Django project had the same problem and got this
solution from there.  See for example the source of this page:
<https://docs.djangoproject.com/en/1.10/topics/db/models/>.  Here is
also some information from Google about this:
<https://webmasters.googleblog.com/2013/04/5-common-mistakes-with-relcanonical.html>

I think this is worth trying.  A one-line patch is attached.

By that article you linked, it's important not to link to pages that don't exist. So we should at least verify that the page does exist in the current version (the same way that we do for the links at the top of the pages for old versions).  IIRC someone (sorry, this is a long time ago, can't remember who or why) mentioned that the pages can get severely punished if the canonical link goes to a 404.

We did try this at some point ages and ages ago and it didn't help, but I agree it's probably worth another try. But we definitely need to be careful not to destroy existing google ranking.

And FWIW, the django way isn't particularly good either. For example a google for "django 1.8 models" only gives me the 1.10 documentation. But that may still be the lesser of the two evils. 

Also, when I search for it, I always get the greek version (el) first (though it's in english), followed by japanese. The actual page to  look for doesn't show up until the 5th spot. Which goes to prove that no way really works *well* :S

--

Re: an attempt to fix the Google search problem

От
Peter Eisentraut
Дата:
On 11/9/16 12:07 PM, Magnus Hagander wrote:
> By that article you linked, it's important not to link to pages that
> don't exist. So we should at least verify that the page does exist in
> the current version (the same way that we do for the links at the top of
> the pages for old versions).  IIRC someone (sorry, this is a long time
> ago, can't remember who or why) mentioned that the pages can get
> severely punished if the canonical link goes to a 404.

OK, good point.  Can you help with the code?  I suppose it needs
something along the lines of

{%if current in supported_versions%}
...
{%endif}

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: an attempt to fix the Google search problem

От
Daniel Gustafsson
Дата:
> On 09 Nov 2016, at 18:07, Magnus Hagander <magnus@hagander.net> wrote:
>
> On Wed, Nov 9, 2016 at 6:34 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com
<mailto:peter.eisentraut@2ndquadrant.com>>wrote: 
> It is a well-known problem that a Google search for something in the
> PostgreSQL documentation will usually return hits in old documentation
> versions first, because those pages have been around for the longest.
>
> I believe I have a promising fix for that.  By adding a <link
> rel="canonical"> to the documentation pages that point to the "current"
> version, search engines will be encouraged to return the current version
> search results.
>
> I had heard that the Django project had the same problem and got this
> solution from there.  See for example the source of this page:
> <https://docs.djangoproject.com/en/1.10/topics/db/models/
<https://docs.djangoproject.com/en/1.10/topics/db/models/>>. Here is 
> also some information from Google about this:
> <https://webmasters.googleblog.com/2013/04/5-common-mistakes-with-relcanonical.html
<https://webmasters.googleblog.com/2013/04/5-common-mistakes-with-relcanonical.html>>
>
> I think this is worth trying.  A one-line patch is attached.
>
> By that article you linked, it's important not to link to pages that don't exist. So we should at least verify that
thepage does exist in the current version (the same way that we do for the links at the top of the pages for old
versions). IIRC someone (sorry, this is a long time ago, can't remember who or why) mentioned that the pages can get
severelypunished if the canonical link goes to a 404. 

While I can’t cite a source supporting that Google punish 4XX responses, I have
first-hand experience in that they in fact do (or at least have done).

> We did try this at some point ages and ages ago and it didn't help, but I agree it's probably worth another try. But
wedefinitely need to be careful not to destroy existing google ranking. 

The backing RFC states that the target document must be a duplicate or superset
of the context document, and Google says similar.  The current version of a doc
page fit that but we should be careful when doc pages have been substantially
rewritten, targetting a completely different page could lead to punishment.

cheers ./daniel


Re: an attempt to fix the Google search problem

От
Greg Stark
Дата:
On Wed, Nov 9, 2016 at 4:34 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> I believe I have a promising fix for that.  By adding a <link
> rel="canonical"> to the documentation pages that point to the "current"
> version, search engines will be encouraged to return the current version
> search results.

I don't think this "encourages" them to return the current version. I
believe it teaches them that they should only ever return the current
version and the old versions are just copies of it that should never
be returned.

We could perhaps adopt that attitude altogether though. Treat all the
versions as the same page with a selector at the top between the
versions. When you land on a page if you don't pass an explicit
version as a parameter always assume the current version. Unless
people start linking to specific versions in which case the problem
starts all over.


-- 
greg



Re: an attempt to fix the Google search problem

От
Magnus Hagander
Дата:
On Thu, Nov 10, 2016 at 7:32 PM, Greg Stark <stark@mit.edu> wrote:
On Wed, Nov 9, 2016 at 4:34 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> I believe I have a promising fix for that.  By adding a <link
> rel="canonical"> to the documentation pages that point to the "current"
> version, search engines will be encouraged to return the current version
> search results.

I don't think this "encourages" them to return the current version. I
believe it teaches them that they should only ever return the current
version and the old versions are just copies of it that should never
be returned.

We could perhaps adopt that attitude altogether though. Treat all the
versions as the same page with a selector at the top between the
versions. When you land on a page if you don't pass an explicit
version as a parameter always assume the current version. Unless
people start linking to specific versions in which case the problem
starts all over.

That matches my experience with other sites. The page with the canonical ref is basically removed from the search engine, and any hits from it will go to the other page. 

Since we do have the links to previous versions at the top of the page, so perhaps that's what we actually want?

--

Re: an attempt to fix the Google search problem

От
Peter Geoghegan
Дата:
On Thu, Nov 10, 2016 at 11:51 AM, Magnus Hagander <magnus@hagander.net> wrote:
> Since we do have the links to previous versions at the top of the page, so
> perhaps that's what we actually want?

In my opinion, that's the ideal. I would definitely vote for that.


-- 
Peter Geoghegan



Re: an attempt to fix the Google search problem

От
Daniel Gustafsson
Дата:
> On 10 Nov 2016, at 20:51, Magnus Hagander <magnus@hagander.net> wrote:
>
> On Thu, Nov 10, 2016 at 7:32 PM, Greg Stark <stark@mit.edu <mailto:stark@mit.edu>> wrote:
> On Wed, Nov 9, 2016 at 4:34 PM, Peter Eisentraut
> <peter.eisentraut@2ndquadrant.com <mailto:peter.eisentraut@2ndquadrant.com>> wrote:
> > I believe I have a promising fix for that.  By adding a <link
> > rel="canonical"> to the documentation pages that point to the "current"
> > version, search engines will be encouraged to return the current version
> > search results.
>
> I don't think this "encourages" them to return the current version. I
> believe it teaches them that they should only ever return the current
> version and the old versions are just copies of it that should never
> be returned.
>
> We could perhaps adopt that attitude altogether though. Treat all the
> versions as the same page with a selector at the top between the
> versions. When you land on a page if you don't pass an explicit
> version as a parameter always assume the current version. Unless
> people start linking to specific versions in which case the problem
> starts all over.
>
> That matches my experience with other sites. The page with the canonical ref is basically removed from the search
engine,and any hits from it will go to the other page.  
>
> Since we do have the links to previous versions at the top of the page, so perhaps that's what we actually want?

If we want to Google searches to always reach the current version unless a
version is explicitly searched, instrumenting the links to the older versions
with nofollow while keeping them in the sitemap.xml should work.  It does mean
that we kill the accumulated PageRank on the older pages though (which either
of these schemes do AFAICT).

cheers ./daniel


Re: an attempt to fix the Google search problem

От
Peter Geoghegan
Дата:
On Thu, Nov 10, 2016 at 12:12 PM, Daniel Gustafsson <daniel@yesql.se> wrote:
> If we want to Google searches to always reach the current version unless a
> version is explicitly searched, instrumenting the links to the older versions
> with nofollow while keeping them in the sitemap.xml should work.  It does mean
> that we kill the accumulated PageRank on the older pages though (which either
> of these schemes do AFAICT).

Do people really search for something with a specific version
specified? I've never done that, because I rarely particularly need
to, and because I think that it wouldn't work.

-- 
Peter Geoghegan



Re: an attempt to fix the Google search problem

От
Christophe Pettus
Дата:
> On Nov 10, 2016, at 12:17, Peter Geoghegan <pg@heroku.com> wrote:
> Do people really search for something with a specific version
> specified?

I do.  It's often successful.

--
-- Christophe Pettus  xof@thebuild.com




Re: an attempt to fix the Google search problem

От
Peter Geoghegan
Дата:
On Thu, Nov 10, 2016 at 12:18 PM, Christophe Pettus <xof@thebuild.com> wrote:
> I do.  It's often successful.

If I search for "postgresql jsonb", the first result I see is
"PostgreSQL: Documentation: 9.4: JSON Types", which is what I'd expect
(though, the current version would be better). If I search for
"postgresql jsonb 9.5", I don't see the result "JSON Types" at all,
for any version on the first page of results. I do see "9.5: JSON
Functions and Operators", presumably because those changed in 9.5.

Most of the important information about jsonb is under "JSON Types",
and I don't get to see that at all. I think that putting a version
number into Google is a pretty bad strategy. By making the current
version of each page canonical (or, the latest version that still has
the page), we'd be better off on average. There might be a cost, but
it seems well worth it to me.

-- 
Peter Geoghegan



Re: an attempt to fix the Google search problem

От
Steve Atkins
Дата:
> On Nov 10, 2016, at 12:17 PM, Peter Geoghegan <pg@heroku.com> wrote:
>
> On Thu, Nov 10, 2016 at 12:12 PM, Daniel Gustafsson <daniel@yesql.se> wrote:
>> If we want to Google searches to always reach the current version unless a
>> version is explicitly searched, instrumenting the links to the older versions
>> with nofollow while keeping them in the sitemap.xml should work.  It does mean
>> that we kill the accumulated PageRank on the older pages though (which either
>> of these schemes do AFAICT).
>
> Do people really search for something with a specific version
> specified? I've never done that, because I rarely particularly need
> to, and because I think that it wouldn't work.

I do quite often. Usually because the question I'm asking is "did feature X exist in version Y".

It's not great, but it gives useful results more often than not.

Cheers, Steve




Re: an attempt to fix the Google search problem

От
Greg Stark
Дата:
On Thu, Nov 10, 2016 at 8:17 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Do people really search for something with a specific version
> specified? I've never done that, because I rarely particularly need
> to, and because I think that it wouldn't work.

Fwiw when I searched for Oracle docs I found including version numbers
*did* reliably produce better results. In that case the problem is
that there are too many blogs and support pages that try to help and
you have to include extra terms to filter out just the most helpful
ones. Including terms like version numbers really helps winnow out the
chaff of pages that are for older or newer versions or aren't
technical enough to specify what version they're for.


I think a more realistic scenario to consider is someone running an
older release and searching for something like
"ssl_renegotiation_limit" or "wal_level hot_standby" would find
nothing at all rather than documentation for a version that may be
what they're actually running. Now.... the fact that
ssl_renogitation_limit is the *only* such example could find more
recent than 9.2 may mean it's rare enough that it's not much of a
concern


-- 
greg



Re: an attempt to fix the Google search problem

От
"Joshua D. Drake"
Дата:
On 11/10/2016 12:18 PM, Christophe Pettus wrote:
>
>> On Nov 10, 2016, at 12:17, Peter Geoghegan <pg@heroku.com> wrote:
>> Do people really search for something with a specific version
>> specified?
>
> I do.  It's often successful.

I don't but I bet I am the exception. I just change the URL when I hit 
the page so that it reflects the version I am after.

JD

>
> --
> -- Christophe Pettus
>    xof@thebuild.com
>
>
>


-- 
Command Prompt, Inc.                  http://the.postgres.company/                        +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
Unless otherwise stated, opinions are my own.



Re: an attempt to fix the Google search problem

От
Magnus Hagander
Дата:


On Fri, Nov 11, 2016 at 12:45 AM, Greg Stark <stark@mit.edu> wrote:
On Thu, Nov 10, 2016 at 8:17 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Do people really search for something with a specific version
> specified? I've never done that, because I rarely particularly need
> to, and because I think that it wouldn't work.

Fwiw when I searched for Oracle docs I found including version numbers
*did* reliably produce better results. In that case the problem is
that there are too many blogs and support pages that try to help and
you have to include extra terms to filter out just the most helpful
ones. Including terms like version numbers really helps winnow out the
chaff of pages that are for older or newer versions or aren't
technical enough to specify what version they're for.


I think a more realistic scenario to consider is someone running an
older release and searching for something like
"ssl_renegotiation_limit" or "wal_level hot_standby" would find
nothing at all rather than documentation for a version that may be
what they're actually running. Now.... the fact that
ssl_renogitation_limit is the *only* such example could find more
recent than 9.2 may mean it's rare enough that it's not much of a
concern

Yeah, that's a bigger issue - there'd be no way at all to get hits on that in google then, would there? And I think cutting that argument off at supported releases (9.2) is too soon - people migrating *off* earlier versions for example, would still be interested in searching for them.

Wouldn't the same thing apply to say checkpoint_segments, btw? Surely there must be more than just those?


--