Skip to content

CXX-3097 update sitemaps and patch mongocxx-3.11.0 pages with redirects #1239

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Oct 23, 2024

Conversation

eramongodb
Copy link
Contributor

@eramongodb eramongodb commented Oct 22, 2024

Summary

Resolves CXX-3097. See also #1238.

This is a followup to #1209:

In advance of the next upcoming release, enables the SITEMAP_URL so that the generated Doxygen API doc pages will include a sitemap.xml which may be referenced by the gh-pages sitemap via manual post-release modifications until a more thorough solution to CXX-3097 is implemented.

This PR along with #1238 implements this thorough solution.

Note

Some iteration and followup changes may be required after examining actual search engine results following the deployment of these changes.

Sitemaps and Sitemap Indexes

The search engine quality of API doc pages is currently very poor:

image

The cached /current page is labeled as mongocxx-3.6.3 with subpages referencing even older versions (3.0.1, 3.1.2, etc.). This suggests that search engines have not indexed the API doc pages in a long time (likely August 15, 2016 per the current sitemap's <lastmod> values), nor with any appreciable throughness (API doc pages are not referenced at all by the sitemap). This issue may be directly attributed to the long-outdated sitemap.xml.

Per Google:

A sitemap is a file where you provide information about the pages, videos, and other files on your site, and the relationships between them. Search engines like Google read this file to crawl your site more efficiently. A sitemap tells search engines which pages and files you think are important in your site, and also provides valuable information about these files. For example, when the page was last updated and any alternate language versions of the page.

Doxygen 1.9.7 implemented a SITEMAP_URL configuration option to generate a sitemap describing the generated API doc pages, which was enabled in #1209 and used during generation of the mongocxx-3.11.0 API doc pages. This PR finally utilizes the generated sitemap by introducing a sitemap index file:

You can provide multiple Sitemap files, but each Sitemap file that you provide must have no more than 50,000 URLs and must be no larger than 50MB (52,428,800 bytes). [...] If you do provide multiple Sitemaps, you should then list each Sitemap file in a Sitemap index file. [...] You can have more than one Sitemap index file. The XML format of a Sitemap index file is very similar to the XML format of a Sitemap file.

Although size limits appear to be the primary motivation behind sitemap index files, this permits a very simple and straightforward integration method for Doxgyen-generated sitemap files. See updated release instructions in #1238. These changes should enable search engines to finally crawl the C++ Driver's API doc pages and return up-to-date and relevant results for C++ Driver library interfaces.

Redirection and Canonical URLs

Due to the use of a symlink for /current (which is expected to be the most often used URL path when referencing or initially navigating to the API doc pages) as well as the Github Pages static site structure, finding a satisfying way to inform both users and search engines that /current pages are actually aliases to /mongocxx-<version> equivalents was a challenge. Hugo (used for page generation) does not appear to support a convenient method to implement redirects for subpages. Doxygen certainly does not support such behavior either in its configuration options. Therefore, a new patch script is introduced instead (in #1238) to directly modify the HTML pages with redirect routines.

Redirection is implemented using the window.location.replace() pattern. This pattern was chosen because it does not generate browser navigation history for the /current page prior to redirect, which avoids the perpetual "Go to Last Page -> Redirected Right Back" problem. The redirect is guarded by a conditional check to avoid perpetual redirection.

A "canonical element" is included alongside the redirection script:

A rel="canonical" link element (also known as a canonical element) is an element used in the head section of HTML to indicate that another page is representative of the content on the page.

This (alongside the redirection script itself) indicates to search engines the "canonical URL" of all /current pages:

Canonicalization is the process of selecting the representative –canonical– URL of a piece of content. Consequently, a canonical URL is the URL of a page that Google chose as the most representative from a set of duplicate pages. Often called deduplication, this process helps Google show only one version of the otherwise duplicate content in its search results. [...] Some duplicate content on a site is normal and it's not a violation of Google's spam policies. However, having the same content accessible through many different URLs can be a bad user experience (for example, people might wonder which is the right page, and whether there's a difference between the two) and it may make it harder for you to track how your content performs in search results.

This should ensure search engines understand the /current pages are "aliases" for their versioned URL equivalents and prefer the versioned URL for indexing purposes.

These changes has a very nice benefit: when a new release updates the /current symlink, the old API doc pages do not need to be updated at all. The new API doc pages (which also contain the redirect routines) will automatically work with their new /current status, while the old API doc pages will continue to be navigatable-to via the versioned URL path (which search engines will remember as being canonical, rather than their old /current aliases). The update to the sitemap index entry's <lastmod> field for /current pages should also trigger (re-)indexing of the API doc pages by search engines according to the updated symlink and the new canonical URLs, thus permitting up-to-date search engine results following a new release while preserving the "stability" of old API doc page indexes.

Legacy Pages

The legacy doc pages are given a priority of 0.0 (default is 0.5) to discourage ranking them above the current API doc pages, which are given a priority of 1.0. The obsoleted /categories and /tags are given stub pages which redirect users to the front page.

@eramongodb eramongodb marked this pull request as ready for review October 22, 2024 15:51
@eramongodb eramongodb requested a review from kevinAlbs October 22, 2024 15:51
@eramongodb eramongodb merged commit 894b5eb into mongodb:gh-pages Oct 23, 2024
@eramongodb eramongodb deleted the cxx-3097-pages branch October 23, 2024 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants