Create `updates.xml` RSS feed #8908

Turbo87 · 2024-06-22T13:08:37Z

This PR creates an RSS feed published at https://static.crates.io/rss/updates.xml. The feed is synced with the database in a background job after every successful publish of a new version. It includes the latest 100 published versions with the crate name, version number, crate description, URL and publish date. The feed is created via https://github.com/rust-syndication/rss.

This feature is inspired by https://warehouse.pypa.io/api-reference/feeds.html and roughly matches their structure. In the future we might add the packages.xml (or crates.xml?) feed and the per-package feeds too.

Feed for Dashboard Content #127

Example

<item>
    <title>tree-sitter-cpp v0.22.0</title>
    <link>https://crates.io/crates/tree-sitter-cpp/0.22.0</link>
    <description><![CDATA[C++ grammar for tree-sitter]]></description>
    <guid>https://crates.io/crates/tree-sitter-cpp/0.22.0</guid>
    <pubDate>Mon, 15 Apr 2024 01:39:51 +0000</pubDate>
</item>

codecov · 2024-06-22T14:05:22Z

Codecov Report

Attention: Patch coverage is 97.85714% with 6 lines in your changes missing coverage. Please review.

Project coverage is 88.72%. Comparing base (86dedf5) to head (06c13c1).
Report is 6 commits behind head on main.

Files	Patch %	Lines
src/worker/jobs/rss/sync_updates_feed.rs	97.74%	4 Missing ⚠️
src/admin/enqueue_job.rs	0.00%	1 Missing ⚠️
src/controllers/krate/publish.rs	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8908      +/-   ##
==========================================
+ Coverage   88.63%   88.72%   +0.09%     
==========================================
  Files         276      278       +2     
  Lines       27645    27925     +280     
==========================================
+ Hits        24502    24777     +275     
- Misses       3143     3148       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

walterhpearce · 2024-06-22T15:05:49Z

Is there any reason the title uses a name v123 syntax? Almost everything across the ecosystem uses name-version and likewise handles parsing for such. When working with the index, crates.io APIs, download URLs and files it is always with separate name and version fields, and when combined always with a dash delim, easily handled with an rsplit.

Id request that the title field follow that, so it remains uniform and any automations can continue their common handling of files and URLs.

To be clear, this is mainly a nitpick. I already have to parse out name and version from files, and having to do a unique kind of split to extract info from the RSS irks me 🙃

Turbo87 · 2024-06-22T15:29:53Z

Is there any reason the title uses a name v123 syntax?

I mostly copied what I've found in the PyPI feeds

Almost everything across the ecosystem uses name-version and likewise handles parsing for such.

I'm more used to seeing name@version, which also solves the ambiguity of - being allowed in the crate name and also in the version number. But I'm not sure people should parse the title field. It might be better to parse the name and version from the URL instead 🤔

easily handled with an rsplit

crate-name-1.0.0-beta.1 disagrees 😉

walterhpearce · 2024-06-22T16:09:33Z

@ I think would be more appropriate too. And fair about the URL parsing.

LawnGnome · 2024-06-24T17:00:03Z

I wonder if it would make more sense to use the extension mechanism in rss to add a couple more fields to the item element that include the name and version separately in a crates.io-specific namespace?

Turbo87 · 2024-06-24T18:29:34Z

@LawnGnome I don't know RSS well enough to answer that question, but I don't see any reason why not :)

LawnGnome · 2024-06-24T23:05:51Z

@LawnGnome I don't know RSS well enough to answer that question, but I don't see any reason why not :)

I meant the crate more than the standard in that case. 🙂

What I'm thinking here is that if we did something like:

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:crates="https://static.crates.io/update.schema.xml">
    <channel>
        <!-- ... -->
        <item>
            <title>foo v1.2.0</title>
            <link>https://crates.io/crates/foo/1.2.0</link>
            <guid>https://crates.io/crates/foo/1.2.0</guid>
            <pubDate>Sat, 22 Jun 2024 15:57:19 +0000</pubDate>
            <crates:name>foo</crates:name>
            <crates:version>1.2.0</crates:version>
        </item>
    </channel>
</rss>

Then that means a user doesn't have to do any string parsing to get the raw crate name and version.

I haven't used the rss crate, but it looks like ItemBuilder::extension() would be the jumping off point to implement that, as far as I can tell.

(Bonus points if we actually publish an XML schema to whatever URL we use for the namespace, but I don't see that as essential for this to work.)

Turbo87 · 2024-06-25T08:41:09Z

@LawnGnome sounds good. the extension API of the rss crate is a bit cumbersome, but I managed to make it work now :)

Eh2406 · 2024-06-25T18:28:05Z

There was a previous discussion https://rust-lang.zulipchat.com/#narrow/stream/318791-t-crates-io/topic/Proposal.3A.20AWS.20SNS.20for.20Crate.20Actions.3F in which @Turbo87 pointed out that this is available from https://github.com/rust-lang/crates.io-index/commits.atom

More generally I'm wondering what design work went into this model? What use cases is this intended for, and what are the threat models of those users? Inasmuch as this is copying from existing ecosystems (you mentioned Pypi) have you talked to the maintainers or users of those ecosystems about what they would do differently?

LawnGnome

Oof, you're not wrong about that extension API, particularly when not using the builder API.

Thanks for getting it working!

src/worker/jobs/rss/sync_updates_feed.rs

Turbo87 · 2024-06-25T18:39:19Z

in which @Turbo87 pointed out that this is available from https://github.com/rust-lang/crates.io-index/commits.atom

which is still correct, though the scalability of the git index is becoming more and more of a problem and I thought it probably makes sense to have an alternative in case we will hit any hard limits in the future.

More generally I'm wondering what design work went into this model?

As written in the PR description, I was primarily just following the PyPI example

What use cases is this intended for

see #127 for example

and what are the threat models of those users?

I'm not sure what you mean by that? How are threat models involved in this?

Inasmuch as this is copying from existing ecosystems (you mentioned Pypi) have you talked to the maintainers or users of those ecosystems about what they would do differently?

No, but as Walter and Adam pointed out, it is useful to have dedicated elements for the crate name and version to make the feed easier to use if not consumed directly by a human.

Eh2406 · 2024-06-25T22:15:11Z

though the scalability of the git index is becoming more and more of a problem and I thought it probably makes sense to have an alternative in case we will hit any hard limits in the future.

Good point. I haven't thought about the scalability in the git index in a long time. It's probably back to being critical. If it is, I would be happy to help brainstorm short to medium-term ways to reduce the load. Anyway, Good of you to be getting ahead of it.

I was primarily just following the PyPI example

PyPI has many APIs that have not aged well over the past few decades. The general pattern being adding things because they were easy and the community locking them in because they were the only thing available. Amplified by tools not being willing to switch to newer APIs because they have to work with a common denominator of mirrors and third-party registries. I don't know where these endpoints landed on the "works fine for what it does" to "why did we do that to ourselves" scale. Would you like me to reach out to Pypi for their thoughts?

What use cases is this intended for

see #127 for example

#127 does not provide a lot of detail just a "it would be nice". Also is asking for a different feature, a rss feed to only the packages you subscribed to or a rss feed per package. The zulip conversation was discussing "mirroring published crates" and "scanning/documenting crates as their published".

I'm not sure what you mean by that? How are threat models involved in this?

You're right that was not a clear way of saying things. Sorry. What I had in mind was:

List all the use cases that are intended.
List all of the ways the infrastructure could fail.
For the cross product of 1 and 2. Describe the impact, how the issue would be resolved, and how important that issue would be to that user.

This has a lot of structure in common with a "threat model", but doesn't actually have anything to do with threats.

So for example, a thing that could go wrong is that aws-sdk could publish all of its (multiple hundreds) crates so that more than 100 crates have been published since the last time the follower checked the RSS feed. This would lead to the follower missing publish notifications for those crates. This means any system based on this RSS feed needs to either be comfortable missing publishes or have some background task that looks for missing entries. So the RSS feed is just a fast path to reduce response time. I imagine for the security scanning use case, this could be catastrophic. Many attacks are most dangerous during their initial publication. Even for the mirroring use case, there probably wants to be some kind of bounding on how long it takes to be consistent. An SLA of "If we see it in the RSS feed we pull it in within 1 min, but if we miss it it can take us hours" is not catastrophic but is going to be hard on users.

Speaking of that background task looking for missing entries, we are going to need to provide another API to make that possible. Assuming the git index goes away. If we do something Merkel-tree-ish or TUF-ish then the RSS feed will be redundant.

Turbo87 · 2024-06-26T07:24:24Z

Would you like me to reach out to Pypi for their thoughts?

sure, sounds like a reasonable idea to ask them about their experience :)

#127 does not provide a lot of detail just a "it would be nice". Also is asking for a different feature, a rss feed to only the packages you subscribed to or a rss feed per package. The zulip conversation was discussing "mirroring published crates" and "scanning/documenting crates as their published".

yeah, I know it's not exactly the same use case, but this PR sets up some of the infrastructure to provide per-crate feeds in the future, which would kinda resolve #127.

So for example, a thing that could go wrong is that aws-sdk could publish all of its (multiple hundreds) crates so that more than 100 crates have been published since the last time the follower checked the RSS feed. This would lead to the follower missing publish notifications for those crates.

We could apply logic like "include all publishes from the past 60min, but at least 100 items", so if the poll interval is set to 50min it shouldn't miss any updates? And if you're offline for a bit then you would need a full sync anyway.

Eh2406 · 2024-06-26T16:55:05Z

We could apply logic like "include all publishes from the past 60min, but at least 100 items", so if the poll interval is set to 50min it shouldn't miss any updates? And if you're offline for a bit then you would need a full sync anyway.

I think that would be an improvement, although we should think about scalability. When the rate of publishing doubles two or three more times, how big will the last 60min of publishes be? Perhaps the RSS feed itself could include what time interval it currently guarantees.

Anyway that was just one example, we should enumerate all the failure modes and make sure we have appropriate mitigations in place or document the limitation.

Actually, I made an assumption on might be wrong about it. What stability guarantee are we making about this API? If we have some way to communicate that it still experimental and subject to rapid change (like we've effectively done for the database dumps) then these are questions I'd like to see answered at some point not necessarily before merge. If are not absolutely clear we will end up Hyrum's Law into not being able to fix these sorts of things.

Eh2406 · 2024-06-26T20:02:29Z

I got a reasonably comforting response from the Pypi maintainers. Thank you to Seth Larson and Mike Fiedler. They recommended we look at https://github.com/pypi/warehouse/labels/APIs%2Ffeeds to see what problems people are having with these APIs. After reading, here are some takeaways:

These are part of the newer "JSON" APIs not the problematic "XML-RPC" APIs.
I've seen no complaints about maintenance or operations of these APIs.
People are useing to watch for changes, for mirrors and scanning purposes, with associated complaints about packages slipping through during high activity periods.
People would like to use them for checking what changed since last time I synced, but they are not suitable for that purpose. (Unless your sinking every few minutes.)

It sounds like these APIs are firmly in the "works fine for what it does" camp.

I'm now feeling that this PR is a good first step toward building something useful for the community. Things will need to be changed as people start relying on this. Some of those use cases will require all new approaches that are somewhat redundant with this API. But it sounds like you're open to improving this as things are suggested, and maintaining it after their other sources of this data.

Thank you for your tireless work to make crates.io more usable and scalable. I apologize for butting in.

This creates an RSS feed published at https://static.crates.io/rss/updates.xml. The feed is synced with the database in a background job after every successful publish of a new version. It includes the latest 100 published versions with the crate name, version number, crate description, URL and publish date.

Turbo87 · 2024-06-27T16:22:28Z

We could apply logic like "include all publishes from the past 60min, but at least 100 items"

I've rebased the branch and implemented this logic on top of the existing code. I hope that makes the feed at least a little more useful/reliable.

Turbo87 added C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works A-backend ⚙️ labels Jun 22, 2024

Turbo87 requested a review from a team June 22, 2024 13:08

Turbo87 mentioned this pull request Jun 22, 2024

Use snapshot tests for app.stored_files() assertions #8909

Merged

Turbo87 force-pushed the updates-feed branch from 7706d20 to f61ccc7 Compare June 22, 2024 13:55

Turbo87 force-pushed the updates-feed branch 2 times, most recently from 471e2c6 to b1f21e6 Compare June 22, 2024 15:18

Turbo87 force-pushed the updates-feed branch 2 times, most recently from cf6f839 to 342787e Compare June 22, 2024 15:48

Turbo87 force-pushed the updates-feed branch from 342787e to fb5c9c9 Compare June 25, 2024 08:36

Turbo87 force-pushed the updates-feed branch from fb5c9c9 to 8c942f2 Compare June 25, 2024 14:10

LawnGnome approved these changes Jun 25, 2024

View reviewed changes

src/worker/jobs/rss/sync_updates_feed.rs Outdated Show resolved Hide resolved

Turbo87 added 4 commits June 27, 2024 17:41

rss: Add crates:name and crates:version extension fields

85b8ea3

rss: Use human readable title field

e9b72ec

rss: Add crates namespace

3eb969f

Turbo87 force-pushed the updates-feed branch from 69013dc to 4da340b Compare June 27, 2024 16:20

Turbo87 requested a review from LawnGnome June 27, 2024 16:22

rss: Always include *all* version updates from the past 60 minutes

06c13c1

Turbo87 force-pushed the updates-feed branch from 4da340b to 06c13c1 Compare June 27, 2024 16:34

LawnGnome approved these changes Jun 27, 2024

View reviewed changes

Turbo87 merged commit f23cbf7 into rust-lang:main Jun 28, 2024
9 checks passed

Turbo87 deleted the updates-feed branch June 28, 2024 06:55

Turbo87 mentioned this pull request Jul 3, 2024

Create crates.xml RSS feed #8986

Merged

Turbo87 mentioned this pull request Jul 12, 2024

Create crates/{name}.xml RSS feeds #9064

Merged

Create updates.xml RSS feed #8908

Create updates.xml RSS feed #8908

Uh oh!

Conversation

Turbo87 commented Jun 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example

Uh oh!

codecov bot commented Jun 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

walterhpearce commented Jun 22, 2024

Uh oh!

Turbo87 commented Jun 22, 2024

Uh oh!

walterhpearce commented Jun 22, 2024

Uh oh!

LawnGnome commented Jun 24, 2024

Uh oh!

Turbo87 commented Jun 24, 2024

Uh oh!

LawnGnome commented Jun 24, 2024

Uh oh!

Turbo87 commented Jun 25, 2024

Uh oh!

Eh2406 commented Jun 25, 2024

Uh oh!

LawnGnome left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Turbo87 commented Jun 25, 2024

Uh oh!

Eh2406 commented Jun 25, 2024

Uh oh!

Turbo87 commented Jun 26, 2024

Uh oh!

Eh2406 commented Jun 26, 2024

Uh oh!

Eh2406 commented Jun 26, 2024

Uh oh!

Turbo87 commented Jun 27, 2024

Uh oh!

Uh oh!

Uh oh!

Create `updates.xml` RSS feed #8908

Create `updates.xml` RSS feed #8908

Turbo87 commented Jun 22, 2024 •

edited

Loading

codecov bot commented Jun 22, 2024 •

edited

Loading