Skip to content

Create updates.xml RSS feed #8908

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 28, 2024
Merged

Create updates.xml RSS feed #8908

merged 5 commits into from
Jun 28, 2024

Conversation

Turbo87
Copy link
Member

@Turbo87 Turbo87 commented Jun 22, 2024

This PR creates an RSS feed published at https://static.crates.io/rss/updates.xml. The feed is synced with the database in a background job after every successful publish of a new version. It includes the latest 100 published versions with the crate name, version number, crate description, URL and publish date. The feed is created via https://github.com/rust-syndication/rss.

This feature is inspired by https://warehouse.pypa.io/api-reference/feeds.html and roughly matches their structure. In the future we might add the packages.xml (or crates.xml?) feed and the per-package feeds too.

Related:

Example

<item>
    <title>tree-sitter-cpp v0.22.0</title>
    <link>https://crates.io/crates/tree-sitter-cpp/0.22.0</link>
    <description><![CDATA[C++ grammar for tree-sitter]]></description>
    <guid>https://crates.io/crates/tree-sitter-cpp/0.22.0</guid>
    <pubDate>Mon, 15 Apr 2024 01:39:51 +0000</pubDate>
</item>

@Turbo87 Turbo87 added C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works A-backend ⚙️ labels Jun 22, 2024
@Turbo87 Turbo87 requested a review from a team June 22, 2024 13:08
Copy link

codecov bot commented Jun 22, 2024

Codecov Report

Attention: Patch coverage is 97.85714% with 6 lines in your changes missing coverage. Please review.

Project coverage is 88.72%. Comparing base (86dedf5) to head (06c13c1).
Report is 6 commits behind head on main.

Files Patch % Lines
src/worker/jobs/rss/sync_updates_feed.rs 97.74% 4 Missing ⚠️
src/admin/enqueue_job.rs 0.00% 1 Missing ⚠️
src/controllers/krate/publish.rs 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8908      +/-   ##
==========================================
+ Coverage   88.63%   88.72%   +0.09%     
==========================================
  Files         276      278       +2     
  Lines       27645    27925     +280     
==========================================
+ Hits        24502    24777     +275     
- Misses       3143     3148       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@walterhpearce
Copy link

Is there any reason the title uses a name v123 syntax? Almost everything across the ecosystem uses name-version and likewise handles parsing for such. When working with the index, crates.io APIs, download URLs and files it is always with separate name and version fields, and when combined always with a dash delim, easily handled with an rsplit.

Id request that the title field follow that, so it remains uniform and any automations can continue their common handling of files and URLs.

To be clear, this is mainly a nitpick. I already have to parse out name and version from files, and having to do a unique kind of split to extract info from the RSS irks me 🙃

@Turbo87 Turbo87 force-pushed the updates-feed branch 2 times, most recently from 471e2c6 to b1f21e6 Compare June 22, 2024 15:18
@Turbo87
Copy link
Member Author

Turbo87 commented Jun 22, 2024

Is there any reason the title uses a name v123 syntax?

I mostly copied what I've found in the PyPI feeds

Almost everything across the ecosystem uses name-version and likewise handles parsing for such.

I'm more used to seeing name@version, which also solves the ambiguity of - being allowed in the crate name and also in the version number. But I'm not sure people should parse the title field. It might be better to parse the name and version from the URL instead 🤔

easily handled with an rsplit

crate-name-1.0.0-beta.1 disagrees 😉

@Turbo87 Turbo87 force-pushed the updates-feed branch 2 times, most recently from cf6f839 to 342787e Compare June 22, 2024 15:48
@walterhpearce
Copy link

@ I think would be more appropriate too. And fair about the URL parsing.

@LawnGnome
Copy link
Contributor

I wonder if it would make more sense to use the extension mechanism in rss to add a couple more fields to the item element that include the name and version separately in a crates.io-specific namespace?

@Turbo87
Copy link
Member Author

Turbo87 commented Jun 24, 2024

@LawnGnome I don't know RSS well enough to answer that question, but I don't see any reason why not :)

@LawnGnome
Copy link
Contributor

@LawnGnome I don't know RSS well enough to answer that question, but I don't see any reason why not :)

I meant the crate more than the standard in that case. 🙂

What I'm thinking here is that if we did something like:

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:crates="https://static.crates.io/update.schema.xml">
    <channel>
        <!-- ... -->
        <item>
            <title>foo v1.2.0</title>
            <link>https://crates.io/crates/foo/1.2.0</link>
            <guid>https://crates.io/crates/foo/1.2.0</guid>
            <pubDate>Sat, 22 Jun 2024 15:57:19 +0000</pubDate>
            <crates:name>foo</crates:name>
            <crates:version>1.2.0</crates:version>
        </item>
    </channel>
</rss>

Then that means a user doesn't have to do any string parsing to get the raw crate name and version.

I haven't used the rss crate, but it looks like ItemBuilder::extension() would be the jumping off point to implement that, as far as I can tell.

(Bonus points if we actually publish an XML schema to whatever URL we use for the namespace, but I don't see that as essential for this to work.)

@Turbo87
Copy link
Member Author

Turbo87 commented Jun 25, 2024

@LawnGnome sounds good. the extension API of the rss crate is a bit cumbersome, but I managed to make it work now :)

@Eh2406
Copy link
Contributor

Eh2406 commented Jun 25, 2024

There was a previous discussion https://rust-lang.zulipchat.com/#narrow/stream/318791-t-crates-io/topic/Proposal.3A.20AWS.20SNS.20for.20Crate.20Actions.3F in which @Turbo87 pointed out that this is available from https://github.com/rust-lang/crates.io-index/commits.atom

More generally I'm wondering what design work went into this model? What use cases is this intended for, and what are the threat models of those users? Inasmuch as this is copying from existing ecosystems (you mentioned Pypi) have you talked to the maintainers or users of those ecosystems about what they would do differently?

Copy link
Contributor

@LawnGnome LawnGnome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, you're not wrong about that extension API, particularly when not using the builder API.

Thanks for getting it working!

@Turbo87
Copy link
Member Author

Turbo87 commented Jun 25, 2024

in which @Turbo87 pointed out that this is available from https://github.com/rust-lang/crates.io-index/commits.atom

which is still correct, though the scalability of the git index is becoming more and more of a problem and I thought it probably makes sense to have an alternative in case we will hit any hard limits in the future.

More generally I'm wondering what design work went into this model?

As written in the PR description, I was primarily just following the PyPI example

What use cases is this intended for

see #127 for example

and what are the threat models of those users?

I'm not sure what you mean by that? How are threat models involved in this?

Inasmuch as this is copying from existing ecosystems (you mentioned Pypi) have you talked to the maintainers or users of those ecosystems about what they would do differently?

No, but as Walter and Adam pointed out, it is useful to have dedicated elements for the crate name and version to make the feed easier to use if not consumed directly by a human.

@Eh2406
Copy link
Contributor

Eh2406 commented Jun 25, 2024

though the scalability of the git index is becoming more and more of a problem and I thought it probably makes sense to have an alternative in case we will hit any hard limits in the future.

Good point. I haven't thought about the scalability in the git index in a long time. It's probably back to being critical. If it is, I would be happy to help brainstorm short to medium-term ways to reduce the load. Anyway, Good of you to be getting ahead of it.

I was primarily just following the PyPI example

PyPI has many APIs that have not aged well over the past few decades. The general pattern being adding things because they were easy and the community locking them in because they were the only thing available. Amplified by tools not being willing to switch to newer APIs because they have to work with a common denominator of mirrors and third-party registries. I don't know where these endpoints landed on the "works fine for what it does" to "why did we do that to ourselves" scale. Would you like me to reach out to Pypi for their thoughts?

What use cases is this intended for

see #127 for example

#127 does not provide a lot of detail just a "it would be nice". Also is asking for a different feature, a rss feed to only the packages you subscribed to or a rss feed per package. The zulip conversation was discussing "mirroring published crates" and "scanning/documenting crates as their published".

I'm not sure what you mean by that? How are threat models involved in this?

You're right that was not a clear way of saying things. Sorry. What I had in mind was:

  1. List all the use cases that are intended.
  2. List all of the ways the infrastructure could fail.
  3. For the cross product of 1 and 2. Describe the impact, how the issue would be resolved, and how important that issue would be to that user.

This has a lot of structure in common with a "threat model", but doesn't actually have anything to do with threats.

So for example, a thing that could go wrong is that aws-sdk could publish all of its (multiple hundreds) crates so that more than 100 crates have been published since the last time the follower checked the RSS feed. This would lead to the follower missing publish notifications for those crates. This means any system based on this RSS feed needs to either be comfortable missing publishes or have some background task that looks for missing entries. So the RSS feed is just a fast path to reduce response time. I imagine for the security scanning use case, this could be catastrophic. Many attacks are most dangerous during their initial publication. Even for the mirroring use case, there probably wants to be some kind of bounding on how long it takes to be consistent. An SLA of "If we see it in the RSS feed we pull it in within 1 min, but if we miss it it can take us hours" is not catastrophic but is going to be hard on users.

Speaking of that background task looking for missing entries, we are going to need to provide another API to make that possible. Assuming the git index goes away. If we do something Merkel-tree-ish or TUF-ish then the RSS feed will be redundant.

@Turbo87
Copy link
Member Author

Turbo87 commented Jun 26, 2024

Would you like me to reach out to Pypi for their thoughts?

sure, sounds like a reasonable idea to ask them about their experience :)

#127 does not provide a lot of detail just a "it would be nice". Also is asking for a different feature, a rss feed to only the packages you subscribed to or a rss feed per package. The zulip conversation was discussing "mirroring published crates" and "scanning/documenting crates as their published".

yeah, I know it's not exactly the same use case, but this PR sets up some of the infrastructure to provide per-crate feeds in the future, which would kinda resolve #127.

So for example, a thing that could go wrong is that aws-sdk could publish all of its (multiple hundreds) crates so that more than 100 crates have been published since the last time the follower checked the RSS feed. This would lead to the follower missing publish notifications for those crates.

We could apply logic like "include all publishes from the past 60min, but at least 100 items", so if the poll interval is set to 50min it shouldn't miss any updates? And if you're offline for a bit then you would need a full sync anyway.

@Eh2406
Copy link
Contributor

Eh2406 commented Jun 26, 2024

We could apply logic like "include all publishes from the past 60min, but at least 100 items", so if the poll interval is set to 50min it shouldn't miss any updates? And if you're offline for a bit then you would need a full sync anyway.

I think that would be an improvement, although we should think about scalability. When the rate of publishing doubles two or three more times, how big will the last 60min of publishes be? Perhaps the RSS feed itself could include what time interval it currently guarantees.

Anyway that was just one example, we should enumerate all the failure modes and make sure we have appropriate mitigations in place or document the limitation.

Actually, I made an assumption on might be wrong about it. What stability guarantee are we making about this API? If we have some way to communicate that it still experimental and subject to rapid change (like we've effectively done for the database dumps) then these are questions I'd like to see answered at some point not necessarily before merge. If are not absolutely clear we will end up Hyrum's Law into not being able to fix these sorts of things.

@Eh2406
Copy link
Contributor

Eh2406 commented Jun 26, 2024

I got a reasonably comforting response from the Pypi maintainers. Thank you to Seth Larson and Mike Fiedler. They recommended we look at https://github.com/pypi/warehouse/labels/APIs%2Ffeeds to see what problems people are having with these APIs. After reading, here are some takeaways:

  • These are part of the newer "JSON" APIs not the problematic "XML-RPC" APIs.
  • I've seen no complaints about maintenance or operations of these APIs.
  • People are useing to watch for changes, for mirrors and scanning purposes, with associated complaints about packages slipping through during high activity periods.
  • People would like to use them for checking what changed since last time I synced, but they are not suitable for that purpose. (Unless your sinking every few minutes.)

It sounds like these APIs are firmly in the "works fine for what it does" camp.

I'm now feeling that this PR is a good first step toward building something useful for the community. Things will need to be changed as people start relying on this. Some of those use cases will require all new approaches that are somewhat redundant with this API. But it sounds like you're open to improving this as things are suggested, and maintaining it after their other sources of this data.

Thank you for your tireless work to make crates.io more usable and scalable. I apologize for butting in.

Turbo87 added 4 commits June 27, 2024 17:41
This creates an RSS feed published at https://static.crates.io/rss/updates.xml. The feed is synced with the database in a background job after every successful publish of a new version. It includes the latest 100 published versions with the crate name, version number, crate description, URL and publish date.
@Turbo87
Copy link
Member Author

Turbo87 commented Jun 27, 2024

We could apply logic like "include all publishes from the past 60min, but at least 100 items"

I've rebased the branch and implemented this logic on top of the existing code. I hope that makes the feed at least a little more useful/reliable.

@Turbo87 Turbo87 requested a review from LawnGnome June 27, 2024 16:22
@Turbo87 Turbo87 merged commit f23cbf7 into rust-lang:main Jun 28, 2024
9 checks passed
@Turbo87 Turbo87 deleted the updates-feed branch June 28, 2024 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-backend ⚙️ C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants