Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy package to conda-forge #11

Closed
wants to merge 2 commits into from

Conversation

dominictarro
Copy link

@dominictarro dominictarro commented Dec 7, 2024

Closes #10

A GitHub workflow for deploying the package to conda-forge.

reserve build and deploy for manual run (after publishing to pypi)
@umarbutler
Copy link
Collaborator

Awesome, thanks for this! What do I need to do to make it work? I noticed for example there's a reference to /~https://github.com/umarbutler/staged-recipes.git and also an ANACONDA_TOKEN secret?

@umarbutler umarbutler self-assigned this Dec 9, 2024
@umarbutler umarbutler added the enhancement New feature or request label Dec 9, 2024
@dominictarro
Copy link
Author

@umarbutler

I got the steps working on my local and was able to publish to my account's index, https://anaconda.org/dominictarro/semchunk, but it's a bit different for conda-forge. Publishing to conda-forge is a first for me so we'll learn as we go. Anaconda has their own process that includes forking the staged-recipes repo and us requesting to merge a branch. This happens within the workflow, but I wasn't able to test it. I'm doing this for a package at work rn and can apply some of that learning here.

You only need to create an Anaconda account and get an API key with publishing permission. Set that to the repo variable ANACONDA_TOKEN.

@dominictarro
Copy link
Author

I may have spoken too soon. I'll research this more thoroughly and get back to you.

@umarbutler
Copy link
Collaborator

I may have spoken too soon. I'll research this more thoroughly and get back to you.

Keep me posted :) I don’t use Anaconda myself so it’s unfamiliar territory to me.

@dominictarro
Copy link
Author

Keep me posted :) I don’t use Anaconda myself so it’s unfamiliar territory to me.

conda-forge/staged-recipes#28590

Got everything set up, just need for you to comment on the PR. The whole PR process through conda-forge/staged-recipes is a one-time thing. Once that is done, they will create a repository in conda-forge org named semchunk-feedstock that we are added as maintainers of.

All that really goes into maintaining is updating the meta.yaml's SHA256 hash from PyPI (e.g. for 2.2.0) and the version number. Fork the feedstock, make the changes to a branch, and make a PR to the parent feedstock.

https://conda-forge.org/docs/maintainer/updating_pkgs/#example-workflow-for-updating-a-package
https://conda-forge.org/docs/maintainer/updating_pkgs/#updating-recipes

@umarbutler
Copy link
Collaborator

@dominictarro So I just need to merge this into my main branch? And set secrets.ANACONDA_TOKEN? There's also secrets.GITHUB_TOKEN what about that?

@dominictarro
Copy link
Author

@dominictarro So I just need to merge this into my main branch? And set secrets.ANACONDA_TOKEN? There's also secrets.GITHUB_TOKEN what about that?

Sorry, I should have elaborated. This PR and my fork of semchunk can be closed unmerged. They aren't necessary.

I need to update dominictarro/staged-recipes with the conda-forge/staged-recipes latest changes. I will update the meta.yaml in my fork to use semchunk v2.2.2.

Since this PR can be closed unmerged, no token needs to be created. conda-forge will copy the build on PyPI to the conda-forge "channel" on conda, and they will use their own CI/CD to do so.

To trigger their CI/CD, we just create a PR to the "feedstock" repo that they create with the meta.yaml changes that we want (i.e. version, build hash).

There's a tool, regro-cf-autotick-bot, they mention for automatically updating the feedstock when PyPI changes are detected. I haven't seen seemchunk release frequently enough to make it useful, but it's an option.

@umarbutler
Copy link
Collaborator

@dominictarro Just to clarify, right now semchunk is on conda already and when I publish my next release I just need to go to the conda-forge/semchunk-feedstock repo and do a PR modifying conda-forge/semchunk-feedstock/recipe/meta.yaml with whatever needs updating? If so, that should be easy. I have an internal checklist for updating semchunk that I can add that to :)

@dominictarro
Copy link
Author

@umarbutler correct, and sounds good!

@dominictarro dominictarro deleted the conda branch December 19, 2024 14:33
@umarbutler
Copy link
Collaborator

Hey @dominictarro, I'm prepping a new release and I'd like to mention how to install semchunk with conda.

Sorry if this is a silly question but I just want to confirm is this the best way to install semchunk with conda:

conda install dominictarro::semchunk

Is it possible to just do:

conda install semchunk

@dominictarro
Copy link
Author

@umarbutler

The dominictarro::semchunk was just a test I did on my personal channel (package index). I can remove it. Users should install with

conda install conda-forge::semchunk
# or
conda install -c conda-forge semchunk

There's a way to set conda-forge as a default channel to install from so you don't have to specify conda-forge, but it involves modifying a config file.

You can also share the conda package's listing: https://anaconda.org/conda-forge/semchunk

@dominictarro
Copy link
Author

@umarbutler

Unrelated but maybe interesting to some users, I created a variant of semchunk for Rust, semchunk-rs. You can find it below

https://crates.io/crates/semchunk-rs
/~https://github.com/dominictarro/semchunk-rs

@umarbutler
Copy link
Collaborator

@dominictarro I actually came across that earlier but didn't have time to look into it further. Thanks for creating that port! I'll link to it in my README.

I had been thinking about trying to speed semchunk up even further by offloading to Rust but I'm not a Rust coder. I was meaning to learn more about it later. Do you think that's possible by combining our work or is your port not as fast as the Python version?

@dominictarro
Copy link
Author

@umarbutler Frankly, that was my first Rust package, so I'm no expert! My package is for Rust applications, not as a Rust backend with Python bindings. I used a Rust-native tokenizer library, rust_tokenizers. I don't know how a Rust backend will interface with sentence-transformers or tiktoken. Want to create an issue so I and others can experiment, see if it will work?

The Rust version clocks in at 6.22s with RoBERTa against the Gutenberg corpus, but you used the GPT-4 tokenizer. I see you recently got your benchmark down from 6.69s to 2.87s. I think if I throw multiprocessing at it I can get the Rust version's numbers down. I'll have to check the changes, see what you did.

@umarbutler
Copy link
Collaborator

@umarbutler Frankly, that was my first Rust package, so I'm no expert! My package is for Rust applications, not as a Rust backend with Python bindings. I used a Rust-native tokenizer library, rust_tokenizers. I don't know how a Rust backend will interface with sentence-transformers or tiktoken. Want to create an issue so I and others can experiment, see if it will work?

Ah, I think I'll have to do some more research into how the interfacing would work. To be honest, semchunk is already quite efficient, the biggest gains come from using fast tokenizers.

The Rust version clocks in at 6.22s with RoBERTa against the Gutenberg corpus, but you used the GPT-4 tokenizer. I see you recently got your benchmark down from 6.69s to 2.87s. I think if I throw multiprocessing at it I can get the Rust version's numbers down. I'll have to check the changes, see what you did.

I don't think there was any change except for getting a much faster PC than what I had before 😆 you should try benchmarking the Python library on the same hardware and then compare the results.

@umarbutler
Copy link
Collaborator

@dominictarro v3.0.0 is out 🍾

@dominictarro
Copy link
Author

To be honest, semchunk is already quite efficient, the biggest gains come from using fast tokenizers.

Agree. The tokenizer is probably >=90% of the compute at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Publish to conda-forge
2 participants