Deploy package to conda-forge #11

dominictarro · 2024-12-07T05:21:04Z

Closes #10

A GitHub workflow for deploying the package to conda-forge.

build package functioning

reserve build and deploy for manual run (after publishing to pypi)

umarbutler · 2024-12-09T04:20:54Z

Awesome, thanks for this! What do I need to do to make it work? I noticed for example there's a reference to /~https://github.com/umarbutler/staged-recipes.git and also an ANACONDA_TOKEN secret?

dominictarro · 2024-12-09T16:19:57Z

@umarbutler

I got the steps working on my local and was able to publish to my account's index, https://anaconda.org/dominictarro/semchunk, but it's a bit different for conda-forge. Publishing to conda-forge is a first for me so we'll learn as we go. Anaconda has their own process that includes forking the staged-recipes repo and us requesting to merge a branch. This happens within the workflow, but I wasn't able to test it. I'm doing this for a package at work rn and can apply some of that learning here.

You only need to create an Anaconda account and get an API key with publishing permission. Set that to the repo variable ANACONDA_TOKEN.

dominictarro · 2024-12-09T17:07:37Z

I may have spoken too soon. I'll research this more thoroughly and get back to you.

umarbutler · 2024-12-14T08:41:04Z

I may have spoken too soon. I'll research this more thoroughly and get back to you.

Keep me posted :) I don’t use Anaconda myself so it’s unfamiliar territory to me.

dominictarro · 2024-12-16T02:46:14Z

Keep me posted :) I don’t use Anaconda myself so it’s unfamiliar territory to me.

conda-forge/staged-recipes#28590

Got everything set up, just need for you to comment on the PR. The whole PR process through conda-forge/staged-recipes is a one-time thing. Once that is done, they will create a repository in conda-forge org named semchunk-feedstock that we are added as maintainers of.

All that really goes into maintaining is updating the meta.yaml's SHA256 hash from PyPI (e.g. for 2.2.0) and the version number. Fork the feedstock, make the changes to a branch, and make a PR to the parent feedstock.

https://conda-forge.org/docs/maintainer/updating_pkgs/#example-workflow-for-updating-a-package
https://conda-forge.org/docs/maintainer/updating_pkgs/#updating-recipes

umarbutler · 2024-12-18T07:24:26Z

@dominictarro So I just need to merge this into my main branch? And set secrets.ANACONDA_TOKEN? There's also secrets.GITHUB_TOKEN what about that?

dominictarro · 2024-12-18T15:28:14Z

@dominictarro So I just need to merge this into my main branch? And set secrets.ANACONDA_TOKEN? There's also secrets.GITHUB_TOKEN what about that?

Sorry, I should have elaborated. This PR and my fork of semchunk can be closed unmerged. They aren't necessary.

I need to update dominictarro/staged-recipes with the conda-forge/staged-recipes latest changes. I will update the meta.yaml in my fork to use semchunk v2.2.2.

Since this PR can be closed unmerged, no token needs to be created. conda-forge will copy the build on PyPI to the conda-forge "channel" on conda, and they will use their own CI/CD to do so.

To trigger their CI/CD, we just create a PR to the "feedstock" repo that they create with the meta.yaml changes that we want (i.e. version, build hash).

There's a tool, regro-cf-autotick-bot, they mention for automatically updating the feedstock when PyPI changes are detected. I haven't seen seemchunk release frequently enough to make it useful, but it's an option.

umarbutler · 2024-12-19T04:59:49Z

@dominictarro Just to clarify, right now semchunk is on conda already and when I publish my next release I just need to go to the conda-forge/semchunk-feedstock repo and do a PR modifying conda-forge/semchunk-feedstock/recipe/meta.yaml with whatever needs updating? If so, that should be easy. I have an internal checklist for updating semchunk that I can add that to :)

dominictarro · 2024-12-19T14:29:51Z

@umarbutler correct, and sounds good!

umarbutler · 2024-12-31T02:40:44Z

Hey @dominictarro, I'm prepping a new release and I'd like to mention how to install semchunk with conda.

Sorry if this is a silly question but I just want to confirm is this the best way to install semchunk with conda:

conda install dominictarro::semchunk

Is it possible to just do:

conda install semchunk

dominictarro · 2024-12-31T02:51:08Z

@umarbutler

The dominictarro::semchunk was just a test I did on my personal channel (package index). I can remove it. Users should install with

conda install conda-forge::semchunk
# or
conda install -c conda-forge semchunk

There's a way to set conda-forge as a default channel to install from so you don't have to specify conda-forge, but it involves modifying a config file.

You can also share the conda package's listing: https://anaconda.org/conda-forge/semchunk

dominictarro · 2024-12-31T02:53:07Z

@umarbutler

Unrelated but maybe interesting to some users, I created a variant of semchunk for Rust, semchunk-rs. You can find it below

https://crates.io/crates/semchunk-rs
/~https://github.com/dominictarro/semchunk-rs

umarbutler · 2024-12-31T03:14:57Z

@dominictarro I actually came across that earlier but didn't have time to look into it further. Thanks for creating that port! I'll link to it in my README.

I had been thinking about trying to speed semchunk up even further by offloading to Rust but I'm not a Rust coder. I was meaning to learn more about it later. Do you think that's possible by combining our work or is your port not as fast as the Python version?

dominictarro · 2024-12-31T03:47:59Z

@umarbutler Frankly, that was my first Rust package, so I'm no expert! My package is for Rust applications, not as a Rust backend with Python bindings. I used a Rust-native tokenizer library, rust_tokenizers. I don't know how a Rust backend will interface with sentence-transformers or tiktoken. Want to create an issue so I and others can experiment, see if it will work?

The Rust version clocks in at 6.22s with RoBERTa against the Gutenberg corpus, but you used the GPT-4 tokenizer. I see you recently got your benchmark down from 6.69s to 2.87s. I think if I throw multiprocessing at it I can get the Rust version's numbers down. I'll have to check the changes, see what you did.

umarbutler · 2024-12-31T04:28:43Z

@umarbutler Frankly, that was my first Rust package, so I'm no expert! My package is for Rust applications, not as a Rust backend with Python bindings. I used a Rust-native tokenizer library, rust_tokenizers. I don't know how a Rust backend will interface with sentence-transformers or tiktoken. Want to create an issue so I and others can experiment, see if it will work?

Ah, I think I'll have to do some more research into how the interfacing would work. To be honest, semchunk is already quite efficient, the biggest gains come from using fast tokenizers.

The Rust version clocks in at 6.22s with RoBERTa against the Gutenberg corpus, but you used the GPT-4 tokenizer. I see you recently got your benchmark down from 6.69s to 2.87s. I think if I throw multiprocessing at it I can get the Rust version's numbers down. I'll have to check the changes, see what you did.

I don't think there was any change except for getting a much faster PC than what I had before 😆 you should try benchmarking the Python library on the same hardware and then compare the results.

umarbutler · 2024-12-31T04:40:15Z

@dominictarro v3.0.0 is out 🍾

dominictarro · 2024-12-31T18:08:29Z

To be honest, semchunk is already quite efficient, the biggest gains come from using fast tokenizers.

Agree. The tokenizer is probably >=90% of the compute at this point.

dominictarro added 2 commits December 7, 2024 00:15

github workflow to create conda package and upload it

8104124

build package functioning

remove run on push to main

9d33567

reserve build and deploy for manual run (after publishing to pypi)

umarbutler self-assigned this Dec 9, 2024

umarbutler added the enhancement New feature or request label Dec 9, 2024

dominictarro mentioned this pull request Dec 17, 2024

add semchunk recipe conda-forge/staged-recipes#28590

Merged

9 tasks

dominictarro closed this Dec 19, 2024

dominictarro deleted the conda branch December 19, 2024 14:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploy package to conda-forge #11

Deploy package to conda-forge #11

dominictarro commented Dec 7, 2024 •

edited

Loading

umarbutler commented Dec 9, 2024

dominictarro commented Dec 9, 2024

dominictarro commented Dec 9, 2024

umarbutler commented Dec 14, 2024

dominictarro commented Dec 16, 2024

umarbutler commented Dec 18, 2024

dominictarro commented Dec 18, 2024

umarbutler commented Dec 19, 2024

dominictarro commented Dec 19, 2024

umarbutler commented Dec 31, 2024

dominictarro commented Dec 31, 2024

dominictarro commented Dec 31, 2024

umarbutler commented Dec 31, 2024

dominictarro commented Dec 31, 2024

umarbutler commented Dec 31, 2024

umarbutler commented Dec 31, 2024

dominictarro commented Dec 31, 2024

Deploy package to conda-forge #11

Deploy package to conda-forge #11

Conversation

dominictarro commented Dec 7, 2024 • edited Loading

umarbutler commented Dec 9, 2024

dominictarro commented Dec 9, 2024

dominictarro commented Dec 9, 2024

umarbutler commented Dec 14, 2024

dominictarro commented Dec 16, 2024

umarbutler commented Dec 18, 2024

dominictarro commented Dec 18, 2024

umarbutler commented Dec 19, 2024

dominictarro commented Dec 19, 2024

umarbutler commented Dec 31, 2024

dominictarro commented Dec 31, 2024

dominictarro commented Dec 31, 2024

umarbutler commented Dec 31, 2024

dominictarro commented Dec 31, 2024

umarbutler commented Dec 31, 2024

umarbutler commented Dec 31, 2024

dominictarro commented Dec 31, 2024

dominictarro commented Dec 7, 2024 •

edited

Loading