Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LangChain representation does not support LCEL Runnables #1564

Closed
joshuasundance-swca opened this issue Oct 6, 2023 · 6 comments
Closed

Comments

@joshuasundance-swca
Copy link
Contributor

joshuasundance-swca commented Oct 6, 2023

I'm sorry to put a list under a single issue like this, but each of the items below are related to bertopic/representation/_langchain.py. I wouldn't normally make so many unsolicited changes at once, even in one file, but I was trying to make the LangChain representation class more robust for my intended use cases and I think others would benefit from each change.

I realize this may be a lot for one Issue or PR, but I'd love to get the discussion started. :) I would be more than happy to answer any questions and address any concerns.

Thank you.

Problems

  1. As noted in the docstring, bertopic.representation.LangChain currently expects chain to be the output of langchain.chains.question_answering.load_qa_chain.
  2. Important functionality (nr_samples, nr_repr_docs, diversity) is not exposed to the user.
  3. Users have no control over document truncation. (similar to Question/request about representative document truncation  #1527)
  4. The .run method is used, which is not part of the standard Runnable interface. .invoke could work, but I think .batch is the preferred method in this case.
  5. Users cannot provide a RunnableConfig, which could help them avoid RateLimitError as in Retry strategies for OpenAI RateLimitError #1560
  6. The return signature of LangChain.extract_topics is incorrect (specifies float, should be int).

Motivation

The LangChain Expression Language (LCEL) is a powerful declarative syntax for creating Runnable objects in LangChain, which operate similar to traditional chains. It is based largely around piping and modularity. In the example code below, I show how LCEL can be more powerful than ordinary chains. This is real code I used today, btw. :) It's worth noting that LCEL transformations would also enable the use of any document truncation methods, including those which are token-based, as long as the user implements it themselves.

Solution

I forked the repo and solved each of these [perceived] problems in a branch. I realize this is getting ahead of the usual process, but I did so to be sure my proposed solution would actually work. :) All tests still pass, and mypy is satisfied. However, I wanted to get ample feedback and a chance to collaborate before proposing changes to tests, documentation, etc. In fact, I didn't even change the docstring yet.

It's worth noting that in the case of simply using Runnable objects, a Dummy wrapper as seen in the code below will work. You could probably add max_concurrency and other RunnableConfig settings in the wrapper.

Notes

Existing uses and functionality should remain unchanged, because any Runnable objects used with the LangChain representation class must use the same input keys expected in the original implementation.

Required input keys:

  • input_documents: List[str]
  • question: str

Required output key:

  • output_text: str

Example code using forked version

The code below uses Microsoft Presidio to remove PII before sending text to the LLM and then puts it back when it gets the result.

This is useful if you want to be sure you're not sending PII to the LLM. Similar use cases, including any arbitrary text transformations, would still be supported.

import re

from bertopic.representation import LangChain
from langchain.chains.question_answering import load_qa_chain
from langchain.schema.document import Document
from langchain.schema.language_model import BaseLanguageModel
from langchain_experimental.data_anonymizer.presidio import PresidioReversibleAnonymizer

pat = re.compile(r"\[COMPANY]", re.IGNORECASE)

representation_prompt = "..."


# class Dummy:
#     """Workaround class enabling BERTopic representation with arbitrary RunnableSequences."""
#     def __init__(self, chain):
#         self.chain = chain
#     def run(self, **kwargs) -> str:
#         return self.chain.invoke(kwargs)

## `LangChain(Dummy(chain), prompt)` is valid in BERTopic's current implementation
## but it doesn't support batching or use of `RunnableConfig`


def representation_model_from_llm(
    llm: BaseLanguageModel,
    prompt: str = representation_prompt,
    **kwargs,
) -> LangChain:
    """Create a representation model from an LLM."""

    pii_handler = PresidioReversibleAnonymizer(analyzed_fields=["PERSON"])
    chain = (
        {
            "input_documents": (
                # remove company and employee names from content before sending to LLM
                lambda inp: [
                    Document(
                        page_content=pii_handler.anonymize(
                            d.page_content.replace("SWCA", "[COMPANY]"),
                            language="en",
                        ),
                    )
                    for d in inp["input_documents"]
                ]
            ),
            "question": (
                lambda inp: pii_handler.anonymize(inp["question"], language="en")
            ),
        }
        | load_qa_chain(llm, chain_type="stuff")

        # plug company and employee names back into the returned text
        | (lambda output: pat.sub("SWCA", output["output_text"]).strip())
        | pii_handler.deanonymize

        # put the content back into the format expected by the `LangChain` representation class
        | (lambda output: {"output_text": output})
    )
    # return LangChain(Dummy(chain), prompt=prompt)
    return LangChain(chain, prompt=prompt, **kwargs)


representation_model = representation_model_from_llm(
    ChatAnthropic(model="claude-2", temperature=0.25),

    # use all docs to generate a label because they're very short
    # and using just the first 4 led to misleading labels
    nr_repr_docs=None,

    # limit concurrency to avoid RateLimitError / code 429
    config=dict(max_concurrency=5)
)
@joshuasundance-swca joshuasundance-swca changed the title LangChain representation does not support LECL Runnables LangChain representation does not support LCEL Runnables Oct 6, 2023
@MaartenGr
Copy link
Owner

I realize this may be a lot for one Issue or PR, but I'd love to get the discussion started. :) I would be more than happy to answer any questions and address any concerns.

Let me start off by saying thank you for taking the time to explore all of this! Interesting features and definitely something worthwhile implementing.

Let me go through some snippets first and then dive into the main idea of the solution.

Problems

  1. As noted in the docstring, bertopic.representation.LangChain currently expects chain to be the output of langchain.chains.question_answering.load_qa_chain.

If I am not mistaken, your suggestion is to generalize the chain to take in any LangChain chain to allow for more control over the underlying pipeline right?

  1. Important functionality (nr_samples, nr_repr_docs, diversity) is not exposed to the user.
  2. Users have no control over document truncation. (similar to Question/request about representative document truncation  #1527)

These are currently implemented in an open PR in #1539 and will be merged soon. So that should be resolved.

  1. The .run method is used, which is not part of the standard Runnable interface. .invoke could work, but I think .batch is the preferred method in this case.

I agree. Seeing as you essentially give it all topics at once instead of iterating over them, it seems that .batch would be best here. Technically, it would also open up async but I think that is currently out of scope.

  1. Users cannot provide a RunnableConfig, which could help them avoid RateLimitError as in Retry strategies for OpenAI RateLimitError #1560

That's great, indeed opens it up for many interesting options!

  1. The return signature of LangChain.extract_topics is incorrect (specifies float, should be int).

No problem, should be quickly resolved.

Motivation

The LangChain Expression Language (LCEL) is a powerful declarative syntax for creating Runnable objects in LangChain, which operate similar to traditional chains. It is based largely around piping and modularity. In the example code below, I show how LCEL can be more powerful than ordinary chains. This is real code I used today, btw. :) It's worth noting that LCEL transformations would also enable the use of any document truncation methods, including those which are token-based, as long as the user implements it themselves.

Although I agree with the general principle, the one thing that I am concerned about is user experience. Traditional chains is something that users are likely to know when using LangChain. LangChain, due to the speed of developments, has changed quite drastically a couple of times and I want to make sure that whatever is implemented is likely to stay. Having said that, since it is tightly integrated with its traditional chains (and a major component of LangSmith), it seems that there is little to be worried about.

Solution

I forked the repo and solved each of these [perceived] problems in a branch.

This looks great! Very clear and nicely implemented. Since there is already a PR open with some of the suggested implementations, it might make sense to add it there. You could also create a PR (if you are interested) after I merged that one but I am not sure if I am going to merge that this week or in a couple of weeks. What would you prefer?

Notes

Existing uses and functionality should remain unchanged because any Runnable objects used with the LangChain representation class must use the same input keys expected in the original implementation.

That's perfect! If I am not mistaken, it would also mean that there is no change for the user compared with the previous implementation right? In other words, they could still use their old code (i.e., using load_qa_chain)?

In other words, the old examples would still work right?

Example code using forked version

The code below uses Microsoft Presidio to remove PII before sending text to the LLM and then puts it back when it gets the result.

Not only a nice example of a more complex LangChain pipeline but something that is highly relevant to using external APIs!

@joshuasundance-swca
Copy link
Contributor Author

Let me start off by saying thank you for taking the time to explore all of this! Interesting features and definitely something worthwhile implementing.

It's my pleasure. BERTopic is one of my favorite libraries. Before I became aware of BERTopic, I was constructing my own [less robust] embedding / clustering pipelines. I did not have good tools in place for dimensionality reduction, cluster representation, or visualization. BERTopic offers this and more, with plenty of the modularity and flexibility that we all love and need.

If I am not mistaken, your suggestion is to generalize the chain to take in any LangChain chain to allow for more control over the underlying pipeline right?

That is correct. My changes would allow greater flexibility while still accomodating existing use cases. I did not introduce any massive or breaking changes; everything remains backwards-compatible, as far as I can tell. The one exception may be the Runnable class that is imported at the top; I'm not sure when that class was introduced, and it's possible that it may require an updated version of LangChain compared to what some users have installed. We could examine that possibility and decide how to handle it. Since the "regular" chain interface also implements the batch method, I think we could omit the type hints or put them in strings or something for type hinting and linting.

I agree. Seeing as you essentially give it all topics at once instead of iterating over them, it seems that .batch would be best here. Technically, it would also open up async but I think that is currently out of scope.

That is a possibility, and I'm sure some users would appreciate it, but I don't think it would offer a ton of benefit in most settings compared to just using .batch. Perhaps if it's part of a web app or something. Maybe I'm wrong about that. If you did want to implement it, I suppose it would be straightforward.

Although I agree with the general principle, the one thing that I am concerned about is user experience. Traditional chains is something that users are likely to know when using LangChain. LangChain, due to the speed of developments, has changed quite drastically a couple of times and I want to make sure that whatever is implemented is likely to stay. Having said that, since it is tightly integrated with its traditional chains (and a major component of LangSmith), it seems that there is little to be worried about.

You are certainly not wrong about this. However, since the fork is backwards-compatible and I think the Runnable / LCEL interface is here to stay, I think these are safe changes.

This looks great! Very clear and nicely implemented. Since there is already a PR open with some of the suggested implementations, it might make sense to add it there. You could also create a PR (if you are interested) after I merged that one but I am not sure if I am going to merge that this week or in a couple of weeks. What would you prefer?

Thank you. :) I do not want to make a PR with duplicate work, nor do I wish to rush you. Whatever fits with your normal development pattern would be fine with me. It probably makes the most sense to wait until you merge existing PRs and then make a new one accounting for those changes. In the meantime, users can use the Dummy class or create an [arbitrarily named] LangChain2 class that implements the functionality seen in the fork. I think you can use it as a representation_model as long as it implements the expected method(s).

That's perfect! If I am not mistaken, it would also mean that there is no change for the user compared with the previous implementation right? In other words, they could still use their old code (i.e., using load_qa_chain)?
In other words, the old examples would still work right?

Correct. I would suggest more testing before merging changes, but I tested this to make sure existing documentation was still valid, working code.

Not only a nice example of a more complex LangChain pipeline but something that is highly relevant to using external APIs!

Thank you. I think this was a compelling example of what can be done with more complex chains, and at some point, I'd love to share more about how I've been using LangChain to achieve better clustering results with BERTopic. :)

Finally, thank you again for all of your hard work and diligence. I look forward to future collaboration.

@MaartenGr
Copy link
Owner

That is correct. My changes would allow greater flexibility while still accomodating existing use cases. I did not introduce any massive or breaking changes; everything remains backwards-compatible, as far as I can tell. The one exception may be the Runnable class that is imported at the top; I'm not sure when that class was introduced, and it's possible that it may require an updated version of LangChain compared to what some users have installed. We could examine that possibility and decide how to handle it. Since the "regular" chain interface also implements the batch method, I think we could omit the type hints or put them in strings or something for type hinting and lining.

I think that upping the version of LangChain would be perfectly okay. The package is known for their quick development and major changes to the API, so it would not be unsurprising that users would need to up LangChain's version in order to get this new functionality.

If omitting the type hints would solve this issue, then that would be my preference. Choosing between full backward compatibility and having type hints I would definitely choose the former.

That is a possibility, and I'm sure some users would appreciate it, but I don't think it would offer a ton of benefit in most settings compared to just using .batch. Perhaps if it's part of a web app or something. Maybe I'm wrong about that. If you did want to implement it, I suppose it would be straightforward.

I agree. It's okay to leave it for now. If users would want this feature in a later release, then at least it would be straightforward to implement.

You are certainly not wrong about this. However, since the fork is backwards-compatible and I think the Runnable / LCEL interface is here to stay, I think these are safe changes.

Agreed!

Thank you. :) I do not want to make a PR with duplicate work, nor do I wish to rush you. Whatever fits with your normal development pattern would be fine with me. It probably makes the most sense to wait until you merge existing PRs and then make a new one accounting for those changes. In the meantime, users can use the Dummy class or create an [arbitrarily named] LangChain2 class that implements the functionality seen in the fork. I think you can use it as a representation_model as long as it implements the expected method(s).

After some testing, I can merge the relevant PRs this week. After that, I can either start working on your suggested changes or, if you want, you can create the PR yourself. It would be greatly appreciated if you have the time. If not, then I'm more than okay implementing the changes myself.

Thank you. I think this was a compelling example of what can be done with more complex chains, and at some point, I'd love to share more about how I've been using LangChain to achieve better clustering results with BERTopic. :)

That would be great! Any and all use cases are much appreciated.

@zilch42
Copy link
Contributor

zilch42 commented Oct 10, 2023

Thank you. I think this was a compelling example of what can be done with more complex chains, and at some point, I'd love to share more about how I've been using LangChain to achieve better clustering results with BERTopic. :)

I'd be particularly interested in this!

@MaartenGr
Copy link
Owner

Thank you. :) I do not want to make a PR with duplicate work, nor do I wish to rush you. Whatever fits with your normal development pattern would be fine with me. It probably makes the most sense to wait until you merge existing PRs and then make a new one accounting for those changes. In the meantime, users can use the Dummy class or create an [arbitrarily named] LangChain2 class that implements the functionality seen in the fork. I think you can use it as a representation_model as long as it implements the expected method(s).

After some testing, I can merge the relevant PRs this week. After that, I can either start working on your suggested changes or, if you want, you can create the PR yourself. It would be greatly appreciated if you have the time. If not, then I'm more than okay implementing the changes myself.

@joshuasundance-swca I merged the relevant PR a couple of days ago, so if you are interested, a PR would be very much appreciated. If you are constrained with respect to time or something else, I completely understand! I will pick it up and start implementing it in a couple of days/weeks.

@joshuasundance-swca
Copy link
Contributor Author

Wonderful! Thank you. I will submit a PR in the near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants