Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed broken ainsert_custom_kg() #811

Merged
merged 6 commits into from
Feb 19, 2025
Merged

Fixed broken ainsert_custom_kg() #811

merged 6 commits into from
Feb 19, 2025

Conversation

da-luggas
Copy link
Contributor

Fix for #745

Custom knowledge graph insertion fails due to required properties not provided (tokens, chunk_order_id, full_doc_id, status).

"content": chunk_content,
"source_id": source_id,
"tokens": len(encode_string_by_tiktoken(chunk_content)),
"chunk_order_id": 0,
Copy link
Collaborator

@YanSte YanSte Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about chunk_order_id? it does not exist in the project.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea is that the user defines the chunks manually, so they are already chunked and it is not traceable what source documents they were part of and should be treated as individual parts.

Otherwise, we'd have to make it possible for the user to specify the chunk_order_id manually in the custom_kg dictionary passed to the method.

Copy link
Collaborator

@YanSte YanSte Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing.

I didn’t fully understand.

If I understand correctly, chunk_order_id is not necessary because the chunks are manually defined by the user and treated as independent parts, without a link to their original source documents. Is that correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LarFii Could you have a look ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I couldn't make it clear in my previous message. As far as I understand, the chunk_order_id is to provide an ordering for chunks created from a full document in case it needs to be chunked. If that's the case, we don't have this information in the custom kg insertion as per example in the README:

custom_kg = {
    "entities": [
        ...
    ],
    "relationships": [
        ...
    ],
    "chunks": [
        {
            "content": "ProductX, developed by CompanyA, has revolutionized the market with its cutting-edge features.",
            "source_id": "Source1",
        },
        {
            "content": "PersonA is a prominent researcher at UniversityB, focusing on artificial intelligence and machine learning.",
            "source_id": "Source2",
        },
        {
            "content": "None",
            "source_id": "UNKNOWN",
        },
    ],
}

rag.insert_custom_kg(custom_kg)

So a solution would be to either ask the user to pass a chunk_order_id in the custom_kg dict or assume all chunks are individual documents. I'd be happy to hear if I made a correct assumption or there is something I'm missing about the functionality of chunk_order_id.

Copy link
Collaborator

@YanSte YanSte Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation.

Now it’s clear.

What do you think about changing custom_kg: dict[str, Any] into a structured format, such as a dataclass? This way, we can provide chunk_order_id along with all the necessary information more effectively.

Let me know your thoughts.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArnoChenFx or @LarFii Could help me in this issue ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I overlooked the fact it's called chunk_order_index and not chunk_order_id. It is fixed now. I proposed a solution where the user can specify a chunk_order_index in the custom_kg and if not it is assumed to be always 0.

@YanSte YanSte requested review from YanSte, HKUDS and LarFii February 18, 2025 17:56
@YanSte
Copy link
Collaborator

YanSte commented Feb 19, 2025

@da-luggas Could please apply the lint ?

Thanks

@YanSte
Copy link
Collaborator

YanSte commented Feb 19, 2025

@da-luggas Could you please update the documentation base and your changes ? And update the code sample.

Thanks !

@YanSte
Copy link
Collaborator

YanSte commented Feb 19, 2025

Super thanks for sharing !

@YanSte
Copy link
Collaborator

YanSte commented Feb 19, 2025

Super Thanks for sharing !

@YanSte YanSte merged commit 6f95ad9 into HKUDS:main Feb 19, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants