Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gitingest fetches entire repo for tagged subdirectories #196

Open
jpotw opened this issue Feb 23, 2025 · 3 comments
Open

Gitingest fetches entire repo for tagged subdirectories #196

jpotw opened this issue Feb 23, 2025 · 3 comments
Assignees
Labels
work in progress This PR is not ready yet but is being worked on

Comments

@jpotw
Copy link

jpotw commented Feb 23, 2025

Hi! I've noticed some unexpected behavior with gitingest when trying to fetch a subdirectory from a specific tag. It seems to grab the whole repository instead of just the subdirectory, which works fine for branches.

Expected:

When I use a URL like this (pointing to a subdirectory within a tag):

gitingest should only fetch the files in that subdirectory, just like it does for branches.


Observed:

gitingest downloads a ton of files (hitting the max file limit on big repos like PyTorch) and seems to ignore the subdirectory part of the tag URL. It's pulling the entire repository for that tag.


Steps to Reproduce:

  1. Tag (Fails): Run:
    gitingest /~https://github.com/pytorch/pytorch/tree/v2.4.1/torch/distributed/elastic/agent/server
    (Note: Actually this command will time out without removing --recurse-submodules – see Why Use --recurse-submodules in clone_repo? It slows down cloning large repos #195).

  1. Observe (Tag): You'll see output like this, indicating it's processing the whole repository:
    Maximum file limit (10000) reached
    ... (repeated many times) ...
    Analysis complete! Output written to: digest.txt
    
    Summary:
    Repository: pytorch/pytorch
    Files analyzed: 10000  # Should be much smaller!
    
    Estimated tokens: 16.8M
    ...
    

  1. Branch (Works): Now try the same subdirectory, but on the main branch:
    gitingest /~https://github.com/pytorch/pytorch/tree/main/torch/distributed/elastic/agent/server

  1. Observe (Branch): You'll see the correct output:
    Analysis complete! Output written to: digest.txt
    
    Summary:
    Repository: pytorch/pytorch
    Files analyzed: 4  # Correct!
    Subpath: /torch/distributed/elastic/agent/server
    
    Estimated tokens: 12.3k
    ...
    

Comparison (Branch Behavior - Working):

Just to confirm, this works perfectly for subdirectories on branches (both main and others):

  • Main: gitingest .../tree/main/... (Correct output: 4 files)
  • Other Branch: gitingest .../tree/gh/qqaatw/26/orig/... (Correct output: 4 files)

I've included the full commands and expected output in the original description, but the key difference is the Files analyzed count.


It seems like gitingest handles tagged subdirectories differently than branch subdirectories, leading to unexpected behavior and hitting the file limit.

I'd be happy to help investigate and potentially submit a PR if you can confirm this is a bug! Let me know what you think.

Thanks!

@filipchristiansen filipchristiansen added the work in progress This PR is not ready yet but is being worked on label Feb 23, 2025
@filipchristiansen
Copy link
Collaborator

Thanks for creating this issue! Work is already in progress:

if partial_clone:
clone_cmd += ["--filter=blob:none", "--sparse"]

@filipchristiansen
Copy link
Collaborator

@cyclotruc

@jpotw
Copy link
Author

jpotw commented Feb 24, 2025

Thanks for opening this issue! I appreciate the heads-up. It looks like some related work is already underway here:

gitingest/src/gitingest/repository_clone.py

Lines 84–85 in d16cbd3:

if partial_clone:  
    clone_cmd += ["--filter=blob:none", "--sparse"]  

Hi, thanks for your response! I’m not entirely sure if this addresses the specific issue I raised, so I’d love to clarify things a bit.


To recap the two issues I submitted:

In #195, I noted that cloning large repos (like PyTorch) within 60 seconds wasn’t feasible. I found that removing --recurse-submodules from this line resolved the issue. Out of curiosity, is there a specific reason the team typically includes this flag? It seems particularly inefficient when dealing with large third-party submodules.

In #196(the current page), I pointed out that tags (e.g., cyclotruc/gitingest/tree/v0.1.3) and their subdirectories aren’t being recognized properly (instead the full repository in main branch is being fetched), unlike commits or branches. I’ve identified the root cause of this and am actively working on a fix for this now.


Issue #195 feels like a quick fix, while I’m currently digging into #196.

Could you assign both issues to me so I can take ownership of them?

Thanks so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
work in progress This PR is not ready yet but is being worked on
Projects
None yet
Development

No branches or pull requests

3 participants