A graphical application designed to identify and manage duplicate Markdown files within an Obsidian vault. This tool efficiently scans your vault, compares file contents based on text similarity, and groups files that are duplicates or nearly identical. It is ideal for anyone looking to keep their Obsidian vault organized and free of redundant content.
Introduction:
obsidian-deduper is a user-friendly desktop application built with Python and Tkinter. It scans your Obsidian vault for duplicate Markdown files by analyzing the contents and calculating their similarity using advanced text processing techniques.
Purpose:
- Problem Solved: Helps identify and clean up duplicate or similar files that may clutter your vault.
- Value Proposition:
- Simplifies vault maintenance by automatically detecting duplicates.
- Saves time and reduces manual effort when organizing notes.
- Provides a clear visual interface to preview and delete duplicates safely.
The following software and libraries are required to run obsidian-deduper:
-
Python 3.6 or Higher
- Description: The programming language used to develop this application.
- Installation: Download and install Python from the official website: https://www.python.org/downloads/
-
Tkinter
- Description: A standard GUI toolkit for Python that creates the application's graphical interface.
- Installation:
- Windows and macOS: Usually comes pre-installed with Python.
- Linux: Install via your package manager (e.g.,
sudo apt-get install python3-tk
).
-
scikit-learn
- Description: Library used for computing text similarities using techniques like TF-IDF and cosine similarity.
- Installation:
pip install scikit-learn
-
NumPy
- Description: Provides support for large, multi-dimensional arrays and matrices along with a collection of mathematical functions.
- Installation:
pip install numpy
-
nltk (Natural Language Toolkit)
- Description: Helps in processing and cleaning text data, especially for removing stopwords.
- Installation:
pip install nltk
- Additional Setup: The program will automatically download the NLTK stopwords. If needed, this can also be triggered manually with:
python -m nltk.downloader stopwords
- Navigate to the obsidian-deduper GitHub repository page.
- Click the green
<> Code
button. - Select
Download ZIP
from the dropdown menu. - Save the ZIP file to a preferred location on your computer.
- Windows: Right-click on the downloaded ZIP file and choose "Extract All...".
- macOS: Double-click the ZIP file to automatically extract its content.
- Linux: Use your file manager’s extract option or run the command:
unzip obsidian-deduper.zip
- Open a terminal (Command Prompt on Windows or Terminal on macOS/Linux).
- Change the directory to the extracted repository folder using the
cd
command. For example:cd path/to/obsidian-deduper
- Execute the program by running:
python obsidian-deduper.py
- The graphical interface will launch, displaying the "Obsidian Duplicate Finder" window.
Once the program is running, follow these steps to detect and manage duplicate files:
-
Select Vault Folder:
- Click the "Select Vault Folder" button.
- In the dialog that appears, navigate to and select the folder containing your Obsidian vault (the directory with your Markdown files).
-
Set Similarity Threshold:
- Use the provided spinbox to set the sensitivity (percentage) for determining duplicate files. A higher percentage means only very similar files are flagged.
-
Find Duplicates:
- Click on the "Find Duplicates" button.
- The application will read all Markdown files, compute text similarities, and display duplicate groups.
- A progress bar will update as the scanning process proceeds.
-
Review Duplicate Groups:
- The left pane displays groups with the number of similar files, along with details like file path, size, and last modified date.
- Click on a group or individual file to preview its content in the right pane.
-
Delete Selected Files:
- Select one or more files (or an entire group) in the treeview.
- Click the "Delete Selected Files" button.
- For duplicate groups, a dialog may appear allowing you to choose specifically which files to remove.
- Scenario: A user maintains a personal Obsidian vault for journaling and research notes. Over time, duplicate or near-duplicate entries are created.
- Example:
- Input: Multiple Markdown files with very similar content about "Project Ideas".
- Expected Output: The program groups these files under one duplicate group with an average similarity (e.g., 85%), highlighting them for review or deletion.
- Scenario: A small team uses Obsidian for collaborative documentation. Duplicate files can occur due to multiple contributions.
- Example:
- Input: Two or more files containing almost identical meeting notes.
- Expected Output: The application identifies these as duplicates, allowing the team to consolidate the notes into a single, updated file.
- Scenario: A researcher imports a large collection of Markdown files from various sources. Some files may carry redundant information.
- Example:
- Input: Markdown files with overlapping content related to research data.
- Expected Output: The program groups duplicates together, enabling the user to efficiently remove or combine similar files, thus enhancing dataset quality.
This repository is continuously updated, and changes to the code may render parts of this README file outdated. No guarantee is made that this file will consistently reflect the current state of the repository.