Skip to content

A powerful, open-source Python library for semantic search using FAISS and multiple database backends (MongoDB, SQLite, Redis, PostgreSQL, MySQL). Supports fast vector search with Sentence Transformers for embeddings

License

Notifications You must be signed in to change notification settings

kunci115/semantic-search

Repository files navigation

Semantic Search

An open-source Python library for semantic search, featuring:

  • FAISS for rapid vector similarity.
  • SentenceTransformers for high-quality embeddings.
  • Pluggable database backends (MongoDB, SQLite, Redis, PostgreSQL, MySQL).

With this library, you can efficiently store, index, and retrieve documents based on semantic similarity.


Table of Contents

  1. Features
  2. Databases & FAISS Table
  3. Installation
  4. Usage
  5. Testing
  6. FAQ

Features

  • Multi-DB Support – Choose MongoDB, SQLite, Redis, PostgreSQL, or MySQL.
  • FAISS Index – Build a fast, in-memory index for vector searches.
  • Easy APIadd_document(), build_faiss_index(), retrieve().
  • Scalable – Handle millions of embeddings with FAISS.
  • Open Source – Contributions are welcome!

Databases & FAISS Table

Database Best For Advantages Considerations
MongoDB JSON-like docs, horizontal scaling - Great for big data
- Flexible schema
- No built-in vector search
- Use external index
SQLite Lightweight local storage - Easy setup
- Single file DB
- Not ideal for concurrent writes
Redis Fast in-memory caching - Extremely fast read
- Good for ephemeral data
- Data stored in RAM
- Must handle persistence
PostgreSQL Traditional SQL, robust & reliable - Potential pgvector extension
- ACID-compliant
- Requires indexing optimization for large data
MySQL Widely used SQL store - Scalable
- Familiar to many devs
- No native vector index; manual approach needed

FAISS:

  • IndexFlatL2 used for demonstration (simple & effective for smaller data).
  • For large data: consider IVF, HNSW, or GPU-based indexing.

Installation

  1. pypi install this repo:

     pip install pysemantic-search
  2. Install Dependencies for Development:

    git clone /~https://github.com/username/semantic-search.git
    cd semantic-search
    pip install -r requirements.txt
    
    Make sure to adjust dependencies (faiss-cpu vs. faiss-gpu) depending on your environment.
    If you plan to use PyTorch on GPU, install torch with CUDA support.

Usage

Below is a basic example using MongoDB as the database backend. You can switch to other backends by changing db_type.

from semantic_search import SemanticSearch, DatabaseFactory

# 1. Create a database connection (MongoDB example)
db = DatabaseFactory.create_database(
    db_type="mongodb",
    mongo_uri="mongodb://localhost:27017/",
    db_name="semantic_db",
    collection_name="documents"
)

# 2. Initialize SemanticSearch with the DB
search_engine = SemanticSearch(database=db)

# 3. If documents already exist, build FAISS index
try:
    search_engine.build_faiss_index()
except ValueError:
    print("No existing documents found. The FAISS index was not built.")

# 4. Add a new document
search_engine.add_document("Deep learning for NLP is a powerful tool.")

# 5. Retrieve similar documents
query = "Best techniques for NLP deep learning?"
results = search_engine.retrieve(query, top_k=3)
print("Results:", results)

Testing

This library includes pytest tests in the tests/ directory. To run them locally:

  1. Install dev dependencies (pytest, pylint, etc.) from requirements.txt.
  2. Run
python -m pytest tests/

FAQ

1. Why do I see a type error about faiss_index.add(x)?

FAISS’s Python bindings are generated via SWIG, so static type checkers (like Pyright) think the signature is add(n, x). We add # type: ignore to bypass this false positive. Runtime usage works fine with add(x).

2. Is .cpu() needed for embeddings?

  • By default, SentenceTransformers returns NumPy arrays if you pass convert_to_tensor=False, so .cpu() is not needed.
  • If you use convert_to_tensor=True, you get a PyTorch tensor. Convert it with:
    embedding = embedding.cpu().numpy().astype(np.float32)

3. Which database is best?

  • MongoDB or PostgreSQL for large data.
  • Redis for fast in-memory lookups.
  • SQLite for small local apps.

4. Can I store embeddings in the DB & FAISS on disk?

Currently, IndexFlatL2 is in-memory only. To store FAISS on disk, use other FAISS indexes with I/O support or HDF5-based approach.

5. How do I contribute?

Fork this repo, create a new branch, and submit a PR with changes.
Add tests in tests/.

Reference

/~https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/semantic-search/README.md

About

A powerful, open-source Python library for semantic search using FAISS and multiple database backends (MongoDB, SQLite, Redis, PostgreSQL, MySQL). Supports fast vector search with Sentence Transformers for embeddings

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages