Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: Add tags into context #195

Open
ankostis opened this issue Feb 15, 2021 · 5 comments
Open

REF: Add tags into context #195

ankostis opened this issue Feb 15, 2021 · 5 comments
Labels
enhancement New feature or request frontend Related to browser extension

Comments

@ankostis
Copy link
Contributor

I see that hypothesis source does not harvest tags, correct?
Although this is fixable, it made me wonder whether tags should be a new column in promnesia db?

@karlicoss
Copy link
Owner

karlicoss commented Feb 16, 2021

Yeah, good suggestion! Just adding to context is a quick & easy workaround.

In general there are all kinds of metadata that could be useful, e.g:

  • org-mode has tags
  • reddit has users/subreddits/etc
  • github (Missing github source? #194) has tags/repos/users
  • instant messaging is associated with a person
  • tweets have authors

I need to think a bit more what's the best way to handle it.. because adding stuff to db is easy. The interesting question is what t do with it in the UI -- although just displaying it could be a good start.
In the future would be interesting also to filter/search by metadata, but that's the hard part since involves frontend changes :)

ankostis added a commit to ankostis/promnesia that referenced this issue Feb 16, 2021
@ankostis
Copy link
Contributor Author

ankostis commented Feb 16, 2021

Just made a PR #199 to include in Hypothesis rows a new context-line with (hash-)-tags; the same can be done with github(labels), gmail(labels).

BUT when trying to search by a hash-prefixed tag e.g. #freedom, the extension takes too much time to respond, and this could be an escaping issue (and probably a security vulnerability?).

These are the logs when searching freedom:

[INFO    2021-02-16 13:36:42 promnesia.server server.py:270] /search freedom
[INFO    2021-02-16 13:36:42 promnesia.server server.py:164] url: freedom
[INFO    2021-02-16 13:36:42 promnesia.server server.py:167] normalised url: freedom
[DEBUG   2021-02-16 13:36:42 promnesia.server server.py:180] query: SELECT visits.norm_url, visits.orig_url, visits.dt, visits.locator_title, visits.locator_href, visits.src, visits.context, visits.duration 
    FROM visits 
    WHERE (visits.norm_url LIKE '%' || ? || '%' ESCAPE '/') OR (visits.orig_url LIKE '%' || ? || '%' ESCAPE '/') OR (visits.context LIKE '%' || ? || '%' ESCAPE '/') OR (visits.locator_title LIKE '%' || ? || '%' ESCAPE '/')
[DEBUG   2021-02-16 13:36:43 promnesia.server server.py:193] got 57 visits from db
[DEBUG   2021-02-16 13:36:43 promnesia.server server.py:204] responding with 57 visits
127.0.0.1 - - [16/Feb/2021 13:36:43] "POST /search HTTP/1.1" 200 30282

While this is when prefixed with hash(#), where all my visits are returned:

[INFO    2021-02-16 13:38:46 promnesia.server server.py:270] /search #freedom
[INFO    2021-02-16 13:38:46 promnesia.server server.py:164] url: #freedom
[INFO    2021-02-16 13:38:46 promnesia.server server.py:167] normalised url: 
[DEBUG   2021-02-16 13:38:46 promnesia.server server.py:180] query: SELECT visits.norm_url, visits.orig_url, visits.dt, visits.locator_title, visits.locator_href, visits.src, visits.context, visits.duration 
    FROM visits 
    WHERE (visits.norm_url LIKE '%' || ? || '%' ESCAPE '/') OR (visits.orig_url LIKE '%' || ? || '%' ESCAPE '/') OR (visits.context LIKE '%' || ? || '%' ESCAPE '/') OR (visits.locator_title LIKE '%' || ? || '%' ESCAPE '/')
[DEBUG   2021-02-16 13:38:46 promnesia.server server.py:193] got 13035 visits from db
[DEBUG   2021-02-16 13:38:46 promnesia.server server.py:204] responding with 13035 visits
127.0.0.1 - - [16/Feb/2021 13:38:46] "POST /search HTTP/1.1" 200 7632748

@karlicoss
Copy link
Owner

Thanks, the change looks good!

So I think I know why this happens.
Here's the bit of code responsible for searching stuff:

return search_common(
url=url,
where=lambda table, url: or_(
# todo hmm. think about it, not sure if I need proper indexer for fuzzy search etc?
table.c.norm_url .contains(url, autoescape=True),
table.c.orig_url .contains(url, autoescape=True),
table.c.context .contains(url, autoescape=True),
table.c.locator_title.contains(url, autoescape=True),
),
)

, as you can see it searches in several different columns (and somewhat confusingly, paramter name is 'url' even though it really means anything you typed in the search box).

But it delegates to search_common to avoid code duplication between endpoints /~https://github.com/karlicoss/promnesia/blob/master/src/promnesia/server.py#L160

However, this function also normalizes the url passed to it:

logger.info('url: %s', url)
original_url = url
url = canonify(url)
logger.info('normalised url: %s', url)

In particular, at the moment that means stripping out the 'fragment' part of the URL (there is a plan to do something smarter, but still work in progress

promnesia/tests/cannon.py

Lines 127 to 142 in aded41c

# TODO FIXME fragment handling
# ( "https://www.scottaaronson.com/blog/?p=3167#comment-1731882"
# , "scottaaronson.com/blog/?p=3167#comment-1731882"
# ),
# TODO FIXME fragment handling
# ( "https://en.wikipedia.org/wiki/tendon#cite_note-14"
# , "en.wikipedia.org/wiki/tendon#cite_note-14"
# ),
# TODO FIXME fragment handling
# ( "https://physicstravelguide.com/experiments/aharonov-bohm#tab__concrete"
# , "physicstravelguide.com/experiments/aharonov-bohm#tab__concrete"
# ),
)
So as a result, when you search #freedom, it ends up normalizing this to empty string, and this results in matching against the whole database.

So I guess there are several things we could do here

  • non-fix: if you search for freedom without the hash, it should work straightaway :) However the downside is that it would also find all URLs that contain freedom as a substring as an example
    Ans yeah, it's not intuitive that freedom normalizes in freedom even though it's not a valid URL --but I'm not sure was intended, good testcase though to think about.

More reasonable solutions:

  • maybe if normalisation results in empty string, it shouldn't try searching url. However one might want to retrieve their whole database (especially when some kind of pagination is implemented?).
  • maybe the search_common function should guess that the thing you're passing is not a proper URL and not try searching in it. However there are valid cases when you might type some approximate domain name and try to find.. so not sure

This would probably be easier if we had a richer search interface (so e.g. you could tick whether you want to search context/url/etc).

Either way, I'm happy to merge your PR if you are, the issues should be worked around separately.

@ankostis
Copy link
Contributor Author

I would prefer that the PR is self-contained, and it works as expected.
If you don't mind, i would add a committee to try to workaround(1): if the original query-str is not empty and the canonized it is, use the original.

@ankostis
Copy link
Contributor Author

ankostis commented Feb 17, 2021

Actually the same has to happen on the extension js-code [edit:] for bookmarks & history:

normalised_url: normalise_url(url),

Also need to trim the URL before searching; would there be any problem to that?

ankostis added a commit to ankostis/promnesia that referenced this issue Feb 17, 2021
- reason: canonicizing "#tag" eliminates it, fetchi everything,
  as discussed in karlicoss#195
- Missing a respective js-change in the extension for bookmarks/history.
karlicoss pushed a commit that referenced this issue Feb 17, 2021
karlicoss pushed a commit that referenced this issue Feb 17, 2021
- reason: canonicizing "#tag" eliminates it, fetchi everything,
  as discussed in #195
- Missing a respective js-change in the extension for bookmarks/history.
@karlicoss karlicoss added enhancement New feature or request frontend Related to browser extension labels Jan 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request frontend Related to browser extension
Projects
None yet
Development

No branches or pull requests

2 participants