Opened 4 years ago

Closed 4 years ago

#131 closed defect (fixed)

Correct parentheses support in in search parsing

Reported by: jblayloc Owned by: jblayloc
Priority: critical Milestone:
Component: WebSearch Version:
Keywords: INSPIRE Invenio Syntax News Oct Cc:

Description

moved from https://savannah.cern.ch/task/index.php?13998

2010-02-17 08:26, original submission:

search_engine_query_parser should handle nested parens.

This is useful in general, but in particular:

_expand_search_patterns in search_engine_query_parser.py uses the wrong logic:

find a mele or t compton scattering (no quote) should properly be translated as:

author:mele or (title:compton and title:scattering)

however it was being translated as

author:mele or title:compton or title:scattering

Since the parser claims not to be able to handle nested parens,the best I can do at the moment is:

author:mele or title:compton and title:scattering

which works better in that it fails in a slightly less spectacular manner. Fortunately these complex searches are not very likely....

however, being able to handle nested parens would allow us to feed this back to the parens parser and do it right.

2010-02-17 22:17, comment #1:

find a mele or t compton scattering

now becomes

author:mele or (title:compton and title:scattering)

which works unless it is embedded in another parentheses. This is unlikely to happen, so this then is pretty reasonable. However, handling nested parens, maybe parsing spires in a different way, would make other syntax tricks much easier.

2010-04-01 17:44, comment #2:

Another reason this is important is that author searches in SPIRES expand into several boolean queries, and these must be parenthesized, leading to easy nesting...

2010-06-08 19:14, comment #3:

This awaits integration.

For discussion, see thread "SearchQueryParenthesisedParser complete..." starting at https://groups.cern.ch/group/projec...

For code, see commit #5001a568aeb on my branch 'nested_parens_aima'.

Joe

Change History (27)

comment:1 Changed 4 years ago by jblayloc

  • Status changed from new to in_merge

comment:2 Changed 4 years ago by jblayloc

  • Keywords INSPIRE Invenio Syntax added

comment:3 Changed 4 years ago by jblayloc

When this merges, jblayloc has a stack of tickets that should be checked and possibly closed, including:
#67, #181, #189, #190, #191.

comment:4 Changed 4 years ago by jblayloc

I've just freshly cherry-picked this work over top of a checkout of today's master. All the tests pass on my system, and I believe it's still ready for merge. The branch is in my invenio repository, branch name 'trac-131'.

comment:5 Changed 4 years ago by jblayloc

  • Status changed from in_merge to assigned

While reviewing my waiting branches for Tibor, I decided that I should give this another once-over just to make sure everything's right. I'll put it back in_merge when I'm done confirming that everything seems right.

comment:6 Changed 4 years ago by jblayloc

  • Status changed from assigned to in_work

comment:7 Changed 4 years ago by jblayloc

  • Component changed from WebStyle to WebSearch
  • Keywords News added
  • Status changed from in_work to in_merge

This is ready to go.

Tibor, this is a larger change and touches more of the system than any of my other outstanding tickets, but in my tests it nevertheless applies cleanly to the tree alongside them. Despite being large and taking a long time, this patch is a higher priority than any of the other branches I have published.

As per the email I sent the other day about branch priorities (*), the revised list is now:

  • trac-131 (#131 rewrites SearchEngineQueryParser, fixes other bugs)
  • trac-187 (#187 fixes spires syntax bugs)
  • trac-133 (#133 adds h-index to citesummary page)
  • trac-130 (#130 enables search by IRNs)
  • trac-128 (#128 RSS logo rename w/ corresponding INSPIRE patch)

This commit should also close #67, #181, #189, #190, #191.

* The email in question is: my branches waiting for merge/install.

comment:8 Changed 4 years ago by jblayloc

oops; I just realized that I pushed out the wrong version of this. Since I don't think anyone has fetched it yet, I just rebased and squashed and pushed it out again. The current version is correct.

comment:9 Changed 4 years ago by jblayloc

Also now available as 131-Websearch_SearchQueryParenthesisedParser_rewrite

comment:10 Changed 4 years ago by simko

  • Status changed from in_merge to assigned

1) Cosmetic comment: if we take Peter Norvig's code and chances are it
may be updated, we'd better not delete unnecessary stuff to ease later
updates. If chances are we won't need to update often, and if we
modify it substantially already now (even though mostly cosmetically),
then let us rename the file to logicutils.py so that it plays better
with the other utility libraries we have (dateutils.py,
mailutils.py and the like).

2) I started to look at parse_query API changes that this branch
introduces, like:

self.assertEqual(self.parser.parse_query('(expr1) - expr2 + (expr3) | expr4'),
- ['+', 'expr1', '-', 'expr2', '+', 'expr3', '|', 'expr4'])
+ ['+', '+ expr1 | expr4', '+', '- expr2 | expr4', '+', '+ expr3 | expr4'])

in order to check how would they influence the search engine behaviour
(since apparently the output could have been kept the same in this
particular case due to no nested parens). When doing so, I noticed a
few problems:

2a) try to query for:

(ellis -muon (|kaon))

you will get mismatched parens warning;

2b) try the following query:

(ellis )

which gets parsed as:

basic search units are: [['+', '(ellis', '', 'w'], ['+', ')', '', 'w']]

which is not correct;

2c) try queries like:

U(1)
U(1) -SL(2,Z)
(U(1) -SL(2,Z))

The last one fails. (Actually, you can just run
python ./websearch_regression_tests.py --yes-i-know, since we have
a test case for this `apparently nested' parens case that is not
really a nested parens case.)

Can you please have a look at those issues?

comment:11 Changed 4 years ago by tbrooks

  • Keywords Oct added

comment:12 follow-up: Changed 4 years ago by jblayloc

1) Cosmetic comment: if we take Peter Norvig's code and chances are it

Ok, I've changed the logic.py to be logicutils.py. I started to go back and try minimizing the number of changes to the library for easier merging, and I simultaneously remembered that about a third of those changes were necessary to eliminate code dependencies on other AIMA modules, and realized that AIMA hasn't had a significant update in a year or more. So taking over maintenance is probably fine.

2a) try to query for:

(ellis -muon (|kaon))

you will get mismatched parens warning;

This is probably related to the problem outlined in #269. I'll look into this.

2b) try the following query:

(ellis )

which gets parsed as:

basic search units are: +', '(ellis', '', 'w'], ['+', ')', '', 'w?

which is not correct;

This is definitely #269. Note that master is broken as well.

2c) try queries like:

U(1)
U(1) -SL(2,Z)
(U(1) -SL(2,Z))

The last one fails. (Actually, you can just run
python ./websearch_regression_tests.py --yes-i-know, since we have
a test case for this `apparently nested' parens case that is not
really a nested parens case.)

The unit test for U(1) fails against Atlantis in master as well. But I'll do my best to get it working in my branch.

comment:13 follow-up: Changed 4 years ago by tbrooks

So I just went through and confirmed that 2 a,b,c are all behaving the same on a machine running Joe's branch, inspirebeta, and invenio-demo.cern.ch i.e. it looks like they are all three separate from the original fix that this commit is meant to resolve.

While it would be nice to clean these things up now, (i.e Joe is thinking about it, and they may indeed have been caused by other recent commits on search behavior from Rado, me, or Joe, or somewhere else altogether) I don't think any of the 3 are nearly as important on an October timescale as the main commit for nested parens, which fixes a number of user-facing/user-reported issues.

I'd propose to get that moving along, and create new tickets for these additional fixes.

When creating new tickets it is also worth understanding why the regression tests for the (U(1) OR SL(2,Z)) expects a different result with or without parens. Indeed invenio-master does behave differently in the two cases (albeit not in the manner expected by the test case) but I can't see why the test should not expect the same results there (cf.

    def test_special_terms_u1_and_sl_or(self):
        """websearch - query for special terms, U(1) OR SL(2,Z)"""
        self.assertEqual([],
                         test_web_page_content(CFG_SITE_URL + '/search?of=id&p=U%281%29+OR+SL%282%2CZ%29',
                                               expected_text="[57, 79, 80, 88]"))

    def test_special_terms_u1_and_sl_or_parens(self):
        """websearch - query for special terms, (U(1) OR SL(2,Z))"""
        self.assertEqual([],
                         test_web_page_content(CFG_SITE_URL + '/search?of=id&p=%28U%281%29+OR+SL%282%2CZ%29%29',
                                               expected_text="[57, 79, 88]"))

comment:14 Changed 4 years ago by jblayloc

Conversation on 2b should happen on #269, but it is not the query parser; it's being passed in from somewhere else already broken.

comment:15 follow-up: Changed 4 years ago by jblayloc

2a is gibberish syntax and the query parser actually should be breaking on it. But it should be giving a better error message. This is ticketized as #281.

comment:16 in reply to: ↑ 13 Changed 4 years ago by simko

Replying to tbrooks:

So I just went through and confirmed that 2 a,b,c are all behaving the same on a machine running Joe's branch, inspirebeta, and invenio-demo.cern.ch

Actually there is a difference in behaviour: e.g. query (ellis (NOT muon)) with master branch reports that parens are nested, so going to be dropped, and the query returns 11 hits. With Joe's branch, no message is emitted, but no results are returned either, one gets basically an empty page. This is less user-friendly than what the master branch does.

I'd propose to get that moving along, and create new tickets for these additional fixes.

If these are easy to fix, then let's fix them now as part of the "correct parens support" ticket. If they are less easy, then let's fix at least the empty results behaviour. If this require deeper changes, then let's consider even this separately indeed -- but we should at least emit something like "Sorry, no hits found, please avoid using nested parentheses in the query" or the like in this branch. (Of course, only when no hits are printed and when query used nested parens.)

When creating new tickets it is also worth understanding why the regression tests for the (U(1) OR SL(2,Z)) expects a different result with or without parens.

This is indeed a bug that was mistakenly introduced by 7b3804d29a49dd86a64eedba9714795244286653. I have just fixed it. Thanks.

comment:17 in reply to: ↑ 12 Changed 4 years ago by simko

Replying to jblayloc:

The unit test for U(1) fails against Atlantis in master as well.

U(1) works fine for me in master (only nested parens test case fails there, in the true TDD style). But U(1) also works fine in your branch. There is no problem with it, I was just illustrating query refinement, first two being OK, last one failing.

comment:18 in reply to: ↑ 15 Changed 4 years ago by simko

Replying to jblayloc:

2a is gibberish syntax and the query parser actually should be breaking on it.

It is not fully gibberish: (|kaon) should be interpreted as a Boolean expression inside parens, due to the presence of |. |kaon is a legal search unit, meaning "find me <empty string> OR kaon", resulting in everything(*). The whole query then means "find me records containing word ellis AND NOT those containing word muon AND those containing anything". Yes, the third term does not really matter, but it is "legal", because |kaon is legal.

(*) Since we start from the record universe, as it were. Kind of like -kaon returns every record not having word kaon. Though we could muse whether it would not be better to treat |kaon as kaon, returning records containing the word kaon, not everything. As Google seems to be doing.

comment:19 Changed 4 years ago by tbrooks

Regarding (ellis(|kaon)) and gibberish, let me propose to take the discussion over to that ticket, #281, since that, is not as urgent. I will note, re the urgency/lack of urgency of this, that I find empty results here:

http://invenio-demo.cern.ch/search?p=ellis+%28not%28quark%29%29&f=&action_search=Search&c=Atlantis+Institute+of+Fictive+Science&sf=&so=d&rm=&rg=10&sc=1&of=hb

but we can find time to look IRL tomorrow, that will be easier than ticketing about this.

comment:20 Changed 4 years ago by jblayloc

So after hunting for a very long time, struggling to understand why my parser wasn't properly handling the nested parens cases such as (U(1)), I found this in search_engine.py:

1894     # sanity check: do not call parenthesised parser for search terms                                                                                   
1895     # like U(1):                                                                                                                                        
1896     if not re_pattern_parens.search(p):                                                                                                                 
1897         return search_pattern(req, p, f, m, ap, of, verbose, ln, display_nearest_terms_box=display_nearest_terms_box)                                   

...after struggling to fix my code, I discover it's not even being called.

Simply commenting out this line doesn't fix the problem, so something more subtle is needed.

comment:21 Changed 4 years ago by jblayloc

Ok I think it's as fixed as I can make it. Tibor, to get better I think I need your help.

I've pushed my branch 131-WebSearch_SearchQueryParenthesisedParser_rewrite_and_bugfix, which should be clean and ready for merge. I fixed a few small bugs.

I also found the source of the U(1) problem is the set of lines above, which were normally not called for this parenthesised case in the past because search_engine would simply throw an exception when it saw the parens were nested. If you run an ipython session and import the parser, and manually call .parse_query('(U(1) OR SL(2,Z))') though, you'll see that it's output is correct.

I think the next step is to remove the lines above, but I don't understand the consequences of doing that. My unit tests continued to work fine (including the unit test for U(1)) but many regression tests fail. With this patch applied the number of failing regression tests falls from 10 to 6. If search_engine is fixed that number will go to 5. But with that set of lines in #comment20 commented out, the number of failures jumps to something like 19.

I'm out of energy for figuring out why, and I'm going to go sleep.

If possible I'd like to get this branch merged and then worry about fixing search_engine tomorrow. Or Wednesday.

comment:22 Changed 4 years ago by simko

WRT re_pattern_parens regexp match, this was introduced so that query like (...) gets treated by the parenthesized parser only when there is an empty space inside ..., i.e. only when we have a chance of having a Boolean expression to process. Otherwise we consider stuff between brackets as a word, basically. So U(1) would not get processed since not matched by this regexp, but (ellis OR kaon) would get processed, which is mostly what we want.

Now that we support nested parens, I think this regexp should be changed so that parenthesized parser will get invoked if ... contains either en empty space or another paren (=having a nested paren situation).

Note that the role of this regexp is basically to save the search processing time, so that the supposedly slower parenthesized parser would get invoked only when necessary.

Note also that cases like (U(1)) are touched by this regexp, but not the cases I mentioned above like (ellis (NOT muon)) that would get dispatched to the parenthesized parser, but that behaved differently in master and in your branch. (I have not yet looked at your updated branch how it behaves now, these are just comments to your textual musings.) It is mostly for these cases that we have to ensure the sane behaviour in case of no hits found, in case they are hidden inside some complex foo AND bar (NOT baz (fuux OR quux)) OR (xyzzy AND (NOT zyxxy)) sub-expression. So while the regexp is a real issue, it is a kind of parallel one. I guess we should be able to fix this parallel issue relatively easily by altering the regexp as mentioned above.

comment:23 Changed 4 years ago by jblayloc

I have rebased 131 against the new master as of this morning and run its unit and regression tests. This branch is published as 131-WebSearch_SearchQueryParenthesisedParser.

I'll deploy it and start checking out some of these search issues momentarily.

comment:24 Changed 4 years ago by jblayloc

  • Status changed from assigned to in_merge

comment:25 Changed 4 years ago by jblayloc

Hi Tibor,

you had verbally requested a couple changes when we last spoke at CERN. I think that those should now be complete. I squashed them into the previous branch, so you'll have to fetch again. It is pushed as 131-WebSearch_SearchQueryParenthesisedParser.

Thanks!

comment:26 Changed 4 years ago by simko

Actually there were some more (e.g. to have logicutils from the start)
as well as some cosmetics (e.g. pylint enable-msg in more places).
I have now done these, you can take a git diff to see what exactly
I changed while cherry-picking. I'm going to push in a minute, but
nevertheless please keep looking at the cases mentioned in comment:22
and elsewhere, since we should still fix them properly.

comment:27 Changed 4 years ago by Joe Blaylock <jrbl@…>

  • Resolution set to fixed
  • Status changed from in_merge to closed

In [1fc28f35b2ce969037e54905ec08bcc7697fe914]:

WebSearch: SearchQueryParenthesisedParser rewrite

  • New SQPP supports parenthetic subexpressions nested to arbitrary depth.
  • Unit tests corrected to reflect both the new support for parentheses and because some of them should not have been correct with the old parser, either. (fixes #131) (fixes #67) (fixes #181) (fixes #189) (fixes #190) (fixes #191)
  • Introduction of logicutils unit tests.
Note: See TracTickets for help on using tickets.