Opened 3 years ago

Last modified 11 months ago

#363 in_review defect

Number of results show in XML-ish formats

Reported by: tbrooks Owned by: simko
Priority: major Milestone:
Component: BibFormat Version:
Keywords: Cc:

Description

Many users need to create CVs and export their publication lists to bibliographic management software like EndNote or BibTeX.

These users are usually downloading far more than 25 records, so the limit should be increased dramatically for these sorts of downloads.

At the very least some sort of notice should be given (rather than the comment at the top of EndNote which implies that the full result set is given:
http://inspirebeta.net/search?ln=en&p=dorfan%2C+j&f=&action_search=Search&sf=&so=d&rm=&rg=25&sc=0&of=xe

<!-- Search-Engine-Total-Number-Of-Results: 1028 -->

Which is above a file of 25 records...

Change History (14)

comment:1 Changed 3 years ago by arwagner

More than 25 records are no issue in Endnote Tagged (aka RIS) or BibTeX format.

However, to my experience, they produce considerable load upon import from an XML format. Main reason is that the XML parsers may perform very badly on larger sets. Especially, if those sets are huge. In a local use case it proved a bad idea to write out more than 100 records in XML. The required memory footprint by the parser may also be quite an issue in this case. E.g. in my usages, performance is much better if I do 1000 records on per record basis than in one chunk. (Using Perl XML::XPath.) For common papers in experimental HEP (large numbers of authors) a lot fewer records seem advisable, sometimes already one record is quite a job for the importer.

Therefore, it seems advisable to not increase the number of records in one chunk but offer the user several separate files to download and import. A similar approach is implemented in OAI-PMH.

comment:2 Changed 3 years ago by simko

(1) Note that people can use Display results selection box to select to see up to 100 results per page; the limit of 25 is only the default one that people can tweak.

(2) Moreover, people can manipulate rg (=records in groups of) argument in the URL to download pages containing up to CFG_WEBSEARCH_MAX_RECORDS_IN_GROUPS records per page. This value is set to 200 for INSPIRE. It is not very large so that people would not overload the server by downloading tons of stuff in one go.

(3) If people need to download more, then can add jrec (=jump to record) URL argument that can display result set from a certain position, exactly like jump to record selection box in the search interface is doing.

So, joining all these together, we usually advise people to do stuff like:

$ wget -O z1.xml 'http://inspirebeta.net/search?p=ellis&of=xm&jrec=1&rg=200'
$ sleep 10
$ wget -O z2.xml 'http://inspirebeta.net/search?p=ellis&of=xm&jrec=201&rg=200'
[...]

in order to get everything they want in a gentle way. However, this is perhaps not very user friendly, and is targeted more towards the power user type of users.

So, we can raise the visible limits from 200 to say 500 or even 1000, and put this upper limit in the visible Display results box, so that people could make selection more easily, but we should not raise it too much, since it may affect performance.

comment:3 Changed 3 years ago by arwagner

The problem referred to in the first comment is more that people actually do not want more than say 100 records in an XML export at a time. At least not once they tried to import it into some reference software, as such files can almost lock up their box (not the Invenio server but the user's client box).

The wget-method mentioned in the second comment seems waht the average user wants, though today not many will be aware of the simplicity of this solution. Probably, one could have a logic that generates a result page containing the above URLs as donwload links, so the user could just use the mouse to fetch them easily.

comment:4 follow-up: Changed 3 years ago by simko

BTW, I forgot to comment on the XML comment bit. Yes, it will be very useful to extend it to something like:

<!-- Search-Engine-Query: http://inspire-hep.cern.ch/search?p=ellis&of=xm -->
<!-- Search-Engine-Date: 2010-11-24 16:32:30 -->
<!-- Search-Engine-Total-Number-Of-Results: 2404 -->

and before printing every record, print a comment like:

<!-- Search-Engine-Result: 137 of 2404 -->

and at the end:

<!-- Search-Engine-Next-Results: http://inspire-hep.cern.ch/search?p=ellis&of=xm&jrec=26 -->

comment:5 Changed 3 years ago by tbrooks

Indeed a useful conversation. I think there are two key things that my story illustrates well:

The CV that I was dealing with was a group of folks trying to get a CV for Jonathan Dorfan ex-Director of SLAC, whose CV is around 800 papers long. This is a typical use case where you need to serve a high-profile user (who naturally has a long list of publications), who has a need to do this once or twice only (and thus will not/need not learn some unfriendly system like hacking the URL).

To resolve this I think we need to do two things in response:

1) Raise the upper limit clearly available to users in the dropdown. (25/50/100/1000(warning-may be slow) ) would seem like a nice set of choices. Alternatively one could set certain formats as nolimit or high-limit formats. Clicking through a list of 800 publications 50 at a time is usually not something folks will want to do. And most people don't write to us to find out how to hack the URL.

2) For those who can't/don't select larger limits on number of recs, the xml comment solution proposed by Tibor is exactly right, and will be very helpful


comment:6 Changed 3 years ago by arwagner

The point that such large lists are necessary and need to be served is clear as crystal and there needs to be found some way to serve them. IMHO ths can actually easily accomplished by formats like BibTeX or EndNote Tagged (RIS). Just dump it out in XML style formats might not result in what the user expects here: if your XML parser takes hours and GBs of RAM to process such large structures it makes no real fun. Therefore it might be a good idea to guide users to a better suited format for such large lists.

Some observation, as I have similar usecases in a bibliometric context. This deals not only with publications lists of well known researchers but whole institutions (or even countries). Therefore, I sometimes need to process even beyond 10.000 records at a time. Usually, I get them by some web service interfaces from some database (fast would another story, but that's life...). The observations here are pretty simple. First, XML is way to chatty resulting in large amounts of data that need to be transferred. From the endusers point of view this can be neglected in the CV usecase on current hardware. But imagine the publications lists of SLAC, there the story can get interesting already. From the servers point of view, this might be an issue. The next observation is if you want to process a decent amount of records containing decent bibliographic data (and maybe even links to the references) in XML, you should definitely limit the chunk. On a usual desktop PC ~100 records are, to my observations, ok. You can go a bit higher here, but fun decreases dramatically with the number of records processed at once, and the decrease in fun is not linear. Depending on who processes the records you'll also have to consider a usual desktop computer not a workstation. Throwing enough computing power at it may lift the upper limit a bit.

A look at "the competitors" (be it orange, green or blue) might be in order here. For their web services (this is another usage, I'm well aware of that, but the xml-issue is the same) you've usually limits to 100 records per query. In case you want more you've to implement loading several chunks on the client side and you have to add decent pauses between your requests otherwise you'll trigger the robot detection system preventing overloading the database. Similarly, in OAI-PMH you get only a limited number of records per chunk. However, if you want to download bibliographic data from their web interface you can get 500 records in EndNote Tagged at least and without any issues. So the chunk for "not XML" is 5 times as large. One could imagine this has some meaning.

comment:7 in reply to: ↑ 4 ; follow-up: Changed 3 years ago by valkyrie

Replying to simko:

<!-- Search-Engine-Query: http://inspire-hep.cern.ch/search?p=ellis&of=xm -->

How do I get this URL?

comment:8 in reply to: ↑ 7 Changed 3 years ago by simko

Replying to valkyrie:

How do I get this URL?

As discussed on the chat, you can use req.unparsed_uri and the like.

comment:9 Changed 3 years ago by valkyrie

  • Status changed from new to in_merge

These changes are available in my invenio AFS repo at SLAC on the branch 363-output_format_hacks. As per the discussion on chat, the dropdown box will show 10, 25, 50, 100, 200 until Inspire's CFG_WEBSEARCH_MAX_RECORDS_IN_GROUPS is changed.

comment:10 Changed 3 years ago by valkyrie

  • Status changed from in_merge to assigned

comment:11 Changed 3 years ago by valkyrie

  • Owner set to valkyrie

comment:12 Changed 3 years ago by valkyrie

  • Status changed from assigned to in_merge
Version 0, edited 3 years ago by valkyrie (next)

comment:13 Changed 3 years ago by valkyrie

Now with fresh rebase on newest master.

comment:14 Changed 11 months ago by simko

  • Owner changed from valkyrie to simko
  • Status changed from in_merge to in_review
Note: See TracTickets for help on using tickets.