Opened 3 years ago

Closed 2 years ago

#799 closed enhancement (fixed)

RefExtract: introduce author extraction mode

Reported by: simko Owned by: chayward
Priority: major Milestone:
Component: DocExtract Version:
Keywords: Cc:

Description

RefExtract should be enhanced with author extraction mode, behaving like giva. That is, provided an input PDF file, one should be able to run:

$ refextract --extract-authors -f 1:file.pdf

and RefExtract should study the beginning portion of the file, looking for authors and affiliations, and it should output something like:

    <datafield tag="100" ind1=" " ind2=" ">
      <subfield code="a">Doe, J</subfield>
      <subfield code="u">U. Foo</subfield>
    </datafield>
    <datafield tag="700" ind1=" " ind2=" ">
      <subfield code="a">Bloggs, J</subfield>
      <subfield code="u">U. Bar</subfield>
    </datafield>
    <datafield tag="700" ind1=" " ind2=" ">
      <subfield code="a">Mustermann, E</subfield>
      <subfield code="u">U. Xyzzy</subfield>
      <subfield code="u">U. Zyxxy</subfield>
    </datafield>

IOW, refextract would provide two modes: the traditional --extract-references mode that would be the default, and a new --extract-authors mode the addition of which is the task of this ticket.

(Note that this may later touch a question of marking detected fields with provenance $2 and $9 information so that operating author extraction on the back end may be automatised and that refextract-found fields won't overwrite human-edited fields.)

Change History (2)

comment:1 Changed 2 years ago by simko

  • Status changed from new to in_merge

comment:2 Changed 2 years ago by simko

  • Resolution set to fixed
  • Status changed from in_merge to closed
Note: See TracTickets for help on using tickets.