Opened 2 years ago

Closed 17 months ago

Last modified 17 months ago

#944 closed enhancement (fixed)

Including DocExtract

Reported by: adeiana Owned by: adeiana
Priority: minor Milestone:
Component: DocExtract Version:
Keywords: INSPIRE Cc:

Description

I am asking to merge the new refextractor.

It is squashed in a single commit. You can access the branch here:
http://invenio-software.org/repo/personal/invenio-adeiana/commit/?h=refextract-merge&id=b37f6956e3472c88871ec0924456fec0283a3235

Change History (7)

comment:1 Changed 2 years ago by adeiana

  • Status changed from new to in_merge

comment:2 Changed 2 years ago by adeiana

  • Keywords INSPIRE added

comment:3 Changed 23 months ago by adeiana

Thew new branch is here adeiana/944-refextract

comment:4 Changed 17 months ago by Alessio Deiana <alessio.deiana@…>

  • Resolution set to fixed
  • Status changed from in_merge to closed

In 9c44fffa48aba22a416ebd09a41fa15a97158148:

DocExtract: new docextract and refextract modules

  • Adds DocExtract as a way to easily access all text mining facilities It will allow to extract references, authors, plots, etc. (closes #944)
  • Moves the refextract scripts from the bibedit module into its own module.
  • Adds a new api to use the refextract module. It includes calls to:
    • update_references(): update references by passing a record id;
  • extract_references_from_*(): extract and parse references from file/url/record id/string;
  • new function that returns the marcxml of the record with updated references;
  • new function to check if a record has a fulltext (pdf) attached.
  • Refextract filters out null characters from pdfs converted text as they are refused by bibupload.
  • Adds several updates to refextract parsing:
  • handling of JHEP-like journals, as they need the last 2 digits of the year prepended to the volume;
  • adds support for ISBN. They are added in a new subfield called $$i;
  • adds support for references like CERN-LHCC2003-01 by transforming it to CERN-LHCC-2003-01;
  • adds a new subfield <subfield code="t">Text</subfield> where refextract stores references to quoted text "Text".
  • Adds a new option to the bibtask mode of refextract, "--no-overwrite", which checks each record for existing references before parsing it. If the record already has references, it skips it.
  • Fixes recent records detection:
  • only stores last_updated when running on recent records. This prevents from parsing the most recent reference via --recids n, updating the last_updated field and have refextract skip all references preceeding n;
  • only updated last_id and last_updated when respectively the new id is bigger and the new last_updated is more recent. This prevents to store an old date when parsing old records.
  • Handles the format arXiv:9910.1234 [physics.ins-det].
  • Fixes numeration checking when looking for the end of references.
  • Reworks xbook as a single tag: xbook was storing the book title, instead the title is always stored in $$t.
  • New authors recognized:
    • Figuera-O'Farrill
    • P. Pre'
    • Dan V. Schroeder
  • Adds 9+ and w+ to report numers format.
  • Handles Sci.Eng. 450(1-3), 3, 2007 (no space after volume).
  • Handles PoS LAT2007 (2007) 12 journal.
  • Handles report numbers like CERN/LHCC/98-013.
  • Handles C67:674,1998 numeration.
  • Adds a new way to recognize journals which is needed when we recognized short titles. Often the short titles or initials of a journal conflict with other names. e.g. DAN (the journal ) and Dan (common first name) We handle it via precise regular expresssions.
  • Match Acknowldgment and Acknowledgment as end of sections.
  • Format hep report numbers to hep-th/999999.
  • Recognizes roman numbers as volume numbers.
  • Removes [] and () from o subfield.
  • Removes extra spaces at the end of lines.
  • Does not try to detect C et D for roman rumbers. It would result in some series letters being detected instead.
  • Does not detect "B, 07" volumes anymore since some of these are from journals which are different Phys.Rev. & and Phys.Rev.B.
  • Format hep-ex report numbers.
  • Tweaks how the beginning and the end of the references sections are found.
  • Allows dashes as separators for numeration.
  • REST api to run refextract.
  • Defaults to inspire format on CLI when running on an inspire site.
  • Handles journals withe series included in title.
  • Introduces a separator in journals kb: Phys.Rev.B maps to Phys.Rev.;B.
  • Handles Phys.Rev.;B by splitting the B from the journal title and adding it in front of the volume.
  • Repackages docextract and refextract in one directory.
  • Search hook for searching from a reference.
  • Updates binaries to use template.in for custom python binaries paths.
  • Splits daemon functionality which remains in refextract and cli functionality which is moved to docextract.
  • Recognizes publishers.
  • Removes JINST from special journals.
  • Moves special journals kb to a file

.

  • Allows to extract references from an arxiv id.
  • kbs loading optimization: they are now cached in memory after being loaded.
  • Create RT tickets after extracting references.
  • Fixes footer removal when references section contains ")".
  • Escape ibid authors for xml (was leading to bibupload failed tasks).
  • Handle erratum-ibid (closes #1014)
  • Transforms hep-lat-9999 to hep-lat/9999 and astro-php-09 to astro-ph/09.
  • arXiv papers can have several revisions over the first week and curation of this papers is delayed by that one week. We decided as a result to re-extract references when an arXiv record is modified on its first week.

comment:5 Changed 17 months ago by Alessio Deiana <alessio.deiana@…>

In 9c44fffa48aba22a416ebd09a41fa15a97158148:

DocExtract: new docextract and refextract modules

  • Adds DocExtract as a way to easily access all text mining facilities It will allow to extract references, authors, plots, etc. (closes #944)
  • Moves the refextract scripts from the bibedit module into its own module.
  • Adds a new api to use the refextract module. It includes calls to:
    • update_references(): update references by passing a record id;
  • extract_references_from_*(): extract and parse references from file/url/record id/string;
  • new function that returns the marcxml of the record with updated references;
  • new function to check if a record has a fulltext (pdf) attached.
  • Refextract filters out null characters from pdfs converted text as they are refused by bibupload.
  • Adds several updates to refextract parsing:
  • handling of JHEP-like journals, as they need the last 2 digits of the year prepended to the volume;
  • adds support for ISBN. They are added in a new subfield called $$i;
  • adds support for references like CERN-LHCC2003-01 by transforming it to CERN-LHCC-2003-01;
  • adds a new subfield <subfield code="t">Text</subfield> where refextract stores references to quoted text "Text".
  • Adds a new option to the bibtask mode of refextract, "--no-overwrite", which checks each record for existing references before parsing it. If the record already has references, it skips it.
  • Fixes recent records detection:
  • only stores last_updated when running on recent records. This prevents from parsing the most recent reference via --recids n, updating the last_updated field and have refextract skip all references preceeding n;
  • only updated last_id and last_updated when respectively the new id is bigger and the new last_updated is more recent. This prevents to store an old date when parsing old records.
  • Handles the format arXiv:9910.1234 [physics.ins-det].
  • Fixes numeration checking when looking for the end of references.
  • Reworks xbook as a single tag: xbook was storing the book title, instead the title is always stored in $$t.
  • New authors recognized:
    • Figuera-O'Farrill
    • P. Pre'
    • Dan V. Schroeder
  • Adds 9+ and w+ to report numers format.
  • Handles Sci.Eng. 450(1-3), 3, 2007 (no space after volume).
  • Handles PoS LAT2007 (2007) 12 journal.
  • Handles report numbers like CERN/LHCC/98-013.
  • Handles C67:674,1998 numeration.
  • Adds a new way to recognize journals which is needed when we recognized short titles. Often the short titles or initials of a journal conflict with other names. e.g. DAN (the journal ) and Dan (common first name) We handle it via precise regular expresssions.
  • Match Acknowldgment and Acknowledgment as end of sections.
  • Format hep report numbers to hep-th/999999.
  • Recognizes roman numbers as volume numbers.
  • Removes [] and () from o subfield.
  • Removes extra spaces at the end of lines.
  • Does not try to detect C et D for roman rumbers. It would result in some series letters being detected instead.
  • Does not detect "B, 07" volumes anymore since some of these are from journals which are different Phys.Rev. & and Phys.Rev.B.
  • Format hep-ex report numbers.
  • Tweaks how the beginning and the end of the references sections are found.
  • Allows dashes as separators for numeration.
  • REST api to run refextract.
  • Defaults to inspire format on CLI when running on an inspire site.
  • Handles journals withe series included in title.
  • Introduces a separator in journals kb: Phys.Rev.B maps to Phys.Rev.;B.
  • Handles Phys.Rev.;B by splitting the B from the journal title and adding it in front of the volume.
  • Repackages docextract and refextract in one directory.
  • Search hook for searching from a reference.
  • Updates binaries to use template.in for custom python binaries paths.
  • Splits daemon functionality which remains in refextract and cli functionality which is moved to docextract.
  • Recognizes publishers.
  • Removes JINST from special journals.
  • Moves special journals kb to a file

.

  • Allows to extract references from an arxiv id.
  • kbs loading optimization: they are now cached in memory after being loaded.
  • Create RT tickets after extracting references.
  • Fixes footer removal when references section contains ")".
  • Escape ibid authors for xml (was leading to bibupload failed tasks).
  • Handle erratum-ibid (closes #1014)
  • Transforms hep-lat-9999 to hep-lat/9999 and astro-php-09 to astro-ph/09.
  • arXiv papers can have several revisions over the first week and curation of this papers is delayed by that one week. We decided as a result to re-extract references when an arXiv record is modified on its first week.

comment:6 Changed 17 months ago by Alessio Deiana <alessio.deiana@…>

In 9c44fffa48aba22a416ebd09a41fa15a97158148:

DocExtract: new docextract and refextract modules

  • Adds DocExtract as a way to easily access all text mining facilities It will allow to extract references, authors, plots, etc. (closes #944)
  • Moves the refextract scripts from the bibedit module into its own module.
  • Adds a new api to use the refextract module. It includes calls to:
    • update_references(): update references by passing a record id;
  • extract_references_from_*(): extract and parse references from file/url/record id/string;
  • new function that returns the marcxml of the record with updated references;
  • new function to check if a record has a fulltext (pdf) attached.
  • Refextract filters out null characters from pdfs converted text as they are refused by bibupload.
  • Adds several updates to refextract parsing:
  • handling of JHEP-like journals, as they need the last 2 digits of the year prepended to the volume;
  • adds support for ISBN. They are added in a new subfield called $$i;
  • adds support for references like CERN-LHCC2003-01 by transforming it to CERN-LHCC-2003-01;
  • adds a new subfield <subfield code="t">Text</subfield> where refextract stores references to quoted text "Text".
  • Adds a new option to the bibtask mode of refextract, "--no-overwrite", which checks each record for existing references before parsing it. If the record already has references, it skips it.
  • Fixes recent records detection:
  • only stores last_updated when running on recent records. This prevents from parsing the most recent reference via --recids n, updating the last_updated field and have refextract skip all references preceeding n;
  • only updated last_id and last_updated when respectively the new id is bigger and the new last_updated is more recent. This prevents to store an old date when parsing old records.
  • Handles the format arXiv:9910.1234 [physics.ins-det].
  • Fixes numeration checking when looking for the end of references.
  • Reworks xbook as a single tag: xbook was storing the book title, instead the title is always stored in $$t.
  • New authors recognized:
    • Figuera-O'Farrill
    • P. Pre'
    • Dan V. Schroeder
  • Adds 9+ and w+ to report numers format.
  • Handles Sci.Eng. 450(1-3), 3, 2007 (no space after volume).
  • Handles PoS LAT2007 (2007) 12 journal.
  • Handles report numbers like CERN/LHCC/98-013.
  • Handles C67:674,1998 numeration.
  • Adds a new way to recognize journals which is needed when we recognized short titles. Often the short titles or initials of a journal conflict with other names. e.g. DAN (the journal ) and Dan (common first name) We handle it via precise regular expresssions.
  • Match Acknowldgment and Acknowledgment as end of sections.
  • Format hep report numbers to hep-th/999999.
  • Recognizes roman numbers as volume numbers.
  • Removes [] and () from o subfield.
  • Removes extra spaces at the end of lines.
  • Does not try to detect C et D for roman rumbers. It would result in some series letters being detected instead.
  • Does not detect "B, 07" volumes anymore since some of these are from journals which are different Phys.Rev. & and Phys.Rev.B.
  • Format hep-ex report numbers.
  • Tweaks how the beginning and the end of the references sections are found.
  • Allows dashes as separators for numeration.
  • REST api to run refextract.
  • Defaults to inspire format on CLI when running on an inspire site.
  • Handles journals withe series included in title.
  • Introduces a separator in journals kb: Phys.Rev.B maps to Phys.Rev.;B.
  • Handles Phys.Rev.;B by splitting the B from the journal title and adding it in front of the volume.
  • Repackages docextract and refextract in one directory.
  • Search hook for searching from a reference.
  • Updates binaries to use template.in for custom python binaries paths.
  • Splits daemon functionality which remains in refextract and cli functionality which is moved to docextract.
  • Recognizes publishers.
  • Removes JINST from special journals.
  • Moves special journals kb to a file

.

  • Allows to extract references from an arxiv id.
  • kbs loading optimization: they are now cached in memory after being loaded.
  • Create RT tickets after extracting references.
  • Fixes footer removal when references section contains ")".
  • Escape ibid authors for xml (was leading to bibupload failed tasks).
  • Handle erratum-ibid (closes #1014)
  • Transforms hep-lat-9999 to hep-lat/9999 and astro-php-09 to astro-ph/09.
  • arXiv papers can have several revisions over the first week and curation of this papers is delayed by that one week. We decided as a result to re-extract references when an arXiv record is modified on its first week.

comment:7 Changed 17 months ago by Alessio Deiana <alessio.deiana@…>

In 9c44fffa48aba22a416ebd09a41fa15a97158148:

DocExtract: new docextract and refextract modules

  • Adds DocExtract as a way to easily access all text mining facilities It will allow to extract references, authors, plots, etc. (closes #944)
  • Moves the refextract scripts from the bibedit module into its own module.
  • Adds a new api to use the refextract module. It includes calls to:
    • update_references(): update references by passing a record id;
  • extract_references_from_*(): extract and parse references from file/url/record id/string;
  • new function that returns the marcxml of the record with updated references;
  • new function to check if a record has a fulltext (pdf) attached.
  • Refextract filters out null characters from pdfs converted text as they are refused by bibupload.
  • Adds several updates to refextract parsing:
  • handling of JHEP-like journals, as they need the last 2 digits of the year prepended to the volume;
  • adds support for ISBN. They are added in a new subfield called $$i;
  • adds support for references like CERN-LHCC2003-01 by transforming it to CERN-LHCC-2003-01;
  • adds a new subfield <subfield code="t">Text</subfield> where refextract stores references to quoted text "Text".
  • Adds a new option to the bibtask mode of refextract, "--no-overwrite", which checks each record for existing references before parsing it. If the record already has references, it skips it.
  • Fixes recent records detection:
  • only stores last_updated when running on recent records. This prevents from parsing the most recent reference via --recids n, updating the last_updated field and have refextract skip all references preceeding n;
  • only updated last_id and last_updated when respectively the new id is bigger and the new last_updated is more recent. This prevents to store an old date when parsing old records.
  • Handles the format arXiv:9910.1234 [physics.ins-det].
  • Fixes numeration checking when looking for the end of references.
  • Reworks xbook as a single tag: xbook was storing the book title, instead the title is always stored in $$t.
  • New authors recognized:
    • Figuera-O'Farrill
    • P. Pre'
    • Dan V. Schroeder
  • Adds 9+ and w+ to report numers format.
  • Handles Sci.Eng. 450(1-3), 3, 2007 (no space after volume).
  • Handles PoS LAT2007 (2007) 12 journal.
  • Handles report numbers like CERN/LHCC/98-013.
  • Handles C67:674,1998 numeration.
  • Adds a new way to recognize journals which is needed when we recognized short titles. Often the short titles or initials of a journal conflict with other names. e.g. DAN (the journal ) and Dan (common first name) We handle it via precise regular expresssions.
  • Match Acknowldgment and Acknowledgment as end of sections.
  • Format hep report numbers to hep-th/999999.
  • Recognizes roman numbers as volume numbers.
  • Removes [] and () from o subfield.
  • Removes extra spaces at the end of lines.
  • Does not try to detect C et D for roman rumbers. It would result in some series letters being detected instead.
  • Does not detect "B, 07" volumes anymore since some of these are from journals which are different Phys.Rev. & and Phys.Rev.B.
  • Format hep-ex report numbers.
  • Tweaks how the beginning and the end of the references sections are found.
  • Allows dashes as separators for numeration.
  • REST api to run refextract.
  • Defaults to inspire format on CLI when running on an inspire site.
  • Handles journals withe series included in title.
  • Introduces a separator in journals kb: Phys.Rev.B maps to Phys.Rev.;B.
  • Handles Phys.Rev.;B by splitting the B from the journal title and adding it in front of the volume.
  • Repackages docextract and refextract in one directory.
  • Search hook for searching from a reference.
  • Updates binaries to use template.in for custom python binaries paths.
  • Splits daemon functionality which remains in refextract and cli functionality which is moved to docextract.
  • Recognizes publishers.
  • Removes JINST from special journals.
  • Moves special journals kb to a file

.

  • Allows to extract references from an arxiv id.
  • kbs loading optimization: they are now cached in memory after being loaded.
  • Create RT tickets after extracting references.
  • Fixes footer removal when references section contains ")".
  • Escape ibid authors for xml (was leading to bibupload failed tasks).
  • Handle erratum-ibid (closes #1014)
  • Transforms hep-lat-9999 to hep-lat/9999 and astro-php-09 to astro-ph/09.
  • arXiv papers can have several revisions over the first week and curation of this papers is delayed by that one week. We decided as a result to re-extract references when an arXiv record is modified on its first week.
Note: See TracTickets for help on using tickets.