Opened 3 years ago

Last modified 2 years ago

#559 assigned enhancement

BibUpload: Cannot bibupload file containing UTF-8 chars

Reported by: grfavre Owned by: skaplun
Priority: major Milestone:
Component: WebSubmit Version:
Keywords: bibdocfile unicode Cc:

Description

When trying to bibupload -r the attached file, the task crashed on first record. Looking at invenio.err, I found a UnicodeError.

I tried this using 4suite and pyRXP. Current setting is
CFG_BIBRECORD_PARSERS_AVAILABLE = ['pyrxp', '4suite', 'minidom']

>>> Traceback details

Traceback (most recent call last):
  File "/var/www/infoscience.epfl.ch/private/infoscience-env/lib/python2.6/site-packages/invenio/bibtask.py", line 754, in _task_run
    if callable(task_run_fnc) and task_run_fnc():
  File "/var/www/infoscience.epfl.ch/private/infoscience-env/lib/python2.6/site-packages/invenio/bibupload.py", line 1987, in task_run_core
    pretend=task_get_option('pretend'))
  File "/var/www/infoscience.epfl.ch/private/infoscience-env/lib/python2.6/site-packages/invenio/bibupload.py", line 343, in bibupload
    rec_xml_new = record_xml_output(record)
  File "/var/www/infoscience.epfl.ch/private/infoscience-env/lib/python2.6/site-packages/invenio/bibrecord.py", line 899, in record_xml_output
    return '\n'.join(marcxml)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 80: ordinal not in range(128)
Locals by frame, innermost last

The file is UTF-8-encoded, contains accentuated chars on the second record (see line 87).

Attachments (4)

buggy.xml (43.4 KB) - added by grfavre 3 years ago.
MARC XML file creating bibupload problem.
task.log (14.8 KB) - added by grfavre 3 years ago.
bibtask_xxx.log file
task.err (52 bytes) - added by grfavre 3 years ago.
bibtask_xxx.err file
invenio.err (12.8 KB) - added by grfavre 3 years ago.
invenio.err file

Download all attachments as: .zip

Change History (20)

Changed 3 years ago by grfavre

MARC XML file creating bibupload problem.

comment:1 Changed 3 years ago by jmartinm

  • Component changed from *general* to BibUpload
  • Summary changed from BibEdit: Cannot bibupload file containing UTF-8 chars to BibUpload: Cannot bibupload file containing UTF-8 chars

I modified the Summary and Component properties as the issue seems to be related to BibUpload rather than to BibEdit

comment:2 Changed 3 years ago by jcaffaro

  • Status changed from new to infoneeded_new

The exception seems really to take place in BibRecord, which lives in ... BibEdit ;-)

In any case the buggy.xml file works for me, both with:

$ ./bibupload -ri buggy.xml

and with:

>> from invenio.bibrecord import create_records
>> my_records = create_records(file('buggy.xml').read())
>> print my_records[1][0]['245']
[([('a', 'Oxymoron, un tr\xc3\xa9sor de fiches de lecture et un atelier de mutualisation des savoirs entre apprenants/chercheurs')], ' ', ' ', '', 8)]

both with 4suite and pyRXP. In Greg's case the BibUpload insertion fails with the two parsers (we are both with Python 2.6.X. I am running latest Git master, while Greg is with RC0 AFAIK, but bibrecord.py does not seem to have changed in between).

Greg, what do you get as result when running the second case above? Can you retry by downloading the file attached to the ticket (in case the encoding got changed in some way during the upload..)?

This reminds me of a behaviour encountered with the minidom parser, which by default does return unicode strings when "printed" instead of encoded byte strings, resulting in similar issue.

BTW, what do you get with:

>> import sys
>> sys.getdefaultencoding()
'ascii'

comment:3 follow-up: Changed 3 years ago by grfavre

  • Status changed from infoneeded_new to new

Thanks for your reply!
No error traceback after running your excerpt in python shell.

Trying to download and rerun bibupload ended up with the same exception.
Default encoding is also ascii.

Maybe the error could come from my database ? I migrated mine from cdsware 0.5 to 0.7 3 years ago and to invenio 1.0 using the provided sql commands. Is there a way to check this?

Now that you have inserted records without trouble using bibupload -ri could it be different if you retry using only -r option? (this is my usecase, as I'm trying to upgrade invenio, not to build a brand new system without data)

Debian 6.0 "Squeeze"
Python 2.6.6 (r266:84292, Dec 26 2010, 22:31:48)
MySQL 5.1.49-3
python-MySQLdb 1.2.2

Last edited 3 years ago by grfavre (previous) (diff)

comment:4 Changed 3 years ago by jcaffaro

grfavre wrote:

No error traceback after running your excerpt in python shell.

ok, but what does it print:

[([('a', 'Oxymoron, un tr\xc3\xa9sor de fiches de...

or:

[([('a', u'Oxymoron, un tr\xc3\xa9sor de fiches de...

What if you do:

my_records[1][0]['245'][0][0][0][1] + 'foo'

You might also try:

$ /opt/invenio/bin/xmlmarclint buggy.xml
$ xmllint buggy.xml

Depending on the above results we might eliminate the possibility of an encoding problem in the DB (I don't think it is the case). Note that I could run bibupload -r buggy.xml without problem.

comment:5 Changed 3 years ago by grfavre

so:

>>> print my_records[1][0]['245']
[([('a', 'Oxymoron, un tr\xc3\xa9sor de fiches de lecture et un atelier de mutualisation des savoirs entre apprenants/chercheurs')], ' ', ' ', '', 8)]
>>> my_records[1][0]['245'][0][0][0][1] + 'foo'
'Oxymoron, un tr\xc3\xa9sor de fiches de lecture et un atelier de mutualisation des savoirs entre apprenants/chercheursfoo'
(infoscience-env)kis@kissrv49:~$ xmlmarclint work/buggy.xml
... no result displayed
(infoscience-env)kis@kissrv49:~$ xmllint
... command not found (i don't have sudo rights, cannot install it )

comment:6 in reply to: ↑ 3 Changed 3 years ago by simko

Replying to grfavre:

Maybe the error could come from my database ? I migrated mine from cdsware 0.5 to 0.7 3 years ago and to invenio 1.0 using the provided sql commands. Is there a way to check this?

1) First, you may want to check if your database runs in UTF-8 mode.
Here is an example:

$ /opt/invenio/bin/inveniocfg --detect-system-details
>>> Going to detect system details...
* Hostname: pcuds33
* Invenio version: 1.0.0-rc0.145-6639
* Python version: 2.6.6 (r266:84292, Dec 27 2010, 00:02:40)  [GCC 4.4.5]
* Apache version: Apache/2.2.17 (Debian) [/usr/sbin/apache2]
* MySQLdb version: 1.2.2
* MySQL version:
    - version: 5.1.49-3
    - character_set_client: utf8
    - character_set_connection: utf8
    - character_set_database: utf8
    - character_set_results: utf8
    - character_set_server: latin1
    - character_set_system: utf8
    - collation_connection: utf8_general_ci
    - collation_database: utf8_general_ci
    - collation_server: latin1_swedish_ci
>>> System details detected successfully.

(The Latin-1 bits are OK here, as far as the Invenio database and the
client connections are in UTF-8.)

2) If OK, then maybe you were affected by Latin-1 to UTF-8
transformations during your upgrades? Did you convert your table
content from Latin-1 to UTF-8 along the way? (Briefly speaking, I did
this by mysqldump'ing old data in Latin-1 charset, creating new tables
with proper UTF-8 default encoding, and loading the dump, and MySQL
would convert the dump into proper UTF-8.)

You can run some SELECT statements on your bibxxx tables in a UTF-8
capable terminal in order to see if your accents are properly stored
as UTF-8, and not as mangled UTF-8 or something.

comment:7 follow-up: Changed 3 years ago by grfavre

Hi Tibor,

(infoscience-env)kis@kissrv49:~$ inveniocfg --detect-system-details
>>> Going to detect system details...
* Hostname: kissrv49
* Invenio version: 1.0.0-rc0
* Python version: 2.6.6 (r266:84292, Dec 26 2010, 22:31:48)  [GCC 4.4.5]
* Apache version: Apache/2.2.16 (Debian) [/usr/sbin/apache2]
* MySQLdb version: 1.2.2
* MySQL version:
    - version: 5.1.49-3
    - character_set_client: utf8
    - character_set_connection: utf8
    - character_set_database: utf8
    - character_set_results: utf8
    - character_set_server: utf8
    - character_set_system: utf8
    - collation_connection: utf8_general_ci
    - collation_database: utf8_general_ci
    - collation_server: utf8_general_ci
>>> System details detected successfully.

The records were dumped as xml and rebibuploaded/reindexed last april (after database was utf8ized; it should be OK then.
After verification of a single record it seems OK:

mysql> select * from bib24x where id = 237;
+-----+--------+-----------------------------------------------------------------------------------------------------------------+
| id  | tag    | value                                                                                                           |
+-----+--------+-----------------------------------------------------------------------------------------------------------------+
| 237 | 245__a | Oxymoron, un tr?sor de fiches de lecture et un atelier de mutualisation des savoirs entre apprenants/chercheurs |
+-----+--------+-----------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> select * from bib24x where id = 237;
+-----+--------+------------------------------------------------------------------------------------------------------------------+
| id  | tag    | value                                                                                                            |
+-----+--------+------------------------------------------------------------------------------------------------------------------+
| 237 | 245__a | Oxymoron, un trésor de fiches de lecture et un atelier de mutualisation des savoirs entre apprenants/chercheurs |
+-----+--------+------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

comment:8 in reply to: ↑ 7 Changed 3 years ago by simko

Replying to grfavre:

mysql> set names utf8;

Just to make sure about this bit, you should not really need to play
with SET NAMES stuff if you use /opt/invenio/bin/dbexec -i to
connect to your database. You should see the accent to be OK right
upfront then. Is this so?

comment:9 Changed 3 years ago by grfavre

I didn't know about this -i option... I copy-pasted from a conventional mysql shell. Using this option, it seems to work though:

(infoscience-env)kis@kissrv49:~/infoscience2/trunk/migration$ dbexec -i
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2544
Server version: 5.1.49-3 (Debian)

Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved.
This software comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to modify and redistribute it under the GPL v2 license

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> select * from bib24x where id = 237;
+-----+--------+------------------------------------------------------------------------------------------------------------------+
| id  | tag    | value                                                                                                            |
+-----+--------+------------------------------------------------------------------------------------------------------------------+
| 237 | 245__a | Oxymoron, un trésor de fiches de lecture et un atelier de mutualisation des savoirs entre apprenants/chercheurs  |
+-----+--------+------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

comment:10 Changed 3 years ago by simko

So everything seems OK from the DB side then, if the tables are
with default charset utf8 and stuff. You can triple check via
SHOW CREATE TABLE bib24x.

I can as well upload your test file alright. Have you tried
verbose bibupload (bibupload -v9) to see what it would print?

Or maybe you have different pyRXP version? I compiled mine long
time ago, 1.13-2.20091117, see Installation/InvenioOnDebian.
When did you compile yours? Just some wild guesses.

comment:11 Changed 3 years ago by grfavre

The table seems correct also:

CREATE TABLE `bib24x` (
  `id` mediumint(8) unsigned NOT NULL AUTO_INCREMENT,
  `tag` varchar(6) NOT NULL DEFAULT '',
  `value` text NOT NULL,
  PRIMARY KEY (`id`),
  KEY `kt` (`tag`),
  KEY `kv` (`value`(35))
) ENGINE=MyISAM AUTO_INCREMENT=96473 DEFAULT CHARSET=utf8

I rerun it. The verbose mode doesn't add much more info when it crashes. The first record worked fine (no accentuated char). On the second record, it generated an exception after stage 3 (see attached task.log file).

Changed 3 years ago by grfavre

bibtask_xxx.log file

Changed 3 years ago by grfavre

bibtask_xxx.err file

Changed 3 years ago by grfavre

invenio.err file

comment:12 Changed 3 years ago by grfavre

  • Component changed from BibUpload to WebSubmit
  • Keywords bibdocfile added
  • Priority changed from critical to major
  • Type changed from defect to enhancement

I finally found out the solution.
For some reason, bibupload checks stuff using bibdocfile: it adds comments and descriptions to the MARC using the get_description() and get_comment() functions. These functions retrieve content pickled in a blob in the database (this is real bad design, sorry guys, a database is by no way meant to contain language specific stuff.).

As no escaping is made on the content initially passed to set_comment or set_definition, it will then crash when building MARC if this content was a unicode object rather than an encoded string.

The solution I used was to re-encode all descriptions:

from invenio.dbquery import run_sql
from invenio.bibdocfile import BibRecDocs

recids = run_sql("select id_bibrec from bibrec_bibdoc")

def stringize(str_like, default='n/a'):
    if type(str_like) == str:
        return str_like
    if type(str_like) == unicode:
        return str_like.encode('utf-8')
    elif type(str_like) == type(None):
        return default
    else:
        raise ValueError


for (recid,) in recids:
    archive = BibRecDocs(recid)
    for bibdoc in archive.bibdocs:
        for bfile in bibdoc.list_all_files():
            description = stringize(bfile.get_description())
            bibdoc.set_description(description, bfile.get_format(), bfile.get_version())

The simplest solution would be to check string-like objects before storing them in the database. One should modify bibdocfile => BibDocMoreInfo and make it escape content before storing it.

This problem would never have happened if values were stored in a SQL field (which is already encoded by the database). The best possible solution would be to store such content directly in the tables. Anyway, this effort would cost slightly more in development time (modifications of the API, tests an migration kits)...

comment:13 Changed 3 years ago by skaplun

  • Owner set to skaplun
  • Status changed from new to assigned

Hi Greg,

indeed the BibDocMoreInfo blob is not a pearl :-) But it was really meant to treated as such, a blob where to throw in any other future thing that might have appeared in the future without having to change anything, beside the class.

On the other hand it should be protected as you suggested.

What I am curious about, though, is thorugh which workflow unicode instead of str ended up there.

Cheers!

Sam

comment:14 Changed 3 years ago by grfavre

Hi Sam,
thanks for your reply!

This is a typical EPFL workflow: we do not use websubmit.
When creating a record, we instantiate BibRecDocs objects, use them to store files (I don't want to hack directly in the database, it's much more reasonable to lie on an API maintained by others!).
We then upload the correct MATC file generated by our tools.

As your API doesn't say much about encodings, i simply passed the object I got at that time. As Django uses unicode objects, your API received a unicode object rather than an encoded string...

Cheers,
Greg

comment:15 Changed 3 years ago by skaplun

I see.

I will fix this one, but meanwhile, if you wish to try higher level interfaces you can use the CLI version of BibDocFile (preferebly the latest version from GIT) or using FFT tags in the MARC:

<http://cdsweb.cern.ch/help/admin/bibupload-admin-guide#3.6>

Cheers,

Sam

comment:16 Changed 2 years ago by skaplun

  • Keywords unicode added
  • Milestone v1.0 deleted

Finally I am leaving this ticket still open (as a reminder) but removing the 1.0 tag as in general we are thinking to address the unicode Vs. str issue more thoroughly in Invenio (e.g. fully migrate to use unicode everywhere, like in Flask or Jinja).

Note: See TracTickets for help on using tickets.