How to extract chinese lyrics from MIDI files

Walter

Posts: 7

Active Member

Topic starter

Hello, I can extract all characters from chinese MIDI files, but the Unicode characters doesn't show the correct chinese character.
The character 燮 (U000071EE) is present in the MIDI file as ("87 AE"), which gives the character 螮.
Can somebody tell me, how to extract the chinese characters correctly from the MIDI file, thanks?
Best regards, Walter Schurter

Posted : 21/11/2018 7:41 am

Geoff

Posts: 1055

Noble Member

Hello,

I'm not really sure what you're trying to do, or what the problem is.

When you say 'extract' do you mean 'remove', or are you hoping to merely do something with this data?

Next, in your example, you seem to be highlighting a specific midi event, however, you highlight the midi data for one event, and the timing data for the following event, when properly the event you're interested in should include the two timing bytes that come before the data you highlight, and NOT the two bytes following. This may not affect what you're trying to do, but it MIGHT be important.

What software are you using to do what you're trying to do? How 'midi aware' is it? Under normal circumstances, midi data should be 7 bit, the 8th bit has special significance. Text or Lyric data has the length parameter, so this can include 8 bit data, but if it is unicode then the data should be pairs of bytes.

However, you seem to be saying that the data that is actually in the midi file now is wrong? Do you have any control over how the file was created, or are you in fact hoping to CORRECT (as in put right) the data that is there?

Have you determined any 'rule' as to how 'wrong' some of the data is? Are all the bytes wrong, or just some, or just the 8 bit ones? Are they wrong by a specific amount, i.e. xx less or more than they ought to be? Looking again, I note that all 9 bytes of the text data are 8 bit. If each unicode char needs 2 bytes, then what is the 'spare' byte doing? I don't know how the unicode data is stored, maybe the bytes stored in the file are not literal bytes but use some encoding, in which case unicode aware software may show the characters correctly - is that so?

Could you attach a complete midi file? There may be something near the beginning of the file that defines the character encoding? Might not be unicode.

Geoff

Posted : 21/11/2018 11:50 am

Pedro Lopez-Cabanillas

Posts: 154

Estimable Member

There are several encodings that have been historically used to process Chinese scripts on computers before the Unicode era. For instance:
https://en.wikipedia.org/wiki/Big5
https://en.wikipedia.org/wiki/GB_2312
https://en.wikipedia.org/wiki/GB_18030

MIDI files don't provide a standard way to specify the encoding used for text/lyrics events, so you probably need to guess which one has been used and translate the texts from the original encoding to Unicode in order to display it. There are software libraries to infer the encoding from some source text. See:
https://en.wikipedia.org/wiki/Charset_detection

Attached are screenshots of the KMid2 Linux program rendering two Chinese language karaoke songs. The program uses an heuristic library to guess the source encoding, but also offers the user a drop down selector for manually choosing another encoding in case the guessing have failed.

Regards,
Pedro

Posted : 21/11/2018 12:04 pm

Walter

Posts: 7

Active Member

Topic starter

Hello,

first, thanks very much for the fast answers from Geoff and Pedro!

I make a MIDI player Karaoke-App which shows the lyrics (see attached image).
As the API wich I use in XCODE does not deliver Meta events, I extract the Lyrics before I start playing the MIDI file.
Everything is working fine, except the wrong characters for e.g. in Chinese MIDI files.

Here some additional informations:
Adding lyric text with program "Logic", shows the same effect:
The Unicode Bytes in the MIDI file does not correspond to the Chinese lyric characters, so it must be some encoding done.

What is the specification of the "Bytes before unicode" (e.g. "e7")?
Where can I find a conversion table?

MIDI file downloaded from:
http://www.karaokeden.com/karaoke/lyrics/Chinese/11.kar

Original lyric in MIDI file Byte before unicode
燭羲斕扂崋臘忔 뺲隕覂뒋릫뾔薕
71 ED be b2 e7
7F B2 96 95 e6
65 95 89 82 e6
62 42 b4 8b e5
5D 0B b9 ab e7
81 D8 bf 94 e5
5F D4 85 95 e8

Note: The MIDI file "11.kar" seems to be corrupt, as the "FF 2F 00" at the end is not complete (only "FF 2F").
After adding the missing Byte "00", the MIDI file plays correctly with the Apple API "MusicPlayer" in XCODE.

Thanks for helping and best regards
Walter

Posted : 22/11/2018 2:08 am

Walter

Posts: 7

Active Member

Topic starter

Sorry some text is not represented as I wrote it (see attached image).

Posted : 22/11/2018 2:15 am

Geoff

Posts: 1055

Noble Member

Hello,

I've downloaded the 11.kar file, which clearly is a normal midi (SMF) file apart from the .KAR filename.

As Pedro indicates, there seem to be a number of implementations (character sets) for Chinese, and I assume they are different. I've seen reference to such sets, and they were indicated using a @xxx label, and I wondered if there was such a label in your file. When I check the 11 file, I see at the start of Track 2 @LCHINESE, which I'd guess is the indicator of which character sequence to use. But exactly what @LCHINESE is, I don't know. I've tried a quick check using Google, and nothing relevant comes up.

May need to dig further.

First thing, you need to check through the website you give for download of file to see if that gives any info.

Secondly, there's an email addr within the midi file, info@vietbel.org - you might try to communicate there to see if you can get info?

Keep trying different options within Google search. I've seen something @xxx, but cannot find it again just yet.

I'll try playing the midi file. I see the settings for instruments for Channels 1 to 10, and the patch numbers, but no text to say what they OUGHT to be, so I'll hope they sound OK with GM. I might not be able to tell if it's wrong? Channel 1 is set to Flute, Channel 2 to Elec Piano, 3 to Strings, etc

Geoff

Posted : 22/11/2018 10:07 am

Geoff

Posts: 1055

Noble Member

I've loaded the file into SynthFont, and it plays OK. Sounds perfectly fine, with the GM instruments, EXCEPT that it's not remotely Chinese! Should it be? Maybe it's just a piece of western music with chinese lyrics??

Oh, the file did NOT play OK initially, as the music data is all in one track, and SynthFont did not seem to like this. Once the data was split to one channel per track, then it was fine. Maybe SynthFont is at fault, I'll need to try other players. However, SynthFont was not troubled by the FF 2F on the end and coped with this as End of Track and End of File.

Are any of the other files on this site actually Chinese sounding??

Geoff

Posted : 22/11/2018 10:40 am

Walter

Posts: 7

Active Member

Topic starter

Hello Geoff

Thanks for your answer.
The only important thing of all MIDI files on this site, where I found "11.kar" is, that all have Chinese Lyrics.
The music is not interesting in this case.
I must see if I find look up tables for translating the characters from the MIDI file to the real Chinese symbols.

Best regards
Walter

Posted : 22/11/2018 11:12 am

Pedro Lopez-Cabanillas

Posts: 154

Estimable Member

In this case, the lyrics of 11.kar are encoded as UTF-8 (an Unicode representation). For all the CJK characters, this encoding requires 3 bytes (not very efficient for Chinese text). For instance, the character 崋 with the Unicode point U+5D0B is encoded in UTF-8 as e5 b4 8b (in hex).
Reference: https://en.wikipedia.org/wiki/UTF-8
Table of some characters: https://www.utf8-chartable.de/unicode-utf8-table.pl?start=23819&number=1024

Here is the output by KMid. It doesn't guess the encoding properly, but I had to select UFT-8 by hand.

Regards,
Pedro

Posted : 22/11/2018 11:46 am

Walter

Posts: 7

Active Member

Topic starter

Hello Pedro

I am so happy, it is working!!!!!!!

You are a great person!

Thanks, thanks, thanks, thanks, thanks very much, you spared me a lot of development time.
If you are interested, I can send you a promotion code for my new App called "MIDIplayerS" as soon, it is in the Apple App store.

Thanks again and best Regards
Walter

Posted : 22/11/2018 12:04 pm

Geoff

Posts: 1055

Noble Member

Aha, that explains the odd number of chars that I mentioned above. They come in groups of 3 rather than pairs as originally indicated.

Pity about the music. Yes, the music in 11.kar is NOT very interesting. A lot of actual chinese music is. Years ago, our tv here in the UK was showing a documentary series about China, 'Beyond the Clouds' or something like that, and the very Chinese-sounding music was great. I bought the CD of the music, but turned out some of the music was better than other bits.

Geoff

Posted : 22/11/2018 12:11 pm

Pedro Lopez-Cabanillas

Posts: 154

Estimable Member

Hi Walter,

Glad to help! Thanks for the offer. I don't have an iPad/iPhone, but if you publish a version for macOS, then yes! I will be glad to test it.

On the other hand, if you or anybody is interested in KMid, it is free software available only for the Linux distributions Fedora and Rosa. The source code in C++ is also available for download here: https://sourceforge.net/projects/kmid2/files/2.4.0/

Regards,
Pedro

Posted : 22/11/2018 12:21 pm

Walter

Posts: 7

Active Member

Topic starter

Hello Pedro

Now I am facing a new problem.

Do you know, what is the definition of the first Byte (e.g. E6 or C3)?
For a Vietnamese MIDI file i get for example for "ò"
U+00F2 : LATIN SMALL LETTER O WITH GRAVE
In the MIDI file i see: FF 01 02 C3 B2

For a Chinese MIDI file i get for example for "燭":
U+71ED : CJK UNIFIED IDEOGRAPH-71ED
In the MIDI file i see: FF 01 03 E7 BE B2

Is there a logic, that says C0 to CF have 2 Bytes and E0 to EF has 3 Bytes etc,?

Thanks and best Regards
Walter Schurter

Posted : 23/11/2018 2:58 am

Pedro Lopez-Cabanillas

Posts: 154

Estimable Member

Is there a logic, that says C0 to CF have 2 Bytes and E0 to EF has 3 Bytes etc,?

It is explained here: https://en.wikipedia.org/wiki/UTF-8#Description

UTF-8 uses a variable number of bytes to encode Unicode characters.

1 byte between code point U+0000 and U+007F
2 bytes between code point U+0080 and U+07FF
3 bytes between code point U+0800 and U+FFFF
4 bytes between code point U+10000 and U+10FFFF

But you may find MIDI karaoke songs where the encoding is not UTF-8 or even Unicode at all.

Regards,
Pedro

Posted : 23/11/2018 4:26 am

Walter

Posts: 7

Active Member

Topic starter

Hello Pedro

Thanks again for your excellent help.

I think I must treat the label "@LENGLISH", @LVietnamese" etc. from the Karaoke MIDI files to display the correct 1 Byte characters from 00 to FF.

Best Regards
Walter Schurter

Posted : 23/11/2018 8:51 am

MIDI Forum

How to extract chinese lyrics from MIDI files