Δfلה - Special characters are underestimated sources of error during migrations

Sometimes the devil is really in the details Δfلה – special characters are underestimated sources of error during migrations

A seemingly banal problem with migrations of different network protocols with possibly big consequences: special characters.

Even if they usually do not turn into hieroglyphs right away, file names with special characters sometimes become unreadable during a migration.

Data migrations are an underestimated mandatory task in many companies. In the best case, management sees a technical problem and the great time requirement here and hardly recognizes which experience, expertise and special solutions are needed for this. The complexity of the systems increases the need to plan every single step precisely and to think through its consequences for the integrity of data and the flow of processes. Because small things have a big effect: even today, migrations can fail because of banal special characters in the file name. Because character sets have evolved from the mainframe to the current generation of modern computers, so that every legacy special character has to be translated anew.

On the timeline of character sets

Mainframe only knew 127 characters via ASCII. Even the umlauts of some European languages based on the Roman alphabet only arrived here with ANSI /437 (DOS /old Unix systems). ISO8859-X (Windows 3.1 – Windows 98/Windows NT 3.5) comprised fifteen different character sets in which the conventional letters and numbers were identical in all variants. However, the second half provided special characters from different language groups – for example for Greek, Thai or Japanese. Without detours, the computers can only make full use of UTF-8 / Unicode (Windows NT 4.0 and higher / Unix / NAS), which represents characters with one to four bytes. Since then, over a million characters have been available.

A lot of data was generated at the beginning of networking and migrated over time to new generations of file servers. Therefore, a lot has happened on the way to today’s UTF-8, which can now lead to problems with further migrations. Old DOS or Windows versions use ANSI or ISO8859-X. This is usually not a problem, since the umlaut setting is negotiated during the network connection. These versions can also be found above all on control computers for production robots, which hardly know the umlaut problem anyway. Even with modern versions of Windows, there are less problems, because they use Unicode for every application.

The changing representation of file names can cause problems

However, file names with umlauts cause problems with the passage of time. For example, a hexadecimal value can represent different letters on computers with different character sets. A computer configured as ISO 8859-1, i.e. for Western Latin characters, wrote an “ä” (hex E4) on the file server. If this file was later read by a computer under ISO 8859-7, i.e. with Greek characters, the same hex value E4 was interpreted and displayed as “δ”. However, if this file is then written to a new destination, for example, by a computer with ISO 8859-7 with conversion under UTF-8, the original “ä” disappears completely, because E4 does not represent a valid letter in Unicode. Thus, different character sets can incorrectly reproduce given file names. In the worst case, these file names are even unreadable for other computers. The problem is bigger than one might think and affects around 400 special characters from different language groups.

When migrating, there is no way around taking this undesirable “translation activity” into account. To convert automatically from NFSv3 to NFSv4, which always works in UTF8, is hardly possible. In any case, the migration experts are faced with the task of accurately analyzing given data sets. In addition, you are forced to copy the files host-based, i.e. each file by itself. This is not only time-consuming, but also significantly extends the offline time during the migration.

Unix clients can use different protocols

In addition, there is another problem related to the various protocols on Unix clients. Windows has been working with Unicode only for more than ten years and displays the file name to be transferred in the SMB protocol via Unicode. Existing file names can therefore not be transferred and must have been converted to Unicode beforehand. Under NFSv4, the client no longer plays a role: NFSv4 forces the client to transfer the file name to UTF8. This is how the client and file server understand each other, and not much can go wrong anymore.

Under NFSv3, on the other hand, there are still completely disorderly conditions. Here, users can use different code pages for a Unix client, depending on the terminal. You may be writing file names of different encodings to the same file server. Thus, two different files with seemingly identical file names can occur in the directory: both written by the same client, but from different terminals and with different code pages.

Protocol differences under NFSv3

When using a file server that offers multiprotocol, invalid UTF-8 sequences can also occur. An example: A Unix client with NFSv3 is configured with UTF-8. The file name “report_march.txt” is misinterpreted when writing to a NAS that expects an encoding in ISO-8859-1, but when converting to UTF-8. Any other Unix client with NFSv3 would, despite this misconfiguration, also use “Report_März.txt” read. However, a client configured with NFSv4 sees reads “Report_MÃ¤rz.txt“. The file is then not corrupt and can still be read. But after a migration to the new server, the file “Report_März” is hardly found via the search function of a Windows computer, because it no longer exists under this correct file name.

The attempt to repair such file names quickly turns into a not very pleasant guessing game, because of course you can no longer understand when and why the file name was misinterpreted during the conversion. As part of a migration, files can at least be assigned to certain business areas and the protocols used there. For German locations, it can usually be assumed that the ISO 8859-1 character set was used. You can also check whether the conversion was enabled on the file server. So you can at least understand the behavior and look for corresponding misconfigurations.

If you take into account the settings on the file server when planning a migration and thus potential errors when converting file names, then these can be revealed using an algorithm for verification. However, the file names identified as error-prone must then be renamed manually on the source. In principle, those responsible should already clean up umlaut problems detected in advance during a migration at the source. This then enables a clean migration without errors in the log files.

Conclusion

The migration of unstructured data is in itself a complex task with many pitfalls that requires professional analysis, planning and implementation. Data sets on a NAS have mostly grown historically and have been written or converted using different protocols. Special characters, such as the umlauts common in the German-speaking world, often cause problems here. Without the knowledge and experience of where which errors can occur, many IT teams quickly reach their limits. Anyone planning extensive migration projects should seek advice from data and migration experts in advance. These specialists build on their decades of expertise and can identify difficulties even before the actual migration.