Uploaded image for project: 'Puppet'
  1. Puppet
  2. PUP-5728

Change default parsing of manifest files to UTF-8

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Normal
    • Resolution: Fixed
    • None
    • PUP 4.4.0
    • None
    • Bug Fix
    • Hide
      As part of Puppet 4, Puppet declared that UTF-8 was the only valid encoding for manifest files. However, while this behavior was correct on non-Windows systems, because Puppet did not explicitly specify the encoding when reading manifests, the behavior on Windows was incorrect. On Windows, Ruby continued to load files from disk, then convert them to whatever the current local codepage was defined as (often IBM437 or 1252). Depending on whether the Unicode characters in the manifest contained bytes that could be represented in the current codepage or not, this could lead to either a crash while attempting to treat the bytes as characters in the codepage, or more typically, corruption of the intended strings when loading internally. This corruption would result in resources being created that didn't match what was specified in the manifest. In addition to addressing the loading of manifest files, similar changes were made to the loading of resource templates, Ruby based Puppet module code (such as custom functions), puppet apply, puppet lookup, the epp and parser faces.
      Show
      As part of Puppet 4, Puppet declared that UTF-8 was the only valid encoding for manifest files. However, while this behavior was correct on non-Windows systems, because Puppet did not explicitly specify the encoding when reading manifests, the behavior on Windows was incorrect. On Windows, Ruby continued to load files from disk, then convert them to whatever the current local codepage was defined as (often IBM437 or 1252). Depending on whether the Unicode characters in the manifest contained bytes that could be represented in the current codepage or not, this could lead to either a crash while attempting to treat the bytes as characters in the codepage, or more typically, corruption of the intended strings when loading internally. This corruption would result in resources being created that didn't match what was specified in the manifest. In addition to addressing the loading of manifest files, similar changes were made to the loading of resource templates, Ruby based Puppet module code (such as custom functions), puppet apply, puppet lookup, the epp and parser faces.

    Description

      As part of PUP-2564, we changed our language to state that manifest files must be in UTF-8. However, this is not enforced when we actually lex files with the parser, and it causes a host of issues on Windows in particular.

      https://docs.puppetlabs.com/puppet/latest/reference/lang_summary.html#files

      As josh alluded to, Windows will read the file then treat it as whatever the current codepage is, and this carries a number of issues.

      For instance, take the manifest

      user { 'Umlautä':
        ensure => present,
        password => 'password'
      }
      

      If running with the codepage 437, the string produced by Puppet::FileSystem.read is incorrect for subsequent usage as ä gets turned into \xC3\xA4:

      "user { 'Umlaut\xC3\xA4':\n  ensure => present,\n  password => 'password'\n}\n"
      

      So while the user is created, the name is Umlaut├ñ instead of Umlautä due to the way the converted bytes are represented when making the appropriate conversions (i.e calling .encode('utf-8') on the above string produces:

      "user { 'Umlaut\u251C\u00F1':\n  ensure => present,\n  password => 'password'\n}\n"
      

      If instead the local codepage is set to 65001, which is Unicode... then the behavior is correct. However, this is not something we should expect users to do, and we should treat the incoming file as UTF-8 as that's what our documentation specifies / and what we claim to expect.

      The prior PUP-2564 ticket mentioned the inability of specifying an encoding during file reading at https://tickets.puppetlabs.com/browse/PUP-2564?focusedCommentId=125526&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-125526. I think that's a problem that can be easily fixed, at least in this narrow use case to start.

      I spiked a very quick solution to this problem for the sake of discussion. With the change, the parsed string is now represents ä correct as \u00E4:

      "user { 'Umlaut\u00E4':\n  ensure => present,\n  password => 'password'\n}\n"
      

      I know that it fixes the problem with the manifest above (i.e. you can run any local codepage and the correct user name is used during creation based on the UTF8 file contents). However, there could be some additional fallout, and there are certainly other places where files are read that could take similar tacts.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ethan Ethan Brown
              Ryan Gard Ryan Gard
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Zendesk Support