File Coverage

blib/lib/Sys/Binmode.pm
Criterion Covered Total %
statement 9 9 100.0
branch n/a
condition n/a
subroutine 4 4 100.0
pod n/a
total 13 13 100.0


line stmt bran cond sub pod time code
1             package Sys::Binmode;
2              
3 13     13   1128477 use strict;
  13         145  
  13         386  
4 13     13   80 use warnings;
  13         28  
  13         2214  
5              
6             =encoding utf-8
7              
8             =head1 NAME
9              
10             Sys::Binmode - A fix for Perl’s system call character encoding
11              
12             =begin html
13              
14             Coverage Status
15              
16             =end html
17              
18             =head1 SYNOPSIS
19              
20             use Sys::Binmode;
21              
22             my $foo = "\xff";
23             $foo .= "\x{100}";
24             chop $foo;
25              
26             # Prints a single octet (0xFF) and a newline:
27             print $foo, $/;
28              
29             # In Perl 5.32 this may print the same single octet, or it may
30             # print UTF-8-encoded U+00FF. With Sys::Binmode, though, it always
31             # gives the single octet, just like print:
32             exec 'echo', $foo;
33              
34             =head1 DESCRIPTION
35              
36             tl;dr: Use this module in B new code.
37              
38             =head1 BACKGROUND
39              
40             Ideally, a Perl application doesn’t need to know how the interpreter stores
41             a given string internally. Perl can thus store any Unicode code point while
42             still optimizing for size and speed when storing “bytes-compatible”
43             strings—i.e., strings whose code points all lie below 256. Perl’s
44             “optimized” string storage format is faster and less memory-hungry, but it
45             can only store code points 0-255. The “unoptimized” format, on the other
46             hand, can store any Unicode code point.
47              
48             Of course, Perl doesn’t I optimize “bytes-compatible” strings;
49             Perl can also, if
50             it wants, store such strings “unoptimized” (i.e., in Perl’s internal
51             “loose UTF-8” format), too. For code points 0-127 (ASCII printables,
52             controls, and DEL) there’s actually no
53             difference between the two forms, but for 128-255 the formats differ. (cf.
54             L) This means that anything that reads
55             Perl’s internals B differentiate between the two forms in order to
56             use the string correctly.
57              
58             Alas, that differentiation doesn’t always happen. When it doesn’t, Perl
59             outputs code points 128-255 differently depending on whether the
60             containing string is “optimized” or not.
61              
62             Remember, though: Perl applications I I I about
63             Perl’s string storage internals like optimized/unoptimized. (This is why,
64             for example, the L
65             pragma is discouraged.) The catch, though, is that without that knowledge,
66             B B B B B B B B
67             B B B B
68              
69             Thus, applications must either monitor Perl’s string-storage internals
70             or accept unpredictable behavior, both of which are categorically bad.
71              
72             (Perl’s documentation calls the “unoptimized” format “upgraded”, while
73             it calls the “optimized” format “downgraded”. The rest of this document
74             will favor Perl’s terms.)
75              
76             =head1 HOW THIS MODULE (PARTLY) FIXES THE PROBLEM
77              
78             This module provides predictable behavior for Perl’s built-in functions by
79             downgrading all strings before giving them to the operating system. It’s
80             equivalent to—but faster than!—prefixing your system calls with
81             C (cf. L) on all arguments.
82              
83             Predictable behavior is B a good thing; ergo, you should
84             use this module in B new code.
85              
86             =head1 CAVEAT: CHARACTER ENCODING
87              
88             If you apply this module injudiciously to existing code you may see
89             exceptions or character corruption where previously things worked fine.
90              
91             This can
92             happen if you’ve neglected to encode one or more strings before
93             sending them to the OS. Without Sys::Binmode, Perl sends upgraded
94             strings to the OS in UTF-8 encoding. In essence, it’s an implicit
95             UTF-8 auto-encode, which is kind of nice, except that it depends on
96             Perl’s internals, which are unpredictable. Sys::Binmode removes
97             that implicit UTF-8 auto-encode, which of course will break things
98             that need it.
99              
100             The fix is to apply an explicit UTF-8 encode prior to the system call
101             that throws the error. This is what we should do I;
102             Sys::Binmode just enforces that better.
103              
104             =head2 Example: The L Pragma
105              
106             The widely-used L pragma particularly exemplifies this problem.
107              
108             If you have code like this:
109              
110             use utf8;
111              
112             mkdir "épée";
113              
114             … then adding this module will change your program’s behavior in ways you’ll
115             probably dislike.
116              
117             Consider the string C<épée>. Without the C pragma (but assuming that
118             the code I actually written in UTF-8) this is 6
119             characters because the two C<é>s are 2 bytes each (so 2 + 1 + 2 + 1),
120             and without the C pragma each byte in a string constant becomes its own
121             character, even if multiple bytes make up a single UTF-8 character. Since
122             nothing I upgrades that string on its way to
123             C, the OS will receive the intended 6 bytes and create a directory
124             with a UTF-8-encoded name.
125              
126             I C, though, C<épée> is B<4> characters, not 6, because
127             this string is now UTF-8-decoded. Those 4 characters all lie beneath 256,
128             so the string is still bytes-compatible. Thus, if you C that string
129             you’ll get 4 bytes of Latin-1, which probably B what you want.
130              
131             C, though, I still creates a directory with a 6-byte (UTF-8)
132             name. This happens when Perl itself stores C<épée> in upgraded (i.e.,
133             “unoptimized”) form. If that’s the case, that means Perl’s I buffer
134             of C<épée> is still the 6 bytes of UTF-8, even though to the Perl
135             I it’s a 4-character string. Perl’s C doesn’t care
136             about characters, though; it just gives Perl’s internal buffer to the
137             OS’s create-directory function. So by violating its own abstraction, Perl
138             happens to achieve something that is I useful.
139              
140             There are still two problems, though:
141              
142             =over
143              
144             =item * 1. Inconsistency: C sends 4 bytes to the OS while
145             C (again, I) outputs 6.
146              
147             =item * 2. Uncertainty: C<épée> I be stored downgraded rather than
148             upgraded, which would cause C to send 4 bytes instead.
149              
150             =back
151              
152             C’s outputting of 4 bytes here is actually the B behavior
153             because it doesn’t depend on whether Perl stores the string upgraded or
154             downgraded. Sys::Binmode extends that correct behavior to C and
155             other such Perl commands.
156              
157             Of course, in the end, we want C to receive 6 bytes of UTF-8, not
158             4 bytes of Latin-1. To achieve that, just do as you normally do with
159             C: encode your string before you give it to the OS.
160              
161             use utf8;
162             use Encode;
163              
164             mkdir encode("UTF-8", "épée");
165              
166             This is what your code should look like, regardless of Sys::Binmode;
167             the omitted encoding step was a bug that Perl’s own abstraction-violation
168             bug I have obscured for you. Sys::Binmode fixes Perl’s bug,
169             which makes you fix your own bug, too.
170              
171             =head2 Non-POSIX Operating Systems (e.g., Windows)
172              
173             In a POSIX operating system, an application’s communication with the
174             OS happens entirely through byte strings. Thus, treating all
175             OS-destined strings as byte strings is good and natural.
176              
177             In Windows, though, things are weirder. For example, Windows
178             exposes multiple APIs for creating a directory, and the one Perl uses (as of
179             5.32, anyway) only accepts code points 0-255. In this context Sys::Binmode
180             doesn’t I anything, but it does reinforce one of Perl’s unfortunate
181             limitations on Windows.
182              
183             Sys::Binmode is a good idea anywhere that Perl sends byte strings to the OS.
184             For now, as far as I know, that’s everywhere that Perl runs. If that’s not
185             true, please file a bug.
186              
187             =head1 WHERE ELSE THIS PROBLEM CAN APPEAR
188              
189             The unpredictable-behavior problem that this module fixes in core Perl is
190             also common in L’s XS modules due to rampant
191             use of L and
192             variants. SvPV is basically Perl’s L pragma in C: it gives
193             you the string’s
194             internal bytes with no regard for what those bytes represent. This, of course,
195             is problematic for the same reason why the L pragma is. XS authors
196             I should prefer
197             L
198             or L in lieu of
199             SvPV unless the C code in question handles Perl’s encoding abstraction.
200              
201             Note in particular that, as of Perl 5.32, the default XS typemap converts
202             scalars to C C and C via an SvPV variant. This means
203             that any module that uses that conversion logic also has this problem.
204             So XS authors should also avoid the default typemap for such conversions.
205             (Again, though, use of the default typemap in this context is regrettably
206             commonplace.)
207              
208             Before Perl 5.18 this problem also affected %ENV. 5.18 introduced
209             an auto-downgrade when setting %ENV similar to what this module does.
210              
211             =head1 LEXICAL SCOPING
212              
213             If, for some reason, you I Perl’s unpredictable default behavior,
214             you can disable this module for a given block via
215             C, thus:
216              
217             use Sys::Binmode;
218              
219             system 'echo', $foo; # predictable/sane/happy
220              
221             {
222              
223             # You should probably explain here why you’re doing this.
224             no Sys::Binmode;
225              
226             system 'echo', $foo; # nasal demons
227             }
228              
229             =head1 AFFECTED BUILT-INS
230              
231             =over
232              
233             =item * C, C, and C
234              
235             =item * C and C
236              
237             =item * File tests (e.g., C<-e>) and the following:
238             C, C, C, C, C,
239             C, C, C, C, C, C, C,
240             C, C, C, C, C,
241             C, C
242              
243             =item * C, C, C, and C (last argument)
244              
245             =item * C
246              
247             =back
248              
249             =head2 Omissions
250              
251             =over
252              
253             =item * C already does as Sys::Binmode would make it do.
254              
255             =item * C
256             but since it’s a performance-sensitive call where upgraded strings are
257             unlikely, this library doesn’t wrap it.
258              
259             =back
260              
261             =head1 KNOWN ISSUES
262              
263             L creates functions named, e.g., C in the
264             namespace of the module that Cs it. Those functions lack
265             the compiler “hint” that tells Sys::Binmode to do its work; thus,
266             L.
267             C functions will still have Sys::Binmode, but of course they won’t
268             throw exceptions.
269              
270             =head1 TODO
271              
272             =over
273              
274             =item * C and the System V IPC functions aren’t covered here.
275             If you’d like them, ask.
276              
277             =item * There’s room for optimization, if that’s gainful.
278              
279             =item * Ideally this behavior should be in Perl’s core distribution.
280              
281             =item * Even more ideally, Perl should adopt this behavior as I.
282             Maybe someday!
283              
284             =back
285              
286             =cut
287              
288             #----------------------------------------------------------------------
289              
290             our $VERSION = '0.04_90';
291              
292             require XSLoader;
293             XSLoader::load(__PACKAGE__, $VERSION);
294              
295             sub import {
296 18     18   389 $^H{ _HINT_KEY() } = 1;
297              
298 18         15626 return;
299             }
300              
301             sub unimport {
302 1     1   1534 delete $^H{ _HINT_KEY() };
303             }
304              
305             #----------------------------------------------------------------------
306              
307             =head1 ACKNOWLEDGEMENTS
308              
309             Thanks to Leon Timmermans (LEONT) and Paul Evans (PEVANS) for some
310             debugging and design help.
311              
312             =head1 LICENSE & COPYRIGHT
313              
314             Copyright 2021 Gasper Software Consulting. All rights reserved.
315              
316             This library is licensed under the same license as Perl.
317              
318             =cut
319              
320             1;