File Coverage

blib/lib/Sys/Binmode.pm

Criterion	Covered	Total	%
statement	9	9	100.0
branch			n/a
condition			n/a
subroutine	4	4	100.0
pod			n/a
total	13	13	100.0

line	stmt	sub	time	code
1				package Sys::Binmode;
2
3	13	13	1088588	use strict;
	13		153
	13		383
4	13	13	78	use warnings;
	13		24
	13		2229
5
6				our $VERSION = '0.04_91';
7
8				=encoding utf-8
9
10				=head1 NAME
11
12				Sys::Binmode - A fix for Perl’s system call character encoding
13
14				=begin html
15
16
17
18				=end html
19
20				=head1 SYNOPSIS
21
22				use Sys::Binmode;
23
24				my $foo = "\xff";
25				$foo .= "\x{100}";
26				chop $foo;
27
28				# Prints a single octet (0xFF) and a newline:
29				print $foo, $/;
30
31				# In Perl 5.32 this may print the same single octet, or it may
32				# print UTF-8-encoded U+00FF. With Sys::Binmode, though, it always
33				# gives the single octet, just like print:
34				exec 'echo', $foo;
35
36				=head1 DESCRIPTION
37
38				tl;dr: Use this module in B new code.
39
40				=head1 BACKGROUND
41
42				Ideally, a Perl application doesn’t need to know how the interpreter stores
43				a given string internally. Perl can thus store any Unicode code point while
44				still optimizing for size and speed when storing “bytes-compatible”
45				strings—i.e., strings whose code points all lie below 256. Perl’s
46				“optimized” string storage format is faster and less memory-hungry, but it
47				can only store code points 0-255. The “unoptimized” format, on the other
48				hand, can store any Unicode code point.
49
50				Of course, Perl doesn’t I optimize “bytes-compatible” strings;
51				Perl can also, if
52				it wants, store such strings “unoptimized” (i.e., in Perl’s internal
53				“loose UTF-8” format), too. For code points 0-127 (ASCII printables,
54				controls, and DEL) there’s actually no
55				difference between the two forms, but for 128-255 the formats differ. (cf.
56				L) This means that anything that reads
57				Perl’s internals B differentiate between the two forms in order to
58				use the string correctly.
59
60				Alas, that differentiation doesn’t always happen. When it doesn’t, Perl
61				outputs code points 128-255 differently depending on whether the
62				containing string is “optimized” or not.
63
64				Remember, though: Perl applications I I I about
65				Perl’s string storage internals like optimized/unoptimized. (This is why,
66				for example, the L
67				pragma is discouraged.) The catch, though, is that without that knowledge,
68				B B B B B B B B
69				B B B B
70
71				Thus, applications must either monitor Perl’s string-storage internals
72				or accept unpredictable behavior, both of which are categorically bad.
73
74				(Perl’s documentation calls the “unoptimized” format “upgraded”, while
75				it calls the “optimized” format “downgraded”. The rest of this document
76				will favor Perl’s terms.)
77
78				=head1 HOW THIS MODULE (PARTLY) FIXES THE PROBLEM
79
80				This module provides predictable behavior for Perl’s built-in functions by
81				downgrading all strings before giving them to the operating system. It’s
82				equivalent to—but faster than!—prefixing your system calls with
83				C (cf. L) on all arguments.
84
85				Predictable behavior is B a good thing; ergo, you should
86				use this module in B new code.
87
88				=head1 CAVEAT: CHARACTER ENCODING
89
90				If you apply this module injudiciously to existing code you may see
91				exceptions or character corruption where previously things worked fine.
92
93				This can
94				happen if you’ve neglected to encode one or more strings before
95				sending them to the OS. Without Sys::Binmode, Perl sends upgraded
96				strings to the OS in UTF-8 encoding. In essence, it’s an implicit
97				UTF-8 auto-encode, which is kind of nice, except that it depends on
98				Perl’s internals, which are unpredictable. Sys::Binmode removes
99				that implicit UTF-8 auto-encode, which of course will break things
100				that need it.
101
102				The fix is to apply an explicit UTF-8 encode prior to the system call
103				that throws the error. This is what we should do I;
104				Sys::Binmode just enforces that better.
105
106				=head2 Example: The L Pragma
107
108				The widely-used L pragma particularly exemplifies this problem.
109
110				If you have code like this:
111
112				use utf8;
113
114				mkdir "épée";
115
116				… then adding this module will change your program’s behavior in ways you’ll
117				probably dislike.
118
119				Consider the string C<épée>. Without the C pragma (but assuming that
120				the code I actually written in UTF-8) this is 6
121				characters because the two C<é>s are 2 bytes each (so 2 + 1 + 2 + 1),
122				and without the C pragma each byte in a string constant becomes its own
123				character, even if multiple bytes make up a single UTF-8 character. Since
124				nothing I upgrades that string on its way to
125				C, the OS will receive the intended 6 bytes and create a directory
126				with a UTF-8-encoded name.
127
128				I C, though, C<épée> is B<4> characters, not 6, because
129				this string is now UTF-8-decoded. Those 4 characters all lie beneath 256,
130				so the string is still bytes-compatible. Thus, if you C that string
131				you’ll get 4 bytes of Latin-1, which probably B what you want.
132
133				C, though, I still creates a directory with a 6-byte (UTF-8)
134				name. This happens when Perl itself stores C<épée> in upgraded (i.e.,
135				“unoptimized”) form. If that’s the case, that means Perl’s I buffer
136				of C<épée> is still the 6 bytes of UTF-8, even though to the Perl
137				I it’s a 4-character string. Perl’s C doesn’t care
138				about characters, though; it just gives Perl’s internal buffer to the
139				OS’s create-directory function. So by violating its own abstraction, Perl
140				happens to achieve something that is I useful.
141
142				There are still two problems, though:
143
144				=over
145
146				=item * 1. Inconsistency: C sends 4 bytes to the OS while
147				C (again, I) outputs 6.
148
149				=item * 2. Uncertainty: C<épée> I be stored downgraded rather than
150				upgraded, which would cause C to send 4 bytes instead.
151
152				=back
153
154				C’s outputting of 4 bytes here is actually the B behavior
155				because it doesn’t depend on whether Perl stores the string upgraded or
156				downgraded. Sys::Binmode extends that correct behavior to C and
157				other such Perl commands.
158
159				Of course, in the end, we want C to receive 6 bytes of UTF-8, not
160				4 bytes of Latin-1. To achieve that, just do as you normally do with
161				C: encode your string before you give it to the OS.
162
163				use utf8;
164				use Encode;
165
166				mkdir encode("UTF-8", "épée");
167
168				This is what your code should look like, regardless of Sys::Binmode;
169				the omitted encoding step was a bug that Perl’s own abstraction-violation
170				bug I have obscured for you. Sys::Binmode fixes Perl’s bug,
171				which makes you fix your own bug, too.
172
173				=head2 Non-POSIX Operating Systems (e.g., Windows)
174
175				In a POSIX operating system, an application’s communication with the
176				OS happens entirely through byte strings. Thus, treating all
177				OS-destined strings as byte strings is good and natural.
178
179				In Windows, though, things are weirder. For example, Windows
180				exposes multiple APIs for creating a directory, and the one Perl uses (as of
181				5.32, anyway) only accepts code points 0-255. In this context Sys::Binmode
182				doesn’t I anything, but it does reinforce one of Perl’s unfortunate
183				limitations on Windows.
184
185				Sys::Binmode is a good idea anywhere that Perl sends byte strings to the OS.
186				For now, as far as I know, that’s everywhere that Perl runs. If that’s not
187				true, please file a bug.
188
189				=head1 WHERE ELSE THIS PROBLEM CAN APPEAR
190
191				The unpredictable-behavior problem that this module fixes in core Perl is
192				also common in L’s XS modules due to rampant
193				use of L and
194				variants. SvPV is basically Perl’s L pragma in C: it gives
195				you the string’s
196				internal bytes with no regard for what those bytes represent. This, of course,
197				is problematic for the same reason why the L pragma is. XS authors
198				I should prefer
199				L
200				or L in lieu of
201				SvPV unless the C code in question handles Perl’s encoding abstraction.
202
203				Note in particular that, as of Perl 5.32, the default XS typemap converts
204				scalars to C C and C via an SvPV variant. This means
205				that any module that uses that conversion logic also has this problem.
206				So XS authors should also avoid the default typemap for such conversions.
207				(Again, though, use of the default typemap in this context is regrettably
208				commonplace.)
209
210				Before Perl 5.18 this problem also affected %ENV. 5.18 introduced
211				an auto-downgrade when setting %ENV similar to what this module does.
212
213				=head1 LEXICAL SCOPING
214
215				If, for some reason, you I Perl’s unpredictable default behavior,
216				you can disable this module for a given block via
217				C, thus:
218
219				use Sys::Binmode;
220
221				system 'echo', $foo; # predictable/sane/happy
222
223				{
224
225				# You should probably explain here why you’re doing this.
226				no Sys::Binmode;
227
228				system 'echo', $foo; # nasal demons
229				}
230
231				=head1 AFFECTED BUILT-INS
232
233				=over
234
235				=item * C, C, and C
236
237				=item * C and C
238
239				=item * File tests (e.g., C<-e>) and the following:
240				C, C, C, C, C,
241				C, C, C, C, C, C, C,
242				C, C, C, C, C,
243				C, C
244
245				=item * C, C, C, and C (last argument)
246
247				=item * C
248
249				=back
250
251				=head2 Omissions
252
253				=over
254
255				=item * C already does as Sys::Binmode would make it do.
256
257				=item * C
258				but since it’s a performance-sensitive call where upgraded strings are
259				unlikely, this library doesn’t wrap it.
260
261				=back
262
263				=head1 KNOWN ISSUES
264
265				L creates functions named, e.g., C in the
266				namespace of the module that Cs it. Those functions lack
267				the compiler “hint” that tells Sys::Binmode to do its work; thus,
268				L.
269				C functions will still have Sys::Binmode, but of course they won’t
270				throw exceptions.
271
272				=head1 TODO
273
274				=over
275
276				=item * C and the System V IPC functions aren’t covered here.
277				If you’d like them, ask.
278
279				=item * There’s room for optimization, if that’s gainful.
280
281				=item * Ideally this behavior should be in Perl’s core distribution.
282
283				=item * Even more ideally, Perl should adopt this behavior as I.
284				Maybe someday!
285
286				=back
287
288				=cut
289
290				#----------------------------------------------------------------------
291
292				require XSLoader;
293				XSLoader::load(__PACKAGE__, $VERSION);
294
295				sub import {
296	18	18	384	$^H{ _HINT_KEY() } = 1;
297
298	18		15655	return;
299				}
300
301				sub unimport {
302	1	1	1537	delete $^H{ _HINT_KEY() };
303				}
304
305				#----------------------------------------------------------------------
306
307				=head1 ACKNOWLEDGEMENTS
308
309				Thanks to Leon Timmermans (LEONT) and Paul Evans (PEVANS) for some
310				debugging and design help.
311
312				=head1 LICENSE & COPYRIGHT
313
314				Copyright 2021 Gasper Software Consulting. All rights reserved.
315
316				This library is licensed under the same license as Perl.
317
318				=cut
319
320				1;