1 | =head1 NAME |
---|
2 | |
---|
3 | perlpacktut - tutorial on C<pack> and C<unpack> |
---|
4 | |
---|
5 | =head1 DESCRIPTION |
---|
6 | |
---|
7 | C<pack> and C<unpack> are two functions for transforming data according |
---|
8 | to a user-defined template, between the guarded way Perl stores values |
---|
9 | and some well-defined representation as might be required in the |
---|
10 | environment of a Perl program. Unfortunately, they're also two of |
---|
11 | the most misunderstood and most often overlooked functions that Perl |
---|
12 | provides. This tutorial will demystify them for you. |
---|
13 | |
---|
14 | |
---|
15 | =head1 The Basic Principle |
---|
16 | |
---|
17 | Most programming languages don't shelter the memory where variables are |
---|
18 | stored. In C, for instance, you can take the address of some variable, |
---|
19 | and the C<sizeof> operator tells you how many bytes are allocated to |
---|
20 | the variable. Using the address and the size, you may access the storage |
---|
21 | to your heart's content. |
---|
22 | |
---|
23 | In Perl, you just can't access memory at random, but the structural and |
---|
24 | representational conversion provided by C<pack> and C<unpack> is an |
---|
25 | excellent alternative. The C<pack> function converts values to a byte |
---|
26 | sequence containing representations according to a given specification, |
---|
27 | the so-called "template" argument. C<unpack> is the reverse process, |
---|
28 | deriving some values from the contents of a string of bytes. (Be cautioned, |
---|
29 | however, that not all that has been packed together can be neatly unpacked - |
---|
30 | a very common experience as seasoned travellers are likely to confirm.) |
---|
31 | |
---|
32 | Why, you may ask, would you need a chunk of memory containing some values |
---|
33 | in binary representation? One good reason is input and output accessing |
---|
34 | some file, a device, or a network connection, whereby this binary |
---|
35 | representation is either forced on you or will give you some benefit |
---|
36 | in processing. Another cause is passing data to some system call that |
---|
37 | is not available as a Perl function: C<syscall> requires you to provide |
---|
38 | parameters stored in the way it happens in a C program. Even text processing |
---|
39 | (as shown in the next section) may be simplified with judicious usage |
---|
40 | of these two functions. |
---|
41 | |
---|
42 | To see how (un)packing works, we'll start with a simple template |
---|
43 | code where the conversion is in low gear: between the contents of a byte |
---|
44 | sequence and a string of hexadecimal digits. Let's use C<unpack>, since |
---|
45 | this is likely to remind you of a dump program, or some desperate last |
---|
46 | message unfortunate programs are wont to throw at you before they expire |
---|
47 | into the wild blue yonder. Assuming that the variable C<$mem> holds a |
---|
48 | sequence of bytes that we'd like to inspect without assuming anything |
---|
49 | about its meaning, we can write |
---|
50 | |
---|
51 | my( $hex ) = unpack( 'H*', $mem ); |
---|
52 | print "$hex\n"; |
---|
53 | |
---|
54 | whereupon we might see something like this, with each pair of hex digits |
---|
55 | corresponding to a byte: |
---|
56 | |
---|
57 | 41204d414e204120504c414e20412043414e414c2050414e414d41 |
---|
58 | |
---|
59 | What was in this chunk of memory? Numbers, characters, or a mixture of |
---|
60 | both? Assuming that we're on a computer where ASCII (or some similar) |
---|
61 | encoding is used: hexadecimal values in the range C<0x40> - C<0x5A> |
---|
62 | indicate an uppercase letter, and C<0x20> encodes a space. So we might |
---|
63 | assume it is a piece of text, which some are able to read like a tabloid; |
---|
64 | but others will have to get hold of an ASCII table and relive that |
---|
65 | firstgrader feeling. Not caring too much about which way to read this, |
---|
66 | we note that C<unpack> with the template code C<H> converts the contents |
---|
67 | of a sequence of bytes into the customary hexadecimal notation. Since |
---|
68 | "a sequence of" is a pretty vague indication of quantity, C<H> has been |
---|
69 | defined to convert just a single hexadecimal digit unless it is followed |
---|
70 | by a repeat count. An asterisk for the repeat count means to use whatever |
---|
71 | remains. |
---|
72 | |
---|
73 | The inverse operation - packing byte contents from a string of hexadecimal |
---|
74 | digits - is just as easily written. For instance: |
---|
75 | |
---|
76 | my $s = pack( 'H2' x 10, map { "3$_" } ( 0..9 ) ); |
---|
77 | print "$s\n"; |
---|
78 | |
---|
79 | Since we feed a list of ten 2-digit hexadecimal strings to C<pack>, the |
---|
80 | pack template should contain ten pack codes. If this is run on a computer |
---|
81 | with ASCII character coding, it will print C<0123456789>. |
---|
82 | |
---|
83 | |
---|
84 | =head1 Packing Text |
---|
85 | |
---|
86 | Let's suppose you've got to read in a data file like this: |
---|
87 | |
---|
88 | Date |Description | Income|Expenditure |
---|
89 | 01/24/2001 Ahmed's Camel Emporium 1147.99 |
---|
90 | 01/28/2001 Flea spray 24.99 |
---|
91 | 01/29/2001 Camel rides to tourists 235.00 |
---|
92 | |
---|
93 | How do we do it? You might think first to use C<split>; however, since |
---|
94 | C<split> collapses blank fields, you'll never know whether a record was |
---|
95 | income or expenditure. Oops. Well, you could always use C<substr>: |
---|
96 | |
---|
97 | while (<>) { |
---|
98 | my $date = substr($_, 0, 11); |
---|
99 | my $desc = substr($_, 12, 27); |
---|
100 | my $income = substr($_, 40, 7); |
---|
101 | my $expend = substr($_, 52, 7); |
---|
102 | ... |
---|
103 | } |
---|
104 | |
---|
105 | It's not really a barrel of laughs, is it? In fact, it's worse than it |
---|
106 | may seem; the eagle-eyed may notice that the first field should only be |
---|
107 | 10 characters wide, and the error has propagated right through the other |
---|
108 | numbers - which we've had to count by hand. So it's error-prone as well |
---|
109 | as horribly unfriendly. |
---|
110 | |
---|
111 | Or maybe we could use regular expressions: |
---|
112 | |
---|
113 | while (<>) { |
---|
114 | my($date, $desc, $income, $expend) = |
---|
115 | m|(\d\d/\d\d/\d{4}) (.{27}) (.{7})(.*)|; |
---|
116 | ... |
---|
117 | } |
---|
118 | |
---|
119 | Urgh. Well, it's a bit better, but - well, would you want to maintain |
---|
120 | that? |
---|
121 | |
---|
122 | Hey, isn't Perl supposed to make this sort of thing easy? Well, it does, |
---|
123 | if you use the right tools. C<pack> and C<unpack> are designed to help |
---|
124 | you out when dealing with fixed-width data like the above. Let's have a |
---|
125 | look at a solution with C<unpack>: |
---|
126 | |
---|
127 | while (<>) { |
---|
128 | my($date, $desc, $income, $expend) = unpack("A10xA27xA7A*", $_); |
---|
129 | ... |
---|
130 | } |
---|
131 | |
---|
132 | That looks a bit nicer; but we've got to take apart that weird template. |
---|
133 | Where did I pull that out of? |
---|
134 | |
---|
135 | OK, let's have a look at some of our data again; in fact, we'll include |
---|
136 | the headers, and a handy ruler so we can keep track of where we are. |
---|
137 | |
---|
138 | 1 2 3 4 5 |
---|
139 | 1234567890123456789012345678901234567890123456789012345678 |
---|
140 | Date |Description | Income|Expenditure |
---|
141 | 01/28/2001 Flea spray 24.99 |
---|
142 | 01/29/2001 Camel rides to tourists 235.00 |
---|
143 | |
---|
144 | From this, we can see that the date column stretches from column 1 to |
---|
145 | column 10 - ten characters wide. The C<pack>-ese for "character" is |
---|
146 | C<A>, and ten of them are C<A10>. So if we just wanted to extract the |
---|
147 | dates, we could say this: |
---|
148 | |
---|
149 | my($date) = unpack("A10", $_); |
---|
150 | |
---|
151 | OK, what's next? Between the date and the description is a blank column; |
---|
152 | we want to skip over that. The C<x> template means "skip forward", so we |
---|
153 | want one of those. Next, we have another batch of characters, from 12 to |
---|
154 | 38. That's 27 more characters, hence C<A27>. (Don't make the fencepost |
---|
155 | error - there are 27 characters between 12 and 38, not 26. Count 'em!) |
---|
156 | |
---|
157 | Now we skip another character and pick up the next 7 characters: |
---|
158 | |
---|
159 | my($date,$description,$income) = unpack("A10xA27xA7", $_); |
---|
160 | |
---|
161 | Now comes the clever bit. Lines in our ledger which are just income and |
---|
162 | not expenditure might end at column 46. Hence, we don't want to tell our |
---|
163 | C<unpack> pattern that we B<need> to find another 12 characters; we'll |
---|
164 | just say "if there's anything left, take it". As you might guess from |
---|
165 | regular expressions, that's what the C<*> means: "use everything |
---|
166 | remaining". |
---|
167 | |
---|
168 | =over 3 |
---|
169 | |
---|
170 | =item * |
---|
171 | |
---|
172 | Be warned, though, that unlike regular expressions, if the C<unpack> |
---|
173 | template doesn't match the incoming data, Perl will scream and die. |
---|
174 | |
---|
175 | =back |
---|
176 | |
---|
177 | |
---|
178 | Hence, putting it all together: |
---|
179 | |
---|
180 | my($date,$description,$income,$expend) = unpack("A10xA27xA7xA*", $_); |
---|
181 | |
---|
182 | Now, that's our data parsed. I suppose what we might want to do now is |
---|
183 | total up our income and expenditure, and add another line to the end of |
---|
184 | our ledger - in the same format - saying how much we've brought in and |
---|
185 | how much we've spent: |
---|
186 | |
---|
187 | while (<>) { |
---|
188 | my($date, $desc, $income, $expend) = unpack("A10xA27xA7xA*", $_); |
---|
189 | $tot_income += $income; |
---|
190 | $tot_expend += $expend; |
---|
191 | } |
---|
192 | |
---|
193 | $tot_income = sprintf("%.2f", $tot_income); # Get them into |
---|
194 | $tot_expend = sprintf("%.2f", $tot_expend); # "financial" format |
---|
195 | |
---|
196 | $date = POSIX::strftime("%m/%d/%Y", localtime); |
---|
197 | |
---|
198 | # OK, let's go: |
---|
199 | |
---|
200 | print pack("A10xA27xA7xA*", $date, "Totals", $tot_income, $tot_expend); |
---|
201 | |
---|
202 | Oh, hmm. That didn't quite work. Let's see what happened: |
---|
203 | |
---|
204 | 01/24/2001 Ahmed's Camel Emporium 1147.99 |
---|
205 | 01/28/2001 Flea spray 24.99 |
---|
206 | 01/29/2001 Camel rides to tourists 1235.00 |
---|
207 | 03/23/2001Totals 1235.001172.98 |
---|
208 | |
---|
209 | OK, it's a start, but what happened to the spaces? We put C<x>, didn't |
---|
210 | we? Shouldn't it skip forward? Let's look at what L<perlfunc/pack> says: |
---|
211 | |
---|
212 | x A null byte. |
---|
213 | |
---|
214 | Urgh. No wonder. There's a big difference between "a null byte", |
---|
215 | character zero, and "a space", character 32. Perl's put something |
---|
216 | between the date and the description - but unfortunately, we can't see |
---|
217 | it! |
---|
218 | |
---|
219 | What we actually need to do is expand the width of the fields. The C<A> |
---|
220 | format pads any non-existent characters with spaces, so we can use the |
---|
221 | additional spaces to line up our fields, like this: |
---|
222 | |
---|
223 | print pack("A11 A28 A8 A*", $date, "Totals", $tot_income, $tot_expend); |
---|
224 | |
---|
225 | (Note that you can put spaces in the template to make it more readable, |
---|
226 | but they don't translate to spaces in the output.) Here's what we got |
---|
227 | this time: |
---|
228 | |
---|
229 | 01/24/2001 Ahmed's Camel Emporium 1147.99 |
---|
230 | 01/28/2001 Flea spray 24.99 |
---|
231 | 01/29/2001 Camel rides to tourists 1235.00 |
---|
232 | 03/23/2001 Totals 1235.00 1172.98 |
---|
233 | |
---|
234 | That's a bit better, but we still have that last column which needs to |
---|
235 | be moved further over. There's an easy way to fix this up: |
---|
236 | unfortunately, we can't get C<pack> to right-justify our fields, but we |
---|
237 | can get C<sprintf> to do it: |
---|
238 | |
---|
239 | $tot_income = sprintf("%.2f", $tot_income); |
---|
240 | $tot_expend = sprintf("%12.2f", $tot_expend); |
---|
241 | $date = POSIX::strftime("%m/%d/%Y", localtime); |
---|
242 | print pack("A11 A28 A8 A*", $date, "Totals", $tot_income, $tot_expend); |
---|
243 | |
---|
244 | This time we get the right answer: |
---|
245 | |
---|
246 | 01/28/2001 Flea spray 24.99 |
---|
247 | 01/29/2001 Camel rides to tourists 1235.00 |
---|
248 | 03/23/2001 Totals 1235.00 1172.98 |
---|
249 | |
---|
250 | So that's how we consume and produce fixed-width data. Let's recap what |
---|
251 | we've seen of C<pack> and C<unpack> so far: |
---|
252 | |
---|
253 | =over 3 |
---|
254 | |
---|
255 | =item * |
---|
256 | |
---|
257 | Use C<pack> to go from several pieces of data to one fixed-width |
---|
258 | version; use C<unpack> to turn a fixed-width-format string into several |
---|
259 | pieces of data. |
---|
260 | |
---|
261 | =item * |
---|
262 | |
---|
263 | The pack format C<A> means "any character"; if you're C<pack>ing and |
---|
264 | you've run out of things to pack, C<pack> will fill the rest up with |
---|
265 | spaces. |
---|
266 | |
---|
267 | =item * |
---|
268 | |
---|
269 | C<x> means "skip a byte" when C<unpack>ing; when C<pack>ing, it means |
---|
270 | "introduce a null byte" - that's probably not what you mean if you're |
---|
271 | dealing with plain text. |
---|
272 | |
---|
273 | =item * |
---|
274 | |
---|
275 | You can follow the formats with numbers to say how many characters |
---|
276 | should be affected by that format: C<A12> means "take 12 characters"; |
---|
277 | C<x6> means "skip 6 bytes" or "character 0, 6 times". |
---|
278 | |
---|
279 | =item * |
---|
280 | |
---|
281 | Instead of a number, you can use C<*> to mean "consume everything else |
---|
282 | left". |
---|
283 | |
---|
284 | B<Warning>: when packing multiple pieces of data, C<*> only means |
---|
285 | "consume all of the current piece of data". That's to say |
---|
286 | |
---|
287 | pack("A*A*", $one, $two) |
---|
288 | |
---|
289 | packs all of C<$one> into the first C<A*> and then all of C<$two> into |
---|
290 | the second. This is a general principle: each format character |
---|
291 | corresponds to one piece of data to be C<pack>ed. |
---|
292 | |
---|
293 | =back |
---|
294 | |
---|
295 | |
---|
296 | |
---|
297 | =head1 Packing Numbers |
---|
298 | |
---|
299 | So much for textual data. Let's get onto the meaty stuff that C<pack> |
---|
300 | and C<unpack> are best at: handling binary formats for numbers. There is, |
---|
301 | of course, not just one binary format - life would be too simple - but |
---|
302 | Perl will do all the finicky labor for you. |
---|
303 | |
---|
304 | |
---|
305 | =head2 Integers |
---|
306 | |
---|
307 | Packing and unpacking numbers implies conversion to and from some |
---|
308 | I<specific> binary representation. Leaving floating point numbers |
---|
309 | aside for the moment, the salient properties of any such representation |
---|
310 | are: |
---|
311 | |
---|
312 | =over 4 |
---|
313 | |
---|
314 | =item * |
---|
315 | |
---|
316 | the number of bytes used for storing the integer, |
---|
317 | |
---|
318 | =item * |
---|
319 | |
---|
320 | whether the contents are interpreted as a signed or unsigned number, |
---|
321 | |
---|
322 | =item * |
---|
323 | |
---|
324 | the byte ordering: whether the first byte is the least or most |
---|
325 | significant byte (or: little-endian or big-endian, respectively). |
---|
326 | |
---|
327 | =back |
---|
328 | |
---|
329 | So, for instance, to pack 20302 to a signed 16 bit integer in your |
---|
330 | computer's representation you write |
---|
331 | |
---|
332 | my $ps = pack( 's', 20302 ); |
---|
333 | |
---|
334 | Again, the result is a string, now containing 2 bytes. If you print |
---|
335 | this string (which is, generally, not recommended) you might see |
---|
336 | C<ON> or C<NO> (depending on your system's byte ordering) - or something |
---|
337 | entirely different if your computer doesn't use ASCII character encoding. |
---|
338 | Unpacking C<$ps> with the same template returns the original integer value: |
---|
339 | |
---|
340 | my( $s ) = unpack( 's', $ps ); |
---|
341 | |
---|
342 | This is true for all numeric template codes. But don't expect miracles: |
---|
343 | if the packed value exceeds the allotted byte capacity, high order bits |
---|
344 | are silently discarded, and unpack certainly won't be able to pull them |
---|
345 | back out of some magic hat. And, when you pack using a signed template |
---|
346 | code such as C<s>, an excess value may result in the sign bit |
---|
347 | getting set, and unpacking this will smartly return a negative value. |
---|
348 | |
---|
349 | 16 bits won't get you too far with integers, but there is C<l> and C<L> |
---|
350 | for signed and unsigned 32-bit integers. And if this is not enough and |
---|
351 | your system supports 64 bit integers you can push the limits much closer |
---|
352 | to infinity with pack codes C<q> and C<Q>. A notable exception is provided |
---|
353 | by pack codes C<i> and C<I> for signed and unsigned integers of the |
---|
354 | "local custom" variety: Such an integer will take up as many bytes as |
---|
355 | a local C compiler returns for C<sizeof(int)>, but it'll use I<at least> |
---|
356 | 32 bits. |
---|
357 | |
---|
358 | Each of the integer pack codes C<sSlLqQ> results in a fixed number of bytes, |
---|
359 | no matter where you execute your program. This may be useful for some |
---|
360 | applications, but it does not provide for a portable way to pass data |
---|
361 | structures between Perl and C programs (bound to happen when you call |
---|
362 | XS extensions or the Perl function C<syscall>), or when you read or |
---|
363 | write binary files. What you'll need in this case are template codes that |
---|
364 | depend on what your local C compiler compiles when you code C<short> or |
---|
365 | C<unsigned long>, for instance. These codes and their corresponding |
---|
366 | byte lengths are shown in the table below. Since the C standard leaves |
---|
367 | much leeway with respect to the relative sizes of these data types, actual |
---|
368 | values may vary, and that's why the values are given as expressions in |
---|
369 | C and Perl. (If you'd like to use values from C<%Config> in your program |
---|
370 | you have to import it with C<use Config>.) |
---|
371 | |
---|
372 | signed unsigned byte length in C byte length in Perl |
---|
373 | s! S! sizeof(short) $Config{shortsize} |
---|
374 | i! I! sizeof(int) $Config{intsize} |
---|
375 | l! L! sizeof(long) $Config{longsize} |
---|
376 | q! Q! sizeof(long long) $Config{longlongsize} |
---|
377 | |
---|
378 | The C<i!> and C<I!> codes aren't different from C<i> and C<I>; they are |
---|
379 | tolerated for completeness' sake. |
---|
380 | |
---|
381 | |
---|
382 | =head2 Unpacking a Stack Frame |
---|
383 | |
---|
384 | Requesting a particular byte ordering may be necessary when you work with |
---|
385 | binary data coming from some specific architecture whereas your program could |
---|
386 | run on a totally different system. As an example, assume you have 24 bytes |
---|
387 | containing a stack frame as it happens on an Intel 8086: |
---|
388 | |
---|
389 | +---------+ +----+----+ +---------+ |
---|
390 | TOS: | IP | TOS+4:| FL | FH | FLAGS TOS+14:| SI | |
---|
391 | +---------+ +----+----+ +---------+ |
---|
392 | | CS | | AL | AH | AX | DI | |
---|
393 | +---------+ +----+----+ +---------+ |
---|
394 | | BL | BH | BX | BP | |
---|
395 | +----+----+ +---------+ |
---|
396 | | CL | CH | CX | DS | |
---|
397 | +----+----+ +---------+ |
---|
398 | | DL | DH | DX | ES | |
---|
399 | +----+----+ +---------+ |
---|
400 | |
---|
401 | First, we note that this time-honored 16-bit CPU uses little-endian order, |
---|
402 | and that's why the low order byte is stored at the lower address. To |
---|
403 | unpack such a (signed) short we'll have to use code C<v>. A repeat |
---|
404 | count unpacks all 12 shorts: |
---|
405 | |
---|
406 | my( $ip, $cs, $flags, $ax, $bx, $cd, $dx, $si, $di, $bp, $ds, $es ) = |
---|
407 | unpack( 'v12', $frame ); |
---|
408 | |
---|
409 | Alternatively, we could have used C<C> to unpack the individually |
---|
410 | accessible byte registers FL, FH, AL, AH, etc.: |
---|
411 | |
---|
412 | my( $fl, $fh, $al, $ah, $bl, $bh, $cl, $ch, $dl, $dh ) = |
---|
413 | unpack( 'C10', substr( $frame, 4, 10 ) ); |
---|
414 | |
---|
415 | It would be nice if we could do this in one fell swoop: unpack a short, |
---|
416 | back up a little, and then unpack 2 bytes. Since Perl I<is> nice, it |
---|
417 | proffers the template code C<X> to back up one byte. Putting this all |
---|
418 | together, we may now write: |
---|
419 | |
---|
420 | my( $ip, $cs, |
---|
421 | $flags,$fl,$fh, |
---|
422 | $ax,$al,$ah, $bx,$bl,$bh, $cx,$cl,$ch, $dx,$dl,$dh, |
---|
423 | $si, $di, $bp, $ds, $es ) = |
---|
424 | unpack( 'v2' . ('vXXCC' x 5) . 'v5', $frame ); |
---|
425 | |
---|
426 | (The clumsy construction of the template can be avoided - just read on!) |
---|
427 | |
---|
428 | We've taken some pains to construct the template so that it matches |
---|
429 | the contents of our frame buffer. Otherwise we'd either get undefined values, |
---|
430 | or C<unpack> could not unpack all. If C<pack> runs out of items, it will |
---|
431 | supply null strings (which are coerced into zeroes whenever the pack code |
---|
432 | says so). |
---|
433 | |
---|
434 | |
---|
435 | =head2 How to Eat an Egg on a Net |
---|
436 | |
---|
437 | The pack code for big-endian (high order byte at the lowest address) is |
---|
438 | C<n> for 16 bit and C<N> for 32 bit integers. You use these codes |
---|
439 | if you know that your data comes from a compliant architecture, but, |
---|
440 | surprisingly enough, you should also use these pack codes if you |
---|
441 | exchange binary data, across the network, with some system that you |
---|
442 | know next to nothing about. The simple reason is that this |
---|
443 | order has been chosen as the I<network order>, and all standard-fearing |
---|
444 | programs ought to follow this convention. (This is, of course, a stern |
---|
445 | backing for one of the Lilliputian parties and may well influence the |
---|
446 | political development there.) So, if the protocol expects you to send |
---|
447 | a message by sending the length first, followed by just so many bytes, |
---|
448 | you could write: |
---|
449 | |
---|
450 | my $buf = pack( 'N', length( $msg ) ) . $msg; |
---|
451 | |
---|
452 | or even: |
---|
453 | |
---|
454 | my $buf = pack( 'NA*', length( $msg ), $msg ); |
---|
455 | |
---|
456 | and pass C<$buf> to your send routine. Some protocols demand that the |
---|
457 | count should include the length of the count itself: then just add 4 |
---|
458 | to the data length. (But make sure to read L<"Lengths and Widths"> before |
---|
459 | you really code this!) |
---|
460 | |
---|
461 | |
---|
462 | |
---|
463 | =head2 Floating point Numbers |
---|
464 | |
---|
465 | For packing floating point numbers you have the choice between the |
---|
466 | pack codes C<f> and C<d> which pack into (or unpack from) single-precision or |
---|
467 | double-precision representation as it is provided by your system. (There |
---|
468 | is no such thing as a network representation for reals, so if you want |
---|
469 | to send your real numbers across computer boundaries, you'd better stick |
---|
470 | to ASCII representation, unless you're absolutely sure what's on the other |
---|
471 | end of the line.) |
---|
472 | |
---|
473 | |
---|
474 | |
---|
475 | =head1 Exotic Templates |
---|
476 | |
---|
477 | |
---|
478 | =head2 Bit Strings |
---|
479 | |
---|
480 | Bits are the atoms in the memory world. Access to individual bits may |
---|
481 | have to be used either as a last resort or because it is the most |
---|
482 | convenient way to handle your data. Bit string (un)packing converts |
---|
483 | between strings containing a series of C<0> and C<1> characters and |
---|
484 | a sequence of bytes each containing a group of 8 bits. This is almost |
---|
485 | as simple as it sounds, except that there are two ways the contents of |
---|
486 | a byte may be written as a bit string. Let's have a look at an annotated |
---|
487 | byte: |
---|
488 | |
---|
489 | 7 6 5 4 3 2 1 0 |
---|
490 | +-----------------+ |
---|
491 | | 1 0 0 0 1 1 0 0 | |
---|
492 | +-----------------+ |
---|
493 | MSB LSB |
---|
494 | |
---|
495 | It's egg-eating all over again: Some think that as a bit string this should |
---|
496 | be written "10001100" i.e. beginning with the most significant bit, others |
---|
497 | insist on "00110001". Well, Perl isn't biased, so that's why we have two bit |
---|
498 | string codes: |
---|
499 | |
---|
500 | $byte = pack( 'B8', '10001100' ); # start with MSB |
---|
501 | $byte = pack( 'b8', '00110001' ); # start with LSB |
---|
502 | |
---|
503 | It is not possible to pack or unpack bit fields - just integral bytes. |
---|
504 | C<pack> always starts at the next byte boundary and "rounds up" to the |
---|
505 | next multiple of 8 by adding zero bits as required. (If you do want bit |
---|
506 | fields, there is L<perlfunc/vec>. Or you could implement bit field |
---|
507 | handling at the character string level, using split, substr, and |
---|
508 | concatenation on unpacked bit strings.) |
---|
509 | |
---|
510 | To illustrate unpacking for bit strings, we'll decompose a simple |
---|
511 | status register (a "-" stands for a "reserved" bit): |
---|
512 | |
---|
513 | +-----------------+-----------------+ |
---|
514 | | S Z - A - P - C | - - - - O D I T | |
---|
515 | +-----------------+-----------------+ |
---|
516 | MSB LSB MSB LSB |
---|
517 | |
---|
518 | Converting these two bytes to a string can be done with the unpack |
---|
519 | template C<'b16'>. To obtain the individual bit values from the bit |
---|
520 | string we use C<split> with the "empty" separator pattern which dissects |
---|
521 | into individual characters. Bit values from the "reserved" positions are |
---|
522 | simply assigned to C<undef>, a convenient notation for "I don't care where |
---|
523 | this goes". |
---|
524 | |
---|
525 | ($carry, undef, $parity, undef, $auxcarry, undef, $zero, $sign, |
---|
526 | $trace, $interrupt, $direction, $overflow) = |
---|
527 | split( //, unpack( 'b16', $status ) ); |
---|
528 | |
---|
529 | We could have used an unpack template C<'b12'> just as well, since the |
---|
530 | last 4 bits can be ignored anyway. |
---|
531 | |
---|
532 | |
---|
533 | =head2 Uuencoding |
---|
534 | |
---|
535 | Another odd-man-out in the template alphabet is C<u>, which packs an |
---|
536 | "uuencoded string". ("uu" is short for Unix-to-Unix.) Chances are that |
---|
537 | you won't ever need this encoding technique which was invented to overcome |
---|
538 | the shortcomings of old-fashioned transmission mediums that do not support |
---|
539 | other than simple ASCII data. The essential recipe is simple: Take three |
---|
540 | bytes, or 24 bits. Split them into 4 six-packs, adding a space (0x20) to |
---|
541 | each. Repeat until all of the data is blended. Fold groups of 4 bytes into |
---|
542 | lines no longer than 60 and garnish them in front with the original byte count |
---|
543 | (incremented by 0x20) and a C<"\n"> at the end. - The C<pack> chef will |
---|
544 | prepare this for you, a la minute, when you select pack code C<u> on the menu: |
---|
545 | |
---|
546 | my $uubuf = pack( 'u', $bindat ); |
---|
547 | |
---|
548 | A repeat count after C<u> sets the number of bytes to put into an |
---|
549 | uuencoded line, which is the maximum of 45 by default, but could be |
---|
550 | set to some (smaller) integer multiple of three. C<unpack> simply ignores |
---|
551 | the repeat count. |
---|
552 | |
---|
553 | |
---|
554 | =head2 Doing Sums |
---|
555 | |
---|
556 | An even stranger template code is C<%>E<lt>I<number>E<gt>. First, because |
---|
557 | it's used as a prefix to some other template code. Second, because it |
---|
558 | cannot be used in C<pack> at all, and third, in C<unpack>, doesn't return the |
---|
559 | data as defined by the template code it precedes. Instead it'll give you an |
---|
560 | integer of I<number> bits that is computed from the data value by |
---|
561 | doing sums. For numeric unpack codes, no big feat is achieved: |
---|
562 | |
---|
563 | my $buf = pack( 'iii', 100, 20, 3 ); |
---|
564 | print unpack( '%32i3', $buf ), "\n"; # prints 123 |
---|
565 | |
---|
566 | For string values, C<%> returns the sum of the byte values saving |
---|
567 | you the trouble of a sum loop with C<substr> and C<ord>: |
---|
568 | |
---|
569 | print unpack( '%32A*', "\x01\x10" ), "\n"; # prints 17 |
---|
570 | |
---|
571 | Although the C<%> code is documented as returning a "checksum": |
---|
572 | don't put your trust in such values! Even when applied to a small number |
---|
573 | of bytes, they won't guarantee a noticeable Hamming distance. |
---|
574 | |
---|
575 | In connection with C<b> or C<B>, C<%> simply adds bits, and this can be put |
---|
576 | to good use to count set bits efficiently: |
---|
577 | |
---|
578 | my $bitcount = unpack( '%32b*', $mask ); |
---|
579 | |
---|
580 | And an even parity bit can be determined like this: |
---|
581 | |
---|
582 | my $evenparity = unpack( '%1b*', $mask ); |
---|
583 | |
---|
584 | |
---|
585 | =head2 Unicode |
---|
586 | |
---|
587 | Unicode is a character set that can represent most characters in most of |
---|
588 | the world's languages, providing room for over one million different |
---|
589 | characters. Unicode 3.1 specifies 94,140 characters: The Basic Latin |
---|
590 | characters are assigned to the numbers 0 - 127. The Latin-1 Supplement with |
---|
591 | characters that are used in several European languages is in the next |
---|
592 | range, up to 255. After some more Latin extensions we find the character |
---|
593 | sets from languages using non-Roman alphabets, interspersed with a |
---|
594 | variety of symbol sets such as currency symbols, Zapf Dingbats or Braille. |
---|
595 | (You might want to visit L<www.unicode.org> for a look at some of |
---|
596 | them - my personal favourites are Telugu and Kannada.) |
---|
597 | |
---|
598 | The Unicode character sets associates characters with integers. Encoding |
---|
599 | these numbers in an equal number of bytes would more than double the |
---|
600 | requirements for storing texts written in Latin alphabets. |
---|
601 | The UTF-8 encoding avoids this by storing the most common (from a western |
---|
602 | point of view) characters in a single byte while encoding the rarer |
---|
603 | ones in three or more bytes. |
---|
604 | |
---|
605 | So what has this got to do with C<pack>? Well, if you want to convert |
---|
606 | between a Unicode number and its UTF-8 representation you can do so by |
---|
607 | using template code C<U>. As an example, let's produce the UTF-8 |
---|
608 | representation of the Euro currency symbol (code number 0x20AC): |
---|
609 | |
---|
610 | $UTF8{Euro} = pack( 'U', 0x20AC ); |
---|
611 | |
---|
612 | Inspecting C<$UTF8{Euro}> shows that it contains 3 bytes: "\xe2\x82\xac". The |
---|
613 | round trip can be completed with C<unpack>: |
---|
614 | |
---|
615 | $Unicode{Euro} = unpack( 'U', $UTF8{Euro} ); |
---|
616 | |
---|
617 | Usually you'll want to pack or unpack UTF-8 strings: |
---|
618 | |
---|
619 | # pack and unpack the Hebrew alphabet |
---|
620 | my $alefbet = pack( 'U*', 0x05d0..0x05ea ); |
---|
621 | my @hebrew = unpack( 'U*', $utf ); |
---|
622 | |
---|
623 | |
---|
624 | =head2 Another Portable Binary Encoding |
---|
625 | |
---|
626 | The pack code C<w> has been added to support a portable binary data |
---|
627 | encoding scheme that goes way beyond simple integers. (Details can |
---|
628 | be found at L<Casbah.org>, the Scarab project.) A BER (Binary Encoded |
---|
629 | Representation) compressed unsigned integer stores base 128 |
---|
630 | digits, most significant digit first, with as few digits as possible. |
---|
631 | Bit eight (the high bit) is set on each byte except the last. There |
---|
632 | is no size limit to BER encoding, but Perl won't go to extremes. |
---|
633 | |
---|
634 | my $berbuf = pack( 'w*', 1, 128, 128+1, 128*128+127 ); |
---|
635 | |
---|
636 | A hex dump of C<$berbuf>, with spaces inserted at the right places, |
---|
637 | shows 01 8100 8101 81807F. Since the last byte is always less than |
---|
638 | 128, C<unpack> knows where to stop. |
---|
639 | |
---|
640 | |
---|
641 | =head1 Template Grouping |
---|
642 | |
---|
643 | Prior to Perl 5.8, repetitions of templates had to be made by |
---|
644 | C<x>-multiplication of template strings. Now there is a better way as |
---|
645 | we may use the pack codes C<(> and C<)> combined with a repeat count. |
---|
646 | The C<unpack> template from the Stack Frame example can simply |
---|
647 | be written like this: |
---|
648 | |
---|
649 | unpack( 'v2 (vXXCC)5 v5', $frame ) |
---|
650 | |
---|
651 | Let's explore this feature a little more. We'll begin with the equivalent of |
---|
652 | |
---|
653 | join( '', map( substr( $_, 0, 1 ), @str ) ) |
---|
654 | |
---|
655 | which returns a string consisting of the first character from each string. |
---|
656 | Using pack, we can write |
---|
657 | |
---|
658 | pack( '(A)'.@str, @str ) |
---|
659 | |
---|
660 | or, because a repeat count C<*> means "repeat as often as required", |
---|
661 | simply |
---|
662 | |
---|
663 | pack( '(A)*', @str ) |
---|
664 | |
---|
665 | (Note that the template C<A*> would only have packed C<$str[0]> in full |
---|
666 | length.) |
---|
667 | |
---|
668 | To pack dates stored as triplets ( day, month, year ) in an array C<@dates> |
---|
669 | into a sequence of byte, byte, short integer we can write |
---|
670 | |
---|
671 | $pd = pack( '(CCS)*', map( @$_, @dates ) ); |
---|
672 | |
---|
673 | To swap pairs of characters in a string (with even length) one could use |
---|
674 | several techniques. First, let's use C<x> and C<X> to skip forward and back: |
---|
675 | |
---|
676 | $s = pack( '(A)*', unpack( '(xAXXAx)*', $s ) ); |
---|
677 | |
---|
678 | We can also use C<@> to jump to an offset, with 0 being the position where |
---|
679 | we were when the last C<(> was encountered: |
---|
680 | |
---|
681 | $s = pack( '(A)*', unpack( '(@1A @0A @2)*', $s ) ); |
---|
682 | |
---|
683 | Finally, there is also an entirely different approach by unpacking big |
---|
684 | endian shorts and packing them in the reverse byte order: |
---|
685 | |
---|
686 | $s = pack( '(v)*', unpack( '(n)*', $s ); |
---|
687 | |
---|
688 | |
---|
689 | =head1 Lengths and Widths |
---|
690 | |
---|
691 | =head2 String Lengths |
---|
692 | |
---|
693 | In the previous section we've seen a network message that was constructed |
---|
694 | by prefixing the binary message length to the actual message. You'll find |
---|
695 | that packing a length followed by so many bytes of data is a |
---|
696 | frequently used recipe since appending a null byte won't work |
---|
697 | if a null byte may be part of the data. Here is an example where both |
---|
698 | techniques are used: after two null terminated strings with source and |
---|
699 | destination address, a Short Message (to a mobile phone) is sent after |
---|
700 | a length byte: |
---|
701 | |
---|
702 | my $msg = pack( 'Z*Z*CA*', $src, $dst, length( $sm ), $sm ); |
---|
703 | |
---|
704 | Unpacking this message can be done with the same template: |
---|
705 | |
---|
706 | ( $src, $dst, $len, $sm ) = unpack( 'Z*Z*CA*', $msg ); |
---|
707 | |
---|
708 | There's a subtle trap lurking in the offing: Adding another field after |
---|
709 | the Short Message (in variable C<$sm>) is all right when packing, but this |
---|
710 | cannot be unpacked naively: |
---|
711 | |
---|
712 | # pack a message |
---|
713 | my $msg = pack( 'Z*Z*CA*C', $src, $dst, length( $sm ), $sm, $prio ); |
---|
714 | |
---|
715 | # unpack fails - $prio remains undefined! |
---|
716 | ( $src, $dst, $len, $sm, $prio ) = unpack( 'Z*Z*CA*C', $msg ); |
---|
717 | |
---|
718 | The pack code C<A*> gobbles up all remaining bytes, and C<$prio> remains |
---|
719 | undefined! Before we let disappointment dampen the morale: Perl's got |
---|
720 | the trump card to make this trick too, just a little further up the sleeve. |
---|
721 | Watch this: |
---|
722 | |
---|
723 | # pack a message: ASCIIZ, ASCIIZ, length/string, byte |
---|
724 | my $msg = pack( 'Z* Z* C/A* C', $src, $dst, $sm, $prio ); |
---|
725 | |
---|
726 | # unpack |
---|
727 | ( $src, $dst, $sm, $prio ) = unpack( 'Z* Z* C/A* C', $msg ); |
---|
728 | |
---|
729 | Combining two pack codes with a slash (C</>) associates them with a single |
---|
730 | value from the argument list. In C<pack>, the length of the argument is |
---|
731 | taken and packed according to the first code while the argument itself |
---|
732 | is added after being converted with the template code after the slash. |
---|
733 | This saves us the trouble of inserting the C<length> call, but it is |
---|
734 | in C<unpack> where we really score: The value of the length byte marks the |
---|
735 | end of the string to be taken from the buffer. Since this combination |
---|
736 | doesn't make sense except when the second pack code isn't C<a*>, C<A*> |
---|
737 | or C<Z*>, Perl won't let you. |
---|
738 | |
---|
739 | The pack code preceding C</> may be anything that's fit to represent a |
---|
740 | number: All the numeric binary pack codes, and even text codes such as |
---|
741 | C<A4> or C<Z*>: |
---|
742 | |
---|
743 | # pack/unpack a string preceded by its length in ASCII |
---|
744 | my $buf = pack( 'A4/A*', "Humpty-Dumpty" ); |
---|
745 | # unpack $buf: '13 Humpty-Dumpty' |
---|
746 | my $txt = unpack( 'A4/A*', $buf ); |
---|
747 | |
---|
748 | C</> is not implemented in Perls before 5.6, so if your code is required to |
---|
749 | work on older Perls you'll need to C<unpack( 'Z* Z* C')> to get the length, |
---|
750 | then use it to make a new unpack string. For example |
---|
751 | |
---|
752 | # pack a message: ASCIIZ, ASCIIZ, length, string, byte (5.005 compatible) |
---|
753 | my $msg = pack( 'Z* Z* C A* C', $src, $dst, length $sm, $sm, $prio ); |
---|
754 | |
---|
755 | # unpack |
---|
756 | ( undef, undef, $len) = unpack( 'Z* Z* C', $msg ); |
---|
757 | ($src, $dst, $sm, $prio) = unpack ( "Z* Z* x A$len C", $msg ); |
---|
758 | |
---|
759 | But that second C<unpack> is rushing ahead. It isn't using a simple literal |
---|
760 | string for the template. So maybe we should introduce... |
---|
761 | |
---|
762 | =head2 Dynamic Templates |
---|
763 | |
---|
764 | So far, we've seen literals used as templates. If the list of pack |
---|
765 | items doesn't have fixed length, an expression constructing the |
---|
766 | template is required (whenever, for some reason, C<()*> cannot be used). |
---|
767 | Here's an example: To store named string values in a way that can be |
---|
768 | conveniently parsed by a C program, we create a sequence of names and |
---|
769 | null terminated ASCII strings, with C<=> between the name and the value, |
---|
770 | followed by an additional delimiting null byte. Here's how: |
---|
771 | |
---|
772 | my $env = pack( '(A*A*Z*)' . keys( %Env ) . 'C', |
---|
773 | map( { ( $_, '=', $Env{$_} ) } keys( %Env ) ), 0 ); |
---|
774 | |
---|
775 | Let's examine the cogs of this byte mill, one by one. There's the C<map> |
---|
776 | call, creating the items we intend to stuff into the C<$env> buffer: |
---|
777 | to each key (in C<$_>) it adds the C<=> separator and the hash entry value. |
---|
778 | Each triplet is packed with the template code sequence C<A*A*Z*> that |
---|
779 | is repeated according to the number of keys. (Yes, that's what the C<keys> |
---|
780 | function returns in scalar context.) To get the very last null byte, |
---|
781 | we add a C<0> at the end of the C<pack> list, to be packed with C<C>. |
---|
782 | (Attentive readers may have noticed that we could have omitted the 0.) |
---|
783 | |
---|
784 | For the reverse operation, we'll have to determine the number of items |
---|
785 | in the buffer before we can let C<unpack> rip it apart: |
---|
786 | |
---|
787 | my $n = $env =~ tr/\0// - 1; |
---|
788 | my %env = map( split( /=/, $_ ), unpack( "(Z*)$n", $env ) ); |
---|
789 | |
---|
790 | The C<tr> counts the null bytes. The C<unpack> call returns a list of |
---|
791 | name-value pairs each of which is taken apart in the C<map> block. |
---|
792 | |
---|
793 | |
---|
794 | =head2 Counting Repetitions |
---|
795 | |
---|
796 | Rather than storing a sentinel at the end of a data item (or a list of items), |
---|
797 | we could precede the data with a count. Again, we pack keys and values of |
---|
798 | a hash, preceding each with an unsigned short length count, and up front |
---|
799 | we store the number of pairs: |
---|
800 | |
---|
801 | my $env = pack( 'S(S/A* S/A*)*', scalar keys( %Env ), %Env ); |
---|
802 | |
---|
803 | This simplifies the reverse operation as the number of repetitions can be |
---|
804 | unpacked with the C</> code: |
---|
805 | |
---|
806 | my %env = unpack( 'S/(S/A* S/A*)', $env ); |
---|
807 | |
---|
808 | Note that this is one of the rare cases where you cannot use the same |
---|
809 | template for C<pack> and C<unpack> because C<pack> can't determine |
---|
810 | a repeat count for a C<()>-group. |
---|
811 | |
---|
812 | |
---|
813 | =head1 Packing and Unpacking C Structures |
---|
814 | |
---|
815 | In previous sections we have seen how to pack numbers and character |
---|
816 | strings. If it were not for a couple of snags we could conclude this |
---|
817 | section right away with the terse remark that C structures don't |
---|
818 | contain anything else, and therefore you already know all there is to it. |
---|
819 | Sorry, no: read on, please. |
---|
820 | |
---|
821 | =head2 The Alignment Pit |
---|
822 | |
---|
823 | In the consideration of speed against memory requirements the balance |
---|
824 | has been tilted in favor of faster execution. This has influenced the |
---|
825 | way C compilers allocate memory for structures: On architectures |
---|
826 | where a 16-bit or 32-bit operand can be moved faster between places in |
---|
827 | memory, or to or from a CPU register, if it is aligned at an even or |
---|
828 | multiple-of-four or even at a multiple-of eight address, a C compiler |
---|
829 | will give you this speed benefit by stuffing extra bytes into structures. |
---|
830 | If you don't cross the C shoreline this is not likely to cause you any |
---|
831 | grief (although you should care when you design large data structures, |
---|
832 | or you want your code to be portable between architectures (you do want |
---|
833 | that, don't you?)). |
---|
834 | |
---|
835 | To see how this affects C<pack> and C<unpack>, we'll compare these two |
---|
836 | C structures: |
---|
837 | |
---|
838 | typedef struct { |
---|
839 | char c1; |
---|
840 | short s; |
---|
841 | char c2; |
---|
842 | long l; |
---|
843 | } gappy_t; |
---|
844 | |
---|
845 | typedef struct { |
---|
846 | long l; |
---|
847 | short s; |
---|
848 | char c1; |
---|
849 | char c2; |
---|
850 | } dense_t; |
---|
851 | |
---|
852 | Typically, a C compiler allocates 12 bytes to a C<gappy_t> variable, but |
---|
853 | requires only 8 bytes for a C<dense_t>. After investigating this further, |
---|
854 | we can draw memory maps, showing where the extra 4 bytes are hidden: |
---|
855 | |
---|
856 | 0 +4 +8 +12 |
---|
857 | +--+--+--+--+--+--+--+--+--+--+--+--+ |
---|
858 | |c1|xx| s |c2|xx|xx|xx| l | xx = fill byte |
---|
859 | +--+--+--+--+--+--+--+--+--+--+--+--+ |
---|
860 | gappy_t |
---|
861 | |
---|
862 | 0 +4 +8 |
---|
863 | +--+--+--+--+--+--+--+--+ |
---|
864 | | l | h |c1|c2| |
---|
865 | +--+--+--+--+--+--+--+--+ |
---|
866 | dense_t |
---|
867 | |
---|
868 | And that's where the first quirk strikes: C<pack> and C<unpack> |
---|
869 | templates have to be stuffed with C<x> codes to get those extra fill bytes. |
---|
870 | |
---|
871 | The natural question: "Why can't Perl compensate for the gaps?" warrants |
---|
872 | an answer. One good reason is that C compilers might provide (non-ANSI) |
---|
873 | extensions permitting all sorts of fancy control over the way structures |
---|
874 | are aligned, even at the level of an individual structure field. And, if |
---|
875 | this were not enough, there is an insidious thing called C<union> where |
---|
876 | the amount of fill bytes cannot be derived from the alignment of the next |
---|
877 | item alone. |
---|
878 | |
---|
879 | OK, so let's bite the bullet. Here's one way to get the alignment right |
---|
880 | by inserting template codes C<x>, which don't take a corresponding item |
---|
881 | from the list: |
---|
882 | |
---|
883 | my $gappy = pack( 'cxs cxxx l!', $c1, $s, $c2, $l ); |
---|
884 | |
---|
885 | Note the C<!> after C<l>: We want to make sure that we pack a long |
---|
886 | integer as it is compiled by our C compiler. And even now, it will only |
---|
887 | work for the platforms where the compiler aligns things as above. |
---|
888 | And somebody somewhere has a platform where it doesn't. |
---|
889 | [Probably a Cray, where C<short>s, C<int>s and C<long>s are all 8 bytes. :-)] |
---|
890 | |
---|
891 | Counting bytes and watching alignments in lengthy structures is bound to |
---|
892 | be a drag. Isn't there a way we can create the template with a simple |
---|
893 | program? Here's a C program that does the trick: |
---|
894 | |
---|
895 | #include <stdio.h> |
---|
896 | #include <stddef.h> |
---|
897 | |
---|
898 | typedef struct { |
---|
899 | char fc1; |
---|
900 | short fs; |
---|
901 | char fc2; |
---|
902 | long fl; |
---|
903 | } gappy_t; |
---|
904 | |
---|
905 | #define Pt(struct,field,tchar) \ |
---|
906 | printf( "@%d%s ", offsetof(struct,field), # tchar ); |
---|
907 | |
---|
908 | int main() { |
---|
909 | Pt( gappy_t, fc1, c ); |
---|
910 | Pt( gappy_t, fs, s! ); |
---|
911 | Pt( gappy_t, fc2, c ); |
---|
912 | Pt( gappy_t, fl, l! ); |
---|
913 | printf( "\n" ); |
---|
914 | } |
---|
915 | |
---|
916 | The output line can be used as a template in a C<pack> or C<unpack> call: |
---|
917 | |
---|
918 | my $gappy = pack( '@0c @2s! @4c @8l!', $c1, $s, $c2, $l ); |
---|
919 | |
---|
920 | Gee, yet another template code - as if we hadn't plenty. But |
---|
921 | C<@> saves our day by enabling us to specify the offset from the beginning |
---|
922 | of the pack buffer to the next item: This is just the value |
---|
923 | the C<offsetof> macro (defined in C<E<lt>stddef.hE<gt>>) returns when |
---|
924 | given a C<struct> type and one of its field names ("member-designator" in |
---|
925 | C standardese). |
---|
926 | |
---|
927 | Neither using offsets nor adding C<x>'s to bridge the gaps is satisfactory. |
---|
928 | (Just imagine what happens if the structure changes.) What we really need |
---|
929 | is a way of saying "skip as many bytes as required to the next multiple of N". |
---|
930 | In fluent Templatese, you say this with C<x!N> where N is replaced by the |
---|
931 | appropriate value. Here's the next version of our struct packaging: |
---|
932 | |
---|
933 | my $gappy = pack( 'c x!2 s c x!4 l!', $c1, $s, $c2, $l ); |
---|
934 | |
---|
935 | That's certainly better, but we still have to know how long all the |
---|
936 | integers are, and portability is far away. Rather than C<2>, |
---|
937 | for instance, we want to say "however long a short is". But this can be |
---|
938 | done by enclosing the appropriate pack code in brackets: C<[s]>. So, here's |
---|
939 | the very best we can do: |
---|
940 | |
---|
941 | my $gappy = pack( 'c x![s] s c x![l!] l!', $c1, $s, $c2, $l ); |
---|
942 | |
---|
943 | |
---|
944 | =head2 Alignment, Take 2 |
---|
945 | |
---|
946 | I'm afraid that we're not quite through with the alignment catch yet. The |
---|
947 | hydra raises another ugly head when you pack arrays of structures: |
---|
948 | |
---|
949 | typedef struct { |
---|
950 | short count; |
---|
951 | char glyph; |
---|
952 | } cell_t; |
---|
953 | |
---|
954 | typedef cell_t buffer_t[BUFLEN]; |
---|
955 | |
---|
956 | Where's the catch? Padding is neither required before the first field C<count>, |
---|
957 | nor between this and the next field C<glyph>, so why can't we simply pack |
---|
958 | like this: |
---|
959 | |
---|
960 | # something goes wrong here: |
---|
961 | pack( 's!a' x @buffer, |
---|
962 | map{ ( $_->{count}, $_->{glyph} ) } @buffer ); |
---|
963 | |
---|
964 | This packs C<3*@buffer> bytes, but it turns out that the size of |
---|
965 | C<buffer_t> is four times C<BUFLEN>! The moral of the story is that |
---|
966 | the required alignment of a structure or array is propagated to the |
---|
967 | next higher level where we have to consider padding I<at the end> |
---|
968 | of each component as well. Thus the correct template is: |
---|
969 | |
---|
970 | pack( 's!ax' x @buffer, |
---|
971 | map{ ( $_->{count}, $_->{glyph} ) } @buffer ); |
---|
972 | |
---|
973 | =head2 Alignment, Take 3 |
---|
974 | |
---|
975 | And even if you take all the above into account, ANSI still lets this: |
---|
976 | |
---|
977 | typedef struct { |
---|
978 | char foo[2]; |
---|
979 | } foo_t; |
---|
980 | |
---|
981 | vary in size. The alignment constraint of the structure can be greater than |
---|
982 | any of its elements. [And if you think that this doesn't affect anything |
---|
983 | common, dismember the next cellphone that you see. Many have ARM cores, and |
---|
984 | the ARM structure rules make C<sizeof (foo_t)> == 4] |
---|
985 | |
---|
986 | =head2 Pointers for How to Use Them |
---|
987 | |
---|
988 | The title of this section indicates the second problem you may run into |
---|
989 | sooner or later when you pack C structures. If the function you intend |
---|
990 | to call expects a, say, C<void *> value, you I<cannot> simply take |
---|
991 | a reference to a Perl variable. (Although that value certainly is a |
---|
992 | memory address, it's not the address where the variable's contents are |
---|
993 | stored.) |
---|
994 | |
---|
995 | Template code C<P> promises to pack a "pointer to a fixed length string". |
---|
996 | Isn't this what we want? Let's try: |
---|
997 | |
---|
998 | # allocate some storage and pack a pointer to it |
---|
999 | my $memory = "\x00" x $size; |
---|
1000 | my $memptr = pack( 'P', $memory ); |
---|
1001 | |
---|
1002 | But wait: doesn't C<pack> just return a sequence of bytes? How can we pass this |
---|
1003 | string of bytes to some C code expecting a pointer which is, after all, |
---|
1004 | nothing but a number? The answer is simple: We have to obtain the numeric |
---|
1005 | address from the bytes returned by C<pack>. |
---|
1006 | |
---|
1007 | my $ptr = unpack( 'L!', $memptr ); |
---|
1008 | |
---|
1009 | Obviously this assumes that it is possible to typecast a pointer |
---|
1010 | to an unsigned long and vice versa, which frequently works but should not |
---|
1011 | be taken as a universal law. - Now that we have this pointer the next question |
---|
1012 | is: How can we put it to good use? We need a call to some C function |
---|
1013 | where a pointer is expected. The read(2) system call comes to mind: |
---|
1014 | |
---|
1015 | ssize_t read(int fd, void *buf, size_t count); |
---|
1016 | |
---|
1017 | After reading L<perlfunc> explaining how to use C<syscall> we can write |
---|
1018 | this Perl function copying a file to standard output: |
---|
1019 | |
---|
1020 | require 'syscall.ph'; |
---|
1021 | sub cat($){ |
---|
1022 | my $path = shift(); |
---|
1023 | my $size = -s $path; |
---|
1024 | my $memory = "\x00" x $size; # allocate some memory |
---|
1025 | my $ptr = unpack( 'L', pack( 'P', $memory ) ); |
---|
1026 | open( F, $path ) || die( "$path: cannot open ($!)\n" ); |
---|
1027 | my $fd = fileno(F); |
---|
1028 | my $res = syscall( &SYS_read, fileno(F), $ptr, $size ); |
---|
1029 | print $memory; |
---|
1030 | close( F ); |
---|
1031 | } |
---|
1032 | |
---|
1033 | This is neither a specimen of simplicity nor a paragon of portability but |
---|
1034 | it illustrates the point: We are able to sneak behind the scenes and |
---|
1035 | access Perl's otherwise well-guarded memory! (Important note: Perl's |
---|
1036 | C<syscall> does I<not> require you to construct pointers in this roundabout |
---|
1037 | way. You simply pass a string variable, and Perl forwards the address.) |
---|
1038 | |
---|
1039 | How does C<unpack> with C<P> work? Imagine some pointer in the buffer |
---|
1040 | about to be unpacked: If it isn't the null pointer (which will smartly |
---|
1041 | produce the C<undef> value) we have a start address - but then what? |
---|
1042 | Perl has no way of knowing how long this "fixed length string" is, so |
---|
1043 | it's up to you to specify the actual size as an explicit length after C<P>. |
---|
1044 | |
---|
1045 | my $mem = "abcdefghijklmn"; |
---|
1046 | print unpack( 'P5', pack( 'P', $mem ) ); # prints "abcde" |
---|
1047 | |
---|
1048 | As a consequence, C<pack> ignores any number or C<*> after C<P>. |
---|
1049 | |
---|
1050 | |
---|
1051 | Now that we have seen C<P> at work, we might as well give C<p> a whirl. |
---|
1052 | Why do we need a second template code for packing pointers at all? The |
---|
1053 | answer lies behind the simple fact that an C<unpack> with C<p> promises |
---|
1054 | a null-terminated string starting at the address taken from the buffer, |
---|
1055 | and that implies a length for the data item to be returned: |
---|
1056 | |
---|
1057 | my $buf = pack( 'p', "abc\x00efhijklmn" ); |
---|
1058 | print unpack( 'p', $buf ); # prints "abc" |
---|
1059 | |
---|
1060 | |
---|
1061 | |
---|
1062 | Albeit this is apt to be confusing: As a consequence of the length being |
---|
1063 | implied by the string's length, a number after pack code C<p> is a repeat |
---|
1064 | count, not a length as after C<P>. |
---|
1065 | |
---|
1066 | |
---|
1067 | Using C<pack(..., $x)> with C<P> or C<p> to get the address where C<$x> is |
---|
1068 | actually stored must be used with circumspection. Perl's internal machinery |
---|
1069 | considers the relation between a variable and that address as its very own |
---|
1070 | private matter and doesn't really care that we have obtained a copy. Therefore: |
---|
1071 | |
---|
1072 | =over 4 |
---|
1073 | |
---|
1074 | =item * |
---|
1075 | |
---|
1076 | Do not use C<pack> with C<p> or C<P> to obtain the address of variable |
---|
1077 | that's bound to go out of scope (and thereby freeing its memory) before you |
---|
1078 | are done with using the memory at that address. |
---|
1079 | |
---|
1080 | =item * |
---|
1081 | |
---|
1082 | Be very careful with Perl operations that change the value of the |
---|
1083 | variable. Appending something to the variable, for instance, might require |
---|
1084 | reallocation of its storage, leaving you with a pointer into no-man's land. |
---|
1085 | |
---|
1086 | =item * |
---|
1087 | |
---|
1088 | Don't think that you can get the address of a Perl variable |
---|
1089 | when it is stored as an integer or double number! C<pack('P', $x)> will |
---|
1090 | force the variable's internal representation to string, just as if you |
---|
1091 | had written something like C<$x .= ''>. |
---|
1092 | |
---|
1093 | =back |
---|
1094 | |
---|
1095 | It's safe, however, to P- or p-pack a string literal, because Perl simply |
---|
1096 | allocates an anonymous variable. |
---|
1097 | |
---|
1098 | |
---|
1099 | |
---|
1100 | =head1 Pack Recipes |
---|
1101 | |
---|
1102 | Here are a collection of (possibly) useful canned recipes for C<pack> |
---|
1103 | and C<unpack>: |
---|
1104 | |
---|
1105 | # Convert IP address for socket functions |
---|
1106 | pack( "C4", split /\./, "123.4.5.6" ); |
---|
1107 | |
---|
1108 | # Count the bits in a chunk of memory (e.g. a select vector) |
---|
1109 | unpack( '%32b*', $mask ); |
---|
1110 | |
---|
1111 | # Determine the endianness of your system |
---|
1112 | $is_little_endian = unpack( 'c', pack( 's', 1 ) ); |
---|
1113 | $is_big_endian = unpack( 'xc', pack( 's', 1 ) ); |
---|
1114 | |
---|
1115 | # Determine the number of bits in a native integer |
---|
1116 | $bits = unpack( '%32I!', ~0 ); |
---|
1117 | |
---|
1118 | # Prepare argument for the nanosleep system call |
---|
1119 | my $timespec = pack( 'L!L!', $secs, $nanosecs ); |
---|
1120 | |
---|
1121 | For a simple memory dump we unpack some bytes into just as |
---|
1122 | many pairs of hex digits, and use C<map> to handle the traditional |
---|
1123 | spacing - 16 bytes to a line: |
---|
1124 | |
---|
1125 | my $i; |
---|
1126 | print map( ++$i % 16 ? "$_ " : "$_\n", |
---|
1127 | unpack( 'H2' x length( $mem ), $mem ) ), |
---|
1128 | length( $mem ) % 16 ? "\n" : ''; |
---|
1129 | |
---|
1130 | |
---|
1131 | =head1 Funnies Section |
---|
1132 | |
---|
1133 | # Pulling digits out of nowhere... |
---|
1134 | print unpack( 'C', pack( 'x' ) ), |
---|
1135 | unpack( '%B*', pack( 'A' ) ), |
---|
1136 | unpack( 'H', pack( 'A' ) ), |
---|
1137 | unpack( 'A', unpack( 'C', pack( 'A' ) ) ), "\n"; |
---|
1138 | |
---|
1139 | # One for the road ;-) |
---|
1140 | my $advice = pack( 'all u can in a van' ); |
---|
1141 | |
---|
1142 | |
---|
1143 | =head1 Authors |
---|
1144 | |
---|
1145 | Simon Cozens and Wolfgang Laun. |
---|
1146 | |
---|