Recipe 1.1 Accessing Substrings

1.1.1 Problem

You want to access or modify just a portion of a string, not the whole thing. For instance, you've read a fixed-width record and want to extract individual fields.

1.1.2 Solution

The substr function lets you read from and write to specific portions of the string.

$value = substr($string, $offset, $count);
$value = substr($string, $offset);

substr($string, $offset, $count) = $newstring;
substr($string, $offset, $count, $newstring);  # same as previous
substr($string, $offset)         = $newtail;

The unpack function gives only read access, but is faster when you have many substrings to extract.

# get a 5-byte string, skip 3 bytes,
# then grab two 8-byte strings, then the rest;
# (NB: only works on ASCII data, not Unicode)
($leading, $s1, $s2, $trailing) =
    unpack("A5 x3 A8 A8 A*", $data);

# split at 5-byte boundaries
@fivers = unpack("A5" x (length($string)/5), $string);

# chop string into individual single-byte characters
@chars  = unpack("A1" x length($string), $string);

1.1.3 Discussion

Strings are a basic data type; they aren't arrays of a basic data type. Instead of using array subscripting to access individual characters as you sometimes do in other programming languages, in Perl you use functions like unpack or substr to access individual characters or a portion of the string.

The offset argument to substr indicates the start of the substring you're interested in, counting from the front if positive and from the end if negative. If the offset is 0, the substring starts at the beginning. The count argument is the length of the substring.

$string = "This is what you have";
#         +012345678901234567890  Indexing forwards  (left to right)
#          109876543210987654321- Indexing backwards (right to left)
#           note that 0 means 10 or 20, etc. above

$first  = substr($string, 0, 1);  # "T"
$start  = substr($string, 5, 2);  # "is"
$rest   = substr($string, 13);    # "you have"
$last   = substr($string, -1);    # "e"
$end    = substr($string, -4);    # "have"
$piece  = substr($string, -8, 3); # "you"

You can do more than just look at parts of the string with substr; you can actually change them. That's because substr is a particularly odd kind of functionan lvaluable one, that is, a function whose return value may be itself assigned a value. (For the record, the others are vec, pos, and keys. If you squint, local, my, and our can also be viewed as lvaluable functions.)

$string = "This is what you have";
print $string;
This is what you have
substr($string, 5, 2) = "wasn't"; # change "is" to "wasn't"
This wasn't what you have
substr($string, -12)  = "ondrous";# "This wasn't wondrous"
This wasn't wondrous
substr($string, 0, 1) = "";       # delete first character
his wasn't wondrous
substr($string, -10)  = "";       # delete last 10 characters
his wasn'

Use the =~ operator and the s///, m//, or tr/// operators in conjunction with substr to make them affect only that portion of the string.

# you can test substrings with =~
if (substr($string, -10) =~ /pattern/) {
            print "Pattern matches in last 10 characters\n";
}

# substitute "at" for "is", restricted to first five characters
substr($string, 0, 5) =~ s/is/at/g;

You can even swap values by using several substrs on each side of an assignment:

# exchange the first and last letters in a string
$a = "make a hat";
(substr($a,0,1), substr($a,-1)) = 
(substr($a,-1),  substr($a,0,1));
print $a;
take a ham

Although unpack is not lvaluable, it is considerably faster than substr when you extract numerous values all at once. Specify a format describing the layout of the record to unpack. For positioning, use lowercase "x" with a count to skip forward some number of bytes, an uppercase "X" with a count to skip backward some number of bytes, and an "@" to skip to an absolute byte offset within the record. (If the data contains Unicode strings, be careful with those three: they're strictly byte-oriented, and moving around by bytes within multibyte data is perilous at best.)

# extract column with unpack
$a = "To be or not to be";
$b = unpack("x6 A6", $a);  # skip 6, grab 6
print $b;
or not

($b, $c) = unpack("x6 A2 X5 A2", $a); # forward 6, grab 2; backward 5, grab 2
print "$b\n$c\n";
or
be

Sometimes you prefer to think of your data as being cut up at specific columns. For example, you might want to place cuts right before positions 8, 14, 20, 26, and 30. Those are the column numbers where each field begins. Although you could calculate that the proper unpack format is "A7 A6 A6 A6 A4 A*", this is too much mental strain for the virtuously lazy Perl programmer. Let Perl figure it out for you. Use the cut2fmt function:

sub cut2fmt {
    my(@positions) = @_;
    my $template   = '';
    my $lastpos    = 1;
    foreach $place (@positions) {
        $template .= "A" . ($place - $lastpos) . " ";
        $lastpos   = $place;
    }
    $template .= "A*";
    return $template;
}

$fmt = cut2fmt(8, 14, 20, 26, 30);
print "$fmt\n";
A7 A6 A6 A6 A4 A*

The powerful unpack function goes far beyond mere text processing. It's the gateway between text and binary data.

In this recipe, we've assumed that all character data is 7- or 8-bit data so that pack's byte operations work as expected.

1.1.4 See Also

The pack, unpack, and substr functions in perlfunc(1) and in Chapter 29 of Programming Perl; use of the cut2fmt subroutine in Recipe 1.24; the binary use of unpack in Recipe 8.24