Recipe 4.4 Implementing a Sparse Array

4.4.1 Problem

An array with large, unoccupied expanses between occupied elements wastes memory. How do you reduce that overhead?

4.4.2 Solution

Use a hash instead of an array.

4.4.3 Discussion

If you assign to the millionth element of an array, Perl allocates a million and one slots to store scalars. Only the last element contains interesting data, leaving earlier ones each set to undef at a cost of four (or more) bytes per unoccupied slot.

In recent versions of Perl, if you grow an array by assigning either past the end or directly to $#ARRAY, you can distinguish these implicit undefs from those that would result from assigning undef there by using exists instead of defined, just as you would with a hash.

$#foo = 5;
@bar = ( (undef) x 5 ) ;

printf "foo element 3 is%s defined\n",
        defined $foo[3] ? "" : "n't";
printf "foo element 3 does%s exist\n",
        exists $foo[3] ? "" : "n't";
printf "bar element 3 is%s defined\n",
        defined $bar[3] ? "" : "n't";
printf "bar element 3 does%s exist\n",
        exists $bar[3] ? "" : "n't";

foo element 3 isn't defined
foo element 3 doesn't exist
bar element 3 isn't defined
bar element 3 does exist

However, you still waste a lot of space. That's because Perl's array implementation reserves a contiguous vector, one for each element up to the highest occupied position.

$real_array[ 1_000_000 ] = 1;       # costs 4+ megabytes

A hash works differently: you pay only for what you really use, not for unoccupied positions. Although a hash element costs somewhat more than an array element because you need to store both the value and its key, with sparse arrays, the savings can be astonishing.

$fake_array{ 1_000_000 } = 1;       # costs 28 bytes

What's the trade-off? Because a hash's keys aren't ordered, a little more work is needed to sort the numeric keys so you can handle their values in the same order as you would if they were stored as a real array. With an array, you'd just do this to process elements in index order:

foreach $element ( @real_array ) {
    # do something with $element
}

or this to process indices in ascending order:

foreach $idx ( 0 .. $#real_array ) {
    # do something with $real_array[$idx]
}

Using a hash representation, you should instead do either this to process elements in index order:

foreach $element ( @fake_array{ sort {$a <=> $b} keys %fake_array } ) {
    # do something with $element
}

or this to process indices in ascending order:

foreach $idx ( sort {$a <=> $b} keys %fake_array ) {
    # do something with $fake_array{$idx}
}

If you don't care about handling elements in a particular order, however, you don't need to go through all that. Just process the values according to their internal order, either like this:

foreach $element ( values %fake_array ) {
    # do something with $element
}

or like this:

# process indices in internal hash order
foreach $idx ( keys %fake_array ) {
    # do something with $fake_array{$idx}
}

If you're determined to use an array, two fairly specialized cases occasionally arise in which you can save substantial amounts of memory by using an alternate storage scheme. Both cases also apply to arrays that are densely populated, not just those that are mostly empty.

The first case shows up when you grow an array by repeatedly appending new elements until its subscripts become large. Because of how Perl reallocates memory for growing arrays, this can use up to four times the memory you really need. If you happen to know how big the array will (or might) eventually become, you can avoid this reallocation overhead either by storing the large subscripts first instead of the small ones:

for ($i = 10_000; $i >= 0; $i--) { $real_array[$i] = 1 }

or by presizing the array by assigning to the special $#ARRAY notation:

$#real_array = 10_000;

The second special case comes up when each array element holds nothing but a single one-bit valueessentially either a true or a false. For example, suppose you are keeping track of numbered USENET news articles, and you only need to know whether a given article number has been read. For situations like this, use a bit vector instead of a real array:

my $have_read = '';
for ($i = 10_000; $i >= 0; $i--) { vec($have_read, $i, 1) = 1 }

Then you can check to see whether a given article has been read this way:

if (vec($have_read, $artno, 1)) { .... }

4.4.4 See Also

The vec function in perlfunc(1) and in Chapter 29 of Programming Perl

Chapter 1. Strings

Chapter 2. Numbers

Chapter 3. Dates and Times

Chapter 4. Arrays

Introduction

Recipe 4.1 Specifying a List in Your Program

Recipe 4.2 Printing a List with Commas

Recipe 4.3 Changing Array Size

Recipe 4.4 Implementing a Sparse Array

Recipe 4.5 Iterating Over an Array

Recipe 4.6 Iterating Over an Array by Reference

Recipe 4.7 Extracting Unique Elements from a List

Recipe 4.8 Finding Elements in One Array but Not Another

Recipe 4.9 Computing Union, Intersection, or Difference of Unique Lists

Recipe 4.10 Appending One Array to Another

Recipe 4.11 Reversing an Array

Recipe 4.12 Processing Multiple Elements of an Array

Recipe 4.13 Finding the First List Element That Passes a Test

Recipe 4.14 Finding All Elements in an Array Matching Certain Criteria

Recipe 4.15 Sorting an Array Numerically

Recipe 4.16 Sorting a List by Computable Field

Recipe 4.17 Implementing a Circular List

Recipe 4.18 Randomizing an Array

Recipe 4.19 Program: words

Recipe 4.20 Program: permute

Chapter 5. Hashes

Chapter 6. Pattern Matching

Chapter 7. File Access

Chapter 8. File Contents

Chapter 9. Directories

Chapter 11. References and Records

Chapter 12. Packages, Libraries, and Modules

Chapter 13. Classes, Objects, and Ties

Chapter 15. Interactivity

Chapter 16. Process Management and Communication

Chapter 18. Internet Services

Chapter 19. CGI Programming

Chapter 21. mod_perl

Chapter 22. XML