5.3 Restriction.pm: Finding Recognition Sites

The time has now come to write the class that creates an object out of a sequence, enzyme name(s), and the map of the location(s) of the enzyme recognition sites in the sequence.

This module depends a great deal on the module Rebase developed in the previous section, but it is fairly short because it just tries to do a small job. This new Restriction class takes a Rebase object (which has the Rebase database translated to regular expressions), some sequence, and a list of enzymes, and uses the regular expressions to find the recognition sites in the sequence. (Note that it doesn't use inheritance; it simply creates a Rebase object to use.)

In this module, the restriction map (the list of locations where the enzymes have recognition sites in the sequence) is obtained through an auxiliary method map_enzyme that simply lists the locations. Clearly, a more graphical display would be easier and more useful. I'll consider that as the book progresses.

5.3.1 The Restriction.pm Module

Here is the Restriction.pm module:

package Restriction;

#
# A class to find locations of restriction enzyme recognition sites in
#  DNA sequence data.
#

use strict;
use warnings;
use Carp;

# Class data and methods
{
   # A list of all attributes with default values.
   # "enzyme" is given as an argument possibly multiple time, set as key to _map hash
    my %_attributes = (
        _rebase      => { },  # A Rebase.pm hash-based object
        # key   = restriction enzyme name
        # value = space-separated string of recognition sites => regular expressions
        _sequence    => '', # DNA sequence data in raw format (only bases)
        _map         => { },# a hash: keys are enzyme names, 
                            # values are arrays of locations 
        _enzyme      => '', # space- or comma-separated enzyme names, 
                            # set as key to _map hash
    );

    # Global variable to keep count of existing objects
    my $_count = 0;

    # Return a list of all attributes
    sub _all_attributes {
        keys %_attributes;
    }

    # Manage the count of existing objects
    sub get_count {
        $_count;
    }
    sub _incr_count {
        ++$_count;
    }
    sub _decr_count {
        --$_count;
    }
}

This opening block shows that Restriction is also a class with a small and simple set of attributes. One is just the DNA sequence data, _sequence.

The attribute _map is a hash that stores the computed restriction map with a key for each restriction enzyme and a value consisting of an array of the positions in the sequence where recognition sites for that enzyme occur.

The attribute _enzyme is one or more enzyme names separated by spaces or commas.

The remaining attribute _rebase is an object of the class I developed in the last section, the class Rebase. Recall that a Rebase object gives you the ability to get regular expressions for recognition sites of restriction enzymes. When you call Restriction->new you must include as one of its arguments a Rebase object, as is shown in the example given in the POD documentation later in this section.

Since a program can have many restriction maps, I also maintain the count of Restriction objects.

5.3.1.1 Initializing Restriction objects

Next, the new constructor method creates and initializes Restriction objects:

# The "new" constructor method, called from class, e.g.
sub new {
    my ($class, %args) = @_;
    # Create a new object
    my $self = bless {  }, $class;

    # Set the attributes for the provided arguments
    foreach my $attribute ($self->_all_attributes(  )) {

        # E.g. attribute = "_name",  argument = "name"
        my($argument) = ($attribute =~ /^_(.*)/);

        if (exists $args{$argument}) {
            if($argument eq 'enzyme') {
                # permit space or comma separated enzyme names
                $args{$argument} =~ s/,/ /g;
            }
            $self->{$attribute} = $args{$argument};
        }
    }

    # Check that the correct arguments are given
    if( not defined $self->{_rebase} ) {
        croak "A Rebase object must be given as an argument";
    }elsif( ref($self->{_rebase}) ne 'Rebase' ) {
        croak "The argument to rebase is not a Rebase object";
    }elsif( not defined $self->{_sequence} ) {
        croak "A sequence must be given as an argument";
    }

    # Calculate the locations for each enzyme, store in _map hash attribute
    foreach my $enzyme (split(" ", $self->{_enzyme})) {
        $self->map_enzyme($enzyme);
    }

    $self->_incr_count;

    return $self;
}

# For this simple class I have no AUTOLOAD or DESTROY

# No get_rebase method, I don't want to pass around a huge hash

# No set mutators: all initialization done by way of "new" constructor

# No clone method.  Each sequence and set of enzymes can be easily calculated
#  by means of a "new" command.
5.3.1.2 The methods explained

The new constructor method checks that the proper and required arguments are provided: a sequence, enzyme(s), and a Rebase object.

Then the new constructor performs the required computation by calling the method map_enzyme for each enzyme to determine the (possibly empty) array of locations in which the enzyme can cut the sequence. These enzyme-array key/value pairs are stored in the _map attribute.

As you can see in the next segment of code from Restriction.pm, the method map_enzyme works by getting the regular expressions that have been calculated for the enzyme and then finding the positions where the regular expressions match the sequence.

The get_regular_expressions method accesses the Rebase object that was passed to the Restriction->new constructor method and appears as the $self->{_rebase} attribute. In that object, the attribute _rebase is the hash of the database, which looks for the value of the key $enzyme.

The match_positions method puts a match of a regular expression in a while loop, where it loops once for each location it finds. The offset of the matching sequence is available in the special variable $-[0] after a successful regular-expression match (see the perlvar section of the Perl documentation for more details).

sub map_enzyme {
    my($self, $enzyme) = @_;

    my(@positions) = (  );

    my(@res) = $self->get_regular_expressions($enzyme);

    foreach my $re (@res) {
        push @positions, $self->match_positions($re);
    }

    @{$self->{_map}{$enzyme}} = @positions;
    return @positions;
}

sub get_regular_expressions {
    my($self, $enzyme) = @_;

    my(%sites) = split(' ', $self->{_rebase}{_rebase}{$enzyme});

    # May have duplicate values
    return values %sites;
}

# Find positions of a regular expression in the sequence
sub match_positions {
    my($self, $regexp) = @_;

    my @positions = (  );

    # Determine positions of regular expression matches
    while ( $self->{_sequence} =~ /$regexp/ig ) {
        push @positions, ($-[0] + 1 );
    }

    return(@positions);
}

Finally, here are some helper methods that report on the attributes of sequence and of the calculated map for a given enzyme (or the entire hash that contains the map of all the enzymes). I didn't use AUTOLOAD here because there are only a handful of attributes that return a few different data types: array, scalar string, and hash.

sub get_enzyme_map {
    my($self, $enzyme) = @_;

    @{$self->{_map}{$enzyme}};
}

sub get_enzyme_names {
    my($self) = @_;

    keys %{$self->{_map}};
}
sub get_sequence {
    my($self) = @_;

    $self->{_sequence};
}

sub get_map {
    my($self) = @_;

    %{$self->{_map}};
}
5.3.1.3 Documentation

Here's the (bare bones) documentation for the class embedded right in the module using the POD documentation language:

=head1 Restriction

Restriction: Given a Rebase object, sequence, and list of restriction enzyme
    names, return the locations of the recognition sites in the sequence

=head1 Synopsis

    use Restriction;

    use Rebase;

    use strict;
    use warnings;

    my $rebase = Rebase->new(
        dbmfile => 'BIONET',
        bionetfile => 'bionet.212'
    );

    my $restrict = Restriction->new(
        rebase => $rebase,
        enzyme => 'EcoRI, HindIII',
        sequence => 'ACGAATTCCGGAATTCG',
    );
   
    print "Locations for EcoRI are ", join(' ', 
                                    $restrict->get_enzyme_map('EcoRI')), "\n";

=head1 AUTHOR

James Tisdall

=head1 COPYRIGHT

Copyright (c) 2003, James Tisdall

=cut

1;

Saving the Synopsis example in a file and running it (making sure the required bionet.212 file is in place) gives the output:

EcoRI data in Rebase is GAATTC GAATTC
Sequence is ACGAATTCCGGAATTCG
Locations for EcoRI are 3 11