Miscellaneous Topics II

HEREDOC
printf
The glob operator
LWP::Simple
Regex and Multi-line strings
Getopt::Long
Named Arguments

HEREDOC

A long series of print statements can get very repetitive. A construct called HEREDOCs (borrowed from the shell) can help: Instead of:

print "\nThere are $count people.\n"; print "\n"; print "$women of them are women\n"; print "$men of them are men.\n\n";

Try this:

print <<"EOS"; There are $count people. $women of them are women $men of them are men. EOS

Much less punctuation and easier to read, yes? The string constant extends from the line after the construct <<"EOS" up to a line that begins with EOS. String interpolation will happen because the EOS was double-quoted.

printf

If you want precise control of the printing of numbers and alignment of columns there is the printf (and sprintf) function. Its first parameter is a format with many different % characters that allow you to control the printing:

my $r = 5/6; my $f = 15/8; print "ratio $r\n"; print "fraction $f\n"; my $fmt = "%10s %9.4f\n"; printf $fmt, 'ratio', $r; printf $fmt, 'fraction', $f; # the above will print: ratio 0.833333333333333 fraction 1.875 ratio 0.8333 fraction 1.8750

See 'perldoc -f sprintf' for the details.

The glob operator

The < and > characters are used for reading from files. They can also be used to get filenames (with shell metacharacters):

my $line = <$in>; # read from $in while (<STDIN>) {} # read from STDIN my @txt_files = <*.txt>; # get all .txt files in the current directory my @pm_files = <src/perl/*.pm>; # get all .pm files in src/perl

How does Perl know how to interpret <>?? As before, it tries to do "what you mean" and nearly always succeeds. It's complicated. If you're curious, see File::Glob and perlsyn.

LWP::Simple

In the early days of the web (1990's) Perl was used to generate 95% of all dynamic web pages. Perl was affectionately termed 'The Duct Tape of the Internet'. Today is different, of course. Here is an example of how easy it is to do web related stuff in Perl:

use LWP::Simple; my $html = get("http://en.wikipedia.com/wiki/perl");

This gets the complete HTML source of the URL into a single scalar. You can then parse, slice & dice it with regex!

There is also getstore($url, $fname) which will store the contents of the URL in a file. Very convenient.

Regex and Multi-line strings

The regular expressions we saw before matched a pattern to a single line of text. Regex can also quite effectively deal with multi-line strings of arbitrary length. There are two things to note about this:

The dot '.' character normally matches any character except a newline. With the /s modifier it will also match a newline.
my $html = get('http://en.wikipedia.org/wiki/perl'); # we have a multi-line scalar $html =~ s{.*?<table>}{}s; # this discarded the first 53 lines up to the first <table> tag.
The ? after the * in the regex make it match as little as possible rather than the default behavior of being 'greedy' and matching as much as possible.
When dealing with multi-line strings it is advised to always use the /m modifier. It will change '^' and '$' from matching the start or end of the string to matching the start or end of any line anywhere within the string.
The regex meta-characters \A and \z unambiguously match the start and end of the string.

For a Perl script that you can use to explore the above click here. You should be able to understand the source code but may need to consult the lengthy 'perldoc perlre'.

Getopt::Long

As we saw in the discussion of Arrays, command line arguments are placed in @ARGV and you can do whatever you like with them. Many modules have been written to help with this. Getopt::Long is a very sophisticated one and likely the one you'll see the most. Here is one way to use it:

use Getopt::Long; my %opt = ( nlines => 10, # default );; GetOptions(\%opt, qw/ verbose nlines=i fname=s /) or die "usage: $0 -v -n #lines -f fname\n"; print "starting\n" if $opt{verbose}; for (1 .. $opt{nlines}) { ... } open IN, '<', $opt{fname} or die "no file";

We'll discuss the \%opt construct later.

Named Arguments

Hashes can be used to good effect to help with passing many arguments to a subroutine:

sub process { my %args = @_; my $html = get($args{url}); for (1 .. $args{count}) { ... } print "$args{prefix}: $result\n"; } process( url => 'http://www.google.com', count => 45, prefix => 'abc', );

Note these things:

There is no need to remember the order of parameters.
It is self documenting.
Easily extended.

An improvement to the code in sub process would be to first check the validity of the hash keys.

Exercise

For practice in using several of these miscellaneous topics here is a task I made up. This challenge is a concocted/fabricated one but is actually not that far off from a 'real' one.

There are two command lines options: --url and --min. The script will be called like this:
get_tar.pl --url http://www.logicalpoetry.com/perlclass/data.tar --min 30
Make a new directory in /tmp.
Get the tar file at the url into that directory.
Extract/untar the text files within it.
Ignore the text files that are not writable - but do log them (to STDOUT).
The files consist of lines that are colon ':' separated fields.
If the 3rd field has an uppercase letter add the 4th field (which is numeric) to a running total.
Only include the 4th field if it is greater than the min from the command line.
Print a report that includes the number of files in the tar file, the number of lines you read, the number of 4th fields you included in the total, and the final running total.
Write some perldoc for the script.

Prev Index Next