This was originally written in early
1999 when I was working for Landcare Research. Presumably ownership,
strictly speaking, belongs to them. Mind you, computing technique isn't
really their focus. |
In-Source file caching in perl
When text processing significantly large text files in Perl a moderate
portion of the run-time is spent waiting while the file system waits
for the disk to spin into position (known as rotational latency) for
reading the next record.
Many operating systems attempt to avoid this by caching files,
that
is, by fetching more than is needed at any one instance, and storing it
in a reserved portion of memory so it can be fetched faster.
However, the amount of memory used for this purpose is usually
small. Windows 95 seems to have a dynamically resizable cache that will
(supposedly) change on demand. I have got good speed gains from doing
it myself however.
Windows 95, for example, allocates a maximum of 64k for
read-ahead
caching (if you request a block of x kbytes, the OS will fetch the next
64k bytes in the chance that you will continue to read from that
portion of the disk) (further investigation suggests that this is only
for caching from the CD). Read ahead-caching generally works 90-95% of
the time. When processing text files larger than this more time is
spent waiting for the disk.
The most obvious way to avoid this is to load the whole file
into
memory. Thankfully, this is easy to write in perl.
Without caching:
open (IN,"afile.txt"); while ($in=<IN>) { # until EOF, read a line and put into $in variable chop $in; #remove trailing CR from text # ..... process text } close (IN);
With caching
open (IN,"afile.txt"); @myin=<IN>; # slurp the whole file into the array close (IN); foreach $temp (@myin) { # step through that array $in=$temp; # don't refer to the var from the array, # as it gets put back in afterward. chomp $in; # ..... process text }
Note
A little code later, it was realized that the first
three
lines of code can actually be moved off into a local utilities module,
and cut the amount of code used down, while centralisation error
handling.
Now we use something like @myarr=&prepdatafile("afile.txt");
. The code
in the module is something like
sub prepdatafile { open (FH,$_[0])|| die "Cannot open datafile $_[0]"; @temparr=<FH>; close (FH); @_=@temparr; }
|
In addition, it should be noted that hard disk accesses do
consume
CPU cycles, and while this increases demand for memory the OS should
manage to refrain from swapping it to disk for a little while...
While processing a 470kb text file this improved performance
by
~30%! The code takes a little longer to start up (doing ~1Mb/sec file
reads in the first 2 seconds or so) but then drops completely away...
Further hack testing seems to indicate my system tops out at
about
3 MB/sec. This is for both read and write. However, running multiple
process seriously slows down
the process (with 5 perl processes the
File I/0 is down to 2M/sec or less. The perl intepreter doesn't seem
to do multithreaded programs).
Note: This system is most effecient when every record
needs to
be processed in a batch-type environment. In a cold-start interactive
(like CGI) it just adds to the overhead and doesn't improve
performance. If we had written a db server daemon it would naturally be
a speed gain to hold the data in memory as to avoid waiting for the
disk.... up to a point. After that point it makes more sense to have
lots of available memory. This point is the size of the data. Some
times you may only want to cache a subset (index, or maybe core data)
|