March 06, 2007

Parsing irregular multi line records with Perl

Parsing irregular multi line records with Perl

Most of the articles I've read about parsing multi line records with perl offer pithy advice to reset the record separator special variable, $/ , to a different value, and walla, your records are easily manageable. However, what happens when your reading log files, as I seem to spend most of my time doing these days, and there isn't a defined record seperator and lots of noise? Log files messages of many applications don't have defined beginnings and ends, but often have messages that span several lines. Furthrmore, you can't use record seperators, because even if can find a decent seperator, there will be lots of garbage inbetween the messages. Take Genesys for example, one of my main areas of work. Genesys provides CTI software, or "computer telephony integration", a software layer used to control voice telephony systems. They have a well defined event distribution system, from the primary server, called a T-Server. The T-Server logs just about everything (if you set the logging up high enough), and the need arose to examine the logs after the fact for a variety of reasons. Here is an example of a really short T-Server event, surrounded by the typical noise in the log file. The event starts with the distribute_event message, and ends with "Interaction message "EventReleased" generated".

@10:08:56.6920 [0] 7.0.207.66 distribute_event: message EventReleased
	AttributeTimeinuSecs	692000
	AttributeTimeinSecs	1107529736 (10:08:56)
	AttributeANI	'4043620021'
	AttributeDNIS	'2828'
	AttributeConnID	0089012d39c0e25c
	AttributeCallID	520618004
	AttributeCallType	2
	AttributeThisDN	'gsbi::'
@10:08:56.6920 [ISCC] Debug: Party object is removed: p:0000000000000000,00000000 n::
10:08:56.692 Int 04544 Interaction message "EventReleased" generated
10:08:56.692 Trc 04542 EventReleased sent to 1156 (4613 lds_clt_b1)
10:08:56.692 Trc 04542 EventReleased sent to 1044 (4612 lds_bhm_p1)

I had requirements that could be summarized as such: determine how many EventReleased where in a file, and in those events, grab specific elements off record, for example, how many of them had AttributeCallType = 2 ? The actual requirements were a little different as can seen by the attached full script, but the dificulty in finding a record start, then grabing elements out of it was the main issue. Well, perl is probably the perfect tool for parsing Genesys log files. And I doubt this is the most effecient way to do, alhtough it worked, so by posting this perhaps somebody will submit other ideas. Here is the full version if you don't need the play by play below.


Ok, the code. First I set a variable of the $event_type, and then open a directory to allow me to run this script on multiple files, slurp each one up into an array called @files. $ARGV[0] means take the first argument off the commandline, such as a directory path, otherwise use . or current directory.
$event_type = "EventEstablished";

$dir= $ARGV[0] || ".";
opendir(DIR, "$dir");
	while($name = readdir(DIR)) {	
		##Put the names in an array, latest files 1st
			print "puting name = $name in the array\n";
			unshift(@files, $name);
	}
closedir(DIR);

Then I declare some variables, variables above the scope of each file I work with. That way I can save some information that spans the entire directory:
my $direventestablished;
my $dirivr;
my $diroutbound;
my $dirnonurs;
my $nodata;
my %fourdns;

Next, the iteration through each file, and local variables specific to that file. There are a bunch of them because once I got this script going I keept mutating it.
foreach   (@files) {

	my $searchfile = $_;
	my ($ineventestablished, $eventestablished,$ivreventfound,$urseventfound,$connid,$dnis,$ani,$thisdn,$thiscalltype,$thisqueue,$ivrstring,$ivr,$nonivr,$urstring,$urs,$nonurs,$outbound,$fourdigit);

Since we're going through each file in the @files array, which is now $searchfile, we use the $searchfile variable to open each file with a new filehandle, SEARCHFILE. When you use the while () in perl, it automatically cycles though every line of the file. So, all the code below the while fires on EVERY line in the files. For Genesys TServer logs, each file is about 10megs. But it only takes a couple of seconds to parse with this script. Go Perl.
print "searching file $searchfile\n";
open(SEARCHFILE, "<$dir/$searchfile") || print "Could not open file $searchfile : $!";
	
		while () {
			chomp;
			#print "chomping file $dir, line = $line"; 

Recognizing when a record started became key to the script. I decided that once the mulit line record started, I would set a variable, and when the next line of the file is being processed if that variable is set it knows to look for elements of that record. To do that a simple regular expression is used (see the =~ and / /, that means its a regular expression). If the current line matches, we know the record started. We also increment a counter of how many events were found.
					
     if ($_ =~ /distribute_event: message $event_type/) {
	$ineventestablished =1;
        $eventestablished++;	
     } 

So, here comes some of the heavy lifiting. The subsequent line from the file, will check if we have started an event, (if $ineventestablished =1) and if so it does much more statements to grab individual elements off the record. Each of the if statements have a regular expression that tries to match the line. I use the ( ) in the regular expression along with the $1 special variable to not only match but return the value of the matched (item) to the $1 variable.
	if ($ineventestablished) {
		if ($_ =~ /IVR_CALL_REASON/) {
			$ivreventfound = 1;
		}

		if ($_ =~ /URS_START/) {
			$urseventfound = 1;
		}

		if ($_ =~ /AttributeConnID\s(.*)/) {
			$connid = $1;
		}

		if ($_ =~ /AttributeDNIS\s+\'(.*)\'/) {
			$dnis = $1;
		}

		if ($_ =~ /AttributeANI\s+\'(.*)\'/) {
			$ani = $1;
		}

		if ($_ =~ /AttributeThisDN\s+\'(.*)\'/) {
			$thisdn = $1;
		}

		if ($_ =~ /AttributeCallType\s+(\d)/) {
			$thiscalltype = $1;
		}
		
		if ($_ =~ /AttributeThisQueue\s+\'(.*)\'/) {
			$thisqueue = $1;
		}

And, of course, this amount of if statements would be applied indefinately on every single line in the file until you turn off the $ineventestablished variable. I do that by matching the end of the record, like so
	#post processing
	if ($_ =~ /Interaction message \"$event_type\" generated/) {
		$ineventestablished = 0;

Now that I've defined that one full, mulit line record has been scanned, I needed to do some calculations on that record before I moved on to the next one, or the next line in the file. So the following code does some of those caluclations, and shows some processing done at the end of the file showing totals.
	if ($ivreventfound) {
		$ivrstring = "HAS BEEN THROUGH IVR ";
		$ivr++;
	} else {
		$ivrstring = "HASNT BEEN THROUGH IVR ";
		$nonivr++;
	}
	
	if ($urseventfound) {
		$urstring = "HAS BEEN THROUGH ROUTER";
		$urs++;
	} else {
		$urstring = "HASNT BEEN THROUGH ROUTER";
		$nonurs++;
	}


	if (!$urseventfound && !$ivreventfound) {
		$nodata++;
	}

	if (length($thisdn) < 5) {
		$fourdigit++;
		$fourdns{$thisdn} = 1;
	}
	

	if ($thiscalltype == 3) {
		$outbound++;
	}

	$ivreventfound = 0;
	$urseventfound = 0;
	#print "Connid = $connid $ivrstring and $urstring ANI = $ani and DNIS = $dnis and this DN = $thisdn and thisQueue = $thisqueue and thiscalltype = $thiscalltype\n";	
  } # end if postprocessing

 } #end eventestablished

} # end while


print "found $eventestablished total $event_type - $ivr IVR EVENTESTABLISHED and $nonivr NON IVR EVENTESTABLISHED and $urs URS EVENTS and $nonurs NON URS EVENTS and $outbound OUTBOUND\n";			
$direventestablished = $direventestablished + $eventestablished;
		
} # end files for looop

print "toatl $event_type = $direventestablished and total nodata $event_type = $nodata\n";
Posted by choppen5 at 10:38 PM