Monday, October 12, 2009

Data validation using awk command

One common use of patterns in awk is for simple data validation task. Specially in ETL process, you can use it for source data cleansing.

In shell scripting, usually =~ operator is used for regular expression matching. There is a nice guide for advance shell script. Point 17 describes regular expression.

But in awk command ~ is used for matching. Suppose we want to filter a file which contains line delimited by "|" and print only those line whose second column contains "*8999*". So the awk command will be -

$ awk -F '|' 'if(($2 ~ "\*8999\*")){print NR,"|",$0;}}' $file

Thus, you can validate data using awk command. This is one of the great feature of awk command.