FixedLengthSlurper for Groovy Fixed Length File Parsing

Since I work in a company that relies heavily on COBOL processing, we have a lot of fixed length files in our batch environment and occasionally I end up interacting with these files in Java. Traditionally we've used an IBM proprietary product to generate Java objects from COBOL copybooks, but there are times that I don't want to go through all the effort of generating these objects if I'm just performing some simple test or analysis of a file.

Since I love Groovy's XmlSlurper and JsonSlurper, I created a FixedLengthSlurper, which is created with the format specification of your file and returns a map of named properties with the parsed data. The arguments to the FixedLengthSlurper's constructor are the size of the field, the name to assign to the field, and an optional closure to format the object. Any number of these formatting parameters can be passed. If no formatting closure is provided, the data is just returned as a String.

def dateTime = new SimpleDateFormat('yyyyMMddahhmm')
def date = new SimpleDateFormat('yyyyMMdd')
def parser = new FixedLengthSlurper(
  13, 'dateTime', { dateTime.parse(it) },
  4, 'type',
  8, 'processDate', { date.parse(it) },
  9, 'numberRecords', { new Integer(it) },
  11, 'amount', { String str -> new BigDecimal(str[0..str.size() - 3] +
                              '.' + str[str.size() - 3, str.size() - 1]) },
  1, 'typeCode')

List values = []
new File('./data/ppld.txt').eachLine { String line ->
 if (!line.startsWith('0000') && !line.startsWith('9999'))
 values << parser.parseText(line)
}

In this example, the 'dateTime', and 'processDate' properties are stored as Date objects, the 'numberRecords' an Integer, and the 'type' and 'typeCode' properties are Strings.  The 'amount' property gets a little extra parsing. The file stores '10044', but that needs to translate into '100.44', so the closure will break apart the string into the integer and decimal values before creating the BigDecimal object.

The source code for the class follows:

class FixedLengthSlurper {

  List formats = []

  /**
   * Constructor.
   * @param vars the formats that should be used when parsing the file
   */
  FixedLengthSlurper(Object ... vars) {
    int varsIndex = 0
    while (varsIndex < vars.size()) {
      //the size and column name must be provided in pairs
      def format = [size: vars[varsIndex], name: vars[varsIndex + 1]]
      varsIndex += 2

      //check next argument to see if a formatting closure was provided
      if (varsIndex < vars.size() && vars[varsIndex] instanceof Closure) {
        format << [formatter: vars[varsIndex]]
        varsIndex++
      }
      formats << format
    }
  }

  /**
   * Reads through the text and applies all formats to break apart the data
   * into mapped properties
   * @param data the fixed length text to parse
   */
  def parseText = { String data ->
    def values = [:]
    int currentIndex = 0

    formats.each { format ->
      //extract the data
      values[format.'name'] =
        data[currentIndex .. (currentIndex + format.'size' - 1)]

      //if a formatting closure was provided, apply it's formatting
      if (format.'formatter') {
        values[format.'name'] = format.'formatter'(values[format.'name'])
      }

      //increment the indexing for the next format
      currentIndex += format.'size'
    }

    return values
  }
}

While doing some research to build this class, I also ran across a neat article talking about how to override the Date class to add a "fromString" method to it. I didn't use it in my example since I only used my date parsers in one place, but it's a neat concept for situations, like unit tests, where there is a lot of date parsing going on.

Pivot Database Rows Into A Single Column

There are times when a result single record is desired instead of multiple rows, for instance when joining a main table again multiple records in a child table. Say that I have an email and recipients table as follows

EMAILS
--------------------------------------------------
ID | SUBJECT      | EMAIL_TEXT   | SENDER
1  | Great Prices | See circular | AwesomeRetailer 

RECIPIENTS
------------------------
EMAIL_ID | RECEIPIENT
1        | bob@aol.com
1        | al@aol.com



Using a JOIN, you'd get back the following records.

ID | SUBJECT      | EMAIL_TEXT   | SENDER          | RECEIPIENT
1  | Great Prices | See circular | AwesomeRetailer | bob@aol.com
1  | Great Prices | See circular | AwesomeRetailer | al@aol.com

Which gives you all the information you need, but the id, subject, text, and from values are repeated, which means that as you're parsing through the results in Java, you have to 'look back' at the previous value to see if you're processing the same email, or a new one.

Alternatively, you can use the XMLAGG aggregate function to combine the records into a single row. This function will combine data together, then you just have to unpack it to make it into a string using the XMLSERIALIZE function. This approach requires a sub-select, because the XMLAGG function operates off a result set.


SELECT ID,
       SUBJECT,
       EMAIL_TEXT,
       SENDER,
       RTRIM((SELECT CAST(XMLSERIALIZE(XMLAGG(XMLTEXT(ADDR || ';')) AS CLOB)
         AS VARCHAR(500)) FROM RECIPIENTS_TABLE WHERE ID = EMAIL_ID)) AS RECIPIENTS
  FROM EMAILS

results in

ID | SUBJECT      | EMAIL_TEXT   | SENDER          | RECEIPIENTS
1  | Great Prices | See circular | AwesomeRetailer | bob@aol.com; al@aol.com

Printing Tabular Data

Sometimes when I'm running tests, or pulling information from a database, I want to see the results in table so that it's easy to parse through the information and see what's there. For instance, let's say that I wanted to see the following information pulled from a bank application.

Account    | Owner           | Balance |
2131545422 | John Smith      | 10.00   |
23409834   | Alexander Banks | 230.03  |
987545453  | Smithy Joe      | 150.22  |
3445       | Katie Parks     | 3021.53 |

So, I write a little groovy script to query the database and for each result that's pulled back, I add the mapped data to the printer.

def printer = new TablePrinter()
new Sql(.....).eachRow(....) {
  printer << ['Account#': it.'accountNum',
              'Owner': it.'userId',
              'Balance': it.'balance']
}
printer.printTable()

The keys from the map are used as the title bar and the data is placed into the table. The TablePrinter groovy class will find the largest value from each column and make that the width when printing each record to give it the nice formatting. Here's the source for the TablePrinter.

class TablePrinter {

  def out = System.out
  Collection rowCalcs = []
  Collection<Collection> rows = []

  /**
   * Cycles through the list of entries in the table to assess the largest
   * value for each'column', then prints all the results in a table.
   *
   * @param rows the data to display
   */
  static void printTable(Collection<Collection> pRows) {
    if (pRows) {
      def printer = new TablePrinter()
      pRows.each { printer << it }
      printer.printTable()
    }
  }

  /**
   * Adds a map entry to the data being printed.
   * @see #plus(Collection)
   */
  TablePrinter leftShift(Collection pRow) {
    return plus(pRow)
  }

  /**
   * Adds a map entry to the data being printed.
   * @see #plus(Map)
   */
  TablePrinter leftShift(Map pRow) {
    return plus(pRow)
  }

  /**
   * Adds a new row to the values that will be printed and recalculates column
   * sizes. If no row calculations have yet been added, then key set from the
   * map will be assumed to the title row of the table and added to the rows
   * that will be printed.
   *
   * @param pRow the data being added to the table
   * @return reference to this updated object for chaining
   */
  TablePrinter plus(Map pRow) {
    if (rowCalcs.size() == 0) {
      this.plus(pRow.keySet())
    }
    return this.plus(pRow.values())
  }

  /**
   * Adds a new row to the values that will be printed and recalculates
   * column sizes.
   * @param pRow the data being added to the table
   * @return reference to this updated object for chaining
   */
  TablePrinter plus(Collection pRow) {
    rows << pRow
    calculateColumnSize(pRow)
    return this
  }

  /**
   * Creates a format string (for example: "%-10s | %-6s") using column
   * sizes calculated earlier and prints the values from each row in
   * tabular format.
   */
  void printTable() {
    def formatString = rowCalcs.sum { "%-${it}s | " }
    rows.each { row ->
      out.println sprintf(formatString, row.toList())
    }
  }

  /**
   * Adjusts the size of each 'column' by comparing previous calculations
   * against the size of each value in this row. The larger value is stored
   * in the {@link #rowCalcs} variable to be used later when printing.
   *
   * @param row values being added to the print display that need to be
   * sized to possibly adjust the size of each column
   */
  void calculateColumnSize(Collection row) {
    row.eachWithIndex  { cell, cellNum ->
      //ensure the storage for the row size calculations is large enough
      while (rowCalcs.size() < cellNum + 1) {
        rowCalcs << 0
      }

      //default null into an empty string, checks explicitly for null to
      // prevent 0 -> ''.
      if (row[cellNum] == null) {
        row[cellNum] = ''
      }

      //store the larger of either the previous calculation or the length of
      //current value
      rowCalcs[cellNum] = Math.max(rowCalcs[cellNum],
                                   String.valueOf(row[cellNum]).length())
    }
  }

}

Code Highlighting For Posts

Found a neat tool today: hilit.me which will take code and create a HTML friendly version of it that can be used to post to web-sites. Like this blog for instance!

Randomized Data Pool

In my years as a Java developer, I've done a lot of simple performance tests to see how one method of processing compared to another, but one of the issues that I've run into was the effect of database caching on the performance results.

If the two methods of reading the database both read for the same value, the second read is always going to be faster than the first because the data will be cached in the database's data pool with the second read. To solve this problem, I created a simple DataPool class which can retrieve random values from a list of data.


/**
 * Wrapper class which allows a random entry from a list of values to be retrieved.
 * Useful in performance testing to get random values from a list of values.
 *
 * @param <T> defines the object types stored in the data pool
 * @author Brian Bos
 */
public class DataPool<T> {
  private final List<T> data;

  /**
   * Constructor.
   */
  public DataPool() {
    this.data = new ArrayList<T>();
  }

  /**
   * Constructor.
   * @param pData list of data to use for the pool.
   */
  public DataPool(List<T> pData) {
    this.data = pData;
  }

  /**
   * Adds a new entry to the data pool.
   * @param entry the value to add to the data pool
   * @return reference to this object, useful for chaining, also allows the "+"
   *         operator to be used in groovy
   */
  public DataPool<T> plus(T entry) {
    this.data.add(entry);
    return this;
  }

  /**
   * Adds a new entry to the data pool.
   * @param entry the value to add to the data pool
   * @return reference to this object, useful for chaining, also allows the "<<"
   *         operator to be used in groovy
   */
  public DataPool<T> leftShift(T entry) {
    return this.plus(entry);
  }

  /**
   * @return a random value from the data pool
   */
  public T nextValue() {
    return this.data.get(
      org.apache.commons.lang.math.RandomUtils.nextInt(this.data.size()));
  }

  /**
   * @param num the number of values to retrieve
   * @return a random value from the data pool
   */
  public List<T> nextValues(int num) {
    final List<T> values = new ArrayList<T>();
    for (int i = 0; i < num; i++) {
      values.add(this.data.get(RandomUtils.nextInt(this.data.size())));
    }
    return values;
  }

}

I used this class in a groovy script which compared reading data with CICS versus DB2. Before I randomized the data with this approach, the DB2 method looked like it was faster, but that was just because the data was being cached. Using random data gave me a better test and showed that the CICS read was actually faster than the complex query, at least when reading policy by policy.


//prepare data pool
def dataPool = new DataPool<String>()
new File("./performance/policyNumbers.txt").eachLine {
 dataPool << it.trim()
}

//Single policy performance test
def cicsPolicies = ([] << dataPool.nextValue())
def db2Policies = ([] << dataPool.nextValue())
performReads(reader, cicsPolicies, querier, db2Policies)

//multiple policy read performance test
cicsPolicies = dataPool.nextValues(50)
db2Policies = dataPool.nextValues(50)
performReads(reader, cicsPolicies, querier, db2Policies)