Basic Component Encapsulation

From Wings Wiki

Jump to: navigation, search

Basic Component Encapsulation

Contents

Background

Workflow systems utilize models of workflow components to reason about executable codes. These models need to be consistent with the executable code, otherwise the workflow system will run into failures during workflow generation and during workflow execution.

We refer to encapsulation as the process of defining a clear interface for executing a component. Encapsulating code into a workflow component requires exposing all important properties and requirements of an executable code while ensuring that the code does not make assumptions about its execution environment without exposing them.

Unfortunately, synchronizing arguments, environmental variables, argument order, and platform requirements across different and successive component versions is a tedious and error-prone process. For example, a simple change to a component, such as reordering its argument list, can often be only detected by running the workflows and investigating the sources for failure. A major issue is that the encapsulations of the component are developed separately from the development of the code itself. If the specification of the component is provided by the code developer together with the code, then the encapsulation and the code are more likely to in synch with each other and correct.

The Wings/Pegasus team has designed a simple methodology to facilitate component encapsulation process. In this methodology, a Basic Component Encapsulation (BCE) schema must be implemented as a wrapper of any code delivered as a workflow component.

The BCE provides a means to declare the encapsulation of executable codes and recommend (soon to be required) when a component is uploaded to the Workflow Portal.

Basic Component Encapsulation (BCE)

Encapsulations are written by code developers in the process of writing code that is to be delivered as a workflow component. Unlike declarative models of code, encapsulations are generated with the code itself. Encapsulations are simple and capture basic features of how the code is to be executed. An encapsulation can be a Unix script, a Java program, or a standalone binary.

BCEs provide information to Workflow System administrators such as:

  1. execution requirements, including architecture, memory, and software
  2. environmental variables assumed by the code
  3. error codes and related conditions
  4. unit test

The unit test are particularly important because they allow administrators to confirm the execution environment is configured correctly and that the components succeed (and fail correctly).

BCEs are XML instances of the BCE Schema (download from here: File:Bce.xsd) that encode the following information:

  1. number of input and output files
  2. parameters and their basic types (i.e., number or string)
  3. ordering of parameters and files for code invocation
  4. component name, author, vendor, version, date
  5. required environmental variables (libraries and classpaths)
  6. basic resource requirements (i.e., host type, OS, etc)
  7. unit tests to check key execution error codes and other key execution properties

To provide a uniform interface to its BCE, components are required to take a --BCE argument that will print its XML to std-out.

The remainder of this document provides:

  • examples to illustrate the use of BCE
  • diagrams to illustrate the representation
  • an XML Schema for BCE

Examples of a BCE

This section provides two examples of a BCE schema. Before showing the two full examples, we discuss some important properties of the different aspects of the BCE schema.

Input and Output Files

<files>
 <file type="input" id="inputData"/>
 <file type="output" id="outputData"/>
</files>

Note that order does not matter here.

Input Parameters

<parameters>
    <param type="int" id="numberOfBins"/>
    <param type="int" id="classIndex"/>
 </parameters>

Note that order does not matter here.

Argument Type and Ordering

<arguments>
  <arg type="file"><file idref="Group1_DS"/></arg>
  <arg type="file"><file idref="Group2_DS"/></arg>
  <arg type="parameter" option="-iter "><param idref="Iterations"/>
       </arg> <arg type="parameter" option="-name "><param idref="GroupName"/></arg>
  <arg type="file" option="--output="><file idref="OutputGroup_DS"/></arg>
</arguments>

This should be an ordered list corresponding to order that the component expects the arguments.

NOTE: at present, the Wings Portal does not guarantee an ordering of arguments. Components, therefore, must be able to handle arguments in any order.

Component Name, Author, Vendor, Version

<component
  author="Fayyad and Irani"
  vendor="WEKA" 
  version="3.6.2">

Resource Requirements

<system architecture="x86" 
 os="linux" 
 os-version="fc6" 
 glibc="2.4.3"/>

Required Environmental Variables

These variables should be thought of as keys which indicate to the system administrators what variables need to be defined.

<environment>
  <name>JAVA_HOME</name>
  <name>CLASSPATH</name>
  <name>LD_LIBRARY_PATH</name>
</environment>

In the case where a component wraps a third-party algorithm, this set of variables should be the superset of variables for the wrapper and the binary. For example, suppose you are writing a Unix script to wrap some third-party software. The third-party software requires LD_LIBRARY_PATH and the script requires JAVA_HOME - both would be included in this set.

Exit Codes

These exitcodes define what codes the component exits with. For the workflow system we require a code of 0 for success and any non-zero code for failure.

<exitcodes>
  <exit code="0">SUCCESS</exit>
  <exit code="1">FAILURE: Algorithm Failure</exit>
  <exit code="127">FAILURE: Invalid Input File</exit>
</exitcodes>

There should be at least on unit test for each exit code.

Unit Tests

<unit_test>
  <file idref="inputData">iris.arff</file>
  <file idref="outputData">iris-binned.arff</file>
  <param idref="numberOfBins">10</param>
  <param idref="classIndex">5</param>
  <invocation_string>Discretize -i iris.arff -o iris-binned.arff -B 10 -c 5</invocation_string>
     <exit code="0">Success</exit>
  </unit_test>
  <unit_test>
    <file idref="inputData">iris.arff</file>
    <file idref="outputData">iris-binned.arff</file>
    <param idref="numberOfBins">10</param>
    <param idref="classIndex">10</param>
    <invocation_string>Discretize -i iris.arff -o iris-binned.arff -B 10 -c 10</invocation_string>
       <exit code="1">Failure</exit>
</unit_test>

Examples

Example 1: Simple Component Discretize

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<component name="Discretize" author="Fayyad and Irani" vendor="WEKA" version="3.6.2">
    <description>
        <environment>
            <name>JAVA_HOME</name>
            <name>CLASSPATH</name>
            <name>WEKAHOME</name>
        </environment>
        <files>
            <file type="input" id="inputData"/>
            <file type="output" id="outputData"/>
        </files>
        <parameters>
            <param type="int" id="numberOfBins"/>
            <param type="int" id="classIndex"/>
        </parameters>
        <arguments>
            <arg option="-i" type="file">
                <file idref="inputData"/>
            </arg>
            <arg option="-o" type="file">
                <file idref="outputData"/>
            </arg>
            <arg option="-B" type="parameter">
                <param idref="numberOfBins"/>
            </arg>
            <arg option="-c" type="parameter">
                <param idref="classIndex"/>
            </arg>
        </arguments>
        <exitcodes>
            <exit code="0">Success</exit>
            <exit code="1">Failure</exit>
        </exitcodes>
    </description>
    <unit_test>
        <file idref="inputData">iris.arff</file>
        <file idref="outputData">iris-binned.arff</file>
        <param idref="numberOfBins">10</param>
        <param idref="classIndex">5</param>
        <invocation_string>Discretize -i iris.arff -o iris-binned.arff -B 10 -c 5</invocation_string>
        <exit code="0">Success</exit>
    </unit_test>
    <unit_test>
        <file idref="inputData">iris.arff</file>
        <file idref="outputData">iris-binned.arff</file>
        <param idref="numberOfBins">10</param>
        <param idref="classIndex">10</param>
        <invocation_string>Discretize -i iris.arff -o iris-binned.arff -B 10 -c 10</invocation_string>
        <exit code="1">Failure</exit>
    </unit_test>
</component>

Example 2: Component with a Data Collection as an Input

<?xml version="1.0" encoding="UTF-8"?>
<component xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
                     xmlns:"http://seagull.isi.edu/ikcap/bce"
                     name="union-n-groups" 
                     version="0.1.0" 
                     vendor="SR-ISI" 
                     author="varunr" 
                     date="2007-04-10">
        <description>
               <system architecture="x86" os="linux" os-version="fc6" glibc="2.4.3"/>
               <environment>
                       <name>JAVA_HOME</name>
                       <name>CLASSPATH</name>
                       <name>LD_LIBRARY_PATH</name>
               </environment>
               <files>
                       <collection id="InputGroups_DS" type="input"/>
                       <file id="OutputGroup_DS" type="output"/>
               </files>
               <parameters>
                       <param id="Iterations" type="int"/>
                       <param id="GroupName" type="string"/>
               </parameters>
               <arguments>
                       <arg type="collection">
                               <collection idref="InputGroups_DS"/>
                       </arg>
                       <arg type="parameter" option="-iter ">
                               <param idref="Iterations"/>
                       </arg>
                       <arg type="parameter" option="-name ">
                               <param idref="GroupName"/>
                       </arg>
                       <arg type="file" option="--output=">
                               <file idref="OutputGroup_DS"/>
                       </arg>
               </arguments>
        </description>
        <unit_test>
               <file idref="OutputGroup_DS">Output.group</file>
               <collection idref="InputGroups_DS">
                       <file>Input1.group</file>
                       <file>Input2.group</file>
                       <file>Input3.group</file>
               </collection>
               <param idref="Iterations">5</param>
               <param idref="GroupName">Boaters in Marina</param>
               <invocation_string>
                       union-n-groups Input1.group Input2.group  Input3.group -iter 5 
                                                 -name "Boaters in Marina" --output=Output.group
               </invocation_string>
                <exit code="0">Success</exit>
        </unit_test>
        <unit_test>
               <file idref="OutputGroup_DS">Output2.group</file>
               <collection idref="InputGroups_DS">
                       <file>Input7.group</file>
                       <file>Input8.group</file>
                       <file>Input9.group</file>
                       <file>Input10.group</file>
               </collection>
               <param idref="Iterations">1</param>
               <param idref="GroupName">Vampires in LA</param>
               <invocation_string>
                       union-n-groups Input7.group Input8.group Input9.group Input10.group 
                                        -iter 1 -name "Vampires in LA" --output=Output2.group
                <exit code="127">Failure: Illegal File Format</exit>
               </invocation_string>
        </unit_test>
        <unit_test>
               <file idref="OutputGroup_DS">Output.group</file>
               <collection idref="InputGroups_DS">
                       <file>Input1.group</file>
               </collection>
               <param idref="Iterations">5</param>
               <param idref="GroupName">Boaters in Marina</param>
               <invocation_string>
                       union-n-groups Input1.group -iter 5 -name "Rookie Boaters in Marina"
                             --output=Output.group
               </invocation_string>
                <exit code="1">Failure: Too few input groups</exit>
        </unit_test>
</component>

Supporting Documents and Software

BCE Schema

Best Practices and Lessons Learned

Software

We provide a Java API to help component authors to quickly and efficiently wrap codes. The source files include the code that was used to build the components for the Data Mining Workflows. You might also take a look at Example BCE Compatible Scripts for some example scripts.

These are also available as a maven project - just ask us for an account to access our Nexus repository.

Future Work and Extensions of the BCE

  1. Modify the schema to capture two kinds of author/version/date/comment metadata: 1. for the Program that is wrapped and 2. for the wrapper itself
  2. Add a license property to the BCE
  3. Remove the argument ordering requirement for BCE because Wings Portal does not guarantee an ordering
  4. Support the specification of database requirements
Personal tools