Defining new Data and Component Catalogs Tutorial

From Wings Wiki

Jump to: navigation, search

Contents

Overview

In order to create your own workflows in Wings, you need to start by defining what components are available and what kinds of datasets they process. This is done by defining a data catalog and a component catalog:

  • The data catalog describes the kinds of data that can be used or produced by workflow components. This is done by defining datatypes and by making datasets (files) be of those types
  • The component catalog defines the workflow components available to create workflows. A workflow component has an associated executable code and a model that consists of definitions of valid inputs and outputs and their required metadata properties.

In this tutorial we show how to set up these two catalogs by matter of example.

We assume the reader is already familiar with the Wings Sandbox and its Short Tutorial. The Wings Sandbox and its tutorial provide basic background on how to create valid workflow templates based on existing components, and how the information added to the data and component catalogs is used to assist users in creating valid workflows.

This tutorial is organized by complexity and not all aspects of defining component and data catalogs described here need to be implemented in order to get started running workflows. Within each section, we will show a simple example of what can be achieved using the steps described up until then. It doesn't take much to get started!

For this tutorial, we assume the data and component catalogs are empty. You may be using an account that already has datatypes and components defined. If so, you may want to remove them along with any workflows that are predefined. Or you may leave them as is if they do not bother you, as they will not interfere withe the steps in this tutorial.


The WordBags Example

The tutorial describes the steps for creating the data and component catalogs needed for the "WordBags" workflow that is included in the Wings Sandbox, containing components for counting the number of occurrences of words in various files in order to compare files regarding their topic. The workflow template that we want to create based on these components has the following dataflow diagram:

File:target_workflow.jpg

Roughly, the workflow operates as follows. In the branch on the left we generate a simple "topic model" of HTML and/or PDF files by considering several such files on the same topic and counting the number of times words appear in them. Then, applying a distance measure between word-counts, we compare this topic model to a newly given file processed in the branch on the right. This workflow illustrates the following features of Wings:

  • data collections (processing sets of files in the branch on the left) -- check this if you need a reminder
  • abstract components (using different removeMarkup components depending on whether HTML files or PDF files are given as input) -- check this if you need a reminder
  • the use of semantic constraints to define requirements on the component inputs and parameter values and the corresponding meta-data of their outputs, depending on the meta-data properties of their inputs and the used parameter values -- check this, this, and this if you need a reminder

The components we use are implemented with very simple codes based on standard UNIX tools, python, and perl. But you can upload any codes that are supported by the execution environment of your Wings portal installation. More on this later.

Uploading components

On the main screen of the portal, choose Advanced->"Manage Components" to bring up the Component Browser.

File:comp_editor.jpg

Create a new component "getSortedWords" and add one input, called, e.g., "Input1TextFile". Since we haven't defined any data types yet, choose the generic type "DataObject" from the pull-down menu for the type. Leave the other options as they are for now. Next, add an output "Output1WordListFile". Save these changes when done.

File:comp_getSortedWords.jpg

Currently the icon for the new component is yellow. This indicates that this component is not executable, because it doesn't have any executable programs attached to it. Let's change this and fill the component with life! We need to upload code that can be executed when this component is used in a workflow, taking the parameters as specified.

Note that you can indicate a prefix for each input and output datasets as well as input parameters. By default, the form will use in order -i1 -i2, etc for input datasets, -p1, -p2, etc for the input parameters, -o1, -o2, etc for output datasets. You can change that in the form if you would like, but note that these prefixes must correspond to what your executable code will expect.

NOTE: Wings portal installations can be configured to execute component codes locally on the server where the Wings portal software is running (the same as the URL you are visiting), or on the grid that the Wings portal software is set up to access. To see which execution environment is being used in your portal, consult with the site administrator. Any code you upload will need to be compatible with that execution environment. For instance, if local execution is used, and the server on which the portal software is installed is running Ubuntu-Linux, then your code needs to be able to execute on Ubuntu-Linux. If the Wings portal has access to a grid that includes machines of different architectures that run different operating systems then you can upload any codes that will run in any of them. The Pegasus workflow mapping and execution system will be running in the background and will select the appropriate execution resource.

To facilitate the create of components, the portal offers a "skeleton" component package (zip) for download, via the Download menu in the top right corner. This can be used for creating component for most members of the *nix family of operating systems (including Linux and Mac OS X).

File:download_skeleton.jpg

Once downloaded, unpack the zip file. There are two script files: "run" and "io.sh". The latter is an auxiliary file that you can ignore. We only need to edit the latter, "run". Open this file in a text editor. The script contains documentation itself and marks the two places that you probably want to edit by "EDIT". Mainly, what you need to do is to specify the number of inputs, parameters, and outputs the script will receive. If you have not changed the prefix arguments for the inputs, parameters, and outputs defined in the Component Browser (i.e., they still follow the schema -i1, -i2, .., -p1, -p2, ..., -o1, -o2, ...), then the invocation of io.sh (see top of the script) with these numbers will parse all provided arguments into the variables $INPUTS1, $INPUTS2, .., $PARAMS1, $PARAMS2, .., $OUTPUTS1, $OUTPUTS2, ... Hence, they can be readily referred to in the main program invocation, which is the other edit you will need to do in the script (see the documentation included in the script). Once done, copy all required files (executables, libraries, resources..) into this directory and repackage (zip) the directory.

Once you are done, click on "Upload Component" in the Component Browser, select the zip file and click "Begin Upload". When the upload is finished, you will see that the icon of the component changes its color from yellow to blue. The component can now be executed.

If you later want to make any changes to the component code, you can download the current version, edit it, and upload it again.

Note that if you use the skeleton run script and follow the instructions in that file, most of the items in the Component_Author's_Checklist are taken care of for you. Just include any required libraries in the zip file that you upload.

For more information on how to encapsulate existing codes as workflow components, see the Pegasus guidelines and some examples of encapsulation of third party codes.


Test It

To test that the component works as intended, let us create a new data object (a text file) and a simple workflow containing just this component, processing the file we are going to create.

Go to "Manage Data" in the "Analysis" menu of the home screen.

File:Manage_Data.jpg

Select "DataObject" on the left and then click "Upload Files". A dialog will appear that let's you upload files into the data catalog. Upload any plain text file (e.g., any file ending in ".txt") for testing.

Now open the workflow editor (Analysis -> Edit Workflows) and create a simple workflow only consisting of one "getSortedWords" node, with associated input and output variables:

File:test_getSortedWords.jpg

If you are unclear on how to do this, please read the creating valid workflow templates section of the short tutorial.

We are now ready to execute the newly created component. Open the "Template Browser" (Analysis->Browse & Run Workflows) and select the simple workflow just created.

As always, at the top on the left you will see a drop down menu from which you can select the text file you have uploaded as input to getSortedWords. When you have done that, run the workflow. Go to the Access Results page to check how the execution progresses. Once completed, you can check that the result is as expected, by clicking on the link for the output file:

File:test_getSortedWords_results.jpg

This was a quick example of how to upload and use your own components. You can now already compose complex workflows and execute them (locally or on the grid).

Now, let us consider how to use the semantic features of Wings. First we will see how to specify semantic properties of data objects and how to organize data types. After that we will see how to specify semantic constraints for components.

Creating Semantic Descriptions of Datasets

If you want Wings to provide assistance in selecting data and parameters and in checking that workflows are valid, you need to create semantic descriptions of datasets.

NOTE: The Wings portal provides a very simple editor for semantic constraints. In the background, these constraints are converted into OWL axioms. OWL can handle more expressive constraints than those that you can input through this editor. If you want to use the full expressivity of OWL then you will need to use a full fledged OWL editor. Consult your site administrator to understand how this can be done.

To create descriptions of datasets, go to "Manage Data" in the "Analysis" menu of the home screen:

File:Manage_Data.jpg

Here we will define specific data types and specify which meta-data properties they have. In our example, there are a few such properties that are shared by all the data types we want to use. Therefore, we assign these properties to the top-level data type "DataObject". Select this node and add two properties "hasSize" and "hasLanguage" of types "int" and "string" respectively. Save these changes before moving on.

File:shared_properties.jpg

Now let's create a few data types that we will use in our example:

  • "FormattedFile", which we will use to capture various types of files containing text plus additional markup (e.g., HTML, PDF..);
  • "TextFile", for simple plain text files;
  • "WordListFile", a plain text file that contain one word per line,
  • "WordCountFile", like the previous, this type of file contains one word per line but in addition, no word is repeated but instead each word has an associated number behind it, indicating how many times this word was repeated;
  • "PatternFile", a file containing a set of patterns (one per line) that we will want to filter out from file we process;
  • "LikelihoodFile", a file indicating the likelihood that the query file has the same topic as the model, expressed as a single number between 0.0 and 1.0.

To define each of these data types, select the root node of the datatypes tree (DataObject) and click on "Add Datatype". When you open any of these newly created types, you'll see that they inherit the properties we had created for DataObject.

In addition to the two properties that we defined earlier, we want to express that FormattedFiles have an additional metadata property: topics. Therefore, we select FormattedFile and add a new property "hasTopic" of type "string". Furthermore, in our example workflow we want to handle two types of "FormattedFile"s that we want to process: HTML files and PDF files. To reflect this, we create appropriate sub-types of FormattedFile. Select FormattedFile and click on "Add Datatype". The newly created types, which we call "HTMLFile" and "PDFFile", will be created as sub-types of FormattedFile and inherit all its properties.

File:FormattedFiles.jpg

Similarly, we create "CommonTermsFile" and "SpecialCharsFile" as sub-types of PatternFile. We will use these files to list common terms and special characters that will be used by other workflow components to filter out of the text files before building the topic models.

We now have all the datatypes we need and are ready to upload datasets into the repository. Let us begin by uploading a couple of HTML files, by selecting the HTMLFile datatype and clicking "Upload Files (HTMLFile)". This opens a dialog containing an upload queue. Using the "Add Files to Queue" button, we can select files from our local filesystem for upload.

File:upload_data.jpg

To upload these files, click "Begin Upload". When the upload is completed we can close this dialog. The uploaded files appear underneath their type, HTMLFile. Next, we need to fill in the value for the meta-data properties of the uploaded files. Click on the files and fill in appropriate values for the properties. Remember to save your changes.

File:data_with_prop_values.jpg

We proceed analogously for all files that we will use as input to our analysis. As for the text file that you uploaded earlier and that was assigned the type DataObject, you should delete it and re-upload it, this time as a TextFile.

We have created the datatypes and their respective meta-data properties we need in order to describe the data objects of our application, and we have uploaded the files we need for our analysis.

Test It

In order to test these specifications, go back to the Component Editor, and change the input type for getSortedWords, the component we have created previously, to the more appropriate type "TextFile".

When you now go to the Template Browser again, you will see that only the text file that we uploaded appears in the pull down menu where you select the input to getSortedWords. None of the other files (of different type) appear, since they do not match the required input type. Also, test the following: leave the input field blank and click "Suggest Data". Wings will only suggest the text files that you have uploaded, realizing that no other file type would be appropriate.

Benefits From Doing This

Defining Abstract Components

Wings allows for the definition of "abstract" component classes. These are component classes that themselves do not have an attached executable program. Instead, they serve as an abstract placeholder for a list of concrete, executable components. In many workflows there are abstract steps that need to be performed, where the details of this step depend on many things, including the type of input being processed, the type of output expected, and the chosen values for parameters. Since Wings is able to reason about all these constraints, it is very convenient to use abstract components in a workflow and let Wings decide which specialization to use, given the circumstances, which also makes such workflows more broadly applicable and hence improves the chances of reusability.

For instance, in our example workflow, we would like to work with both HTML files and PDF files. In order to extract the words contained in these files we need to use different scripts (concrete components), but conceptually they achieve the same: they turn a formatted file into a plain-text file which can be processed further independent of its original format. Wings allows us to specify this abstraction and is able to generate the corresponding concrete workflows depending on the type of input it receives.


To achieve this, let us create an abstract component "removeMarkup" with one input of type FormattedFile and one output of type TextFile. After creating this component, instead of creating code for it, we select the component in the tree on the left and click on "Add Component". This create a new component as a sub-component (specialization) of removeMarkup. The created component, which we call "html2text", inherits the arguments of removeMarkup. In order to reflect our intention that html2text processes HTML files, we change the type of its input to HTMLFile. In the same way, a second component "pdf2text" which processes PDF files can be created.

File:removeMarkup.jpg

Test It

In the Workflow Editor, create a new template consisting of only one "removeMarkup" component, with associated variables. When you now go to "Browse & Run Workflows" and select this template, you will see in the pulldown of the input that you can use either use HTML files or PDF files.

File:test_removeMarkup.jpg

Depending on which one you choose, Wings will specialize "removeMarkup" into either "html2text" or "pdf2text". To see this, run the workflow on any html file. In "Access Results" you can then see the actual workflow that is executed:

File:test_removeMarkup_results.jpg

Benefits From Doing This

Defining Components that Deal with Data Collections

In the Component Editor, in the widget of inputs and outputs, there is a forth column "dimensionality" that we have ignored so far. The value in this column, which is 0 by default, states the number of dimensions this input/output has. When it is 0, this means that it is a single file, 1 means it is a list, 2 it is a list of lists, etc. On the command line, these collections are passed as space separated lists. Handling these collections is easy when using the io.sh, as is done in the skeleton script. Then, the space separated lists will be made available in the regular variables ($INPUTS1, ..).

In our example workflow, the "merge" component has a one-dimensional dataset as input.

When you create a workflow template that has a component with a data collection, you need to select "Infer Elaborated Template" before you save it. This will cause Wings to propagate the dimensionality throughout the workflow template. In our example, the merge component takes a data collection as input, which makes the left branch of the template to take a data collection as input to the workflow. This is another example of constraint propagation in Wings.

Test It

Try uploading the "merge" component, defining the dimensionality of its input. Then try to create a workflow template that uses it.

Benefits From Doing This

Advanced Semantic Constraints of Components

Wings allows for the specification of more advanced semantic functionality of components via the definition of rules. Use of these rules is entirely optional. As you saw from the previous testing sections, they are not necessary in order to elaborate or run a workflow. Nevertheless, they provide a very expressive mechanism to state semantic constraints in Wings. As a result, writing rules takes time to master. We are working towards providing a better interface for specifying this functionality that makes manual authoring of rules unnecessary.

NOTE: Wings uses the W3C's OWL standard to represent and reason about semantic constraints. However, OWL has limited expressivity. A rule language can be used in combination with OWL to express more advanced constraints. The W3C is currently developing a recommendation (ie, a standard) for a rule language.

If you are not familiar with rules, we recommend that you read some introductory materials on this subject.

At the moment, Wings supports Jena rules. In the future we plan to support [standard rule languages as they become available].

Rules can be entered in the text field in the "Component Rules" tab in the Component Browser. You can use Jena rule syntax and in particular rule built-ins to specify complex conditions. These rules are not associated with any specific component. Instead, in each rule one needs to specify on the left-hand side appropriate conditions under which the rule applies ("fires"), asserting its right-hand side. Hence, if a rule shall only apply to a specific component (which is often the case, see examples below), you need to specify appropriate conditions to limit the application of the rule to only that component. Here are some of the rules that we defined for the example workflow:

File:rules.jpg

To get started, consider the following two stubs, specifying two common shapes of these rules:

Preconditions

[name_preconditions:
 (?c rdf:type pcdom:nameClass)
 (?c pc:hasInput ?in1) (?in1 pc:hasArgumentID "Input1in1")

 # your conditions here, e.g., notEqual(?in1, ?in2), or (?in1 dcdom:hasLanguage "en")

 -> (?c ac:isInvalid "true"^^xsd:boolean)
]

This rule states conditions under which the component may not be used.


Effects

[name_effects:
 (?c rdf:type pcdom:nameClass)
 (?c pc:hasInput ?in1) (?in1 pc:hasArgumentID "Input1in1")
 (?c pc:hasOutput ?out1) (?out1 pc:hasArgumentID "Output1out1")

 # optionally, you can add conditions here to make this effect conditional, 
 # or to get the values of some input properties, e.g., (?in1 dcdom:hasLanguage ?lang)

 -> 
  # Your assertions here. These will be added to the knowledge base when 
  # this type of component encountered during workflow elaboration. E.g., 
  # (?out1 dcdom:hasLanguage ?lang)

]

This rule states the meta-data properties the outputs of a component of a qualified type have. This allows us, for instance, to propagate meta-data properties over the execution of components. For instance, the following rule states that if the input file has some language ?lang, then the output file will also have that language.

 
 [getSortedWords_effects:
  (?c rdf:type pcdom:getSortedWordsClass)
  (?c pc:hasInput ?in1) (?in1 pc:hasArgumentID "Input1TextFile)
  (?c pc:hasOutput ?out1) (?out1 pc:hasArgumentID "Output1WordListFile)
  (?in1 dcdom:hasLanguage ?lang)
 -> 
  (?out1 dcdom:hasLanguage ?lang)
 ]

For more detailed examples of rules and how to use them for different purposes, see the rules for Data_Mining_Workflows.

Benefits From Doing This

Personal tools