This package has been deprecated

Author message:

Install bionode-watermill instead

bionode-waterwheel

0.4.0 • Public • Published

bionode-watermill

npm version node Build Status codecov.io

Watermill: A Streaming Workflow Engine

NPM

Watermill lets you orchestrate tasks using operators like join, junction, and fork. Each task has a lifecycle where

  1. Input glob patterns are resolved to absolute file paths (e.g. *.bam to reads.bam)
  2. The operation is ran, passed resolved input, params, and other props
  3. The operation completes.
  4. Output glob patterns are resolved to absolute file paths.
  5. Validators are ran over the output. Check for non-null files, can pass in custom validators.
  6. Post-validations are ran. Add task and output to DAG.

What is a task?

A task is the fundamental unit pipelines are built with. For more details, see Task. At a glance, a task is created by passing in props and an operationCreator, which will later be called with the resolved input. Consider this task which takes a "lowercase" file and creates an "uppercase" one:

const uppercase = task({
  input: '*.lowercase',
  output: '*.uppercase'
}, function(resolvedProps) {
  const input = resolvedProps.input
  
  return fs.createReadStream(input)
   .pipe(through function(chunk, enc, next) {
      next(null, chunk.toString().toUpperCase())
   })
    .pipe(fs.createWriteStream(input.replace(/lowercase$/, 'uppercase')))
})

A "task declaration" like above will not immediately run the task. Instead, the task declaration returns an "invocable task" that can either be called directly or used with an orchestration operator. Tasks can also be created to run shell programs:

const fastqDump = task({
  input: '**/*.sra',
  output: [1, 2].map(n => `*_${n}.fastq.gz`),
  name: 'fastq-dump **/*.sra'
}, ({ input }) => `fastq-dump --split-files --skip-technical --gzip ${input}` )

What are orchestrators?

Orchestrators are functions which can take tasks as params in order to let you compose your pipeline from a high level view. This separates task order from task declaration. For more details, see Orchestration. At a glance, here is a complex usage of join, junction, and fork:

const pipeline = join(
  junction(
    join(getReference, bwaIndex),
    join(getSamples, fastqDump)
  ),
  trim, mergeTrimEnds,
  decompressReference, // only b/c mpileup did not like fna.gz
  join(
    fork(filterKMC, filterKHMER),
    alignAndSort, samtoolsIndex, mpileupAndCall // 2 instances each of these
  )
)

Examples

Who is this tool for?

Waterwheel is for programmers who desire an efficient and easy-to-write methodology for developing complex and dynamic data pipelines, while handling parallelization as much as possible. Waterwheel is an npm module, and is accessible by anyone willing to learn a little JavaScript. This is in contrast to other tools which develop their own DSL (domain specific language), which is not useful outside the tool. By leveraging the npm ecosystem and JavaScript on the client, Waterwheel can be built upon for inclusion on web apis, modern web applications, as well as native applications through Electron. Look forward to seeing Galaxy-like applications backed by a completely configurable Node API.

Waterwheel is for biologists who understand it is important to experiment with sample data, parameter values, and tools. Compared to other workflow systems, the ease of swapping around parameters and tools is much improved, allowing you to iteratively compare results and construct more confident inferences. Consider the ability to construct your own Teaser for your data with a simple syntax, and getting utmost performance out of the box.

Package Sidebar

Install

npm i bionode-waterwheel

Weekly Downloads

4

Version

0.4.0

License

MIT

Last publish

Collaborators

  • thejmazz