Using parquet files in Nextflow

CSV vs Parquet

CSV is a common plain text file format where each line represents a row and fields in each row are separated by a "special" character, usually a comma, tab or pipe

Parquet, by opposite, is a columnar storage file format optimized for analytics. It’s highly efficient in both storage and performance

In this post I’ll show you how to deal with Parquet files in your Nextflow pipelines using the nf-parquet plugin

CSV generator

We’ll use a Nextflow script to generate a csv file with random values. By default it will generate a one million rows with 21 fields:

UUID
field1, field2, … field10 as random string
field11, field12, … field20 as random float

INFO

You can specify how many rows you want using `--rows' parameter

generator.nf

import java.util.UUID;

params.out = "$baseDir/data/example.csv"
params.rows = 1_000_000

alphabet = (('A'..'Z')+('0'..'9')).join()

String randomString(int n){
  new Random().with {
    (1..n).collect { alphabet[ nextInt( alphabet.length() ) ] }.join()
  }
}

String randomFloat(){
    new Random().with{
        nextFloat()
    }.toString()
}

outFile = new File(params.out)
outFile.text = (["uuid"]+(1..20).collect{ "field${it}"}).join(",")+"\n"

(1..params.rows).each{
    outFile << (
            [ UUID.randomUUID().toString() ]
            +
            (1..10).collect{ randomString(new Random().nextInt(10)) }
            +
            (1..10).collect{ randomFloat() }
        ).join(",")+"\n"
}

nextflow run generator.nf

In my case the script create, after 1-minute aprox, a data/example.csv with 191Mb and one million of rows (plus header)

Reading a CSV file

To parse a CSV file, we can use the splitCSV operator provided by Nextflow:

csv.nf

params.input = "$baseDir/data/example.csv"

workflow {
    Channel.fromPath(params.input)
        .splitCsv(header:true)
        .count()
        .view()
}

$ nextflow run csv.nf

 N E X T F L O W   ~  version 25.02.3-edge

Launching `csv.nf` [prickly_heisenberg] DSL2 - revision: 192c2f35f2

1000000

Reading a Parquet file

Before to work with parquet format we need to convert our CSV.

INFO: We can use online tools to convert a CSV file to Parquet, for example https://www.tablab.app/csv/to/parquet, or with a simple Nextflow script. By the moment we’ll use online tool

Follow instructions to upload the csv generated and download the "parquet version" into data subfolder, close to the csv version

If you compared the sizes of both files, you can find first advantage of parquet format:

example.csv 191Mb
example.parquet 112Mb

INFO

In case your csv contains only a few hundreds of Kb, the size difference will be not notable.

To use parquet files we need to install the nf-parquet plugin:

nextflow.config

plugins {
     id "nf-parquet"
}

The plugin contains a splitParquet operator similar to CSV:

parquet.nf

include { splitParquet } from 'plugin/nf-parquet'

params.input = "$baseDir/data/example.parquet"

workflow {
    Channel.fromPath(params.input)
        .splitParquet()
        .count()
        .view()
}

As you can see, You need to add the splitParquet function defined in the nf-parquet plugin and use it very similar as splitCSV (they differ in the allowed parameters)

First benchmark

Now that we have both format files, we can run a simple comparison:

time nextflow run csv.nf
....
real    0m9,085s
user    0m27,703s
sys     0m2,415s

Twenty-seven seconds to parse the csv file

time nextflow run parquet.nf
....
real    0m5,814s
user    0m20,854s
sys     0m1,542s

Twenty seconds to parse the parquet file

"Only" 7 seconds of difference but it’s the beginner

Reading data

We’ll dump a tuple of field1 and field20 (for example) in both format

csv_view.nf

params.input = "$baseDir/data/example.csv"

workflow {
    Channel.fromPath(params.input)
        .splitCsv(header:true)
        .map{ row -> tuple(row.field1, row.field20) }
        .view()
}

parquet_view.nf

include { splitParquet } from 'plugin/nf-parquet'

params.input = "$baseDir/data/example.parquet"

workflow {
    Channel.fromPath(params.input)
        .splitParquet()
        .map{ row -> tuple(row.field1, row.field20) }
        .view()
}

As you can see, both pipelines look very similar.

Running the benchmark on my computer:

csv took user 0m42,274s
parquet took user 0m32,426s

Defining a Schema

Although nf-parquet allows parsing a parquet file in "raw" format as a Map in a similar way of splitCSV, the power of parquet format comes when you define a schema to instruct the library about the structure of the file

Create a schema directory in your project and create these files:

module-info.java

module records { (1)
    opens records; (1)
}

1	We’ll use a "records" package name. Can be anything but need to be the same in the pipeline

Row.java

package records; (1)

record Field1Field2( (2)
  String uuid,
  String field1,
  String field2
){}

1	the package we specified in module-info.java
2	a java record with the names and type of every field

Lastly, we need to compile our schema:

javac --release 17 -d lib/ schemas/*

INFO: this process is only required one time meantime your schema doesn’t change
INFO: In case you’re versioning your pipeline, you need to version at least the schema (java files) and generate the binary before run the pipeline

Now we’ll run a "raw" count

workflow {
    Channel.fromPath(params.input)
        .splitParquet()
        .count()
        .view()
}

vs a "schema" count

workflow {
    Channel.fromPath(params.input)
        .splitParquet( record: Field1Field20 )
        .count()
        .view()
}

As you can see, the only difference is we provide a record params to indicate which fields we’re interested to read (parquet not only will ignore other but also will not read them)

csv took user 0m32,118s
"raw" took user 0m20,470s
"schema" took user 0m14,590s

Writing

nf-parquet plugin not only allows us to read parquet files but also create

INFO: writing requires a Schema (implemented in a record Java as shown in the previous section)

Create a Row.java record with your schema:

Row.java

package records;

record Row(
  String id,
  String field1,
  String field2,
  String field3,
  String field4,
  String field5,
  String field6,
  String field7,
  String field8,
  String field9,
  String field10,

  Float field11,
  Float field12,
  Float field13,
  Float field14,
  Float field15,
  Float field16,
  Float field17,
  Float field18,
  Float field19,
  Float field20
){}

INFO: you can use nested records as for example position with latitude and longitude fields, or more complex case that it’s impossible to represent with CSV

and compile it:

javac --release 17 -d lib/ schemas/*

INFO: lib folder will contain now a record subfolder with two classes, Field1Field20 and Row

convert.nf

include { splitParquet; toParquet } from 'plugin/nf-parquet' (1)

import records.* (2)

params.index = "$baseDir/data/example.csv"
params.output = "$baseDir/data/converted.parquet"

workflow {
    Channel.fromPath(params.index)
        .splitCsv(header:true) \
        .map{ row->
		    new Row( (3)
                row.uuid,
                row.field1,row.field2,row.field3,row.field4,row.field5,
                row.field6,row.field7,row.field8,row.field9,row.field10,
                row.field11 as float,
                row.field12 as float,
                row.field13 as float,
                row.field14 as float,
                row.field15 as float,
                row.field16 as float,
                row.field17 as float,
                row.field18 as float,
                row.field19 as float,
                row.field20 as float,
	        )
	    }
        .toParquet( params.output, [record:Row])
}

1	include new function from plugin
2	import the package you specified in module-info.java
3	instance a Java record with the desired info

If all goes well you’ll have a data/converted.parquet file.

INFO: Also, if you have a parquet viewer, you can see parquet generated online specifies all columns as String, but our converted file uses Float for last columns following our Schema

Comparing with real data

Someone notices me that using random values doesn’t use some compress features of Parquet format, so I’ve looked for a more realistic example and I run another benchmark. You can read more about it at

https://github.com/jagedn/nf-parquet-benchmark

Este texto ha sido escrito por un humano

This post has been written by a human

2019 - 2025 | Mixed with Bootstrap | Baked with JBake v2.6.7 | Terminos Terminos y Privacidad