DataSet

#include <pbbam/DataSet.h>
class PacBio::BAM::DataSet

The DataSet class represents a PacBio analyis dataset (e.g. from XML).

It provides resource paths, filters, and metadata associated with a dataset under analysis.

DataSet Type

enum TypeEnum

This enum defines the currently-supported DataSet types.

Values:

GENERIC = 0
ALIGNMENT
BARCODE
CONSENSUS_ALIGNMENT
CONSENSUS_READ
CONTIG
HDF_SUBREAD
REFERENCE
SUBREAD
static DataSet::TypeEnum NameToType(const std::string &typeName)

Converts printable dataset type to type enum.

Return
dataset type enum
Parameters
  • typeName: printable dataset type
Exceptions
  • std::runtime_error: if typeName is unknown

static std::string TypeToName(const DataSet::TypeEnum &type)

Converts dataset type enum to printable name.

Return
printable dataset type
Parameters
  • type: dataset type enum
Exceptions
  • std::runtime_error: if type is unknown

PacBio::BAM::DataSet::TypeEnum Type() const

Fetches the dataset’s type.

Return
dataset type enum

std::string TypeName() const

Fetches the dataset’s type.

Return
printable dataset type

DataSet &Type(const PacBio::BAM::DataSet::TypeEnum type)

Edits dataset type.

Return
reference to this dataset object
Parameters
  • type: new dataset type

Constructors & Related Methods

static DataSet FromXml(const std::string &xml)

Creates a DataSet from “raw” XML data.

Parameters
  • xml: DataSetXML text

DataSet()

Constructs an empty, generic DataSet.

DataSet(const DataSet::TypeEnum type)

Constructs an empty DataSet of the type specified.

Parameters
  • type: dataset type
Exceptions
  • std::runtime_error: if type is unknown

DataSet(const BamFile &bamFile)

Constructs a DataSet from a BAM file.

This currently defaults to a SubreadSet, with an ExternalResource pointing to BamFile::Filename.

Parameters

DataSet(const std::string &filename)

Loads a DataSet from a file.

filename may be one of the following types, indicated by its extension:

  • BAM (“*.bam”)
  • FOFN (“*.fofn”)
  • FASTA (“*.fa” or “*.fasta”)
  • DataSetXML (“*.xml”)
    Parameters
    • filename: input filename
    Exceptions
    • std::runtime_error: if filename has an unsupported extension, or if a valid DataSet could not be created from its contents

DataSet(const std::vector<std::string> &filenames)

Constructs a DataSet from a list of files.

Parameters
  • filenames: input filenames
Exceptions
  • std::runtime_error: if DataSet could not be created from filenames

DataSet(const DataSet &other)
DataSet(DataSet&&)
DataSet &operator=(const DataSet &other)
DataSet &operator=(DataSet&&)
~DataSet()

Operators

DataSet &operator+=(const DataSet &other)

Merges DataSet contents.

Adds contents of other to this dataset object

Return
reference to this dataset object
Parameters
  • other: some other dataset to add to this one

Serialization

void Save(const std::string &outputFilename)

Saves dataset XML to file.

Parameters
  • outputFilename: destination for XML contents
Exceptions
  • std::runtime_error: if file could be opened or if DataSet elements could not be converted to XML

void SaveToStream(std::ostream &out)

Saves dataset XML to output stream, e.g. std::cout, std::stringstream.

Parameters
  • out: destination for XML contents
Exceptions
  • std::runtime_error: if DataSet elements could not be converted to XML

Attributes

const std::string &Attribute(const std::string &name) const

Fetches the value of a DataSet root element’s attribute.

These are the attributes attached to the root dataset element:

<SubreadSet foo="x" bar="y" />

Built-in accessors exist for the standard attributes (e.g. CreatedAt) but additional attributes can be used as well via these generic Attribute methods.

Return
const reference to attribute’s value (empty string if not present)
Parameters
  • name: root element’s attribute name

const std::string &CreatedAt() const

Fetches the value of dataset’s CreatedAt attribute.

Return
const reference to attribute’s value (empty string if not present)

const std::string &Format() const

Fetches the value of dataset’s Format attribute.

Return
const reference to attribute’s value (empty string if not present)

const std::string &MetaType() const

Fetches the value of dataset’s MetaType attribute.

Return
const reference to attribute’s value (empty string if not present)

const std::string &ModifiedAt() const

Fetches the value of dataset’s ModifiedAt attribute.

Return
const reference to attribute’s value (empty string if not present)

const std::string &Name() const

Fetches the value of dataset’s Name attribute.

Return
const reference to attribute’s value (empty string if not present)

const std::string &ResourceId() const

Fetches the value of dataset’s ResourceId attribute.

Return
const reference to attribute’s value (empty string if not present)

const std::string &Tags() const

Fetches the value of dataset’s Tags attribute.

Return
const reference to attribute’s value (empty string if not present)

const std::string &TimeStampedName() const

Fetches the value of dataset’s TimeStampedName attribute.

Return
const reference to attribute’s value (empty string if not present)

const std::string &UniqueId() const

Fetches the value of dataset’s UniqueId attribute.

Return
const reference to attribute’s value (empty string if not present)

const std::string &Version() const

Fetches the value of dataset’s Version attribute.

Return
const reference to attribute’s value (empty string if not present)

std::string &Attribute(const std::string &name)

Fetches the value of a DataSet root element’s attribute.

These are the attributes attached to the root dataset element:

<SubreadSet foo="x" bar="y" />

Built-in accessors exist for the standard attributes (e.g. CreatedAt) but additional attributes can be used as well via these generic methods.

A new attribute will be created if it does not yet exist.

Return
non-const reference to attribute’s value (empty string if this is a new attribute)
Parameters
  • name: root element’s attribute name

std::string &CreatedAt()

Fetches the value of dataset’s CreatedAt attribute.

This attribute will be created if it does not yet exist.

Return
non-const reference to attribute’s value (empty string if this is a new attribute)

std::string &Format()

Fetches the value of dataset’s Format attribute.

This attribute will be created if it does not yet exist.

Return
non-const reference to attribute’s value (empty string if this is a new attribute)

std::string &MetaType()

Fetches the value of dataset’s MetaType attribute.

This attribute will be created if it does not yet exist.

Return
non-const reference to attribute’s value (empty string if this is a new attribute)

std::string &ModifiedAt()

Fetches the value of dataset’s ModifiedAt attribute.

This attribute will be created if it does not yet exist.

Return
non-const reference to attribute’s value (empty string if this is a new attribute)

std::string &Name()

Fetches the value of dataset’s Name attribute.

This attribute will be created if it does not yet exist.

Return
non-const reference to attribute’s value (empty string if this is a new attribute)

std::string &ResourceId()

Fetches the value of dataset’s ResourceId attribute.

This attribute will be created if it does not yet exist.

Return
non-const reference to attribute’s value (empty string if this is a new attribute)

std::string &Tags()

Fetches the value of dataset’s Tags attribute.

This attribute will be created if it does not yet exist.

Return
non-const reference to attribute’s value (empty string if this is a new attribute)

std::string &TimeStampedName()

Fetches the value of dataset’s TimeStampedName attribute.

This attribute will be created if it does not yet exist.

Return
non-const reference to attribute’s value (empty string if this is a new attribute)

std::string &UniqueId()

Fetches the value of dataset’s UniqueId attribute.

This attribute will be created if it does not yet exist.

Return
non-const reference to attribute’s value (empty string if this is a new attribute)

std::string &Version()

Fetches the value of dataset’s Version attribute.

This attribute will be created if it does not yet exist.

Return
non-const reference to attribute’s value (empty string if this is a new attribute)

DataSet &Attribute(const std::string &name, const std::string &value)

Sets this dataset’s XML attribute name, with value.

These are the attributes attached to the root dataset element:

<SubreadSet foo="x" bar="y" />

Built-in accessors exist for the standard attributes (e.g. CreatedAt) but additional attributes can be used as well via these generic methods.

The attribute will be created if it does not yet exist.

Return
reference to this dataset object
Parameters
  • name: root element’s attribute name
  • value: new value for the attribute

DataSet &CreatedAt(const std::string &createdAt)

Sets this dataset’s CreatedAt attribute.

This attribute will be created if it does not yet exist.

Return
reference to this dataset object
Parameters
  • createdAt: new value for the attribute

DataSet &Format(const std::string &format)

Sets this dataset’s Format attribute.

This attribute will be created if it does not yet exist.

Return
reference to this dataset object
Parameters
  • format: new value for the attribute

DataSet &MetaType(const std::string &metatype)

Sets this dataset’s MetaType attribute.

This attribute will be created if it does not yet exist.

Return
reference to this dataset object
Parameters
  • metatype: new value for the attribute

DataSet &ModifiedAt(const std::string &modifiedAt)

Sets this dataset’s ModifiedAt attribute.

This attribute will be created if it does not yet exist.

Return
reference to this dataset object
Parameters
  • modifiedAt: new value for the attribute

DataSet &Name(const std::string &name)

Sets this dataset’s Name attribute.

This attribute will be created if it does not yet exist.

Return
reference to this dataset object
Parameters
  • name: new value for the attribute

DataSet &ResourceId(const std::string &resourceId)

Sets this dataset’s ResourceId attribute.

This attribute will be created if it does not yet exist.

Return
reference to this dataset object
Parameters
  • resourceId: new value for the attribute

DataSet &Tags(const std::string &tags)

Sets this dataset’s Tags attribute.

This attribute will be created if it does not yet exist.

Return
reference to this dataset object
Parameters
  • tags: new value for the attribute

DataSet &TimeStampedName(const std::string &timeStampedName)

Sets this dataset’s TimeStampedName attribute.

This attribute will be created if it does not yet exist.

Return
reference to this dataset object
Parameters
  • timeStampedName: new value for the attribute

DataSet &UniqueId(const std::string &uuid)

Sets this dataset’s UniqueId attribute.

This attribute will be created if it does not yet exist.

Return
reference to this dataset object
Parameters
  • uuid: new value for the attribute

DataSet &Version(const std::string &version)

Sets this dataset’s Version attribute.

This attribute will be created if it does not yet exist.

Return
reference to this dataset object
Parameters
  • version: new value for the attribute

Child Elements

const PacBio::BAM::Extensions &Extensions() const

Fetches the dataset’s Extensions element.

Return
const reference to child element
Exceptions
  • std::runtime_error: if element does not exist

const PacBio::BAM::ExternalResources &ExternalResources() const

Fetches the dataset’s ExternalResources element.

Return
const reference to child element
Exceptions
  • std::runtime_error: if element does not exist

const PacBio::BAM::Filters &Filters() const

Fetches the dataset’s Filters element.

Return
const reference to child element

const PacBio::BAM::DataSetMetadata &Metadata() const

Fetches the dataset’s DataSetMetadata element.

Return
const reference to child element

const PacBio::BAM::SubDataSets &SubDataSets() const

Fetches the dataset’s DataSets element.

Return
const reference to child element

PacBio::BAM::Extensions &Extensions()

Fetches the dataset’s Extensions element.

This element will be created if it does not yet exist.

Return
non-const reference to child element

PacBio::BAM::ExternalResources &ExternalResources()

Fetches the dataset’s ExternalResources element.

This element will be created if it does not yet exist.

Return
non-const reference to child element

PacBio::BAM::Filters &Filters()

Fetches the dataset’s Filters element.

This element will be created if it does not yet exist.

Return
non-const reference to child element

PacBio::BAM::DataSetMetadata &Metadata()

Fetches the dataset’s DataSetMetadata element.

This element will be created if it does not yet exist.

Return
non-const reference to child element

PacBio::BAM::SubDataSets &SubDataSets()

Fetches the dataset’s DataSets element.

This element will be created if it does not yet exist.

Return
non-const reference to child element

DataSet &Extensions(const PacBio::BAM::Extensions &extensions)

Sets this dataset’s Extensions element.

This element will be created if it does not yet exist.

Return
reference to this dataset object
Parameters
  • extensions: new value for the element

DataSet &ExternalResources(const PacBio::BAM::ExternalResources &resources)

Sets this dataset’s ExternalResources element.

This element will be created if it does not yet exist.

Return
reference to this dataset object
Parameters
  • resources: new value for the element

DataSet &Filters(const PacBio::BAM::Filters &filters)

Sets this dataset’s Filters element.

This element will be created if it does not yet exist.

Return
reference to this dataset object
Parameters
  • filters: new value for the element

DataSet &Metadata(const PacBio::BAM::DataSetMetadata &metadata)

Sets this dataset’s DataSetMetadata element.

This element will be created if it does not yet exist.

Return
reference to this dataset object
Parameters
  • metadata: new value for the element

DataSet &SubDataSets(const PacBio::BAM::SubDataSets &subdatasets)

Sets this dataset’s DataSets element.

This element will be created if it does not yet exist.

Return
reference to this dataset object
Parameters
  • subdatasets: new value for the element

Resource Handling

std::vector<std::string> AllFiles() const

Returns all of this dataset’s resource files, with relative filepaths already resolved.

Includes both primary resources (e.g. subread BAM files), as well as all secondary or child resources (e.g. index files, scraps BAM, etc).

Return
vector of (resolveD) filepaths
See
DataSet::ResolvedResourceIds

std::vector<BamFile> BamFiles() const

Returns this dataset’s primary BAM resources, with relative filepaths already resolved.

Primary resources are those listed as top-level ExternalResources, not associated files (indices, references, scraps BAMs, etc.).

Return
vector of BamFiles
See
DataSet::ResolvedResourceIds

std::vector<std::string> FastaFiles() const

Returns this dataset’s primary FASTA resources, with relative filepaths already resolved.

Primary resources are those listed as top-level ExternalResources, not associated files (indices, references, scraps BAMs, etc.).

Return
vector of filepaths to FASTA resources
See
DataSet::ResolvedResourceIds

std::vector<std::string> ResolvedResourceIds() const

Returns all primary external resource filepaths, with relative paths resolved.

Primary resources are those listed as top-level ExternalResources, not associated files (indices, references, scraps BAMs, etc.).

See
ResolvePath
Return
resourceIds

std::string ResolvePath(const std::string &originalPath) const

Resolves a filepath (that may be relative to the dataset).

A DataSet‘s resources may be described using absolute filepaths or with relative paths. For absolute paths, nothing is changed from the input. For relative paths, these are resolved using the DataSet‘s own path as a starting point. A DataSet‘s own path will be one of: 1 - the location of its XML or BAM input file, e.g. created using DataSet(“foo.xml”) or DataSet(“foo.bam”) 2 - application’s current working directory for all other DataSet construction methods { DataSet(), DataSet(type), DataSet(“foo.fofn”) }

Return
resolved path
Parameters
  • originalPath: input file path (absolute or relative)

std::set<std::string> SequencingChemistries() const

Return
sequence chemistry info for all read groups in this dataset
See
ReadGroupInfo::SequencingChemistry

XML Namespace Handling

const NamespaceRegistry &Namespaces() const

Access this dataset’s namespace info.

Return
const reference to dataset’s NamespaceRegistry

NamespaceRegistry &Namespaces()

Access this dataset’s namespace info.

Return
non-const reference to dataset’s NamespaceRegistry