Skip to content

Annotation Design Discussion #56

@doulikecookiedough

Description

@doulikecookiedough

Questions & Todo:

  • Discuss how Annotations should be implemented in HashStore
  • What format should we use to store annotation content in /hashstore/metadata? JSON-LD or EML?
  • What is HashStore's responsibility when storing annotations?
    • Is the EML document already formed at this point?
    • Where is the content coming from?
    • Who currently creates the EML documents to be stored?
  • Summarize issue discussion into substorage design document

Initial Proposal to kickstart the conversation (the content below is not final, and will likely change):

  • A dataset that is represented by an EML document can be broken down to 2 components:
    • Attributes that describe the dataset (ex. title, author, method, keywordSet, etc.)
    • Attributes that represent the tables associated with the dataset (ex. dataTable, otherEntity, etc.)
  • A HashStore annotation is a mapping document that should consist of a single parent member and a list that represents the child members
    • This document's location in hashstore/metadata is formed by calculating the SHA-256 hex digest of a given pid and formatId
      • The parent member's value is the id (location) of the parent metadata document in hashstore/metadata
        - The id/location/address of this document is formed by calculating the SHA-256 hex digest of a given pid, formatId and the string "parent". Ex. sha-256(pid + formatId + "parent")
        - This document is composed of the attributes/content that describe the dataset (ex. title, author, method, keywordSet, etc.)
      • The List/HashMap of child members are represented with a number as the key, and the id (location) of the child's metadata document in hashstore/metadata as the value
        - The id/address of each child is formed by calculating the SHA-256 hex digest of a given pid, formatId and (int) key. Ex. sha-256(pid + formatId + 0) where 0 is the first table in the dataset
        - Each child represents a data table in the dataset, or chunk of data that belongs to the dataset
  • Note: The format of the parent/child metadata documents to be stored/chunked requires further discussion/clarification
---
title: HashStoreAnnotation Class
---
classDiagram
    direction RL
    class HashStoreAnnotation{
        +String Parent
        +List~Dict/KVP~ Children
        +setParent(string)
        +setChildren(List)
        +getContent()
        +setContent()
        +getChildrenTotal()
    }
Loading
Example/flow to store an annotation document:

hs_annotation = HashStoreAnnotation()

// Get and store parent content
// Get and store children content

// Get parent location
dataset_parent = sha-256(pid + formatId + "parent")
// Create child list
dataset_children = [
    {0: sha-256(pid + formatId + 0)},
    {1: sha-256(pid + formatId + 1)},
    ...
]
hs_annotation.setParent(dataset_parent)
hs_annotation.setChildren(dataset_children)

// getContent() will format the document to be written based on the chosen format
hs_annotation_content = hs_annotation.getContent()

hashstore.store_metadata(pid, hs_annotation_content, formatId) 

Example/flow to work with/retrieve an annotation document:

// Retrieve the mapping document
hs_annotation_stream = hashstore.retrieve_metadata(pid, formatId)
hs_annotation = HashStoreAnnotation.setContent(hs_annotation_stream)
hsa_parent = hs_annotation.parent
hsa_children = hs_annotation.children

// Iterate over the first 1000 table items
for i in range(0, 1000):
     rel_path = shard(hsa_children[i])
     location = `/hashstore/metadata/` + rel_path
     // ... Do what we will with each child element

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions