Skip to content

Return unique hash for input #137

Closed
@DGollings

Description

@DGollings

So i'm busy ingesting shipments, they arrive as either csv, json, xml or edi

The interface I'm working should take an array of shipments, divide that into individual shipments, hash those and store the original input for success/audit/retry/failure tracking reasons. This would make it easier to ingest 99/100 shipments and retry (after localizing and fixing the issue) that one shipment that's invalid for whatever reason.

In order to decide whether something has been ingested correctly I thought a solution could be hashing it 'unit' of input and storing the original input somewhere as well

Quite easy for csv

Weird python-and-bash-esque psuedocode:

for line in csv:
  process(line) && hash(line) && gzip(line) -> store result, hash, line in db

It becomes less so for json and xml, even marshal and unmarshal is not 100% identical to the input

Even worse is EDI

So, even though I liked the idea of storing the original it quickly becomse cumbersome. A decent alternative is is hashing and storing the output of transform.Read()

But that comes with several issues

  • I can change the output and thus the hash using the schema (not really an issue)
  • its not original (but it is more consistent (all json)), so kind of bug/feature
  • I don't see what I haven't told omniparser to see, so new fields that might have been added

None of these are a major issue, but part of hashing a new representation of the input, not the input itself

I was wondering how hard would it be to hash the input of whatever generates the output would be?
So:
hash, data, err := transform.Read

Is your internal data stable enough? That you could say 'for loop' the IDR input through the sha256 encoder (it supports streaming) and return a stable/unchanging hash?

As in, in theory ["a", "b", "c"] should return the same hash for a, b and c regardless of ordering

Also, I imagine being able to verify whether a file has been fully processed is interesting for more than one usecase

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions