Description
So i'm busy ingesting shipments, they arrive as either csv, json, xml or edi
The interface I'm working should take an array of shipments, divide that into individual shipments, hash those and store the original input for success/audit/retry/failure tracking reasons. This would make it easier to ingest 99/100 shipments and retry (after localizing and fixing the issue) that one shipment that's invalid for whatever reason.
In order to decide whether something has been ingested correctly I thought a solution could be hashing it 'unit' of input and storing the original input somewhere as well
Quite easy for csv
Weird python-and-bash-esque psuedocode:
for line in csv:
process(line) && hash(line) && gzip(line) -> store result, hash, line in db
It becomes less so for json and xml, even marshal and unmarshal is not 100% identical to the input
Even worse is EDI
So, even though I liked the idea of storing the original it quickly becomse cumbersome. A decent alternative is is hashing and storing the output of transform.Read()
But that comes with several issues
- I can change the output and thus the hash using the schema (not really an issue)
- its not original (but it is more consistent (all json)), so kind of bug/feature
- I don't see what I haven't told omniparser to see, so new fields that might have been added
None of these are a major issue, but part of hashing a new representation of the input, not the input itself
I was wondering how hard would it be to hash the input of whatever generates the output would be?
So:
hash, data, err := transform.Read
Is your internal data stable enough? That you could say 'for loop' the IDR input through the sha256 encoder (it supports streaming) and return a stable/unchanging hash?
As in, in theory ["a", "b", "c"] should return the same hash for a, b and c regardless of ordering
Also, I imagine being able to verify whether a file has been fully processed is interesting for more than one usecase