The reason there's a lot of startups in the OCR space (us being one of them) is the classic 80/20 rule. Any solution that's 80% accurate just doesn't work for most applications.
Converting a clean .docx into markdown is 10 lines of python. But what about the same document with a screenshot of an excel file? Or complex table layouts? The .NORM files that people actually use. Definitely agree with having a toggle between rules-based/ocr. But if you're looking at company wide docs, you won't always know which to pick.
The response from MarkItDown seems pretty barebones. I expected it to convert the clean pdf table element into a markdown table, but it just pulls the plaintext, which drops the header/column relationship.
> Any solution that's 80% accurate just doesn't work for most applications.
And yet people use LLMs, for which "80% accuracy" is still mostly an aspiration. :-)
I think it's reasonably likely most people companies end up using open source libraries, at least partly because it lets them avoid adding another GDPR sub-processor. Unstructured.io, one of your competitors, goes as far as having an AWS Marketplace setup so customers can use their own infrastruture but still pay them.
LLMs might get better at consuming badly-formatted data, so the data only needs to meet that minimum bar, vs the admittedly very nice output you showed.
> LLMs might get better at consuming badly-formatted data
Oh agreed. There's definitely a meeting in the middle between better ingestion and smarter models. LLMs are already a great fuzzing layer for that type of interpretation. And even with a perfect WYSIWYG text extraction, you're still limited by how coherent the original document was in the first place.
Converting a clean .docx into markdown is 10 lines of python. But what about the same document with a screenshot of an excel file? Or complex table layouts? The .NORM files that people actually use. Definitely agree with having a toggle between rules-based/ocr. But if you're looking at company wide docs, you won't always know which to pick.
Example with one of our test files:
Input: https://omni-demo-data.s3.us-east-1.amazonaws.com/zerox/Omni...
MarkItDown: https://omni-demo-data.s3.us-east-1.amazonaws.com/zerox/mark...
Ours: https://omni-demo-data.s3.us-east-1.amazonaws.com/zerox/omni...
The response from MarkItDown seems pretty barebones. I expected it to convert the clean pdf table element into a markdown table, but it just pulls the plaintext, which drops the header/column relationship.