The reason there's a lot of startups in the OCR space (us being one of them) is ...

irskep · on Dec 13, 2024

> Any solution that's 80% accurate just doesn't work for most applications.

And yet people use LLMs, for which "80% accuracy" is still mostly an aspiration. :-)

I think it's reasonably likely most people companies end up using open source libraries, at least partly because it lets them avoid adding another GDPR sub-processor. Unstructured.io, one of your competitors, goes as far as having an AWS Marketplace setup so customers can use their own infrastruture but still pay them.

LLMs might get better at consuming badly-formatted data, so the data only needs to meet that minimum bar, vs the admittedly very nice output you showed.

themanmaran · on Dec 13, 2024

> LLMs might get better at consuming badly-formatted data

Oh agreed. There's definitely a meeting in the middle between better ingestion and smarter models. LLMs are already a great fuzzing layer for that type of interpretation. And even with a perfect WYSIWYG text extraction, you're still limited by how coherent the original document was in the first place.