Extracting structured data from long text + assessing information uncertainty

Hi all,

I’m considering extracting structured data about companies from reports, research papers, and news articles using an LLM.

I have a structured hierarchy of ~1000 questions (e.g., general info, future potential, market position, financials, products, public perception, etc.).

Some short articles will probably only contain data for ~10 questions, while longer reports may answer 100s.

The structured data extracts (answers to the questions) will be stored in a database. So a single article may create 100s of records in the database.

This is my goal:

Use an LLM to read both long reports (100+ pages) and short articles (<1 page).
Extract relevant data, structure it, and tagging it with metadata (source, date, etc.).
Assess reliability (is it marketing, analysis, or speculation?).
- Is some information in the article more reliable than other parts?

Questions:

What LLM models are most suitable for such big tasks? (Reasoning models, specific brands like OpenAI, Anthropic, DeepSeek, Mistral, etc. ?)
Is it realistic for an LLM to handle 100s of pages and 100s of questions, with good quality responses?
Should I use chain prompting, or put everything in one large prompt? Putting everything in one large prompt would be the easiest for me. But I'm worried the LLM will give low quality responses if I put too much into a single prompt (the entire article + all the questions + all the instructions).
Will using a framework like LangChain/OpenAI Assistants give better quality responses, or can I just build my own pipeline – does it matter?
Will using Structured Outputs increase quality, or is an output example (JSON) in the prompt enough?
Should I set temperature to 0? Because I don't want the LLM to be creative. I just want it to collect facts from the articles and assess the reliability of these facts.

I don't have prior experience with this, but I do know how to stitch together some code to build a pipeline.

I'm wondering whether I should use some tools or specific techniques to get the best possible quality, which rabbit holes to avoid, and is this whole idea a rabbit hole? 🙂

Anyone have experience with something similar?

Or if you know some articles or videos that explain how to do something like this.
I'm willing to spend many days and weeks on making this work – if it's possible.

Thanks in advance for your insights!

submitted by /u/frithjof_v
[comments]

Source link

Extracting structured data from long text + assessing information uncertainty

This is my goal:

Questions:

Latest articles

ChatGPT gained one million new users in an hour today

China police deploy real-life Robocop as humanoid tech takes huge leap forward

Runway releases Gen-4 video model with focus on consistency

Leave a Comment Cancel reply

ChatGPT gained one million new users in an hour today

China police deploy real-life Robocop as humanoid tech takes huge leap forward

Runway releases Gen-4 video model with focus on consistency

Extracting structured data from long text + assessing information uncertainty

This is my goal:

Questions:

Latest articles

ChatGPT gained one million new users in an hour today

China police deploy real-life Robocop as humanoid tech takes huge leap forward

Runway releases Gen-4 video model with focus on consistency

Leave a Comment Cancel reply

Featured articles

ChatGPT gained one million new users in an hour today

China police deploy real-life Robocop as humanoid tech takes huge leap forward

Runway releases Gen-4 video model with focus on consistency