Information Extraction with Local LLM
How to use LangExtract to mine raw information with local LLMs
Local LLMs are becoming more and more capable in various tasks.
Latest models provide capabilities for tool calling, accepting media inputs and more.
However, generating structured data from raw information is still a challenge.
Specialized models for information extraction exist but they are not that numerous and often not open source.
You can query local LLMs for information extraction, but the results are often not structured enough, or the model may hallucinate information.
Creating own algorithms for information extraction is possible, but it requires a lot of effort and expertise.
Here comes LangExtract.
It is a library for extracting structured data from unstructured text using LLMs.
How it works
LangExtract maps unstructured text to structured data using few-shot prompting.
It uses an example-based approach to guide the LLM in generating the desired output format.
Moreover, it takes in the original input and, with the help of the LLM outputs, it tries to map the original input to the required output format.
In other words, the output is grounded by the original input, and it prevents hallucinations.
If the model supports structured output, LangExtract can leverage that too, but is not limited to it.
Setup
You can install LangExtract via pip:
uv add install LangExtract
# or
pip install LangExtract
Highly recommend to use a virtual environment with uv.
LangExtract is compatible with OpenAI API or Gemini API, but for local models, you can use Ollama models.
To get started, go to the following link Ollama.
Feel free to use any model you like, but for the sake of the article, I will use gemma3
models.
If you want to use Gemini/OpenAI models, you will need API keys. In this article. I will focus on local models only from this point.
Usage - Example preparation
As it was mentioned, LangExtract uses few-shot prompting to guide the LLM.
Based on your use case, you will need to prepare examples.
You can do the following
from typing import List
from langextract.data import ExampleData, Extraction
def get_product_review_examples() -> List[ExampleData]:
return [
ExampleData(
text="Overall a great laptop, I'd give it a solid 4/5 stars.",
extractions=[
Extraction(
extraction_class="rating",
extraction_text="4/5",
attributes={"value": "4", "scale": "5"},
)
],
),
ExampleData(
text="The item arrived damaged and the support team was unhelpful.",
extractions=[
Extraction(
extraction_class="negative_keyword",
extraction_text="damaged",
attributes={"topic": "product_condition"},
),
Extraction(
extraction_class="negative_keyword",
extraction_text="unhelpful",
attributes={"topic": "customer_support"},
),
],
),
ExampleData(
text="I love the vibrant screen, but the battery life is disappointing.",
extractions=[
Extraction(
extraction_class="pro",
extraction_text="vibrant screen",
attributes={"feature": "screen"},
),
Extraction(
extraction_class="con",
extraction_text="disappointing",
attributes={"feature": "battery"},
),
],
),
ExampleData(
text="The main issue was the buggy software, which kept crashing.",
extractions=[
Extraction(
extraction_class="issue",
extraction_text="buggy software",
attributes={"component": "software", "problem": "buggy"},
)
],
),
ExampleData(
text=(
"Review for the AeroBook Pro: The noise cancellation is top-notch! "
"A 5-star product."
),
extractions=[
Extraction(
extraction_class="product_name",
extraction_text="AeroBook Pro",
attributes={"model": "AeroBook Pro"},
),
Extraction(
extraction_class="feature",
extraction_text="noise cancellation",
attributes={"quality": "top-notch"},
),
Extraction(
extraction_class="rating",
extraction_text="5-star",
attributes={"value": "5"},
),
],
),
]
As in the example above, you can create a list of ExampleData
objects.
These are your few-shot examples, which do not relate to each other. They are here to demonstrate possibilities.
Each object contains a text and a list of Extraction
objects.
Each Extraction
object contains the class of the extraction, the text to be extracted and optional attributes.
The class can help you categorize the extractions and handle them differently based on your use case.
In some use cases, you should use only the extracted text and display it to the user. However, for deeper analysis, attributes are very useful. They are key-value pairs that provide additional context about the extraction.
Here, it is mostly trial and error to find the best examples for your use case. You do not want to have too many examples, as it may confuse the model.
The target should be to have enough to guide the model, but not too many to overwhelm it. Experiment with different numbers of examples, formats and structures to find the optimal balance for your specific use case.
Extraction of information
Before, make sure you have Ollama installed and the model you want to use downloaded. Start the Ollama, so the model is available.
The extraction is straightforward.
import textwrap
import langextract as lx
model_identifier = "gemma3:4b"
local_model_url = "http://localhost:11434"
prompt_description = (
"Extract product review information such as ratings, pros, cons, "
"issues, and keywords from the following customer reviews."
)
document_content = """
I recently purchased the AeroBook Pro, and I must say, the noise cancellation is top-notch! A 5-star product.
The battery life could be better, though. It barely lasts 6 hours with moderate use.
The customer support was helpful when I reached out for assistance.
Overall, I'm satisfied with my purchase but hope for improvements in future models.
"""
extraction_result: lx.ExtractionResult | None = None
extraction_error: Exception | None = None
try:
extraction_result = lx.extract(
text_or_documents=textwrap.dedent(document_content),
prompt_description=prompt_description,
examples=get_product_review_examples(), # from the previous code block
model_id=model_identifier,
model_url=local_model_url,
temperature=0.0,
fence_output=False,
use_schema_constraints=False,
)
except Exception as exc:
extraction_error = exc
The lx.extract
function takes in the following parameters:
text_or_documents
: The text to be processed. It can be a single string, url to be downloaded, or list ofDocuments
.prompt_description
: A description of the task to be performed. It helps to guide the model.examples
: The few-shot examples prepared earlier.model_id
: The identifier of the model to be used. For Ollama, it is the model name.model_url
: The URL of the model server. For Ollama, it ishttp://localhost:11434
.temperature
: The temperature for the model. Lower values make the output more deterministic.fence_output
: Whether to expect/generate fenced output (json or
yaml).use_schema_constraints
: Whether to generate schema constraints for models.
The extraction can result in exceptions, when no output is generated or the output is not parsable. So keep that in mind and handle it accordingly.
Further, you can increase/decrease the number of workers, the char buffer or the number of passes to fit your use case.
It is advised to keep context size small enough to extract relevant information.
Quality of prompts and examples is crucial for good results, and once again the experimentation is key.
Afterwards, when you successfully extract the information, you can access it via the extraction_result object - annotated document.
for extraction in extraction_result.extractions or []:
print(f"Class: {extraction.extraction_class}")
print(f"Text: {extraction.extraction_text}")
print(f"Attributes: {extraction.attributes}")
print("-----")
The structure of the extraction is similar to the examples you prepared earlier.
The extraction contains a list of Extraction
objects, each representing an extracted piece of information.
From this point, you can process the extracted information as per your requirements or analyze it further!
Visualization
You can visualize the extraction, how it maps to the original text to the extraction results.
output_name = "extraction_results.jsonl"
lx.io.save_annotated_documents([extraction_result], output_name=output_name, output_dir=".")
html_content = lx.visualize(output_name)
with open("visualization.html", "w") as f:
if hasattr(html_content, "data"):
f.write(html_content.data)
else:
f.write(html_content)
Conclusion
If you work with local LLMs and need to extract structured information from unstructured text, LangExtract is a great tool to consider.
It is not perfect, since if you use a smaller model, which is not trained for structured output, the results may vary.
However, for many use cases, it can provide a good starting point and save you a lot of time and effort.
The best aspect of it, it works with any kind of domain and text, since it relies on few-shot prompting and grounding based on the provided text.
Socials
Thanks for reading this article!
For more content like this, follow me here or on X or LinkedIn.