Models give nonsensical answers, how do I improve them?

chelimin · September 16, 2025, 9:05am

Hi all,

I tried buildling a basic chatbot to read CSVs on real estate transactions. However, it gives nonsensical answers to basic questions such as average/highest/lowest price transacted or number of transactions. I tried using GPT2 and now google flan t5. Can any experts please look at my code in app.py and provide some suggestions to improve the chatbot?

Learning and would appreciate any help I can get!

Here is my chatbot and datafile:

https://huggingface.co/spaces/chelimin/RE

github.com/Chelimin/CSVs

ResidentialTransaction20250827095921.csv

4df8ef448

Project Name, Transacted Price ($) ,Area (SQFT),Unit Price ($ PSF),Sale Date,Street Name,Type of Sale,Type of Area,Area (SQM),Unit Price ($ PSM),Nett Price($),Property Type,Number of Units,Tenure,Postal District,Market Segment,Floor Level
LANDED HOUSING DEVELOPMENT," $480,000 ",3208.75,150,Aug-24,JALAN CHEMPAKA KUNING,Resale,Land,298.10,1610,-,Semi-Detached House,1,70 yrs lease commencing from 1964,16,Outside Central Region,-
LANDED HOUSING DEVELOPMENT," $518,000 ",3694.20,140,Aug-25,JALAN CHEMPAKA KUNING,Resale,Land,343.20,1509,-,Terrace House,1,70 yrs lease commencing from 1964,16,Outside Central Region,-
LANDED HOUSING DEVELOPMENT," $580,000 ",3075.27,189,Jul-24,JALAN CHEMPAKA KUNING,Resale,Land,285.70,2030,-,Semi-Detached House,1,70 yrs lease commencing from 1964,16,Outside Central Region,-
LANDED HOUSING DEVELOPMENT," $610,000 ",2598.43,235,Aug-24,JALAN CHEMPAKA KUNING,Resale,Land,241.40,2527,-,Terrace House,1,70 yrs lease commencing from 1964,16,Outside Central Region,-
FUYONG ESTATE," $660,000 ",2867.53,230,Jun-24,JALAN ASAS,Resale,Land,266.40,2477,-,Semi-Detached House,1,99 yrs lease commencing from 1947,23,Outside Central Region,-
LANDED HOUSING DEVELOPMENT," $780,000 ",1880.47,415,Jul-24,JALAN ASAS,Resale,Land,174.70,4465,-,Terrace House,1,99 yrs lease commencing from 1947,23,Outside Central Region,-
FUYONG ESTATE," $808,888 ",2799.72,289,Jun-24,JALAN ASAS,Resale,Land,260.10,3110,-,Semi-Detached House,1,99 yrs lease commencing from 1947,23,Outside Central Region,-
FUYONG ESTATE," $888,000 ",2792.18,318,Jun-25,JALAN ASAS,Resale,Land,259.40,3423,-,Semi-Detached House,1,99 yrs lease commencing from 1947,23,Outside Central Region,-
MAYFAIR PARK," $1,180,000 ",3221.67,366,Mar-25,JALAN BINGKA,Resale,Land,299.30,3943,-,Semi-Detached House,1,99 yrs lease commencing from 1952,21,Rest of Central Region,-
THOMSON HILLS ESTATE," $1,400,000 ",3256.11,430,Dec-24,SEMBAWANG HILLS DRIVE,Resale,Land,302.50,4628,-,Terrace House,1,Freehold,20,Outside Central Region,-
MAYFAIR PARK," $1,430,000 ",2293.81,623,May-24,JALAN KERIA,Resale,Land,213.10,6710,-,Semi-Detached House,1,99 yrs lease commencing from 1952,21,Rest of Central Region,-
MAYFAIR PARK," $1,450,000 ",4936.37,294,Sep-24,JALAN KERIA,Resale,Land,458.60,3162,-,Detached House,1,99 yrs lease commencing from 1952,21,Rest of Central Region,-
LANDED HOUSING DEVELOPMENT," $1,500,000 ",3274.41,458,Feb-25,YIO CHU KANG ROAD,Resale,Land,304.20,4931,-,Terrace House,1,99 yrs lease commencing from 1954,19,Outside Central Region,-
LANDED HOUSING DEVELOPMENT," $1,600,000 ",1541.40,1038,Jul-24,HEMMANT ROAD,Resale,Land,143.20,11173,-,Terrace House,1,Freehold,15,Rest of Central Region,-
LANDED HOUSING DEVELOPMENT," $1,844,000 ",1697.48,1086,May-25,CAMBRIDGE ROAD,Resale,Land,157.70,11693,-,Terrace House,1,Freehold,08,Rest of Central Region,-
UNIQUE GARDEN," $1,900,000 ",3327.15,571,Aug-24,TOH YI DRIVE,Resale,Land,309.10,6147,-,Semi-Detached House,1,99 yrs lease commencing from 1972,21,Rest of Central Region,-
WESTVILLE," $1,900,000 ",1833.11,1036,Jan-24,WESTWOOD TERRACE,Resale,Land,170.30,11157,-,Terrace House,1,99 yrs lease commencing from 1994,22,Outside Central Region,-
WESTVILLE," $2,000,000 ",1614.60,1239,Feb-24,WESTWOOD CRESCENT,Resale,Land,150.00,13333,-,Terrace House,1,99 yrs lease commencing from 1994,22,Outside Central Region,-
VILLA VERDE," $2,025,000 ",1945.05,1041,Apr-24,VERDE VIEW,Resale,Land,180.70,11206,-,Terrace House,1,99 yrs lease commencing from 1997,23,Outside Central Region,-

This file has been truncated. show original

John6666 · September 16, 2025, 12:20pm

LLMs are generally not structurally well-suited to mathematically handling massive amounts of data…

One approach is simply to use massive LLMs or LLMs specialized for mathematical tasks. Personally, though, I think frameworks like LangChain, or combined with Tool Calling (Function Calling) or SQL, might allow smaller LLMs to achieve reasonable results. Manually aligning CSV files is probably also effective…

Considering future scalability and the freedom to choose LLMs and their backends, using LangChain is probably the way to go. It has a slightly higher learning curve, though.

Pimpcat-AU · September 16, 2025, 8:04pm

You are asking the model to behave like a database which is why it gives nonsense answers. LLMs are not designed to scan raw CSV rows or calculate averages on thousands of records. A better approach is to make the data indexable in a text database. Convert each row into a consistent text record and build a simple index. When the user asks a question you query the index first to get the numbers, then pass the result into the model so it can phrase the answer in natural language. This way the math stays correct and the model just handles how the answer is expressed.

chelimin · September 17, 2025, 8:58am

Thanks, I used LangChain and it performed worse than google flan t5 with returning error all the time

chelimin · September 17, 2025, 9:01am

Oh I see…because ChatGPT and MS Copilot could answer questions on CSVs I uploaded so I thought of buildig a smart chatbot to query private databases on the go like carrying out conversations

John6666 · September 17, 2025, 9:40am

This applies to files other than CSV as well, but for an LLM to make appropriate inferences about files, you need to switch to using the right functions or classes for each file type to pass them to the LLM in the proper format. I think this is fundamentally the same for any RAG system like LangChain…

The conversion process—organizing and cleaning data through preprocessing to transform it into a format the LLM can understand—is more critical than the LLM’s inherent performance.

Even with commercial large-scale generative AI, I believe this functionality isn’t built into the neural network itself but is part of the program. (At present)

Here’s the actual approach for CSV files.

Other Rerousces

chelimin · September 18, 2025, 3:25am

WOW! Thank you…looks like I have a looong way to go

John6666 · September 18, 2025, 4:38am

Yeah. If it’s only for specific CSV files, it probably isn’t that difficult. Like the chatbot Spaces above. The source code is simple.
However, when dealing with any CSV file, or PDFs, there are parts many people are still researching, and it’s often easier to do it in Python than to have an LLM handle it. Computers are for calculations, after all…

chelimin · September 18, 2025, 6:23am

Got it…by the way…would you have some other CSV examples for me to learn from based on free opensource LLMs that don’t require chargeable APIs like OpenAI? Can be quite costly for someone just starting out…

John6666 · September 18, 2025, 6:30am

some other CSV examples for me to learn from based on free opensource LLMs that don’t require chargeable APIs like OpenAI?

While searching for it is one approach, if you have a GPU or MacBook (MPS) even if it’s not top-of-the-line, you can just set up a small local OpenAI server using Ollama or similar. Aside from electricity costs, it’s free, and you can switch to OpenAI models later. The only difference is whether you access it via localhost or the internet…

Ollama makes it easy to use the vast models on Hugging Face and automatically configures things like GPUs for most environments, so it’s recommended for beginners.

Alternatively, using Llama.cpp models via LangChain might also be a good option.

Topic		Replies	Views
Chat bot for Question and answer in csv ,all open source models Intermediate	0	689	April 4, 2024
How to create Q&A chatbot with CSV file Beginners	5	5074	July 21, 2025
Langchain and streamlit chatbot Beginners	0	1547	January 3, 2024
Can we use LLM model method or GenAI for lots of tabular information and get the insight of that 🤗Transformers	1	34	February 21, 2025
I need someone to help me with my llm project Beginners	1	33	September 2, 2025

Models give nonsensical answers, how do I improve them?

Other Rerousces

Related topics