Models give nonsensical answers, how do I improve them?

Hi all,

I tried buildling a basic chatbot to read CSVs on real estate transactions. However, it gives nonsensical answers to basic questions such as average/highest/lowest price transacted or number of transactions. I tried using GPT2 and now google flan t5. Can any experts please look at my code in app.py and provide some suggestions to improve the chatbot?

Learning and would appreciate any help I can get!

Here is my chatbot and datafile:

https://huggingface.co/spaces/chelimin/RE
1 Like

LLMs are generally not structurally well-suited to mathematically handling massive amounts of data…

One approach is simply to use massive LLMs or LLMs specialized for mathematical tasks. Personally, though, I think frameworks like LangChain, or combined with Tool Calling (Function Calling) or SQL, might allow smaller LLMs to achieve reasonable results. Manually aligning CSV files is probably also effective…

Considering future scalability and the freedom to choose LLMs and their backends, using LangChain is probably the way to go. It has a slightly higher learning curve, though.

You are asking the model to behave like a database which is why it gives nonsense answers. LLMs are not designed to scan raw CSV rows or calculate averages on thousands of records. A better approach is to make the data indexable in a text database. Convert each row into a consistent text record and build a simple index. When the user asks a question you query the index first to get the numbers, then pass the result into the model so it can phrase the answer in natural language. This way the math stays correct and the model just handles how the answer is expressed.

1 Like

Thanks, I used LangChain and it performed worse than google flan t5 with returning error all the time :frowning:

1 Like

Oh I see…because ChatGPT and MS Copilot could answer questions on CSVs I uploaded so I thought of buildig a smart chatbot to query private databases on the go like carrying out conversations

1 Like

This applies to files other than CSV as well, but for an LLM to make appropriate inferences about files, you need to switch to using the right functions or classes for each file type to pass them to the LLM in the proper format. I think this is fundamentally the same for any RAG system like LangChain…

The conversion process—organizing and cleaning data through preprocessing to transform it into a format the LLM can understand—is more critical than the LLM’s inherent performance.

Even with commercial large-scale generative AI, I believe this functionality isn’t built into the neural network itself but is part of the program. (At present)

Here’s the actual approach for CSV files.

Other Rerousces

1 Like

WOW! :open_mouth:Thank you…looks like I have a looong way to go :face_with_peeking_eye:

1 Like

Yeah. If it’s only for specific CSV files, it probably isn’t that difficult. Like the chatbot Spaces above. The source code is simple.
However, when dealing with any CSV file, or PDFs, there are parts many people are still researching, and it’s often easier to do it in Python than to have an LLM handle it.:laughing: Computers are for calculations, after all…

Got it…by the way…would you have some other CSV examples for me to learn from based on free opensource LLMs that don’t require chargeable APIs like OpenAI? Can be quite costly for someone just starting out…

1 Like

some other CSV examples for me to learn from based on free opensource LLMs that don’t require chargeable APIs like OpenAI?

While searching for it is one approach, if you have a GPU or MacBook (MPS) even if it’s not top-of-the-line, you can just set up a small local OpenAI server using Ollama or similar. Aside from electricity costs, it’s free, and you can switch to OpenAI models later. The only difference is whether you access it via localhost or the internet…

Ollama makes it easy to use the vast models on Hugging Face and automatically configures things like GPUs for most environments, so it’s recommended for beginners.

Alternatively, using Llama.cpp models via LangChain might also be a good option.