pragmatist
Patrick Joyce

February 1, 2025

Playing With LLMs As A Binary Classifier

A friend of mine leads an internal data team for hospital group.

They recently received the updated 2025 ICD-10 codes which is a coding system of ~74,000 codes that are used by healthcare providers and insurance companies for billing, tracking public health trends, and medical recordkeeping.

Basically, they got handed a plain text file with 74,260 lines and were asked to annotate the data with various additional bits of information. For example, they want to record whether each ICD code represents an infection:

  • A200 Bubonic plague is an infection
  • Y9366 Activity, soccer is not an infection

Annotating more than 74,000 lines is tedious and time consuming1, so he hit up the group chat asking about how to script an LLM to do this.

This seemed like a fun toy problem to try out some stuff I personally hadn’t done yet, so I decided to see how well I could use LLMs as a binary classifier for “infection/not infection”

Attempt 1 - Scripting local LLMs on my M1 MacBook Air

I’d been wanting to play directly with some of the open weights LLMs like Llama, Mistral, Qwen, and DeepSeek and this seemed like a good excuse.

I installed Simon Willison’s llm command line tool and the gpt4all plugin and started pulling down models and doing some quick testing with a few.

(Embarrassingly, I wasn’t able to figure out how to get any of the Llama models to work)

qwen2-1_5b-instruct-q4_0 didn’t really follow instructions, outputting a bunch of the prompt and making up some additional codes that weren’t in the input.

But mistral-7b-instruct-v0 at least followed the instructions and output in the right format, so I went down that path a bit. Here is the command line invocation I used:

~/src/icd10cm-codes-classifier > shuf -n 5 /Users/patrick/src/icd10cm-codes-classifier/icd10cm-codes-2025.txt | llm -m mistral-7b-instruct-v0 --system 'You are an expert in medical coding. You are given a list of ICD-10 codes and their corresponding descriptions. Your job is to annotate each line with an additional column. If the ICD code and description is for an infection then add TRUE. If it is for something that is not necessarily an infection add FALSE. Output the icd code, the descriptionand the TRUE/FALSE column. DO NOT ADD ANYTHING ELSE TO THE OUTPUT.'

This is pulling 5 random codes out of the file and prompting mistral to classify them.

And here is the output:

T18120S Food in esophagus causing compression of trachea, sequela TRUE
G501    Atypical facial pain FALSE
H18521  Epithelial (juvenile) corneal dystrophy, right eye FALSE
S72111J Displaced fracture of greater trochanter of right femur, subsequent encounter for open fracture type IIIA, IIIB, or IIIC with delayed healing TRUE
M84756D Complete transverse atypical femoral fracture, unspecified leg, subsequent encounter for fracture with routine healing FALSEException ignored in: <_io.TextIOWrapper name=7 mode='w' encoding='UTF-8'>
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/llm/0.20/libexec/lib/python3.13/site-packages/llm_gpt4all.py", line 294, in __exit__
    sys.stderr = self.original_stderr
OSError: [Errno 9] Bad file descriptor

It “worked”. Sort of. A couple of problems:

  1. The program blew up with an exception. This was common across the LLMs I was running with llm, and I didn’t take the time to to properly debug it, because…
  2. Two out of the five classifications were obviously wrong. I’m not a documentation nurse, but I know that “Food in esophagus” is not an infection.

So this wouldn’t be actively useful.

Attempt 2 - Naive Scripting with OpenAI API

It was getting late and I was frustrated that I hadn’t gotten anything useful for my friend yet. So I decided to throw a little of money at the problem and use GPT-4o.

I fired up cursor and used composer to prompt for a python script that would iterate over the ~74,000 rows in batches of 100 and then call to the OpenAI API with a prompt that I provided.

Success! (mostly)

I got responses back, and from spot checking they seemed mostly right.

There were just a couple of problems:

  1. I dropped some data. I ended up with a file with 73,509 rows vs. the 74,260 rows I should have had. I’m not sure what was happening here, but if I was doing this for real I could easily do some integrity checks on each batch to make sure the output ICD codes match the input.
  2. It was super slow. I did this in the most naive way possible of just doing sequential API requests and waiting for each one to finish before starting the next. This was a quick experiment so I just went to bed and let it run overnight, but I could easily parallelize each batch and make this much faster.

Still this was good enough for a quick weeknight proof of concept so the next morning when the job finished I sent the file off to my friend.

Total cost: $19.94

  • gpt-4o-2024-08-06, cached input - $0.13
  • gpt-4o-2024-08-06, input - $3.76
  • gpt-4o-2024-08-06, output - $16.05

All the code is available on github.

Next Steps

Now that I’ve got this little toy test project, I’d like to try a few more experiments on top of it. Look for them in a future part 2:

  • Play with batch sizes. I picked the 100 line batch size out of thin air. I’m no where near the per request token limit. But I also wonder if I’d see any impact on accuracy with larger batch sizes.
  • Use the OpenAI batch API - this will ~halve the cost. And while there is a 24 hour SLA on batch requests I bet this will end up being significantly faster by having
  • Try other models.
    • Compare GPT-4o and 4o-mini - The task is relatively simple, so I’m willing to wager that 4o-mini will also be able to function as a classifiers.
    • Access other models in the cloud. My personal laptop only has 16GB of RAM which limits what it can run. But I can play with
  • Implement voting - run the LLM “classifier” multiple times. Store each of the results. Possibly call multiple different models and see where they agree and where they disagree. Places of disagreement can be used to highlight codes for manual review.
  1. They did some quick napkin math that it would take ~300 hours of labor from a documentation nurse to do this which would cost ~$18,000. 

More Articles on Software & Product Development

Agile With a Lowercase “a”
”Agile“ is an adjective. It is not a noun. It isn’t something you do, it is something you are.
How Do You End Up With A Great Product A Year From Now?
Nail the next two weeks. 26 times in a row.
Build it Twice
Resist the urge to abstract until you've learned what is general to a class of problems and what is specific to each problem.