6 Examples of Doman-Specific Large Language Models

ODSC - Open Data Science
5 min readSep 6, 2023

Most people who have experience working with large language models such as Google’s Bard or OpenAI’s ChatGPT have worked with an LLM that is general, and not industry-specific. But as time has gone on, many industries have realized the power of these models. In turn, they’ve come to understand that if they were fine-tuned to their industry, these models could be invaluable. This is why over the last few months multiple examples of domain/industry-specific LLMs have gone live.

Let’s take a look at a few different examples of domain-specific large language models, how said industry is using them, and why they’re making a difference.

Law

Imagine an LLM that can absorb the insane amount of legal documents produced thus far by our justice system and then it turns around to assist lawyers with citing cases and more. Well, that’s what CaseHOLD does. CaseHOLD is a new dataset for legal NLP tasks. It consists of over 53,000 multiple-choice questions, each of which asks to identify the relevant holding of a cited case, which is the legal principle that the cited case establishes. CaseHOLD is a challenging task, as the correct answer is often not explicitly stated in the cited case.

The CaseHOLD dataset was created to address the lack of large-scale, domain-specific datasets for legal NLP. The dataset is a valuable resource for researchers working on legal NLP as it is the first large-scale, domain-specific dataset for this task. The dataset is also challenging, which makes it a good way to evaluate the performance of new NLP models.

Biomedical

Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. Just using standard NLP models for biomedical text mining often yields unsatisfactory results due to the different word distributions between general and biomedical corpora.

This is where BioBERT comes in. BioBERT is a domain-specific language representation model that is pre-trained on a large corpus of biomedical text. Based on the BERT model, it has been fine-tuned on a dataset of biomedical text. This allows BioBERT to learn the unique features of biomedical text, which helps it perform better on biomedical text mining tasks.

Finance

If there is one industry that most would first think of as benefiting from a domain-specific LLM, finance would be at the top of the list. And already, BloombergGPT is causing waves within the industry. So what does it do? Well this LLM is specifically trained on a wide range of financial data. It is a 50 billion-parameter model, which means that it has been trained on a massive dataset of text and code; allowing BloombergGPT to learn the unique features of financial language, which helps it to perform better on financial tasks than LLMs that are not specialized for this domain.

BloombergGPT can perform a variety of financial tasks, including sentiment analysis, named entity recognition, and question answering. It has also been shown to perform well on general LLM benchmarks, which suggests that it is a powerful language model that can be used for a variety of tasks.

Code

As LLM models have become more popular, a new community committed to open-source research and development has sprung forth, and with it, StarCoder was born. StarCoder is an LLM that looks to automate some of the more repetitive tasks associated with coding. StarCoder was trained on a dataset of 1 trillion tokens sourced from The Stack, which is a large collection of permissively licensed GitHub repositories. The Stack dataset includes code from a variety of programming languages, which allows StarCoder to learn the unique features of each language. StarCoder was also fine-tuned on a dataset of 35B Python tokens, which helps it perform well on Python tasks.

Because of that, StarCoder is massive, to say the least. With 15.5B parameters and an 8K context length, which means that it has been trained on a massive dataset of text and code. This allows StarCoder to learn the unique features of code language, which helps it to perform better on code-related tasks than LLMs that are not specialized for this domain.

Medical

Like law, the medical field is drowning in paperwork and data. This is where Google AI’s Med-PaLM comes in. What makes Med-PaLM special is that it is trained on a massive dataset of medical text and code, which allows it to learn the unique features of medical language. Because of this, it has been shown to outperform existing models on a variety of medical tasks, including answering medical questions, summarizing medical text, generating medical reports, identifying medical entities, and predicting clinical outcomes.

Though still not officially released, tests have shown that Med-PaLM can be used to help doctors diagnose diseases, develop new treatments, personalized care for patients, improve patient education, and make healthcare more efficient. Med-PaLM is still under development, but it has the potential to revolutionize the way that healthcare is delivered.

Climate

But if there is one domain many may not think of when it comes to LLMs, it’s climate. But if we’ve learned anything, climate science and all the data produced by researchers could also benefit from LLMs. Part of the BERT family of models, ClimateBERT is specifically trained on climate-related text. It is a transformer-based model that is further pretrained on over 2 million paragraphs of climate-related texts, crawled from various sources such as common news, research articles, and climate reporting of companies.

Currently, ClimateBERT has been shown to outperform existing models on a variety of climate-related tasks, such as text classification, sentiment analysis, and fact-checking. It has also been shown to improve the performance of other NLP models when they are fine-tuned on ClimateBERT.

Conclusion

Clearly, large language models, when geared toward specific industries/domains, can unlock even more benefits for those who are willing to take the time and learn this new technology. But, because LLMs are part of the fast-moving NLP ecosystem, standards, ideas, and even methods are quickly changing.

So it’s becoming important to keep up with any and all changes associated with LLMs. And the best place to do this is at ODSC West 2023 this October 30th to November 2nd. With a full track devoted to NLP and LLMs, you’ll enjoy talks, sessions, events, and more that squarely focus on this fast-paced field.

Confirmed sessions include:

  • Personalizing LLMs with a Feature Store
  • Understanding the Landscape of Large Models
  • Building LLM-powered Knowledge Workers over Your Data with LlamaIndex
  • General and Efficient Self-supervised Learning with data2vec
  • Towards Explainable and Language-Agnostic LLMs
  • Fine-tuning LLMs on Slack Messages
  • Beyond Demos and Prototypes: How to Build Production-Ready Applications Using Open-Source LLMs
  • Automating Business Processes Using LangChain
  • Connecting Large Language Models — Common pitfalls & challenges

What are you waiting for? Get your pass today!

Originally posted on OpenDataScience.com

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Interested in attending an ODSC event? Learn more about our upcoming events here.

--

--

ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.