Saturday, September 09, 2023

Understand GenAI Large Language Model limitations, and how Retrieval Augmented Generation can help

 

Add context from private data and documents to GenAI LLMs to reduce hallucinations and increase performance through Retrieval Augmented Generation.

Use Cases for Large Language Models

Key use cases for Large Language Models include:

  • Generation: LLMs can be used to generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way. For example, LLMs can be used to generate realistic dialogue for chatbots, write news articles, or even create poems.
  • Summarization: LLMs can be used to summarize text, extract the main points of an article or document, and create a shorter version that is still accurate and informative. For example, LLMs can be used to summarize research papers, news articles, or even books.
  • Classification: LLMs can be used to classify text, identify the topic of a document, and determine whether it is positive or negative, factual or opinion, etc. For example, LLMs can be used to classify customer reviews, social media posts, or even medical records.
  • Extraction: LLMs can be used to extract information from text, identify specific entities or keywords, and create a table or list of the extracted information. For example, LLMs can be used to extract contact information from a business card, product information from a website, or even scientific data from a research paper.
  • Q&A: LLMs can be used to answer questions in an informative way, even if they are open ended, challenging, or strange. For example, LLMs can be used to answer questions about a particular topic, provide customer support, or even generate creative text formats of text content.

While Generative AI Large Language Models often seem like a panacea, they suffer from a number of key issues:

  1. Hallucinations: models will ‘make stuff up’ if they don’t know an answer. They also suffer from a lack of contextual understanding. Techniques like few-shot prompting can help.
  2. Inference Performance: even the faster models are slower than a dial-up modem, or a fast typist! They also suffer from latency or time to first token. For most queries, expect 10–20 second response times from most models, and even with streaming, you’ll end up waiting a few seconds for the first token to be generated!
  3. Inference Cost: LLMs are expensive to run! Some of the top 180B parameter models may need as many as 5xA100 GPUs to run, while even quantized versions of 70B LLAMA would take up a whole GPU! That’s one query at a time. The costs add up. For example, a dedicated A100 might cost as much as $20K a month with a cloud provider! A brute force approach is going to be expensive.
  4. Stale training data: even top models haven’t been trained on ‘recent’ data, and have a cut-off date. Remember, a model doesn’t ‘have access to the internet’. While certain ‘plugins’ do offer ‘internet search’, it’s just a form of RAG, where ‘top 10 internet search query results’ are fed into the prompt as context, for example.
  5. Use with private data: LLMs haven’t been trained on *your* private data, and as such, cannot answer questions based on our dataset, unless that data is inject through fine tuning or prompt engineering.
  6. Token limits / context window size: Models are limited by the TOKEN_LIMIT, and most models can process, at best, a few pages of total input/output. You can’t feed a model and entire document, and ask for a summary or extract facts from the document. You need to chunk documents into pages first, and perform multiple queries.
  7. They only support text: while this sounds obvious (from the name), it also means you can’t just feed a PDF file or WORD document to a LLM. You first need to convert that data to text, and chunk it to fit in the token limit, alongside your prompt and some room for output. Conversions aren’t perfect. What happens to your images, or tables, or metadata? It also means models can only output text. Formatting the text to output HTML or DOCX or other rich text formats requires a lot of heavy lifting in our pipeline.
  8. Lack of transparency / explainability: why did the model generate a particular answer? Techniques such as RAG can help, as you are able to point at the ‘context’ that generated a particular answer, and even display the context. While the LLM answer may not necessarily be correct, you can display the source content that helped generate that answer.
  9. Potential bias, hate, abuse, harm, ethical concerns, etc: sometimes, answers generated by an LLM can be outright harmful. Using the RAG pattern, in addition to HARM filters can help mitigate some of these issues.
  10. Training and fine tuning costs: to put it in perspective, a 70B model like LLAMA2 might need ~2048 A100 GPUs for a month to train, adding up to $20–40M training cost, not to mention what it takes to download and store the data. The: “Training Hardware & Carbon Footprint” section from the LLAMA2 paper suggests a total of 3311616 GPU hours was used to train LLAMA2 (7/13/34 and 70B)!

It helps to think of of Large Language Models (LLMs) like mathematical functions, or your phone’s autocomplete:

f(x) = x’

  • Where the input (x) and the output (x’) are strings. The model starts by looking at the input, then will ‘autocomplete’ the output.
  • For example, f(“What is Kubernetes”) = “Kubernetes, often abbreviated as K8s, is an open-source platform designed to automate deploying, scaling, and operating application containers.”
  • Most chat interfaces will also provide a default system prompt. For LLAMA2, this is: “You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."
  • Depending on the model and interface, there may be ‘hidden’ inputs to your model. Many Chat interfaces will include a conversational memory, where they insert a moving window of your previous prompts into the current prompt, as context. It would look something like this: “Below are a series of dialogues between a user and a AI assistant…. [dialogues] [new content]“

The inputs to a model are a little more complex though:

f(training_data, model_parameters, input_string) = output_string

  • training_data represents the data it was trained (different models will provide different answers). While not an ‘input’ as such, the data the model was trained (and how it was trained) on plays a key factor in the output.
  • model_parameters represent things like “temperature”, “repetition penalty”, “min tokens” or “max tokens”, “top_p”, “top_k”, and other such values.
  • input_string is the combination of prompt and context you give to the model. Ex: “What is Kubernetes” or “Summarize the following document: ${DOCUMENT}”
  • the ‘prompt’ is usually an optional instruction like “summarize”, “extract”, “translate”, “classify” etc. but more complex prompts are usually used. “Be a helpful assistant that responds to my question.. etc.”
  • The function can process a maximum of TOKEN_LIMIT (total input and output), usually ~4096 tokens (~3000 words in English, fewer in say.. Japanese). Models with larger TOKEN_LIMITS exist, though they usually don’t perform as well above the 4096 token limit. This means, in practice, you can’t feed a whole whitepaper to an LLM and ask it to ‘summarize this document’, for example.

What Large Language Models DON’T DO

Learn: A model will not ‘learn’ from interactions (unless specifically trained/fine-tuned).

Remember: A model doesn’t remember previous prompts. In fact, it’s all done with prompt trickery: previous prompts are injected. The API does a LOT of of filtering and heavy lifting!

Reason: Think of LLMs like your phone’s autocomplete, it doesn’t reason, or do math.

Use your data: LLMs don’t provide responses based on YOUR data (databases or files), unless it’s include in the training dataset, or the prompt (ex: RAG).

Use the Internet: A LLM doesn’t have the capacity to ‘search the internet’, or make API calls.

  • In fact, a model does not perform any activity other than converting one string of text into another string of text.
  • Any 3rd party data not in the model will need to be injected into prompts (RAG)

Adding a LLM to your software architecture:

A LLM is much much slower than a Faxmodem!

Believe it or not, LLMs are much slower than even a faxmodem! At WPM = ((BPS / 10) / 5) * 60, a 9600 baud modem will generate 11520 words / minute.

At an average 30 tokens / second (20 words) for LLAMA-70B, you’re getting 1200 words / minute!

Large models (70B) such as LLAMA2 can be painfully slow. Smaller models (20B, 13B, 7B) are faster, and require less GPU to run. Quantized models are also faster, but provide lower quality responses.

Quantize your model for faster inference

You can load and quantize your model in 8, 4, 3 or even 2 bits, sacrificing quality for faster inference speed.

This is always a tradeoff, as you’re sacfificing model output quality for faster inferencing. Since a quantized model needs less GPU VRAM to run in, this helps you run large models on commodity hardware.

Reducing model hallucinations:

LLMs lack context from private data — leading to hallucinations when asked domain or company-specific questions. RAG can help reduce hallucinations by ‘injecting’ context into prompts. Papers:

Retrieval Augmented Generation and the importance of Vector Databases

A vector database is a specialized database designed to store and query vector embeddings efficiently. Vector embeddings are numerical representations of text, images, audio, or other data. They are used in a variety of machine learning applications, such as natural language processing, image recognition, and recommendation systems.

Near Vector search or how to Search for “Sky” and find “Blue”:

  • Finding the most similar documents to a given document
  • Finding documents that contain a specific keyword or phrase
  • Clustering documents together based on their similarity
  • Ranking documents for a search query

Popular vector databases include ChormaDB, Weaviate, Milvus.

Advantages of using a VectorDB with your LLM, in a Retrieval Augmented Generation Pattern:

  • Insert your data into prompts every time
  • Cheap, and can work with vast amounts of data
  • While LLMs are SLOW, Vector Databases are FAST!
  • Can help overcome model limitations (such as token limits) — as you’re only feeding ‘top search results’ to the LLM, instead of whole documents.
  • Reduce hallucinations by providing context.

Loading Documents into your Vector Databases:

Loading data into your vector database typically requires you to convert documents to text, split the text into chunks, then vectorize those chunks using an embedding model. SentenceTransformers offers a number of pre-trained models, such as all-mpnet-base-v2 or all-MiniLM-L12-v2 that perform well for English text.

Scaling factor for RAG: what to consider:

  • Vector Database: consider sharding and High Availability
  • Fine Tuning: collecting data to be used for fine tuning
  • Governance and Model Benchmarking: how are you testing your model performance over time, with different prompts, one-shot, and various parameters
  • Chain of Reasoning and Agents
  • Caching embeddings and responses
  • Personalization and Conversational Memory Database
  • Streaming Responses and optimizing performance. A fine tuned 13B model may perform better than a poor 70B one!
  • Calling 3rd party functions or APIs for reasoning or other type of data (ex: LLMs are terrible at reasoning and prediction, consider calling other models)
  • Fallback techniques: fallback to a different model, or default answers
  • API scaling techniques, rate limiting, etc.
  • Async, streaming and parallelization, multiprocessing, GPU acceleration (including embeddings), generating your API using OpenAPI, etc.
  • Retraining your embedding model

RAG Talk from Shipitcon can be found on GitHub and YouTube:

Social media

Monday, September 24, 2018

Data Science environment with Docker and Jupyter on the IBM Mainframe

Guide to getting started with Docker, Python and Jupyter Notebook on zLinux.

Here, I'm using Red Hat Enterprise Linux 7.5 to build and deploy Jupyter notebook in an Ubuntu container. I will go over the steps used to build and run a Docker container.

Oh, and in case you're wondering: why would anyone do this - check out this snippet from the z14 announcement: "Microservices can be built on z14 with Node.js, Java, Go, Swift, Python, Scala, Groovy, Kotlin, Ruby, COBOL, PL/I, and more. They can be deployed in Docker containers where a single z14 can scale out to 2 million Docker containers".
A few basic commands:
Establish the OS release and version. We're running on RHEL 7.5 for s390x.
[cmihai@rh74s390x ~]$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.5 (Maipo)

[cmihai@rh74s390x ~]$ uname -a
Linux rh74s390x.novalocal 3.10.0-693.17.1.el7.s390x #1 SMP Sun Jan 14 10:38:29 EST 2018 s390x s390x s390x GNU/Linux

[cmihai@rh74s390x ~]$ docker --version
Docker version 17.05.0-ce, build 89658be

Setup regular user access, sudo and SSH keys

Create a regular user account

useradd cmihai
passwd cmihai
usermod -aG wheel cmihai
su - cmihai

Add your SSH public key to authorized_hosts

mkdir -p ~/.ssh
echo "YOURKEYHERE" >> ~/.ssh/authorized_keys

Log in as your new user, and forward port 9000:

ssh -L 9000:https://www.linkedin.com/redir/invalid-link-page?url=127%2e0%2e0%2e1%3A9000 -i cmihai.pem cmihai@myzLinux

Setup docker

Create the Docker group

sudo groupadd docker
sudo usermod -aG docker cmihai

Start Docker

sudo systemctl enable docker
sudo systemctl restart docker.service
sudo systemctl status docker.service

Test docker

docker run s390x/hello-world

Let’s run a simple Ubuntu interactive shell:

docker run --name s390x-ubuntu --hostname s390x-ubuntu --interactive --tty s390x/ubuntu /bin/bash

Building a Docker container for Jupyter Notebook

Create a Dockerfile from the s390x/ubuntu base image.
FROM s390x/ubuntu
MAINTAINER Mihai Criveti

# ADD AND RUN
RUN apt-get update \
    && apt-get install -y python3 python3-pip \
    && pip3 install jupyter \
    && apt-get clean

# COMMAND and ENTRYPOINT:
CMD ["jupyter","notebook","--allow-root","--ip=0.0.0.0","--port=9000"]

# NETWORK
EXPOSE 9000

Build the container:

docker build . --tag "cmihai/jupyter-lite:v1" -f Dockerfile

Run your new container:

docker run --name jupyter --hostname jupyter -p 9000:9000 cmihai/jupyter-lite:v1

Connect to Jupyter Notebook

You can now install depedencies directly from Jupyter:

!apt-get install --yes zlib1g-dev libjpeg-dev
Potential next steps:
  • Consider setting up persistence for your notebooks (ex: VOLUME ["/notebooks"] in Dockerfile)
  • Setup Docker Compose and build multi-tiered applications specifications - such as connecting your Jupyter Notebook to PostgreSQL, Redis, Spark, etc.
  • Set up other programming languages or kernels (Java, R) even Zeppelin Notebook
For an interactive tutorial of using Docker for Data Science, check out: https://github.com/crivetimihai/docker-data-science

Monday, August 13, 2018

Learn Cloud Native, Docker, K8s & Istio with free courses and IBM Badges

Microservices are the building blocks for cloud native architecture. Learn how to create and use containers with Docker, manage and orchestrate Containers with Kubernetes and secure and connect microservices with Istio.

Prerequisites:

You should have a good understanding of Linux and some basic concepts such as Version Control (preferably using Git). If you need a refresher, check out these 2 free resources:
  1. Optional: Learn Git Version Control on Katakoda if you've never used Version Control.
  2. Optional: Take up Introduction to Linux by The Linux Foundation on edX.

Get started learning containers with Docker

Find out what containers are, how they differ from Virtual Machines and what are the benefits of using containers. Install Docker CE on your machine, search for and run container from the Docker Hub, build your first Docker container from a Dockerfile and publish it to a Docker Registry.
  1. Learn Docker & Containers using Interactive Browser-Based Scenarios on Katakoda.
  2. Test your knowledge and collect the Docker essentials: Extend your apps with containers badge on developerWorks.
  3. Optional: Try my interactive Docker for Data Science course.

Learn container orchestration with Kubernetes

Kubernetes is a platform for managing containerized workloads and services. Learn about the key Kubernetes components and architecture:
  1. Join Introduction to Kubernetes by The Linux Foundation on edX.
  2. Collect the Container & Kubernetes Essentials with IBM Cloud badge.
  3. Complete Kubernetes Introduction on Katakoda.

Cloud Native, Microservices, 12-factor and Istio

Learn about essential cloud-native technologies, the twelve-factor app methodology, microservices, and Istio : an intelligent service mesh for microservices. Istio helps you to connect, secure, control, and observe services.

  1. Getting started with Microservices with Istio and IBM Cloud Kubernetes Service badge.
  2. Complete Beyond the Basics: Istio and IBM Cloud Kubernetes Service badge and also collect the badge for completing the Containers, microservices, Kubernetes, and Istio on the CloudLearning Path.
  3. Complete the Istio course on Katakoda - use the Istio service mesh to connect, manage, and secure microservices.
  4. Optional: Pursue the IBM Cloud Garage Method Explorer and IBM Cloud Garage Method Advocate badges and learn about key practices such as IBM Design Thinking, Agile, DevOps used in developing and managing Cloud Native applications.

Advanced Courses & Next Steps:

  1. Complete Debugging and Troubleshooting Containers on Katakoda.
  2. Learn Docker in Production on Katakoda.
  3. Learn Docker Security on Katakoda.
  4. Learn Docker Orchestration / Swarm Mode on Katakoda.
  5. Running Containers without Docker on Kotakoda.
  6. Configuration Management for Containerized Delivery on edX.

For IBMers:

For IBM-ers who have access to YourLearning courses and Safari Books Online:
  1. Complete the IBM Cloud Private Consultant Bootcamp, which includes 15+ hours of self-paced learning on Kubernetes, Helm, Docker, Microservices (IBM Garage Method), Cloud Foundry, and introduction to IBM Cloud Private.
  2. Complete other courses and certifications from the ICp series: IBM Cloud Private - Continuous Integration/Continuous Delivery Pipelines.

Books:

  1. Read up on Docker on Safari Books online: Docker: Up & Running , Using Docker
  2. Check out: Kubernetes: Up & Running and Kubernetes Cookbook.

Next Steps:

  1. Find and join a Docker or Kubernetes community at meetup.com. Attend a few presentations, talk to people!
  2. Capstone project: find an application you like, create a Docker container for it and publish it on Docker Hub.
For additional Learning Paths, check out my Learn Cloud and AI with IBM Badges article.