{"product_id":"large-language-model-based-solutions-isbn-9781394240722","title":"Large Language Model-Based Solutions","description":"\u003cp\u003e\u003cb\u003eLearn to build cost-effective apps using Large Language Models\u003c\/b\u003e \u003c\/p\u003e\u003cp\u003eIn \u003ci\u003eLarge Language Model-Based Solutions: How to Deliver Value with Cost-Effective Generative AI Applications\u003c\/i\u003e, Principal Data Scientist at Amazon Web Services, Shreyas Subramanian, delivers a practical guide for developers and data scientists who wish to build and deploy cost-effective large language model (LLM)-based solutions. In the book, you'll find coverage of a wide range of key topics, including how to select a model, pre- and post-processing of data, prompt engineering, and instruction fine tuning. \u003c\/p\u003e\u003cp\u003eThe author sheds light on techniques for optimizing inference, like model quantization and pruning, as well as different and affordable architectures for typical generative AI (GenAI) applications, including search systems, agent assists, and autonomous agents. You'll also find: \u003c\/p\u003e\u003cul\u003e\n\u003cli\u003eEffective strategies to address the challenge of the high computational cost associated with LLMs\u003c\/li\u003e \u003cli\u003eAssistance with the complexities of building and deploying affordable generative AI  apps, including tuning and inference techniques\u003c\/li\u003e \u003cli\u003eSelection criteria for choosing a model, with particular consideration given to compact, nimble, and domain-specific models\u003c\/li\u003e\n\u003c\/ul\u003e \u003cp\u003ePerfect for developers and data scientists interested in deploying foundational models, or business leaders planning to scale out their use of GenAI, \u003ci\u003eLarge Language Model-Based Solutions\u003c\/i\u003e will also benefit project leaders and managers, technical support staff, and administrators with an interest or stake in the subject. \u003c\/p\u003e\u003cp\u003eIntroduction xix\u003c\/p\u003e \u003cp\u003e\u003cb\u003eChapter 1: Introduction 1\u003c\/b\u003e\u003c\/p\u003e \u003cp\u003eOverview of GenAI Applications and Large Language Models 1\u003c\/p\u003e \u003cp\u003eThe Rise of Large Language Models 1\u003c\/p\u003e \u003cp\u003eNeural Networks, Transformers, and Beyond 2\u003c\/p\u003e \u003cp\u003eGenAI vs. LLMs: What’s the Difference? 5\u003c\/p\u003e \u003cp\u003eThe Three-Layer GenAI Application Stack 6\u003c\/p\u003e \u003cp\u003eThe Infrastructure Layer 6\u003c\/p\u003e \u003cp\u003eThe Model Layer 7\u003c\/p\u003e \u003cp\u003eThe Application Layer 8\u003c\/p\u003e \u003cp\u003ePaths to Productionizing GenAI Applications 9\u003c\/p\u003e \u003cp\u003eSample LLM-Powered Chat Application 11\u003c\/p\u003e \u003cp\u003eThe Importance of Cost Optimization 12\u003c\/p\u003e \u003cp\u003eCost Assessment of the Model Inference Component 12\u003c\/p\u003e \u003cp\u003eCost Assessment of the Vector Database Component 19\u003c\/p\u003e \u003cp\u003eBenchmarking Setup and Results 20\u003c\/p\u003e \u003cp\u003eOther Factors to Consider 23\u003c\/p\u003e \u003cp\u003eCost Assessment of the Large Language Model Component 24\u003c\/p\u003e \u003cp\u003eSummary 27\u003c\/p\u003e \u003cp\u003e\u003cb\u003eChapter 2: Tuning Techniques for Cost Optimization 29\u003c\/b\u003e\u003c\/p\u003e \u003cp\u003eFine-Tuning and Customizability 29\u003c\/p\u003e \u003cp\u003eBasic Scaling Laws You Should Know 30\u003c\/p\u003e \u003cp\u003eParameter-Efficient Fine-Tuning Methods 32\u003c\/p\u003e \u003cp\u003eAdapters Under the Hood 33\u003c\/p\u003e \u003cp\u003ePrompt Tuning 34\u003c\/p\u003e \u003cp\u003ePrefix Tuning 36\u003c\/p\u003e \u003cp\u003eP-tuning 39\u003c\/p\u003e \u003cp\u003eIA3 40\u003c\/p\u003e \u003cp\u003eLow-Rank Adaptation 44\u003c\/p\u003e \u003cp\u003eCost and Performance Implications of PEFT Methods 46\u003c\/p\u003e \u003cp\u003eSummary 48\u003c\/p\u003e \u003cp\u003e\u003cb\u003eChapter 3: Inference Techniques for Cost Optimization 49\u003c\/b\u003e\u003c\/p\u003e \u003cp\u003eIntroduction to Inference Techniques 49\u003c\/p\u003e \u003cp\u003ePrompt Engineering 50\u003c\/p\u003e \u003cp\u003eImpact of Prompt Engineering on Cost 50\u003c\/p\u003e \u003cp\u003eEstimating Costs for Other Models 52\u003c\/p\u003e \u003cp\u003eClear and Direct Prompts 53\u003c\/p\u003e \u003cp\u003eAdding Qualifying Words for Brief Responses 53\u003c\/p\u003e \u003cp\u003eBreaking Down the Request 54\u003c\/p\u003e \u003cp\u003eExample of Using Claude for PII Removal 55\u003c\/p\u003e \u003cp\u003eConclusion 59\u003c\/p\u003e \u003cp\u003eProviding Context 59\u003c\/p\u003e \u003cp\u003eExamples of Providing Context 60\u003c\/p\u003e \u003cp\u003eRAG and Long Context Models 60\u003c\/p\u003e \u003cp\u003eRecent Work Comparing RAG with Long Content Models 61\u003c\/p\u003e \u003cp\u003eConclusion 62\u003c\/p\u003e \u003cp\u003eContext and Model Limitations 62\u003c\/p\u003e \u003cp\u003eIndicating a Desired Format 63\u003c\/p\u003e \u003cp\u003eExample of Formatted Extraction with Claude 63\u003c\/p\u003e \u003cp\u003eTrade-Off Between Verbosity and Clarity 66\u003c\/p\u003e \u003cp\u003eCaching with Vector Stores 66\u003c\/p\u003e \u003cp\u003eWhat Is a Vector Store? 66\u003c\/p\u003e \u003cp\u003eHow to Implement Caching Using Vector Stores 66\u003c\/p\u003e \u003cp\u003eConclusion 69\u003c\/p\u003e \u003cp\u003eChains for Long Documents 69\u003c\/p\u003e \u003cp\u003eWhat Is Chaining? 69\u003c\/p\u003e \u003cp\u003eImplementing Chains 69\u003c\/p\u003e \u003cp\u003eExample Use Case 70\u003c\/p\u003e \u003cp\u003eCommon Components 70\u003c\/p\u003e \u003cp\u003eTools That Implement Chains 72\u003c\/p\u003e \u003cp\u003eComparing Results 76\u003c\/p\u003e \u003cp\u003eConclusion 76\u003c\/p\u003e \u003cp\u003eSummarization 77\u003c\/p\u003e \u003cp\u003eSummarization in the Context of Cost and Performance 77\u003c\/p\u003e \u003cp\u003eEfficiency in Data Processing 77\u003c\/p\u003e \u003cp\u003eCost-Effective Storage 77\u003c\/p\u003e \u003cp\u003eEnhanced Downstream Applications 77\u003c\/p\u003e \u003cp\u003eImproved Cache Utilization 77\u003c\/p\u003e \u003cp\u003eSummarization as a Preprocessing Step 77\u003c\/p\u003e \u003cp\u003eEnhanced User Experience 77\u003c\/p\u003e \u003cp\u003eConclusion 77\u003c\/p\u003e \u003cp\u003eBatch Prompting for Efficient Inference 78\u003c\/p\u003e \u003cp\u003eBatch Inference 78\u003c\/p\u003e \u003cp\u003eExperimental Results 80\u003c\/p\u003e \u003cp\u003eUsing the accelerate Library 81\u003c\/p\u003e \u003cp\u003eUsing the DeepSpeed Library 81\u003c\/p\u003e \u003cp\u003eBatch Prompting 82\u003c\/p\u003e \u003cp\u003eExample of Using Batch Prompting 83\u003c\/p\u003e \u003cp\u003eModel Optimization Methods 83\u003c\/p\u003e \u003cp\u003eQuantization 83\u003c\/p\u003e \u003cp\u003eCode Example 84\u003c\/p\u003e \u003cp\u003eRecent Advancements: GPTQ 85\u003c\/p\u003e \u003cp\u003eParameter-Efficient Fine-Tuning Methods 85\u003c\/p\u003e \u003cp\u003eRecap of PEFT Methods 85\u003c\/p\u003e \u003cp\u003eCode Example 86\u003c\/p\u003e \u003cp\u003eCost and Performance Implications 87\u003c\/p\u003e \u003cp\u003eSummary 88\u003c\/p\u003e \u003cp\u003eReferences 88\u003c\/p\u003e \u003cp\u003e\u003cb\u003eChapter 4: Model Selection and Alternatives 89\u003c\/b\u003e\u003c\/p\u003e \u003cp\u003eIntroduction to Model Selection 89\u003c\/p\u003e \u003cp\u003eMotivating Example: The Tale of Two Models 89\u003c\/p\u003e \u003cp\u003eThe Role of Compact and Nimble Models 90\u003c\/p\u003e \u003cp\u003eExamples of Successful Smaller Models 91\u003c\/p\u003e \u003cp\u003eQuantization for Powerful but Smaller Models 91\u003c\/p\u003e \u003cp\u003eText Generation with Mistral 7B 93\u003c\/p\u003e \u003cp\u003eZephyr 7B and Aligned Smaller Models 94\u003c\/p\u003e \u003cp\u003eCogVLM for Language-Vision Multimodality 95\u003c\/p\u003e \u003cp\u003ePrometheus for Fine-Grained Text Evaluation 96\u003c\/p\u003e \u003cp\u003eOrca 2 and Teaching Smaller Models to Reason 98\u003c\/p\u003e \u003cp\u003eBreaking Traditional Scaling Laws with Gemini and Phi 99\u003c\/p\u003e \u003cp\u003ePhi 1, 1.5, and 2 B Models 100\u003c\/p\u003e \u003cp\u003eGemini Models 102\u003c\/p\u003e \u003cp\u003eDomain-Specific Models 104\u003c\/p\u003e \u003cp\u003eStep 1 - Training Your Own Tokenizer 105\u003c\/p\u003e \u003cp\u003eStep 2 - Training Your Own Domain-Specific Model 107\u003c\/p\u003e \u003cp\u003eMore References for Fine-Tuning 114\u003c\/p\u003e \u003cp\u003eEvaluating Domain-Specific Models vs. Generic Models 115\u003c\/p\u003e \u003cp\u003eThe Power of Prompting with General-Purpose Models 120\u003c\/p\u003e \u003cp\u003eSummary 122\u003c\/p\u003e \u003cp\u003e\u003cb\u003eChapter 5: Infrastructure and Deployment Tuning Strategies 123\u003c\/b\u003e\u003c\/p\u003e \u003cp\u003eIntroduction to Tuning Strategies 123\u003c\/p\u003e \u003cp\u003eHardware Utilization and Batch Tuning 124\u003c\/p\u003e \u003cp\u003eMemory Occupancy 126\u003c\/p\u003e \u003cp\u003eStrategies to Fit Larger Models in Memory 128\u003c\/p\u003e \u003cp\u003eKV Caching 130\u003c\/p\u003e \u003cp\u003ePagedAttention 131\u003c\/p\u003e \u003cp\u003eHow Does PagedAttention Work? 131\u003c\/p\u003e \u003cp\u003eComparisons, Limitations, and Cost Considerations 131\u003c\/p\u003e \u003cp\u003eAlphaServe 133\u003c\/p\u003e \u003cp\u003eHow Does AlphaServe Work? 133\u003c\/p\u003e \u003cp\u003eImpact of Batching 134\u003c\/p\u003e \u003cp\u003eCost and Performance Considerations 134\u003c\/p\u003e \u003cp\u003eS3: Scheduling Sequences with Speculation 134\u003c\/p\u003e \u003cp\u003eHow Does S3 Work? 135\u003c\/p\u003e \u003cp\u003ePerformance and Cost 135\u003c\/p\u003e \u003cp\u003eStreaming LLMs with Attention Sinks 136\u003c\/p\u003e \u003cp\u003eFixed to Sliding Window Attention 137\u003c\/p\u003e \u003cp\u003eExtending the Context Length 137\u003c\/p\u003e \u003cp\u003eWorking with Infinite Length Context 137\u003c\/p\u003e \u003cp\u003eHow Does StreamingLLM Work? 138\u003c\/p\u003e \u003cp\u003ePerformance and Results 139\u003c\/p\u003e \u003cp\u003eCost Considerations 139\u003c\/p\u003e \u003cp\u003eBatch Size Tuning 140\u003c\/p\u003e \u003cp\u003eFrameworks for Deployment Configuration Testing 141\u003c\/p\u003e \u003cp\u003eCloud-Native Inference Frameworks 142\u003c\/p\u003e \u003cp\u003eDeep Dive into Serving Stack Choices 142\u003c\/p\u003e \u003cp\u003eBatching Options 143\u003c\/p\u003e \u003cp\u003eOptions in DJL Serving 144\u003c\/p\u003e \u003cp\u003eHigh-Level Guidance for Selecting Serving Parameters 146\u003c\/p\u003e \u003cp\u003eAutomatically Finding Good Inference Configurations 146\u003c\/p\u003e \u003cp\u003eCreating a Generic Template 148\u003c\/p\u003e \u003cp\u003eDefining a HPO Space 149\u003c\/p\u003e \u003cp\u003eSearching the Space for Optimal Configurations 151\u003c\/p\u003e \u003cp\u003eResults of Inference HPO 153\u003c\/p\u003e \u003cp\u003eInference Acceleration Tools 155\u003c\/p\u003e \u003cp\u003eTensorRT and GPU Acceleration Tools 156\u003c\/p\u003e \u003cp\u003eCPU Acceleration Tools 156\u003c\/p\u003e \u003cp\u003eMonitoring and Observability 157\u003c\/p\u003e \u003cp\u003eLLMOps and Monitoring 157\u003c\/p\u003e \u003cp\u003eWhy Is Monitoring Important for LLMs? 159\u003c\/p\u003e \u003cp\u003eMonitoring and Updating Guardrails 160\u003c\/p\u003e \u003cp\u003eSummary 161\u003c\/p\u003e \u003cp\u003eConclusion 163\u003c\/p\u003e \u003cp\u003eIndex 181\u003c\/p\u003e  \u003cp\u003e \u003cb\u003eSHREYAS SUBRAMANIAN, PhD, \u003c\/b\u003e is a principal data scientist at AWS, one of the largest organizations building and providing large language models for enterprise use. He is currently advising both internal Amazon teams and large enterprise customers on building, tuning, and deploying Generative AI applications at scale. Shreyas runs machine learning-focused cost optimization workshops, helping them reduce the costs of machine learning applications on the cloud. Shreyas also actively participates in cutting-edge research and development of advanced training, tuning and deployment techniques for foundation models.   \u003c\/p\u003e\u003cp\u003e \u003cb\u003eBalance performance with cost optimization to unlock the potential of AI\u003c\/b\u003e  \u003c\/p\u003e\u003cp\u003eWith the rise of AI and machine learning, large language models (LLMs) have become increasingly popular, but their high computational costs can be a barrier to entry for many organizations. This book offers cost-effective approaches to building and deploying LLMs. At each stage of the process, from model selection and prompt engineering to fine tuning and deployment, you can minimize costs without unduly sacrificing performance.  \u003c\/p\u003e\u003cp\u003eWritten for developers and data scientists, \u003ci\u003eLarge Language Model-Based Solutions \u003c\/i\u003eprovides the practical, technical knowledge needed to implement valuable generative AI applications like search systems, agent assists, and autonomous agents. The book explores techniques for optimizing inference, such as model quantization and pruning, as well as opportunities for reducing costs at the infrastructure level. It also considers future trends in LLM cost optimization, so you can remain competitive for the next stage in generative AI.  \u003c\/p\u003e\u003cp\u003eWritten by one of Amazon’s leading data scientists, this book empowers you to overcome the challenges associated with LLMs and successfully implement generative AI.\u003c\/p\u003e","brand":"Wiley","offers":[{"title":"Default Title","offer_id":47989509751013,"sku":"NP9781394240722","price":50.0,"currency_code":"USD","in_stock":false}],"thumbnail_url":"\/\/cdn.shopify.com\/s\/files\/1\/1842\/7735\/files\/9781394240722.jpg?v=1761784393","url":"https:\/\/k12savings.com\/products\/large-language-model-based-solutions-isbn-9781394240722","provider":"K12savings","version":"1.0","type":"link"}