HashiCast Episode 42 - Ming Zhao, IBM
This episode features Ming Zhao, developer at IBM and contributor to Docling. Join us as we learn the basics of AI, RAG, and Docling.
Subscribe to HashiCasts
»Guests
- MZMing ZhaoDeveloper
This episode features Ming Zhao, developer at IBM and contributor to Docling.
Join us as we learn the basics of AI and RAG, how Docling processes documents for AI, and our most and least favorite cheeses.
Podcast Notes
Transcript
00:00:00 Intro / Outro: Welcome to HashiCast, the self-proclaimed number one podcast about the world of DevOps practices, tools, and practitioners.
00:00:06 Rosemary: Hello everyone. Welcome to another episode of HashiCast. We've been remiss on releasing a normal episode of HashiCast lately, but we have had episodes of CHANGELOG, which is our HashiCorp product releases podcast. Today we have Ming Zhao from IBM on the show. He is an expert on Docling, an open source project for document processing for AI. Ming, Tell us a little bit about yourself and how you got started in this space.
00:00:33 Ming: Yeah. So I actually work as a software developer at IBM, focused specifically on open source technologies. When I first started, IBM was hiring people to work on open source projects in an effort to learn more about them - start integrating open source into IBM technology, but also just generally building eminence in that space. I think one issue that I ran into when I first started was coming from an environment, which was school, coming from college, where everything was an assignment, and I was told exactly what to do at all times. It was really difficult to transition to this open source space where I was kind of just told, hey, like go and, you know, work in this space, right? So getting started was a huge pain for me. What I ended up doing was just working on a lot of the smaller issues within that space, right. Like just looking at even just like a typo in documentation. Right. Like adding little examples in the documentation, or if someone has a bug in something, just trying to reproduce that bug and not even necessarily solving it, but, you know, taking small steps to start interacting with the community. I think that from there, I was able to develop a lot more confidence in interacting with the open source space as a whole. Then I started working on some other open source projects. I started working on the Granite project, which is a family of IBM open source large language models. Then finally I pivoted into Docling. At the time, Docling was just an internal project that IBM had but IBM had a huge amount of interest in continuing to grow in open source space. And so, after a year or more of work within research ended up being released as an open source project.
00:02:20 Rosemary: I guess the interesting thing about your story is that you started work within the Python community. A lot of folks who want to get involved with open source communities do start with Python. What do you think is so interesting about the Python community, and why did you feel so welcome in getting started in it in the first place?
00:02:38 Ming: That's a good question. I think the Python community is just very robust. It's a very popular language that's particularly within like AI for AI developers. Python is kind of the primary go-to starting language. Because of its robustness and just a number of people that are involved in it, it becomes a lot easier to directly get involved. There's so many people that are there to support you within any of the open source projects. A lot of the Python-based projects, like PyTorch for example, have these huge communities of people where you can ask questions. There's these like Slack communities, there's Discord communities, all these like large communities of people where if you ask a question, you're very likely to get an answer within the next couple of days. Right. And then I think that differs from some of the maybe smaller, smaller scale projects where you could have to wait like a couple of weeks before you hear back from somebody.
00:03:31 Rosemary: Fun fact: my very first programming language was Python. I was not very diligent about learning it. It just happened that my dad put down one of those old O'Reilly books in front of me and was like, you should learn Python. And of course I was like, I don't care about it that much. So I started learning it. I think my first Hello World was in Python, but interestingly enough, I never got deeply involved in the Python community. It came much later that I ended up getting more involved in the Java community and Go, and then by extension, the HashiCorp tooling, which is written in Go and any other domain specific languages. So it's really funny that I just never got involved in the Python community, even though it was technically my first language. I don't know if that was your first language, that programming language that you learned.
00:04:18 Ming: Well, I actually I studied computer engineering in school and so technically my first language was in assembly. So we were doing assembly based coding my freshman year. And I think a lot of people who study computer science may have to go through that in their first year of studying. So technically assembly. But yeah, I do find it interesting that that people nowadays are more likely to get involved in the communities. I think that just in the past couple of years, there's just been such a huge growth in a lot of these open source communities and so much more focused in open source projects from both individuals, but also like corporations. I think a lot of these communities are growing very rapidly, a lot more eyes are on AI development in general. Lots of movement and growth in this space.
00:05:05 Rosemary: All right, so let's chat about Docling. What is it and what problem does it solve? We briefly touched on it, but I think the problem that it solves is a much larger and more complicated situation. Right? Because I remember when I used to work in financial services, everybody was like, how do we process documents? And that was the biggest problem anybody could try to solve. How do you process an actual physical document and turn it to something digital that's usable in the form of data? Does Docling solve that problem? And if not, to what extent does it solve that problem?
00:05:37 Ming: So that is the problem that Docling is meant to solve, right? It's an open source project. It's MIT licensed. The point of it is to take that unstructured document and extract it into a structured format, where there's layout, where the layout of documents is going to be preserved, those tables are going to be preserved, like the natural reading order in your document is going to be preserved. All of these things just to make your data more valuable in your AI workflows.
00:05:59 Rosemary: Is there a sequence by which it is processing all of the information in the PDF?
00:06:04 Ming: In your specific case, you have a PDF. Depending on what document type you have, there are actually several document processing pipelines. There is a simple pipeline, right? So ignoring PDF, there's a simple pipeline where if you look at just like a Word document or something simple that's easily parsable, Docling is able to take that and convert it into the Docling document format. With PDFs and then scanned PDFs specifically, you can actually parse a standard PDF. Docling be able to parse those standard PDFs, it might pass it through like a table for a model. If there's table specifically, layout analysis models, that kind of thing to generate a Docling document. So I mentioned Docling document a couple times. The Docling document is a document representation format that's native to Docling. It's a highly structured and very metadata rich documentation format. I encourage you to go and check it out. There's a bunch of documentation on it in the Docling GitHub repo, but essentially it takes every element of text within your document and tags it with what type of text it is. So for example, if you have section headers, if you have basic text, list items, tables, all of those little bits of text, different types of text within your document are going to be tagged and stored in this highly structured format in the Docling document. And so Docling employs a bunch of different models in your PDF pipeline. So depending on the type of PDF you may need different models. Like a scanned PDF you might want to use OCR models. Table heavy PDF, you'll use the table form models. All of these models can be toggled on and off within that PDF conversion pipeline to get to where you need to be in the Docling document format. And then from that Docling document format, you're able to export into various formats like the Markdown, HTML, or JSON formats you might want to use.
00:07:49 Rosemary: How do you decide which models are the best for your document? Because I imagine there's this sort of tuning process you mentioned. Sometimes you might decide to use OCR, you might decide to do this, and so you'll toggle them on and off. In your experience, what do you think is the cycle time for basically the first iteration of passing that document through to the last iteration of when you think that document is usable?
00:08:11 Ming: Again, it kind of depends on the complexity of your document, right? Sometimes it'll take trial and error. A lot of the base settings that Docling includes out of the box are going to work very well for you. For the majority of documents, you may want to use different OCR models simply based on what OCR models you have found work best on your specific data. There are several OCR models that are integrated natively within Docling that you can choose from, but a lot of it's going to be trial and error. And the first time you run through the pipeline for your initial document processing, that first document is going to take the longest because in your setup, you set up the Docling library and import all the imports. During the first conversion is where Docling is going to download some of the models that it actually needs to do the processing. So if you try Docling the very first time you upload a page and it takes a while and you're like, "Wow, this is a lot slower than I thought it would be." Try another page after that and you'll see the performance improvement. That first step is usually what takes a long time to download all the models. After that, your document processing should increase in speed.
00:09:12 Rosemary: That leads into my next question, which is what if you have thousands of documents? Some of them might be of roughly the same type and category and format, while others are not. How do you process them at scale? Imagine you work with folks who are saying, I need to process hundreds of thousands of documents, and I want to do it as efficiently as possible, but I don't want to have to manually go in and tune all the models that I'm using for every single document. How do you approach this from a scaling perspective?
00:09:40 Ming: A huge part of the goal of Docling is efficiency. So as an open source project, one of the things that it does really well is being able to run locally on your device. So if you download Docling, most of the time, most documents you can actually process using the CPU only, even if it takes a while, it wouldn't take longer than like a minute or so. If you have a GPU then we're thinking like several seconds on like a very basic GPU. In terms of processing speed, the number of documents that you're actually able to do per second or per minute, Docling has efficiency in mind. There's a lot of accelerators that are available depending on what kind of GPU you have available to you. You know, obviously with large scale document processing, it's very important to have that speed aspect. In terms of identifying which specific pipeline to use, Docling will auto route to a pipeline depending on the document type. If you have a simple document, you just run "docling.convert". It's going to convert using a simpler pipeline. If Docling recognizes that scanned document, it may use OCR automatically. That being said, a lot of the times you will still want to segment your documents.
00:10:46 Rosemary: I hope that an organization segments their documents in a way already. I assume that the segmentation might be by category or by, I don't know, information on or within the document specifically. I hope they're doing it. But, you know, in case they're not, that might be a sort of first step they need to take themselves proactively.
00:11:06 Ming: At the very least, it can recognize document type. Like if you have documents that are all in PDF format, it's actually quite easy for it to identify that it's a PDF and route the conversion based on that. If you have a lot of Excel documents they can be brought to Excel. In terms of routing the specific document type, that's an easier thing to do. Document complexity is a little more difficult. Having a model recognize the complexity of a document to route to may require more work on the actual user side than Docling is doing.
00:11:36 Rosemary: I know that some of these PDFs could involve - I'm just thinking of a very common example of PDFs that maybe aren't necessarily in a large enterprise but might be more from an academic perspective - research papers, they have a very specific format, very specific structure. A lot of the text you can generally predict where things are. Tables go in a certain section, most of the time. Figures go in a certain section most of the time. Is this the sort of ideal use case for Docling?
00:12:03 Ming: When Docling was first created, it was used to process a lot of these research paper type documents. One of the big issues with those research paper type documents is that their format is a little more complex than a standard document. A lot of them are going to have at least a two column format. They're going to have images and figures interspersed within the text. They'll have captions for all these images. They end up having tables in a column or something like that. They tend to be more complex in terms of document format. But that was exactly what Docling was trying to tackle. When you look at a use case where you're converting a scientific paper or a research paper or something like that that has a lot of images, has a lot of tables, the benefit of Docling is that because it's able to recognize where those images are and where those tables are, it can tag those images as images. It can tag the tables as tables, which means that when it does the processing, it can recognize, okay, here's table I need to use the table format to extract the table. Here's an image. Let me store this image separately for later processing. That's also where the VLM part comes in too. Docling has a couple of other VLM that are integrated, one of them being IBM's Granite vision model. That would allow you to take the images or charts or figures out of your document, separate from the document itself. So extracting and isolating those images and processing those images using a VLM so you can extract better information out of them than if you were to process an entire page with several images at once.
00:13:34 Rosemary: You mentioned VLM. What is VLM stand for or what is it used for in Docling?
00:13:41 Ming: VLM in this context stands for visual language model. Docling currently, as part of one of its document processing pipelines, leverages the Granite Docling model, which is a VLM, to take an image of any scanned page. You can think of any document format. As long as you can scan it into an image, you can take that image and process it into the Docling document format. Instead of going through the traditional pipeline, which may go through several different models, including OCR models, table form models, all those things, it can actually just take the scanned document as an image and process it directly into text.
00:14:16 Rosemary: Got it. I guess that makes me think about what are the limitations of Docling. You mentioned part of this, which is it's important to treat some of the parts of these documents as a separate processing pipeline. It could be images, it could be something else, in order to get the most context out of it and to use it most effectively. What are some of the limitations that you've encountered in Docling whether it be use cases, things that it can and cannot process, extra steps that you might need to take in order to process something effectively.
00:14:44 Ming: One thing that we've been working on lately that we've run into some trouble with Docling is handwritten forms. A lot of text, especially like government documents, have some handwritten portions in them. Being able to accurately extract both text and handwriting ends up being somewhat difficult for Docling to do. If you use some of the integrated OCR models, what they end up doing is they take text, but they also do their best to extract the handwritten portions. You end up with kind of mediocre handwritten portions and then some mediocre texts as well. It's a problem that we're trying to solve with Docling by essentially separating those handwritten portions from the text. One of the things that a lot of models specifically like visual language models that do document conversion, one of the things that they don't do is preserve chord information for text within the document. If you use like a basic extractor, even some of the more advanced ones, you don't really actually know exactly where each element of text that you're extracting is coming from on the document. You know that it's in the document. But in terms of the specific coordinate based location, you know where that is. If you play with Docling, you'll see that it has this visual grounding function that allows you to draw bounding boxes around each element of text. That's native to that Docling document format is preserving those coordinate data. But another thing that allows us to do when it comes to the handwritten forms is isolating specific sections, recognizing that they're handwritten, and being able to isolate those specific sections for a better processing. Like models that maybe aren't integrated into Docling to do the handwriting process in pieces.
00:16:23 Rosemary: As you process the document, what makes that information useful for AI specifically? Does it differ from any other use case? Like in the use case that I was doing with OCR, we were pretty much just extracting these addresses and putting them in a database. That's just some data processing. But what makes the AI use case a little bit more special? I guess you have to be a little bit more conscientious of the structure and the context of the data before it goes to some AI system.
00:16:49 Ming: If you think of one of the most common AI use cases, which is RAG these days, is being able to ask questions of your your documents, ask questions of your corpus of documents. How do you get text within your database? Right within that, within your documents? How do you get that text in a way that's easily searchable and also easily retrievable? The benefit of using Docling is that instead of taking your documents and just having this wall of text - you can chunk that wall of text as well and embed it into a vector database - but with Docling, the structure of the text is preserved in the conversion, which means that when you do chunking, you can also preserve and chunk based on the structure. Should I go into like RAG stuff and talk about chunking?
00:17:38 Rosemary: Yes, you should definitely talk about RAG. I don't think we've ever talked about RAG in any episode. I am familiar with RAG, but I imagine a lot of folks who are in this space may or may not be as familiar with RAG.
00:17:50 Ming: So for those of you who are not familiar with RAG, it's retrieval augmented generation. So the idea is you're taking all of your documents that you have your whole database, and you need to somehow get it into a database that you can search out of, right? So you need to be able to retrieve items from that database. In order to do that, most of the time you'll use some sort of embeddings model to take your text and embed it into vectors within your database. The issue with that is that all embeddings model have some sort of context window limit. You can't just take your full book and just embed the entire book at once. Using your embeddings model, you need to break that book down into a bunch of different, smaller chunks so that your embeddings model can actually embed each of those individual chunks. So assuming, for example, that your embeddings model has like a five hundred and twelve token embedding context window, how can I break this book down into chunks no bigger than five hundred twelve tokens? The naive way to do it is just take your entire text and just say, okay, first five hundred twelve tokens, that's one chunk, second token, that's another chunk, and you can do that. The issue is if you think about any body of text, that chunk may end up splitting a paragraph in half. It may end up splitting a sentence in half. It may encompass one section in its entirety, but then only half of the next section. Using Docling when you have documents that are in these highly structured formats, you can then chunk based on the document structure, which then means that your chunks are going to be more semantically coherent. When you end up retrieving those chunks in your retrieval step for generation, you have these semantically coherent chunks, these structurally coherent chunks that then are just more valuable when you're doing the generation step. That's a huge piece of what Docling is capable of enabling. The other thing is, obviously you have a lot of tables in these documents as well. Instead of just extracting those directly into text, instead of extracting this table and now you have a wall of text with no structure to it where the table should be, Docling can preserve the table structure. You can store it as a CSV export in Excel format. You can export any format you want, but recognizing the table structure and having it stored as a table allows that specific data to just be all the more valuable when you end up retrieving it in your AI pipeline.
00:20:07 Rosemary: This is particularly relevant for me. I've been attempting to build an agent that ingests all of my good Terraform configuration and all of my various resources that I've written out there about what is good Terraform, how to write it, good patterns, practices, and principles. I have a book in PDF, as well as a bunch of slides in PDF and PowerPoint. It was really hilarious because processing the book, I thought was kind of straightforward. It turns out it's not as straightforward as I thought. I did end up having to pass it through Docling because I have a bunch of diagrams and tables that were messing with some of the processing when I tried to just feed it in straight. It's been an interesting journey there. But processing the book after I put it through Docling and did some testing and checking was a lot easier. It was a lot easier to handle. The slides are a different story, because I realized they have a number of diagrams embedded in the slides. So the slides themselves have no words, right? Or at least very few of them. Then there are a lot of these diagrams on there that have labels. That's probably the only text that's on there. Some of these diagrams were drawn outside of PowerPoint, while others were drawn directly using PowerPoint icons. Whenever I try to process these, it comes out completely incoherent because there's not that much text on there. Those relationships are really hard to capture. Do you have any suggestions on how I can process those slides, or it's some other problem that I have to fix?
00:21:51 Ming: If there is a certain amount of difficulty involved in processing these diagrams, especially when some of the diagrams are drawn in the slides themselves, and some of them are just essentially images, I would recommend the best thing to do is process your slides just straight into image. So make sure that all of those slides that are drawn in PowerPoint are also now just scanned images as part of PowerPoint. I think you can just export your PowerPoint into PDF or something like that. The main benefit of that is Docling as a first step still tries to parse your document. If it's parsable, it will still try to parse it. That's because parsing is way cheaper than using OCR. The structures that you draw within PowerPoint using shapes and arrows aren't really easy to parse. And there's no simple way for Docling to have any sort of parsing understanding of those. Instead, storing them as images separately or having your PowerPoint as a scanned document. That way when you process using Docling all of those figures are going to be stored on separately as images and as figures, as opposed to a bunch of different pieces of text that are parsed out of the PowerPoint. From there, you can use some VLM to do processing and ingestion of those images and figures, as opposed to worrying about the two different types of figures that are in your PowerPoint.
00:23:10 Rosemary: I knew I should have coded these as a kind of like graph format or something so that I could - hindsight is twenty twenty, right? Because I should have just done it in the first place, where I coded it into some graph format that I could reproduce because actually some of these are long gone. Some of these diagrams are long gone. I can't find them anymore. I can't find the originals and especially the ones in PowerPoint. There's no way I can capture the boxes and the hierarchies and where the arrows are pointing.
00:23:40 Ming: VLMs these days are very powerful. There's a ton of now open source VLMs too and multimodal models that can take images and process them into text. They can take image with a prompt and process it into readable text. Or you can even potentially export it into the graph format. I would try out a lot of these open source VLMs, even the closed source ones. And again what's nice is Docling can extract those figures as images and you can process those images in your own way. You can take all those images and process them using whatever model you see fit, even if it's not integrated in Docling already, it's super easy to take any image model, especially like an open source one, and set it up to work in your document processing pipeline with Docker. I think that while the initial keeping it in graph format may have been better, there's so many solutions for image processing these days, especially with the VLMs that you may end up with output that's just as good.
00:24:44 Rosemary: Awesome. I think that's worth asking the bigger question, which is, there are a lot of document processing tools out there, especially PDF tools. Why should a developer choose Docling? Is it specifically for AI? Is there a reason for someone to choose Docling that's not just related to AI?
00:25:01 Ming: There's kind of two camps that these document processing applications or libraries fit into? There's these open source ones that tend to be very lightweight and very efficient, but lacking in features. And then there's these closed source ones that exist. Azure has their smart document or document AI. AWS has something similar. All these and like Unstructured, for example, offers a service as well. Compared to the as-a-service camp, the main thing that Docling enables is local processing. You don't need to actually pay per page, right? You can use it depending on what hardware you have yourself. Uh, but also not having to send your data off to a server. If you have particularly sensitive data or data that needs to be regulated in some way, being able to process it on your own servers is very beneficial. Docling being open source means that you can download it, run it, install it, and process your documents wherever it is that you want them processed. That's the benefit compared to some of these larger closed source providers. If you look at some of the more open source providers, the main thing that Docling adds - and I will be the first to admit, Docling will run slower than a lot of these open source providers - but the issue with open source providers is that they just do that text extraction. Docling preserves some of the structure. But that structural benefit, at least from our perspective, is worth the increased compute cost. Losing out on the entire semantic structure of your tables means that you lose out on so much valuable data that you would otherwise want access to. Docling adds an extra layer of processing compared to some of those lower level PDF parsers that are going to be open source to just make that data more valuable.
00:26:56 Rosemary: There must be some burning questions in the Docling community that you get very regularly. I think we're here to set the record straight. What are some frequently asked questions that you get that you wish you could have put on an FAQ page? And that way you didn't have to answer over and over again.
00:27:11 Ming: A lot of the questions are around what is available in Docling or what it can and can't do. One of the big ones is because Docling is used a lot for these RAG use cases. I mentioned chunking already. Can Docling do chunking, or how can I chunk using Docling? Docling actually has some built in chunking functions within it. Again, check out the docs to take a look at what they do, but they leverage Docling structure to chunk your text based on document structure. So just preserving a lot of that structural information in your chunks. Another question that we get asked a lot is does Docling support OCR? Does Docling use OCR? Because I think the majority of people, when they think document processing immediately think OCR models. So how does Docling compare to OCR models? OCR models are a part of what Docling does, right? Docling, in some of its pipelines depending on your document type, will leverage an OCR model in order to extract more information out of document. Their OCR models that are integrated within Docling, and you can bring your own OCR model if you have a specific OCR model you want to use. But yes, Docling will leverage OCR if OCR is needed or you want to use OCR for your processing. Some people wonder, will it work on my device? Will it work on this device? Will it work on that device? It will work on anything with a CPU. You can run Docling on anything with CPU. Your processing times are going to vary. Obviously, if you have a specific GPU, there's a Mac accelerator. There's now NVIDIA accelerators as well as CUDA-based accelerators. Actually, the Docling team very recently worked in partnership with NVIDIA to accelerate the Docling models for use on NVIDIA RTX devices. Any hardware that you have most likely will support Docling use. I think that's the main stuff. A lot of the questions that we get show up during our office hours. Docling does have some semi-regular office hours now, so if you had any specific questions about Docling, feel free to tune into those. We announced them on all of those Docling social media channels. There's a LinkedIn Docling page, should be on the website as well. There's a Discord, all that good stuff. If you had any other questions about Docling for the community, feel free to show up and ask them yourself.
00:29:21 Rosemary: For those who are not familiar, and I think we should define it just in case, OCR is optical character recognition. It's the processing of characters in a document.
00:29:32 Ming: Earlier we were talking about the handwritten stuff. Part of the reason why the performance is somewhat poorer than Docling is because most OCR models are either tuned to do handwriting or tuned to do text. And a lot of them do both. When you use like a text-based OCR model to try to analyze handwriting, it'll mess up. If you use a handwriting-based model, it'll like hallucinate a ton on text. One thing that Docling does well is again, recognizing some of those bounding box areas. Being able to recognize where individual text comes from. So what we're working on doing is identifying handwritten text and identifying the bounding boxes that they're in so that we can then use a more handwriting capable model to process them.
00:30:19 Rosemary: I've been learning a lot about the AI space. As I mentioned before, I'm not an expert by any means. I know a little bit about a little bit of everything. The interesting challenge that I've encountered with AI is that there's a lot to learn. There's a lot to learn about context engineering, prompt engineering. You know, if you want to run agents, like there's a whole thing on that. And a lot of folks have come to me saying they want to learn more about AI and how it applies to their day-to-day job. They're developers, engineers, or operators. But the information is very overwhelming. They'll start searching and they'll quickly find one hundred different resources, talking about one hundred different things. Most of the time they really only need to learn about, let's say, how to do a proper prompt. Some of them maybe need to just write an agent or need to understand how to write an agent. How did you start learning in this space? Did you find it overwhelming as well? And how did you sort of take out what was the most important parts for you to learn?
00:31:19 Ming: I think that with the pace of development these days in this space, there is an incredibly overwhelming amount of information available out there. There's a million different courses that you could buy. There's a million different podcasts to listen to. There's a lot of content available out there to learn about the AI space. I think that accurately identifying the right content is not something that people have an easy time doing because even for something as simple as like learning to write an agent, there's going to be hundreds of different videos teaching you how to do that. A lot of the times what helped me the most when I started - and most of my AI learning happened after I started working. I think a lot of what I learned in school set a nice foundation, but in terms of AI and applications, it was more useful learning on the job. What helped me the most was a mentor in the space to show me the resources that they have found the most valuable when they started, and also just showing the resources they found the most valuable now as they're learning. I think that finding the right mentor to show you the resources is great, but then where is it that you go to find the right mentor? I think that's a huge part of where the open source community comes in and just the community aspect of all this learning. I think that the benefit of our interconnected world now is that there's all these people out there that are super willing to help, and it's so easy to connect directly with them. Any these communities, like joining their Discords, joining their their office hours, just interacting with people through Github, working on these small issues will get you connected with people that are very interested in helping. I know that there's a lot of like, mentorship programs that even IBM has to mentor, like college kids and people that are interested in entering this open source space to increase just the number of people that are using and developing in the field. Because end of the day, most of these projects, they want more people working on them. They want more eyes on them, they want more developers. It is in their best interest to make sure that you learn things that you need to know to be effective in this space. Identifying projects that you're interested in, identifying projects that you're like, oh, this is going to be something that's really useful to me, or like looking at the projects that you currently use, right? If you currently use a project and you're like, oh interesting. It's open source, let me go check out the community, ask around, you know, if you have an issue, always raise it there. And just interacting with people as a whole I think is really beneficial in accelerating that learning process.
00:33:57 Rosemary: In my experience, it helps to also have someone who's knowledgeable in the space to give you a more pragmatic perspective on what's reasonable. There's a lot of innovative thinking on what you can do with AI, but the reality is that, at least whenever I've spoken to folks in the community, most of what we're looking to do and apply AI is just expediting our workflows as a developer. We're not looking to generate whole entire applications necessarily with code. Some people do, but we're looking to expedite the process of code review, of building tests, of making sure our code is compliant. We're looking to sort of take some of the grunt work of scaling code out of our hands and putting it into some kind of automated responsibility that we hold accountable. Hopefully, we hold accountable. The thing that helped me the most as I started working in the space was just someone telling me, "Don't think about that yet. We're not there yet. We're thinking about this right now, and this is the most pragmatic use case that makes the most sense and what the developer community is looking into." A lot of folks aren't necessarily looking at sort of the next generation of AI use cases. They're just looking at improving their own productivity a little bit by little bit.
00:35:22 Ming: I think one thing that AI does really well these days is encouraging every little bit of idea that you have, and trying to flesh out every single idea that you have. But what that ends up doing is, you know, you end up targeting all these things that maybe are not super practical for your actual use case, right? You may develop features for something that are just completely out there just because AI is like, oh yeah, that's a great idea. You know every time you ask a question, it's like, "Oh yeah. Like, does this make sense? Yeah, that's a great idea." Having a human in the loop, which would potentially be a mentor of some sort to ground you, tell you what actually works and what doesn't work, I think has become more valuable now because of this kind of over-hyping that AI does for you.
00:36:05 Rosemary: All right. So the other thing that I've learned is that AI can be a little cost prohibitive, in that I've quickly run out of credits for all of the various AI models that I've run. I've started to look into a local development stack, and it turns out there's not a universally favored local development stack. You get a mixed set of tools and with different requirements. Some say it's better to run in GPU, some it's not. It just depends on the model. So it's a whole bunch of resources on that. What do you use or what do you suggest?
00:36:39 Ming: I like that you asked this question because a huge part of what I did before working in open source was also looking at open source models. I used to work on the Granite family of models that IBM has, which are a bunch of open source models that are, relatively speaking, much, much smaller than some of these frontier models that people use. If you look at these cloud models, like [Claude] Haiku, Sonnet, and Opus, these are like hundreds of billions of parameter models, up to trillions of parameter models that you could never possibly run on your local device. Right? They end up being very cost prohibitive because they're just huge models that aren't really specialized for any one use case. A lot of people really like these smaller open source models for their use cases. One application that I use a lot is Ollama. It's a model serving platform that you essentially can access. There's a whole catalog of open source models that are optimized within Ollama, the Granite family being one of them. The Granite Docling model is being worked on to be in Ollama as well. But it allows you to run these models on your local device. This is limited to the capabilities of your device, like what GPUs you have. I think that any sort of open source, locally run application is always good to focus on because it's easier to scale those. Like if you're working on application locally, it's easier to scale that to some huge application than it is to take a huge application, scale it down and run on your device.
00:38:20 Rosemary: I've been using Ollama whenever I run out of credits. It's nice to be able to run the models locally on my machine. I've had to tune it just a little bit to run some of these models, but there are so many options out there that you can choose. While the models might not be perfect and you might have some situations where it's not going to give you the response that you're looking for, I've not found it that far off. I think the models, especially for anyone with fewer resources, has have been getting better and better.
00:38:48 Ming: There's such a variety of models available on Ollama, too. You have models that are like several hundred million parameters, which can run very efficiently, and then you have like thirty two billion parameter models that you maybe need, like a pretty solid GPU for. You have this huge variation in what models that you actually have available there. Just tuning that to your own use case and what you have available is always good.
00:39:14 Rosemary: We have one final question for you in true HashiCast tradition. It's been a while since we've done one of these. This is a very, very, very much a less serious question. We usually phrase it as, "This is a slightly less serious question." This is very much not a serious question at all and it to do with AI or Docling or anything we talked about today. If you were a cheese, what kind would you be and why?
00:39:44 Ming: Oh, man.
00:39:45 Rosemary: I don't know if you eat cheese or not.
00:39:47 Ming: No, the thing is I actually hate cheese. I hate to say it, but so it's funny and most people are like, what do you mean? I'm actually not even lactose intolerant. I love milk, I love ice cream, but I cannot stand the taste of cheese. Actually, two of my best friends also don't like cheese, so I'm not alone in this. I don't like cheese, but the one cheese that I do tolerate, and I'm not sure if this is rude to say, but it's the cheese on Domino's pizza. Most times if I order pizza, I'll be like, can I get it with no cheese? And it always gets a weird look. Fine. Whatever. Right? But for some reason, the cheese on Domino's pizza doesn't trigger the same cheese aversion as regular cheese. It just doesn't taste like cheese to me. And so I can eat that just fine.
00:40:34 Rosemary: It's valid. So you'll be like reconstituted cheese.
00:40:37 Ming: Yes, exactly. That's the cheese that I would choose.
00:40:39 Rosemary: I think that's okay. I don't think that's rude. It's just very interesting that it's just the Domino's pizza.
00:40:45 Ming: No, I feel bad saying that about Domino's. But you know, I love Domino's. It's just the reason why I like it.
00:40:52 Rosemary: Funny, the one less serious question that we landed on. You just don't like cheese. So there you go. I would probably be a Brie, like a soft cheese. Just because I'm a soft hearted person. I'm pretty versatile. I guess you could eat Brie with a sandwich. You could have a baguette. You can have it by itself if you really wanted to, I guess.
00:41:09 Ming: Yeah, I mean, I don't like Brie. I've tried a lot of cheeses. I don't like Brie. The worst one is feta cheese. That is so strong. Terrible. Parmesan? I hate getting meatballs. And then there's parmesan in them. You know, they don't tell you. They don't ever tell you that there's cheese in the meatball, but then there's parmesan.
00:41:31 Rosemary: I'm sorry. I do have a friend who has a parmesan aversion. Loves pizza, but does not like parmesan. A lot of pizza, like the fancy pizza, they put the parmesan on the top and she's like, I can't stand it. So yeah, to each their own. To each their own. Well, Ming, it was great having you on HashiCast and learning about what cheeses are acceptable and not acceptable, as well as about Docling and AI and how you got into the space. Thank you so much for taking the time to talk to us today.
00:42:01 Ming: Well, thank you for having me, Rosemary.
00:42:03 Intro / Outro: You've been listening to HashiCast. Today's guest was Ming Zhao from IBM. Tune in next time.
»Hosts
- Rosemary WangChief Developer Advocate


