Vox AI can produce material on any topic using Siri-like voice commands and transform it into Audiobooks, Podcasts, or Voiceovers for VSL, TV Commercial, Webinars, Sales Videos, etc.

Text From This Video

Okay. I want you to look at this since it’s a huge difficulty for large language models like GPT-3 and now GPT-4. However, it is not code. It’s a list that nations are dealing with. Let’s briefly recap before discussing huge language model issues. their function. ChatGPT is likely familiar. It’s an app that rests on top of a big language model. GPT in this situation. ChatGPT models do natural language processing. It’s utilized in anything from auto completion to phone customer care. GPT is like a used bookshop. The store’s books taught it. Large language models are basically.


They read a lot and strive to master a language. Covering the responses lets them test their method. then checking for accuracy. They can then apply that expertise to identify sentiment, summarize, translate, and provide answers or suggestions depending on the processed data. The last line was written by ChatGPT. That’s necessary in videos like these. This skill is wonderful because it’s read a lot. ChatGPT can reword things to Shakespeare, but my waste of paper comes in since it’s read all the Shakespeare. I want to show you something. Yeah? Good? Okay. This is a printout of Common Crawl from 2008 until the present. They index all webpages in a common crawl. This list includes all languages they think they’ve indexed. The English is immediately apparent here. English makes up around 40% of every crawl. Deutsch: DEU. Indexes are here.


It’s approximately 6% every time, which may not seem like much, but it is. But look at 2023. fin: Finnish. A much. However, it accounts for 0.4% of the scan. This bookshop has inventory issues. The concentration is on a narrow selection of languages. According to a report, 20 of the 7,000 languages spoken worldwide make up most NLP research. Let’s go back a bit. Ruth-Ann Armstrong here. I interviewed a researcher who is accomplishing what many researchers are attempting to achieve. establish new data sets. We designate those 20 languages high-resource languages and the remainder low-resource languages. Low-resource languages don’t appear on the Internet as text, hence they don’t constitute language databases. The AI can’t understand them.


Recall our secondhand bookshop. It has several James Patterson, Anne Tyler, and Dan Brown novels. English, German, and Chinese are similar. Languages with abundant resources. Rare books are also available. These are resource-constrained languages. Many models don’t know much about them or have anything at all. I’m from Jamaica. Jamaica’s main language is English, however we also speak Jamaican patois, a Creole language. Armstrong and her coauthors aimed to construct a dataset to explain this mostly spoken language, not ChatGPT. Instead, they wanted it understood by their model. Armstrong gathered up a handful of Jamaican patois examples to achieve that. A pair. She also indicated if the remarks were neutral, agreed, or disagreed.

You may test it here. Fever is affecting A. B’s heat is high. Entailment, then. In agreement. Here’s one: Inconsistency or implication? Contradiction. One last. Neutral. The two statements don’t connect. She did that for about 650 instances. This was a lot of labor, as you can see. Jamaican patois is not on my extensive list of Common Crawl languages. I also talked to Catalan scholars. They are assessing how well these massive language models perform on Catalan. This autonomous community of Spain speaks it the most. English makes up 92% of GPT-3. German has 1. 4 percent. Spanish accounts for 0.7%. Catalan, finally. The training set contains no Catalan words. VOX AI OTO Product Description 01%. It operates great still. So this situation is a bit different, right? Catalan is in the dataset. Catalan is 0 according to Common Crawl. 2335% of their poll. Not much, but some. On minimal data, major firm models like GPT-3 and GPT-4 performed well. GPT-3 generated three Catalan phrases for the study team. They then added genuine phrases. Three native speakers reviewed them. That was our exam. The machine did well with their outcomes. However, there is a catch. It functions well. It’s worth it to construct a language-specific model that has been trained and tested for that language. Transparency and data volume are the issues here, not performance. Common Crawl claims to have indexed.

millions of Catalan words. GPT-3 claims they only read 140 pages of Catalan. Imagine like a novella. Being depending on the performance or goodwill of a few institutions or organizations is an issue. Catalan might be eliminated by one of these firms. Catalan News alleged that Google was excluding Catalan sites from searches. GPT-3 was trained for much more than Common Crawl. GPT-4’s details are unknown. That suggests this language model has a lot of other stuff we don’t know about. Meta, Microsoft, Baidu, Open AI, and Google currently run all these bookshops. They don’t reveal the books’ origins or authors. A library is being built close to the bookstore. Paris has an underused supercomputer. I was talking to others about it almost down the road.

They created it and they say, “Nobody uses this GPU.” Can we do anything? Thomas Wolf, co-founder of Hugging Face, worked on Big Science’s BLOOM. an initiative to produce an open-source multilingual model. The more we thought about it, the better it was that we trained it in several languages other than English. If we incorporate more individuals, the Hugging Face project will grow into a massive partnership. We opened this to everyone. They covered the top languages on Wikipedia. But included low-resource languages whenever possible. We have very low-resource languages there. African languages predominate. To acquire data there, we chose to engage as much as possible with local communities and ask them what data they thought was valuable and how we might get it. Importantly, we know where the data came from and how it was gathered. Open-source is different. You’re familiar with library books. Okay, let me find English. As an English speaker, let’s be honest.


I’m a huge company’s target audience. Even though English makes well over 40% of the Common Crawl, the target audience wants all languages adequately represented. I recall my Jamaican accent even if I speak English. Siri didn’t grasp my accent when it first came out, so I had trouble using it. Thus, voice assistant training datasets need be expanded. It’s helped to add accents. Imagine what would happen if we expanded another piece of it. We’re creating technology for more languages. You must trust them to use this model everywhere. Microsoft is fine if you trust them. If you don’t trust them, yes. We speak it. Catalan is our language. Because of a little language or a relatively small language, you may have languages that have a… a lot of speakers in the real world but few online. Therefore, they will vanish. .

