Introducing Indic Gemma 7B/2B Instruction tuned model on 9 Indian Languages — Navarasa

Ravi Theja
8 min readMar 6, 2024

--

Today at Telugu LLM Labs, we are thrilled to share Navarasa, a Gemma 7B/2B instruction-tuned models in 9 Indian Languages. Perhaps this is the first Indic open instruction tuned model trained on 9 Indian languages. (+ English)

Navarasa is a Gemma 7B/ 2B SFT model using Gemma 7B/ 2B base models. Last week we released the Telugu Gemma 7B/ 2B SFT model using curated Telugu datasets from Telugu LLM Labs and we observed really good performance compared to Llama2-based models.

So, we thought why don’t we scale up Gemma 7B/ 2B models to multiple Indian languages and we went ahead with testing tokenizers of the following 9 Indian Languages and English Language.

  1. Hindi
  2. Telugu
  3. Tamil
  4. Malayalam
  5. Kannada
  6. Gujarati
  7. Bengali
  8. Punjabi
  9. Odia
  10. English

Let’s first check the tokenization of a simple sentence India is a great countryin all these languages.

Tokenization:

Hindi: भारत एक महान देश है.

Tokenization: [‘भार’, ‘त’, ‘▁एक’, ‘▁मह’, ‘ान’, ‘▁देश’, ‘▁है’, ‘.’]

Telugu: భారతదేశం గొప్ప దేశం.

Tokenization: [‘భ’, ‘ార’, ‘త’, ‘ద’, ‘ే’, ‘శ’, ‘ం’, ‘▁గ’, ‘ొ’, ‘ప్ప’, ‘▁ద’, ‘ే’, ‘శ’, ‘ం’, ‘.’]

Tamil: இந்தியா ஒரு சிறந்த நாடு.

Tokenization: [‘இ’, ‘ந்த’, ‘ிய’, ‘ா’, ‘▁ஒரு’, ‘▁ச’, ‘ிற’, ‘ந்த’, ‘▁நா’, ‘டு’, ‘.’]

Malayalam: ഇന്ത്യ ഒരു മഹത്തായ രാജ്യമാണ്.

Tokenization: [‘ഇ’, ‘ന്’, ‘ത്’, ‘യ’, ‘▁ഒരു’, ‘▁മ’, ‘ഹ’, ‘ത്ത’, ‘ായ’, ‘▁’, ‘രാ’, ‘ജ’, ‘്യ’, ‘മാ’, ‘ണ്’, ‘.’]

Kannada: ಭಾರತ ಒಂದು ಶ್ರೇಷ್ಠ ದೇಶ.

Tokenization:[‘ಭ’, ‘ಾರ’, ‘ತ’, ‘▁’, ‘ಒ’, ‘ಂದು’, ‘▁ಶ’, ‘್ರ’, ‘ೇ’, ‘ಷ್’, ‘ಠ’, ‘▁ದ’, ‘ೇ’, ‘ಶ’, ‘.’]

Gujarati: ભારત એક મહાન દેશ છે.

Tokenization: [‘ભ’, ‘ાર’, ‘ત’, ‘▁એ’, ‘ક’, ‘▁મ’, ‘હ’, ‘ા’, ‘ન’, ‘▁દ’, ‘ે’, ‘શ’, ‘▁છે’, ‘.’]

Bengali: ভারত একটি মহান দেশ।

Tokenization: [‘ভ’, ‘ার’, ‘ত’, ‘▁একটি’, ‘▁ম’, ‘হ’, ‘ান’, ‘▁দ’, ‘েশ’, ‘।’]

Punjabi: ਭਾਰਤ ਇੱਕ ਮਹਾਨ ਦੇਸ਼ ਹੈ।

Tokenization: [‘ਭ’, ‘ਾਰ’, ‘ਤ’, ‘▁ਇ’, ‘ੱਕ’, ‘▁ਮ’, ‘ਹ’, ‘ਾਨ’, ‘▁ਦੇ’, ‘ਸ਼’, ‘▁ਹੈ’, ‘।’]

Odia: ଗୁଗୁଲ୍ ନ୍ୟୁଜ୍ ଆପ୍ କେବେ ରିଲିଜ୍ ହେଲା?

Tokenization: [‘ଭ’, ‘ା’, ‘ର’, ‘ତ’, ‘▁’, ‘ଏ’, ‘କ’, ‘▁’, ‘ମ’, ‘ହ’, ‘ା’, ‘ନ’, ‘▁’, ‘ଦ’, ‘େ’, ‘ଶ’, ‘▁|’]

As you can see the tokenization has pretty good conversions of tokens in each of these languages. Over the past few months, we have seen community efforts to curate instruction datasets across Indian languages. We are very thankful for each one of them. We consolidated all these community-released instruction datasets across 9 Indian languages to build our model. To preserve the English input and response capabilities as it is, we included the English instruction dataset as well.

Datasets:

Following are the different datasets we used which cater to more than ~450K instruction samples.

1. Hindiravithejads/samvaad-hi-filtered, HydraIndicLM/hindi_alpaca_dolly_67k(sampled)

2. Telugu Telugu-LLM-Labs/yahma_alpaca_cleaned_telugu_filtered_and_romanized, Telugu-LLM-Labs/teknium_GPTeacher_general_instruct_telugu_filtered_and_romanized

3. Tamilabhinand/tamil-alpaca

4. KannadaTensoic/airoboros-3.2_kn, Tensoic/gpt-teacher_kn

5. MalayalamVishnuPJ/Alpaca_Instruct_Malayalam

6.GujaratiTensoic/Alpaca-Gujarati

7. Punjabi HydraIndicLM/punjabi_alpaca_52K

8. BengaliHydraIndicLM/bengali_alpaca_dolly_67k(alpaca filtered)

9. OdiaOdiaGenAI/Odia_Alpaca_instructions_52k, OdiaGenAI/gpt-teacher-roleplay-odia-3k

10. English yahma/alpaca-cleaned

Training Details:

The models are trained for 2 epochs on 1 A100 machine which took approx. 36 hours for the 7B model and 15 hours for the 2B model. The training and inference scripts can be found in our Github Repository.

Model Capabilities:

We found the model to have the following capabilities:

  1. Instruction and Input in Native X language, Output in Native X language.
  2. Instruction and Input in English language prompted to respond in Native X language, Output in Native X language.
  3. Instruction in Native X language, Input in English language and Output in Native X language.

Following we show some examples and responses in Reasoning, Question Answering with Context, and Translation to Native X language when instructed in English.

Examples:

Reasoning:

Instruction: (9+1) + (10 + 0) — (5*2) ? Solve it by giving a detailed explanation.

Note: For this, we specifically don’t have input.

  1. Hindi: (2+10)/(1+3)? विस्तृत व्याख्या सहित हल करें
  • Output: (2+10)/(1+3)\n\n= (12)/(4)\n\n= 3\n\nइसलिए, (2+10)/(1+3) का उत्तर 3 है।

2. Telugu: (2+10)/(1+3)? వివరణాత్మక వివరణతో పరిష్కరించండి

  • Output: వివరణ కొరకు, మొదట పారెంథసిస్ లోపల వ్యక్తీకరణను పరిష్కరిస్తాము, అది 2 + 10 = 12. అప్పుడు మనం పారెంథసిస్ పైన ఉన్న వ్యక్తీకరణను పరిష్కరిస్తాము, అంటే 1 + 3 = 4. చివరగా, మనం పారెంథసిస్ పైన ఉన్న వ్యక్తీకరణను పరిష్కరిస్తాము, అంటే 12 ÷ 4 = 3. అందువల్ల, (2 + 10) / (1 + 3) = 3.

3. Tamil: (2+10)/(1+3)? விரிவான விளக்கத்துடன் தீர்க்கவும்

  • Output:கொடுக்கப்பட்ட வெளிப்பாடு படிப்படியாக தீர்க்கப்படும்:\n\n1. அடைப்புக்குறிக்குள் எண்களைச் சேர்க்கவும்: (2 + 10) = 12\n2. அடைப்புக்குறிக்குள் உள்ள மதிப்புகளின் கூட்டுத்தொகைக்கு, அடைப்புக்குறிக்குள் மற்றொரு வெளிப்பாட்டைச் சேர்க்கவும்: (1 + 3) = 4\n3. இப்போது, ​​நாம் இரண்டு வெளிப்பாடுகளை பிரிக்க வேண்டும். இறுதி முடிவு (2+10)/(1+3) = 3.

4. Malayalam: (2+10)/(1+3)? വിശദമായ വിശദീകരണത്തോടെ പരിഹരിക്കുക

  • Output: (2+10)/(1+3) = (12/4) = 3.

5. Kannada: (2+10)/(1+3)? ವಿವರವಾದ ವಿವರಣೆಯೊಂದಿಗೆ ಪರಿಹರಿಸಿ

  • Output: ಹಂತ ಹಂತವಾಗಿ ಅಭಿವ್ಯಕ್ತಿಯನ್ನು ಪರಿಹರಿಸೋಣ.\n\nನೀಡಿರುವ ಅಭಿವ್ಯಕ್ತಿ ಹೀಗಿದೆ:\n\n(2 + 10) / (1 + 3)\n\nಮೊದಲಿಗೆ, ಆವರಣದೊಳಗೆ ಕಾರ್ಯಾಚರಣೆಗಳನ್ನು ಮಾಡೋಣ:\n\n12/4\n\nಈಗ, ವಿಭಾಗವನ್ನು ಮಾಡೋಣ:\n\n3\n\nಆದ್ದರಿಂದ, (2 + 10) / (1 + 3) ಫಲಿತಾಂಶವು 3 ಆಗಿದೆ.

6. Gujarati: (2+10)/(1+3)? વિગતવાર વર્ણન સાથે ઉકેલો

  • Output: આ સમીકરણને (2+10)/(1+3) તરીકે લખી શકાય. પહેલાં અંશ અને છેદ બંનેની ગણતરી કરો: (2+10)=12 અને (1+3)=4. પછી અંશ અને છેદ વડે ભાગ લો: 12/4=3. તેથી, (2+10)/(1+3)નું પરિણામ 3 છે.

7. Bengali: (2+10)/(1+3)? বিস্তারিত ব্যাখ্যা সহ সমাধান করুন

  • Output: প্রথমে (2+10) এবং (1+3) এর মান বের করতে হবে।\n\n(2+10) = 12\n(1+3) = 4\n\nতাহলে সমাধান হবে,\n\n(2+10)/(1+3) = 12/4\n\n= 3\n\nতাই, (2+10)/(1+3) = 3।

8. Punjabi: (2+10)/(2–1+3)? ਵਿਸਤ੍ਰਿਤ ਵਿਆਖਿਆ ਨਾਲ ਹੱਲ ਕਰੋ

  • Output: (2+10)/(1+3) = (12)/4 = 3।

9. Odia: (2 + 10) / (1 + 3)? ବିସ୍ତୃତ ବ୍ୟାଖ୍ୟା ସହିତ ସମାଧାନ କରନ୍ତୁ |

  • Output: ସମାଧାନ କରିବାକୁ ଆସନ୍ତୁ ସମସ୍ତ କାର୍ଯ୍ୟ ସମାପ୍ତ କରିବାକୁ କାର୍ଯ୍ୟ କରିବା | \ n1 | ପ୍ରଥମେ ବନ୍ଧନୀରେ ସମୀକରଣ ସମାଧାନ କରନ୍ତୁ \ n (2 + 10) = 12 \ n2 | ପରବର୍ତ୍ତୀ ସମୀକରଣକୁ ସମାଧାନ କରନ୍ତୁ ଯାହା ବନ୍ଧନୀ ବାହାରେ \ n1 + 3 = 4 \ n3 | ଶେଷରେ ଦୁଇଟି ସମାଧାନ ହୋଇଥିବା ସମୀକରଣକୁ ଭାଗ କରନ୍ତୁ: \ n12 / 4 = 3 \ n ତେଣୁ ସମୀକରଣର ସମାଧାନ (2 + 10) / (1 + 3) ହେଉଛି 3 |

10. English: (2+10)/(1+3)? Solve with a detailed explanation

  • Output: The given expression is (2 + 10)/(1 + 3).\n\nFirst, we need to solve the numerator and denominator of the fraction separately.\n\nThe numerator is the sum of 2 and 10, which is 12.\n\nThe denominator is the sum of 1 and 3, which is 4.\n\nSo, the expression becomes (12/4).\n\nNow, we can solve the fraction by dividing 12 by 4.\n\n12 divided by 4 is 3.\n\nSo, the final result of the expression (2 + 10)/(1 + 3) is 3.

Question Answering with Context:

The task is to answer the question given some information.

Instruction: When was the Google News app released?

Input: Google News is a news aggregator service developed by Google. It presents a continuous flow of links to articles organized from thousands of publishers and magazines. Google News is available as an app on Android, iOS, and the Web. Google released a beta version in September 2002 and the official app in January 2006.

  1. Hindi Instruction: Google समाचार ऐप कब जारी किया गया था?
  • Hindi Input: Google News Google द्वारा विकसित एक समाचार एग्रीगेटर सेवा है। यह हजारों प्रकाशकों और पत्रिकाओं से व्यवस्थित लेखों के लिंक का निरंतर प्रवाह प्रस्तुत करता है। Google News एंड्रॉइड, iOS और वेब पर एक ऐप के रूप में उपलब्ध है। Google ने सितंबर 2002 में बीटा संस्करण और जनवरी 2006 में आधिकारिक ऐप जारी किया।
  • Output: Google News ऐप सितंबर 2002 में बीटा संस्करण में जारी किया गया था और जनवरी 2006 में आधिकारिक ऐप के रूप में उपलब्ध हुआ था।

2. Telugu Instruction: Google వార్తలు యాప్ ఎప్పుడు విడుదల చేయబడింది?

  • Telugu Input: Google వార్తలు అనేది Google ద్వారా అభివృద్ధి చేయబడిన వార్తా అగ్రిగేటర్ సేవ. ఇది వేలకొద్దీ ప్రచురణకర్తలు మరియు మ్యాగజైన్‌ల నుండి నిర్వహించబడిన కథనాలకు నిరంతర లింక్‌లను అందిస్తుంది. Google వార్తలు Android, iOS మరియు వెబ్‌లో యాప్‌గా అందుబాటులో ఉన్నాయి. గూగుల్ సెప్టెంబరు 2002లో బీటా వెర్షన్‌ను మరియు జనవరి 2006లో అధికారిక యాప్‌ను విడుదల చేసింది.
  • Output: గూగుల్ వార్తలు యాప్ జనవరి 2006లో విడుదలైంది.

3. Tamil Instruction: Google செய்திகள் பயன்பாடு எப்போது வெளியிடப்பட்டது?

  • Tamil Input:கூகுள் செய்திகள் என்பது கூகுள் உருவாக்கிய செய்தி சேகரிப்பு சேவையாகும். ஆயிரக்கணக்கான வெளியீட்டாளர்கள் மற்றும் பத்திரிகைகளிலிருந்து ஒழுங்கமைக்கப்பட்ட கட்டுரைகளுக்கான தொடர்ச்சியான இணைப்புகளை இது வழங்குகிறது. Android, iOS மற்றும் இணையத்தில் Google செய்திகள் ஒரு பயன்பாடாகக் கிடைக்கிறது. கூகுள் செப்டம்பர் 2002 இல் பீட்டா பதிப்பையும், ஜனவரி 2006 இல் அதிகாரப்பூர்வ பயன்பாட்டையும் வெளியிட்டது.
  • Output: ஜனவரி 2006 இல் Google அதிகாரப்பூர்வ Google News பயன்பாட்டை வெளியிட்டது.

4. Malayalam Instruction: എപ്പോഴാണ് Google വാർത്ത ആപ്പ് റിലീസ് ചെയ്തത്?

  • Malayalam Input: എപ്പോഴാണ് Google വാർത്ത ആപ്പ് റിലീസ് ചെയ്തത്?
  • Output: 2006 ജനുവരിയിൽ Google News ഔദ്യോഗിക ആപ്പ് പുറത്തിറങ്ങി.

5. Kannada Instruction: Google News ಅಪ್ಲಿಕೇಶನ್ ಅನ್ನು ಯಾವಾಗ ಬಿಡುಗಡೆ ಮಾಡಲಾಯಿತು?

  • Kannada Input: Google News Google ನಿಂದ ಅಭಿವೃದ್ಧಿಪಡಿಸಲಾದ ಸುದ್ದಿ ಸಂಗ್ರಾಹಕ ಸೇವೆಯಾಗಿದೆ. ಇದು ಸಾವಿರಾರು ಪ್ರಕಾಶಕರು ಮತ್ತು ನಿಯತಕಾಲಿಕೆಗಳಿಂದ ಆಯೋಜಿಸಲಾದ ಲೇಖನಗಳಿಗೆ ನಿರಂತರವಾದ ಲಿಂಕ್‌ಗಳನ್ನು ಪ್ರಸ್ತುತಪಡಿಸುತ್ತದೆ. Android, iOS ಮತ್ತು ವೆಬ್‌ನಲ್ಲಿ Google News ಅಪ್ಲಿಕೇಶನ್‌ನಂತೆ ಲಭ್ಯವಿದೆ. ಗೂಗಲ್ ಸೆಪ್ಟೆಂಬರ್ 2002 ರಲ್ಲಿ ಬೀಟಾ ಆವೃತ್ತಿಯನ್ನು ಮತ್ತು ಜನವರಿ 2006 ರಲ್ಲಿ ಅಧಿಕೃತ ಅಪ್ಲಿಕೇಶನ್ ಅನ್ನು ಬಿಡುಗಡೆ ಮಾಡಿತು.
  • Output: ಜನವರಿ 2006

6. Gujarati Instruction: Google News ઍપ ક્યારે રિલીઝ થઈ?

  • Gujarati Input: ગૂગલ ન્યૂઝ એ ગૂગલ દ્વારા વિકસિત ન્યૂઝ એગ્રીગેટર સેવા છે. તે હજારો પ્રકાશકો અને સામયિકોમાંથી આયોજિત લેખોની લિંક્સનો સતત પ્રવાહ રજૂ કરે છે. Google News એ Android, iOS અને વેબ પર એપ્લિકેશન તરીકે ઉપલબ્ધ છે. ગૂગલે સપ્ટેમ્બર 2002માં બીટા વર્ઝન અને જાન્યુઆરી 2006માં સત્તાવાર એપ બહાર પાડી.
  • Output: Google News એપ સપ્ટેમ્બર 2002માં બીટા વર્ઝન અને જાન્યુઆરી 2006માં સત્તાવાર એપ બનાવવામાં આવી હતી

7. Bengali Instruction: Google News অ্যাপ কবে প্রকাশিত হয়েছিল?

  • Bengali Input: গুগল নিউজ হল একটি নিউজ এগ্রিগেটর সার্ভিস যা গুগল ডেভেলপ করেছে। এটি হাজার হাজার প্রকাশক এবং ম্যাগাজিন থেকে সংগঠিত নিবন্ধগুলির লিঙ্কগুলির একটি অবিচ্ছিন্ন প্রবাহ উপস্থাপন করে। Google News Android, iOS এবং ওয়েবে একটি অ্যাপ হিসেবে উপলব্ধ। গুগল 2002 সালের সেপ্টেম্বরে একটি বিটা সংস্করণ এবং জানুয়ারী 2006 সালে অফিসিয়াল অ্যাপ প্রকাশ করে।
  • Output: Google News অ্যাপ প্রকাশিত হয়েছিল জানুয়ারী 2006 সালে।

8. Punjabi Instruction: Google News ਐਪ ਕਦੋਂ ਜਾਰੀ ਕੀਤੀ ਗਈ ਸੀ?

  • Punjabi Input: ਗੂਗਲ ਨਿਊਜ਼ ਗੂਗਲ ਦੁਆਰਾ ਵਿਕਸਿਤ ਕੀਤੀ ਗਈ ਇੱਕ ਨਿਊਜ਼ ਐਗਰੀਗੇਟਰ ਸੇਵਾ ਹੈ। ਇਹ ਹਜ਼ਾਰਾਂ ਪ੍ਰਕਾਸ਼ਕਾਂ ਅਤੇ ਰਸਾਲਿਆਂ ਤੋਂ ਸੰਗਠਿਤ ਲੇਖਾਂ ਦੇ ਲਿੰਕਾਂ ਦਾ ਨਿਰੰਤਰ ਪ੍ਰਵਾਹ ਪੇਸ਼ ਕਰਦਾ ਹੈ। Google News Android, iOS ਅਤੇ ਵੈੱਬ ‘ਤੇ ਇੱਕ ਐਪ ਵਜੋਂ ਉਪਲਬਧ ਹੈ। ਗੂਗਲ ਨੇ ਸਤੰਬਰ 2002 ਵਿੱਚ ਇੱਕ ਬੀਟਾ ਸੰਸਕਰਣ ਅਤੇ ਜਨਵਰੀ 2006 ਵਿੱਚ ਅਧਿਕਾਰਤ ਐਪ ਜਾਰੀ ਕੀਤਾ।
  • Output: Google News ਐਪ ਜਨਵਰੀ 2006 ਵਿੱਚ ਜਾਰੀ ਕੀਤੀ ਗਈ ਸੀ।

9. Odia Instruction: ଗୁଗୁଲ୍ ନ୍ୟୁଜ୍ ଆପ୍ କେବେ ରିଲିଜ୍ ହେଲା?

  • Odia Input: ଗୁଗୁଲ୍ ଦ୍ୱାରା ବିକଶିତ ଏକ ନ୍ୟୁଜ୍ ଏଗ୍ରିଗେଟର୍ ସେବା | ଏହା ହଜାର ହଜାର ପ୍ରକାଶକ ଏବଂ ପତ୍ରିକାଗୁଡ଼ିକରୁ ସଂଗଠିତ ପ୍ରବନ୍ଧଗୁଡିକ ସହିତ ଲିଙ୍କଗୁଡିକର କ୍ରମାଗତ ପ୍ରବାହକୁ ଉପସ୍ଥାପନ କରେ | ଗୁଗୁଲ୍ ନ୍ୟୁଜ୍ ଆଣ୍ଡ୍ରଏଡ୍, ଆଇଓଏସ୍ ଏବଂ ୱେବରେ ଏକ ଆପ୍ ଭାବରେ ଉପଲବ୍ଧ | ସେପ୍ଟେମ୍ବର 2002 ରେ ଗୁଗୁଲ୍ ଏକ ବିଟା ସଂସ୍କରଣ ଏବଂ ଜାନୁଆରୀ 2006 ରେ ଅଫିସିଆଲ୍ ଆପ୍ ପ୍ରକାଶ କରିଥିଲା ​​|
  • Output: ସେପ୍ଟେମ୍ବର 2002 ରେ ଗୁଗୁଲ୍ ଏକ ବିଟା ସଂସ୍କରଣ ଏବଂ ଜାନୁଆରୀ 2006 ରେ ଅଫିସିଆଲ୍ ଆପ୍ ପ୍ରକାଶ କରିଥିଲା।

10. English Instruction: When was the Google News app released?

  • English Input: Google News is a news aggregator service developed by Google. It presents a continuous flow of links to articles organized from thousands of publishers and magazines. Google News is available as an app on Android, iOS, and the Web. Google released a beta version in September 2002 and the official app in January 2006.
  • Output: The Google News app was released in January 2006.

Translation:

The task here is to translate the text to specific language

Instruction: Translate the following sentence to X language.

Input: India is a great country.

  1. Hindi Output: भारत एक महान देश है।

2. Telugu Output: భారతదేశం అద్భుత దేశం.

3. Tamil Output: இந்தியா ஒரு சிறந்த நாடு.

4. Malayalam Output: ഇന്ത്യ ഒരു വലിയ രാജ്യമാണ്

5. Kannada Output: ಭಾರತ ಒಂದು ಶ್ರೇಷ್ಠ ದೇಶ.

6. Gujarati Output: ભારત એક મોટો દેશ છે.

7. Bengali Ouput: ভারত একটি বড় জাতি।

8. Punjabi Output: ਇੰਡੀਆ ਇੱਕ ਵੱਡਾ ਦੇਸ਼ ਹੈ।

9. Odia Output: ଭାରତ ଏକ ବିଶାଳ ରାଷ୍ଟ୍ର।

As you can see the responses for the above tasks are good and we used the finetuned 7B model for the above tasks. We found the 2B model a bit inconsistent compared to the 7B model.

The work would not have been possible without huge community effort from different languages and a huge shout out to each one of their work over the past few months showcasing the true OSS power. Following are details of contributors for some of these languages:

  1. Hindi: sarvam.ai
  2. Telugu: Telugu LLM Labs
  3. Tamil: Abhinanda
  4. Kannada: Adarsh and team at Tensonic
  5. Malayalam: Vishnu J Prasad
  6. Odia — OdiaGenAI
  7. Gujarati: Adarsh and team at Tensonic
  8. Punjabi — HydraIndicLM
  9. Bengali — HydraIndicLM

As we release these models, the next step is to create romanized datasets and we are working hard on evaluation datasets so that we can benchmark and improve on top of it.

This work is done in collaboration with Ramsri Goutham Golla as part of Telugu LLM Labs independent initiative.

References:

Code Repository

--

--