How to Extract Data from GST Invoices in 22 Indian Languages
Learn to extract GSTIN, HSN codes, tax amounts, and line items from GST invoices programmatically using OCR and structured extraction APIs.
India's GST system generates over 800 million invoices monthly according to GSTN data from FY 2024-25. Each invoice carries critical structured data — GSTINs, HSN codes, tax breakdowns, line items — that businesses need in their accounting systems, ERP platforms, and compliance workflows. Manually keying this data is slow, error-prone, and increasingly unscalable as transaction volumes grow.
This guide covers everything you need to programmatically extract structured data from GST invoices across all 22 Indian languages — from understanding the invoice schema to writing production-ready integration code.
What Are the Mandatory Fields in a GST Invoice?
Every GST invoice issued under Section 31 of the CGST Act 2017, read with Rule 46 of the CGST Rules, must contain specific fields. These are the data points your extraction pipeline needs to capture reliably.
Supplier details include the legal name, address, GSTIN (15-character alphanumeric identifier), and state code. Recipient details mirror these fields for the buyer. The invoice metadata includes a sequential invoice number (max 16 characters per Rule 46), date of issue, and place of supply (state code determining whether CGST+SGST or IGST applies).The line item table is the most complex section to extract. Each row contains: HSN/SAC code (minimum 4 digits for turnover above ₹5 crore as per Notification 78/2020), item description, quantity with unit of measurement (UQC), unit price, discount if any, taxable value, and applicable tax rates and amounts for CGST, SGST, and IGST.
| Field Category | Required Fields | Extraction Difficulty |
|
|
|
|
| Supplier Info | Name, Address, GSTIN, State Code | Low — typically in header |
| Recipient Info | Name, Address, GSTIN, Place of Supply | Low — structured block |
| Invoice Metadata | Invoice Number, Date, Due Date | Low — fixed positions |
| Line Items | HSN, Description, Qty, Rate, Taxable Value | High — variable table formats |
| Tax Summary | CGST, SGST, IGST amounts and rates | Medium — summary section |
| Totals | Total taxable value, Total tax, Grand total | Low — footer section |
| Additional | E-way bill number, IRN, QR code | Medium — may be absent |
Why Does Traditional OCR Fail on Indian GST Invoices?
Traditional Optical Character Recognition treats a document as flat text, reading left-to-right, top-to-bottom. GST invoices break this assumption in multiple ways, making raw OCR output unreliable for structured data extraction.
The first challenge is layout variability. Unlike standardized forms, GST invoices have no mandated layout. A Tally-generated invoice looks nothing like one from Zoho, Busy, or a custom ERP. Column orders differ, tax summary placement varies, and some invoices use landscape orientation. Traditional OCR cannot map these varying layouts to a consistent schema.
The second challenge is multilingual content. An invoice from a Tamil Nadu supplier might have item descriptions in Tamil script, legal headers in English, and amounts in standard numerals. Devanagari-script invoices from Hindi-speaking states mix Hindi and English freely. Traditional OCR engines optimized for Latin scripts produce garbled output on Indic scripts, especially when multiple scripts appear on the same page.
The third challenge is table extraction. GST invoice line items are presented in tables with variable column counts, merged cells, and inconsistent borders. Some invoices use dotted lines, others use solid borders, and many use no borders at all — relying on whitespace alignment. OCR engines that lack table-detection capabilities output jumbled text where row and column associations are lost.
The fourth challenge is scan quality. Many Indian businesses still work with physical invoices — photographed with mobile cameras, scanned with low-resolution scanners, or received as faxes. Skew, blur, uneven lighting, and compression artifacts all degrade OCR accuracy significantly.
How Does AI-Powered Structured Extraction Differ from OCR?
AI-powered structured extraction goes beyond character recognition to understand document structure, field semantics, and contextual relationships. Instead of returning raw text, it returns typed, labeled JSON with field-level confidence scores.
The pipeline typically works in three stages. First, a vision model analyzes the document layout, identifying regions like header blocks, line-item tables, tax summaries, and signatures. Second, text recognition runs within each identified region using script-specific models — a Devanagari model for Hindi text, a Tamil model for Tamil text, and so on. Third, a language model maps recognized text to the target schema, understanding that "कुल राशि" means "Total Amount" and "வரி" means "Tax."
This approach produces structured output like:
```json
{
"supplier": {
"gstin": "27AABCU9603R1ZM",
"legal_name": "ABC Enterprises Pvt Ltd",
"address": "Plot 42, MIDC Andheri East, Mumbai 400093",
"state_code": "27"
},
"invoice_number": "INV-2026-03-0847",
"invoice_date": "2026-03-15",
"place_of_supply": "27",
"line_items": [
{
"hsn_code": "8471",
"description": "Laptop Computer 14-inch",
"quantity": 5,
"unit": "NOS",
"unit_price": 45000.00,
"taxable_value": 225000.00,
"cgst_rate": 9,
"cgst_amount": 20250.00,
"sgst_rate": 9,
"sgst_amount": 20250.00
}
],
"total_taxable_value": 225000.00,
"total_cgst": 20250.00,
"total_sgst": 20250.00,
"grand_total": 265500.00,
"confidence": 0.96
}
```
How Do You Extract GST Invoice Data Using an API?
Integrating a structured extraction API into your application is straightforward. The typical flow is: upload the document, receive structured JSON, validate key fields, and push to your downstream system. Here are working code samples for the three most common implementation languages.
cURL — Quick Testing and Shell Scripts```bash
curl -X POST "https://api.anumiti.ai/v1/extract" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: multipart/form-data" \
-F "file=@invoice.pdf" \
-F "document_type=gst_invoice" \
-F "languages=hi,en" \
-F "extract_tables=true" \
-F "extract_qr=true"
```
The `languages` parameter hints which scripts to prioritize. While the API auto-detects scripts, providing hints improves accuracy by 2-3% and reduces latency by ~15ms.
Python — Backend Services and Data Pipelines```python
import requests
import json
API_URL = "https://api.anumiti.ai/v1/extract"
API_KEY = "your_api_key_here"
def extract_gst_invoice(file_path: str, languages: list[str] = None) -> dict:
"""Extract structured data from a GST invoice image or PDF."""
headers = {"Authorization": f"Bearer {API_KEY}"}
with open(file_path, "rb") as f:
files = {"file": (file_path, f, "application/pdf")}
data = {
"document_type": "gst_invoice",
"extract_tables": "true",
"extract_qr": "true",
}
if languages:
data["languages"] = ",".join(languages)
response = requests.post(API_URL, headers=headers, files=files, data=data)
response.raise_for_status()
return response.json()
# Process a single invoice
result = extract_gst_invoice("invoices/march_2026/inv_0847.pdf", ["hi", "en"])
# Access structured fields directly
supplier_gstin = result["data"]["supplier"]["gstin"]
line_items = result["data"]["line_items"]
total_tax = result["data"]["total_cgst"] + result["data"]["total_sgst"]
print(f"Supplier GSTIN: {supplier_gstin}")
print(f"Line items: {len(line_items)}")
print(f"Total tax: ₹{total_tax:,.2f}")
# Validate GSTIN format
import re
gstin_pattern = r"^[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z]{1}[1-9A-Z]{1}Z[0-9A-Z]{1}$"
if not re.match(gstin_pattern, supplier_gstin):
print(f"WARNING: Invalid GSTIN format: {supplier_gstin}")
```
Node.js — Web Applications and Serverless Functions```javascript
const fs = require("fs");
const FormData = require("form-data");
const axios = require("axios");
const API_URL = "https://api.anumiti.ai/v1/extract";
const API_KEY = process.env.ANUMITI_API_KEY;
async function extractGSTInvoice(filePath, languages = ["en"]) {
const form = new FormData();
form.append("file", fs.createReadStream(filePath));
form.append("document_type", "gst_invoice");
form.append("languages", languages.join(","));
form.append("extract_tables", "true");
form.append("extract_qr", "true");
const response = await axios.post(API_URL, form, {
headers: {
Authorization: `Bearer ${API_KEY}`,
...form.getHeaders(),
},
maxContentLength: 50 1024 1024, // 50MB limit
});
return response.data;
}
// Batch process invoices
async function processInvoiceBatch(directory) {
const files = fs.readdirSync(directory).filter((f) => f.endsWith(".pdf"));
const results = await Promise.allSettled(
files.map((f) => extractGSTInvoice(`${directory}/${f}`, ["hi", "en"]))
);
const succeeded = results.filter((r) => r.status === "fulfilled");
const failed = results.filter((r) => r.status === "rejected");
console.log(`Processed: ${succeeded.length}/${files.length}`);
console.log(`Failed: ${failed.length}`);
return succeeded.map((r) => r.value);
}
```
What Performance Should You Expect from GST Invoice Extraction?
Performance benchmarks matter when you are processing thousands of invoices daily. The three key metrics are accuracy (field-level correctness), latency (time per document), and throughput (documents per minute at scale).
Based on benchmark testing across 10,000 real-world GST invoices from Indian businesses spanning manufacturing, services, and retail sectors, here is how different extraction approaches compare:
| Metric | Manual Data Entry | Generic OCR (Tesseract) | Cloud OCR (Textract) | NETRA Structured Extraction |
|
|
|
|
|
|
| Accuracy (printed, English) | 96-98% (human error) | 78-85% | 88-92% | 96-98% |
| Accuracy (printed, Hindi) | 96-98% | 45-60% | 72-80% | 93-96% |
| Accuracy (printed, Tamil) | 96-98% | 35-50% | 65-75% | 92-95% |
| Accuracy (scanned/photo) | 96-98% | 55-70% | 75-82% | 88-94% |
| Accuracy (handwritten) | 90-95% | 20-35% | 40-55% | 85-90% |
| Latency per page | 3-5 minutes | 2-4 seconds | 1-3 seconds | 60-95 ms |
| Throughput (docs/min) | 0.2-0.3 | 15-30 | 20-60 | 500+ |
| Table extraction | Manual | No structure | Partial | Full JSON schema |
| Multi-script support | Depends on operator | Latin only | Limited Indic | 22 Indian languages |
| Output format | Spreadsheet | Raw text | Semi-structured | Typed JSON with schema |
| Cost per invoice | ₹8-15 (labor) | Free (self-hosted) | ₹1.5-3.0 | ₹0.50-1.50 |
These benchmarks highlight a critical gap: generic OCR and even major cloud OCR services struggle with Indic scripts. The accuracy drop from English to Hindi or Tamil is 15-25 percentage points for cloud OCR, while purpose-built Indian document extraction maintains accuracy within 2-5 percentage points across languages.
How Do You Handle Multi-Language Invoices in Production?
Indian businesses routinely issue invoices mixing English with one or more Indic languages. A typical pattern: header and legal name in the regional language, item descriptions in English or mixed, amounts in standard numerals. Your extraction pipeline must handle this gracefully.
Step 1: Implement script detection. Before extraction, detect which scripts are present in the document. This is computationally cheap — Unicode code point ranges uniquely identify Devanagari (U+0900-U+097F), Tamil (U+0B80-U+0BFF), Bengali (U+0980-U+09FF), and other Indic scripts. Send detected scripts as language hints to improve accuracy. Step 2: Use field-level language handling. Different fields on the same invoice may be in different scripts. The supplier name might be "श्री कृष्णा ट्रेडर्स" (Devanagari) while the item description reads "Copper Wire 2.5mm" (Latin). Your extraction API should handle this per-field, not per-document. Step 3: Normalize extracted text. Indic scripts have multiple Unicode representations for visually identical characters. Apply Unicode NFC normalization to extracted text before storing or comparing. This prevents duplicate detection failures where the same supplier name in two different Unicode normalizations appears as two different suppliers. Step 4: Validate numeric fields. While most Indian invoices use Western Arabic numerals (0-9) for amounts, some use Devanagari numerals (०-९). Your validation layer should handle both, converting Devanagari numerals to their Arabic equivalents before arithmetic validation. Step 5: Cross-validate extracted data. Use mathematical relationships inherent in GST invoices as validation checks. Taxable value multiplied by the tax rate should equal the tax amount. CGST and SGST should be equal for intra-state supplies. The sum of line-item taxable values should equal the total taxable value. These checks catch extraction errors that field-level confidence scores might miss.```python
def validate_gst_invoice(data: dict) -> list[str]:
"""Cross-validate extracted GST invoice data for consistency."""
errors = []
# Validate tax arithmetic
for i, item in enumerate(data.get("line_items", [])):
expected_cgst = round(item["taxable_value"] * item["cgst_rate"] / 100, 2)
if abs(item["cgst_amount"] - expected_cgst) > 1.0: # ₹1 tolerance
errors.append(
f"Line {i+1}: CGST mismatch. "
f"Expected ₹{expected_cgst}, got ₹{item['cgst_amount']}"
)
# CGST and SGST must be equal for intra-state
if item.get("sgst_amount") and item.get("cgst_amount"):
if abs(item["cgst_amount"] - item["sgst_amount"]) > 0.5:
errors.append(
f"Line {i+1}: CGST/SGST mismatch for intra-state supply"
)
# Validate totals
calc_total_taxable = sum(
item["taxable_value"] for item in data.get("line_items", [])
)
if abs(calc_total_taxable - data.get("total_taxable_value", 0)) > 1.0:
errors.append("Total taxable value does not match line items sum")
# Validate GSTIN format
import re
gstin_re = r"^[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z]{1}[1-9A-Z]{1}Z[0-9A-Z]{1}$"
for field in ["supplier.gstin", "recipient.gstin"]:
parts = field.split(".")
value = data.get(parts[0], {}).get(parts[1], "")
if value and not re.match(gstin_re, value):
errors.append(f"Invalid GSTIN format in {field}: {value}")
return errors
```
What Are Common Pitfalls When Building a GST Invoice Extraction Pipeline?
Building a production-ready invoice extraction pipeline involves more than just calling an API. Teams commonly hit these issues after deployment, and planning for them upfront saves significant rework.
Pitfall 1: Ignoring image preprocessing. Mobile-photographed invoices often have perspective distortion, shadows, and uneven lighting. Adding a preprocessing step — deskewing, contrast normalization, shadow removal — improves extraction accuracy by 5-10% on low-quality inputs. Libraries like OpenCV provide these capabilities with minimal code. Pitfall 2: Not handling API rate limits. During monthly GST filing periods (1st-11th of each month), invoice processing volumes spike 3-5x. Your pipeline needs queuing, retry logic with exponential backoff, and graceful degradation. Process invoices asynchronously and use webhook callbacks rather than synchronous polling. Pitfall 3: Missing confidence threshold logic. Every extracted field comes with a confidence score. Setting a global threshold (e.g., reject below 0.8) is too simplistic. Critical fields like GSTIN and total amount need higher thresholds (0.95+), while description fields can tolerate lower confidence. Implement field-specific thresholds and route low-confidence extractions to human review. Pitfall 4: No human-in-the-loop for exceptions. Even the best extraction API will encounter invoices it cannot process reliably — damaged documents, unusual formats, or edge-case layouts. Build a review queue where low-confidence or validation-failed invoices are routed to human operators. This prevents bad data from entering your accounting system. Pitfall 5: Ignoring DPDP compliance. GST invoices contain personal data — names, addresses, phone numbers. Under the Digital Personal Data Protection Act 2023, processing this data requires proper consent and purpose limitation. Ensure your extraction API provider does not retain document data after processing, and integrate consent management into your invoice processing workflow. Tools like KAVACH can handle consent flows across your document pipeline.How Do You Integrate Extracted Invoice Data with Accounting Systems?
The extracted JSON needs to flow into your accounting system — whether that is Tally, SAP, custom ERP, or a GST filing platform. The integration pattern depends on your system architecture.
For Tally users, generate XML import files from extracted JSON. Tally's XML import schema expects voucher entries with specific tags for party name, ledger accounts, tax classifications, and amount breakdowns. Map the extraction output to Tally voucher XML, and use Tally's auto-import folder to process files automatically. For ERP systems with APIs (SAP, Oracle, custom), post extracted data directly via their invoice creation endpoints. Build a mapping layer that translates the extraction schema to your ERP's field names. Include the extraction confidence scores in metadata fields so your AP team can prioritize review of low-confidence entries. For GST filing workflows, the extracted data maps directly to GSTR-1 and GSTR-2A/2B schemas. The GSTN portal accepts JSON uploads in a defined format. Map supplier GSTIN to the counterparty field, HSN codes to the item classification, and tax amounts to the appropriate CGST/SGST/IGST columns.For automated GST filing pipelines, validate extracted GSTINs against the GST Network before filing. The GSTIN lookup tool can verify that the supplier's registration is active and the trade name matches — preventing mismatches that trigger notices from the tax department.
What Are the Best Practices for Production-Scale Invoice Processing?
Scaling invoice extraction from a proof-of-concept to production handling thousands of documents daily requires attention to reliability, monitoring, and cost optimization.
1. Implement idempotent processing. Assign each invoice a unique hash (based on file content, not filename). Before processing, check if this hash has already been extracted. This prevents duplicate entries when invoices are resubmitted or retried after transient failures.
2. Use async processing with webhooks. For batch workloads, submit documents to an async extraction endpoint and receive results via webhook callback. This decouples your application from extraction latency and allows the API to optimize batch processing internally.
3. Monitor extraction quality continuously. Track field-level confidence scores over time. A sudden drop in average confidence often indicates a new invoice format entering your pipeline that the extraction model handles poorly. Set up alerts for confidence drops exceeding 5% week-over-week.
4. Cache GSTIN validation results. GSTINs change status infrequently. Cache validation results for 24-72 hours to reduce API calls and latency. Invalidate cache entries when you detect a mismatch between cached and freshly validated data.
5. Implement cost controls. Set daily and monthly processing limits. Track cost per invoice by document type and quality tier. Route high-quality digital PDFs (which are cheaper and faster to process) through the fast path, and reserve the more expensive high-accuracy pipeline for scanned or photographed documents.
6. Build comprehensive audit trails. For every invoice processed, store the original document, extraction timestamp, raw API response, validation results, and any human review decisions. GST audits under Section 65 of the CGST Act can request processing records, and having a complete trail simplifies compliance.
For teams building GST invoice processing pipelines, NETRA's extraction API provides pre-built support for all GST invoice formats with field-level confidence scores, automatic table extraction, and 22-language support — reducing integration time from weeks to days.
How Do You Get Started with GST Invoice Extraction Today?
Getting from zero to extracting your first GST invoice takes under 15 minutes with a well-designed API. Here is the step-by-step process.
1. Sign up for API access and obtain your API key from the developer dashboard.
2. Prepare a test invoice — a clear PDF or high-resolution image of a real GST invoice.
3. Make your first API call using the cURL example above, replacing the API key and file path.
4. Inspect the JSON response — verify that supplier GSTIN, line items, and tax amounts match the source document.
5. Run the validation function from the code samples to confirm arithmetic consistency.
6. Test with edge cases — a multilingual invoice, a low-quality scan, a multi-page document.
7. Integrate into your pipeline using the Python or Node.js examples as a starting point.
8. Deploy with monitoring — log confidence scores, set up alerts, and build your human review queue.
The GST ecosystem is moving toward e-invoicing mandates expanding to smaller businesses (CBIC has progressively lowered the e-invoice threshold from ₹500 crore to ₹5 crore). Building extraction capabilities now prepares your systems for the increasing volume and variety of digital invoices ahead.