docs: move docling and textract to Text Processing

docling (document-to-structured-data conversion) and textract (text extraction from Office/PDF files) are document parsing tools, not data analysis or web scraping tools, so Text Processing > General is a more accurate placement. Co-Authored-By: Claude <noreply@anthropic.com>
2026-05-09 22:53:49 +08:00 · 2026-03-18 23:50:25 +08:00
parent a7c5d84ce9
commit 79c0be0a5c
1 changed files with 2 additions and 2 deletions
@@ -317,7 +317,6 @@ _Libraries for data analysis._
 - [aws-sdk-pandas](https://github.com/aws/aws-sdk-pandas) - Pandas on AWS.
 - [datasette](https://github.com/simonw/datasette) - An open source multi-tool for exploring and publishing data.
 - [desbordante](https://github.com/desbordante/desbordante-core/) - An open source data profiler for complex pattern discovery.
- [docling](https://github.com/docling-project/docling) - Library for converting documents into structured data.
 - [optimus](https://github.com/hi-primus/optimus) - Agile Data Science Workflows made easy with PySpark.
 - [pandas](https://github.com/pandas-dev/pandas) - A library providing high-performance, easy-to-use data structures and data analysis tools.
 - [pathway](https://github.com/pathwaycom/pathway) - Real-time data processing framework for Python with reactive dataflows.
@@ -971,8 +970,10 @@ _Shells built with Python._
 _Libraries for parsing and manipulating specific text formats._

 - General
+  - [docling](https://github.com/docling-project/docling) - Library for converting documents into structured data.
  - [kreuzberg](https://github.com/kreuzberg-dev/kreuzberg) - High-performance document extraction library with a Rust core, supporting 62+ formats including PDF, Office, images with OCR, HTML, email, and archives.
  - [tablib](https://github.com/jazzband/tablib) - A module for Tabular Datasets in XLS, CSV, JSON, YAML.
+  - [textract](https://github.com/deanmalmgren/textract) - Extract text from any document, Word, PowerPoint, PDFs, etc.
 - Office
  - [docxtpl](https://github.com/elapouya/python-docx-template) - Editing a docx document by jinja2 template
  - [openpyxl](https://openpyxl.readthedocs.io/en/stable/) - A library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files.
@@ -1136,7 +1137,6 @@ _Libraries for extracting web contents._
 - [python-readability](https://github.com/buriy/python-readability) - Fast Python port of arc90's readability tool.
 - [requests-html](https://github.com/psf/requests-html) - Pythonic HTML Parsing for Humans.
 - [sumy](https://github.com/miso-belica/sumy) - A module for automatic summarization of text documents and HTML pages.
- [textract](https://github.com/deanmalmgren/textract) - Extract text from any document, Word, PowerPoint, PDFs, etc.
 - [toapi](https://github.com/gaojiuli/toapi) - Every web site provides APIs.

 ## Web Crawling