使用 Nodejs 读取超大文件

近期看了一篇文章，刚开始感觉挺有意思的，作者使用 Node 进行读取一些超大文件并完成一些数据处理，到最后再对比不同的读取方式的效率对比，可是读到后面，发现作者的观点有许多错误的地方，并且举的代码示例有很多自己没有发现的问题，因此在这里简单的记录一下。

原文链接在这里：

Using Node.js to Read Really, Really Large Datasets & Files (Pt 1)
Streams For the Win: A Performance Comparison of Node.js Methods for Reading Large Datasets (Pt 2)

题目内容

这里提出了 4 个问题：

Write a program that will print out the total number of lines in the file.
Notice that the 8th column contains a person’s name. Write a program that loads in this data and creates an array with all name strings. Print out the 432nd and 43243rd names.
Notice that the 5th column contains a form of date. Count how many donations occurred in each month and print out the results.
Notice that the 8th column contains a person’s name. Create an array with each first name. Identify the most common first name in the data and how many times it occurs.

其中数据集在这里：https://www.fec.gov/files/bulk-downloads/2018/indiv18.zip，当你解压后，应该能看到一个 4G 大小左右的 txt 文件，其中数据格式大概是长这样的：

C00092957|A|M2|P|201801249090620780|15|IND|BUTTS, IVAN D|ALEXANDRIA|VA|22311|UNITED STATES POSTAL SERVICE|MGR HRM (DIST)|01262017|1299||8483312|1199787||(IN-KIND)|4012520181503246958
C00092957|A|M2|P|201801249090620780|15|IND|SHAWN, STEVE D|ROCKVILLE|MD|208511402|UNITED STATES POSTAL SERVICE|MGR CUSTOMER SRVCS|01312017|25||PR452039221219|1199787||P/R DEDUCTION ($25.00 MONTHLY)|4012520181503246960
C00092957|A|M2|P|201801249090620779|15|IND|BUTTS, IVAN D|ALEXANDRIA|VA|22311|UNITED STATES POSTAL SERVICE|MGR HRM (DIST)|01182017|100||8475123|1199787|||4012520181503246954
C00092957|A|M2|P|201801249090620779|15|IND|SHAWN, STEVE D|ROCKVILLE|MD|208511402|UNITED STATES POSTAL SERVICE|MGR CUSTOMER SRVCS|01182017|200||8475135|1199787|||4012520181503246956
C00092957|A|M3|P|201801249090620996|15|IND|BRADFORD, ROBERT D|HEWITT|TX|76643|RETIRED|RETIRED|02212017|324||8524782|1199797||(IN-KIND)|4012520181503246964
C00092957|A|M3|P|201801249090620996|15|IND|BRADFORD, ROBERT D|HEWITT|TX|76643|RETIRED|RETIRED|02252017|40||8528229|1199797|||4012520181503246966

作者提出了三个方法：

fs.readFile
fs.createReadStream() & readLine
fs.createReadStream() & event-stream

其实这里是不用大费篇章去讨论这三个方法的效率问题的（详见 pt2），因为第一个方法很明显不适用，如果你尝试把超大文件直接读入内存中，那可能你会收到这个错误：

RangeError [ERR_FS_FILE_TOO_LARGE]: File size (4288772248) is greater than 2 GB

当然，如果你是在运行中造成内存溢出（4G 左右），大概会接受到这个错误：

FATAL ERROR: MarkCompactCollector: young object promotion failed Allocation failed - JavaScript heap out of memory

至于方法 2 和方法 3 ，其实它们效率是一样的，如果代码一致的话。可是作者在举例时确实两种代码写法不一致，导致第二种方法运行时堆栈溢出了，因此得出结论 event-stream 效率更高其实是否不靠谱的。

应该采用哪种方法按行读取超大文件

如果按行读取的话，我推荐 fs.createReadStream() & readLine ，代码示例如下：

import fs from "fs";
import readline from "readline";

const rl = readline.createInterface({
  input: fs.createReadStream("./itcont.txt"),
});

rl.on("line", (line) => {
  // handle line
});
rl.on("close", () => {
  // handle done
});

读取超大的 JSON 文件

使用 JSONStream 包即可完成对超大 JSON 文件的读取：

import fs from "fs";
import JSONStream from "JSONStream";

fs.createReadStream("../data.json", { encoding: "utf-8", flags: "r" })
  .pipe(JSONStream.parse("*"))
  .on("data", (data) => console.log(data));

关于 EventStream

如果你想要搜索相关 event-stream 库的信息，相信你首先看到的不是相关的教学或推荐文章，而是关于一则恶意攻击的新闻：

这个事件的起因是 event-stream 项目的作者由于时间和精力有限，将其维护工作交给了另一位开发者 Right9ctrl，该开发者获得了 event-stream 的控制权，将恶意代码注入。据报道，该恶意程序在默认情况下处于休眠状态，当 BitPay 的 Copay 钱包启动后，就会自动激活，它将会窃取用户钱包内的私钥并发送至 copayapi.host:8080。

鉴于这次事件的恶劣程度，我会尽可能去避免使用该包。简单来说，该包提供了一系列操作流的方法，如拆分、过滤与重组，用户可以链式的进行数据流处理，典型的使用范式为：

查看内存占用

当我们在应用启动时加上 —trace_gc ，即可查看当前程序占用的内存信息：

[38080:0x105600000]     1473 ms: Scavenge 16.7 (26.7) -> 10.0 (26.7) MB, 0.2 / 0.0 ms  (average mu = 1.000, current mu = 1.000) task
[38080:0x105600000]     1489 ms: Scavenge 17.0 (26.7) -> 10.1 (26.7) MB, 0.3 / 0.0 ms  (average mu = 1.000, current mu = 1.000) task
[38080:0x105600000]     1505 ms: Scavenge 17.1 (26.9) -> 10.0 (26.9) MB, 0.2 / 0.0 ms  (average mu = 1.000, current mu = 1.000) task

题目内容​

应该采用哪种方法按行读取超大文件​

读取超大的 JSON 文件​

关于 EventStream​

查看内存占用​

题目内容

应该采用哪种方法按行读取超大文件

读取超大的 JSON 文件

关于 EventStream

查看内存占用