Scientific Computing File Errors: Troubleshooting & Recovery Guide

Understanding Scientific Computing File Errors

Scientific computing relies heavily on specialized file formats designed to store complex numerical data, multidimensional arrays, simulation results, and research findings. These file formats often combine raw data with extensive metadata, enabling researchers to preserve not just results but also the context of experiments and analyses. When errors occur in these files, they can potentially compromise research integrity, delay publication, or lead to loss of irreplaceable experimental data.

This comprehensive guide addresses common file errors in scientific computing across various formats, including HDF5, NetCDF, MATLAB files, Jupyter notebooks, simulation outputs, and other research data formats. From corrupted headers and structural damage to version incompatibilities and metadata inconsistencies, we'll explore the typical issues researchers face when working with scientific data files. Whether you're a researcher, data scientist, engineer, or IT support for scientific computing, this guide provides detailed troubleshooting approaches and recovery techniques to help preserve valuable research data.

Common Scientific Computing File Formats

Before diving into specific errors, it's important to understand the various file formats commonly used in scientific computing:

  • HDF5 (.h5, .hdf5) - Hierarchical Data Format, a versatile format for storing large, complex datasets with rich metadata
  • NetCDF (.nc, .cdf) - Network Common Data Form, widely used in climate science, geosciences, and atmospheric research
  • MATLAB (.mat) - MATLAB's native format for storing workspace variables, widely used in engineering and signal processing
  • Jupyter Notebooks (.ipynb) - JSON-based format that combines code, output, visualizations, and markdown documentation
  • CSV/TSV (.csv, .tsv) - Simple tabular formats commonly used for data exchange
  • Parquet/Arrow (.parquet, .arrow) - Columnar storage formats optimized for big data analytics
  • FITS (.fits, .fit) - Flexible Image Transport System, standard in astronomy and astrophysics
  • NPY/NPZ (.npy, .npz) - NumPy's binary format for storing array data efficiently
  • Domain-specific formats - Formats like PDB (protein structures), GROMACS (molecular dynamics), or ROOT (particle physics)

Each format has specific structures, capabilities, and common issues. Understanding the format you're working with is crucial for effective troubleshooting.

Error #1: "HDF5 File Corrupted" or "Cannot Access HDF5 Dataset"

Symptoms

When attempting to open an HDF5 file, you may encounter error messages like "Unable to open HDF5 file," "HDF5 signature not found," or "Cannot read from dataset." Software may fail to load the file entirely, or it might load partially with missing datasets or groups.

Causes

  • File truncation during transfer or storage
  • Corrupted file headers or superblocks
  • Interrupted write operations
  • Storage media failures
  • Incompatible HDF5 library versions
  • Filesystem corruption
  • Network issues during remote access

Solutions

Solution 1: Verify File Integrity with h5check

Use the HDF5 validation tools to identify issues:

  1. Run h5check to validate the file structure:
    h5check filename.h5
  2. Check for detailed error information about corruption location
  3. For more information, use h5dump with error detection:
    h5dump -pH filename.h5
  4. Review error codes and specific problem areas

Solution 2: Recover with h5repack or h5copy

Try to extract salvageable data:

  1. Use h5repack to create a clean copy of the file:
    h5repack corrupted.h5 repaired.h5
  2. If h5repack fails, try selective extraction with h5copy:
    h5copy -i corrupted.h5 -o extracted.h5 -s /path/to/dataset -d /path/to/dataset
  3. For partially accessible files, selectively copy individual datasets or groups that are still readable

Solution 3: Programmatic Recovery using High-Level Libraries

Use programming libraries with error handling:

  1. Python example using h5py with error handling:
    import h5py
    import numpy as np
    import traceback
    
    # Create a new file for recovered data
    recovered = h5py.File('recovered.h5', 'w')
    
    # Try to open the corrupted file with read-only and error handling
    try:
        with h5py.File('corrupted.h5', 'r', swmr=True) as f:
            # Function to recursively visit and try to copy groups/datasets
            def visit_and_recover(name, obj):
                try:
                    if isinstance(obj, h5py.Group):
                        # Create group in the recovered file if it doesn't exist
                        if name not in recovered:
                            recovered.create_group(name)
                        print(f"Successfully copied group: {name}")
                    elif isinstance(obj, h5py.Dataset):
                        # Try to read and copy the dataset
                        try:
                            data = obj[()]
                            # Recreate dataset in the recovered file
                            if name not in recovered:
                                recovered.create_dataset(name, data=data)
                            print(f"Successfully copied dataset: {name}")
                        except Exception as e:
                            print(f"Failed to recover dataset {name}: {str(e)}")
                except Exception as e:
                    print(f"Error processing {name}: {str(e)}")
                    
            # Visit all objects in the file
            f.visititems(visit_and_recover)
            
    except Exception as e:
        print(f"Failed to open file: {str(e)}")
        traceback.print_exc()
        
    finally:
        # Always close the recovered file
        recovered.close()
        print("Recovery attempt completed.")

Solution 4: Use Low-Level HDF5 Recovery Tools

For more severe corruption, try specialized approaches:

  1. Check if the HDFGroup's recovery tools are applicable to your case:
    • h5recover for superblock damage
    • h5repair for selective block recovery
  2. For corrupted metadata but intact raw data, consider byte-level extraction tools
  3. Commercial data recovery services specializing in scientific formats may be able to help with severe corruption

Solution 5: Preventive Replication for Critical HDF5 Files

To avoid future data loss, implement protective measures:

  1. Use h5repack periodically to clean and optimize important files:
    h5repack -f GZIP=9 original.h5 optimized.h5
  2. Implement checksumming for datasets:
    h5repack -f FLETCHER32 original.h5 checksummed.h5
  3. Consider storing critical data with redundancy using mirrored HDF5 files

Error #2: "NetCDF Read Error" or "Invalid Dimensions"

Symptoms

When working with NetCDF files, you may encounter errors like "NetCDF: Invalid dimensions," "NetCDF: Not a valid file format," or "Error accessing variable." Parts of the file may be inaccessible, or dimensional information may be inconsistent.

Causes

  • Incomplete file transfers
  • File header corruption
  • Version incompatibilities (NetCDF-3 vs. NetCDF-4)
  • Dimension or variable name corruption
  • Conflicts between dimensional definitions
  • Incorrect attribute types or values
  • Storage or networking issues during write operations

Solutions

Solution 1: NetCDF File Validation and Analysis

Analyze the file structure to identify issues:

  1. Use ncdump to examine the file structure:
    ncdump -h filename.nc
  2. For more detailed checking, use nccheck or nc-verify tools
  3. Check for specific error information pointing to corrupted sections
  4. Verify version compatibility:
    ncdump -k filename.nc

Solution 2: Convert Between NetCDF Versions

Address version incompatibility issues:

  1. Convert NetCDF-4 to NetCDF-3:
    ncks -3 input.nc output.nc
  2. Convert NetCDF-3 to NetCDF-4:
    ncks -4 input.nc output.nc
  3. Try conversion with compression for optimized storage:
    ncks -4 -L 4 input.nc compressed.nc
  4. For specific file format issues, try forcing a format type:
    ncks --fl_fmt=netcdf4_classic input.nc output.nc

Solution 3: Extract Variables and Rebuild the File

Salvage individual components from the damaged file:

  1. Use NCO tools to extract variables selectively:
    ncks -v variable_name input.nc extracted_var.nc
  2. Extract dimension information and attributes:
    ncks -v .dimension_name input.nc extracted_dim.nc
  3. Merge salvaged components into a new file:
    ncks -A extracted_var1.nc new.nc
    ncks -A extracted_var2.nc new.nc

Solution 4: Programmatic NetCDF Repair with Python

Use the netCDF4 library for controlled file repair:

  1. Python example for selective recovery:
    import netCDF4 as nc
    import numpy as np
    
    # Open a new file for recovered data
    recovered = nc.Dataset('recovered.nc', 'w')
    
    try:
        # Try to open the corrupted file in read-only mode
        with nc.Dataset('corrupted.nc', 'r') as src:
            # Copy dimensions
            for dim_name, dimension in src.dimensions.items():
                try:
                    recovered.createDimension(dim_name, len(dimension) if not dimension.isunlimited() else None)
                    print(f"Copied dimension: {dim_name}")
                except Exception as e:
                    print(f"Failed to copy dimension {dim_name}: {str(e)}")
            
            # Copy global attributes
            for attr_name in src.ncattrs():
                try:
                    recovered.setncattr(attr_name, src.getncattr(attr_name))
                    print(f"Copied global attribute: {attr_name}")
                except Exception as e:
                    print(f"Failed to copy global attribute {attr_name}: {str(e)}")
            
            # Copy variables
            for var_name, variable in src.variables.items():
                try:
                    # Create the variable in the new file
                    var_type = variable.datatype
                    var_dims = variable.dimensions
                    var_out = recovered.createVariable(var_name, var_type, var_dims)
                    
                    # Copy variable attributes
                    for attr_name in variable.ncattrs():
                        var_out.setncattr(attr_name, variable.getncattr(attr_name))
                    
                    # Copy the data
                    var_out[:] = variable[:]
                    print(f"Copied variable: {var_name}")
                except Exception as e:
                    print(f"Failed to copy variable {var_name}: {str(e)}")
                    
    except Exception as e:
        print(f"Error opening corrupted file: {str(e)}")
        
    finally:
        # Close the recovered file
        recovered.close()
        print("Recovery attempt completed.")

Solution 5: CDO and NCO Tools for Advanced Repair

Leverage climate data operators for recovery:

  1. Use CDO to fix common NetCDF issues:
    cdo copy input.nc fixed.nc
  2. Repair time dimension issues:
    cdo settaxis,yyyy-mm-dd,hh:mm:ss,timeunit input.nc fixed.nc
  3. Fix grid definition problems:
    cdo setgrid,gridfile.txt input.nc fixed.nc
  4. Try selective data extraction and concatenation for corrupted timeseries:
    cdo seldate,yyyy-mm-dd,yyyy-mm-dd input.nc part1.nc
    cdo seldate,yyyy-mm-dd,yyyy-mm-dd input.nc part2.nc
    cdo mergetime part1.nc part2.nc merged.nc

Error #3: "MATLAB File Format Error" or "MAT-File Variable Import"

Symptoms

When trying to load MATLAB (.mat) files, you may see error messages like "Invalid MAT-file," "Unable to read MAT-file header," or "Error reading variable from file." Variables may be missing, corrupted, or have incorrect types when loaded.

Causes

  • Version incompatibilities (MATLAB 5.0 vs. 7.3 formats)
  • Corrupted file headers
  • Partial file saves due to crashes
  • 64-bit vs. 32-bit data storage issues
  • Platform-specific data format differences
  • Compression errors in newer MAT formats
  • Mixed version saves from different MATLAB versions

Solutions

Solution 1: Try Different MATLAB Loading Options

Adjust loading parameters to accommodate corruption:

  1. In MATLAB, use the 'load' command with options:
    % Try different MATLAB versions' formats
    try
        % Try v7.3 format (HDF5-based)
        data = load('corrupt.mat', '-mat', '-v7.3');
    catch
        try
            % Try v7 format
            data = load('corrupt.mat', '-mat', '-v7');
        catch
            try
                % Try v6 format (MATLAB 5.0)
                data = load('corrupt.mat', '-mat', '-v6');
            catch
                error('All loading attempts failed');
            end
        end
    end
  2. Try loading variables selectively to isolate corruption:
    % List what variables are in the file
    vars = who('-file', 'corrupt.mat');
    
    % Try loading each variable separately
    for i = 1:length(vars)
        try
            var_data = load('corrupt.mat', vars{i});
            fprintf('Successfully loaded: %s\n', vars{i});
        catch
            fprintf('Failed to load: %s\n', vars{i});
        end
    end

Solution 2: Convert MAT File Versions

Transform between different MATLAB formats:

  1. Load and re-save in a different format:
    % Load whatever can be loaded
    try
        data = load('corrupt.mat');
        
        % Save in older format which might be more robust
        save('recovered_v6.mat', '-struct', 'data', '-v6');
        
        % Or save in newer format
        save('recovered_v7.mat', '-struct', 'data', '-v7');
    catch e
        fprintf('Error during conversion: %s\n', e.message);
    end
  2. For large files that might be using v7.3 (HDF5-based), try HDF5 tools:
    % Use low-level HDF5 functions to access 7.3 format files
    fileinfo = h5info('corrupt.mat');
    datasets = {fileinfo.Datasets.Name};
    
    % Extract datasets one by one
    for i = 1:length(datasets)
        try
            data.(datasets{i}) = h5read('corrupt.mat', ['/' datasets{i}]);
            fprintf('Successfully extracted dataset: %s\n', datasets{i});
        catch
            fprintf('Failed to extract dataset: %s\n', datasets{i});
        end
    end
    
    % Save recovered data
    save('recovered.mat', '-struct', 'data');

Solution 3: Use Third-Party Tools for MAT File Recovery

Leverage alternative libraries for loading MATLAB files:

  1. Python example using scipy.io:
    import scipy.io as sio
    import h5py
    import numpy as np
    
    # Try loading with scipy
    try:
        data = sio.loadmat('corrupt.mat')
        print("Successfully loaded with scipy.io")
        # Save back to a new mat file
        sio.savemat('recovered_scipy.mat', data)
    except Exception as e:
        print(f"scipy.io failed: {str(e)}")
        
        # Try HDF5 approach for v7.3 files
        try:
            with h5py.File('corrupt.mat', 'r') as f:
                # Create a dictionary to hold the data
                data = {}
                
                # Function to recursively visit all objects
                def visit_and_extract(name, obj):
                    if isinstance(obj, h5py.Dataset):
                        try:
                            # Convert to numpy array
                            data[name] = np.array(obj)
                            print(f"Extracted: {name}")
                        except Exception as e:
                            print(f"Failed to extract {name}: {str(e)}")
                
                # Visit all objects
                f.visititems(visit_and_extract)
                
                # Save recovered data with scipy
                if data:
                    sio.savemat('recovered_h5py.mat', data)
                    print("Saved recovered data")
        except Exception as e:
            print(f"HDF5 approach failed: {str(e)}")

Solution 4: Binary Analysis for Header Repair

For advanced users, fix file headers manually:

  1. MATLAB MAT files have specific header structures depending on version:
    • MAT 5.0 format starts with a 128-byte header
    • The first 4 bytes should be 'MATLAB'
  2. Use a hex editor to verify and potentially fix simple header corruption
  3. For v7.3 files, use HDF5 header repair tools since they use HDF5 format

Solution 5: Partial Reconstruction from Research Results

When direct recovery fails, reconstruct critical data:

  1. Check for exported figures or data that might contain the essential information
  2. Look for script files that generated the data originally
  3. Check for derivative files or analysis results that might contain copies of variables
  4. If source data for calculations is available, rerun analyses to regenerate results

Error #4: "Jupyter Notebook Parse Error" or "Invalid Notebook Format"

Symptoms

When opening a Jupyter notebook (.ipynb file), you may encounter errors like "Notebook validation failed," "Invalid JSON," or "Unable to parse notebook." JupyterLab or Jupyter Notebook may fail to load the file, or display a corrupted version with missing cells or content.

Causes

  • Corrupted JSON structure
  • Interrupted save operations during kernel activity
  • Notebook server crashes during autosave
  • Merge conflicts in version control systems
  • Manual edits to the notebook file
  • JupyterLab/Notebook version incompatibilities
  • Extremely large output cells causing parsing issues

Solutions

Solution 1: Jupyter Notebook Format Validation and Repair

Check and fix JSON structure issues:

  1. Use the nbformat validation tool:
    jupyter nbconvert --to notebook --validate corrupted.ipynb
  2. For more detailed diagnostics:
    python -m nbformat.validator corrupted.ipynb
  3. Try the notebook repair extension if available:
    pip install nbrepair  # If available
    jupyter nbrepair corrupted.ipynb

Solution 2: Fix JSON Structure Manually

Address specific JSON formatting issues:

  1. Open the .ipynb file in a text editor (it's just JSON)
  2. Look for obvious JSON errors:
    • Missing or extra commas
    • Unclosed brackets or braces
    • Incomplete string values (missing quote marks)
  3. Use an online JSON validator to identify specific syntax errors
  4. Focus on fixing structural issues rather than content initially

Solution 3: Extract Cells and Content Programmatically

Recover individual notebook components:

  1. Python script to extract salvageable cells:
    import json
    import nbformat
    
    # Try to open the corrupted notebook
    try:
        with open('corrupted.ipynb', 'r', encoding='utf-8') as f:
            content = f.read()
        
        # Try to parse the JSON, even if it's partially corrupted
        notebook_data = json.loads(content)
        
        # Extract cells
        cells = []
        if 'cells' in notebook_data:
            for i, cell in enumerate(notebook_data['cells']):
                try:
                    # Validate each cell
                    if 'cell_type' in cell and 'source' in cell:
                        cells.append(cell)
                        print(f"Successfully extracted cell {i}")
                    else:
                        print(f"Skipping cell {i} due to missing required fields")
                except Exception as e:
                    print(f"Error processing cell {i}: {str(e)}")
        
        # Create a new notebook with the salvageable cells
        new_notebook = nbformat.v4.new_notebook()
        new_notebook.cells = cells
        
        # If metadata is available, try to preserve it
        if 'metadata' in notebook_data:
            try:
                new_notebook.metadata = notebook_data['metadata']
            except:
                print("Could not recover metadata")
        
        # Write the repaired notebook
        with open('recovered.ipynb', 'w', encoding='utf-8') as f:
            nbformat.write(new_notebook, f)
        
        print(f"Recovered {len(cells)} cells to recovered.ipynb")
        
    except Exception as e:
        print(f"Failed to recover notebook: {str(e)}")
        
        # If JSON parsing completely fails, try to extract content with regex
        import re
        try:
            with open('corrupted.ipynb', 'r', encoding='utf-8') as f:
                content = f.read()
            
            # Extract code blocks
            code_blocks = re.findall(r'"source":\s*\[(.*?)\]', content, re.DOTALL)
            
            # Create a simple text file with extracted code
            with open('extracted_code.txt', 'w', encoding='utf-8') as f:
                for i, block in enumerate(code_blocks):
                    f.write(f"--- BLOCK {i} ---\n")
                    # Remove JSON formatting
                    cleaned = re.sub(r'",\s*"', '\n', block)
                    cleaned = re.sub(r'"', '', cleaned)
                    # Unescape newlines
                    cleaned = cleaned.replace('\\n', '\n')
                    f.write(cleaned)
                    f.write('\n\n')
            
            print(f"Extracted {len(code_blocks)} code blocks to extracted_code.txt")
        
        except Exception as e2:
            print(f"Even basic content extraction failed: {str(e2)}")

Solution 4: Recover from Jupyter Autosave or Checkpoints

Look for automatic backups created by Jupyter:

  1. Check for checkpoint files in the .ipynb_checkpoints directory:
    ls -la .ipynb_checkpoints/
  2. Restore from the checkpoint version:
    cp .ipynb_checkpoints/notebook_name-checkpoint.ipynb recovered.ipynb
  3. For JupyterLab, look for autosave files with names like:
    ls -la ~/.jupyter/lab/workspaces/

Solution 5: Convert to Other Formats and Rebuild

Try conversion to simpler formats:

  1. If the notebook partially opens, export to a different format:
    jupyter nbconvert --to python corrupted.ipynb
  2. For markdown content:
    jupyter nbconvert --to markdown corrupted.ipynb
  3. Create a new notebook and copy salvageable content from these exports
  4. If output data is critical, try extracting just the HTML:
    jupyter nbconvert --to html corrupted.ipynb

Error #5: "NumPy Array Loading Error" or "NPY Format Issue"

Symptoms

When trying to load NumPy binary files (.npy, .npz), you may encounter errors like "Unable to read array header," "Invalid NPY format," or "Cannot load NPZ file." The data may fail to load entirely, or load with incorrect shapes or data types.

Causes

  • Corrupted file headers
  • Incompatible NumPy versions
  • Endianness issues across different platforms
  • Incomplete file writes
  • Mixed data type corruption
  • Compression errors in NPZ files

Solutions

Solution 1: NumPy Loading with Error Handling

Try different loading approaches:

  1. Python code with flexible loading options:
    import numpy as np
    
    def try_load_npy(filename):
        # Try different approaches to load a potentially corrupted NPY file
        try:
            # Standard approach
            data = np.load(filename)
            print("Standard loading successful")
            return data
        except Exception as e1:
            print(f"Standard loading failed: {str(e1)}")
            
            try:
                # Try with allow_pickle
                data = np.load(filename, allow_pickle=True)
                print("Loading with allow_pickle successful")
                return data
            except Exception as e2:
                print(f"allow_pickle loading failed: {str(e2)}")
                
                try:
                    # Try with fixing
                    data = np.load(filename, allow_pickle=True, fix_imports=True)
                    print("Loading with fix_imports successful")
                    return data
                except Exception as e3:
                    print(f"fix_imports loading failed: {str(e3)}")
                    
                    try:
                        # Try with mmap_mode for large files
                        data = np.load(filename, mmap_mode='r')
                        print("Loading with mmap_mode successful")
                        return data
                    except Exception as e4:
                        print(f"mmap_mode loading failed: {str(e4)}")
                        
                        # All attempts failed
                        print("All loading attempts failed")
                        return None
    
    # For NPZ files
    def try_load_npz(filename):
        try:
            # Standard approach
            data = np.load(filename)
            print("NPZ loading successful")
            print(f"Available arrays: {list(data.keys())}")
            return data
        except Exception as e:
            print(f"NPZ loading failed: {str(e)}")
            
            # Try opening as a zip file
            try:
                import zipfile
                with zipfile.ZipFile(filename) as z:
                    print(f"NPZ file contains: {z.namelist()}")
                    # Extract individual arrays
                    arrays = {}
                    for name in z.namelist():
                        if name.endswith('.npy'):
                            try:
                                with z.open(name) as f:
                                    # Read the file into a BytesIO object
                                    import io
                                    data_bytes = io.BytesIO(f.read())
                                    # Try to load the array
                                    arr = np.load(data_bytes)
                                    arrays[name[:-4]] = arr  # Remove .npy extension
                                    print(f"Successfully extracted array: {name}")
                            except Exception as e2:
                                print(f"Failed to extract {name}: {str(e2)}")
                    return arrays
            except Exception as e3:
                print(f"Zip extraction failed: {str(e3)}")
                return None

Solution 2: Repair NumPy File Headers

Fix header information in corrupted files:

  1. Understanding the NPY format:
    • NPY files start with a magic string ('\x93NUMPY')
    • Followed by version byte, header length, and descriptor
  2. Create a script to fix common header issues:
    import numpy as np
    import struct
    
    def repair_npy_header(corrupted_file, repaired_file, expected_shape, dtype):
        """
        Attempt to repair a corrupted NPY file by reconstructing its header
        
        Parameters:
        corrupted_file - Path to the corrupted NPY file
        repaired_file - Where to save the repaired file
        expected_shape - Tuple with the expected array shape
        dtype - Expected data type (e.g., 'float32', 'int64')
        """
        try:
            # Read the raw data from the corrupted file
            with open(corrupted_file, 'rb') as f:
                content = f.read()
            
            # Check if the magic string is present
            if not content.startswith(b'\x93NUMPY'):
                print("Magic string missing, adding NPY header")
                
                # Determine the size of the data
                dtype_obj = np.dtype(dtype)
                header = {
                    'descr': dtype_obj.str,
                    'fortran_order': False,
                    'shape': expected_shape
                }
                
                # Convert header to string representation
                header_str = repr(header).replace("'", '"')
                # Pad for 16-byte alignment
                header_bytes = header_str.encode('utf-8')
                padding = 16 - ((len(header_bytes) + 10) % 16)
                header_bytes = header_bytes + b' ' * padding + b'\n'
                # Format: 6-byte magic string + 4-byte header length + header
                magic = b'\x93NUMPY'
                version = struct.pack('BB', 1, 0)
                header_len = struct.pack('

Solution 3: Extract Raw Data and Reconstruct

For severe corruption, extract the raw binary data:

  1. Skip the header and try to recover the raw data:
    import numpy as np
    import os
    
    def extract_raw_data(corrupted_file, output_file, expected_shape, dtype):
        """
        Extract raw data from a corrupted NPY file, skipping the header
        """
        # Determine the data size
        dtype_obj = np.dtype(dtype)
        element_size = dtype_obj.itemsize
        total_elements = np.prod(expected_shape)
        expected_data_size = total_elements * element_size
        
        # Get file size
        file_size = os.path.getsize(corrupted_file)
        
        # Read the file
        with open(corrupted_file, 'rb') as f:
            # Skip potential header (NPY header is typically less than 128 bytes)
            header_size = min(128, file_size - expected_data_size)
            if header_size < 0:
                print("File too small for expected data size")
                return False
            
            f.seek(header_size)
            raw_data = f.read(expected_data_size)
        
        # Reshape the raw data into the expected array
        try:
            array = np.frombuffer(raw_data, dtype=dtype_obj)
            if len(array) == total_elements:
                array = array.reshape(expected_shape)
                # Save the reconstructed array
                np.save(output_file, array)
                print(f"Raw data extracted and saved to {output_file}")
                return True
            else:
                print(f"Extracted data size mismatch: got {len(array)}, expected {total_elements}")
                return False
        except Exception as e:
            print(f"Failed to reconstruct array: {str(e)}")
            return False

Solution 4: NPZ Archive Recovery

For NPZ files (which are ZIP archives), use ZIP recovery:

  1. Use ZIP utilities to check and extract contents:
    import zipfile
    import numpy as np
    import io
    
    def recover_npz(corrupted_npz, output_dir):
        """
        Try to recover individual NPY files from a corrupted NPZ archive
        """
        try:
            # Try to open as a ZIP file
            with zipfile.ZipFile(corrupted_npz, 'r') as z:
                file_list = z.namelist()
                print(f"NPZ archive contains: {file_list}")
                
                success_count = 0
                for name in file_list:
                    if name.endswith('.npy'):
                        try:
                            # Extract the file
                            z.extract(name, output_dir)
                            print(f"Extracted {name} to {output_dir}")
                            
                            # Try to load it
                            arr = np.load(f"{output_dir}/{name}")
                            print(f"Successfully loaded {name}, shape: {arr.shape}, dtype: {arr.dtype}")
                            success_count += 1
                        except Exception as e:
                            print(f"Failed to process {name}: {str(e)}")
                
                print(f"Recovered {success_count} of {len(file_list)} files")
                return success_count > 0
        
        except zipfile.BadZipFile:
            print("File is not a valid ZIP/NPZ archive")
            
            # For severely corrupted ZIP files, try ZIP repair tools or raw extraction
            try:
                # Simple example - in practice, use specialized ZIP repair tools
                with open(corrupted_npz, 'rb') as f:
                    data = f.read()
                
                # Look for NPY file signatures within the data
                npy_sigs = [b'\x93NUMPY']
                positions = []
                
                for sig in npy_sigs:
                    pos = 0
                    while True:
                        pos = data.find(sig, pos)
                        if pos == -1:
                            break
                        positions.append(pos)
                        pos += 1
                
                if positions:
                    print(f"Found {len(positions)} potential NPY headers in the corrupted file")
                    
                    # Try to extract data starting from these positions
                    for i, pos in enumerate(positions):
                        try:
                            # Extract a chunk of data (arbitrary size)
                            chunk = data[pos:pos+10000000]  # 10MB chunk
                            
                            # Try to load as NPY
                            with open(f"{output_dir}/recovered_{i}.npy", 'wb') as f:
                                f.write(chunk)
                            
                            # Test if loadable
                            try:
                                arr = np.load(f"{output_dir}/recovered_{i}.npy")
                                print(f"Successfully recovered array {i}, shape: {arr.shape}")
                            except:
                                print(f"Extracted chunk {i} is not a valid NPY file")
                        except Exception as e:
                            print(f"Failed to extract chunk {i}: {str(e)}")
                    
                    return True
                else:
                    print("No NPY signatures found in the file")
                    return False
            
            except Exception as e:
                print(f"Raw extraction failed: {str(e)}")
                return False

Solution 5: Alternative Storage Format Conversion

When dealing with problematic NumPy binary files, convert to more robust formats:

  1. If you can load the data, save in multiple formats for redundancy:
    import numpy as np
    import h5py
    import pickle
    
    def save_array_multi_format(array, base_filename):
        """
        Save an array in multiple formats for redundancy
        """
        # NumPy binary
        np.save(f"{base_filename}.npy", array)
        
        # Compressed NumPy
        np.savez_compressed(f"{base_filename}.npz", array=array)
        
        # HDF5 format
        with h5py.File(f"{base_filename}.h5", 'w') as f:
            f.create_dataset('array', data=array)
        
        # CSV (for 2D arrays)
        if array.ndim <= 2:
            np.savetxt(f"{base_filename}.csv", array, delimiter=',')
        
        # Python pickle
        with open(f"{base_filename}.pkl", 'wb') as f:
            pickle.dump(array, f)
        
        print(f"Saved array in multiple formats with base name: {base_filename}")

Error #6: "Parquet/Arrow File Corruption" or "Columnar Data Access Issues"

Symptoms

When working with modern columnar storage formats like Parquet or Arrow, you may encounter errors like "Invalid Parquet file," "Footer corruption," or "Arrow metadata error." Only partial data may be accessible, or specific columns might be unreadable.

Causes

  • File truncation during write operations
  • Corrupted file metadata or footers
  • Incompatible format versions
  • Compression-related errors
  • Schema inconsistencies or type violations
  • Library version incompatibilities

Solutions

Solution 1: Parquet Validation and Inspection

Analyze the file structure to identify issues:

  1. Use parquet-tools to examine the file:
    parquet-tools meta corrupted.parquet
    parquet-tools schema corrupted.parquet
  2. For detailed inspection:
    parquet-tools dump corrupted.parquet
  3. Check for specific metadata or row group issues:
    parquet-tools inspect corrupted.parquet

Solution 2: Selective Column and Row Group Reading

Extract accessible portions of the data:

  1. Python example using pyarrow:
    import pyarrow.parquet as pq
    import pandas as pd
    
    def recover_parquet_by_columns(corrupted_file, output_file):
        """
        Attempt to recover a Parquet file by reading columns selectively
        """
        try:
            # Try to read the file metadata
            try:
                parquet_file = pq.ParquetFile(corrupted_file)
                schema = parquet_file.schema
                print(f"Successfully read schema with {len(schema.names)} columns")
                column_names = schema.names
            except Exception as e:
                print(f"Failed to read schema: {str(e)}")
                # Try a different approach to get column names
                try:
                    # Try reading first row to get column names
                    df_peek = pd.read_parquet(corrupted_file, nrows=1)
                    column_names = df_peek.columns.tolist()
                    print(f"Retrieved {len(column_names)} column names from first row")
                except:
                    print("Cannot determine column names, recovery not possible")
                    return False
            
            # Try reading each column individually
            recovered_columns = {}
            for col in column_names:
                try:
                    # Read just this column
                    column_data = pd.read_parquet(corrupted_file, columns=[col])
                    recovered_columns[col] = column_data[col]
                    print(f"Successfully recovered column: {col}")
                except Exception as e:
                    print(f"Failed to recover column {col}: {str(e)}")
            
            # Combine recovered columns into a DataFrame
            if recovered_columns:
                recovered_df = pd.DataFrame(recovered_columns)
                print(f"Recovered DataFrame with {len(recovered_df)} rows and {len(recovered_columns)} columns")
                
                # Save the recovered data
                recovered_df.to_parquet(output_file)
                print(f"Saved recovered data to {output_file}")
                return True
            else:
                print("No columns could be recovered")
                return False
            
        except Exception as e:
            print(f"Overall recovery failed: {str(e)}")
            return False
    
    def recover_parquet_by_row_groups(corrupted_file, output_file):
        """
        Attempt to recover a Parquet file by reading row groups selectively
        """
        try:
            # Try to open the file and get row group info
            parquet_file = pq.ParquetFile(corrupted_file)
            num_row_groups = parquet_file.num_row_groups
            print(f"File has {num_row_groups} row groups")
            
            # Try to read each row group
            dfs = []
            for i in range(num_row_groups):
                try:
                    row_group = parquet_file.read_row_group(i)
                    df = row_group.to_pandas()
                    dfs.append(df)
                    print(f"Successfully read row group {i} with {len(df)} rows")
                except Exception as e:
                    print(f"Failed to read row group {i}: {str(e)}")
            
            # Combine the recovered row groups
            if dfs:
                recovered_df = pd.concat(dfs, ignore_index=True)
                print(f"Recovered DataFrame with {len(recovered_df)} rows and {len(recovered_df.columns)} columns")
                
                # Save the recovered data
                recovered_df.to_parquet(output_file)
                print(f"Saved recovered data to {output_file}")
                return True
            else:
                print("No row groups could be recovered")
                return False
                
        except Exception as e:
            print(f"Overall recovery failed: {str(e)}")
            return False

Solution 3: Format Conversion Recovery

Convert between formats to bypass corruption:

  1. Try different libraries and formats:
    import pyarrow.parquet as pq
    import pyarrow as pa
    import pandas as pd
    
    def multi_format_recovery(corrupted_file, base_output):
        """
        Try to recover data using multiple format conversions
        """
        recovery_methods = []
        
        # Method 1: PyArrow direct
        try:
            table = pq.read_table(corrupted_file)
            pa.parquet.write_table(table, f"{base_output}_pyarrow.parquet")
            recovery_methods.append("pyarrow_direct")
            print("PyArrow direct recovery successful")
        except Exception as e:
            print(f"PyArrow direct failed: {str(e)}")
        
        # Method 2: Via pandas
        try:
            df = pd.read_parquet(corrupted_file)
            df.to_parquet(f"{base_output}_pandas.parquet")
            recovery_methods.append("pandas_parquet")
            print("Pandas parquet recovery successful")
        except Exception as e:
            print(f"Pandas parquet failed: {str(e)}")
        
        # Method 3: Parquet to CSV to Parquet
        try:
            df = pd.read_parquet(corrupted_file)
            csv_path = f"{base_output}.csv"
            df.to_csv(csv_path, index=False)
            print(f"Saved to CSV: {csv_path}")
            
            # Read back from CSV
            df_csv = pd.read_csv(csv_path)
            df_csv.to_parquet(f"{base_output}_via_csv.parquet")
            recovery_methods.append("via_csv")
            print("CSV roundtrip recovery successful")
        except Exception as e:
            print(f"CSV roundtrip failed: {str(e)}")
        
        # Method 4: Convert to Arrow IPC format
        try:
            table = pq.read_table(corrupted_file)
            arrow_path = f"{base_output}.arrow"
            with pa.OSFile(arrow_path, 'wb') as sink:
                with pa.RecordBatchFileWriter(sink, table.schema) as writer:
                    writer.write_table(table)
            
            # Read back from Arrow
            with pa.memory_map(arrow_path, 'rb') as source:
                reader = pa.RecordBatchFileReader(source)
                arrow_table = reader.read_all()
            
            pa.parquet.write_table(arrow_table, f"{base_output}_via_arrow.parquet")
            recovery_methods.append("via_arrow")
            print("Arrow IPC roundtrip successful")
        except Exception as e:
            print(f"Arrow IPC roundtrip failed: {str(e)}")
        
        # Summary
        if recovery_methods:
            print(f"Successfully recovered data using: {', '.join(recovery_methods)}")
            return True
        else:
            print("All recovery methods failed")
            return False

Solution 4: Repair Parquet Footer and Metadata

For advanced users, fix file structure issues:

  1. Understanding Parquet structure:
    • Parquet files have a footer with metadata at the end
    • The last 8 bytes indicate the size of the footer
    • Corrupted footers often cause most recovery issues
  2. Python example to fix truncated files (advanced):
    import struct
    import os
    import pyarrow.parquet as pq
    
    def repair_truncated_parquet(corrupted_file, fixed_file):
        """
        Attempt to repair a truncated Parquet file by reconstructing the footer
        Note: This is a simplified example and may not work for all cases
        """
        try:
            # First, make a copy of the corrupted file
            with open(corrupted_file, 'rb') as f_in, open(fixed_file, 'wb') as f_out:
                f_out.write(f_in.read())
            
            # Try to extract schema information from a similar file or first part of the file
            try:
                # This assumes that part of the file is valid and schema can be read
                partial_schema = pq.read_schema(corrupted_file)
                print(f"Retrieved partial schema with {len(partial_schema.names)} columns")
                
                # In a real implementation, you would now:
                # 1. Reconstruct proper row group metadata
                # 2. Recalculate column chunk offsets and sizes
                # 3. Build a new file footer with proper statistics
                # 4. Write the footer to the end of the file
                # 5. Append the footer length (4 bytes) and PARQ magic (4 bytes)
                
                print("Full footer reconstruction requires detailed Parquet format knowledge")
                print("Consider using specialized Parquet repair tools for serious corruption")
                
                return True
            except Exception as e:
                print(f"Schema extraction failed: {str(e)}")
                return False
                
        except Exception as e:
            print(f"Repair attempt failed: {str(e)}")
            return False

Solution 5: Use Specialized Arrow/Parquet Tools

Leverage dedicated utilities for recovery:

  1. For Arrow files, use the arrow-validate tool:
    arrow-validate corrupted.arrow
  2. Consider commercial or specialized data recovery tools designed for columnar formats
  3. Search for recovery utilities in the Apache Arrow and Parquet community resources

Error #7: "Domain-Specific Format Errors" (FITS, PDB, etc.)

Symptoms

When working with specialized scientific formats like FITS (astronomy), PDB (molecular structures), or other domain-specific formats, you may encounter errors like "Invalid header," "Structure validation failed," or "Cannot parse format." The files may fail to load in specialized software, or display incorrectly.

Causes

  • Format-specific structural corruption
  • Incompatible format versions or extensions
  • Missing required metadata or fields
  • Software version incompatibilities
  • File transfer or encoding issues
  • Domain-specific constraints violations

Solutions

Solution 1: FITS File Recovery (Astronomy)

For corrupted FITS files used in astronomy:

  1. Use FITS utilities to examine and fix the file:
    # Check the file structure
    fitsinfo corrupted.fits
    
    # Verify the header
    fitsdump -h corrupted.fits
    
    # Try to fix common issues
    fitsverify -e corrupted.fits
  2. Python example using astropy:
    from astropy.io import fits
    import numpy as np
    
    def recover_fits(corrupted_file, output_file):
        """
        Attempt to recover data from a corrupted FITS file
        """
        try:
            # Try opening with various options
            try:
                hdul = fits.open(corrupted_file, ignore_missing_end=True)
                print("Successfully opened FITS file with ignore_missing_end")
            except Exception as e1:
                print(f"Standard open failed: {str(e1)}")
                try:
                    hdul = fits.open(corrupted_file, ignore_missing_end=True, checksum=False)
                    print("Successfully opened FITS file with checksum disabled")
                except Exception as e2:
                    print(f"Checksum disabled open failed: {str(e2)}")
                    return False
            
            # Process each HDU (Header Data Unit)
            salvaged_hdus = []
            for i, hdu in enumerate(hdul):
                try:
                    # Check if header is readable
                    header = hdu.header
                    print(f"HDU {i} has readable header with {len(header)} keywords")
                    
                    # Check if data is accessible
                    try:
                        data = hdu.data
                        if data is not None:
                            print(f"HDU {i} has data with shape {data.shape} and type {data.dtype}")
                            # Create a new HDU with the salvaged data
                            if isinstance(hdu, fits.PrimaryHDU):
                                new_hdu = fits.PrimaryHDU(data=data, header=header)
                            else:
                                new_hdu = fits.ImageHDU(data=data, header=header)
                            salvaged_hdus.append(new_hdu)
                        else:
                            print(f"HDU {i} has no data")
                            salvaged_hdus.append(fits.ImageHDU(header=header))
                    except Exception as e:
                        print(f"Could not access data in HDU {i}: {str(e)}")
                        # Try to salvage just the header
                        salvaged_hdus.append(fits.ImageHDU(header=header))
                except Exception as e:
                    print(f"Could not process HDU {i}: {str(e)}")
            
            # Create a new FITS file with salvaged HDUs
            if salvaged_hdus:
                new_hdul = fits.HDUList(salvaged_hdus)
                new_hdul.writeto(output_file, overwrite=True)
                print(f"Wrote {len(salvaged_hdus)} HDUs to {output_file}")
                return True
            else:
                print("No HDUs could be salvaged")
                return False
                
        except Exception as e:
            print(f"Overall recovery failed: {str(e)}")
            return False

Solution 2: PDB File Repair (Molecular Structures)

For protein and molecular structure files:

  1. Use structure validation tools:
    pdb_validate corrupted.pdb
  2. Python example using Biopython:
    from Bio import PDB
    import re
    
    def repair_pdb(corrupted_file, output_file):
        """
        Attempt to repair a corrupted PDB file
        """
        try:
            # Try using PDB parser with strict=False
            parser = PDB.PDBParser(QUIET=True, PERMISSIVE=True)
            try:
                structure = parser.get_structure('structure', corrupted_file)
                print("Successfully parsed PDB with permissive parser")
                
                # If successful, write to a new file
                io = PDB.PDBIO()
                io.set_structure(structure)
                io.save(output_file)
                print(f"Saved repaired structure to {output_file}")
                return True
            except Exception as e:
                print(f"Permissive parsing failed: {str(e)}")
            
            # If parsing fails completely, try line-by-line repair
            with open(corrupted_file, 'r') as f:
                lines = f.readlines()
            
            # Filter for valid ATOM/HETATM records
            valid_lines = []
            atom_pattern = re.compile(r'^(ATOM|HETATM)(\s*\d+\s+\w+\s+\w+\s+\w+\s+\d+\s+[-\d\.]+\s+[-\d\.]+\s+[-\d\.]+).*$')
            
            for line in lines:
                if line.startswith(('ATOM', 'HETATM')):
                    match = atom_pattern.match(line)
                    if match:
                        # This is a valid-looking ATOM/HETATM record
                        valid_lines.append(line)
                elif line.startswith(('TER', 'END', 'HEADER', 'TITLE', 'REMARK')):
                    # Keep these administrative records
                    valid_lines.append(line)
            
            if valid_lines:
                # Ensure we have END record
                if not any(line.startswith('END') for line in valid_lines):
                    valid_lines.append('END\n')
                
                # Write the cleaned file
                with open(output_file, 'w') as f:
                    f.writelines(valid_lines)
                
                print(f"Wrote {len(valid_lines)} valid records to {output_file}")
                
                # Try parsing again
                try:
                    structure = parser.get_structure('fixed', output_file)
                    print("Successfully parsed the repaired PDB file")
                    return True
                except Exception as e:
                    print(f"Parsing of repaired file still failed: {str(e)}")
                    return False
                    
            else:
                print("No valid ATOM/HETATM records found")
                return False
                
        except Exception as e:
            print(f"Overall repair attempt failed: {str(e)}")
            return False

Solution 3: General Approach for Domain-Specific Formats

Apply these general principles to any specialized format:

  1. Understand the file structure:
    • Study the format specification if available
    • Identify critical header/metadata sections vs. data sections
    • Learn what validation constraints apply to the format
  2. Use domain-specific validation tools:
    • Most scientific domains have format-specific validators
    • Run with permissive options when available
  3. Create a minimal valid file:
    • Study examples of minimal valid files in the format
    • Compare headers and structures with your corrupted file
    • Sometimes combining a valid header with your data can work

Solution 4: Format Conversion Recovery

Use alternative formats when direct repair fails:

  1. Identify common interchange formats in your scientific domain
  2. If partial reading works, export to a simpler format:
    • For structural data: Convert to simpler formats like mmCIF or SDF
    • For image data: Export to TIFF or other standard formats
    • For tabular data: Export to CSV or TSV
  3. If raw data is crucial, extract the binary data blocks and rebuild

Solution 5: Consult Domain Experts

Seek specialized help for critical files:

  1. Scientific domains often have mailing lists or forums for format issues
  2. Contact the original software developers for recovery guidance
  3. Consider professional data recovery services that specialize in scientific data

Preventative Measures for Scientific Computing File Errors

Taking proactive steps can significantly reduce the risk of scientific data file issues:

  1. Regular File Validation: Use format-specific validation tools routinely
  2. Multiple Format Storage: Save critical results in multiple file formats
  3. Versioned Backups: Implement systematic backup procedures with versioning
  4. Checksumming: Calculate and store file checksums with your data
  5. Use Robust Storage Formats: Prefer formats with built-in validation (HDF5 with checksums, etc.)
  6. Atomic File Operations: Use temporary files and atomic renames for safer saves
  7. Metadata Documentation: Document data formats and structures separately
  8. Version Control: Use Git LFS or similar for tracking data files
  9. Automated Testing: Implement automated validation in data processing pipelines
  10. Software Updates: Keep scientific libraries and tools current

Best Practices for Scientific Data File Management

Follow these best practices to minimize problems with scientific computing files:

  1. Format Selection: Choose appropriate formats based on data characteristics and needs
  2. Version Control Integration: Use Git LFS or DVC for large scientific datasets
  3. Standardized Naming: Implement consistent file naming with version indicators
  4. Metadata Management: Include comprehensive metadata within files
  5. Data Publication Preparation: Validate files before submission to repositories
  6. Documentation: Document data structures and dependencies
  7. Format Conversion Testing: Verify round-trip conversions preserve data integrity
  8. Dependency Management: Track software dependencies that affect file formats
  9. Storage Media Selection: Use appropriate storage for different data lifecycle stages
  10. Recovery Planning: Develop and test data recovery procedures in advance

Scientific Computing File Repair Software and Tools

Several specialized tools can help troubleshoot and repair scientific data files:

  • Format-Specific Tools:
    • h5check, h5repack, h5dump (HDF5)
    • nccopy, nccheck, ncdump (NetCDF)
    • fitsverify, fitsfix (FITS)
    • pdb_validate, pdb_repair (PDB)
    • parquet-tools (Parquet)
  • Programming Libraries:
    • h5py, PyTables (Python for HDF5)
    • netCDF4-python (Python for NetCDF)
    • astropy (Python for FITS)
    • Biopython, PyMOL (Molecular structures)
    • pyarrow (Arrow/Parquet)
  • General Data Analysis Tools:
    • Pandas (Python data analysis)
    • NumPy (Array operations)
    • Jupyter Notebooks (Interactive analysis)
  • Domain-Specific Software:
    • DS9, CASA (Astronomy)
    • VMD, PyMOL (Molecular visualization)
    • Climate Data Operators (CDO) (Climate science)
  • Low-Level Inspection Tools:
    • hexdump, xxd (Hex editors)
    • strings (Text extraction)
    • file (File type identification)

Having appropriate tools for your specific scientific domain is essential for effective troubleshooting and recovery.

Advanced Considerations for High-Performance Computing Data

For scientific data used in high-performance computing environments, consider these additional factors:

Parallel File Access and Corruption

  • Parallel file systems like Lustre or GPFS introduce additional complexity
  • File striping across multiple storage targets can complicate recovery
  • Use parallel-aware tools and libraries (Parallel HDF5, Parallel NetCDF)
  • Implement proper locking mechanisms for concurrent access

Big Data Considerations

  • For extremely large datasets (TB+), standard tools may be insufficient
  • Consider specialized big data repair approaches using distributed computing
  • Implement chunking strategies for manageable error isolation
  • Build redundancy into data storage from the beginning

Long-term Data Preservation

  • Scientific data often needs to remain accessible for decades
  • Consider format obsolescence in long-term archiving strategies
  • Document recovery procedures with the archived data
  • Include sample code for reading/interpreting the data
  • Store multiple representation formats when possible

Conclusion

Scientific computing file errors present unique challenges due to the specialized formats, complex data structures, and high value of research data. Whether dealing with HDF5 corruption, NetCDF dimension issues, or domain-specific format problems, a methodical approach to troubleshooting and recovery is essential to preserve valuable scientific information.

Prevention is the most effective strategy, and implementing good scientific data management practices—including format selection, validation, backup procedures, and documentation—can significantly reduce the likelihood of encountering serious file issues. When problems do arise, approach them systematically, starting with format-specific validation and using the appropriate specialized tools for your scientific domain.

By following the guidance in this article and utilizing appropriate tools, researchers and data scientists should be well-equipped to handle most scientific computing file errors they may encounter, ensuring that valuable research data remains accessible and usable for analysis and reproducibility.